dumpster-dive
is a nodejs script that puts a highly-queryable wikipedia on your computer in a nice afternoon.
It uses worker-nodes to process pages in parallel, and wtf_wikipedia to turn wikiscript into whatever json.
npm install -g dumpster-dive
var dumpster = require('dumpster-dive')
dumpster({ file:'./enwiki-latest-pages-articles.xml', db:'enwiki'}, callback)
dumpster /path/to/my-wikipedia-article-dump.xml --citations=false --html=true
then check out the articles in mongo:
$ mongo #enter the mongo shell
use enwiki #grab the database
db.pages.count()
# 4,926,056...
db.pages.find({title:"Toronto"})[0].categories
#[ "Former colonial capitals in Canada",
# "Populated places established in 1793" ...]
you can do this. just a few Gb. you can do this.
Install nodejs (at least v6
), mongodb (at least v3
)
# install this script
npm install -g dumpster-dive # (that gives you the global command `dumpster`)
# start mongo up
mongod --config /mypath/to/mongod.conf
The Afrikaans wikipedia (around 47,000 articles) only takes a few minutes to download, and 5 mins to load into mongo on a macbook:
# dowload an xml dump (38mb, couple minutes)
wget https://dumps.wikimedia.org/afwiki/latest/afwiki-latest-pages-articles.xml.bz2
the english dump is 16Gb. The download page is confusing, but you'll want this file: ${LANG}wiki-latest-pages-articles.xml.bz2
i know, this sucks. but it makes the parser so much faster.
bzip2 -d ~/path/afwiki-latest-pages-articles.xml.bz2
On a macbook, unzipping en-wikipedia takes an hour or so. This is the most-boring part. Eat some lunch.
The english wikipedia is around 60Gb.
#load it into mongo (10-15 minutes)
dumpster ./afwiki-latest-pages-articles.xml
just put some epsom salts in there, it feels great.
The en-wiki dump should take a few hours. Maybe 8. Should be done before dinner.
The console will update you every couple seconds to let you know where it's at.
go check-out the data! to view your data in the mongo console:
$ mongo
use afwiki //your db name
//show a random page
db.pages.find().skip(200).limit(2)
//find a specific page
db.pages.findOne({title:"Toronto"}).categories
//find the last page
db.pages.find().sort({$natural:-1}).limit(1)
// all the governors of Kentucky
db.pages.count({ categories : { $eq : "Governors of Kentucky" }}
//pages without images
db.pages.count({ images: {$size: 0} })
alternatively, you can run dumpster-report afwiki
to see a quick spot-check of the records it has created across the database.
the english wikipedia will work under the same process, but the download will take an afternoon, and the loading/parsing a couple hours. The en wikipedia dump is a 13 GB (for enwiki-20170901-pages-articles.xml.bz2), and becomes a pretty legit mongo collection uncompressed. It's something like 51GB, but mongo can do it 💪.
dumpster follows all the conventions of wtf_wikipedia, and you can pass-in any fields for it to include in it's json.
- human-readable plaintext --plaintext
dumpster({file:'./myfile.xml.bz2', db: 'enwiki', plaintext:true, categories:false})
/*
[{
_id:'Toronto',
title:'Toronto',
plaintext:'Toronto is the most populous city in Canada and the provincial capital...'
}]
*/
- disambiguation pages / redirects --skip_disambig, --skip_redirects by default, dumpster skips entries in the dump that aren't full-on articles, you can
let obj = {
file: './path/enwiki-latest-pages-articles.xml.bz2',
db: 'enwiki',
skip_redirects: false,
skip_disambig: false
}
dumpster(obj, () => console.log('done!') )
- reducing file-size: you can tell wtf_wikipedia what you want it to parse, and which data you don't need:
dumpster ./my-wiki-dump.xml --infoboxes=false --citations=false --categories=false --links=false
- custom json formatting
you can grab whatever data you want, by passing-in a
custom
function. It takes a wtf_wikipediaDoc
object, and you can return your cool data:
let obj={
file: path,
db: dbName,
custom: function(doc) {
return {
_id: doc.title(), //for duplicate-detection
title: doc.title(), //for the logger..
sections: doc.sections().map(i => i.json()),
categories: doc.categories() //whatever you want!
}
}
}
dumpster(obj, () => console.log('custom wikipedia!') )
- non-main namespaces:
do you want to parse all the navboxes? change
namespace
in ./config.js to another number
this library uses:
- sunday-driver to stream the gnarly xml file
- wtf_wikipedia to brute-parse the article wikiscript contents into JSON.
since wikimedia makes all pages have globally unique titles, we also use them for the mongo _id
fields.
The benefit is that if it crashes half-way through, or if you want to run it again, running this script repeatedly will not multiply your data. We do a 'upsert' on the record.
mongo has some opinions on special-characters in some of its data. It is weird, but we're using this standard(ish) form of encoding them:
\ --> \\
$ --> \u0024
. --> \u002e
This library should also work on other wikis with standard xml dumps from MediaWiki. I haven't tested them, but the wtf_wikipedia supports all sorts of non-standard wiktionary/wikivoyage templates, and if you can get a bz-compressed xml dump from your wiki, this should work fine. Open an issue if you find something weird.
if the script trips at a specific spot, it's helpful to know the article it breaks on, by setting verbose:true
:
dumpster({
file: '/path/to/file.xml',
verbose: true
})
this prints out every page's title while processing it..
This is an important project, come help us out.