Wikipedia Dump Woes
I've been struggling for the last couple hours to import the wikipedia dump. Haven't had much luck though. Anybody have a successful attempt? I know at least one person would appreciate it if you shared the steps you took, etc.
On a sorta separate note, I've come to understand that the dump would contain the article content in the wikimedia formatting. Anyone attempt to parse that yet?
Finally, I'm currently downloading the "static html" dump of wikipedia. It's slightly older, and is 14GB, but I'm hoping it will solve both the issues above. http://static.wikipedia.org/
It's gonna be 5 hours before it's downloaded, I'll post again once I know more...
On a sorta separate note, I've come to understand that the dump would contain the article content in the wikimedia formatting. Anyone attempt to parse that yet?
Finally, I'm currently downloading the "static html" dump of wikipedia. It's slightly older, and is 14GB, but I'm hoping it will solve both the issues above. http://static.wikipedia.org/
It's gonna be 5 hours before it's downloaded, I'll post again once I know more...
1
person has this question
I have this question, too!
Tell me when someone answers.
The more people who ask this question, the more it gets noticed.
The more people who ask this question, the more it gets noticed.
Create a customer community for your own organization
Plans starting at $19/month
-
Inappropriate?I have some updates.
I was able to import the xml dump to mysql (hopefully, it's still processing!).
Here's what I did:
1. Create the database schema. Run this sql script:
http://svn.wikimedia.org/viewvc/media...
You'll need to replace all the /*$wgDBprefix*/ comments with your db name followed by a dot. e.g. "wikidb." This should create all the tables, etc.
Alternately, you could install MediaWiki, that will create the database, and will have the bonus of allowing you to browse your imported wikipedia dump offline. But I wasn't able to install MediaWiki, I ran into problems with Apache/Php/MySql on my windows machine...
2. Import the xml dump to your newly created database. Instructions for this is provided in the following link, I used the batch file described for Win XP. If you're copy pasting, remember to change the file names accordingly, including the mysql connector version, etc.
http://www.mediawiki.org/wiki/Mwdumper
Seems like this will take a while to complete, and I'm still downloading the static html dump in the meantime. I think parsing the MediaWiki format should actually be pretty easy... but I still am not sure which source to use.
-
Inappropriate?Blast! :| :(
The import tool crashed. It gave me a duplicate key error. Have no idea why that would happen. I noticed it's actually mentioned on the page on mwdumper (above), but the solution someone has proposed doesn't apply to me. The tables were all empty when I started... :-s
Btw, I also made a silly attempt to parse the 19GB xml file directly in code. I think you all can guess what happened! :)
-
Inappropriate?I was finally able to import the data to mysql without any hitches. 7,649,051 pages. I'm guessing they're not all "content pages".
Here's how I resolved the problems I was having:
- On the mwdumper page, a bunch of tips are given toward the end. Like removing indexes from the database tables. Listen to them!
- Increase the java heap size (-Xmx1000m)
- MOST IMPORTANTLY increase the max packet size in mysql. Here's a link with more info: http://dev.mysql.com/doc/refman/5.0/e...
There's a comment at the end that explains how to do it through the mysql administrator GUI. I used the suggested 32mb.
I'm still downloading the static html dump...
-
also... if you are using linux, be sure you are using the sun's java runtime instead of gcj (ubuntu's default), changing made a significant difference for me in the import process -
Inappropriate?I was able to download the static html dump in ~1hr (thank you uci housing bandwidth). however, uncompressed the tar file is ~200gb, which does seem to work well with my 160gb hard drive.
on ubuntu, setting up mediawiki is extremely easy
- sudo apt-get install mediawiki
- navigate to http://localhost/mediawiki
- complete configuration
i'm currently importing the xml file into mysql.
as far as the mediawiki syntax... i'm thinking about stripping all non-alphabetical characters from the text. unless there's something i haven't thought of, these aren't necessary, and will remove the wiki syntax at the same time.
stemming:
surprisingly there do not seem to be very many stemmers available. as far as i can tell the big name stemmer is the porter algorithm (java version available here: http://snowball.tartarus.org/runtime/...). if you have lucene available, these classes are in it as well. -
Inappropriate?Number of "content pages": 2,871,055
At least that's what I've come to believe.
I used the page_is_redirect and page_namespace columns in the page table to filter out the extra pages... -
Inappropriate?Great hints! Thanks.
-
Inappropriate?Hi,
I'm importing the Wikipedia dump for almost two weeks now and at this rate I estimate another 4 . I'm using importDump instead of mwdumper because I've read that mwdumper does not update the internal link tables. From you experience using mwdumper, can u tell me exactly what's not updated?
Thanks -
Inappropriate?This class has ended, so I don't think anyone else is watching this question now! Sorry :(
I’m sad
-
Inappropriate?OK, I see..
Loading Profile...




EMPLOYEE