I'll be the first to admit I'm not the smartest person in the world, but I'm at a loss on this one.
I want to have access to the words and details of each word of the English Wiktionary project. I saw they do data dumps, and got all excited. That lasted all of 3 seconds. Since then, all I've done is swear and smoke in bouts of frustration and irritation.
I'm using windows 7.
I've installed the latest version of xampp (64 bit, installed at root).
I've installed the latest Java DK.
I've set Xampp and JDK to run as admin.
I've grabbed the article-pages files.
I've decompressed them.
I've used the mwxml2sql tool.
I couldn't get it to run (no matter what settings/flags I tried).
I used the GUI version of the mwxml2sql tool.
It ran - and then errored at 4300 rows.
The error was about duple keys in name_title.
I've looked at wikokit - but that seems a few years behind.
I'm at a loss.
I've looked at the data that did get into the DB before the dupe-key error.
I can see some data in Blob format.
How am I meant to access that information via php?
Is there not a decent (as in "idiots" :D) guide for this?
Do I really have to grab all the files, install a wiki, parse the files?
How am I meant to handle the dupe key issues (not like I can open up the sql file and find the relevant line!)?
So, please - has anyone done this or know of a way to do it?
The only thing I can think of is to actually try and scrape the site - which I'd rather not do (and nor would the wiki group).
In case it is relevant - I'm specifically after the word-form, the PoS, the pronunciations, the definitions, any phrases and related words. Things like etymology etc. would be nice, but aren't as important.
If it is suggested, yes, I've looked at WordNet (managed to find a mysql dump, and got that working). I've also seen resources like MRC and the CMU dict - but none have the right permissions. That's why Wiktionary looked so attractive. But it seems the format/dumps are far from friendly :(
So, any help or ideas ?
Alternative sources, guides, walk-through ... all would help.
Alternatively, if you can tell me what is causing the error and how to get around it, and how to access the word data, that would be superb.
Sincerley yours - frustrated.