viaf ntriples

I had a few requests for the Virtual International Authority File ntriples file I wrote about earlier. Having the various flavors of VIAF data available is great, but if an RDF dump is going to be made available I think ntriples kinda makes more sense than line oriented rdf/xml. I say this only because most RDF libraries and tools have support for bulk loading ntriples, and none (to my knowledge) support loading line oriented rdf/xml files.

I’ve made the 1.9G bzipped ntriples file available on Amazon S3 if you are interested in having at it:

http://viaf-ntriples.s3.amazonaws.com/viaf-20120524-clusters-rdf.nt.bz2
Incidentally you can torrent it as well, which would help spread the download of the file (and would spare me some cost on s3) by pointing your BitTorrent client at:

http://viaf-ntriples.s3.amazonaws.com/viaf-20120524-clusters-rdf.nt.bz2?torrent
As with the original VIAF dataset, this ntriples VIAF download contains information from VIAF (Virtual International Authority File) which is made available under the ODC Attribution License (ODC-By). Similarly, I am making the ntriples VIAF download available using the ODC-By License as well, because I think I have to given the of viral nature of ODC-By. At least that’s my unprofessional (I am not a lawyer) reading of the license. I’m not really complaining either, I’m all for openness going viral :-)

On a side note, I upgraded my laptop after the 4 days it took to initially create the ntriples file. In the process I accidentally deleted the ntriples file when I reformatted my hard drive. So the second time around I spent some time seeing if I could generate it quicker on Elastic MapReduce by splitting the file across multiple workers that would generate the ntriples from the rdf/xml and merge it back together. The conversion of the rdf/xml to ntriples using rdflib was largely CPU bound on my laptop, so I figured Hadoop Streaming would help me run my little Python script on as many workers nodes as I needed.

EMR made setting up and running a job quite easy, but I ran into my own ignorance pretty quickly. It turns out that Hadoop was not able to split the gzipped VIAF data, which meant data was only ever sent to one worker, no matter how many I ran. I then ran across some advice to use LZO compression, which is supposedly splittable on EMR, but after lots of experimentation I couldn’t get it to split either. I thought about uncompressing the original gzipped file on S3, but felt kind of depressed about doing that for some reason.

I time-boxed only a few days to try to get EMR working, so I backpedaled to rewriting the conversion script with Python’s multiprocessing library. I thought multiprocessing would let me take advantage of a multi-core EC2 machine. But oddly the conversion ran slower using multiprocessing’s Pool than it did as a single process. I chalked this up to the overhead of pickling large strings of rdf/xml and ntriples to send them around during inter-process-communication…but I didn’t investigate further.

So, in the end, I just re-ran the script for 4 days, but this time up at EC2 so that I could use my laptop during that time. &sigh;