xml spelunking
As part of my day job I’ve been rifling through large foreign XML files–learning the rhyme and reason of tags used, looking at content, etc… I opened files in Firefox and vim and that was OK–but I like working from the command line. After minimal searching I wasn’t able to find a suitable tool that would simply outline the structure of an xml document in the way I wanted–although artunit pointed out Gadget from MIT which looks like a really wonderful GUI tool to try out. So (predictably) I wrote my own:
biblio:~ ed$ xmltree Usage: xmltree foo.xml [--depth=n] [--xpath=/foo/bar] [--content] Specific options: -d, --depth n max levels -x, --xpath /foo/bar xpath to apply -c, --content include tag content -n, --namespaces include namespace information -h, --help show this message
You can use it to list all the elements in a document like this:
biblio:~ ed$ xmltree pmets.xml PorticoMETS metsHdr agent name structMapContent div mdGroup descMDcurated mdWrap xmlData PorticoArticleMetadata article ... many lines of content removed
Maybe it’s a huge file and you only want to see a few levels in:
biblio:~ed$ xmltree --depth=3 pmets.xml PorticoMETS metsHdr agent name structMapContent div mdGroup div div div structMapMetadata div mdGroup fileGrp
And if you just want to explore a particular node you can use an xpath:
biblio:~ed$ xmltree --xpath .//PorticoMETS/structMapContent/div/mdGroup/descMDextracted/mdWrap/xmlData sample.pmets xmlData PorticoArticleMetadata article front journal-meta journal-id journal-title issn issn article-meta article-id article-id title-group article-title contrib-group contrib pub-date day month year string-date volume issue fpage page-range product copyright-year copyright-holder self-uri
And finally if you want to eyeball the content of the fields you can use the –content option:
biblio:~ ed$ xmltree --xpath .//PorticoMETS/structMapContent/div/mdGroup/ descMDextracted/mdWrap/xmlData --content sample.pmets xmlData PorticoArticleMetadata article front journal-meta journal-id='bull' journal-title='Bulletin of the American Mathematical Society' issn='0273-0979' issn='1088-9485' article-meta article-id='S0273-0979-00-00866-1' article-id='10.1090/S0273-0979-00-00866-1' title-group article-title='Book Review' contrib-group contrib='David Marker' pub-date day='02' month='03' year='2000' string-date='02 March 2000' volume='37' issue='03' fpage='351' page-range='351-357' product='Tame topology and o-minimal structures, by Lou van den Dries, Cambridge Univ. Press, New York (1998), x + 180 pp., $39.95, ISBN 0-521-59838-9' copyright-year='2000' copyright-holder='American Mathematical Society' self-uri='http://www.ams.org/jourcgi/jour-getitem?pii=S0273-0979- 00-00866-1'
Anyhow, if you have a favorite tool for doing this sort of stuff please let me know. If you want to try out xmltree you can grab it out of my subversion repository. You’ll just need a modern Ruby.