As part of my day job I’ve been rifling through large foreign XML files–learning the rhyme and reason of tags used, looking at content, etc… I opened files in Firefox and vim and that was OK–but I like working from the command line. After minimal searching I wasn’t able to find a suitable tool that would simply outline the structure of an xml document in the way I wanted–although artunit pointed out Gadget from MIT which looks like a really wonderful GUI tool to try out. So (predictably) I wrote my own:

biblio:~ ed$ xmltree
Usage: xmltree foo.xml [--depth=n] [--xpath=/foo/bar] [--content]

Specific options:
    -d, --depth n                    max levels
    -x, --xpath /foo/bar             xpath to apply
    -c, --content                    include tag content
    -n, --namespaces                 include namespace information
    -h, --help                       show this message

You can use it to list all the elements in a document like this:

biblio:~ ed$ xmltree pmets.xml

PorticoMETS
 metsHdr
  agent
   name
 structMapContent
  div
   mdGroup
    descMDcurated
     mdWrap
      xmlData
       PorticoArticleMetadata
        article
... many lines of content removed

Maybe it’s a huge file and you only want to see a few levels in:

biblio:~ed$ xmltree --depth=3 pmets.xml 

PorticoMETS
 metsHdr
  agent
   name
 structMapContent
  div
   mdGroup
   div
   div
   div
 structMapMetadata
  div
   mdGroup
   fileGrp

And if you just want to explore a particular node you can use an xpath:

biblio:~ed$ xmltree --xpath .//PorticoMETS/structMapContent/div/mdGroup/descMDextracted/mdWrap/xmlData sample.pmets

xmlData
 PorticoArticleMetadata
  article
   front
    journal-meta
     journal-id
     journal-title
     issn
     issn
    article-meta
     article-id
     article-id
     title-group
      article-title
     contrib-group
      contrib
     pub-date
      day
      month
      year
      string-date
     volume
     issue
     fpage
     page-range
     product
     copyright-year
     copyright-holder
     self-uri

And finally if you want to eyeball the content of the fields you can use the –content option:

biblio:~ ed$ xmltree --xpath .//PorticoMETS/structMapContent/div/mdGroup/
  descMDextracted/mdWrap/xmlData --content sample.pmets

xmlData
 PorticoArticleMetadata
  article
   front
    journal-meta
     journal-id='bull'
     journal-title='Bulletin of the American Mathematical Society'
     issn='0273-0979'
     issn='1088-9485'
    article-meta
     article-id='S0273-0979-00-00866-1'
     article-id='10.1090/S0273-0979-00-00866-1'
     title-group
      article-title='Book Review'
     contrib-group
      contrib='David Marker'
     pub-date
      day='02'
      month='03'
      year='2000'
      string-date='02 March 2000'
     volume='37'
     issue='03'
     fpage='351'
     page-range='351-357'
     product='Tame topology and o-minimal structures, by Lou van den Dries,
       Cambridge Univ. Press, New York (1998), x + 180 pp., $39.95,
       ISBN 0-521-59838-9'
     copyright-year='2000'
     copyright-holder='American Mathematical Society'
     self-uri='http://www.ams.org/jourcgi/jour-getitem?pii=S0273-0979-
       00-00866-1'

Anyhow, if you have a favorite tool for doing this sort of stuff please let me know. If you want to try out xmltree you can grab it out of my subversion repository. You’ll just need a modern Ruby.