I’m used to refactoring code smells, but sometimes you can catch a bad whiff in XML too.

Before:

< ?xml version="1.0" encoding="UTF-8"?>


     
     
        
            Library of Congress
        
    

    
                    
        
            
                
                    
                            Library of Congress
                    
                    
                        sn82015056
                        
                    
                
            
        
    
                            
   
    
    
    
        
            
                
                    
                        
                           
                              The National Forum (Washington, DC), 1910-19??
                           
                           
                           
                              
The first issue of the National Forum was likely released on April 30, 1910 
and the newspaper ran through at least November 12 of that year. The four-page African-American 
weekly covered such local events as Howard University graduations and Baptist church activities, but its 
pages also included national news, sports, home maintenance, women's news, science, editorial 
cartoons, and reprinted stories from national newspapers. Its primary focus was on how the news 
affected the city's black community. A unique feature was its coverage of Elks Club meetings and 
activities.  Business manager John H. Wills contributed the community-centered "Vanity Fair" column that
 usually appeared on the front page of each issue. The publisher and editor was Ralph W. White, who 
went on to publish another African-American newspaper, the 
McDowell Times of Keystone, West Virginia. Originally located at 
609 F St., NW, the newspaper's offices moved in August to 1022 U Street, N.W. to be closer to the 
African-American community it served.  No extant first issue of the National
 Forum exists.
                              
                           
                        
                    
                
            
        
    
        
        
    
        

After:

< ?xml version="1.0" encoding="utf-8"?>
< !DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">

  
    The National Forum (Washington, DC), 1910-19??
    
    
    
    
  
  
    

The first issue of the National Forum was likely released on April 30, 1910 and the newspaper ran through at least November 12 of that year. The four-page African-American weekly covered such local events as Howard University graduations and Baptist church activities, but its pages also included national news, sports, home maintenance, women's news, science, editorial cartoons, and reprinted stories from national newspapers. Its primary focus was on how the news affected the city's black community. A unique feature was its coverage of Elks Club meetings and activities. Business manager John H. Wills contributed the community-centered "Vanity Fair" column that usually appeared on the front page of each issue. The publisher and editor was Ralph W. White, who went on to publish another African-American newspaper, the McDowell Times of Keystone, West Virginia. Originally located at 609 F St., NW, the newspaper's offices moved in August to 1022 U Street, N.W. to be closer to the African-American community it served. No extant first issue of the National Forum exists.

I basically took a complicated METS wrapper around some XHTML, which was really just expressing metadata about the HTML, and refactored it as XHTML. Not that METS is a bad XML smell generally, but in this particular case it was overkill. If you look closely you’ll see I’m using RDFa, similar to what Facebook are doing with their OpenGraph Protocol. There’s less to get wrong, what’s there should look more familiar to web developers who aren’t versed in arcane library standards, and I can now read the metadata from the XHTML with an RDFa aware parser, like Python’s rdflib:

>>> import rdflib
>>> g = rdflib.Graph()
>>> g.parse('essays/1.html', format='rdfa')
>>> for triple in g: print triple
... 
(rdflib.term.URIRef('file:///home/ed/Projects/essays/essays/1.html'), rdflib.term.URIRef('http://purl.org/dc/terms/creator'), rdflib.term.Literal(u'Library of Congress'))
(rdflib.term.URIRef('file:///home/ed/Projects/essays/essays/1.html'), rdflib.term.URIRef('http://purl.org/dc/terms/title'), rdflib.term.Literal(u'The National Forum (Washington, DC), 1910-19??'))
(rdflib.term.URIRef('file:///home/ed/Projects/essays/essays/1.html'), rdflib.term.URIRef('http://purl.org/dc/terms/description'), rdflib.term.Literal(u'\n    

\nThe first issue of the National Forum was likely released on April 30, 1910 and the newspaper ran through at least November 12 of that year. The four-page African-American weekly covered such local events as Howard University graduations and Baptist church activities, but its pages also included national news, sports, home maintenance, women s news, science, editorial cartoons, and reprinted stories from national newspapers. Its primary focus was on how the news affected the city s black community. A unique feature was its coverage of Elks Club meetings and activities. Business manager John H. Wills contributed the community-centered "Vanity Fair" column that usually appeared on the front page of each issue. The publisher and editor was Ralph W. White, who went on to publish another African-American newspaper, the McDowell Times of Keystone, West Virginia. Originally located at 609 F St., NW, the newspaper s offices moved in August to 1022 U Street, N.W. to be closer to the African-American community it served. No extant first issue of the National Forum exists.\n

\n ', datatype=rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral'))) (rdflib.term.URIRef('file:///home/ed/Projects/essays/essays/1.html'), rdflib.term.URIRef('http://purl.org/dc/terms/subject'), rdflib.term.Literal(u'http://chroniclingamerica.loc.gov/lccn/sn82015056#title')) (rdflib.term.URIRef('file:///home/ed/Projects/essays/essays/1.html'), rdflib.term.URIRef('http://purl.org/dc/terms/created'), rdflib.term.Literal(u'2007-01-10T09:00:00'))