bad xml smells
I’m used to refactoring code smells, but sometimes you can catch a bad whiff in XML too.
Before:
< ?xml version="1.0" encoding="UTF-8"?>
Library of Congress
Library of Congress
sn82015056
The National Forum (Washington, DC), 1910-19??
The first issue of the National Forum was likely released on April 30, 1910
and the newspaper ran through at least November 12 of that year. The four-page African-American
weekly covered such local events as Howard University graduations and Baptist church activities, but its
pages also included national news, sports, home maintenance, women's news, science, editorial
cartoons, and reprinted stories from national newspapers. Its primary focus was on how the news
affected the city's black community. A unique feature was its coverage of Elks Club meetings and
activities. Business manager John H. Wills contributed the community-centered "Vanity Fair" column that
usually appeared on the front page of each issue. The publisher and editor was Ralph W. White, who
went on to publish another African-American newspaper, the
McDowell Times of Keystone, West Virginia. Originally located at
609 F St., NW, the newspaper's offices moved in August to 1022 U Street, N.W. to be closer to the
African-American community it served. No extant first issue of the National
Forum exists.
After:
< ?xml version="1.0" encoding="utf-8"?>
< !DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
The National Forum (Washington, DC), 1910-19??
The first issue of the National Forum was likely released on April 30, 1910 and the
newspaper ran through at least November 12 of that year. The four-page African-American weekly
covered such local events as Howard University graduations and Baptist church activities, but its pages
also included national news, sports, home maintenance, women's news, science, editorial cartoons, and
reprinted stories from national newspapers. Its primary focus was on how the news affected the city's black community. A unique feature was its coverage of Elks Club meetings and activities. Business
manager John H. Wills contributed the community-centered "Vanity Fair" column that usually appeared
on the front page of each issue. The publisher and editor was Ralph W. White, who went on to publish
another African-American newspaper, the
McDowell Times of Keystone, West Virginia. Originally located at 609 F St., NW, the
newspaper's offices moved in August to 1022 U Street, N.W. to be closer to the African-American
community it served. No extant first issue of the National Forum exists.
I basically took a complicated METS wrapper around some XHTML, which was really just expressing metadata about the HTML, and refactored it as XHTML. Not that METS is a bad XML smell generally, but in this particular case it was overkill. If you look closely you’ll see I’m using RDFa, similar to what Facebook are doing with their OpenGraph Protocol. There’s less to get wrong, what’s there should look more familiar to web developers who aren’t versed in arcane library standards, and I can now read the metadata from the XHTML with an RDFa aware parser, like Python’s rdflib:
>>> import rdflib
>>> g = rdflib.Graph()
>>> g.parse('essays/1.html', format='rdfa')
>>> for triple in g: print triple
...
(rdflib.term.URIRef('file:///home/ed/Projects/essays/essays/1.html'), rdflib.term.URIRef('http://purl.org/dc/terms/creator'), rdflib.term.Literal(u'Library of Congress'))
(rdflib.term.URIRef('file:///home/ed/Projects/essays/essays/1.html'), rdflib.term.URIRef('http://purl.org/dc/terms/title'), rdflib.term.Literal(u'The National Forum (Washington, DC), 1910-19??'))
(rdflib.term.URIRef('file:///home/ed/Projects/essays/essays/1.html'), rdflib.term.URIRef('http://purl.org/dc/terms/description'), rdflib.term.Literal(u'\n \nThe first issue of the National Forum was likely released on April 30, 1910 and the newspaper ran through at least November 12 of that year. The four-page African-American weekly covered such local events as Howard University graduations and Baptist church activities, but its pages also included national news, sports, home maintenance, women
s news, science, editorial cartoons, and reprinted stories from national newspapers. Its primary focus was on how the news affected the city
s black community. A unique feature was its coverage of Elks Club meetings and activities. Business manager John H. Wills contributed the community-centered "Vanity Fair" column that usually appeared on the front page of each issue. The publisher and editor was Ralph W. White, who went on to publish another African-American newspaper, the McDowell Times of Keystone, West Virginia. Originally located at 609 F St., NW, the newspaper
s offices moved in August to 1022 U Street, N.W. to be closer to the African-American community it served. No extant first issue of the National Forum exists.\n
\n ', datatype=rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral')))
(rdflib.term.URIRef('file:///home/ed/Projects/essays/essays/1.html'), rdflib.term.URIRef('http://purl.org/dc/terms/subject'), rdflib.term.Literal(u'http://chroniclingamerica.loc.gov/lccn/sn82015056#title'))
(rdflib.term.URIRef('file:///home/ed/Projects/essays/essays/1.html'), rdflib.term.URIRef('http://purl.org/dc/terms/created'), rdflib.term.Literal(u'2007-01-10T09:00:00'))