Flickr Commons LAMs

After the last post Seb got me wondering if there were any differences between libraries, archives and museums when looking at upload and comment activity in Flickr Commons in Aaron’s snapshot of the Flickr Commons metadata.

First I had to get a list of Flickr Commons organizations and classify them as either a library, museum or archive. It wasn’t always easy to pick, but you can see the result here. I lumped galleries in with museums. I also lumped historical societies in with archives. Then I wrote a script that walked around in the Redis database I already had from loading Aaron’s data.

In doing this I noticed there were some Flickr Commons organizations that were missing from Aaron’s snapshot:

  • Tasmanian Archive and Heritage Office Commons
  • The Royal Library, Denmark
  • The Finnish Museum of Photography
  • Musée McCord Museum

Update: Aaron quickly fixed this.

I didn’t do any research to see if these organizations had significant activity. Also, since there were close to a million files, I didn’t load the British Library activity yet. If there’s interest in adding them into the mix I’ll splurge for the larger ec2 instance.

Anyhow, below are the results. You can find the spreadsheet for these graphs up in Google Docs

This was all done rather quickly, so if you notice anything odd or that looks amiss please let me know. Initially it seemed a bit strange to me that libraries, archives and museums trended so similarly in each graph, even if the volume was different.


Where Brooklyn At?

As a follow up to my last post I added a script to my fork of Aaron’s py-flarchive that will load up a Redis instance with comments, notes, tags and sets for Flickr images that were uploaded by Brooklyn Museum. The script assumes you’ve got a snapshot of the archived metadata, which I downloaded as a tarball. It took several hours to unpack the tarball on a medium ec2 instance; so if you want to play around and just want the redis database let me know and I’ll get it to you.

Once I loaded up Redis I was able to generate some high level stats:

  • images: 5,697
  • authors: 4,617
  • tags: 6,132
  • machine tags: 933
  • comments: 7,353
  • notes: 963
  • sets: 141

Given how many images there were there it represents an astonishing number of authors: unique people who added tags, comments or notes. If you are curious I generated a list of the tags and saved them as a Google Doc. The machine tags were particularly interesting to me. The majority (849) of them look like Brooklyn Museum IDs of some kind, for example:

bm:unique=S10_08_Thebes/9928

But there were also 51 geotags, and what looks like 23 links to items in Pleiades, for example:

tag:pleiades:depicts=721417202

If I had to guess I’d say this particular machine tag indicated that the Brooklyn Museum image depicted Abu Simbel. Now there weren’t tons of these machine tags but it’s important to remember that other people use Flickr as a scratch space for annotating images this way.

If you aren’t familiar with them, Flickr notes are annotations of an image, where the user has attached a textual note to a region in the image. Just eyeballing the list, it appears that there is quite a bit of diversity in them, ranging from the whimsical:

  • cool! they look soo surreal
  • teehee somebody wrote some graffiti in greek
  • Lol are these painted?
  • Steaks are ready!

to the seemingly useful:

  • Hunter’s Island
  • Ramesses III Temple
  • Lapland Village
  • Lake Michigan
  • Montuemhat Crypt
  • Napoleon’s troops are often accused of destroying the nose, but they are not the culprits. The nose was already gone during the 18th century.

Similarly the general comments run the gamut from:

  • very nostalgic…
  • always wanted to visit Egypt

to:

  • Just a few points. This is not ‘East Jordan’ it is in the Hauran region of southern Syria. Second it is not Qarawat (I guess you meant Qanawat) but Suweida. Third there is no mention that the house is enveloped by the colonnade of a Roman peripteral temple.
  • The fire that destroyed the buildings was almost certainly arson. it occurred at the height of the Pullman strike and at the time, rightly or wrongly, the strikers were blamed.
  • You can see in the background, the TROCADERO with two towers .. This “medieval city” was built on the right bank where are now buildings in modern art style erected for the exposition of 1937.

Brooklyn Museum pulled over 48 tags from Flickr before they deleted the account. That’s just 0.7% of the tags that were there. None of the comments or notes were moved over.

In the data that Aaron archived there was one indicator of user engagement: the datetime included with comments. Combined with the upload time for the images it was possible to create a spreadsheet that correlates the number of comments with the number of uploads per month:

Brooklyn Museum Flickr Activity

I’m guessing the drop off in December of 2013 is due to that being the last time Aaron archived Brooklyn Museum’s metadata. You can see that there was a decline in user engagement: the peak in late 2008 / early 2009 was never matched again. I was half expecting to see that user engagement fell off when Brooklyn Museum’s interest in the platform (uploads) fell off. But you can see that they continued to push content to Flickr, without seeing much of a reward, at least in the shape of comments. It’s impossible now to tell if tagging, notes or sets trended differently.

Since Flickr includes the number of times each image was viewed it’s possible to look at all the images and see how many times images were viewed, the answer?

9,193,331

Not a bad run for 5,697 images. I don’t know if Brooklyn Museum downloaded their metadata prior to removing their account. But luckily Aaron did.


Glass Houses

You may have noticed Brooklyn Museum’s recent announcement that they have pulled out of Flickr Commons. Apparently they’ve seen a “steady decline in engagement level” on Flickr, and decided to remove their content from that platform, so they can focus on their own website as well as Wikimedia Commons.

Brooklyn Museum announced three years ago that they would be cross-posting their content to Internet Archive and Wikimedia Commons. Perhaps I’m not seeing their current bot, but they appear to have two, neither of which have done an upload since March of 2011, based on their user activity. It’s kind of ironic that content like this was uploaded to Wikimedia Commons by Flickr Uploader Bot and not by one of their own bots.

The announcement stirred up a fair bit of discussion about how an institution devoted to the preservation and curation of cultural heritage material could delete all the curation that has happened at Flickr. The theory being that all the comments, tagging and annotation that has happened on Flickr has not been migrated to Wikimedia Commons. I’m not even sure if there’s a place where this structured data could live at Wikimedia Commons. Perhaps some sort of template could be created, or it could live in Wikidata?

Fortunately, Aaron Straup-Cope has a backup copy of Flickr Commons metadata, which includes a snapshot of the Brooklyn Museum’s content. He’s been harvesting this metadata out of concern for Flickr’s future, but surprise, surprise – it was an organization devoted to preservation of cultural heritage material that removed it. It would be interesting to see how many comments there were. I’m currently unpacking a tarball of Aaron’s metadata on an ec2 instance just to see if it’s easy to summarize.

But:

I’m pretty sure I’m living in one of those.

I agree with Ben:

It would help if we had a bit more method to the madness of our own Web presence. Too often the Web is treated as a marketing platform instead of our culture’s predominant content delivery mechanism. Brooklyn Museum deserves a lot of credit for talking about this issue openly. Most organizations just sweep it under the carpet and hope nobody notices.

What do you think? Is it acceptable that Brooklyn Museum discarded the user contributions that happened on Flickr, and that all the people who happened to be pointing at said content from elsewhere now have broken links? Could Brooklyn Museum instead decided to leave the content there, with a banner of some kind indicating that it is no longer actively maintained? Don’t lots of copies keep stuff safe?

Or perhaps having too many copies detracts from the perceived value of the currently endorsed places of finding the content? Curators have too many places to look, which aren’t synchronized, which add confusion and duplication. Maybe it’s better to have one place where people can focus their attention?

Perhaps these two positions aren’t at odds, and what’s actually at issue is a framework for thinking about how to migrate Web content between platforms. And different expectations about content that is self hosted, and content that is hosted elsewhere?


dump truck

I had a strange dream last night.

I was working on a consulting project with an archivist friend, and a group of others I didn’t know as well. I knew that the work was politically sensitive, and that it was important for some reason that escapes me now.

Before the work could start I was required to sign a document. I trusted my friend implicitly, and didn’t really read it closely. Afterwards my friend let me know that I had signed a document stating that said I had committed suicide, and that I no longer legally existed. Apparently, this would provide a framework for the work to happen more easily.

I remember being concerned about my family. I walked outside to smoke a cigarette (something I don’t do anymore). A few of the other people joined me, and we got in a dump truck.


I don’t know why, but I woke up from the dream feeling strangely relaxed. I looked up a few things in our our cheesy dreamer’s dictionary.

Archives. Anything to do with archives in a dream is a forerunner of unexpected legal entanglements.

Document. Business or legal documents in a dream are usually a warning against speculations, unless the documents were in an indigenous place such as a notary or lawyer’s office, in which case they portend a coming increase, possibly for inheritance.

Suicide. This dream is a signal that you need a change of scene or more mental relaxation. Try sharing your troubles with a trusted friend or adviser, but in any event stop brooding.

Cigar (or Cigarette) Whether you were offering them, smoking yourself, or observing someone else with them, this form of tobacco in a dream is a lucky omen pertaining to prosperity.

Unfortunately, there wasn’t an entry for dump truck.


To Mr. Hazard

Lots of Copies Keeps Stuff Safe, 1791-Style.

Time and accident are committing daily havoc on the originals deposited in our public offices. The late war has done the work of centuries in this business. The last cannot be recovered, but let us save what remains; not by vaults and locks, which fence them from the public eye and use in consigning them to the waste of time, but by such a multiplication of copies as shall place them beyond the reach of accident.

– Thomas Jefferson

GoogleBooks


Incompleteness

Zen spirit has come to mean not only peace and understanding, but devotion to art and to work, the rich unfoldment of contentment, opening the door to insight, the expression of innate beauty, the intangible charm of incompleteness. Zen carries many meanings, none of them entirely definable. If they are defined, they are not Zen.

Zen Flesh, Zen Bones, p. 18


Dissecting GettyImage Embeds

Yes, GettyImages have decided to encourage people to embed their images. Despite opinions to the contrary I think this is A Good Thing. So what happens when you embed a Getty image into your HTML? To get something like this in your page:

you need to include a little snippet of HTML in your pages:

<iframe src="//embed.gettyimages.com/embed/81901686?et=4td6Xm2f0k6pMgQVX7pNFA&sig=fhRom4eoepnZbyWjZ0_2N3SdVG1dxQTC2GUAK4XrPjg=" width="462" height="440" frameborder="0" scrolling="no"></iframe>

which in turn embeds this HTML into your page:

<!DOCTYPE html>
<html>
  <head>
    <base target="_parent" />
    <title>20 - 30 year old female worker pulls box off of warehouse shelf [Getty Images]</title>
    <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />
    <!--[if lt IE 10]>
    <script src="//html5shiv.googlecode.com/svn/trunk/html5.js"></script>
    <![endif]-->
  </head>
  <body>
  <link rel="stylesheet" type="text/css" href="//embed.gettyimages.com/css/style.css" />
<section id="embed-body" data-asset-id="81901686" data-collection-id="41">
  <a href="http://gty.im/81901686" target="_blank"><img src="http://d2v0gs5b86mjil.cloudfront.net/xc/81901686.jpg?v=1&c=IWSAsset&k=2&d=F5B5107058D53DF50D8BA2399504758256BF753C679B89B417A38C0E9F1FBB9F&Expires=1394499600&Key-Pair-Id=APKAJZZHJ4LGWQENK3OQ&Signature=UC1YXxhGwSAY0BduwMZqnFQ7fcAQTdCksDvYu4WVmNWlTou7NktH7rZ8uk7BLbupJ4sp0ijiDaA93Yi2XijnC-TtcUO1Kylcew4nZpM~Al9jD0OSfx5yNe7jcIalweGpLGOdMLTXn0wRs6XfEh3~1fc~csMrAesHJkUayhBqNxo6Xja-35XQLx98d5fg6UXazOsCRT-UzebWA4dFURz~BSxXgq0RtU~LhKVKRZvkUTvl2RrsqBcN4bW3i~dbNMwHKn~7s9dMy5CxH-7k4ELyJaBClWEO2Jgr5WV9cXy~WGBQnNd-5Lb7CMcZclzn88-LbmDnFcO~BVLgtSU5x-KTpw__" /></a>
  <footer>
    <ul class="meta">
      <li class="gi-logo icon icon-logo"></li>
      <li>Bob O'Connor / Stone</li>
    </ul>
    <ul class="reblog">
      <li>
        <a href="//twitter.com/share" title="Share on Twitter" class="twitter-share-button" data-lang="en" data-count="none" data-url="http://gty.im/81901686"></a>        
      </li>
      <li>
        <a class="icon-tumblr" target="_self" title="Share on Tumblr" href="//www.tumblr.com/share/video?embed=%3Ciframe%20src%3D%22%2f%2fembed.gettyimages.com%2fembed%2f81901686%3fet%3d4td6Xm2f0k6pMgQVX7pNFA%26sig%3dfhRom4eoepnZbyWjZ0_2N3SdVG1dxQTC2GUAK4XrPjg%3d%22%20width%3D%22462%22%20height%3D%22440%22%20frameborder%3D%220%22%20%3E%3C%2Fiframe%3E"></a>
      </li>
      <li>
        <a href="javascript:void(0);" title="Re-embed this image"><i class="icon-code"></i></a>
      </li>
    </ul>
  </footer>
</section>
  <aside class='modal embed-modal' style='display: none;'>
  <div class='contents'>
    <a class="icon modal-close icon-close" href="#close" title="Close"></a> 
    <span id="re-embed-body">
      <h3>Embed this image</h3>
      <p>Copy this code to your website or blog. <a href="http://www.gettyimages.com/helpcenter" target="_blank" id="learn-more">Learn more</a></p>
      <p class="commercial-use">
        Note: Embedded images may not be used for commercial purposes.</p>        
      <p id="embed-link">
        <textarea></textarea></p>
      <p class="terms">
        By embedding this image, you agree to Getty Images
        <a href="http://www.gettyimages.com/corporate/terms.aspx" target="_blank">terms of use</a>.</p>
    </span>
  </div>
</aside>
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.10.2/jquery.min.js"></script>
<script type="text/javascript" src="/script/embed.js"></script>
    <script src="//platform.tumblr.com/v1/share.js"></script>
    <script src="//platform.twitter.com/widgets.js"></script>
  </body>
</html>

You can see Amazon’s CloudFront is being used as a CDN for the images, and that Getty are using CloudFront’s Signed URLs to expire the images…it looks like after 24 hours? This isn’t a problem because Getty are serving the page up, but anyone that’s tried to snag the image URL for reuse (Google Images?) will end up getting a 400 error.

I thought it was interesting that the embedded iframe gives you not only the image, author and collection, but also links to re-share the image on Twitter and Tumblr. I guess this is Viral Marketing 101, but it’s smart I think, since it encourages reuse, and the recycling of content on the Web. Conspicuously absent from the reshare buttons is Facebook – maybe there’s a story there? Also, as we’ll see in a second, the description of the image is missing from the embedded view:

20 - 30 year old female worker pulls box off of warehouse shelf

Of course the other big thing the iframe does is gives Getty an idea of where their content is being used. Anyone who uses this one line embed iframe will trigger an HTTP request to a embed.gettyimages.com URL (hosted on Amazon EC2 incidentally). These requests, and their referral information can be stashed away and analyzed, so that Getty can get a picture of who is using their content, and how. Embedded images and the Twitter and Tumblr reshares are automatically linked to Getty’s specific short URLs, such as:

http://gty.im/81901686

The number used in the short URL is also used in the expanded URL:

http://www.gettyimages.com/detail/photo/year-old-female-worker-pulls-box-off-of-high-res-stock-photography/81901686

But the title text is just there for SEO, it can be changed to anything:

http://www.gettyimages.com/detail/photo/wikileaks-storage-annex/81901686

Ordinarily I’d be down on the use of a short URL, but in this case it’s role is more of a permalink. Of course these short URLs have the same problem as Handles and PURLs in that people won’t ordinarily bookmark them. But, Que Sera Sera. As the Verge pointed out these embedded iframes could end up depriving Web content of lead images, if the GettyImages decides to pull the plug on the embeds and they suddenly 404. But their credibility would suffer quite a bit by a decision like that. I think it’s important that they are encouraging the Web to rely on these URLs, and that they are putting their reputation on the line.

Of course lots of inbound links to those pages should do wonders for their PageRank. Plus, following that link allows you to purchase the image, explore other images by the photographer, related images in the GettyImages collection, as well as see some additional metadata about the photo: item number, rights, license type, original file dimensions, size, dots-per-inch. Some of this metadata is even expressed using RDFa (Facebook’s OpenGraph metadata) … which makes the lack of a Facebook share button even more interesting. In addition there is also some minimal use of schema.org HTML microdata for the search engine’s to nibble on. If you are curious, Google’s Structured Data Testing Tool provides a view on this metadata.

It seems like there’s an opportunity to express more information in RDFa or microdata, specifically the details about the original, as well as licensing/rights metadata. Oddly the RDFa doesn’t even mark up the author of the image, I suppose because Facebook’s OpenGraph doesn’t give a way of expressing it. They could start by marking up the author of the image, but what if Getty established photographer pages, so instead of Bob O’Connor linking to:

http://www.gettyimages.com/search/2/image?artist=Bob+O%27Connor&family=Creative

What if it linked to a vanity URL like:

http://www.gettyimages.com/people/bob-oconnor

This would be a perfect place to share links to author’s other social media accounts, a bio, their photographer friends, etc. I’m thinking of the sort of work that National Geographic are doing with their YourShot application, for example this Profile page for Bahareh Mohamadian.

The licensing restrictions and iframes around these images would have ordinarily turned me off. But given Getty’s market position in this space it’s completely understandle, and seems like a useful compromise for now. These landing pages are a perfect place to make more structured metadata available that could be used by integrating applications. Getty should invest in this real estate, not only for the Web, but also for data resuse across their enterprise. The landing pages are an example of just how influential Facebook and Google have been in promoting the use of metadata on the Web. Without them, I think it is safe to assume we wouldn’t have seen any structured metadata on these pages at all.


OCLC Works

The news about OCLC’s Linked Data service circulated widely on Twitter yesterday. I’ve never been a big OCLC cheerleader, but the news really hit home for me. I’ve been writing in my rambling way about Linked Data here for about 6 years. Of course there are many others who’ve been at it much longer than I have … and in a way I think librarians and archivists feel a kinship with the effort because it is cooked into the DNA of how we think about the Web as an information space.

Like Button

This new OCLC service struck me as an excellent development for the library Web community for a few reasons, that I thought I would quickly jot down:

  • it’s evolutionary: OCLC didn’t let the perfect be the enemy of the good. It’s great to hear links to VIAF, FAST, LCSH, etc are planned. But you have to start somewhere, and there is already significant value in expressing the FRBR workset data they have as Linked Data on the Web for others to use. Also, the domain experiment.worldcat.org clearly reflects this is an experiment…but they didn’t let anxiety about changing URLs prevent them from publishing what they can now. The future is longer than the past.
  • it’s snappy: I don’t know if they’ve written about the technical architecture they are using, but the views are quite responsive. Of course I have no idea what kind of load it is under, but so far so good. Update: Ron Buckley of OCLC let me know the service is built on top of a shared Apache HBase Hadoop cluster.
  • schema.org: OCLC has the brains and the market position to create their own vocabulary for bibliographic data. But they worked hard at engaging openly with the Web community to help clarify and adapt the Schema.org vocabulary so that it can be used by our community. There is lots of thrashing going on in this space at the moment, and OCLC is being a great model in trying to work with the Web we have, and iterating to make it better, instead of trying to take a quantum leap forward.
  • json-ld: JSON-LD has been cooking for a while, but it’s a brand new W3C standard for representing RDF as idiomatic JSON. RDF has been somewhat plagued in the past by esoteric and/or hard to understand representations. JSON-LD really seems to have hit a sweet-spot between the expressivity of RDF and the usability of the Web. It’s refreshing to see OCLC kicking JSON-LD’s tires.

Rubber Meet Road

So how do you discover these Work URIs? Richard’s post led me to believe I could get them directly from the xID service using an ISBN. But I found it to be a two step process: first get any OCLC Number associated with an ISBN from xID, and then use the OCLC Number to get the Work Identifier from the xID service:

So for example, to discover the Work URI for Tim Berners-Lee’s Weaving the Web you first look up the ISBN:

http://xisbn.worldcat.org/webservices/xid/isbn/0062515861?method=getMetadata&format=json&fl=*

which should yield:

{
    "list": [
        {
            "author": "Tim Berners-Lee with Mark Fischetti.",
            "city": "San Francisco",
            "ed": "1st ed.",
            "form": [
                "AA",
                "BA"
            ],
            "isbn": [
                "0062515861"
            ],
            "lang": "eng",
            "lccn": [
                "99027665",
                "00039593"
            ],
            "oclcnum": [
                "300691968",
                "318261941",
                "410824754",
                "41238513",
                "470718156",
                "558595430",
                "628749869",
                "768228949",
                "807901805",
                "43903751",
                "699807622"
            ],
            "publisher": "HarperSanFrancisco.",
            "title": "Weaving the Web : the original design and ultimate destiny of the World Wide Web by its inventor",
            "url": [
                "http://www.worldcat.org/oclc/300691968?referer=xid"
            ],
            "year": "1999"
        }
    ],
    "stat": "ok"
}

Then pick one of the OCLC Numbers (oclcnum) at random and use it to do an xID call:

http://xisbn.worldcat.org/webservices/xid/oclcnum/300691968?method=getMetadata&format=json&fl=*

Which should return:

{
    "list": [
        {
            "isbn": [
                "9780062515865",
                "9780062515872"
            ],
            "lccn": [
                "99027665"
            ],
            "oclcnum": [
                "300691968"
            ],
            "owi": [
                "owi27331745"
            ]
        }
    ],
    "stat": "ok"
}

You can then dig out the Work Identifier (owi), trim off the owi prefix, and put it on the end of a URL like:

http://experiment.worldcat.org/entity/work/data/27331745

or, if you want the JSON-LD without doing content negotiation:

http://experiment.worldcat.org/entity/work/data/27331745.jsonld

This returns a chunk of JSON data that I won’t reproduce here, but do check it out.

Update: After hitting publish on this blog post I’ve corresponded a bit with Stephan Schindehette at OCLC and Alf Eaton about some inconsistencies in my blog post (which I’ve fixed), and uncertainty about what the xID API should be returning. Hopefully xID can be updated to return the OCLC Work Identifier when you lookup by ISBN. I’ll update this blog post if I am notified of a change.

Peanut Gallery

One bit of advice that I was given by Dave Longley on the #json-ld IRC channel, which I will pass along to OCLC, is that it might be better to use CURIE-less properties, e.g. name instead of schema:name, to make it easier to use (and read) the JSON from JavaScript. To do this you would need a more expressive (???) but I think it might make sense to reference an external context document and cut down on the size of the JSON-LD document even more.

It’s wonderful to see that the data is being licensed ODC-BY, but maybe assertions to that effect should be there in the data as well? I think schema.org have steered clear of licensing properties, but cc:license seems like a reasonable property to use, assuming it’s used with the right subject URI.

And one last tiny suggestion I have is that it would be nice to see the service mainstreamed into other parts of OCLC’s website. But I understand all too well the divides between R&D and production … and how challenging it can be to integrate them sometimes, even in the simplest of ways.


Presidential Papers

There’s an interesting piece in the New York Times this morning: about the future of Obama’s Presidential library

There are reports that Mr. Obama used to be skeptical of having a library at all; a bold move would be to revert to tradition and deposit his papers at the Library of Congress. (The National Archives and Records Administration manages the 13 presidential libraries, which nowadays are built and maintained with private funds. The Herbert C. Hoover Library opened in 1962.) I knew that the Presidential Papers collection at the Library of Congress, but never knew that LC was the place Presidents would deposit their papers before FDR built the first presidential library.

I don’t know if this list is completely accurate, but I diffed the list of pre-FDR presidents with the list of Presidential Papers and came up with this short list of presidents who didn’t choose to deposit their papers with the Library of Congress:

  • John Adams
  • John Quincy Adams
  • Millard Fillmore
  • James Buchanan
  • Rutherford B Hayes
  • Warren Harding

At least one person thinks Obama’s library belongs in the cloud. I’m not 100% sure what they mean by the cloud, but embracing the distributed nature of archival collections, and using the Web to achieve that sounds like a good idea.


Summoner

I’ve recently been working with the Serials Solutions Summon API for some side work at George Washington University Libraries. Since the work is largely in Python I created a very simple reusable module called summoner for talking to the API. It largely just does the authentication bits, and gets out of the way…but it may be of interest to you if you use Python and work at a library that subscribes to Summon.

Anyway, I found myself wanting to get a picture of the metadata fields that are used in Summon responses, so I wrote a little script that does an open ended search and then walks through lots of results tallying up the fields used by content type. You can find the results for a scan of 50,000 records below.

As you can see the majority (78%) of the results were newspaper and journal articles. There was a fair bit of diversity in the fields returned for these. It was kind of interesting that 83% of the journal articles had ISSNs (I was expecting it to be higher). 50% had EISSNs so perhaps the 17% that lacked ISSNs had an EISSN instead. Only 41% of journals had subject terms. The sampling of the other content areas was probably too low to make any guesses.

The big caveat here is that these numbers reflect GWU’s particular holdings and subscription options. Still, if you use the Summon API you might find them interesting; and if you are really curious you can run the script for your own institution.

Newspaper Articles (54.0%)

Field Name Coverage
PublicationYear 100.0%
openUrl 100.0%
PublicationPlace_xml 100.0%
DatabaseTitle 100.0%
Title 100.0%
Score 100.0%
PublicationTitle 100.0%
link 100.0%
PublicationDecade 100.0%
Publisher_xml 100.0%
Copyright 100.0%
LinkModel 100.0%
DBID 100.0%
ISSN 100.0%
MergedId 100.0%
isFullTextHit 100.0%
PQPubID 100.0%
StartPage 100.0%
PublicationDate 100.0%
PublicationPlace 100.0%
Genre 100.0%
PublicationDate_xml 100.0%
Publisher 100.0%
SourceType 100.0%
IsScholarly 100.0%
Language 100.0%
ExternalDocumentID 100.0%
URI 100.0%
PublicationCentury 100.0%
ID 100.0%
IsPeerReviewed 100.0%
inHoldings 100.0%
ContentType 100.0%
SSID 100.0%
thumbnail_l 100.0%
thumbnail_m 100.0%
thumbnail_s 100.0%
hasFullText 100.0%
Copyright_xml 100.0%
EISSN 100.0%
PQID 100.0%
Snippet 88.9%
DatabaseTitleList 88.9%
AbstractList 88.9%
Abstract 88.9%
SubjectTerms 74.1%
GeographicLocations 14.8%
GeographicLocations_xml 14.8%

Journal Articles (24.0%)

Field Name Coverage
PublicationYear 100.0%
openUrl 100.0%
Score 100.0%
PublicationTitle 100.0%
link 100.0%
PublicationDecade 100.0%
isFullTextHit 100.0%
LinkModel 100.0%
Title 100.0%
MergedId 100.0%
PublicationDate 100.0%
ExternalDocumentID 100.0%
PublicationDate_xml 100.0%
IsScholarly 100.0%
Language 100.0%
hasFullText 100.0%
ID 100.0%
IsPeerReviewed 100.0%
inHoldings 100.0%
ContentType 100.0%
SourceType 100.0%
PublicationCentury 100.0%
Author 91.7%
Author_xml 91.7%
StartPage 91.7%
ISSN 83.3%
Issue 83.3%
Volume 83.3%
thumbnail_l 83.3%
thumbnail_m 83.3%
thumbnail_s 83.3%
Publisher_xml 75.0%
Publisher 75.0%
SSID 75.0%
DatabaseTitle 58.3%
Copyright 58.3%
DBID 58.3%
Copyright_xml 58.3%
URI 50.0%
EISSN 50.0%
Snippet 41.7%
Genre 41.7%
SubjectTerms 41.7%
PQID 41.7%
PublicationPlace_xml 33.3%
Database_xml 33.3%
PQPubID 33.3%
PublicationPlace 33.3%
EndPage 33.3%
Discipline 25.0%
PageCount 25.0%
Notes 25.0%
DatabaseTitleList 25.0%
Audience 25.0%
AbstractList 25.0%
Abstract 25.0%
DOI 16.7%
RelatedPersons 16.7%
RelatedPersons_xml 16.7%
PublicationSeriesTitle 16.7%
ClassificationCodes 16.7%
CODEN 8.3%
GeographicLocations 8.3%
DLL_JC 8.3%
GeographicLocations_xml 8.3%
PCID 8.3%

Reference (10.0%)

Field Name Coverage
PublicationYear 100.0%
openUrl 100.0%
Title 100.0%
Source 100.0%
Score 100.0%
PublicationTitle 100.0%
link 100.0%
ISBN 100.0%
PublicationDecade 100.0%
Publisher_xml 100.0%
IsPeerReviewed 100.0%
LinkModel 100.0%
MergedId 100.0%
isFullTextHit 100.0%
PublicationDate 100.0%
ExternalDocumentID 100.0%
PublicationDate_xml 100.0%
Publisher 100.0%
IsScholarly 100.0%
Language 100.0%
PublicationCentury 100.0%
ID 100.0%
SourceType 100.0%
inHoldings 100.0%
ContentType 100.0%
SSID 100.0%
thumbnail_l 100.0%
thumbnail_m 100.0%
thumbnail_s 100.0%
hasFullText 100.0%
Copyright 80.0%
Snippet 80.0%
DBID 80.0%
URI 80.0%
Copyright_xml 80.0%
Discipline 60.0%
SubjectTerms 40.0%
PublicationPlace_xml 20.0%
EISBN 20.0%
Author_xml 20.0%
Edition 20.0%
Author 20.0%
StartPage 20.0%
PublicationPlace 20.0%

Patents (6.0%)

Field Name Coverage
Discipline 100.0%
PublicationYear 100.0%
openUrl 100.0%
Author 100.0%
Snippet 100.0%
Score 100.0%
link 100.0%
PublicationDecade 100.0%
Author_xml 100.0%
IsPeerReviewed 100.0%
LinkModel 100.0%
DBID 100.0%
Title 100.0%
MergedId 100.0%
isFullTextHit 100.0%
PublicationDate 100.0%
ExternalDocumentID 100.0%
PublicationDate_xml 100.0%
IsScholarly 100.0%
Language 100.0%
Notes 100.0%
URI 100.0%
DatabaseTitleList 100.0%
PublicationCentury 100.0%
AbstractList 100.0%
ID 100.0%
inHoldings 100.0%
ContentType 100.0%
SourceType 100.0%
Abstract 100.0%
hasFullText 100.0%

Books (2.0%)

Field Name Coverage
PublicationYear 100.0%
openUrl 100.0%
PublicationPlace_xml 100.0%
Author 100.0%
Source 100.0%
Score 100.0%
link 100.0%
LCCallNum 100.0%
PublicationDecade 100.0%
Author_xml 100.0%
IsPeerReviewed 100.0%
LinkModel 100.0%
LCCN 100.0%
DBID 100.0%
Title 100.0%
MergedId 100.0%
isFullTextHit 100.0%
PublicationDate 100.0%
ExternalDocumentID 100.0%
PublicationDate_xml 100.0%
IsScholarly 100.0%
Language 100.0%
PublicationPlace 100.0%
URI 100.0%
PublicationCentury 100.0%
OCLC 100.0%
FullText_t_NoSnippeting 100.0%
ID 100.0%
inHoldings 100.0%
ContentType 100.0%
SourceType 100.0%
hasFullText 100.0%

Web Resources (2.0%)

Field Name Coverage
PublicationYear 100.0%
openUrl 100.0%
Copyright 100.0%
Author 100.0%
Score 100.0%
link 100.0%
PublicationDecade 100.0%
Author_xml 100.0%
IsPeerReviewed 100.0%
LinkModel 100.0%
DBID 100.0%
Title 100.0%
MergedId 100.0%
isFullTextHit 100.0%
SubjectTerms 100.0%
ExternalDocumentID 100.0%
PublicationDate_xml 100.0%
PublicationDate 100.0%
IsScholarly 100.0%
Language 100.0%
Notes 100.0%
URI 100.0%
Copyright_xml 100.0%
hasFullText 100.0%
ID 100.0%
inHoldings 100.0%
ContentType 100.0%
SourceType 100.0%
PublicationCentury 100.0%

Book Chapters (2.0%)

Field Name Coverage
Discipline 100.0%
PublicationYear 100.0%
DOI 100.0%
openUrl 100.0%
PublicationPlace_xml 100.0%
Copyright 100.0%
Title 100.0%
EISBN 100.0%
Snippet 100.0%
Score 100.0%
PublicationTitle 100.0%
link 100.0%
ISBN 100.0%
PublicationDecade 100.0%
Publisher_xml 100.0%
IsPeerReviewed 100.0%
LinkModel 100.0%
MergedId 100.0%
isFullTextHit 100.0%
StartPage 100.0%
PublicationDate 100.0%
PublicationPlace 100.0%
PublicationDate_xml 100.0%
SubjectTerms 100.0%
Publisher 100.0%
SourceType 100.0%
IsScholarly 100.0%
Language 100.0%
ExternalDocumentID 100.0%
DatabaseTitleList 100.0%
PublicationCentury 100.0%
AbstractList 100.0%
ID 100.0%
SSID 100.0%
inHoldings 100.0%
ContentType 100.0%
EndPage 100.0%
Abstract 100.0%
thumbnail_l 100.0%
thumbnail_m 100.0%
thumbnail_s 100.0%
hasFullText 100.0%
Copyright_xml 100.0%