Wikimania Revisited

I recently attended the Wikimania conference here in Washington, DC. I really can’t express how amazing it was to be a Metro ride away from more than 1,400 people from 87 countries who were passionate about creating a world in which every single human being can freely share in the sum of all knowledge. It was my first Wikimania, and I had pretty high expectations, but I was blown away by the level of enthusiasm and creativity of the attendees. Since my employer supported me by allowing me to spend the week there, I thought I would jot down some notes about the things that I took from the conference, from the perspective of someone working in the cultural heritage sector.

Archivy

Of course the big news from Wikimania for folks like me who work in libraries and archives was the plenary speech by the Archivist of the United States, David Ferriero. Ferriero did an excellent job of connecting NARA’s mission to that of the Wikipedia community. In particular he stressed that NARA cannot solve difficult problems like the preservation of electronic records without the help of open government, transparency and citizen engagement to shape its policies and activities. As a library software developer I’m as interested as the next person in technical innovations in the digital preservation space: be they new repository software, flavors of metadata and digital object packaging, web services and protocols, etc. But over the past few years I’ve been increasingly convinced that access to the content that is being preserved is an absolutely vital ingredient to its preservation. If open access (as in the case of NARA) isn’t possible due to licensing concerns, then it is still essential to let access by some user community drive and ground efforts to collect and preserve digital content. Seeing high level leadership in the cultural heritage space (and from the federal government no less) address this issue was really inspiring.

At the Archives our concepts of openness and access are embedded in our mission. The work we do every day is rooted in the belief that citizens have the right to see, examine, and learn from the records that guarantee citizens rights, document government actions, and tell the story of our nation.

My biggest challenge is visibility: not everyone knows who we are, what we do, or more importantly, the amazing resources we collect and house. The lesson I learned in my time in New York is that it isn’t good enough to create great digital collections, and sit back and expect people to find you. You need to be where the people are.

The astounding thing is that it’s not just talk–Ferriero went on to describe several efforts of how the Archives is executing on collaboration with the Wikipedia community, which is also documented at a high level in NARA’s Open Government Plan. One example that stood out for me was NARA’s Today’s Document website which highlights documents from its collections. On June 1st, 2011 they featured a photograph of Harry P. Perry who was the first African American to enlist in the US Marine Corps after it was desegregated on June 1st, 1942. NARA’s Wikipedian in Residence Dominic McDevitt-Parks’ efforts to bring archival content to the attention of Wikipedians resulted in a new article Desegregation in the United States Marine Corps being created that same day…and the photograph on NARA’s website was viewed more than 4 million times in 8 hours. What proportion of the web traffic was driven by Wikipedia specifically rather than other social networking sites wasn’t exactly clear, but the point is that this is what happens when you get your content where the users are. If my blog post is venturing into tl;dr territory, please be sure to at least watch his speech, it’ll just take 20 minutes.

Resident Wikipedians

In a similar vein Sara Snyder made a strong case for the use of archival materials on Wikipedia in her talk 5 Reasons Why Archives are an Untapped Goldmine for Wikimedians. She talked about the work that Sarah Stierch did as the Wikipedia in Residence at the Smithsonian Archives of American Art. The partnership resulted in ~300 WPA images being uploaded to Wikimedia Commons, 37 new Wikipedia articles, and new connections with a community of volunteers who participated in edit-a-thons to improve Wikipedia and learn more about the AAA collections. She also pointed out that since 2010 Wikipedia has driven more traffic to the Archives of American Art website than all other social media combined.

In the same session Dominic McDevitt-Parks spoke about his activities as the Wikipedian in Residence at the US National Archives. Dominic focused much of his presentation on NARA’s digitization work, largely done by volunteers, the use of Wikimedia Commons as a content platform for the images, and ultimately WikiSource as a platform for transcribing the documents. The finished documents are then linked to from NARA’s Online Catalog, as in this example: Appeal for a Sixteenth Amendment from the National Woman Suffrage Association. NARA also prominently links out to the content waiting to be transcribed at WikiSource on its Citizens Archivist Dashboard. If you are interested in learning more, Dominic has written a bit about the work with WikiSource on the NARA blog. Both Dominic and Sara will be speaking next month at the Society of American Archivists Annual Meeting making the case for Wikipedia to the archival community. Their talk is called 80,000 Volunteers Can’t Be Wrong: The Case for Greater Collaboration with Wikipedia, and I encourage you attend if you will be at SAA.

The arrival of Wikipedians in Residence is a welcome seachange in the Wikipedia community, where historically there had been some uncertainty about the best way for cultural heritage organizations to highlight their original content in Wikipedia articles. As Sara pointed out in her talk, it helps both sides (the institutional side, and the Wikipedia side) to have an actual, experienced Wikipedian on site to help the organization understand how they want to engage the community. Having direct contact with archivists, curators and librarians that know their collections backwards and forwards also helps the resident in knowing how to direct their work, and the work of other Wikipedians. The Library of Congress made an announcement at the Wikimania reception that the World Digital Library are seeking a Wikipedia in Residence. I don’t work directly on the project anymore, but I know people who do, so let me know if you are interested and I can try to connect the dots.

I think in a lot of ways the residency program is an excellent start, but really it’s just that–a start. The task at hand of connecting the Wikipedia community and article content with the collections of galleries, libraries, archives and museums is a huge one. One person, especially a temporary volunteer, can only do so much. As you probably know, Wikipedia editors can often be found embedded in cultural heritage organizations. It’s one of the reasons why we started having informal Wikipedia lunches at the Library of Congress: to see what can be done at the grass roots level by staff to integrate Wikipedia into our work. When we started to meet I learned about an earlier, 4 year old effort to create a policy that provides guidance to staff about how to interact with the Wikipedia community as editors. Establishing a residency program is an excellent way to signal a change in institutional culture, and to bootstrap and focus the work. But I think the residencies also highlight the need to empower staff throughout the organization to participate as well, so that after the resident leaves the work goes on. In addition to establishing a WDL Wikipedian in Residence I would love to see the Library of Congress put the finishing touches on its Wikipedia policy that would empower staff to use and contribute to Wikipedia as part of their work, without lingering doubt about whether it was correct or not. It would probably be helpful for other organizations to publish theirs as examples for other organizations wanting the same.

Wikipedia as a Platform

Getting back to Wikimania, I wanted to highlight a few other GLAM related projects that use Wikipedia as a platform.

Daniel Mietchen spoke about work he was doing around the Open Access Media Importer (OAMI). The OAMI is a tool that harvests media files (images, movies, etc) from open access materials and uploads them to Wikimedia Commons for use in article content. Efforts to date have focused primarily on PubMed from the National Institutes of Health. As someone working in the digital preservation field one of the interesting outcomes of the work so far was a table that illustrated the media formats present in PubMed:

Since Daniel and other OAMI collaborators are scientists they have been focused primarily on science related media…so they naturally are interested in working with arXiv. arXiv is a heavily trafficked, volunteer supported, pre-print server, that is normally a poster child for open repositories. But one odd thing about arXiv that Daniel pointed out is that while arXiv collects licensing information from authors as part of deposit, they do not indicate in the user interface which license has been used. This makes it particularly difficult for the OAMI to determine which content can be uploaded to the Wikimedia Commons. I learned from Simeon Warner shortly afterwards that while the licensing information doesn’t show up in the UI currently, and isn’t present in all the metadata formats that their OAI-PMH service provides, it can be found squirreled away in the arXivRaw format. So it should be theoretically possible to modify the OAMI to use arXivRaw.

Another challenges the OAMI faces is extraction of metadata. For example media files often don’t share all the subject keywords that are appropriate for the entire article. So knowing which ones to apply can be difficult. In addition, metadata extraction from Wikimedia Commons was reported to not be optimal, since it involves parsing mediawiki templates, which limits the downstream use of the content added to the Commons. I don’t know if the Public Library of Science is on the radar for harvesting, but if it isn’t it should be. The OAMI work also seems loosely related to the issue of research data storage and citation which seems to be on the front burner for those interested in digital repositories. Jimmy Wales has reportedly been advising the UK government on how to making funded research available to the public. I’m not sure if datasets fit the purview of the Wikimedia Commons, but since Excel is #3 in the graph above perhaps it is. It might be interesting to think more about Wikimedia Commons as a platform for publishing (and citing) datasets.

I learned about another interesting use of the Wikimedia Commons from Maarten Dammers and Dan Entous during their talk about the GLAMwiki Toolset. The project is a partnership between Wikipedia Netherlands and Europeana. If you aren’t already familiar with Europeana it is an EU funded effort to enhance access to European cultural heritage material on the Web. The project is just getting kicked off now, and is aiming to:

…develop a scalable, maintainable, ease to use system for mass uploading open content from galleries, libraries, archives and museums to Wikimedia Commons and to create GLAM-specific requirements for usage statistics.

Wikimedia Commons can be difficult to work with in an automated, batch oriented way for a variety of reasons. One that was mentioned above is metadata. The GLAMwiki Toolset will provide some mappings from commonly held metadata formats (starting with Dublin Core) to Commons templates, and will provide a framework for adapting the tool to custom formats. Also there is a perceived need for tools to manage batch imports as well as exports from the Commons. The other big need are usable analytics tools that let you see how content is used and referenced on the Commons once it has been uploaded. Maarten indicated that they are seeking participation in the project from other GLAM organizations. I imagine that there are other organizations that would like to use the Wikimedia Commons as a content platform, to enable collaboration across institutional boundaries. Wikipedia is one of the most popular destinations on the Web, so they have been forced to scale their technical platform to support this demand. Even the largest cultural heritage organizations can often find themselves bound to somewhat archaic legacy systems, that can make it difficult to similarly scale their infrastructure. I think services like Wikimedia Commons and WikiSource have a lot to offer cash strapped organizations that want to do more to provide access to their organizations unique materials on the Web, but are not in a position to make the technical investments to make it happen. I’m hoping that efforts like the GLAMWiki toolset will make this easier to achieve, and is something I personally would like to get involved in.

Incidentally, one of the more interesting technical track talks I attended was a talk by Ben Hartshorne from the Wikimedia Foundation Operations Team, about their transition from NFS to Openstack Swift for media storage. I had some detailed notes about this talk, but proceeded to lose them. I seem to remember that in total, the various Wikimedia properties amount to 40T of media storage (images, videos, etc), and they want to be able to grow this to 200T this year. Ben included lots of juicy details about the hardware and deployment of Swift in their infrastructure, so I’ve got an email out to him to see if he can share his slides (update: he just shared them, thanks Ben!). The placement of various caches (Swift is an HTTP REST API), as well as the hooks into MediaWiki were really interesting to me. The importance of URL addressable object storage for bitstreams in an enterprise that is made up of many different web applications can’t be overstated. It was also fun to hear about the impact that digitization projects like Wikipedia Loves Monuments and the NARA work mentioned above, are having on the backend infrastructure. It’s great to hear that Wikipedia is planning for growth in the area of media storage, and can scale horizontally to meet it, without paying large sums of money for expensive, proprietary, vendor supplied NAS solutions. What wasn’t entirely clear from the presentation is whether there is a generic tipping point where investing in staff and infrastructure to support something like Swift becomes more cost-effective than using a storage solution like Amazon S3. Ben did indicate that there use of Swift and the abstractions they built into Mediawiki would allow for using storage APIs like S3.

Before I finish this post, there were a couple other Wikipedia related topics that I didn’t happen to see discussed at Wikimania (it’s a multi-track event so I may have just missed it). One is the topic of image citation on Wikipedia. Helena Zinkham (Chief of the Prints and Photographs Division at the Library of Congress) recently floated a project proposal at LC’s Wikipedia Lunch to more prominently place the source of an image in Wikipedia articles. For an example of what Helena is talking about take a look at the article for Walt Whitman: notice how the caption doesn’t include information about where the image came from? If you click on the image you get a detail page that does indicate that the photograph is from LC’s Prints & Photographs collection, with a link back to the Prints & Photographs Online Catalog. I agree with Helena that more prominent information about the source of photographs and other media in Wikipedia could encourage more participation from the GLAM community. What the best way to proceed with the idea is still in question. I’m new to the way projects get started and RFCs work there. Hopefully we will continue to work on this in the context of the grassroots Wikipedia work at LC. If you are interested please drop me an email

Another Wikipedia project directly related to my $work is the Digital Preservation WikiProject that the National Digital Stewardship Alliance is trying to kickstart. One of the challenges of digital preservation is the identification of file formats, and their preservation characteristics. English Wikipedia currently has 325 articles about Computer File Formats, and one of the goals of the Digital Preservation project is to enhance these with predictable infoboxes that usefully describe the format. External data sources such as PRONOM and UDFR also contain information about data formats. It’s possible that some of them could be used to improve Wikipedia articles, to more widely disseminate digital preservation information. Also, as Ferriero noted, it’s important for cultural heritage organizations to get their information out to where the people are. Jason Scott of ArchiveTeam has been talking about a similar project to aggregate information about file formats to build better tools for format identification. While I can understand the desire to build a new wiki to support this work, and there are challenges to working with the Wikipedia community, I think Linus’ Law points the way to using Wikipedia.

Beginning

So, I could keep going, but in the interests of time (yours and mine) I have to wrap this Wikimania post up (for now). Thanks for reading this far through my library colored glasses. Oddly I didn’t even get to mention the most exciting and high profile Wikidata and Visual Editor projects that are under development, and are poised to change what it means to use and contribute to Wikipedia for everyone, not just GLAM organizations. Wikidata is of particular interest to me because if successful it will bring many of the ideas of the Linked Data to solve an eminently practical problem that Wikipedia faces. In some ways the WikiData project is following in the footsteps of the successful dbpedia and Google Freebase projects. But there is a reason why Freebase and Dbpedia have spent time engineering their Wikipedia updates–because it’s where the users are creating content. Hopefully I’ll be able to attend Wikimania next year to see how they are doing. And I hope that my first Wikimania marks the beginning of a more active engagement in what Wikipedia is doing to transform the Web and the World.