Archive for the ‘libraries’ Category

tripleshot

Friday, January 11th, 2008

Recently there was a bit of interesting news around a MARBI Discussion Paper 2008-DP04 regarding semweb technologies at LC.

Related to this work are RDF/OWL representations and models for MODS and MARC, which we are also developing. Several representations of MODS in RDF/OWL, such as the one from the SIMILE project, have been made available as part of various projects and we have found they useful for our analysis and to inform our design process. We want to bring them together into one easily downloaded and maintained RDF/OWL file for use in community experimentation with RDF applications. Our time line is to have the MODS RDF ready for community comment by June.

WoGroFuBiCo cloud

Friday, January 11th, 2008

access accessible addition al american analysis application applications appropriate archives areas association authority available based benefit benefits bibliographic broad broader catalog catalogers cataloging catalogs cataloguing chain change changes classification code collaboration collections committee communities community congress consequences consider considered content continue control controlled cooperative cost costs create created creating creation current data databases dc description descriptive desired develop developed development different digital discovery distribution dublin ed education effort encourage enhance environment et evidence exchange exist findings focus format formats frameworks frbr future greater group headings hidden identifiers identify ifla impact include including increase increasingly information institution institutions international knowledge language lc lcs lcsh libraries limited lis maintaining make management marc materials metadata model national need needs networks new number oclc online organization organizations outcomes outside participants particular pcc possible potential practice practices primary principles process processes production program programs provide public publishers quo range rare rda recommendations records reference relationships report require requirements research resource resources responsibility results role rules search serve service services share shared sharing sources special specific standards states status subject supply support systems technology terms time today tools types unique united university use used users using value variety various vendors vocabularies washington ways web working works

same stats as before, but the top 200 this time, and as a cloud. It’s crying out for some kind of stemming to collapse some terms together I suppose…but it’s also 3:17AM.

WoGroFuBiCo wc

Thursday, January 10th, 2008

word count
library 263
bibliographic 236
data 170
libraries 144
lc 127
control 109
information 98
cataloging 91
records 88
subject 82
materials 81
standards 81
use 80
congress 79
work 76
record 73
community 67
users 61
working 59
group 58
access 57
recommendations 56
resources 53
authority 52
metadata 47
future 46
new 40
environment 37
development 37
web 36
collections 35
systems 35
available 35
creation 35
services 34
headings 32
national 31
findings 30
research 30
unique 29
sharing 29
oclc 28
model 28
catalog 28
international 27
develop 27
value 27
lcsh 26
pcc 26
user 26
need 26
report 25
make 25
practices 25
rda 25
used 25
time 24
needs 24
rare 24
including 24
provide 23
discovery 23
communities 23
special 23
frbr 23
current 22
resource 22
rules 22
digital 21
cooperative 21
program 21
participants 21
management 21
service 20
dc 20
programs 20
online 20
costs 20
washington 20
standard 19
support 19
knowledge 19
different 19
appropriate 19
effort 18
applications 18
marc 18
shared 18
exchange 18
process 18
changes 17
lcs 17
increase 16
public 16
search 16
creating 16
broader 16
catalogs 16
controlled 16

I converted the pdf to text file called ‘lc’ with xpdf and then wrote a little python:

#!/usr/bin/env python
 
from urllib import urlopen
from re import sub
 
stop_words = urlopen('http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words').read().split()
text = file('lc').read()
 
counts = {}
for word in text.split():
    word = word.lower()
    word = sub(r'\W', '', word)
    word = sub(r'\d+', '', word)
    if word == ''  or word in stop_words: continue
    counts[word] = counts.get(word,0) + 1
 
words = counts.keys()
words.sort(lambda a,b: cmp(counts[b], counts[a]))
for word in words[0:100]:
    print "%20s %i" % (word, counts[word])

Does me writing code to read the report count as reading the report? …

permalinks reloaded

Monday, December 17th, 2007

The recently announced Zotero / InternetArchive partnership is exciting on a bunch of levels. The one that immediately struck me was the use of the Internet Archive URI. As you may have noticed before all the content in Internet Archive Wayback Machine can be referenced with a URL that looks something like:

  • http://web.archive.org/web/{yyyymmddhhmmss}/{url}

Where url is the document URL you want to look up in the archive at the given time. So for example:

is a URL for what http://google.com looked like on December 02, 1998 at 23:04:10. Perhaps this is documented somewhere prominent or is common knowledge, but it looks like you can play with the timestamp, and archive.org will adjust as needed, redirecting you to the closest snapshot it can find:

and even:

which redirects to the most recent content for a given URL. It’s just a good old 302 at work:

ed@curry:~$ curl -I http://web.archive.org/web/199812/http://www.google.com/
HTTP/1.1 302 Found
Date: Mon, 17 Dec 2007 21:11:12 GMT
Server: Apache/2.0.54 (Ubuntu) PHP/5.0.5-2ubuntu1.2 mod_ssl/2.0.54 OpenSSL/0.9.7g mod_perl/2.0.1 Perl/v5.8.7
Location: http://web.archive.org/web/19981202230410/www.google.com/
Content-Type: text/html; charset=iso-8859-1

So anyhow, pretty cool use of URIs and HTTP right? The addition of zotero to the mix will mean that scholars can cite the web as it appeared at a particular point in time:

… as scholars begin to use not only traditional primary sources that have been digitized but also “born digital” materials on the web (blogs, online essays, documents transcribed into HTML), the possibility arises for Zotero users to leverage the resources of IA to ensure a more reliable form of scholarly communication. One of the Internet Archive’s great strengths is that it has not only archived the web but also given each page a permanent URI that includes a time and date stamp in addition to the URL.

Currently when a scholar using Zotero wishes to save a web page for their research they simply store a local copy. For some, perhaps many, purposes this is fine. But for web documents that a scholar believes will be important to share, cite, or collaboratively annotate (e.g., among a group of coauthors of an article or book) we will provide a second option in the Zotero web save function to grab a permanent copy and URI from IA’s web archive. A scholar who shares this item in their library can then be sure that all others who choose to use it will be referring to the exact same document.

This is pretty fundamental to scholarship on the web. Of course when generating a time anchored permalink with zotero one can well expect that archive.org will on occasion not have a snapshot of said content, resulting in a 404. It would be great if archive.org could leverage these requests for snapshots as requests to go out and archive the page. One could imagine a blocking and nonblocking request: the former which would spawn a request to fetch a particular URI, stash content away, and return the permalink; and the latter which would just quickly return the best match its already got (which may be a 404).

Anyhow, it’s really good to see these two outfits working together. Nice work!

ps. dear lazyweb is there a documented archive.org api available?

more marcdb

Monday, November 5th, 2007

This morning Clay and I were chatting about Library of Congress Subject Headings and SKOS a bit. At one point we found ourselves musing about how much reuse there is of topical subdivisions in topical headings in the LC authority file. You know how it is. Anyhow, I remembered that I’d used marcdb to import all of Simon Spiro’s authority data–so I fired up psql and wrote a query:

SELECT subfields.value AS subdivision, count(*) AS total
FROM subfields, data_fields
WHERE subfields.code = 'x'
  AND subfields.data_field_id = data_fields.id
  AND data_fields.tag = '150'
GROUP BY subfields.value
ORDER BY total DESC;

And a few seconds later…

 subdivision                          | total
--------------------------------------+-------
 Law and legislation                  |  3342
 Religious aspects                    |  2500
 Buddhism, [Christianity, etc.]       |   898
 History                              |   847
 Equipment and supplies               |   571
 Taxation                             |   566
 Baptists, [Catholic Church, etc.]    |   476
 Diseases                             |   450
 Research                             |   422
 Campaigns                            |   378
 Awards                               |   342
 Finance                              |   284
 Study and teaching                   |   284
 Surgery                              |   275
 Employees                            |   269
 Spectra                              |   261
 Computer programs                    |   259
 Labor unions                         |   218
 Testing                              |   207
 Diagnosis                            |   194
 Isotopes                             |   190
 Complications                        |   183
 Physiological effect                 |   172
 Programming                          |   163

There’s nothin’ like the smell of strong set theory in the morning. Although something seems a bit fishy about [Christianity, etc.] and [Catholic Church, etc.]… If you want to try similar stuff and don’t want to wait hours for marcdb to import all the data and you use postgres, here’s the full database dump which you ought to be able to import:

  % createdb authorities
  % wget http://inkdroid.org/data/authorities.sql.bz2
  % bunzip2 authorities.sql.bz2
  % psql authorities < authorities.sql

wdl peeps

Wednesday, October 31st, 2007

Speaking of smarties here’s a picture of some of the folks I was fortunate to work with on the recent WDL effort. From left to right: Dan Chudnov, Andy Boyko, Babak Hamidzadeh, Dave Hafken, myself, and Chris Thatcher. I feel really fortunate to be working with all of them. The best part is that these are just the folks that were involved with the WDL project–and there are a bunch more equally fun/talented people in our group that are working on other things. I can safely say that I haven’t worked with a group before that is as simultaneously top-notch and fun to work with.

…thanks to Michael Neubert for the snapshot taken outside the Adams building at the Library of Congress

tools

Thursday, October 18th, 2007


At $work recently many late nights were spent hackety-hacking on a prototype that got written up in the New York Times today. Apart from some promotional materials, not much is available to the public just yet. I just got pulled in near the end to do some search stuff. Over the past few months I’ve seen dchud in top form managing complicated data/organizational workflows while making technical decisions. A nice outgrowth of working with smarties is ending up with a fun and productive technology stack: python, django, postgres, jquery, solr, tilecache, ubuntu, trac, subversion, vmware. Given the press and the commitment to UNESCO I think the code is going to start being a bit more than a prototype pretty soon :-)

OCLC deserves some REST

Wednesday, September 26th, 2007

Hey Worldcat Identities you are doing awesome work–you deserve some REST. Why not use content-negotiation to serve up your HTML and XML representations? So:

  curl --header "Accept: text/html" http://orlabs.oclc.org/Identities/key/lccn-no99-10609

would return HTML and

  curl --header "Accept: application/xml" http://orlabs.oclc.org/Identities/key/lccn-no99-10609

would return XML. This would allow you to:

  • not be limited to XSLT driven user views (doesn’t that get tedious?)
  • allow you to scale to other sorts of output (application/rdf+xml, etc)

At least from the outside I’d have to disagree w/ Roy — it appears that institutions can and do innovate. But I won’t say it is easy …

How do 26 Nobel Laureates change a light bulb?

Friday, July 13th, 2007

I don’t know … but it sure is nice to see that 26 Nobel Laureates at least understand the direction libraries ought to be headed:

As scientists and Nobel laureates, we are writing to express our strong support for the House and Senate Appropriations Committees’ recent directives to the NIH to enact a mandatory policy that allows public access to published reports of work supported by the agency. We believe that the time is now for Congress to enact this enlightened policy to ensure that the results of research conducted by NIH can be more readily accessed, shared and built upon ­ to maximize the return on our collective investment in science and to further the public good.

The public at large also has a significant stake in seeing that this research (researched funded by the National Institute of Health) is made more widely available. When a woman goes online to find what treatment options are available to battle breast cancer, she will find many opinions, but peer-reviewed research of the highest quality often remains behind a high-fee barrier. Families seeking clinical trial updates for a loved one with Huntington’s disease search in vain because they do not have a journal subscription. Librarians, physicians, health care workers, students, journalists, and investigators at thousands of academic institutions and companies are currently hindered by unnecessary costs and delays in gaining access to publicly funded research results.

Exciting times for libraries and the medical profession! I just hope they can convince Congress.

ruby-zoom v0.3.0

Tuesday, July 10th, 2007

Thanks to some prodding from William Denton and Jason Ronallo and the kindness of Laurent Sansonetti I’ve been added as a developer to the ruby-zoom project which provides a Ruby wrapper to the yaz Z39.50 library. I essentially wanted to remove some unused code from the project that was interfering with the ruby-marc gem … and I also wanted to create gem for ruby-zoom. This was the first time I’ve tried packaging up a C wrapper as a gem and it was remarkably smooth. I also added a test suite and a Rakefile. So assuming you have yaz installed you can install ruby-zoom with:

% gem install zoom

I’ll admit, I’m no huge fan of Z39.50 but the fact remains that it’s pretty much the most widely deployed machine API for getting at bibliographic data locked up in online catalogs. It’s really nice to see forward thinking systems at Talis, Evergreen and Koha who have (or at least experimented with) OpenSearch implementations.