Posts Tagged ‘data’

public.resource.org to liberate Code of Federal Regulations

Wednesday, September 17th, 2008

good news via the govtrack mailing list

Carl Malamud of public.resource.org, with funding from a bunch of places including a small bit from GovTrack’s ad profits, announced his intention to purchase from the Government Printing Office documents they produce in the course of their statutory obligations and then have the nerve to sell back to the public at prohibitive prices. The document to be purchased is the Code of Federal Regulations, the component of federal law created by executive branch agencies, in electronic form. Once obtained, it will be posted openly/freely online.

More here: http://public.resource.org/gpo.gov/index.html

And Carl’s letter to the GPO:
http://public.resource.org/gpo.gov/the_honorable.html

It’s pretty sad that it has to come to this…but it’s also pretty awesome that it’s happening.

provide and enable

Wednesday, June 18th, 2008

I got a chance to meet Jennifer Rigby of the National Archives UK at the LinkedDataPlanet Conference in New York City (thanks Ian). Jennifer is the Head of IT Strategy, and told me lots of interesting stuff related to a profound shift they’ve had in their online strategies to:

Provide and Enable

So rather than pouring all their energy into making applications to visualize archival resources, the National Archives have recognized that making machine readable resources available to the public (in formats like RDF and RDFa) is really important to their core mission. In addition to providing services and data, they are trying to enable an ecosystem of innovation around their assets–or in their words:

• We will allow others to harness the power of our information, leading to a far wider range of products and services than we could provide ourselves.
• We will continue to work with commercial partners to provide online access to millions of records.

Jennifer said we can look forward to an announcement around OpenTech2008 (July 5th) about a set of important publications that are going to made available by the Archives as RDF and RDFa. In addition I heard about how they work with website data harvested by Internet Archive to create a resolver service for transient publications on the web.

Hearing how a big organization like the National Archives can come to this realization of “Provide and Enable”, and then start to execute on it was really encouraging–and inspiring. It is also refreshing to see people recognize, in writing the importance of semantic web technologies:

We have started exploring new ideas and technologies, including using RDFa for publishing the Gazettes. The way we now publish legislation has a key role to play in the further development of the semantic web.

WoGroFuBiCo cloud

Friday, January 11th, 2008

access accessible addition al american analysis application applications appropriate archives areas association authority available based benefit benefits bibliographic broad broader catalog catalogers cataloging catalogs cataloguing chain change changes classification code collaboration collections committee communities community congress consequences consider considered content continue control controlled cooperative cost costs create created creating creation current data databases dc description descriptive desired develop developed development different digital discovery distribution dublin ed education effort encourage enhance environment et evidence exchange exist findings focus format formats frameworks frbr future greater group headings hidden identifiers identify ifla impact include including increase increasingly information institution institutions international knowledge language lc lcs lcsh libraries limited lis maintaining make management marc materials metadata model national need needs networks new number oclc online organization organizations outcomes outside participants particular pcc possible potential practice practices primary principles process processes production program programs provide public publishers quo range rare rda recommendations records reference relationships report require requirements research resource resources responsibility results role rules search serve service services share shared sharing sources special specific standards states status subject supply support systems technology terms time today tools types unique united university use used users using value variety various vendors vocabularies washington ways web working works

same stats as before, but the top 200 this time, and as a cloud. It’s crying out for some kind of stemming to collapse some terms together I suppose…but it’s also 3:17AM.

WoGroFuBiCo wc

Thursday, January 10th, 2008

word count
library 263
bibliographic 236
data 170
libraries 144
lc 127
control 109
information 98
cataloging 91
records 88
subject 82
materials 81
standards 81
use 80
congress 79
work 76
record 73
community 67
users 61
working 59
group 58
access 57
recommendations 56
resources 53
authority 52
metadata 47
future 46
new 40
environment 37
development 37
web 36
collections 35
systems 35
available 35
creation 35
services 34
headings 32
national 31
findings 30
research 30
unique 29
sharing 29
oclc 28
model 28
catalog 28
international 27
develop 27
value 27
lcsh 26
pcc 26
user 26
need 26
report 25
make 25
practices 25
rda 25
used 25
time 24
needs 24
rare 24
including 24
provide 23
discovery 23
communities 23
special 23
frbr 23
current 22
resource 22
rules 22
digital 21
cooperative 21
program 21
participants 21
management 21
service 20
dc 20
programs 20
online 20
costs 20
washington 20
standard 19
support 19
knowledge 19
different 19
appropriate 19
effort 18
applications 18
marc 18
shared 18
exchange 18
process 18
changes 17
lcs 17
increase 16
public 16
search 16
creating 16
broader 16
catalogs 16
controlled 16

I converted the pdf to text file called ‘lc’ with xpdf and then wrote a little python:

#!/usr/bin/env python
 
from urllib import urlopen
from re import sub
 
stop_words = urlopen('http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words').read().split()
text = file('lc').read()
 
counts = {}
for word in text.split():
    word = word.lower()
    word = sub(r'\W', '', word)
    word = sub(r'\d+', '', word)
    if word == ''  or word in stop_words: continue
    counts[word] = counts.get(word,0) + 1
 
words = counts.keys()
words.sort(lambda a,b: cmp(counts[b], counts[a]))
for word in words[0:100]:
    print "%20s %i" % (word, counts[word])

Does me writing code to read the report count as reading the report? …