WoGroFuBiCo wc
Thursday, January 10th, 2008| word | count |
|---|---|
| library | 263 | bibliographic | 236 |
| data | 170 | libraries | 144 |
| lc | 127 | control | 109 |
| information | 98 | cataloging | 91 |
| records | 88 | subject | 82 |
| materials | 81 | standards | 81 |
| use | 80 | congress | 79 |
| work | 76 | record | 73 |
| community | 67 | users | 61 |
| working | 59 | group | 58 |
| access | 57 | recommendations | 56 |
| resources | 53 | authority | 52 |
| metadata | 47 | future | 46 |
| new | 40 | environment | 37 |
| development | 37 | web | 36 |
| collections | 35 | systems | 35 |
| available | 35 | creation | 35 |
| services | 34 | headings | 32 |
| national | 31 | findings | 30 |
| research | 30 | unique | 29 |
| sharing | 29 | oclc | 28 |
| model | 28 | catalog | 28 |
| international | 27 | develop | 27 |
| value | 27 | lcsh | 26 |
| pcc | 26 | user | 26 |
| need | 26 | report | 25 |
| make | 25 | practices | 25 |
| rda | 25 | used | 25 |
| time | 24 | needs | 24 |
| rare | 24 | including | 24 |
| provide | 23 | discovery | 23 |
| communities | 23 | special | 23 |
| frbr | 23 | current | 22 |
| resource | 22 | rules | 22 |
| digital | 21 | cooperative | 21 |
| program | 21 | participants | 21 |
| management | 21 | service | 20 |
| dc | 20 | programs | 20 |
| online | 20 | costs | 20 |
| washington | 20 | standard | 19 |
| support | 19 | knowledge | 19 |
| different | 19 | appropriate | 19 |
| effort | 18 | applications | 18 |
| marc | 18 | shared | 18 |
| exchange | 18 | process | 18 |
| changes | 17 | lcs | 17 |
| increase | 16 | public | 16 |
| search | 16 | creating | 16 |
| broader | 16 | catalogs | 16 |
| controlled | 16 |
I converted the pdf to text file called ‘lc’ with xpdf and then wrote a little python:
#!/usr/bin/env python from urllib import urlopen from re import sub stop_words = urlopen('http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words').read().split() text = file('lc').read() counts = {} for word in text.split(): word = word.lower() word = sub(r'\W', '', word) word = sub(r'\d+', '', word) if word == '' or word in stop_words: continue counts[word] = counts.get(word,0) + 1 words = counts.keys() words.sort(lambda a,b: cmp(counts[b], counts[a])) for word in words[0:100]: print "%20s %i" % (word, counts[word])
Does me writing code to read the report count as reading the report? …