citation graphs

JCDL2006 was chock-a-block full of good content. A set of papers presented on the first day in the Named Entities track explored a common theme of applying graph theory to citation networks in order to cluster works by the same author. For example an author name may appear as Daniel Chudnov, D Chudnov, Dan Chudnov. There is also a similar problem when two authors with the same name are actually two different people. Being able to group all the works by an author is very important for good search interfaces…and also for calculating citation counts and impact factors.

The most interesting paper in the bunch (for me) was Also By The Same Author: AKTiveAuthor, A Citation Graph Approach To Name Disambiguation. This paper revolves around the hypothesis that authors tend to cite their own works more frequently than others–so called ‘self-citation’. Self-citation isn’t the result of navel gazing or self-promotion so much as it is the result of researchers building on the work that they’ve done previously. In addition to self-citation graphs co-authorship and source URL graphs are also used to build a graph of a particular authors works.

The paper concludes some good precision/recall figures (.997/.818) which points to the value in using self-citation for name clustering. This paper made and some growing interest I have in RDF and Jena have made me realize that I’d like to spend a bit of time over the coming year learning about graph theory.