I did end up hearing back from Matt Biddulph about the search technology that he’s using with RubyOnRails to build the BBC Programme Catalogue.

The core of the search is nothing more than mysql 4.1’s fulltext indexer. I used to think very poorly of it until I discovered how to turn off its automatic stoplist and minimum indexable word length, and started using its boolean mode. Having the database manage the indexing without having to keep a separate index in sync is very valuable, and of course it’s portable to any client language.

The nice thing with a dataset the size and quality of the BBC’s is that you’re not solely dependent on the quality of the freetext indexer. I’ve done a little statistical analysis on the data to help with scoring the results. For example, programme contributors can be ranked according to how many shows they’ve contributed to, and commonly co-occurring contributors can be easily calculated with a bit of overnight batch processing. This kind of stuff contributes to a pretty good set of search results.

Given the visibility of the BBC Catalogue and that it has nearly a million records this says good things to me about the scalability of MySQL’s fulltext search. I’ll definitely consider it along with Ferret for Rails experiments that need search functionality.