The core of the search is nothing more than mysql 4.1’s fulltext indexer. I used to think very poorly of it until I discovered how to turn off its automatic stoplist and minimum indexable word length, and started using its boolean mode. Having the database manage the indexing without having to keep a separate index in sync is very valuable, and of course it’s portable to any client language.
The nice thing with a dataset the size and quality of the BBC’s is that you’re not solely dependent on the quality of the freetext indexer. I’ve done a little statistical analysis on the data to help with scoring the results. For example, programme contributors can be ranked according to how many shows they’ve contributed to, and commonly co-occurring contributors can be easily calculated with a bit of overnight batch processing. This kind of stuff contributes to a pretty good set of search results.
Given the visibility of the BBC Catalogue and that it has nearly a million records this says good things to me about the scalability of MySQL’s fulltext search. I’ll definitely consider it along with Ferret for Rails experiments that need search functionality.