full text searching in javascript
So I have this project I am working on with Nick. One of the things we needed was a fancy customizable search solution. Generally, this is done in the ruby world using something like Spinix, Ferret, Solr or the like. Only problem is that all of these require you to have some fancy server-side setup. But for our project we weren't going to have that-- only a static site that we generate once.
Well, I had a little search engine I had written called Fates search. It was based on Mauricio Fernandez's FT_Search (see http://eigenclass.org/hiki/simple+full+text+search+engine). With some help from Nick I cleaned up a lot of my code (it could use more cleanup, believe me). Then in a couple of late night fits of coding, I converted the whole thing to Javascript.
There was a problem. The reason the fulltext searching was quick was because of some elegant file seeking and binary searching. But if every user had to download megabytes and megabytes of indexes and the fulltext corpus every time they wanted to search... well, they would not be impressed. So I decided that I needed to shard the whole thing. Basically I divided the main index (the suffixes) into four-character shards. So hits for "Jeffrey" would fall into the "jeff" shard. Clearly, the right length depends on the data you are indexing, but for my fifty-thousand name sample, four seemed about right. Most of the index shards turned out to be less than 1kb with a couple of pathological cases for common names.
However, it wasn't just the main index that needed sharding. In order to run the binary search, each step needed to peek in the main fulltext corpus which contained all of the text, and which was of course, very big. Because the file is not ordered in any way, sharding by a prefix wasn't an option. But because I would have offsets into the file for seeking, I figured I could just chunk the whole thing into smaller 1kb file chunks. This way, if the search engine needed to seek to offset 15693, I could simply divide to find the appropriate file chunk, retrieve it, then seek to the position based on the modulus.
Finally, I added a little bit of jQuery goodness so that it would look prettier while you search.
I posted the code to GitHub at http://jeffrafter.github.com/fates and the Indexer is available in the main Fates repository http://github.com/jeffrafter/fates. Try it out and make sure you watch the console in Firebug to see the shards and chunks being retrieved.
0 comments
Jump to comment form | comments rss [?]