One of Oshyn's latest success projects http://disneydvd.com relies heavily on its "search suggest" feature for providing a user-friendly search to all available titles in the site. Needless to say there are many, many movies/tv shows to search for on such site and it is imperative to allow users to find what they are looking as fast as possible and with the least trouble.
With these obvious requirements in mind, Oshyn turned to Apache's Solr enterprise search server. In short words, Solr works as a Lucene based web application that can be deployed to any Java servlet container. Solr provides several methods for querying the index through XML/HTTP and JSON APIs that are ideal for a search suggest feature where the browser will be making several AJAX calls on every user input.
Configuring Solr is not a challenge and is very well documented. However when it comes to a feature such as search suggest a different story arises and there are many variables to take into account to get a successful "search suggest" working in your site. In this blog post I will identify the challenges Oshyn had to overcome when implementing this feature with Solr and will demonstrate the solutions to finally implementing a solid "Google like" search suggest in your site.
1) Dealing with lack of support for wild card searches on phrase queries:
A "Phrase" as defined by the Lucene documentation on (http://lucene.apache.org/java/2_3_2/queryparsersyntax.html) is a group of words surrounded by double quotes such as "hello dolly". Using Lucene you can search for phrases within your index by building a lucene query such as title:"pirates of the" where the field you want to search is called "title". This is a very useful feature for finding results based on partial phrases, however this is not close to being enough when it comes to "search suggest" and I'm going to explain why.
In search suggest, if your users input "Pirates of the" in your search box, you want them to get the suggestion "Pirates of the Caribbean". Lucene's phrase search would work for this example, but how about when they enter "Pirates of th" or "Pirat" or... you can see where this is going. Phrase search will not return your desired results for this input because these are not phrases contained in your index. Remember a phrase in Lucene is considered as a group of WORDS (non space separated).
If you want your users to get the correct results you need to work around this a bit.
Oshyn implemented a query builder that turned a user's input into a Lucene query that would always return the desired results. How did we do this?
Lucene supports wildcard searches, meaning you can enter a partial word and lucene will return all results matching your partial criteria and 0 or more characters following your wildcard. For example searching for "Pira*" will return "Pirates", "Pirate", "Pirana", etc. Now you see we are getting close to where we want to but not close enough yet. Lucene does NOT support wildcard searches on phrase queries, meaning searching for "Pirates of th*" will not be interpreted by Lucene.
Turns out a combination of phrase and wildcard queries is exactly what you need to provide a fully functional search suggest.
Enough introduction and let's get to the action. How does this look to you?
User input: Pirates of th
Oshyn's Query Builder Output: (title:Pirates AND title:of AND (title:th* OR title:th)) OR title:"pirates of th"
The query builder parses the original input and builds one that simulates a wildcard phrase query. It looks for all the words the user entered and adds a wildcard (*) to the last word. It also searches for the whole phrase the user entered using a phrase query in case the whole phrase is found in the index. This should work!
Part 2 of this entry will show you why such a query might not work immediately for you. The secret being "stopwords" so please keep on reading part 2 if you like this so far and don't forget to leave any comments!