Implementing Search Suggest with Apache Solr
One of Oshyn's latest successful projects (disneydvd.com) relies heavily on its "search suggest" feature for providing a user-friendly search to all available titles in the site. Needless to say there are many, many movies/tv shows to search for on such site and it is imperative to allow users to find what they are looking as fast as possible and with the least trouble.
With these obvious requirements in mind, Oshyn turned to Apache's Solr enterprise search server. In short words, Solr works as a Lucene based web application that can be deployed to any Java servlet container. Solr provides several methods for querying the index through XML/HTTP and JSON APIs that are ideal for a search suggest feature where the browser will be making several AJAX calls on every user input.
Configuring Solr is not a challenge and is very well documented. However when it comes to a feature such as search suggest a different story arises and there are many variables to take into account to get a successful "search suggest" working in your site. In this blog post I will identify the challenges Oshyn had to overcome when implementing this feature with Solr and will demonstrate the solutions to finally implementing a solid "Google like" search suggest in your site.
1) Dealing with lack of support for wild card searches on phrase queries:
A "Phrase" as defined by the Lucene documentation on (
http://lucene.apache.org/java/2_3_2/queryparsersyntax.html) is a group of words surrounded by double quotes such as "hello dolly". Using Lucene you can search for phrases within your index by building a lucene query such as title:"pirates of the" where the field you want to search is called "title". This is a very useful feature for finding results based on partial phrases, however this is not close to being enough when it comes to "search suggest" and I'm going to explain why.
In search suggest, if your users input "Pirates of the" in your search box, you want them to get the suggestion "Pirates of the Caribbean". Lucene's phrase search would work for this example, but how about when they enter "Pirates of th" or "Pirat" or... you can see where this is going. Phrase search will not return your desired results for this input because these are not phrases contained in your index. Remember a phrase in Lucene is considered as a group of WORDS (non space separated).
If you want your users to get the correct results you need to work around this a bit.
Oshyn implemented a query builder that turned a user's input into a Lucene query that would always return the desired results. How did we do this?
Lucene supports wildcard searches, meaning you can enter a partial word and lucene will return all results matching your partial criteria and 0 or more characters following your wildcard. For example searching for "Pira*" will return "Pirates", "Pirate", "Pirana", etc. Now you see we are getting close to where we want to but not close enough yet. Lucene does NOT support wildcard searches on phrase queries, meaning searching for "Pirates of th*" will not be interpreted by Lucene.
Turns out a combination of phrase and wildcard queries is exactly what you need to provide a fully functional search suggest.
Enough introduction and let's get to the action. How does this look to you?
User input: Pirates of th
Oshyn's Query Builder Output: (title:Pirates AND title:of AND (title:th* OR title:th)) OR title:"pirates of th"
The query builder parses the original input and builds one that simulates a wildcard phrase query. It looks for all the words the user entered and adds a wildcard (*) to the last word. It also searches for the whole phrase the user entered using a phrase query in case the whole phrase is found in the index. This should work!
Now that we've seen how to build a Lucene query for Solr to implement search suggest in your website, let's look at why you may be having problems with getting such a query to return the results you are expecting. The reason will most likely have to do with "stopwords" and the stop filter applied to the default index analyzer configuration in Solr.
Stopwords in Solr are words (such as prepositions) that are configured to be ignored by the index and query analyzers. The words are configured inside a file called stopwords.txt. This is a useful feature since it mimics any other search engine functionality (like google) when performing searches. For example, in google when searching for "peter pan and captain hook" the word AND is ignored in order to bring more relevant results, otherwise any document containing the word "and" would match the search as well. When the stop filter is applied in Solr any search made will show the same behavior. However when implementing a search suggest functionality this behavior becomes undesired. "Pirates of the caribbean" is indexed in Solr as NOT containing the words "of" and "the" and so when performing the search above, for which you would expect correct results:
title:Pirates AND (title:O* OR title:O)
Nothing is returned by Solr. In order to obtain your results, just remove the stop filter from the field used for the search suggest query. For this case the best solution is to create a new field that will be used for the search suggest with an analyzer that does NOT use the stop filter. Here's the final solution in schema.xml and now search suggest works as expected.
<fieldType name="text_no_stop" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <!--filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/--> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <!--filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/--> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldType>
*Notice the stop filter commented out
<fields> <field name="titlesuggest" type="text_no_stop" indexed="true" stored="true"/> </fields>
You now have a fully functional search suggest feature working on your site!
I hope these blog entries were useful to anyone out there trying to implement a similar feature. Please keep in mind everything discussed here was done with Solr 1.2 which was the latest version available at the time. Solr is currently at 1.3 version and I know it's full of new features so if anyone has stumbled across an easier way to do this with Solr 1.3 (or even 1.2) please feel free to share by adding a comment!