Subdividing Search with Open Source Solr

By David F. Carr  |  Posted 09-18-2009

Open Source Search Powers MTV Networks

See our slideshows on Solr and a related project, Hadoop.

The open source Solr search engine is emerging as the standard for new and revamped websites launched by MTV Networks and is already powering the search on key properties such as TheDailyShow.com, ColbertNation.com, and SouthParkStudios.com.

"We have about 15 sites on Solr and more coming later this year," says Mark Cohen, vice president of the data platform group at MTV.

The flagship MTV.com website still bases its search function on the Google Search Appliance and is unlikely to be converted anytime soon, he said. That main website is so sprawling and more complex that the Google approach to crawling and indexing web content is the most practical approach. But on the more focused websites MTV Networks is building around specific shows or to appeal to appeal to specific audiences, Solr is becoming the default choice because of the way it can take advantage of categories or "facets" within search results, Cohen said.

Solr is a branch of the Apache Lucene project, which has also spawned other powerful spin-offs such as the Hadoop distributed computing system. Lucene itself is a Java library used as a core component for multiple search systems. Solr was originally developed by Yonik Seeley while at CNET Networks, where it is used on the CNET Reviews site to help visitors drill down through search results by category. Seeley has since joined Lucid Imagination, a firm formed to provide commercial distributions, training, and support for Lucene and Solr.

MTV is talking with Lucid (which provided us with an introduction) but has yet to sign up for its services, says Warren Habib, Senior Vice President of Digital Platform Development at MTV Networks. MTV likes to take advantage of open source products wherever practical, Habib says. "We're fundamentally a media company, not a technology company," he says, and open source software not only helps keep costs down but tends to be backed by a large pool of skilled developers. But he says MTV is also "open" to taking advantage of commercial support options to make its use of these technologies more effective.

Subdividing Search with Open Source Solr

Habib says another reason open source products tend to find their way into the process is that developers can download them and start working with them without going through a procurement process. "You can imagine what the approval process is like for a purchase at a large corporation. Open source empowers developers come up with solutions," he says.

That's what happened with Solr, which one of the developers working on SouthParkStudios.com began tinkering with on his own initiative, Cohen says. In that case, the advantage was that Solr made it easier to subdivide search results according to categories such as the names of characters featured in that episode, he says.

To achieve those results, the web site must be configured to provide an XML data feed to Solr that exposes the structure and categories stored in the site's content management system (CMS). Because MTV's other major search solution, the Google appliance, relies primarily on crawling web content, its ability to find search facets is limited to identifying different file types or extracting categorization information from HTML meta tags, Cohen says.

So Solr works best with sites such as TheDailyShow.com, which runs off a single CMS, as opposed to MTV.com, which features layers of technology built up over decades, Cohen says. Solr is also used on more complex sites, such as comedycentral.com, as a back end tool for indexing content but not for serving website search requests.

In addition to helping viewers of TheDailyShow.com find segments featuring their favorite comic correspondents, Solr feeds a slider-style timeline user interface widget that sorts through results by when a segment appeared. Cohen says the faceting feature also makes it possible to feed search results to multiple international versions of the MTVMusic site from single index.

The one drawback, Cohen says, is the "tooling" that comes with Solr. The management and monitoring utilities are not as slick as those that come with commercial products, he says.