The year was 2009 and I had one of my first work assignments post college. I was a software engineer for an enterprise search project for one of the biggest financial companies in the US. I was certain of the outcome: I would use Google’s search engine to save the world. Unfortunately, I was immediately informed that the project was not planning to use Google’s search engine. Shockingly, Google wasn’t even considered one of the leaders in enterprise search. (Check this blog post if you want to see Gartner’s magic quadrant from 2008.)
How is it possible Google wouldn’t be the top choice for a search project? Is there anything else that could surpass Google’s search engine? I must admit I was new to search applications, and my knowledge was largely limited to the PageRank paper published by Brin and Page in 1998.
The Three Components of a Search Engine
There are three components of any search engine (web or enterprise):
1. Ingest – The purpose of this ingest stage is to make the data available for the search engine. It is common to use tools provided by your search engine during this stage.
2. Index – Once data is available for consumption, the search engine will create an index to be referenced later when performing search operations. A data index is a structure to be used when looking up resources associated with the input data. Usually the input consists of some keywords and the output is a list of documents or resources related to the input.
3. Query – Once you have an index in place, you can query the index and return the results to your users.
Although the major components are the same, it turns out the requirements for enterprise search are quite different from web search.
Web Search vs. Enterprise Search
This is what I’ve learned:
Web Search | Enterprise Search | |
---|---|---|
Network Access | The Internet is an open network. | An Intranet is a closed network. |
Security | Internet searches are public and anonymous. Access control is not a concern. | Security considerations are critical. Enforcing the access control rules that exist in the systems of record can be complex. |
Resource Types | The resources found in the Internet are mostly file-based (HTML, PDF, TXT, MP3, GIF, etc.). | In an Intranet, the information is not limited to documents or files. Resources can include relational databases (SQL), directory services (such as LDAP), legacy data sources, etc. |
Ingest | If you are building a web search, all you need is a web crawler able to traverse the pages. The crawler will create a graph as it moves through all the available web resources. The ingest phase for web search is straightforward and relies primarily on HTTP and FTP standard protocols. | Data ingestion for the enterprise is more complex. In many enterprise deployments, data ingestion is the most difficult part of a search project. Access to the enterprise resources will likely require HTTP, FTP and other approaches such as message queues and other API calls. Any change in the systems landscape (new systems, system upgrades, schema changes, etc.) may threaten the stability of the search solution and will likely require a change to the search solution. |
Index | For web search, a generic index is sufficient. | Building an index for an enterprise search engine can be complex. An enterprise search deployment needs to consider particular semantics or domains and how those are used within the organization. You need control over how relevancy is determined for a particular resource or set of resources. In an enterprise solution, customization is very important when defining and building your index. You may need to define multiple indexes using different criteria for each one of them in order to capture the semantics of each domain or data source being indexed. For example, when dealing with mortgage data, you may want to use an index that is concerned only with a set of fields that are more likely to be searched such as the customer name and property information. You may want to ignore other fields since they are not relevant in this context. |
Presentation Services | The information is formatted for presentation according to Internet standards (HTML, etc.). | The information may need to be processed before being displayed to the end user (relational databases, XML files, etc.). |
Search Criteria | The web search provider defines general criteria which is used to establish the relevance or weights for the values stored within its index. | You are expected to know a lot about the data in your enterprise. It will be expected that you define particular criteria and semantics for different datasets. You could create categories and capture relationships that would not be relevant for a Web Search engine. |
Search Clients | Web searches are designed for end users. | Enterprise searches often need to support both end users and client systems. End users will require UI development. Other internal applications may want to query and surface results within their own UI by using search APIs. |
Search Types | For web searches, users primarily rely on direct searches against key words. | In enterprise search, the types of searches needed by users are more complex. An important component of enterprise search is faceted search which is a search approach where you can drill down your results by reducing the scope of your search in every step. Development of faceted search requires a detailed knowledge of the data structures and must be driven by highly customized indexes. |
Google is the leader
Gartner now has an “Enterprise Search MarketScope” category which focuses exclusively on enterprise search vendors.
It turns out Google is now in the lead for enterprise search.