January 15, 2008
What is Google hiding - Just how many servers do they have, anyway? (Part 4 in a Series on Search)
Posted by Chris Tebo, CTO
In my previous post, I started looking at what it takes to deliver real-time search for an email archive. In that post I was focused on looking at the size of the dataset that an email archive encompasses, indicating that these datasets start approaching the size of Google's web-indexes. At those scales, the hardware infrastructure deployed to support real-time search are just as important as the software that executes those searches.
Four years ago I started down the path of building what has become the Fortiva Archiving Suite. Providing real-time search for a corporate email archive was obviously a requirement. To get started down this path, I called a former colleague who had spent his time in the 80s and 90s leading the charge building market-leading search and knowledge management solutions. He's kept up with the comings and goings in the search technology space since then, and my question for him was simple - whose search technology should we license to solve our email archive search needs...
His answer surprised me and his guidance was straightforward...
- All of the real-time search technology out there today is based on the concept of an inverted-index which allows you to very quickly identify documents that match search criteria. From one vendor to the next there are differences in their implementations, but at the core, the technology remains the same.
- The challenge with large search datasets is that to deliver real-time search, the search infrastructure needs to be able to execute searches in parallel over multiple servers. At the time, none of the search vendors were providing this as part of their solution.
As we were setting out to build our SaaS email archiving solution, developing a platform and infrastructure that allows one to distribute work over multiple servers (be it for search or any other function in the application) was fundamental. So what we did is the obvious thing... We combined the SaaS platform we needed to develop anyway, with appropriate hardware, and open source search technology that provides the inverted-indexes that are at the core of search. We built #2, and we leveraged others work on #1.
This may all sound like talk about the plumbing, and it is... But that was the point my colleague was making to me. Delivering real time search on these large datasets is just as much about the plumbing as it is the search technology. Take Google for example... They'll let you run their search application on your desktop. That represents point 1 above. But if you try to find out how many servers Google has deployed to support web-search, and what that infrastructure looks like, you are unlikely to have much luck. Google remains very secretive about that, for good reason.
We've made mention of one firm we've spoken to whose inhouse email archiving solution has taken over 25 days to complete a search. They, like many firms who tackle the challenge of enterprise search on large datasets have learned the hard way that the software deployed to address this problem is only a small part of the puzzle. Deploying and managing the infrastructure required to deliver on the promise of real-time enterprise search is the hard part. For many firms, the cost of care and feeding for this infrastructure is excessive, and just doesn't make sense.
At Fortiva, we have deep knowledge about the infrastructure required to deliver real-time search to our customers. We know how much index data any one of our servers can manage, and we work hard to improve our software and our processes to deliver better results to our customers. I won't share those numbers here, for the same reason that Google doesn't tell you what their infrastructure looks like. Instead, and more importantly to our customers, we deliver a guarantee on search performance to our customers. That's what they really care about. More on this next time...
Click here to read Part 5 in the Series on Search



