Proofpoint: Security, Compliance and the Cloud

14 posts categorized "Archiving Infrastructure"

January 15, 2008

What is Google hiding - Just how many servers do they have, anyway? (Part 4 in a Series on Search)

Posted by Chris Tebo, CTO

Search In my previous post, I started looking at what it takes to deliver real-time search for an email archive.  In that post I was focused on looking at the size of the dataset that an email archive encompasses, indicating that these datasets start approaching the size of Google's web-indexes.  At those scales,  the hardware infrastructure deployed to support real-time search are just as important as the software that executes those searches.

Four years ago I started down the path of building what has become the Fortiva Archiving Suite.  Providing real-time search for a corporate email archive was obviously a requirement.  To get started down this path, I called a former colleague who had spent his time in the 80s and 90s leading the charge building market-leading search and knowledge management solutions.  He's kept up with the comings and goings in the search technology space since then, and my question for him was simple - whose search technology should we license to solve our email archive search needs...

His answer surprised me and his guidance was straightforward...

  1. All of the real-time search technology out there today is based on the concept of an inverted-index which allows you to very quickly identify documents that match search criteria.  From one vendor to the next there are  differences in their implementations, but at the core, the technology remains the same.
  2. The challenge with large search datasets is that to deliver real-time search, the search infrastructure needs to be able to execute searches in parallel over multiple servers.  At the time, none of the search vendors were providing this as part of their solution.

As we were setting out to build our SaaS email archiving solution,  developing a platform and infrastructure that allows one to distribute work over multiple servers (be it for search or any other function in the application) was fundamental.  So what we did is the obvious thing... We combined the SaaS platform we needed to develop anyway, with appropriate hardware, and open source search technology that provides the inverted-indexes that are at the core of search.  We built #2,  and we leveraged others work on #1.

This may all sound like talk about the plumbing, and it is...  But that was the point my colleague was making to me.  Delivering real time search on these large datasets is just as much about the plumbing as it is the search technology.  Take Google for example... They'll let you run their search application on your desktop.  That represents point 1 above.  But if you try to find out how many servers Google has deployed to support web-search, and what that infrastructure looks like, you are unlikely to have much luck. Google remains very secretive about that, for good reason.

We've made mention of one firm we've spoken to whose inhouse email archiving solution has taken over 25 days to complete a search.  They, like many firms who tackle the challenge of enterprise search on large datasets have learned the hard way that the software deployed to address this problem is only a small part of the puzzle.  Deploying and managing the infrastructure required to deliver on the promise of real-time enterprise search is the hard part.  For many firms, the cost of care and feeding for this infrastructure is excessive, and just doesn't make sense.

At Fortiva, we have deep knowledge about the infrastructure required to deliver real-time search to our customers.  We know how much index data any one of our servers can manage, and we work hard to improve our software and our processes to deliver better results to our customers.  I won't share those numbers here, for the same reason that Google doesn't tell you what their infrastructure looks like.  Instead, and more importantly to our customers,  we deliver a guarantee on search performance to our customers.   That's what they really care about.  More on this next time...

Click here to read Part 5 in the Series on Search

January 04, 2008

Searching an Email Archive: Real-World Examples (Part 2 in a Series on Search)

Posted by Rick Dales, VP Product Management

Search In my previous post, I talked about the significant challenges of enterprise-wide search, and how those challenges directly translate to an email archive (in fact, they’re arguably greater for an email archive).

Today, most organizations archive email for legal discovery purposes. While they may have other goals, including compliance and storage management, searching through the entire repository is a fundamental requirement for any archive. The problem is that firms always underestimate the growth of the data and the infrastructure required to support the searching of that data (and the sales team from most email archiving vendors have little or no reason to change that).

To further this point, I wanted to share some real world experiences from companies we recently talked to that have an in-house email archiving solution in place. The first is an international bank that was archiving for a division of about 10,000 users.  Within two years they had amassed several terabytes of information. At that point, every time their legal or compliance department requested data from the archive, it was taking the IT department in excess of 24 hours to run a search.  With an expectation of next day delivery of information, this left no room for error.

This is far from the worst example we’ve seen. Another firm took over 25 days to complete a single search. And these experiences are not uncommon.  Making it even worse, we frequently hear that IT staff must stay up all night monitoring these long-running activities, because turnaround times don't allow for processes that fail overnight to be restarted the next day. 

Almost without exception, the companies we talk to say that their email volume is growing faster than expected.  The end result is that any new investments in the archive go toward growing the data intake processing capacity, not the search or access capability. Companies simply don’t have the budget, staff or time to keep up with search optimization. Which takes me back to my first post of this series, where I explained how a few years’ worth of corporate information can quickly accumulate to the size of all public information on the web, making it unreasonable for a company to even try to achieve short windows for search in-house (it would require hundreds or thousands of dedicated servers).

The big challenge is that for most organizations archiving data for litigation readiness, the data remains largely untouched until a legal issue arises. At that point, critical (and time-sensitive) searches are required. Yet maintaining the infrastructure in-house to conduct those searches on an infrequent basis (even a couple times a week) makes no sense. Leveraging a shared (SaaS) infrastructure for search, on the other hand, is an ideal way to cost-effectively conduct time-sensitive searches on a periodic basis.

As the archiving industry begins to mature, and more companies have experience managing an archive for more than a year or so, this problem will continue to come to light, and the benefits of multi-tenancy for archiving will be better understood. In the meantime, if you’re considering an email archive, take the time to ask the vendors you’re evaluating if they track search performance. Furthermore, ask to speak to customer references that have been archiving email for a significant period of time (and that have a comparable storage requirements to your own), and ask them about search times. You might be surprised at the answer.

Click here to read Part 3 in the Series on Search

November 29, 2007

Green Computing and Virtualization – Basics First

Posted by Jeremy Hope, VP Operations

Banner_summ_3 Sitting at the Gartner Data Center conference, the discussion of green computing and better utilization of servers to reduce power and cooling requirements within Data Centers is everywhere.  In all of these sessions, about two slides into the presentation the discussion turns to and focuses on Virtualization Technology to achieve this goal.

What strikes me as strange with this is that so far in not one of the presentations that I have attended have I heard discussed the basics needed to start reducing power consumption, reducing cooling needs, and implement Virtual Technology – upgrading  of the old server infrastructure.  Sure everyone is trying to sell their biggest, baddest 32 or 64 way CPU box that can be sliced up in a myriad of VMs – but what about the basics – taking the old power hungry, heat-spewing Pentium III/Xeon boxes that have been purchased over the past years and replacing them with current (x- core or equivalent) technology.

At Fortiva, we changed our standard Linux server configuration from a Pentium III based system to a Dual Core CPU based solution, at the same time upgrading to a more current motherboard, power supplies and interfaces.  Doing so has not only reduced the cost of each server by approximately $400 (while providing a 150% lift in processing power – perfect for VM implementation)  but these new machines also use approximately 60% less power.  That’s a huge savings and increase in processing power, without implementing a bit of VM code.

Maybe it’s time for the hardware vendors to provide buyback incentives for replacing older technology that is power hungry with newer equipment that manages power better and would provide a much better basis for most Virtualization technology.

The government in Ontario (where i live) just recently started offering tax discounts for anyone replacing an energy pig appliance with one that is Energy Star Compliant, even offering to come pickup that old beer fridge out of your basement to help us all save energy.  Maybe the hardware vendors and Data Center hosting providers should get together and offer a similar plan to remove those old Pentium III machines from the Data Center.   Then maybe they can provide adequate power and cooling for all those new Virtual Machines without a major retrofit.

June 07, 2007

How We Solved the SaaS Security Challenge

Posted by Chris Tebo, CTO

When you start talking to IT people about SaaS, one of the most commonly mentioned concerns that comes up is security. I mentioned in an earlier post that I spent a lot of time developing technology to address those security concerns for a SaaS email archive. Since the problem of security and SaaS is something I’d like to explore further through this blog, I figure I should first explain how we dealt with the issue at Fortiva.

The challenge for us came up when we originally started looking at email archiving as business opportunity. It seemed to us that it was an obvious application to outsource, since so many things about email archiving lend themselves to managed services. Dealing with large volumes of data and dealing with data that is idle for long periods of time are perfect examples. In fact, they’re some of the same reasons why businesses have for years used third parties to store their documents and backup tapes (think Iron Mountain or Recall).

The problem was that when we starting thinking about what we’d be storing, we realized that in many cases it would be a company’s most critical business data – everything from details on mergers and acquisitions to intellectual property. Even more importantly, we’d be storing it in a format that is easy to search through in seconds, unlike the boxes of data or backup tapes at traditional third-party data storage companies. So to gain the trust of customers, we knew the onus was on us to prove that we could offer the highest levels of data security.

We came to the conclusion that a pure outsourcing approach just wouldn’t work - this required more than simply asking customers to trust that we have put appropriate security measures in place. We needed to have a technology that could be put to task, a technology that was built from the ground up to prevent us, as the vendor, from having visibility into the customer’s data.

So we started by looking at other solutions where this problem comes up. The closest example we could find was remote offsite backups. The security problem there has been addressed by encrypting data before it leaves the customer site, and leaving it encrypted during storage. This works well for third-party backup providers because they don’t do anything with the data other than store it and send it back to you when necessary. 

The problem is, when you’re talking about email archiving, you need to provide rich functionality and workflow around the data that’s in the archive. A great example is search – we needed to provide a way for our customers to search through messages in the archive. We also needed to provide a way to apply policy and workflow to the archived data based on the content of the messages. So somehow we needed to find a way to “see” the data, without being able to access the content of the data. 

All of this led to us developing DoubleBlind Encryption™ technology. What this allows us to do is to encrypt the data on an appliance before it leaves the customer site, and then store it in encrypted form, much like the traditional third-party backup provider does. Since the customer has the encryption key (Fortiva doesn’t have a copy), we have no way to decrypt the data stored on our network.

The appliance that encrypts the data is also used to prepare an index of the data before it leaves the customer site.  When a customer types in a search request, the request goes through the appliance, is encrypted, and then the encrypted search terms cross-reference the encrypted archive to return the results, decrypt them on the appliance before returning them to the end-user. All this happens in seconds, because we built the system on a scalable grid architecture.

To put it more simply, imagine that we were using pig latin as the encryption key (of course, we don’t use pig latin, because that’s not particularly safe or secure). In this example, if we wanted to archive a message with the phrase “the quick brown fox,” it would be encrypted to “ethay ickquay ownbray oxfay”. So we then build an index that maps back to the same encryption. So when a user types in a query for “brown fox”, the request is encypted at the customer site to “ownbrey oxfay” and that query is sent to the system, any matches are found and returned to the customer network, decrypted by the appliance and returned to the end user.

If our staff or any other outside party were to try to access the data, or even the search terms that a company uses, all they could access would be meaningless data that’s encrypted using the highest standards in encryption technology. All of that together gives our customers the confidence that they don’t have to “trust us”, but rather, they can feel confident that the technology is in place to keep their data just as secure – if not more secure – than it is on their own corporate network.

Archives

Blog Search

Email Security Gateways, 2011

Magic Quadrant

Tweets

What people are saying right now about us.

©2012 Proofpoint, Inc.
threat protection: Proofpoint Enterprise Protection compliance: Proofpoint Enterprise Privacy governance: Proofpoint Enterprise Archive secure communication: Proofpoint Encryption