Documentum Search – Lucene, FAST, Verity, Google and upcoming DSS

Since the new Documentum Search Services beta program just started last week, we thought we would share some of TSG’s thoughts on full-text search and our plans to add Lucene capabilities to our open source offerings.

Documentum Search Services (DSS), was tentatively called Enterprise Search Services (ESS) early in the product development.    DSS promises to be “the next generation of search in EMC and will be built upon xDB with Apache Lucene as the underlying indices”.  Specific highlights from EMC World included:

  • Relevance Sorting
  • Advanced Query Processing
    • Parallel, Native Facet computation, Xquery for structured and unstructured search
    • Lower Hardware and Storage Costs
  • Native VMWare, NAS, SAN support and Advanced Data Placement

At the present time, DSS is targeted for heavy testing through the end of 2009 with a release in 2010.

TSG Thoughts on DSS

At the present time, we are very encouraged with the progress and the direction of DSS.  We have been using Lucene for a couple of clients and can safely say that the tool will address many of the shortcomings of FAST including index rebuild, overall performance and server requirements.  That being said, the scope of DSS needs to encompass all of the Documentum API level functionality that FAST or Verity have addressed in the past.  As the beta progresses, truly the “devil is in the details” of how DSS evolves so we will with hold our final thoughts until the beta is complete.

Other Tools (Autonomy, Google Appliance, SearchBlox, Vivisimo….)

As an integrator, we do get asked to integrate in different search tools.  We began working with Autonomy for EMC on an internal Documentum project (pre-Documentum purchase) back in the late 90’s.  Overall, most search tools meet full-text needs but are typically built as “crawlers” focused on the web.  As a crawler, the tool needs to scan a directory/website for changes and then update the full text index.  We have found this approach difficult when Documentum clients want to do true “Documentum  Searches” of combining attribute, security and full text.  For example – one client wanted to search on secure documents a certain plant (attribute), create date (attribute) and containing this part-number (full-text).

Also, a couple of clients have had concerns in regards to latency of when a document is stored in Documentum and indexed (after the crawler runs) in the full-text search engine.  One client complained that with FAST, sometimes the latency was 2 minutes and other times it was 2 hours.

Our last concern with the crawler approach is how to get the index data and security added to the index to avoid having to run the query against Documentum (plant, create date, security), against full-text (part-number) and then only displaying the results that are on both lists.

Native Lucene with Documentum?

One scenario we are building out for clients is a Documentum 5.3 or 6.5 application that indexes documents into Lucene from either Documentum or a cached copy (whitepaper here).   To differentiate from DSS, our approach won’t provide support for inline DQL but rather a pure web services approach.

In the diagram below, both OpenMigrate and HPI use OpenContent web services to communicate with Lucene.  OpenMigrate is used to keep the Lucene index up to date, and HPI is used to query the index for full text searches and optionally metadata searches as well:

full_text_arch

A couple of key factors:

  • 5.3 Support – we are focused on supporting both 5.3, 6.0, 6.5 and future releases.  Many of our clients have chosen to delay their upgrades due to variety of reasons.  By implementing Lucene now, clients can remove FAST in their current environment and from an eventual D6.5 upgrade.
  • Attributes – we are focused on storing both the content, attributes and security in Lucene to avoid having to search both the Documentum attributes and the Lucene full-text index.
  • Indexing – we are leveraging OpenMigrate to index/delete content and meta data to Lucene on a real-time, multi-threaded push basis to avoid a crawler approach.   We think the push approach can better control updates to the index, reduce server load on the full-text index and improve audit control to insure everything is indexed.
  • Security – One issue we addressed was how to manage security concerns versus high-performance search.  Verifying that the user has access to browse each document retrieved from the search (Documentum lookup) is expensive and would hurt performance as identified in the crawler discussion above.  One approach was to cache document ACL information with each document in Lucene and update as ACL’s are updated.  Since Documentum ACL’s don’t change often, we would leverage one lookup to retrieve the users ACL access and add that information to the Lucene query.

So far our results have been favorable.  Please contact us if you are interested in this type of solution as we are looking for additional case studies.

2 thoughts on “Documentum Search – Lucene, FAST, Verity, Google and upcoming DSS

Comments are closed.