Documentum Search – Lucene versus FAST

As mentioned in a previous article, many clients are moving to away from FAST in preparation for the eventual release of Documentum Search Services (DSS) slated for release in June that leverages the open source product, Apache Lucene.  This post will share the results from one client that executed a proof of concept test to compare the two search engines.

Proof of Concept Approach – As we have mentioned before, many clients have decided to implement an external cache outside of Documentum to address business continuity, performance and licensing issues.   For a large pharmaceutical client, TSG was tasked with performing a proof of concept on 156,000 documents in an external data source indexed by Lucene.  The proof of concept would compare search results of FAST within Documentum (Webtop) and Lucene (HPI) outside of Documentum in regards to search results.  The proof of concept additionally evaluated leveraging Lucene for metadata storage rather than storing in another database such as Oracle.

POC Findings – Lucene/HPI and the external repository was found to be considerably quicker that the existing FAST/Webtop implementation on most queries.  

Specific results:

Query

FAST/Webtop

Lucene/HPI

1200 Results 90 seconds 3 seconds
8 Results 5 seconds 3 seconds
10 Results 8 seconds 4 seconds
76 Results 10 seconds 5 seconds
5100 Results 72 seconds 5 seconds
65 Results 6 seconds 3 seconds

 Simple configuration of the Lucene index did a better job of returning a more complete search result set than the standard FAST/webtop configuration.  Examples included additional documents that were logical derivatives of the initial search word. For example – a search for “exception report” could return “exceptions report” or “exception reports”. The proof of concept data set also included German documents and Lucene demonstrated multilingual stemming capability.

Key Stats – Lucene

  • 156,000 Documents – 31.6 Gigabytes
  • Total Index Space – 521 MB
  • Total Index Build Time – 10 hours – The client was very interested in the time it took to index the content and metadata in Lucene because they had experience lengthy indexing times with FAST in their 5.3 upgrade. This was tracked as part of the proof of concept, however, the corresponding FAST data is no longer available from the 5.3 upgrade.

FAST and Lucene – Full Text Syntax Differences

  • FAST
    • “One Two” – will return documents with the exact phrase “One Two” in the document
    • One Two – will return documents with the words One OR Two in the document
    • One+Two – will return documents with the words One OR Two in the document
    • One and Two – will return documents with the words One AND Two in the documen
  • Lucene – Based on the Proof of Concept’s configuration
    • “One Two” – will return documents with the exact phrase “One Two” in the document
    • One Two – will return documents with the words One AND Two in the document
    • One OR Two – will return documents with the words One OR Two in the document
    • One and Two – will return documents with the words One AND Two in the document
    • One+Two  – will return documents with the exact phrase “One Two” in the document

Overall Thoughts

Overall the client was very satisfied with the findings and is moving forward with the solution.  The flexibility of Lucene to index both the metdata and full-text values allowed the client to avoid adding an additional Oracle database to their external cache for attribute storage.  The client also liked the more simple, intuitive search interface of HPI compared to the Webtop interface. 

In addition to leveraging Lucene for searching an external cache, we are also working to leverage Lucene for internal Documentum/Webtop search.

If you have any questions or would like more detailed information, please contact us or comment below:

12 thoughts on “Documentum Search – Lucene versus FAST

  1. Is that a valid comparison? A FAST search via webtop also has to process the security applied by all the ACLs specfied in Documentum for each result. Did the external repository also have this concept? What was the external repository?

  2. Mike – I would agree with your point, the comparison is not completely apples to apples and there are definitely processing tasks on the Webtop/FAST side of things that were not done on the Lucene/HPI side (e.g., ACL application). The client was frustrated with full text search performance and could expose documents out to an external read-only cache where all users have access. This approach enables consumers to quickly search for content without unnecessary Documentum overhead.

    We realize that this solution does not always apply – there are situations where ACL security needs to be honored. As mentioned in the post, we are also looking at integrating Lucene with Documentum which (to your point) would provide a more apples to apples comparison. Once we have more concrete findings, we will post on that.

  3. I agree that FAST is very bad with the results it provides, but trust me it is FAST. It takes about 1-2 seconds on a 2 mil doc repository of size few tera bytes. What takes longer is the ACL verification. You can validate yourself by going to the search interface provided by the index server and firing a FT-DQL query.

    Just saying, the comparison is not fair. But I like the community embracing open source search platforms.

    -Ramesh

    • Ramesh,

      You bring up a great point in regards to security and ACL’s. For the client they were looking for something faster and the combination of the cache and Lucene was definately faster for what was loosely secured content. We have been doing the “web cache” approach for awhile just with attributes so adding Lucene wasn’t that much of a stretch. Like Bethany said – not really apples to apples.

      Look for another post here shortly on our approach to searching with Lucene in Documentum and preserving ACL security. I just saw the first draft so it should be out in a day or so.

      Dave

Comments are closed.