Hadoop – Why Hadoop as a Content Store when Caching Content for ECM Consumers

Last week we posted on a publishing approach for enterprise search.  Along with enterprise search, we have seen more and more ECM clients look to publish content out of the ECM repository for a variety of business reasons including performance, business continuity and reducing costs.  This post will highlight how Hadoop can be used within a publishing architecture and explain some of the benefits.

ECM publishing infrastructure – Reasons and Justifications

As presented last week as well in other posts, ECM customers look to publish from one or multiple ECM environments for a number of reasons including:

  • Business Continuity – Processes that rely on the documents managed by the ECM infrastructure might be interrupted if that system became unavailable.  Publishing the content to redundant infrastructures for consumption can avoid that interruption.
  • System Performance – typically ECM systems will have overhead (ex: security, extra interface elements for authors) that could slow down retrieval performance. A published approach will have a simple consumer only security model to improve performance for retrieval while removing expensive consumer queries (and resulting performance restrictions) from the ECM repository for authors and approvers.
  • User Training – ECM “do all” interfaces can be confusing for the average user.  A simplified consumer only interface typically can be set up to require zero training.
  • Enterprise Search – As presented last week, content can be published from multiple sources to allow a true “enterprise” search experience of all systems.
  • User License and Maintenance – Additional consumers can be easily added to the published repository without adding and maintaining more users in ECM authoring and approval repository.  This has been helpful as companies grow and add employees as well as bringing on third-party consumers/contractors.

ECM Publishing Infrastructure – Components

ECM publishing infrastructure has four major components as depicted below:

OpenMigrate Publishing

These components include:

  • Publishing Infrastructure – Above, this is pictured as OpenMigrate.  This component polls the ECM repository(ies) and publishes the content when it reaches a certain approved stage.  The publishing job will retrieve the document (typically only PDF renditions) as well as meta-data to post in the index as well as the content store.
  • Index – The index maintains the information about the documents (meta-data) as well as potentially the full-text index for the documents.  Typically we recommend Lucene/Solr for its performance and cost (open source).
  • Interface – The interface allows access to the index to identify documents as part of a user search.
  • Content Store – The content store holds the document itself.  Typically the content store is a mounted file system or SAN.

Typically we will see components of the infrastructure replicated to multiple environments for quick access as well as business continuity.  Typical scenarios include geographic (ex: North America, Europe, Asia…) as well as business (Plant A, Plant B….).  Clients accomplish the redundancy either with multiple publishing jobs or leveraging other replication capabilities.

Hadoop as a content store for caching consumers

Hadoop has some great features that make it a perfect extension to leverage as a content store for the publishing infrastructure.  These include:

  • Open Source – like other components of the architecture, Hadoop is open source and does not require an additional purchase.
  • Hadoop Distributed File System (HDFS) – Hadoop is built on an architecture of replication/duplication to push content to redundant servers that could be geographically separated.  Typically, we will want to have servers close to the physical location of the consumers to speed the content retrieval.  By utilizing separate servers, HDFS can provide quicker access based on retrieval from the closest server rather than maintaining a distributed SAN with duplicate copies that become difficult to maintain.
  • Reindexing Scenarios – Hadoop provides the ability to store not just the content file but also the meta-data in a redundant environment.  Often times, clients will rebuild the index of the publishing repository to meet new taxonomy requirements.  With most solutions, the publishing job (ex OpenMigrate) would have to re-run against each separate ECM repository to perform the reindex.  With Hadoop, the reindex could be accomplished within the publishing environment with no need to access the source repositories.

Because of the ability to store meta-data in Hadoop, some clients have asked if Hadoop can be used to replace the index/search component as well.  We would recommend sticking with Lucene/Solr as it provides a meta-data as well as full-text capability as a tuned search appliance.


Hadoop can provide a more robust open source content store for ECM publishing infrastructures with added redundancy and performance as well as better support for reindexing requirements.  From a TSG perspective, we have added Hadoop support for this approach with:

  • Publishing – OpenMigrate supports all of the publishing as well as reindexing requirements.
  • Interface – HPI can store and retrieve documents from Hadoop as well as other ECM repositories.

If you have any thoughts, please add your comments below.

2 thoughts on “Hadoop – Why Hadoop as a Content Store when Caching Content for ECM Consumers

Comments are closed.