Hadoop, a leading technology disruptor from the Big Data movement, is a massively scalable architecture capable of dealing with terabytes+ of unstructured data. As the cost of storage, memory and CPU continues to move toward almost free, we are seeing the clients looking to move away from traditional legacy ECM tools for the newer architecture and cost savings by managing their documents with Hadoop. On major advantage of Hadoop versus traditional vendors is the ability to add attributes “on the fly”. This post is will describe an ECM content model for Hadoop and walk through the process of add and remove attributes “on the fly” based on our experience from Hadoop ECM clients.
Legacy ECM – Schema on Write
Legacy ECM has traditionally required a highly structured schema of object types with metadata/ properties/ attributes before any content is written to the ECM repository. This approach is called “schema on write”. In this case, the document type and property needs to be set up before any document is created. To add a new attribute, the repository needs to be updated to include the new property type. When a new attribute is added to the schema, all existing documents of this type will have this attribute added with a blank value. Removing attributes or changing types is more of a problem. In many legacy systems, removing an attribute or moving around attributes between parent and child types causes issues that may require re-indexing the entire repository.
Hadoop ECM – Schema on Read
Unlike traditional databases that require a very structured layout for metadata to support (schema on write), Hadoop allows for “schema on read” which allows for optional “ad-hoc” attributes to be tagged on content when it is ingested into Hadoop. “Schema on Read” puts the focus on the retrieving application, rather than the storing application, to process the schema. This flexibility allows users applications storing content and metadata in the repository the ability to add a new piece of metadata to a document type they are importing without updating any reading applications. The program accessing the metadata just needs to be intelligent enough to:
- look for any attributes and process based on attributes that are found.
- understand that some attributes may or may not be present.
Hadoop allows clients to leverage an unstructured “Data Lake” model that can be leveraged to simply store all of the content Hadoop and full-text index it for searching.
By optionally adding some metadata to a document when it is uploaded/ingested into Hadoop, it is much easier to find content later. Since HBase (the Hadoop database) is a schemaless data store, Hadoop isn’t bound to some of the same issues that plague other legacy ECM data models that were reliant on a relational DB for storage of metadata. Some shortcomings of the relational database schema were always:
- Removal of existing attributes from a model not possible without a scripting/migration effort
- Content model changes required a restart of the entire repository and/or restarts of individual applications
- Content model changes required full reindexing the entire repository
- No GUI for changing object models – required editing of XML or proprietary config files
Lastly, Hadoop provides for automatic clustering for high availability, something that is often difficult (and expensive) withe the relational database components underneath legacy ECM vendor repositories.
TSG OpenContent – Legacy Support and Hadoop
In building our OpenContent webservice layer, our design focused on providing a neutral method for an application to retrieve content from an ECM repository. By leveraging HBase in our OpenContent layer, we can provide searching on Hadoop metadata on a document as well as against a Solr index for all metadata. Our High Performance Interface has always used a combination of full text searching combined with metadata searching to provide users the flexibility of doing more targeted searches against a specific set of metadata.
Below is a screencam of our HPI Admin interface to create a data model for objects that are stored in Hadoop. Exposing this lightweight and flexible schema in front of your Hadoop repository for documents allows us to capture metadata on documents without being tied down to some of the old rules that relational database driven ECM implementations were stuck with.
Let us know your thoughts on storing your structured content in Hadoop in the comments below.