We had a good conversation yesterday with a long-time and innovative TSG client. The client has a mix of technical and business skills that make him a visionary in a highly regulated industry in regards to Enterprise Content Management. In addition to our normal catch-up discussions about plans for the year and what are we seeing other clients do, we also talked about Hadoop and how it could disrupt traditional Relational Databases (RDBMS). This post will present highlights of that discussion from a business perspective.
Relational Databases – Built for Scarcity and Search
In the 70’s and 80’s, Hierarchical, Network and finally Relational databases all emerged. Built for the mainframe systems of the time, CPU, Memory and Disc space were expensive and the database systems were built for maximum efficiency of these limited resources. Extensive effort was made on the following constraints:
- only relevant data was stored
- archived data might be pushed off to tape and
- the Database Administrator (DBA) was a critical role in the care and feeding in making sure the database performed.
- Adding indexes
- Normalizing the data
- things to make sure that the data maintained in the database could be quickly found and processed.
Disc, Memory and CPU – No longer Scarce
Since relational databases came of age, Moore’s Law has radically changed the scarcity of computing resources. Some notable items to consider in regard the function of databases.
- Inexpensive disk space – Disk prices went from $100,000/GB (early 80’s) to $.20/GB (2013)
- Google – introduced the concept that a farm of inexpensive computers could support efficient searching
- Consumer computing – pushed for additional savings in cost in memory, disc and CPU
- OpenSource Focus – companies like Google and Facebook are building technologies to offer better services for their clients and releasing/enhancing open source rather than turning to traditional software (or database) vendors
It makes much more sense today to store as much data as you can possibly produce since there is value that can be squeezed out of every piece of data captured as a part of a company’s business process. It is more expensive to throw data away and lose potential value than to keep it lying around at 20 cents per GB. No longer do we have to be as concerned with normalizing the data and making sure every piece of data fits a well-defined schema when it is produced.
Hadoop versus Relational
This post won’t get into all of the different underlying technologies for Hadoop (MapReduce, NameNode, DataNode) but focus on more of the use cases for ECM. For ECM customers, let’s examine the traditional story of storing a document with attributes.
- Relational Database – Would store the attributes in columns/rows of the relational database with a pointer to the document file location in a SAN.
- Hadoop – Would store the attributes in a Hadoop entry with tags/metadata that describe the document along with the document content itself. There is no need for a SAN as Hadoop provides it’s own distributed data store.
To illustrate the differences, now let’s compare what happens when searching for a document with a title of “SOP-1234”.
- Relational Database – The database could be queried to find the row where document name equals “SOP-1234”. Given a well indexed database this would be sub second. Once on that row, it would be fairly easy to retrieve the file content from the SAN and related attributes from other tables.
- Hadoop – Hadoop would have to rely on multi processing to query ALL the nodes looking where title = “SOP-1234”. (very inefficient). Once identified, all attributes could be quickly retrieved as well as the document content itself.
In the example above, it would appear that Hadoop, while a good fit for big data doesn’t necessarily line up with the existing paradigm for search and retrieval in an ECM system.
Hadoop ECM Search – Search Appliance to the rescue
To truly understand how Hadoop can disrupt the relational database, particularly in an ECM scenario, let’s adjust the scenario to take advantage of another disrupting technology, the search appliance. For the bulk of ECM vendors, the ability to have an appliance approach (Solr/Lucene) to provide indexing/searching of content in the relational database eliminates the need to query within the relational database.
Let’s review the same search scenario coupled with a search appliance. Searching for “SOP-1234”:
- Relational Database with Search Appliance – Solr/Lucene would be used to perform a search for “SOP-1234” and have a pointer to the table containing the attributes. Once on that row, it would be fairly easy to retrieve the document content from the SAN. Legacy ECM vendors have moved to this paradigm over the years to escape the performance issues of querying million row tables with complex queries.
- Hadoop with Search Appliance– Solr/Lucene would be used to perform a search for “SOP-1234” and would have a pointer to the node in Hadoop to be able to quickly retrieve the document content from HDFS.
In the case above, users wouldn’t notice the difference between an ECM tool built on Hadoop or a traditional relational database.
Hadoop ECM – What else can it do
In addition to being able to do whatever a RDMBS can do with legacy ECM vendors, Hadoop differentiates itself by providing the following:
- Cost – Removal of RDMS and SAN for file storage
- Unstructured – Data can be dumped into Hadoop as it is captured rather than worrying about designing a schema or making it fit in an existing schema
- Unlimited – Can store unlimited number of data objects. Could include audit trail (like auditing every content view) or other big data items that tend to break RDMS structures.
- Backup – Hadoop includes replication/clustering “out of the box” to remove the need to do a database backup.
Summary
The combination of a Search Appliance with Hadoop has the capability of disrupting the RDMS and SAN components of typical ECM systems. Some similar shifts in the technology landscape we see telling a similar story are:
- Commodity servers disrupting proprietary hardware vendors
- Commodity storage disrupting proprietary SAN vendors
- Linux disrupting UNIX
- Apache and Tomcat disrupting proprietary application servers
- Solr/Lucene disrupting Verity/FAST/Autonomy
Let us know your thoughts in the comments below.
Thank you for all of this information about Hadoop. It looks like a very important new direction for content management that I see will have a marked effect on technical direction going forward. Three things that have come to mind to me while I reviewed this article and your others that I feel are important to think about:
1) Retention/Destruction – if the metadata for a document is stored as a blob rather than intertwined in a multitude of database tables, it will be easier to destroy a document without damage to traditional audit trails and tables.
2) Movement of documents – See 1) above – it may be easier to migrate content to other systems if you can move the blob with it rather than having to extract data from tables and merge it with the content.
3) Data backup – traditional backups were full and incremental backups of servers, tied with database backups to keep the data in synch with the documents. It took a marrying of multiple technologies to ensure that the RDBMS layer and the data files were backed up and in synch and could be restored together. Hadoop has native, inherent replication capabilities, but it also offers large scale enterprises a technology footprint that aligns well with the latest enterprise block-level real-time replication tools built directly into SANs.