An important and often overlooked component of implementing a content management system is the lifecycle of content once it has been deleted in the repository. There are a number of things to consider when coming up with a plan for deleting content, including recoverability, performance, and system resource usage. This article outlines the way that Alfresco handles content deletion and will hopefully bring to light some key decision points that are often ignored during the initial implementation of an Alfresco repository.
It’s worth noting that this article was inspired by a client that has a relatively large-scale Alfresco implementation with a repository of over 10 million documents that occupy multiple terabytes of storage space. Over time, the client began to struggle with the scalability and performance of a very large database, as well as concerns about the rapidly growing demands for additional storage. After some investigation, it was discovered that the repository contained an excessive amount of content that had been deleted and was no longer visible from the user interface, but was still consuming valuable database and file system resources.
It probably doesn’t come as a surprise that when content is deleted from the user interface in Alfresco, the metadata and binary content are not immediately deleted from the database and file system. This provides a safety net so that if content is deleted by mistake, it can be recovered. It’s important for repository administrators to understand what happens to this deleted content over time and make adjustments as needed in order to ensure that the system will continue to perform well in the future, while using minimal system resources (memory, hard drive space, etc.). Below is a list of the stages of content deletion in an Alfresco repository:
Stage 1 – Move to Trash
- User deletes content from the user interface (e.g. Alfresco Share).
- Content is moved to the trash. In Alfresco 4.1.x and earlier, content could only be restored from the trash by an admin user. In Alfresco 4.2 and later, users are able to see their own trashcan and can restore any content that they’ve deleted.
- Behind the scenes, content is moved from the main content store to the archive store.
- Metadata is still stored in the database, and content is still stored on the file system in the contentstore directory.
- Content remains in the trash (archive store) until the trash is emptied. By default, there is no process that automatically purges content from the archive store.
Stage 2 – Empty Trash
- Content can be individually removed from the trash, or the trash can be emptied entirely.
- If the trash is not managed properly, it can grow in size very quickly. It’s important to note that for performance reasons, emptying the trash from the user interface will only delete the first 1000 items in the trash.
- When the trash is emptied, the metadata is still stored in the database, but is marked as deleted. Binary content is still stored on the file system in the contentstore directory.
- Content remains in the database and on the file system until jobs run to purge the content.
Stage 3 – Content Store Cleaner Job Runs
- Alfresco has a Content Store Cleaner job that runs daily. The purpose of this job is to move the binary content that has been deleted from the trash (archive store) from the contentstore directory on the file system to the contentstore.deleted directory.
- The Content Store Cleaner job runs daily at 4:00 a.m. by default. This schedule can be adjusted via configuration if desired. The job should run during off-peak hours.
- The Content Store Cleaner job does not move content to the contentstore.deleted directory until 14 days after it was removed from the trash. This provides another safety net in case content was inadvertently deleted. The 14 day window can be adjusted via configuration as well.
- Once content is move to the contentstore.deleted directory, it remains there permanently. If desired, this directory can be safely purged manually by a system administrator.
Stage 4 – Database Node Cleaner Job Runs
- Alfresco has a Database Node Cleaner job that runs daily. The purpose of this job is to remove all traces of a piece of content from the database once it’s been deleted from the trash.
- The Database Node Cleaner job runs daily at 9:00 p.m. by default. This schedule can be adjusted via configuration if desired. The job should run during off-peak hours.
- The Database Node Cleaner job does not remove an item until 30 days after it was removed from the trash. This 30 day window can be adjusted via configuration as well.
- Once the metadata is removed from the database, the metadata can be considered to be permanently deleted. The only way to recover the metadata would be to restore the database from backup.
As you can tell, the process for deleting content in Alfresco is way more than meets the eye. Most of the time, the default settings and configurations work well, but there are a couple of sticking points that should be considered:
- Content must be manually deleted from the trash in order for the other cleanup processes to kick in. Many clients are not even aware of the trash, since it was only available to admin users in Share until version 4.2 was release. If content is deleted frequently in your repository, it might be a good idea to implement a scheduled job to automatically purge content from the trash after a certain number of days to prevent buildup.
- Content must be manually deleted from the contentstore.deleted directory on the Alfresco server in order to recover the hard drive space that is consumed by deleted content. It is safe to delete the contents of the contentstore.deleted directory, provided there are no business rules that required deleted content to be retained for an extended period of time.
Hopefully this article has shed some light on how Alfresco handles deleted content by default, and what adjustments and manual intervention can be done to modify the default process. Feel free so share your thoughts below.