Cleaning older documents from Elasticsearch

Elasticsearch is basically a search server based on lucene, which provides facilities to store data as JSON documents and scalable searching of those documents. This is a very handy solution in storing statistics on different aspects such as counters, events and later retrieving them effectively for analysis. But since Elasticsearch stores all the data reported as documents, your Elasticsearch server will run out of disk space at some point down the line, more rapidly if the statistics are generated in high frequencies.

The best solution, specially for time-based data such as logging is to clean the older documents which have no effective use from Elasticsearch storage. Elasticsearch guides recommends to create multiple indices, one per each time frame (e.g. seperate index for each month) and then delete the older indices completely. But in some use cases of Elasticsearch, this multi-indices approach makes data retrieval activities such as aggregations difficult, compared to a single index approach. Therefore this post will discuss how to clean documents older than a given date from a single index.

This cleaning process will be performed from the following 3 main steps.
  1. Collect the list of IDs of expired documents
  2. Perform a bulk delete of those documents
  3. Optimize the index

1. Collect the list of IDs of expired documents

The first step is to get the list of IDs of the expired documents to be deleted. It should be noted that to use this method, the elasticsearch records should have a field that represents the time of publish. For this tutorial lets assume that field name as "publishTime" while assuming the elasticsearch index as "my_index", record type as "my_record_type" and expire time as 180 days. We'll also assume that the host name of elasticsearch server as "localhost" and port as "9300". First of all we'll create a class and assign these values to constants.


Then we'll add a method that will query the documents from Elasticsearch filtered from their publishTime.


In this method we have used Scan & Scroll API to avoid issues caused when retrieving larger number of documents. We  filter all the documents that have publishTimes older than 180 days and print their IDs to the system out. Also for performance concerns, we have invoked the setNoFields() method on searchResponse so that it does not contain the source fields which are unnecessary for our task.

2. Perform a bulk delete of those documents

As the second step, we'll modify the above method to use Bulk API, that will delete the identified documents in the previous step.


3. Optimize the index

Now we have deleted the expired documents from Elasticsearch. This will only mark those documents as deleted and will omit from queries. But to remove those documents from disk in order to free disk space, we need to run a index optimization operation after the deletion operation. For that we'll add the following index optimization method and call it after the deletion operation.


This will delete all the expired documents and free up the disk space. You can view the statistics about a certain index using Indices Stats API.

The complete Java class for this tutorial can be found here.

Comments

Popular posts from this blog

Automatically connect to a dial-up dongle in Windows 7

Set the battery level at which Ubuntu gives the "Battery Low" Warning