By / 19 September 2016 / data / est. 2 min read

Reindex data in Elasticsearch

This post is older than a year. Consider some information might not be accurate anymore.

Today we have reached more than 3000 shards in our elasticsearch clusters. Digging a little deeper, that is definitely too much. Since a shard (primary or replica) is a Lucene index, it consumes file handles, memory, and CPU resources. Each search request will touch a copy of every shard in the index, which isn’t a problem when the shards are spread across several nodes. Contention arises and performance decreases when the shards are competing for the same hardware resources. :-

If you keep the logstash daily default, you will come in the situation very soon. I choose now a monthly basis. The latest 2-3 months are kept and old indices are deleted. The outcome is to merge all daily indices of a month to a big index. Therefore the Elasticsearch Reindex API is very useful.

A simple example

curl -XPOST "http://alpha:9200/_reindex" -d'
{
   "source": {
     "index": "metrics-2016.08.*"
   },
   "dest": {
     "index": "metrics-2016.08"
   }
}'

The source contains an asterisk (for each day). Depending on the data, it can take a while. Therefore you can check with the Task API, the current status of the reindexing.

GET /_tasks/?pretty&detailed=true&actions=*reindex
{"nodes": {
    "VrNH--qmTBOr2bwqzXdHqQ": {
      "name": "delta",
      "transport_address": "10.25.23.47:9300",
      "host": "10.25.23.47",
      "ip": "10.25.23.47:9300",
      "attributes": {
        "master": "false"
      },
      "tasks": {
        "VrNH--qmTBOr2bwqzXdHqQ:754011": {
          "node": "VrNH--qmTBOr2bwqzXdHqQ",
          "id": 754011,
          "type": "transport",
          "action": "indices:data/write/reindex",
          "status": {
            "total": 4411100,
            "updated": 0,
            "created": 3162000,
            "deleted": 0,
            "batches": 3163,
            "version_conflicts": 0,
            "noops": 0,
            "retries": 0,
            "throttled_millis": 0,
            "requests_per_second": "unlimited",
            "throttled_until_millis": 0
          },
          "description": "",
          "start_time_in_millis": 1474309906870,
          "running_time_in_nanos": 440069372981
        }}}}}

Each batch contains of 1000 documents (default size). After the task has finished, the response looks like this:

{"took":605281,"timed_out":false,"total":4411100,"updated":0,"created":4411100,"batches":4412,"version_conflicts":0,"noops":0,"retries":0,"throttled_millis":0,"requests_per_second":"unlimited","throttled_until_millis":0,"failures":[]}

The number itself is in milliseconds and the re-indexing took approximately 10 minutes.

elasticsearch

Reindex data in Elasticsearch

Author

Tan-Vinh Nguyen