By cinhtau / 27 March 2018 / operations / est. 7 min read

Shard Allocation in a Elasticsearch Cluster

This post is older than a year. Consider some information might not be accurate anymore.

Used: elasticsearch v5.6.8

Shards are parts of an Apache Lucene Index, the storage unit of Elasticsearch. An index may consists of more than one shard. Elasticsearch distributes the storage to its nodes. In a regular case each shard (as primary) has a replica. Primary and Replica are never stored on the same node. If a node fails, the replica takes over as primary and Elasticsearch tries to allocate a replica shard in the remaining cluster nodes. Cluster Shard Allocation is a pretty decent mechanism to ensure high availability. This post gives some insights and recipes how to deal with cluster shard allocation in a hot-warm architecture.

Index Details

Since this is about an index, use the Index API to check how many primaries and replicas an index has.

GET _cat/indices/metricbeat-6.2.2-2018.03.27?v

The output as plain text with headers.

health status index                       uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   metricbeat-6.2.2-2018.03.27 H_sYkWaiRmS2H4dJesu9Pg   1   1    2024496            0      5.2gb          2.6gb

Above example shows us one primary and one replica for index metricbeat-6.2.2-2018.03.27.

Cluster Scenario

Cluster Shard Allocation is a pretty cool feature of Elasticsearch. If you have a hot warm architecture you can e.g. prioritize data regarding IO. Hot and warm are just semantic values to describe Elasticsearch nodes regarding IO. For instance hot is super fast (SSD or SAN), warm slower (magnetic drives). My cluster setup regarding data nodes:

5 hot nodes
2 warm nodes
1 volatile node

The volatile node is for insignificant data, like Machine Learning (currently in evaluation ). You can define custom attributes in the elasticsearch.yml like

and use box_type as discriminator for data nodes in a hot-warm architecture.

Analyze Index Settings

If you wanna check if a allocation exists on the index

GET metricbeat-6.2.2-2018.03.27/_settings

The output

{
  "metricbeat-6.2.2-2018.03.27": {
    "settings": {
      "index": {
        "routing": {
          "allocation": {
            "include": {
              "box_type": "volatile,hot"
            }
          }
        },
        "mapping": {
          "total_fields": {
            "limit": "10000"
          }
        },
        "refresh_interval": "30s",
        "number_of_shards": "1",
        "provided_name": "metricbeat-6.2.2-2018.03.27",
        "creation_date": "1522101603804",
        "unassigned": {
          "node_left": {
            "delayed_timeout": "30m"
          }
        },
        "number_of_replicas": "1",
        "uuid": "H_sYkWaiRmS2H4dJesu9Pg",
        "version": {
          "created": "5060899"
        }
      }
    }
  }
}

"box_type": "volatile,hot" on the include part means that the index can reside on a volatile or hot node. The combination is a OR conjunction.

Explain API

Elasticsearch offers a powerful explain API, to check that:

GET /_cluster/allocation/explain
{
  "index": "metricbeat-6.2.2-2018.03.27",
  "shard": 0,
  "primary": true
}

Basically the shortened output tells you about the decisions the cluster (master node) has made. In the node_allocation_decisions you find all the details.

{
  "index": "metricbeat-6.2.2-2018.03.27",
  "shard": 0,
  "primary": true,
  "current_state": "started",
  "current_node": {
    "name": "machine-learning-master",
    "attributes": {
      "box_type": "hot"
    },
    "weight_ranking": 2
  },
  "can_remain_on_current_node": "yes",
  "can_rebalance_cluster": "throttled",
  "can_rebalance_cluster_decisions": [],
  "can_rebalance_to_other_node": "throttled",
  "rebalance_explanation": "rebalancing is throttled",
  "node_allocation_decisions": [
    {
      "node_name": "machine-learning-slave",
      "node_attributes": {
        "box_type": "volatile"
      },
      "node_decision": "throttled",
      "weight_ranking": 1,
      "deciders": [
        {
          "decider": "throttling",
          "decision": "THROTTLE",
          "explanation": "reached the limit of incoming shard recoveries [2], cluster setting [cluster.routing.allocation.node_concurrent_incoming_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
        }
      ]
    },
    {
      "node_name": "hot-node",
      "node_attributes": {
        "box_type": "hot"
      },
      "node_decision": "no",
      "weight_ranking": 6,
      "deciders": [
        {
          "decider": "same_shard",
          "decision": "NO",
          "explanation": "the shard cannot be allocated to the same node on which a copy of the shard already exists [[metricbeat-6.2.2-2018.03.27][0], node[p87j8OP3R4GzTrgau_chsw], [R], s[STARTED], a[id=NeGf1la8TC-czqIMLspnXg]]"
        }
      ]
    },
    {
      "node_name": "warm-node",
      "node_attributes": {
        "box_type": "warm"
      },
      "node_decision": "no",
      "weight_ranking": 7,
      "deciders": [
        {
          "decider": "filter",
          "decision": "NO",
          "explanation": "node does not match index setting [index.routing.allocation.include] filters [box_type:\"volatile OR hot\"]"
        }
      ]
    }
    //..
  ]
}

Change Shard Allocation

If you want to have a different allocation, following template list all available options:

PUT metricbeat-6.2.2-2018.03.27/_settings
{
  "index": {
    "routing": {
      "allocation": {
        "include": {
          "box_type": "hot"
        },
        "exclude": {
          "box_type": "warm"
        },
        "require": {
          "box_type": ""
        }
      }
    }
  }
}

The use case for a custom routing is for instance a performance reason. For instance: You have to investigate 2 days/indices in the past, residing on the warm nodes, you can allocate them to the hot node during the investigation and let Elasticsearch Curator (the housekeeping tool) move them automatically back after the investigation.

Include

Above example would allocate the index on hot nodes only. If you add multiple values like volatile,hot, Elasticsearch will also use the volatile node.

Exclude

In this example the warm node is excluded. To unset it use "". As tested in v5.6.8 null does not work.

Require

If you choose require instead of include, multiple values are AND joined. If you use

PUT metricbeat-6.2.2-2018.03.27/_settings
{
  "index.routing.allocation.require.box_type": "volatile,hot"
}

a node must have both attributes. In my above scenario, that won’t work. To unset it:

PUT metricbeat-6.2.2-2018.03.27/_settings
{
  "index.routing.allocation.require.box_type": ""
}

Cluster Reroute

If you want to intervene in the master node decision, the reroute command allows to explicitly execute a cluster reroute allocation command including specific commands. A simple replica allocation to a specific data node. Elasticsearch allows to use the node names. Node ids are unique, but are hard to remember .

POST /_cluster/reroute
{
  "commands": [
    {
      "allocate_replica": {
        "index": "metricbeat-6.2.2-2018.03.27",
        "shard": 0,
        "node": "hot-node-3"
      }
    }
  ]
}

Summary

Elasticsearch does a wonderful job keeping your indices distributed among data nodes.
If you need to check the Cluster Routing examine the index settings.
Specific Cluster Filtering with include, exclude and require allows a fine grained control on the distribution among data nodes.
The Cluster Allocation Explain gives you detailed information about the decisions.
You can adjust or intervene in the allocation with the Cluster Reroute commands.

A template for cluster filtering

elasticsearch