Overview

This checks uses the default http check to check the health of the elastic search cluster http://127.0.0.1:9200/_cluster/health?local=true&timeout=10s

Technical Details

Implementation

Global base check	No
Puppet profiles using this check	profile_elasticsearch

Check plugins

Name	check_http_json
Packages	nagios-plugins-http-json
CheckCommand name	check_http_json
Upstream source link	https://git.vshn.net/vshn/nagios-plugins-http-json
Documentation link	https://git.vshn.net/vshn/nagios-plugins-http-json

List of variables

Icinga2 variable	Default Value	Description
http_timeout	floor(1.1 * $_health_timeout)	Sets the timeout for the check
http_uri	$_local_health_url	Local URL for the health endpoint
http_json_element	'status'	Json element to check
http_json_result_warn	'yellow'	Value of the json element to trigger warning
http_json_result_crit	'red'	Value of the json element to trigger critical

Troubleshooting/Known issues

Checks and Commands if Cerebro is Unavailable

Is it running?

(data-node-prod-elastic)foo.bar@server707:~$ curl localhost:9200
{
  "name" : "server707",
  "cluster_name" : "cust_prod",
  "cluster_uuid" : "ohchaephah8lahWoap",
  "version" : {
    "number" : "5.6.8",
    "build_hash" : "688ecce",
    "build_date" : "2018-02-16T16:46:30.010Z",
    "build_snapshot" : false,
    "lucene_version" : "6.6.1"
  },
  "tagline" : "You Know, for Search"
}

Cluster health

This shows the health of the cluster as well as the shards that are in initialising or relocating state.

(data-node-prod-elastic)foo.bar@server707:~$ curl localhost:9200/_cluster/health?pretty
{
  "cluster_name" : "cust_prod",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 14,
  "active_primary_shards" : 255,
  "active_shards" : 844,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

The status field is what's checked against in icinga. It can have three values:

Green: Everything okay with the cluster

Yellow: Some shard replicas are missing, but all data available and the cluster is operational

Red: Some shard primaries and replicas are missing. The indices with the missing shards are not available anymore

The field unassigned_shards should show a value != in those cases.

To determine what indices are affected go to "Check indices status".

List shards and status

curl -XGET localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason,node

Lock and unlock shard allocation

If you want to reboot a node please lock the shard allocation beforehand. This avoids that the cluster starts to rebalance shards.

Lock:

curl -X PUT "http://localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "transient": {
    "cluster.routing.allocation.enable": "none"
  }
}'

After the reboot you can unlock it again:

curl -X PUT "http://localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "transient": {
    "cluster.routing.allocation.enable": "all"
  }
}'

It should start to reallocate the shards again. You can check the status of that with the cluster health command.

Check indices status

If the cluster status isn't green you can use this to find out what index is causing the problem.

(data-node-prod-elastic)foo.bar@server707:~$ curl localhost:9200/_cat/indices?v
health status index                                                    uuid                   pri rep docs.count docs.deleted store.size pri.store.size                               
green open  agentur-xyz_net-customer-service_label_de_20171214         OijMjGSeRyWIM7QrkukyDw  1 2       48        1  298.2kb   99.4kb
green open  agentur-smartive_ch-customer-jobs-api_de_20160808          QPbprI0lQFqj-baFxeJTww  1 2     1167      161   25.4mb    7.6mb
green open  agentur-xyz_net-customer-service_event_de_20171214         ST56g5irTIObmCOMVFoc8A  1 2      111        0  557.9kb  18
                                                                                         shards^ ^replications

A common cause for yellow state is if the cluster has less nodes than the pri + rep setting. Pri is for primary shards on the index and rep is for replicas. Primaries + replica should be less or equal to the amount of nodes in the cluster. To change the replicas see "Change index replication". Most of the time this information has to be relayed to the customer/user who created the index, as this can be set during the creation. The default can be set with profile_elasticsearch::es_number_of_replicas. But this only controls the default it can be overwritten when creating an index.

Replicas Missing on Single Node

Problem: WARN: status matches yellow

But indices are OK, then

# check broken index
curl localhost:9200/_cat/shards?v
index                                                         shard prirep state       docs  store ip             node
.ds-.logs-deprecation.elasticsearch-default-2022.07.12-000003 0     p      STARTED                 172.16.11.11   elastic1
.ds-.logs-deprecation.elasticsearch-default-2022.07.12-000003 0     r      UNASSIGNED     

# replica set to 0
curl -XPUT "http://localhost:9200/.ds-.logs-deprecation.elasticsearch-default-2022.07.12-000003/_settings" -H 'Content-Type: application/json' -d '{"index":{"number_of_replicas": 0 }}'

Query node from other node

This is useful to find out if a node is in an unresponsive state. This command can take a few seconds to finish, if it takes longer than 60 seconds the node is probably hanging.

(data-node-prod-elastic)foo.bar@server707:~$ curl -XGET 'localhost:9200/_nodes/server688/stats?pretty' #replace server688 with the node you'd like to query
{
  "_nodes" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "cluster_name" : "cust_prod",
  "nodes" : {
    "hssrnZpdTdGWXzsrAja3fA" : {
      "timestamp" : 1523875403041,
      "name" : "server688",
<snip>

Get shard allocation states

This command will explain the allocation state and also print some suggestions on how to proceed. This example shows output of a cluster that doesn't have any issues which is confusing since it says error.

curl localhost:9200/_cluster/allocation/explain?pretty
{
  "error" : {
    "root_cause" : [
      {
        "type" : "remote_transport_exception",
        "reason" : "[server686][164.14.194.5:9300][cluster:monitor/allocation/explain]"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "unable to find any unassigned shards to explain [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false]"
  },
  "status" : 400
}

Retry shard allocation

If there are shards stuck in UNASSIGNED state the first you can try is this. This command will retry all shards and distribute them automatically.

curl -XPOST 'localhost:9200/_cluster/reroute?retry_failed&pretty'

If this doesn't help you can try to move them manually.

Manually allocate shards

With this command you can force to allocate shards. ElasticSearch tries 5 times to allocate a shard, if it fails 5 times it won't be retried. This command forces it to allocate it to the specified node.

curl -XPOST 'localhost:9200/_cluster/reroute?pretty&retry_failed=true' -H 'Content-Type: application/json' -d '{
    "commands": [{
        "allocate_replica": {
            "index": "agentur-xyz_net-customer-mapi_product_it_20180305", #Adjust the name
            "shard": 13, #change the shard number
            "node": "server687" #change the node name
        }
    }]
}'

Delete a shard

With this command you can remove a stuck shard.

curl -XDELETE "http://localhost:9200/INDEXNAME"

Change index replication

There's an equation to get the right number of replication N>=replication+1 where N is the number of nodes. If this equation isn't satisfied the cluster will stay in the yellow state.

curl -X PUT "localhost:9200/$INDEXNAME/_settings" -H 'Content-Type: application/json' -d'
{
    "index" : {
        "number_of_replicas" : 0
    }
}
'

Disable write lock

If an ElasticSearch node hits 95% it will trigger a circuit breaker, that puts all indices in a read-only mode. To disable that again:

curl -X PUT "http://localhost:9200/_all/_settings?pretty" -H 'Content-Type: application/json' -d'
{
    "index.blocks.read_only_allow_delete": false
}
'

Snapshot / Recovery Commands

Snapshot:

Create a snapshot

curl -X PUT "localhost:9200/_snapshot/snapshots/$SNAPSHOTNAME?pretty"			# change $SNAPSHOTNAME

Get some status information

curl -s -X GET "localhost:9200/_snapshot/_status?pretty" | less

curl -s -X GET "localhost:9200/_snapshot/snapshots/*?pretty" | less

curl -s "localhost:9200/_snapshot/snapshots/*?pretty" | jq -r '.snapshots | .[] | .snapshot + " " + .state'

Recovery:

Restore a snapshot

# you can use wildcards for the indices name
# replace $SNAPSHOTNAME

curl -X POST "localhost:9200/_snapshot/snapshots/$SNAPSHOTNAME/_restore?pretty" -H 'Content-Type: application/json' -d'
 {
   "indices": "graylog_*"
 }

Get Recovery Status

curl -s -X GET "localhost:9200/_cat/recovery?pretty"

Customer Knowledge Base

Monitoring Service Check: es_http_check