Welcome to the VSHN Knowledge Base
Monitoring Service Check: es_http_check
Overview
This checks uses the default http check to check the health of the elastic search cluster http://127.0.0.1:9200/_cluster/health?local=true&timeout=10s
Technical Details
Implementation
Global base check | No |
---|---|
Puppet profiles using this check | profile_elasticsearch |
Check plugins
Name | check_http_json |
---|---|
Packages | nagios-plugins-http-json |
CheckCommand name | check_http_json |
Upstream source link | https://git.vshn.net/vshn/nagios-plugins-http-json |
Documentation link | https://git.vshn.net/vshn/nagios-plugins-http-json |
List of variables
Icinga2 variable | Configured in | Default Value | Description |
---|---|---|---|
http_timeout | floor(1.1 * $_health_timeout) | Sets the timeout for the check | |
http_uri | $_local_health_url | Local URL for the health endpoint | |
http_json_element | 'status' | Json element to check | |
http_json_result_warn | 'yellow' | Value of the json element to trigger warning | |
http_json_result_crit | 'red' | Value of the json element to trigger critical |
Troubleshooting/Known issues
Checks and Commands if Cerebro is Unavailable
Is it running?
(data-node-prod-elastic)foo.bar@server707:~$ curl localhost:9200 { "name" : "server707", "cluster_name" : "cust_prod", "cluster_uuid" : "ohchaephah8lahWoap", "version" : { "number" : "5.6.8", "build_hash" : "688ecce", "build_date" : "2018-02-16T16:46:30.010Z", "build_snapshot" : false, "lucene_version" : "6.6.1" }, "tagline" : "You Know, for Search" }
Cluster health
This shows the health of the cluster as well as the shards that are in initialising or relocating state.
(data-node-prod-elastic)foo.bar@server707:~$ curl localhost:9200/_cluster/health?pretty { "cluster_name" : "cust_prod", "status" : "green", "timed_out" : false, "number_of_nodes" : 20, "number_of_data_nodes" : 14, "active_primary_shards" : 255, "active_shards" : 844, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 0, "delayed_unassigned_shards" : 0, "number_of_pending_tasks" : 0, "number_of_in_flight_fetch" : 0, "task_max_waiting_in_queue_millis" : 0, "active_shards_percent_as_number" : 100.0 }
The status
field is what's checked against in icinga. It can have three values:
Green: Everything okay with the cluster
Yellow: Some shard replicas are missing, but all data available and the cluster is operational
Red: Some shard primaries and replicas are missing. The indices with the missing shards are not available anymore
The field unassigned_shards should show a value != in those cases.
To determine what indices are affected go to "Check indices status".
List shards and status
curl -XGET localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason,node
Lock and unlock shard allocation
If you want to reboot a node please lock the shard allocation beforehand. This avoids that the cluster starts to rebalance shards.
Lock:
curl -X PUT "http://localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d' { "transient": { "cluster.routing.allocation.enable": "none" } }'
After the reboot you can unlock it again:
curl -X PUT "http://localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d' { "transient": { "cluster.routing.allocation.enable": "all" } }'
It should start to reallocate the shards again. You can check the status of that with the cluster health command.
Check indices status
If the cluster status isn't green you can use this to find out what index is causing the problem.
(data-node-prod-elastic)foo.bar@server707:~$ curl localhost:9200/_cat/indices?v health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open agentur-xyz_net-customer-service_label_de_20171214 OijMjGSeRyWIM7QrkukyDw 1 2 48 1 298.2kb 99.4kb green open agentur-smartive_ch-customer-jobs-api_de_20160808 QPbprI0lQFqj-baFxeJTww 1 2 1167 161 25.4mb 7.6mb green open agentur-xyz_net-customer-service_event_de_20171214 ST56g5irTIObmCOMVFoc8A 1 2 111 0 557.9kb 18 shards^ ^replications
A common cause for yellow state is if the cluster has less nodes than the pri + rep setting. Pri is for primary shards on the index and rep is for replicas. Primaries + replica should be less or equal to the amount of nodes in the cluster. To change the replicas see "Change index replication". Most of the time this information has to be relayed to the customer/user who created the index, as this can be set during the creation. The default can be set with profile_elasticsearch::es_number_of_replicas
. But this only controls the default it can be overwritten when creating an index.
Replicas Missing on Single Node
Problem: WARN: status matches yellow
But indices are OK, then
# check broken index curl localhost:9200/_cat/shards?v index shard prirep state docs store ip node .ds-.logs-deprecation.elasticsearch-default-2022.07.12-000003 0 p STARTED 172.16.11.11 elastic1 .ds-.logs-deprecation.elasticsearch-default-2022.07.12-000003 0 r UNASSIGNED # replica set to 0 curl -XPUT "http://localhost:9200/.ds-.logs-deprecation.elasticsearch-default-2022.07.12-000003/_settings" -H 'Content-Type: application/json' -d '{"index":{"number_of_replicas": 0 }}'
Query node from other node
This is useful to find out if a node is in an unresponsive state. This command can take a few seconds to finish, if it takes longer than 60 seconds the node is probably hanging.
(data-node-prod-elastic)foo.bar@server707:~$ curl -XGET 'localhost:9200/_nodes/server688/stats?pretty' #replace server688 with the node you'd like to query { "_nodes" : { "total" : 1, "successful" : 1, "failed" : 0 }, "cluster_name" : "cust_prod", "nodes" : { "hssrnZpdTdGWXzsrAja3fA" : { "timestamp" : 1523875403041, "name" : "server688", <snip>
Get shard allocation states
This command will explain the allocation state and also print some suggestions on how to proceed. This example shows output of a cluster that doesn't have any issues which is confusing since it says error.
curl localhost:9200/_cluster/allocation/explain?pretty { "error" : { "root_cause" : [ { "type" : "remote_transport_exception", "reason" : "[server686][164.14.194.5:9300][cluster:monitor/allocation/explain]" } ], "type" : "illegal_argument_exception", "reason" : "unable to find any unassigned shards to explain [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false]" }, "status" : 400 }
Retry shard allocation
If there are shards stuck in UNASSIGNED state the first you can try is this. This command will retry all shards and distribute them automatically.
curl -XPOST 'localhost:9200/_cluster/reroute?retry_failed&pretty'
If this doesn't help you can try to move them manually.
Manually allocate shards
With this command you can force to allocate shards. ElasticSearch tries 5 times to allocate a shard, if it fails 5 times it won't be retried. This command forces it to allocate it to the specified node.
curl -XPOST 'localhost:9200/_cluster/reroute?pretty&retry_failed=true' -H 'Content-Type: application/json' -d '{ "commands": [{ "allocate_replica": { "index": "agentur-xyz_net-customer-mapi_product_it_20180305", #Adjust the name "shard": 13, #change the shard number "node": "server687" #change the node name } }] }'
Delete a shard
With this command you can remove a stuck shard.
curl -XDELETE "http://localhost:9200/INDEXNAME"
Change index replication
There's an equation to get the right number of replication N>=replication+1
where N is the number of nodes. If this equation isn't satisfied the cluster will stay in the yellow state.
curl -X PUT "localhost:9200/$INDEXNAME/_settings" -H 'Content-Type: application/json' -d' { "index" : { "number_of_replicas" : 0 } } '
Disable write lock
If an ElasticSearch node hits 95% it will trigger a circuit breaker, that puts all indices in a read-only mode. To disable that again:
curl -X PUT "http://localhost:9200/_all/_settings?pretty" -H 'Content-Type: application/json' -d' { "index.blocks.read_only_allow_delete": false } '
Snapshot / Recovery Commands
Snapshot:
curl -X PUT "localhost:9200/_snapshot/snapshots/$SNAPSHOTNAME?pretty" # change $SNAPSHOTNAME
curl -s -X GET "localhost:9200/_snapshot/_status?pretty" | less curl -s -X GET "localhost:9200/_snapshot/snapshots/*?pretty" | less curl -s "localhost:9200/_snapshot/snapshots/*?pretty" | jq -r '.snapshots | .[] | .snapshot + " " + .state'
Recovery:
# you can use wildcards for the indices name # replace $SNAPSHOTNAME curl -X POST "localhost:9200/_snapshot/snapshots/$SNAPSHOTNAME/_restore?pretty" -H 'Content-Type: application/json' -d' { "indices": "graylog_*" }
curl -s -X GET "localhost:9200/_cat/recovery?pretty"