Overview

Checks the number of documents in the elasticsearch cluster and provides performance data for influxdb/grafana

Technical Details

Implementation

Global base check	no
Puppet profiles using this check	profile_elasticsearch

Check plugins

Name	check_es_performance_metrics
Packages	nagios-plugin-elasticsearch, nagios-plugin-elasticsearch-config
CheckCommand name	check_es_performance_metrics
Upstream source link	https://git.vshn.net/vshn/nagios-plugins-elasticsearch
Documentation link

List of variables

Icinga2 variable	Default value	Description
es_heap_percentage_warn	85	Warning threshold for actually used heap space in percentage
es_heap_percentage_crit	90	Critical threshold for actually used heap space in percentage
es_bulk_queue_warn	175	Warning threshold for bulk import queue items
es_bulk_queue_crit	195	Critical threshold for bulk import queue items
es_index_queue_warn	175	Warning threshold for index queue items
es_index_queue_crit	195	Critical threshold for index queue items
es_search_queue_warn	900	Warning threshold for search queue items
es_search_queue_crit	975	Critical threshold for search queue items

Description for the metrics

Heap Usage

The heap usage is the actual used heap space. This will differ greatly from the used memory report of the os itself because the heap stays allocated. This will create a sawtooth pattern in grafana because the used heap will be cleaned by the garbage collector. If a node stays over 85% permanently the heap settings for the node may be too low. You can increase the heap size to about half the available memory but not more than 32Gb. If the heap settings are already at 32Gb then consider adding more nodes to the cluster.

Queries and Fetches

Queries and fetches are equivalent to read operations in a relational database system. Every search query for ElasticSearch has two phases: the query and the fetch phase.

The query phase is looking for the right data in the cluster.

The fetch phase is the actual read of the data. The timings for this phase should normally be lower than the query phase. High fetch times can be caused by slow disks.

Indexing and Flushes

Indexing and flushes are equivalent to write operations in a relational database system. As with the search queries the process is similar and has two phases:

The indexing phase collects the information memory and the documents are not yet available for search.

The flush writes the information to disk. As with the fetch phase this can hint at slow disks.

Garbage Collection

These metric show how often and how long the garbage collection runs. These metrics can indicate if the heap is too large or too small. In the first case the gc times will be too high. In the worst case this could lead your cluster to mistakenly register your node as having dropped off the grid. If the gc is performed very frequently the heap size may be too low.

Queues and Rejects

Bulk, search and indexing operations each have their own queues and rejection counts. If a queue is filled up and a request can't be added, it will be rejected. These metrics help to correctly size the queues and the thread pools.

Troubleshooting/Known issues

CRITICAL: Heap usage error

Metrics in the Monitoring report an error.

Solution

Metrics start working once a client connects to elastic search. As long as the service is in setup, the check does report critical.

Customer Knowledge Base

Monitoring Service Check: es_performance_metrics