Welcome to the VSHN Knowledge Base
Monitoring Service Check: es_performance_metrics
Overview
Checks the number of documents in the elasticsearch cluster and provides performance data for influxdb/grafana
Technical Details
Implementation
Global base check | no |
---|---|
Puppet profiles using this check | profile_elasticsearch |
Check plugins
Name | check_es_performance_metrics |
---|---|
Packages | nagios-plugin-elasticsearch, nagios-plugin-elasticsearch-config |
CheckCommand name | check_es_performance_metrics |
Upstream source link | https://git.vshn.net/vshn/nagios-plugins-elasticsearch |
Documentation link |
List of variables
Icinga2 variable | Configured in | Default value | Description |
---|---|---|---|
es_heap_percentage_warn | 85 | Warning threshold for actually used heap space in percentage | |
es_heap_percentage_crit | 90 | Critical threshold for actually used heap space in percentage | |
es_bulk_queue_warn | 175 | Warning threshold for bulk import queue items | |
es_bulk_queue_crit | 195 | Critical threshold for bulk import queue items | |
es_index_queue_warn | 175 | Warning threshold for index queue items | |
es_index_queue_crit | 195 | Critical threshold for index queue items | |
es_search_queue_warn | 900 | Warning threshold for search queue items | |
es_search_queue_crit | 975 | Critical threshold for search queue items |
Description for the metrics
Heap Usage
The heap usage is the actual used heap space. This will differ greatly from the used memory report of the os itself because the heap stays allocated. This will create a sawtooth pattern in grafana because the used heap will be cleaned by the garbage collector. If a node stays over 85% permanently the heap settings for the node may be too low. You can increase the heap size to about half the available memory but not more than 32Gb. If the heap settings are already at 32Gb then consider adding more nodes to the cluster.
Queries and Fetches
Queries and fetches are equivalent to read operations in a relational database system. Every search query for ElasticSearch has two phases: the query and the fetch phase.
The query phase is looking for the right data in the cluster.
The fetch phase is the actual read of the data. The timings for this phase should normally be lower than the query phase. High fetch times can be caused by slow disks.
Indexing and Flushes
Indexing and flushes are equivalent to write operations in a relational database system. As with the search queries the process is similar and has two phases:
The indexing phase collects the information memory and the documents are not yet available for search.
The flush writes the information to disk. As with the fetch phase this can hint at slow disks.
Garbage Collection
These metric show how often and how long the garbage collection runs. These metrics can indicate if the heap is too large or too small. In the first case the gc times will be too high. In the worst case this could lead your cluster to mistakenly register your node as having dropped off the grid. If the gc is performed very frequently the heap size may be too low.
Queues and Rejects
Bulk, search and indexing operations each have their own queues and rejection counts. If a queue is filled up and a request can't be added, it will be rejected. These metrics help to correctly size the queues and the thread pools.
Troubleshooting/Known issues
CRITICAL: Heap usage error
Metrics in the Monitoring report an error.
Solution
Metrics start working once a client connects to elastic search. As long as the service is in setup, the check does report critical.