Monitoring ArangoDB using collectd

Problem

The ArangoDB web interface shows a nice summary of the current state. I want to see similar numbers in my monitoring system so I can analyze the system usage post mortem or send alarms on failure.

Solution

Collectd is an excellent tool to gather all kinds of metrics from a system, and deliver it to a central monitoring like Graphite and / or Nagios.

Ingredients

For this recipe you need to install the following tools:

Configuring collectd

For aggregating the values we will use the cURL-JSON plug-in. We will store the values using the Round-Robin-Database writer(RRD) which kcollectd can later on present to you.

We assume your collectd comes from your distribution and reads its config from /etc/collectd/collectd.conf. Since this file tends to become pretty unreadable quickly, we use the include mechanism:

<Include "/etc/collectd/collectd.conf.d">
  Filter "*.conf"
</Include>

This way we can make each metric group on compact set config files. It consists of three components:

  • loading the plug-in
  • adding metrics to the TypesDB
  • the configuration for the plug-in itself

rrdtool

We will use the Round-Robin-Database as storage backend for now. It creates its own database files of fixed size for each specific time range. Later you may choose more advanced writer-plug-ins, which may do network distribution of your metrics or integrate the above mentioned Graphite or your already established monitoring, etc.

For the RRD we will go pretty much with defaults:

# Load the plug-in:
LoadPlugin rrdtool
<Plugin rrdtool>
   DataDir "/var/lib/collectd/rrd"
#  CacheTimeout 120
#  CacheFlush 900
#  WritesPerSecond 30
#  CreateFilesAsync false
#  RandomTimeout 0
#
# The following settings are rather advanced
# and should usually not be touched:
#   StepSize 10
#   HeartBeat 20
#   RRARows 1200
#   RRATimespan 158112000
#   XFF 0.1
</Plugin>

cURL JSON

Collectd comes with a wide range of metric aggregation plug-ins. Many tools today use JSON as data formatting grammar; so does ArangoDB.

Therefore a plug-in offering to fetch JSON documents via HTTP is the perfect match to query ArangoDBs administrative Statistics interface:

# Load the plug-in:
LoadPlugin curl_json
# we need to use our own types to generate individual names for our gauges:
# TypesDB "/etc/collectd/arangodb_types.db"
<Plugin curl_json>
  # Adjust the URL so collectd can reach your arangod:
  <URL "http://localhost:8529/_db/_system/_admin/statistics">
    # Set your authentication to Aardvark here:
    User "root"
    # Password "bar"
    <Key "http/requestsTotal"> 
       Type "gauge"
    </Key> 
    <Key "http/requestsPatch"> 
       Type "gauge"
    </Key> 
    <Key "http/requestsPut"> 
       Type "gauge"
    </Key> 
    <Key "http/requestsOther"> 
       Type "gauge"
    </Key> 
    <Key "http/requestsAsync"> 
       Type "gauge"
    </Key> 
    <Key "http/requestsPost"> 
       Type "gauge"
    </Key> 
    <Key "http/requestsOptions"> 
       Type "gauge"
    </Key> 
    <Key "http/requestsHead"> 
       Type "gauge"
    </Key> 
    <Key "http/requestsGet"> 
       Type "gauge"
    </Key> 
    <Key "http/requestsDelete"> 
       Type "gauge"
    </Key> 
    
    
    <Key "system/minorPageFaults"> 
       Type "gauge"
    </Key> 
    <Key "system/majorPageFaults"> 
       Type "gauge"
    </Key> 
    <Key "system/userTime"> 
       Type "gauge"
    </Key> 
    <Key "system/systemTime"> 
       Type "gauge"
    </Key> 
    <Key "system/numberOfThreads"> 
       Type "gauge"
    </Key> 
    <Key "system/virtualSize"> 
       Type "gauge"
    </Key> 
    <Key "system/residentSize"> 
       Type "gauge"
    </Key> 
    <Key "system/residentSizePercent"> 
       Type "gauge"
    </Key> 
    
    <Key "server/threads/running"> 
       Type "gauge"
    </Key> 
    <Key "server/threads/queued"> 
       Type "gauge"
    </Key> 
    <Key "server/threads/working"> 
       Type "gauge"
    </Key> 
    <Key "server/threads/blocked"> 
       Type "gauge"
    </Key> 
    <Key "server/uptime"> 
       Type "gauge"
    </Key> 
    <Key "server/physicalMemory"> 
       Type "gauge"
    </Key> 
    
    <Key "server/v8Context/available"> 
       Type "gauge"
    </Key> 
    <Key "server/v8Context/max"> 
       Type "gauge"
    </Key> 
    <Key "server/v8Context/busy"> 
       Type "gauge"
    </Key> 
    <Key "server/v8Context/dirty"> 
       Type "gauge"
    </Key> 
    <Key "server/v8Context/free"> 
       Type "gauge"
    </Key> 
    
    <Key "client/totalTime/count"> 
       Type "client_totalTime_count"
    </Key> 
    <Key "client/totalTime/sum"> 
       Type "client_totalTime_sum"
    </Key> 
    <Key "client/totalTime/counts/0"> 
       Type "client_totalTime_counts0"
    </Key> 
    
    <Key "client/bytesReceived/count"> 
       Type "client_bytesReceived_count"
    </Key> 
    <Key "client/bytesReceived/sum"> 
       Type "client_bytesReceived_sum"
    </Key> 
    <Key "client/bytesReceived/counts/0"> 
       Type "client_bytesReceived_counts0"
    </Key> 
    
    <Key "client/requestTime/count"> 
       Type "client_requestTime_count"
    </Key> 
    <Key "client/requestTime/sum"> 
       Type "client_requestTime_sum"
    </Key> 
    <Key "client/requestTime/counts/0"> 
       Type "client_requestTime_counts0"
    </Key> 
    
    <Key "client/connectionTime/count"> 
       Type "client_connectionTime_count"
    </Key> 
    <Key "client/connectionTime/sum"> 
       Type "client_connectionTime_sum"
    </Key> 
    <Key "client/connectionTime/counts/0"> 
       Type "client_connectionTime_counts0"
    </Key> 
    
    <Key "client/queueTime/count"> 
       Type "client_queueTime_count"
    </Key> 
    <Key "client/queueTime/sum"> 
       Type "client_queueTime_sum"
    </Key> 
    <Key "client/queueTime/counts/0"> 
       Type "client_queueTime_counts0"
    </Key> 
    
    <Key "client/bytesSent/count"> 
       Type "client_bytesSent_count"
    </Key> 
    <Key "client/bytesSent/sum"> 
       Type "client_bytesSent_sum"
    </Key> 
    <Key "client/bytesSent/counts/0"> 
       Type "client_bytesSent_counts0"
    </Key> 
    
    <Key "client/ioTime/count"> 
       Type "client_ioTime_count"
    </Key> 
    <Key "client/ioTime/sum"> 
       Type "client_ioTime_sum"
    </Key> 
    <Key "client/ioTime/counts/0"> 
       Type "client_ioTime_counts0"
    </Key> 
    
    <Key "client/httpConnections"> 
       Type "gauge"
    </Key>
  </URL> 
</Plugin> 

To circumvent the shortcoming of the curl_JSON plug-in to only take the last path element as name for the metric, we need to give them a name using our own types.db file in /etc/collectd/arangodb_types.db:

client_totalTime_count           value:GAUGE:0:9223372036854775807
client_totalTime_sum             value:GAUGE:U:U
client_totalTime_counts0         value:GAUGE:U:U

client_bytesReceived_count       value:GAUGE:0:9223372036854775807
client_bytesReceived_sum         value:GAUGE:U:U
client_bytesReceived_counts0     value:GAUGE:U:U

client_requestTime_count         value:GAUGE:0:9223372036854775807
client_requestTime_sum           value:GAUGE:U:U
client_requestTime_counts0       value:GAUGE:U:U

client_connectionTime_count      value:GAUGE:0:9223372036854775807
client_connectionTime_sum        value:GAUGE:U:U
client_connectionTime_counts0    value:GAUGE:U:U

client_queueTime_count           value:GAUGE:0:9223372036854775807
client_queueTime_sum             value:GAUGE:U:U
client_queueTime_counts0         value:GAUGE:U:U

client_bytesSent_count           value:GAUGE:0:9223372036854775807
client_bytesSent_sum             value:GAUGE:U:U
client_bytesSent_counts0         value:GAUGE:U:U

client_ioTime_count              value:GAUGE:0:9223372036854775807
client_ioTime_sum                value:GAUGE:U:U
client_ioTime_counts0            value:GAUGE:U:U

Please note that you probably need to uncomment this line from the main collectd.conf:

# TypesDB "/usr/share/collectd/types.db" "/etc/collectd/my_types.db"

in order to make it still load its main types definition file.

Rolling your own

You may want to monitor your own metrics from ArangoDB. Here is a simple example how to use the config:

{
 "testArray":[1,2],
 "testArrayInbetween":[{"blarg":3},{"blub":4}],
 "testDirectHit":5,
 "testSubLevelHit":{"oneMoreLevel":6}
}

This config snippet will parse the JSON above:

<Key "testArray/0">
  Type "gauge"
  # Expect: 1
</Key>
<Key "testArray/1">
  Type "gauge"
  # Expect: 2
</Key>
<Key "testArrayInbetween/0/blarg">
  Type "gauge"
  # Expect: 3
</Key>
<Key "testArrayInbetween/1/blub">
  Type "gauge"
  # Expect: 4
</Key>
<Key "testDirectHit">
  Type "gauge"
  # Expect: 5
</Key>
<Key "testSubLevelHit/oneMoreLevel">
  Type "gauge"
  # Expect: 6
</Key

Get it served

Now we will (re)start collectd so it picks up our configuration:

/etc/init.d/collectd start

We will inspect the syslog to revalidate nothing went wrong:

Mar  3 13:59:52 localhost collectd[11276]: Starting statistics collection and monitoring daemon: collectd.
Mar  3 13:59:52 localhost systemd[1]: Started LSB: manage the statistics collection daemon.
Mar  3 13:59:52 localhost collectd[11283]: Initialization complete, entering read-loop.

Collectd adds the hostname to the directory address, so now we should have files like these:

 -rw-r--r-- 1 root root 154888 Mar  2 16:53 /var/lib/collectd/rrd/localhost/curl_json-default/gauge-numberOfThreads15M.rrd

Now we start kcollectd to view the values in the RRD file:

Kcollectd screenshot

Since we started putting values in just now, we need to choose ‘last hour’ and zoom in a little more to inspect the values.

Finished with this dish, wait for more metrics to come in other recipes.