Configuration Bischeck cache directives | bischeck – dynamic and adaptive monitoring

Yesterday we got the following question on the mailing list that I think many others are also struggling with.

Hi,

I am trying to setup the bischeck plugin for our organization. I have configured most part of it except for the cache retention period. Here is what I want – I want to store every value which has been generated during the past 1 month. The reason being my threshold is currently calculated as the average of the metric value during the past 4 weeks at the same time of the day.

So, how do I define the cache template for this? If I don’t define any cache template, for how many days is the data kept?

Also, how does the aggregate function work and and what does the purge maxcount signify?

I’ve gone through the documentation but it wasn’t clear. Looking forward to a response.

Bischeck is one awesome plugin. Keep up the great work. Regards, Rahul.

It’s always great to here that someone thinks its a great product, but now back to the question.

In 1.0.0 we introduce the concept of individual cache purging and aggregations. Even if the two are related from a configuration perspective it is really two independent features.

Cache purging

Lets start with cache purging. Collected monitoring data, metrics, are kept in the cache (redis from 1.0.0) as a linked lists. There is one linked list per service definition, like host1-service1-serviceitem1. Prior to 1.0.0 all the linked lists had the same size that was set with the property lastStatusCacheSize. But in 1.0.0 we made that configurable so it could be defined per service definition.

To enable individual cache configurations we added a section called in the serviceitem section of the bischeck.xml. Like many other configuration options in 1.0.0 the cache section could have the specific values or point to a template that could be shared.

To manage the size of the cache , or to be more specific the linked list size, we defined the section. The purge section can have two different configurations. The first define the max size of the cache linked list.

<cache>

<purge>

<maxcount>1000</maxcount>

</purge>

</cache>

The second option defines the “time to live” for the metrics in the cache. The time to live has nothing to do with Redis ttl.

<cache>

<purge>

<offset>10


<period>D</period>
</purge>
</cache>
In the above example we set the time to live to 10 days. So any metrics older then this period will be removed. The period can have the following values:

H – hours
D – days
W – weeks
Y – year

The two option are mutual exclusive. You have to chose one for each serviceitem or use a cache template.
If no cache directive is define for a serviceitem the property lastStatusCacheSize will be used. It’s default value is 500.
Hopefully this explains the cache purging.
Aggregations
The next question was related to aggregations which has nothing to do with purging, but it’s configured in the same  section. The idea with aggregations was to create an automatic way to aggregate metrics on the level of an hour, day, week and month. The aggregation functions current supported is average, max and min.
Lets say you have a service definition of the format host1-service1-serviceitem1. When you enable an average (avg) aggregation you will automatically get the following new service definitions

host1-service1/H/avg-serviceitem1
host1-service1/D/avg-serviceitem1
host1-service1/W/avg-serviceitem1
host1-service1/M/avg-serviceitem1

The configuration you need to achieve the above average aggregations is:
<cache>
<aggregate>
<method>avg</method>
</aggregate>
</cache>
If you like to combine this  with the above described purging your configuration would look like:
<cache>
<aggregate>
<method>avg</method>
</aggregate>
<purge>
 <offset>10</offset>
<period>D</period>
</purge>
</cache>
The new aggregated service definitions, host1-service1/H/avg-serviceitem1, etc, will have their own cache entries and can be used in threshold configurations and virtual services like any other service definitions. For example in a threshold hours section we could define
<hours hoursID="2"> 
<hourinterval>
<from>09:00</from>
<to>12:00</to>
<threshold>host1−service1/H/avg−serviceitem1[0]*0.8</threshold>
</hourinterval>
 ...
This would mean that we use the average value for host1-service1-serviceitem1 for the period of the last hour.
Aggregations are calculated hourly, daily, weekly and monthly.
By default weekends metrics are not included in the aggregation calculation. This can be enabled by setting the true:
<cache>
<aggregate>
<method>avg</method>
<useweekend>true</useweekend>
</aggregate>
</cache>
This will create an aggregated service definitions with the following name standard:

host1-service1/H/avg/weekend-serviceitem1
host1-service1/D/avg/weekend-serviceitem1
host1-service1/W/avg/weekend-serviceitem1
host1-service1/M/avg/weekend-serviceitem1

You can also have multiple entries like:
<cache>
<aggregate>
<method>avg</method>
<useweekend>true</useweekend>
</aggregate>
<aggregate>
<method>max</method>
</aggregate>
</cache>
So how long time will the aggregated values be kept in the cache? By default we save:

Hour aggregation for 25 hours
Daily aggregations for 7 days
Weekly aggregations for 5 weeks
Monthly aggregations for 1 month

These values can be overridden, but they can not be lower then the default. Below you have an example where we save the aggregation for 168 hours, 60 days and 53 weeks.
<cache>
<aggregate>
<method>avg</method>
<useweekend>true</useweekend>
<retention>
<period>H</period>
<offset>168</offset>
</retention>
<retention>
<<period>D</period>
<offset>60</offset>
</retention> 
<retention> 
<period>W</period>
<offset>53</offset>
</retention>
</aggregate>
...
</cache>
I hope this clarify the configuration of cache and aggregation. What is clear is that we need to improve the documentation in this area.
Looking forward to your feedback.

Cache purging

Aggregations

Leave a Reply Cancel reply