Yesterday we got the following question on the mailing list that I think many others are also struggling with.
Hi,
I am trying to setup the bischeck plugin for our organization. I have configured most part of it except for the cache retention period. Here is what I want – I want to store every value which has been generated during the past 1 month. The reason being my threshold is currently calculated as the average of the metric value during the past 4 weeks at the same time of the day.
So, how do I define the cache template for this? If I don’t define any cache template, for how many days is the data kept?
Also, how does the aggregate function work and and what does the purge maxcount signify?
I’ve gone through the documentation but it wasn’t clear. Looking forward to a response.
Bischeck is one awesome plugin. Keep up the great work. Regards, Rahul.
It’s always great to here that someone thinks its a great product, but now back to the question.
In 1.0.0 we introduce the concept of individual cache purging and aggregations. Even if the two are related from a configuration perspective it is really two independent features.
Cache purging
Lets start with cache purging. Collected monitoring data, metrics, are kept in the cache (redis from 1.0.0) as a linked lists. There is one linked list per service definition, like host1-service1-serviceitem1. Prior to 1.0.0 all the linked lists had the same size that was set with the property lastStatusCacheSize. But in 1.0.0 we made that configurable so it could be defined per service definition.
To enable individual cache configurations we added a section called
To manage the size of the cache , or to be more specific the linked list size, we defined the
<cache>
<purge>
<maxcount>1000</maxcount>
</purge>
</cache>
The second option defines the “time to live” for the metrics in the cache. The time to live has nothing to do with Redis ttl.
<cache>
<purge>
<offset>10
<period>D</period>
</purge>
</cache>
In the above example we set the time to live to 10 days. So any metrics older then this period will be removed. The period can have the following values:
- H – hours
- D – days
- W – weeks
- Y – year
The two option are mutual exclusive. You have to chose one for each serviceitem or use a cache template.
If no cache directive is define for a serviceitem the property lastStatusCacheSize will be used. It’s default value is 500.
Hopefully this explains the cache purging.
Aggregations
The next question was related to aggregations which has nothing to do with purging, but it’s configured in the same
Lets say you have a service definition of the format host1-service1-serviceitem1. When you enable an average (avg) aggregation you will automatically get the following new service definitions
- host1-service1/H/avg-serviceitem1
- host1-service1/D/avg-serviceitem1
- host1-service1/W/avg-serviceitem1
- host1-service1/M/avg-serviceitem1
The configuration you need to achieve the above average aggregations is:
<cache>
<aggregate>
<method>avg</method>
</aggregate>
</cache>
If you like to combine this with the above described purging your configuration would look like:
<cache>
<aggregate>
<method>avg</method>
</aggregate>
<purge>
<offset>10</offset>
<period>D</period>
</purge>
</cache>
The new aggregated service definitions, host1-service1/H/avg-serviceitem1, etc, will have their own cache entries and can be used in threshold configurations and virtual services like any other service definitions. For example in a threshold hours section we could define
<hours hoursID="2">
<hourinterval>
<from>09:00</from>
<to>12:00</to>
<threshold>host1−service1/H/avg−serviceitem1[0]*0.8</threshold>
</hourinterval>
...
This would mean that we use the average value for host1-service1-serviceitem1 for the period of the last hour.
Aggregations are calculated hourly, daily, weekly and monthly.
By default weekends metrics are not included in the aggregation calculation. This can be enabled by setting the
<cache>
<aggregate>
<method>avg</method>
<useweekend>true</useweekend>
</aggregate>
</cache>
This will create an aggregated service definitions with the following name standard:
- host1-service1/H/avg/weekend-serviceitem1
- host1-service1/D/avg/weekend-serviceitem1
- host1-service1/W/avg/weekend-serviceitem1
- host1-service1/M/avg/weekend-serviceitem1
You can also have multiple entries like:
<cache>
<aggregate>
<method>avg</method>
<useweekend>true</useweekend>
</aggregate>
<aggregate>
<method>max</method>
</aggregate>
</cache>
So how long time will the aggregated values be kept in the cache? By default we save:
- Hour aggregation for 25 hours
- Daily aggregations for 7 days
- Weekly aggregations for 5 weeks
- Monthly aggregations for 1 month
These values can be overridden, but they can not be lower then the default. Below you have an example where we save the aggregation for 168 hours, 60 days and 53 weeks.
<cache>
<aggregate>
<method>avg</method>
<useweekend>true</useweekend>
<retention>
<period>H</period>
<offset>168</offset>
</retention>
<retention>
<<period>D</period>
<offset>60</offset>
</retention>
<retention>
<period>W</period>
<offset>53</offset>
</retention>
</aggregate>
...
</cache>
I hope this clarify the configuration of cache and aggregation. What is clear is that we need to improve the documentation in this area.
Looking forward to your feedback.