Playing with Prometheus…

Short story of playing with Prometheus metrics and functions to measure, observe, then optimize.

Alok Kumar Singh
3 min readMay 14, 2021
All loads running
Load of the product being optimized

Timeline Metric

We wanted to visualize and optimize the number of load tasks running to Redshift. Redshift UI gives a similar view for the query runtime.

Prometheus Gauge metric helped us make this possible. Sweet and simple!

rsk_loader_running{product="communicator", topic="sms"} 1
rsk_loader_running{product="book", topic="appointment"} 0

Whenever a task run, we set the Gauge to 1, and we set it back to 0 when task finishes. Our scrap interval of this metric is always greater than the time to complete one task. So not a problem.

We can easily visualize timeline in Grafana using the below query:

sum by (rsk) (rsk_loader_running > 0)
sum(rsk_loader_running > 0)

Duration Metrics

Load Speed (bytes/second)

Notice the vertical lines, they are the time of load; the graph is just connected!

Next we wanted to measure the speed. Bytes loaded per second to Redshift.

For this, we used Histogram metric.

rate(rsk_loader_bytes_loaded_sum[5m])/
rate(rsk_loader_bytes_loaded_count[5m])
/ rate(rsk_loader_seconds_sum[5m]) / rate(rsk_loader_seconds_count[5m])

Similarly, we measure the no of messages, data loaded per load, duration per load, messages processed per load, etc..

Counting the no of loads that happened in last 15minutes?

This shows the load happening for a task in last 30 minutes

When load happens, the metric becomes one, and when it finishes, it becomes zero again. How do you measure how many loads happened in last 15 minutes using Prometheus query then ?

We can use Prometheus changes to find the count here!

changes(rsk_loader_running{task=”task1"}[15m])

This will return the no of changes happened for this metric. It would be 6 in this case. 6 times the values have changed from 0 to 1 or 1 to 0.

Now but only 3 loads ran right! Yes!

So, we divide it by 2 and take the ceil, that’s it, we get our total runs.

ceil(changes(rsk_loader_running{task=”task1"}[15m])/2)

Thanks for reading.

--

--