The Grafana trust problem


Disclaimer: This explains my personal experiences with Grafana products. There are some facts included here but your experience may be completely different and I would love to hear your opinion here.


I started my working life in a small software company near my university. They develop, run websites and provide web services for multiple clients. Everyone had many responsibilities, and they relied heavily on apprentices and newcomers – which could be both bad and good.
It was good for me because I learned a lot.

At some point we needed a monitoring solution, and Zabbix didn’t fit well into the new and declarative world of containers and Docker. I was tasked with finding a solution. I looked at Loki/Prometheus with Grafana and Elastic with Kibana. Elastic was a beast! Heavy, difficult to operate, resource hungry and complex, Loki and Prometheus were a perfect fit for the moment.

so i made one docker-compose.yaml With Loki, Prometheus and Grafana. Since they all had an internal Docker network, we didn’t need any authentication between them. Grafana was only exposed over an SSH tunnel. After a static scrape configuration and the Docker Loki log plugin, we had the Observability stack. For non-Docker logs, we used promtail. Loki and Prometheus resided on the same machine and we only needed a local volume mount. The load was minimal.

ℹ️ This is when I learned that you shouldn’t turn every log parameter into a label to make it easier to select in the Grafana UI. Basically having a label for latency with unlimited values ​​will fill every disk inode, similar to how Cortex bin-packs.

I also learned that Grafana Labs has a cloud offering with a good free tier. So I used it for personal stuff too. My experience with him was good.

Time passed and I changed jobs. Now we have Kubernetes.
The Prometheus container was now switching nodes. At that time roaming storage was a problem and our workload had also increased a lot. We also needed long term storage (13 months). So I looked around and found Thanos and Mimir.
Previous experiences with Grafana products were good, so I chose Mimir. Loki should be similar as both are based on the cortex. We don’t really need Prometheus anymore. we were only using remote_write From Prometheus. Grafana had the solution. With the Grafana Agent, you can send both logs and metrics in a single binary to a remote location. It seemed like a mindless task.

Time passed, and Grafana changed the Grafana Agent setup to Grafana Agent Flow mode – some adjustments, but ok – software changes. And man, did Grafana love to change things up.

They started building their own observation platform to steal some of Datadogs’ customers. They made Grafana OnCall their own notification system. Not only that, but they invested heavily in Helm charts and general starter templates. Basically two commands to install metric/log shippers and use Grafana Cloud. And even when you don’t want or can’t use Grafana Cloud, here are the Helm charts for installing Mimir/Loki/Tempo. To make things even easier, let’s put it all in a wider chart (it renders 6k lines in the default state). Or use their Grafana Operator to manage a Grafana install – or at least parts of it.

As many people may have experienced, software maintenance starts to show up with age.
Grafana OnCall has been deprecated, and Grafana Agent and Agent Flow were deprecated within 2-3 years of their creation. some of the easy to use Helm charts are no longer maintained. They also removed Angular within Grafana and switched to React for the dashboard. This broke most existing dashboards.

The same day they condemned Grafana Agent, they announced Grafana Alloy. All in one replacement. It can do logs, metrics, traces (Zipkin and Jaeger) and OTEL. Solution to everything!
The beginning of the solution was difficult and it was Little Small cart. But it got better with time. The alloy operator also entered the game because why not.

ℹ️ They choose to use their own configuration language for alloy. Something that looks like HCL. I can understand why you don’t want to use YAML but I’m still not a fan of it. Not everything needs its own DSL.

Happy ending, right? – not enough
All-in-one solutions don’t support everything. While Grafana built its own monitoring empire, the Qube-Prometheus community grew steadily and naturally. with prometheus operator ServiceMonitor And PodMonitor CRDs became the de facto standard. So alloy also supports monitoring.coreos.com API-Group CRD, at least parts of it. It basically works with ServiceMonitor And PodMonitorBut PrometheusRules Requires additional configuration. AlertmanagerConfig Which would need to be implemented in Mimir is not supported. Because Mimir brings its own AlertManager – at least kind of. There are version variations and minor incompatibilities.

But I got it all working; Now I can finally stop explaining to my boss why we need to remodel the monitoring stack every year.

Grafana recently released Mimir 3.0. They reworked the ingestion logic for scalability, and now they use a message broker. Yes, Mimir in version 3.0 requires Apache Kafka to work.
None of the above things alone would be a reason to abandon Grafana products. Put aside the fact that they have made it incredibly difficult to find ingest endpoints for Grafana Cloud because they want to drive users to use their new fleet-config management service. But all this together makes me uneasy about recommending Grafana stuff.
I just don’t know what will change next.

I want stability to monitor me; I want it to be boring, and that’s something Grafana is not offering. It seems like the pace within Grafana is too fast for many companies, and I know for a fact that this pace is partly driven by career-driven growth. There are some smart people in Grafana but not every customer is smart nor has the ability to make Grafana their priority number one. Complexity kills – we’ve seen it.

ℹ️ Don’t get me wrong. Mimir, Loki and Grafana are technically really good software products and I (mostly) still like them but the way these products are managed makes me question them.

Sometimes I wonder how I would have looked if I had chosen the ELK stack in my first job. I also wonder whether the OpenShift approach with Thanos (kube-prometheus-stack) is the most time-stable solution for long-term storage. I just hope that OTEL settles down, becomes increasingly stable and boring, and allows me to choose whatever I want for my backend. Because my inspection has just been completed. I just want to support our application and not have to revisit monitoring setup every week, because monitoring is a requirement—not a product. At least for most companies.



Leave a Comment