Observability for tech teams is like the film room for professional athletes. In the film room, players review opportunities to improve their game.

Key tools like Grafana, Prometheus, Loki, and Podman play a crucial role in generating metrics for your workloads, contributing significantly to the process of improving observability.

You’ll need to monitor and observe workloads as a systems engineer, site reliability engineer, DevOps engineer, or developer—i.e., anyone in charge of tech systems. You need a film room.

Software engineering teams use observability to understand the health, performance, and status of software systems, including when and why errors occur. - What is Observability? | New Relic

Now, numerous tools can assist engineers with grabbing metrics from their resources. Choosing which system to use for your environment can be just as challenging because, as with anything in tech, there are many choices and many tradeoffs.

For example, some tools may offer more features but require more resources, while others may be simpler but less powerful. It’s important to consider your specific needs and constraints when making this decision.

By setting up a system to pull metrics from the environment, I aimed to empower the team to improve the overall observability of our work. While some monitoring existed in other environments, this particular one needed our attention more.

I listed a few things I hoped to accomplish for the organization by setting this system up. Some of the goals were to:

Improve visibility into existing systems without adding much overhead to the process.
utilize containers to increase portability and decouple dependencies from the OS
setup dashboards to observe the metrics in a digestible way
setup alarms to alert the team if things aren’t happy
use the metrics to validate assumptions and optimize the architecture of existing systems by updating requirements

The proposed architecture, featuring Prometheus, Grafana, Telegraf, InfluxDB, and a podman, was chosen for its cost-effectiveness, familiarity, and ability to centralize metrics, all of which are key to enhancing observability.

Podman was chosen for the container engine as it uses fork-exec architecture, offers SELinux for security, and the ease of running containers rootless. Podman also introduced a new system for managing containers with Systemd called Quadlet.

The Quadlet strategy made sense because it provided a familiar process for engineers on the team who may need to gain container experience. Those engineers can utilize systemd to restart containers if required. A win for the team!

As you see below in the snippet systemctl --user status grafana.service is showing the container status. The expected systemctl commands are the methods to restart or stop the containers.

[monitor@monitors ~]$ systemctl --user status grafana.service --no-pager
● grafana.service - Podman Grafana container
     Loaded: loaded (/home/monitor/.config/containers/systemd/grafana.container; generated)
     Active: active (running) since Fri 2024-05-24 16:15:47 EDT; 2 weeks 0 days ago
       Docs: man:podman-generate-systemd(1)
   Main PID: 1826 (conmon)
      Tasks: 27 (limit: 48921)
     Memory: 346.2M
        CPU: 21min 52.337s
     CGroup: /user.slice/user-1006.slice/user@1006.service/app.slice/grafana.service
             ├─libpod-payload-377e44df70a05f120f2bc048f98d84cdd84e3b31375a88ddf0c9187909b1caaf
             │ └─1829 grafana server --homepath=/usr/share/grafana --config=/etc/grafana/grafana.ini --packaging=docker cfg:def…
             └─runtime
               ├─1763 rootlessport
               ├─1789 rootlessport-child
               └─1826 /usr/bin/conmon --api-version 1 -c 377e44df70a05f120f2bc048f98d84cdd84e3b31375a88ddf0c9187909b1caaf -u 37…

Quadlets allow the configuration of containers utilizing Systemd units as shown below:

[Unit]
Description=Podman Grafana container
Documentation=man:podman-systemd.unit(5)

[Container]
ContainerName=grafana
Image=docker.io/grafana/grafana
PublishPort=127.0.0.1:3000:3000
User=grafana
Volume=/var/opt/monitors/grafana:/var/lib/grafana:U,Z
Network=monitor-network.network


[Service]
Restart=always

[Install]
WantedBy=multi-user.target default.target

For more info see man podman-systemd.unit

I used Grafana, Telegraf, InfluxDb, and Prometheus to centralize the collected metrics. Below are short snippets of each tool’s About page.

Prometheus collects and stores its metrics as time series data.
Grafana is a multi-platform open-source analytics and interactive visualization web application. When connected to supported data sources, it can produce charts, graphs, and alerts for the web.
Telegraf is a server-based agent that collects and sends all metrics and events from databases, systems, and IoT sensors.
InfluxDB is a database that stores and analyzes time series data from any source in real-time. It offers high performance, low cost, native SQL support, and interoperability with other data systems.

I provisioned the server and deployed the quadlets onto the RHEL-9 server with Ansible. To accelrate our process, I’ve imported a couple of community dashboards found on Grafana’s site, e.g., Dashboard. The community dashboards allow a quick ROI on this project while we develop our custom dashboards.

Conclusion

Implementing observability solutions may seem daunting at first, but with the technology available today, it’s relatively straightforward. There are numerous ways to set up observability, and choosing the right solution may take some time, but rest assured, it’s simple enough.

If a demo on how to set this up is helpful, let me know, and I’ll prepare one.

The Film Room for Your Systems

Conclusion