The way out of the CloudWatch trap: a flat-rate log stack, and the math
I was a bit annoyed after the last post. I’d explained how CloudWatch Logs Insights quietly bills you per gigabyte scanned — how an always-on dashboard becomes a metered query loop that can cost more than the servers it watches — and then I stopped at “move it to a flat-rate store.” That’s the right answer, but it’s exactly the kind of hand-wave I complain about in other people’s writing. So I went and worked out the actual setup: the hardware, the software, the migration, and — because it’s the whole point — the savings.
Here’s how you make the alternative actually work.
The principle, in one line
CloudWatch charges you every time a query looks at your logs. A box you rent flat doesn’t. So you move the storing and the querying onto a machine you pay a fixed monthly for — and a dashboard that refreshes every ten seconds costs exactly the same as one that never refreshes, because it’s just CPU you already bought. Everything below is how to do that without it becoming a second job.
The stack: Grafana + Loki
Three moving parts (plus an optional fourth). Crucially, you keep Grafana, so your dashboards barely change — you just repoint them.
| Piece | Role | Replaces |
|---|---|---|
| Grafana Alloy (or Vector / Fluent Bit) | Runs on your hosts/containers, tails logs, ships them out. Can also pull existing CloudWatch groups during cutover. | the CloudWatch agent |
| Loki | Stores + queries logs. Indexes only labels (service, host, level), keeps bodies as compressed chunks, and does not bill per GB scanned — queries are just compute on your box. | CloudWatch Logs + Logs Insights |
| Grafana | The dashboards you already have. Swap each panel’s datasource to Loki (LogQL). Refresh as fast as you like; it’s free now. | CloudWatch dashboards |
| Object storage (optional) | Point Loki’s chunk store at MinIO on the box, or Backblaze B2 / Hetzner Object Storage, so long retention is cheap and the box stays disposable. | CloudWatch Logs retention |
Why Loki specifically: it was built to be cheap. It doesn’t full-text-index every byte the way OpenSearch does — it indexes labels and greps compressed chunks — so it’s light on RAM and disk, and querying carries no per-scan meter.
The hardware: one flat-rate box
The entire point is a fixed monthly price with generous, unmetered traffic — so Hetzner or OVH. Size by log volume:
| Volume / retention | Box | ~Cost/mo |
|---|---|---|
| A few GB/day, weeks of retention | Hetzner Cloud CPX31 (4 vCPU, 8 GB, 160 GB NVMe) | ~$15 |
| Moderate, want NVMe + headroom | Hetzner EX44 / AX42 dedicated (64 GB RAM, 2× NVMe) | ~$45 |
| Big / long retention (100s of GB–TBs) | Dedicated + Loki chunks on B2 / Object Storage (or a Hetzner SX box for big HDDs) | ~$45–100 |
Loki is light — 8–16 GB RAM handles a lot. Keep hot, recent chunks on NVMe for snappy dashboards and push older chunks to object storage for cheap retention.
The setup: one docker-compose
The whole footprint is a single Compose file behind nginx + certbot, backed up with Restic. Roughly:
services:
loki:
image: grafana/loki:3.0.0
command: -config.file=/etc/loki/config.yml
volumes: ["./loki-config.yml:/etc/loki/config.yml", "loki-data:/loki"]
restart: unless-stopped
grafana:
image: grafana/grafana:latest
ports: ["127.0.0.1:3000:3000"] # nginx terminates TLS in front
environment: ["GF_SECURITY_ADMIN_PASSWORD=change-me"]
volumes: ["grafana-data:/var/lib/grafana"]
depends_on: ["loki"]
restart: unless-stopped
alloy:
image: grafana/alloy:latest
command: run /etc/alloy/config.alloy
volumes: ["./alloy.alloy:/etc/alloy/config.alloy", "/var/log:/var/log:ro"]
depends_on: ["loki"]
restart: unless-stopped
volumes:
loki-data:
grafana-data:
alloy.alloy says what to ship and where (tail these files / scrape these containers → push to
http://loki:3100); loki-config.yml sets retention and, if you want it, the object-storage
backend. Add the Loki datasource in Grafana pointed at http://loki:3100, and your existing panels
work with LogQL.
The migration: don’t rip out CloudWatch — split it
Low-risk because you run both side by side until you trust the new one:
- Stand up the stack on the box.
- Ship logs in parallel — point Alloy at Loki alongside CloudWatch for a week, so you can compare and trust it.
- Repoint the dashboards — switch your Grafana operational panels to the Loki datasource.
- Cut over app/host logs; keep a thin slice of CloudWatch for AWS-native alarms and metrics you can’t get elsewhere.
- Cut CloudWatch retention on the moved log groups (retention is itself billed) and watch the next invoice drop.
The savings — the actual math
Take the example from the last post: a 30 GB log group re-queried by an always-on dashboard runs roughly ~$650/month in Logs-Insights scan charges — and it grows every time you add a panel or speed up a refresh. On the box, that same querying is $0 at the margin; you pay the rent.
| CloudWatch (this pattern) | Grafana + Loki on a box | |
|---|---|---|
| Storing the logs | ~$1/mo | included |
| Querying them (always-on dashboard) | ~$650+/mo, and rising | $0 at the margin |
| The box | — | ~$15–50/mo flat |
| 5-year cost of the query bill | ~$39,000+ | ~$900–3,000 (box) |
The honest footnote: that box is not free to run. Someone configures the stack, sets retention, and keeps it patched — price that at engineer rates and a self-run log stack is realistically $200–500/month all-in. It still beats a four-figure-and-climbing scan bill decisively; the point is to compare against the honest number, not a fantasy “$0.” And back Loki’s chunks to object storage so the box itself is disposable — if it dies, you redeploy the Compose file and re-attach the data.
When to keep some CloudWatch
Don’t rip it all out. Keep a thin slice for AWS-native alarms (auto-scaling triggers, managed-service internals you can’t get elsewhere) and, if your volume is genuinely small, the free tier. And if nobody will own the stack, a managed Grafana Cloud tier may beat a self-host you’ll let rot — honest monitoring you don’t run is worse than a bill you understand.
The alternatives, briefly
- VictoriaLogs — leaner than Loki, very low resource use; great if you want minimal ops.
- ClickHouse — the heavyweight for high volume + SQL-style analytics; blazing fast, cheap per GB, wants more RAM. Pair with Grafana.
- OpenSearch + Dashboards — if you truly need full-text search / Kibana UX; the most resource-hungry to run.
That’s the whole thing: one box, one Compose file, and querying that no longer bills you by the gigabyte. If you’d like the “what am I paying to scan?” line broken out of your own bill — and a target stack sized to your team and volume — send me a recent cloud bill and I’ll send back a one-page teardown within a business day. For the short version, see the CloudWatch Logs Insights vs Loki comparison.