Grafana: Creación de Dashboards de Monitoreo Avanzado
Grafana es la plataforma líder para visualización de métricas y observabilidad. Esta guía completa cubre desde la configuración inicial hasta la creación de dashboards empresariales complejos.
Instalación y Configuración
Docker Compose Setup
# docker-compose.yml
version: '3.8'
services:
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin123
- GF_USERS_ALLOW_SIGN_UP=false
- GF_INSTALL_PLUGINS=grafana-clock-panel,grafana-simple-json-datasource,grafana-worldmap-panel
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
- ./grafana/dashboards:/var/lib/grafana/dashboards
restart: unless-stopped
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--storage.tsdb.retention.time=15d'
- '--web.enable-lifecycle'
restart: unless-stopped
volumes:
grafana_data:
prometheus_data:
Configuración Automatizada con Provisioning
# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
- name: Loki
type: loki
access: proxy
url: http://loki:3100
editable: true
- name: MySQL
type: mysql
url: mysql-server:3306
database: monitoring
user: grafana
secureJsonData:
password: 'grafana_password'
Data Sources Configuration
Prometheus Data Source
# Configuración avanzada de Prometheus
# grafana/provisioning/datasources/prometheus-advanced.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
httpMethod: POST
jsonData:
httpHeaderName1: 'Authorization'
timeInterval: '30s'
queryTimeout: '60s'
disableMetricsLookup: false
customQueryParameters: 'max_source_resolution=5m&partial_response=true'
secureJsonData:
httpHeaderValue1: 'Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...'
InfluxDB Configuration
# grafana/provisioning/datasources/influxdb.yml
apiVersion: 1
datasources:
- name: InfluxDB
type: influxdb
access: proxy
url: http://influxdb:8086
database: metrics
user: grafana
secureJsonData:
password: 'influx_password'
jsonData:
timeInterval: '10s'
httpMode: GET
Creación de Dashboards
Dashboard Infrastructure Overview
{
"dashboard": {
"id": null,
"title": "Infrastructure Overview",
"tags": ["infrastructure", "overview"],
"timezone": "browser",
"panels": [
{
"id": 1,
"title": "System Load",
"type": "stat",
"targets": [
{
"expr": "avg(node_load1)",
"legendFormat": "Load 1m",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"thresholds": {
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 0.7},
{"color": "red", "value": 1.0}
]
}
}
},
"gridPos": {"h": 8, "w": 6, "x": 0, "y": 0}
},
{
"id": 2,
"title": "Memory Usage",
"type": "timeseries",
"targets": [
{
"expr": "100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100)",
"legendFormat": "{{instance}}",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100
}
},
"gridPos": {"h": 8, "w": 12, "x": 6, "y": 0}
}
],
"time": {
"from": "now-1h",
"to": "now"
},
"refresh": "30s"
}
}
Application Performance Dashboard
{
"dashboard": {
"title": "Application Performance",
"panels": [
{
"id": 1,
"title": "Request Rate",
"type": "timeseries",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (service)",
"legendFormat": "{{service}}",
"refId": "A"
}
],
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
},
"legend": {
"displayMode": "table",
"placement": "right",
"values": ["last", "max"]
}
}
},
{
"id": 2,
"title": "Response Time Percentiles",
"type": "timeseries",
"targets": [
{
"expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
"legendFormat": "{{service}} p50",
"refId": "A"
},
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
"legendFormat": "{{service}} p95",
"refId": "B"
},
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
"legendFormat": "{{service}} p99",
"refId": "C"
}
],
"fieldConfig": {
"defaults": {
"unit": "s"
}
}
},
{
"id": 3,
"title": "Error Rate",
"type": "stat",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 1},
{"color": "red", "value": 5}
]
}
}
}
}
]
}
}
Queries Avanzadas con PromQL
Queries de Infraestructura
# CPU Usage por instancia
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memoria disponible
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100
# Uso de disco por filesystem
100 - (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100)
# Network I/O
rate(node_network_receive_bytes_total[5m]) * 8
# Disk I/O
rate(node_disk_read_bytes_total[5m]) + rate(node_disk_written_bytes_total[5m])
# Load average
node_load1 / count by(instance) (node_cpu_seconds_total{mode="idle"})
Queries de Aplicación
# Request rate total
sum(rate(http_requests_total[5m]))
# Request rate por servicio
sum(rate(http_requests_total[5m])) by (service)
# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
# Response time percentiles
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# Throughput por endpoint
sum(rate(http_requests_total[5m])) by (endpoint)
# Database connections
mysql_global_status_threads_connected / mysql_global_variables_max_connections * 100
Alerting y Notifications
Alert Rules Configuration
# grafana/provisioning/alerting/alerts.yml
apiVersion: 1
groups:
- name: infrastructure
folder: Infrastructure
interval: 1m
rules:
- uid: high_cpu_usage
title: High CPU Usage
condition: C
data:
- refId: A
queryType: ''
relativeTimeRange:
from: 300
to: 0
model:
expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
refId: A
- refId: C
queryType: ''
relativeTimeRange:
from: 0
to: 0
model:
conditions:
- evaluator:
params: [80]
type: gt
operator:
type: and
query:
params: [A]
reducer:
type: last
type: query
refId: C
noDataState: NoData
execErrState: Alerting
for: 5m
annotations:
description: "CPU usage is above 80% for more than 5 minutes"
summary: "High CPU usage detected"
labels:
severity: warning
team: infrastructure
Notification Channels
# grafana/provisioning/notifiers/slack.yml
apiVersion: 1
notifiers:
- name: slack-alerts
type: slack
uid: slack_uid
settings:
url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#alerts'
username: 'Grafana'
title: 'Grafana Alert'
text: |
**Alert:** {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}
**Description:** {{ range .Alerts }}{{ .Annotations.description }}{{ end }}
**Status:** {{ .Status }}
- name: email-alerts
type: email
uid: email_uid
settings:
addresses: 'admin@empresa.com;devops@empresa.com'
subject: '[GRAFANA] {{ .GroupLabels.alertname }}'
body: |
Alert Details:
{{ range .Alerts }}
Summary: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Status: {{ .Status }}
Labels: {{ range .Labels.SortedPairs }}{{ .Name }}={{ .Value }} {{ end }}
{{ end }}
Variables y Templating
Dashboard Variables
{
"templating": {
"list": [
{
"name": "instance",
"type": "query",
"query": "label_values(node_cpu_seconds_total, instance)",
"refresh": 1,
"includeAll": true,
"allValue": ".*",
"multi": true,
"current": {
"text": "All",
"value": "$__all"
}
},
{
"name": "service",
"type": "query",
"query": "label_values(http_requests_total, service)",
"refresh": 1,
"includeAll": true,
"multi": true
},
{
"name": "time_range",
"type": "interval",
"query": "1m,5m,10m,30m,1h,6h,12h,1d,7d,14d,30d",
"current": {
"text": "5m",
"value": "5m"
}
},
{
"name": "environment",
"type": "custom",
"query": "production,staging,development",
"current": {
"text": "production",
"value": "production"
}
}
]
}
}
Advanced Templating
{
"templating": {
"list": [
{
"name": "datacenter",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(node_boot_time_seconds, datacenter)",
"refresh": 2,
"sort": 1
},
{
"name": "server",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(node_boot_time_seconds{datacenter=\"$datacenter\"}, instance)",
"refresh": 2,
"regex": "/([^:]+):.*/",
"sort": 1
},
{
"name": "interval",
"type": "interval",
"query": "30s,1m,5m,10m,15m,30m,1h,6h,12h",
"auto": true,
"auto_count": 30,
"auto_min": "10s"
}
]
}
}
Plugins y Extensiones
Panel Plugins Configuration
# grafana/grafana.ini
[plugins]
enable_alpha = true
plugin_admin_enabled = true
allow_loading_unsigned_plugins = grafana-worldmap-panel,grafana-clock-panel
[panels]
disable_sanitize_html = true
Custom Panel Creation
// custom-panel/module.js
import { PanelPlugin } from '@grafana/data';
import { CustomPanel } from './CustomPanel';
import { CustomOptions } from './types';
export const plugin = new PanelPlugin<CustomOptions>(CustomPanel)
.setPanelOptions(builder => {
return builder
.addTextInput({
path: 'title',
name: 'Panel Title',
description: 'The title of the panel',
defaultValue: 'Custom Panel'
})
.addColorPicker({
path: 'color',
name: 'Background Color',
defaultValue: 'blue'
})
.addBooleanSwitch({
path: 'showLegend',
name: 'Show Legend',
defaultValue: true
});
});
Performance Optimization
Dashboard Performance Best Practices
{
"dashboard": {
"refresh": "30s",
"time": {
"from": "now-1h",
"to": "now"
},
"panels": [
{
"targets": [
{
"expr": "rate(metric[5m])",
"interval": "30s",
"maxDataPoints": 1000,
"format": "time_series"
}
],
"options": {
"reduceOptions": {
"calcs": ["lastNotNull"]
}
}
}
]
}
}
Query Optimization
# Malo: Query muy lenta
sum(rate(http_requests_total[5m])) by (instance, service, method, status)
# Mejor: Query optimizada
sum(rate(http_requests_total[5m])) by (service)
# Usar recording rules para queries complejas
sum(rate(http_requests_total[5m])) by (service)
# Se convierte en:
job:http_requests:rate5m
Multi-Tenancy y Seguridad
Organization Management
# Crear organizaciones
curl -X POST \
http://admin:admin@localhost:3000/api/orgs \
-H 'Content-Type: application/json' \
-d '{"name": "Development Team"}'
curl -X POST \
http://admin:admin@localhost:3000/api/orgs \
-H 'Content-Type: application/json' \
-d '{"name": "Production Team"}'
# Asignar usuarios a organizaciones
curl -X POST \
http://admin:admin@localhost:3000/api/orgs/2/users \
-H 'Content-Type: application/json' \
-d '{"loginOrEmail": "dev@empresa.com", "role": "Editor"}'
RBAC Configuration
# grafana/grafana.ini
[auth]
disable_login_form = false
disable_signout_menu = false
[auth.ldap]
enabled = true
config_file = /etc/grafana/ldap.toml
allow_sign_up = false
[users]
allow_sign_up = false
allow_org_create = false
auto_assign_org = true
auto_assign_org_role = Viewer
verify_email_enabled = false
Backup y Disaster Recovery
Dashboard Backup Script
#!/bin/bash
# backup-dashboards.sh
GRAFANA_URL="http://localhost:3000"
API_KEY="your_api_key"
BACKUP_DIR="/backups/grafana/$(date +%Y%m%d)"
mkdir -p $BACKUP_DIR
# Export all dashboards
curl -s -H "Authorization: Bearer $API_KEY" \
"$GRAFANA_URL/api/search?type=dash-db" | \
jq -r '.[] | .uid' | \
while read uid; do
echo "Backing up dashboard: $uid"
curl -s -H "Authorization: Bearer $API_KEY" \
"$GRAFANA_URL/api/dashboards/uid/$uid" | \
jq '.dashboard' > "$BACKUP_DIR/$uid.json"
done
# Export data sources
curl -s -H "Authorization: Bearer $API_KEY" \
"$GRAFANA_URL/api/datasources" > "$BACKUP_DIR/datasources.json"
# Export alert rules
curl -s -H "Authorization: Bearer $API_KEY" \
"$GRAFANA_URL/api/v1/provisioning/alert-rules" > "$BACKUP_DIR/alert-rules.json"
echo "Backup completed in $BACKUP_DIR"
Restore Script
#!/bin/bash
# restore-dashboards.sh
GRAFANA_URL="http://localhost:3000"
API_KEY="your_api_key"
BACKUP_DIR="/backups/grafana/20240210"
# Restore data sources
curl -X POST \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d @"$BACKUP_DIR/datasources.json" \
"$GRAFANA_URL/api/datasources"
# Restore dashboards
for dashboard in $BACKUP_DIR/*.json; do
if [[ $dashboard != *"datasources"* ]] && [[ $dashboard != *"alert-rules"* ]]; then
echo "Restoring $(basename $dashboard)"
jq '. | {dashboard: ., overwrite: true}' "$dashboard" | \
curl -X POST \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d @- \
"$GRAFANA_URL/api/dashboards/db"
fi
done
Conclusión
Una implementación efectiva de Grafana requiere:
- Configuración automatizada con provisioning
- Dashboards bien estructurados y organizados
- Queries optimizadas para performance
- Sistema de alertas configurado apropiadamente
- Variables y templating para flexibilidad
- Seguridad y multi-tenancy implementados
- Estrategias sólidas de backup y recuperación
- Monitoreo del propio Grafana
Estas prácticas aseguran dashboards robustos, escalables y mantenibles para equipos de cualquier tamaño.