Grafana: Creación de Dashboards de Monitoreo Avanzado

Grafana es la plataforma líder para visualización de métricas y observabilidad. Esta guía completa cubre desde la configuración inicial hasta la creación de dashboards empresariales complejos.

Instalación y Configuración

Docker Compose Setup

# docker-compose.yml
version: '3.8'

services:
  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_INSTALL_PLUGINS=grafana-clock-panel,grafana-simple-json-datasource,grafana-worldmap-panel
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
      - ./grafana/dashboards:/var/lib/grafana/dashboards
    restart: unless-stopped

  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--storage.tsdb.retention.time=15d'
      - '--web.enable-lifecycle'
    restart: unless-stopped

volumes:
  grafana_data:
  prometheus_data:

Configuración Automatizada con Provisioning

# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true
    
  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    editable: true
    
  - name: MySQL
    type: mysql
    url: mysql-server:3306
    database: monitoring
    user: grafana
    secureJsonData:
      password: 'grafana_password'

Data Sources Configuration

Prometheus Data Source

# Configuración avanzada de Prometheus
# grafana/provisioning/datasources/prometheus-advanced.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    httpMethod: POST
    jsonData:
      httpHeaderName1: 'Authorization'
      timeInterval: '30s'
      queryTimeout: '60s'
      disableMetricsLookup: false
      customQueryParameters: 'max_source_resolution=5m&partial_response=true'
    secureJsonData:
      httpHeaderValue1: 'Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...'

InfluxDB Configuration

# grafana/provisioning/datasources/influxdb.yml
apiVersion: 1

datasources:
  - name: InfluxDB
    type: influxdb
    access: proxy
    url: http://influxdb:8086
    database: metrics
    user: grafana
    secureJsonData:
      password: 'influx_password'
    jsonData:
      timeInterval: '10s'
      httpMode: GET

Creación de Dashboards

Dashboard Infrastructure Overview

{
  "dashboard": {
    "id": null,
    "title": "Infrastructure Overview",
    "tags": ["infrastructure", "overview"],
    "timezone": "browser",
    "panels": [
      {
        "id": 1,
        "title": "System Load",
        "type": "stat",
        "targets": [
          {
            "expr": "avg(node_load1)",
            "legendFormat": "Load 1m",
            "refId": "A"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "color": {
              "mode": "thresholds"
            },
            "thresholds": {
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 0.7},
                {"color": "red", "value": 1.0}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 6, "x": 0, "y": 0}
      },
      {
        "id": 2,
        "title": "Memory Usage",
        "type": "timeseries",
        "targets": [
          {
            "expr": "100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100)",
            "legendFormat": "{{instance}}",
            "refId": "A"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "min": 0,
            "max": 100
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 6, "y": 0}
      }
    ],
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "refresh": "30s"
  }
}

Application Performance Dashboard

{
  "dashboard": {
    "title": "Application Performance",
    "panels": [
      {
        "id": 1,
        "title": "Request Rate",
        "type": "timeseries",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (service)",
            "legendFormat": "{{service}}",
            "refId": "A"
          }
        ],
        "options": {
          "tooltip": {
            "mode": "multi",
            "sort": "desc"
          },
          "legend": {
            "displayMode": "table",
            "placement": "right",
            "values": ["last", "max"]
          }
        }
      },
      {
        "id": 2,
        "title": "Response Time Percentiles",
        "type": "timeseries",
        "targets": [
          {
            "expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
            "legendFormat": "{{service}} p50",
            "refId": "A"
          },
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
            "legendFormat": "{{service}} p95",
            "refId": "B"
          },
          {
            "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
            "legendFormat": "{{service}} p99",
            "refId": "C"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "s"
          }
        }
      },
      {
        "id": 3,
        "title": "Error Rate",
        "type": "stat",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100",
            "refId": "A"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 1},
                {"color": "red", "value": 5}
              ]
            }
          }
        }
      }
    ]
  }
}

Queries Avanzadas con PromQL

Queries de Infraestructura

# CPU Usage por instancia
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memoria disponible
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100

# Uso de disco por filesystem
100 - (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100)

# Network I/O
rate(node_network_receive_bytes_total[5m]) * 8

# Disk I/O
rate(node_disk_read_bytes_total[5m]) + rate(node_disk_written_bytes_total[5m])

# Load average
node_load1 / count by(instance) (node_cpu_seconds_total{mode="idle"})

Queries de Aplicación

# Request rate total
sum(rate(http_requests_total[5m]))

# Request rate por servicio
sum(rate(http_requests_total[5m])) by (service)

# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

# Response time percentiles
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Throughput por endpoint
sum(rate(http_requests_total[5m])) by (endpoint)

# Database connections
mysql_global_status_threads_connected / mysql_global_variables_max_connections * 100

Alerting y Notifications

Alert Rules Configuration

# grafana/provisioning/alerting/alerts.yml
apiVersion: 1

groups:
  - name: infrastructure
    folder: Infrastructure
    interval: 1m
    rules:
      - uid: high_cpu_usage
        title: High CPU Usage
        condition: C
        data:
          - refId: A
            queryType: ''
            relativeTimeRange:
              from: 300
              to: 0
            model:
              expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
              refId: A
          - refId: C
            queryType: ''
            relativeTimeRange:
              from: 0
              to: 0
            model:
              conditions:
                - evaluator:
                    params: [80]
                    type: gt
                  operator:
                    type: and
                  query:
                    params: [A]
                  reducer:
                    type: last
                  type: query
              refId: C
        noDataState: NoData
        execErrState: Alerting
        for: 5m
        annotations:
          description: "CPU usage is above 80% for more than 5 minutes"
          summary: "High CPU usage detected"
        labels:
          severity: warning
          team: infrastructure

Notification Channels

# grafana/provisioning/notifiers/slack.yml
apiVersion: 1

notifiers:
  - name: slack-alerts
    type: slack
    uid: slack_uid
    settings:
      url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
      channel: '#alerts'
      username: 'Grafana'
      title: 'Grafana Alert'
      text: |
        **Alert:** {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}
        **Description:** {{ range .Alerts }}{{ .Annotations.description }}{{ end }}
        **Status:** {{ .Status }}
      
  - name: email-alerts
    type: email
    uid: email_uid
    settings:
      addresses: 'admin@empresa.com;devops@empresa.com'
      subject: '[GRAFANA] {{ .GroupLabels.alertname }}'
      body: |
        Alert Details:
        {{ range .Alerts }}
        Summary: {{ .Annotations.summary }}
        Description: {{ .Annotations.description }}
        Status: {{ .Status }}
        Labels: {{ range .Labels.SortedPairs }}{{ .Name }}={{ .Value }} {{ end }}
        {{ end }}

Variables y Templating

Dashboard Variables

{
  "templating": {
    "list": [
      {
        "name": "instance",
        "type": "query",
        "query": "label_values(node_cpu_seconds_total, instance)",
        "refresh": 1,
        "includeAll": true,
        "allValue": ".*",
        "multi": true,
        "current": {
          "text": "All",
          "value": "$__all"
        }
      },
      {
        "name": "service",
        "type": "query",
        "query": "label_values(http_requests_total, service)",
        "refresh": 1,
        "includeAll": true,
        "multi": true
      },
      {
        "name": "time_range",
        "type": "interval",
        "query": "1m,5m,10m,30m,1h,6h,12h,1d,7d,14d,30d",
        "current": {
          "text": "5m",
          "value": "5m"
        }
      },
      {
        "name": "environment",
        "type": "custom",
        "query": "production,staging,development",
        "current": {
          "text": "production",
          "value": "production"
        }
      }
    ]
  }
}

Advanced Templating

{
  "templating": {
    "list": [
      {
        "name": "datacenter",
        "type": "query",
        "datasource": "Prometheus",
        "query": "label_values(node_boot_time_seconds, datacenter)",
        "refresh": 2,
        "sort": 1
      },
      {
        "name": "server",
        "type": "query",
        "datasource": "Prometheus", 
        "query": "label_values(node_boot_time_seconds{datacenter=\"$datacenter\"}, instance)",
        "refresh": 2,
        "regex": "/([^:]+):.*/",
        "sort": 1
      },
      {
        "name": "interval",
        "type": "interval",
        "query": "30s,1m,5m,10m,15m,30m,1h,6h,12h",
        "auto": true,
        "auto_count": 30,
        "auto_min": "10s"
      }
    ]
  }
}

Plugins y Extensiones

Panel Plugins Configuration

# grafana/grafana.ini
[plugins]
enable_alpha = true
plugin_admin_enabled = true
allow_loading_unsigned_plugins = grafana-worldmap-panel,grafana-clock-panel

[panels]
disable_sanitize_html = true

Custom Panel Creation

// custom-panel/module.js
import { PanelPlugin } from '@grafana/data';
import { CustomPanel } from './CustomPanel';
import { CustomOptions } from './types';

export const plugin = new PanelPlugin<CustomOptions>(CustomPanel)
  .setPanelOptions(builder => {
    return builder
      .addTextInput({
        path: 'title',
        name: 'Panel Title',
        description: 'The title of the panel',
        defaultValue: 'Custom Panel'
      })
      .addColorPicker({
        path: 'color',
        name: 'Background Color',
        defaultValue: 'blue'
      })
      .addBooleanSwitch({
        path: 'showLegend',
        name: 'Show Legend',
        defaultValue: true
      });
  });

Performance Optimization

Dashboard Performance Best Practices

{
  "dashboard": {
    "refresh": "30s",
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "panels": [
      {
        "targets": [
          {
            "expr": "rate(metric[5m])",
            "interval": "30s",
            "maxDataPoints": 1000,
            "format": "time_series"
          }
        ],
        "options": {
          "reduceOptions": {
            "calcs": ["lastNotNull"]
          }
        }
      }
    ]
  }
}

Query Optimization

# Malo: Query muy lenta
sum(rate(http_requests_total[5m])) by (instance, service, method, status)

# Mejor: Query optimizada
sum(rate(http_requests_total[5m])) by (service)

# Usar recording rules para queries complejas
sum(rate(http_requests_total[5m])) by (service)
# Se convierte en:
job:http_requests:rate5m

Multi-Tenancy y Seguridad

Organization Management

# Crear organizaciones
curl -X POST \
  http://admin:admin@localhost:3000/api/orgs \
  -H 'Content-Type: application/json' \
  -d '{"name": "Development Team"}'

curl -X POST \
  http://admin:admin@localhost:3000/api/orgs \
  -H 'Content-Type: application/json' \
  -d '{"name": "Production Team"}'

# Asignar usuarios a organizaciones
curl -X POST \
  http://admin:admin@localhost:3000/api/orgs/2/users \
  -H 'Content-Type: application/json' \
  -d '{"loginOrEmail": "dev@empresa.com", "role": "Editor"}'

RBAC Configuration

# grafana/grafana.ini
[auth]
disable_login_form = false
disable_signout_menu = false

[auth.ldap]
enabled = true
config_file = /etc/grafana/ldap.toml
allow_sign_up = false

[users]
allow_sign_up = false
allow_org_create = false
auto_assign_org = true
auto_assign_org_role = Viewer
verify_email_enabled = false

Backup y Disaster Recovery

Dashboard Backup Script

#!/bin/bash
# backup-dashboards.sh

GRAFANA_URL="http://localhost:3000"
API_KEY="your_api_key"
BACKUP_DIR="/backups/grafana/$(date +%Y%m%d)"

mkdir -p $BACKUP_DIR

# Export all dashboards
curl -s -H "Authorization: Bearer $API_KEY" \
  "$GRAFANA_URL/api/search?type=dash-db" | \
  jq -r '.[] | .uid' | \
  while read uid; do
    echo "Backing up dashboard: $uid"
    curl -s -H "Authorization: Bearer $API_KEY" \
      "$GRAFANA_URL/api/dashboards/uid/$uid" | \
      jq '.dashboard' > "$BACKUP_DIR/$uid.json"
  done

# Export data sources
curl -s -H "Authorization: Bearer $API_KEY" \
  "$GRAFANA_URL/api/datasources" > "$BACKUP_DIR/datasources.json"

# Export alert rules
curl -s -H "Authorization: Bearer $API_KEY" \
  "$GRAFANA_URL/api/v1/provisioning/alert-rules" > "$BACKUP_DIR/alert-rules.json"

echo "Backup completed in $BACKUP_DIR"

Restore Script

#!/bin/bash
# restore-dashboards.sh

GRAFANA_URL="http://localhost:3000"
API_KEY="your_api_key"
BACKUP_DIR="/backups/grafana/20240210"

# Restore data sources
curl -X POST \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d @"$BACKUP_DIR/datasources.json" \
  "$GRAFANA_URL/api/datasources"

# Restore dashboards
for dashboard in $BACKUP_DIR/*.json; do
  if [[ $dashboard != *"datasources"* ]] && [[ $dashboard != *"alert-rules"* ]]; then
    echo "Restoring $(basename $dashboard)"
    jq '. | {dashboard: ., overwrite: true}' "$dashboard" | \
    curl -X POST \
      -H "Authorization: Bearer $API_KEY" \
      -H "Content-Type: application/json" \
      -d @- \
      "$GRAFANA_URL/api/dashboards/db"
  fi
done

Conclusión

Una implementación efectiva de Grafana requiere:

  • Configuración automatizada con provisioning
  • Dashboards bien estructurados y organizados
  • Queries optimizadas para performance
  • Sistema de alertas configurado apropiadamente
  • Variables y templating para flexibilidad
  • Seguridad y multi-tenancy implementados
  • Estrategias sólidas de backup y recuperación
  • Monitoreo del propio Grafana

Estas prácticas aseguran dashboards robustos, escalables y mantenibles para equipos de cualquier tamaño.