Prometheus: Configuración de Monitoreo de Infraestructura
Prometheus se ha convertido en el estándar de facto para el monitoreo de sistemas modernos. Esta guía completa cubre desde la instalación básica hasta configuraciones avanzadas de alta disponibilidad.
Instalación y Configuración Inicial
Instalación con Docker Compose
# docker-compose.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./rules:/etc/prometheus/rules:ro
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--storage.tsdb.retention.time=15d'
- '--web.enable-lifecycle'
- '--storage.tsdb.max-block-duration=2h'
- '--storage.tsdb.min-block-duration=2h'
restart: unless-stopped
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
- alertmanager_data:/alertmanager
restart: unless-stopped
node_exporter:
image: prom/node-exporter:latest
container_name: node_exporter
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
restart: unless-stopped
volumes:
prometheus_data:
alertmanager_data:
Configuración Principal de Prometheus
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'production'
region: 'us-east-1'
rule_files:
- "rules/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
# Prometheus self-monitoring
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node Exporter
- job_name: 'node_exporter'
static_configs:
- targets:
- 'node_exporter:9100'
- '192.168.1.10:9100'
- '192.168.1.11:9100'
relabel_configs:
- source_labels: [__address__]
regex: '([^:]+):\d+'
target_label: instance
replacement: '${1}'
# Application monitoring
- job_name: 'web_app'
static_configs:
- targets: ['app1.empresa.com:8080', 'app2.empresa.com:8080']
metrics_path: '/metrics'
scrape_interval: 10s
# Database monitoring
- job_name: 'postgres_exporter'
static_configs:
- targets: ['db-exporter.empresa.com:9187']
# Service Discovery con Consul
- job_name: 'consul-services'
consul_sd_configs:
- server: 'consul.empresa.com:8500'
services: ['web', 'api', 'database']
relabel_configs:
- source_labels: [__meta_consul_service]
target_label: service
- source_labels: [__meta_consul_node]
target_label: node
Configuración de Exporters
Node Exporter Avanzado
# /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
--collector.systemd \
--collector.processes \
--collector.interrupts \
--collector.tcpstat \
--collector.meminfo_numa \
--web.listen-address=0.0.0.0:9100
[Install]
WantedBy=multi-user.target
PostgreSQL Exporter
# postgres_exporter.yml
version: '3.8'
services:
postgres_exporter:
image: prometheuscommunity/postgres-exporter:latest
environment:
DATA_SOURCE_NAME: "postgresql://prometheus_user:password@postgres:5432/postgres?sslmode=disable"
PG_EXPORTER_EXTEND_QUERY_PATH: "/etc/postgres_exporter/queries.yaml"
ports:
- "9187:9187"
volumes:
- ./postgres_queries.yaml:/etc/postgres_exporter/queries.yaml:ro
restart: unless-stopped
Custom Application Metrics
# app_metrics.py - Flask application example
from flask import Flask, Response
from prometheus_client import Counter, Histogram, Gauge, generate_latest
import time
import psutil
app = Flask(__name__)
# Define metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'HTTP request latency')
CPU_USAGE = Gauge('system_cpu_usage_percent', 'System CPU usage percentage')
MEMORY_USAGE = Gauge('system_memory_usage_bytes', 'System memory usage in bytes')
@app.before_request
def before_request():
request.start_time = time.time()
@app.after_request
def after_request(response):
REQUEST_COUNT.labels(
method=request.method,
endpoint=request.endpoint,
status=response.status_code
).inc()
REQUEST_LATENCY.observe(time.time() - request.start_time)
return response
@app.route('/metrics')
def metrics():
# Update system metrics
CPU_USAGE.set(psutil.cpu_percent())
MEMORY_USAGE.set(psutil.virtual_memory().used)
return Response(generate_latest(), mimetype='text/plain')
@app.route('/health')
def health():
return {'status': 'healthy', 'timestamp': time.time()}
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080)
Reglas de Alertas
Reglas de Infraestructura
# rules/infrastructure.yml
groups:
- name: infrastructure
rules:
# High CPU usage
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
team: infrastructure
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 80% for more than 5 minutes on {{ $labels.instance }}"
# High memory usage
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
for: 5m
labels:
severity: warning
team: infrastructure
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 85% on {{ $labels.instance }}"
# Disk space warning
- alert: DiskSpaceWarning
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 20
for: 5m
labels:
severity: warning
team: infrastructure
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk usage is above 80% on {{ $labels.instance }} ({{ $labels.mountpoint }})"
# Service down
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
team: infrastructure
annotations:
summary: "Service {{ $labels.job }} is down"
description: "Service {{ $labels.job }} on {{ $labels.instance }} has been down for more than 1 minute"
Reglas de Aplicación
# rules/application.yml
groups:
- name: application
rules:
# High error rate
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100 > 5
for: 2m
labels:
severity: critical
team: development
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Error rate is above 5% for {{ $labels.service }}"
# High response time
- alert: HighResponseTime
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
for: 5m
labels:
severity: warning
team: development
annotations:
summary: "High response time on {{ $labels.service }}"
description: "95th percentile response time is above 500ms for {{ $labels.service }}"
# Database connection issues
- alert: DatabaseConnectionHigh
expr: pg_stat_activity_count{state="active"} > 80
for: 5m
labels:
severity: warning
team: database
annotations:
summary: "High number of database connections"
description: "Active database connections are above 80 on {{ $labels.instance }}"
Configuración de Alertmanager
Configuración Principal
# alertmanager.yml
global:
smtp_smarthost: 'smtp.empresa.com:587'
smtp_from: 'alertas@empresa.com'
smtp_auth_username: 'alertas@empresa.com'
smtp_auth_password: 'password'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
continue: true
- match:
team: infrastructure
receiver: 'infrastructure-team'
- match:
team: development
receiver: 'development-team'
receivers:
- name: 'default'
email_configs:
- to: 'admin@empresa.com'
subject: '[ALERT] {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Labels: {{ range .Labels.SortedPairs }}{{ .Name }}={{ .Value }} {{ end }}
{{ end }}
- name: 'critical-alerts'
slack_configs:
- api_url: 'https://hooks.slack.com/services/WEBHOOK_URL'
channel: '#alerts-critical'
title: '🚨 CRITICAL ALERT'
text: |
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Severity:* {{ .Labels.severity }}
{{ end }}
webhook_configs:
- url: 'https://api.pagerduty.com/integration/webhook'
send_resolved: true
- name: 'infrastructure-team'
email_configs:
- to: 'infra-team@empresa.com'
subject: '[INFRA] {{ .GroupLabels.alertname }}'
- name: 'development-team'
email_configs:
- to: 'dev-team@empresa.com'
subject: '[DEV] {{ .GroupLabels.alertname }}'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
High Availability y Clustering
Configuración de Cluster Prometheus
# prometheus-ha.yml
version: '3.8'
services:
prometheus-1:
image: prom/prometheus:latest
container_name: prometheus-1
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
- prometheus1_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.enable-lifecycle'
- '--storage.tsdb.retention.time=15d'
- '--web.external-url=http://prometheus-1:9090'
restart: unless-stopped
prometheus-2:
image: prom/prometheus:latest
container_name: prometheus-2
ports:
- "9091:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
- prometheus2_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.enable-lifecycle'
- '--storage.tsdb.retention.time=15d'
- '--web.external-url=http://prometheus-2:9090'
restart: unless-stopped
# HAProxy for load balancing
haproxy:
image: haproxy:latest
container_name: prometheus-lb
ports:
- "8080:8080"
volumes:
- ./haproxy.cfg:/usr/local/etc/haproxy/haproxy.cfg:ro
depends_on:
- prometheus-1
- prometheus-2
restart: unless-stopped
volumes:
prometheus1_data:
prometheus2_data:
Configuración HAProxy
# haproxy.cfg
global
daemon
defaults
mode http
timeout connect 5000ms
timeout client 50000ms
timeout server 50000ms
frontend prometheus_frontend
bind *:8080
default_backend prometheus_backend
backend prometheus_backend
balance roundrobin
option httpchk GET /api/v1/status/config
server prometheus-1 prometheus-1:9090 check
server prometheus-2 prometheus-2:9090 check
Optimización y Performance
Configuración de Retention y Storage
# Configuración avanzada de storage
command:
- '--storage.tsdb.retention.time=30d'
- '--storage.tsdb.retention.size=50GB'
- '--storage.tsdb.max-block-duration=2h'
- '--storage.tsdb.min-block-duration=2h'
- '--storage.tsdb.wal-compression'
- '--web.enable-admin-api'
- '--web.enable-lifecycle'
Recording Rules para Performance
# rules/recording.yml
groups:
- name: recording_rules
interval: 30s
rules:
# CPU usage per instance
- record: instance:node_cpu_utilization:rate5m
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage per instance
- record: instance:node_memory_utilization:ratio
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes
# Application request rate
- record: job:http_requests:rate5m
expr: sum by(job) (rate(http_requests_total[5m]))
# Application error rate
- record: job:http_errors:rate5m
expr: sum by(job) (rate(http_requests_total{status=~"5.."}[5m]))
Integración con Service Discovery
Configuración Kubernetes
# kubernetes-config.yml
scrape_configs:
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
relabel_configs:
- source_labels: [__address__]
regex: '(.*):10250'
target_label: __address__
replacement: '${1}:9100'
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
Conclusión
Un sistema de monitoreo robusto con Prometheus requiere:
- Configuración adecuada de exporters y métricas
- Reglas de alertas bien definidas y organizadas
- Alta disponibilidad mediante clustering
- Optimización de storage y performance
- Integración con service discovery
- Alerting inteligente con Alertmanager
Esta configuración proporciona una base sólida para monitorear infraestructuras complejas de manera eficiente y confiable.