Chuyển tới nội dung chính

Hướng Dẫn Monitoring Linux Server Với Prometheus và Grafana

Giới Thiệu

Monitoring là yếu tố then chốt để duy trì hệ thống ổn định và phát hiện sớm các vấn đề. Trong bài này, chúng ta sẽ xây dựng một hệ thống monitoring production-ready với:

  • Prometheus: Time-series database và metric collector
  • Grafana: Visualization và dashboard
  • Node Exporter: Collect system metrics
  • Alertmanager: Quản lý alerts và notifications

Kiến trúc hệ thống:

┌─────────────────────────────────────────────────────┐
│ Grafana Dashboard │
│ (Visualization Layer) │
└─────────────────┬───────────────────────────────────┘
│ Query
┌─────────────────▼───────────────────────────────────┐
│ Prometheus Server │
│ (Metrics Collection & Storage) │
└─────┬───────────────────────────────────────────────┘
│ Scrape Metrics

├──► Node Exporter (Server 1) - System metrics
├──► Node Exporter (Server 2) - System metrics
├──► Application Metrics
└──► Custom Exporters

┌─────────────────▼───────────────────────────────────┐
│ Alertmanager │
│ (Alert Routing & Notification) │
└─────────────────────────────────────────────────────┘

![Sơ đồ kiến trúc Prometheus - Đặt ảnh tại /static/img/system/prometheus-architecture.png]

Thời gian thực hiện: 90-120 phút
Độ khó: Trung bình đến Nâng cao
Yêu cầu: Ubuntu 22.04, 4GB RAM, 20GB disk

Phần 1: Cài Đặt Prometheus

Bước 1: Tạo User và Directories

# Tạo prometheus user (no login)
sudo useradd --no-create-home --shell /bin/false prometheus

# Tạo directories
sudo mkdir -p /etc/prometheus
sudo mkdir -p /var/lib/prometheus

# Set ownership
sudo chown prometheus:prometheus /etc/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus

Bước 2: Download và Cài Đặt Prometheus

# Check latest version tại: https://prometheus.io/download/
PROM_VERSION="2.48.0"

# Download
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v${PROM_VERSION}/prometheus-${PROM_VERSION}.linux-amd64.tar.gz

# Extract
tar -xvf prometheus-${PROM_VERSION}.linux-amd64.tar.gz
cd prometheus-${PROM_VERSION}.linux-amd64

# Copy binaries
sudo cp prometheus /usr/local/bin/
sudo cp promtool /usr/local/bin/

# Set ownership
sudo chown prometheus:prometheus /usr/local/bin/prometheus
sudo chown prometheus:prometheus /usr/local/bin/promtool

# Copy config files
sudo cp -r consoles /etc/prometheus/
sudo cp -r console_libraries /etc/prometheus/
sudo chown -R prometheus:prometheus /etc/prometheus/consoles
sudo chown -R prometheus:prometheus /etc/prometheus/console_libraries

# Cleanup
cd /tmp
rm -rf prometheus-${PROM_VERSION}.linux-amd64*

Verify installation:

prometheus --version
# Output: prometheus, version 2.48.0 (branch: HEAD, revision: ...)

Bước 3: Cấu Hình Prometheus

sudo nano /etc/prometheus/prometheus.yml

Basic configuration:

# Global config
global:
scrape_interval: 15s # Scrape targets mỗi 15s
evaluation_interval: 15s # Evaluate rules mỗi 15s
scrape_timeout: 10s

# External labels (cho Alertmanager)
external_labels:
cluster: 'production'
environment: 'prod'

# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093

# Load rules
rule_files:
- "alerts/*.yml"

# Scrape configurations
scrape_configs:
# Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
labels:
alias: 'prometheus-server'

# Node Exporter (local)
- job_name: 'node_exporter'
static_configs:
- targets: ['localhost:9100']
labels:
alias: 'local-server'

# Node Exporter (remote servers)
- job_name: 'remote_servers'
static_configs:
- targets:
- '192.168.1.101:9100'
- '192.168.1.102:9100'
labels:
datacenter: 'dc1'

Set permissions:

sudo chown prometheus:prometheus /etc/prometheus/prometheus.yml

Bước 4: Tạo Systemd Service

sudo nano /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus Monitoring System
Documentation=https://prometheus.io/docs/introduction/overview/
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=prometheus
Group=prometheus
ExecReload=/bin/kill -HUP $MAINPID
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus/ \
--storage.tsdb.retention.time=30d \
--web.console.templates=/etc/prometheus/consoles \
--web.console.libraries=/etc/prometheus/console_libraries \
--web.listen-address=0.0.0.0:9090 \
--web.enable-lifecycle

Restart=always
RestartSec=5
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

Start Prometheus:

# Reload systemd
sudo systemctl daemon-reload

# Start service
sudo systemctl start prometheus
sudo systemctl enable prometheus

# Check status
sudo systemctl status prometheus

# Check logs
sudo journalctl -u prometheus -f

Verify Prometheus UI:

# Truy cập: http://your-server-ip:9090
curl http://localhost:9090/-/healthy
# Output: Prometheus is Healthy.

![Screenshot Prometheus UI - Đặt ảnh tại /static/img/system/prometheus-ui.png]

Phần 2: Cài Đặt Node Exporter

Node Exporter collect hardware và OS metrics.

Bước 1: Download Node Exporter

# Check version: https://prometheus.io/download/#node_exporter
NODE_VERSION="1.7.0"

cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v${NODE_VERSION}/node_exporter-${NODE_VERSION}.linux-amd64.tar.gz

# Extract
tar -xvf node_exporter-${NODE_VERSION}.linux-amd64.tar.gz

# Copy binary
sudo cp node_exporter-${NODE_VERSION}.linux-amd64/node_exporter /usr/local/bin/

# Create user
sudo useradd --no-create-home --shell /bin/false node_exporter
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter

# Cleanup
rm -rf node_exporter-${NODE_VERSION}.linux-amd64*

Bước 2: Tạo Systemd Service

sudo nano /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
Documentation=https://prometheus.io/docs/guides/node-exporter/
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=node_exporter
Group=node_exporter
ExecStart=/usr/local/bin/node_exporter \
--collector.filesystem.mount-points-exclude='^/(sys|proc|dev|host|etc)($$|/)' \
--collector.netclass.ignored-devices='^(veth.*|docker.*|br-.*)$$' \
--collector.netdev.device-exclude='^(veth.*|docker.*|br-.*)$$' \
--web.listen-address=:9100

Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

Start Node Exporter:

sudo systemctl daemon-reload
sudo systemctl start node_exporter
sudo systemctl enable node_exporter

# Check status
sudo systemctl status node_exporter

# Test metrics endpoint
curl http://localhost:9100/metrics | head -n 20

Phần 3: Cài Đặt Grafana

Bước 1: Install Grafana

# Install dependencies
sudo apt install -y software-properties-common wget

# Add Grafana GPG key
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -

# Add repository
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee /etc/apt/sources.list.d/grafana.list

# Update và install
sudo apt update
sudo apt install grafana -y

# Start service
sudo systemctl start grafana-server
sudo systemctl enable grafana-server

# Check status
sudo systemctl status grafana-server

Default credentials:

Bước 2: Cấu Hình Grafana

sudo nano /etc/grafana/grafana.ini

Key configurations:

[server]
# Protocol (http, https, h2, socket)
protocol = http
http_addr = 0.0.0.0
http_port = 3000
domain = monitoring.example.com
root_url = %(protocol)s://%(domain)s:%(http_port)s/

[security]
# Change default admin password
admin_user = admin
admin_password = YourStrongPassword123!
# Disable user signup
disable_initial_admin_creation = false
allow_sign_up = false

[users]
# Allow users to change their profile
allow_org_create = false

[auth.anonymous]
# Enable anonymous access (for public dashboards)
enabled = false

[smtp]
enabled = true
host = smtp.gmail.com:587
user = [email protected]
password = your-app-password
from_address = [email protected]
from_name = Grafana

[alerting]
enabled = true

Restart Grafana:

sudo systemctl restart grafana-server

Bước 3: Add Prometheus Data Source

  1. Login to Grafana: http://your-server-ip:3000
  2. Go to ConfigurationData Sources
  3. Click Add data source
  4. Select Prometheus
  5. Configure:
    • Name: Prometheus
    • URL: http://localhost:9090
    • Access: Server (default)
  6. Click Save & Test

![Screenshot Grafana data source - Đặt ảnh tại /static/img/system/grafana-datasource.png]

Bước 4: Import Dashboard

Method 1: Import từ Grafana.com

Popular dashboards:

  • Node Exporter Full: Dashboard ID 1860
  • Node Exporter for Prometheus: Dashboard ID 11074
  • Prometheus Stats: Dashboard ID 3662

Steps:

  1. Go to DashboardsImport
  2. Enter dashboard ID: 1860
  3. Click Load
  4. Select Prometheus data source
  5. Click Import

Method 2: Custom Dashboard

Tạo dashboard JSON file:

sudo nano /var/lib/grafana/dashboards/system-overview.json
{
"dashboard": {
"title": "System Overview",
"tags": ["linux", "monitoring"],
"timezone": "browser",
"panels": [
{
"title": "CPU Usage",
"type": "graph",
"targets": [
{
"expr": "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "{{instance}}"
}
]
}
]
}
}

![Screenshot Grafana dashboard - Đặt ảnh tại /static/img/system/grafana-dashboard.png]

Phần 4: Cấu Hình Alerting

Bước 1: Cài Đặt Alertmanager

# Download
ALERT_VERSION="0.26.0"
cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v${ALERT_VERSION}/alertmanager-${ALERT_VERSION}.linux-amd64.tar.gz

# Extract
tar -xvf alertmanager-${ALERT_VERSION}.linux-amd64.tar.gz

# Copy binary
sudo cp alertmanager-${ALERT_VERSION}.linux-amd64/alertmanager /usr/local/bin/
sudo cp alertmanager-${ALERT_VERSION}.linux-amd64/amtool /usr/local/bin/

# Create user
sudo useradd --no-create-home --shell /bin/false alertmanager

# Create directories
sudo mkdir -p /etc/alertmanager
sudo mkdir -p /var/lib/alertmanager

# Set ownership
sudo chown -R alertmanager:alertmanager /etc/alertmanager
sudo chown -R alertmanager:alertmanager /var/lib/alertmanager
sudo chown alertmanager:alertmanager /usr/local/bin/alertmanager
sudo chown alertmanager:alertmanager /usr/local/bin/amtool

# Cleanup
rm -rf alertmanager-${ALERT_VERSION}.linux-amd64*

Bước 2: Cấu Hình Alertmanager

sudo nano /etc/alertmanager/alertmanager.yml
global:
# SMTP config cho email alerts
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: '[email protected]'
smtp_auth_username: '[email protected]'
smtp_auth_password: 'your-app-password'
smtp_require_tls: true

# Templates
templates:
- '/etc/alertmanager/templates/*.tmpl'

# Route alerts
route:
# Default receiver
receiver: 'default-receiver'

# Group alerts
group_by: ['alertname', 'cluster', 'service']

# Timing
group_wait: 10s # Chờ 10s trước khi gửi alert đầu tiên
group_interval: 10s # Chờ 10s trước khi gửi alert cho group mới
repeat_interval: 12h # Gửi lại alert mỗi 12h nếu vẫn firing

# Routes cho specific alerts
routes:
# Critical alerts
- match:
severity: critical
receiver: critical-receiver
continue: true

# Warning alerts
- match:
severity: warning
receiver: warning-receiver

# Database alerts
- match:
service: database
receiver: dba-team

# Inhibit rules (suppress alerts)
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']

# Receivers
receivers:
# Default receiver
- name: 'default-receiver'
email_configs:
- to: '[email protected]'
headers:
Subject: '[Monitoring] {{ .GroupLabels.alertname }}'

# Critical alerts
- name: 'critical-receiver'
email_configs:
- to: '[email protected]'
headers:
Subject: '[CRITICAL] {{ .GroupLabels.alertname }}'
# Slack webhook
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
channel: '#alerts-critical'
title: 'Critical Alert'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
# PagerDuty (optional)
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'

# Warning alerts
- name: 'warning-receiver'
email_configs:
- to: '[email protected]'

# DBA team
- name: 'dba-team'
email_configs:
- to: '[email protected]'

Bước 3: Create Systemd Service

sudo nano /etc/systemd/system/alertmanager.service
[Unit]
Description=Alertmanager
Documentation=https://prometheus.io/docs/alerting/alertmanager/
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=alertmanager
Group=alertmanager
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/var/lib/alertmanager/ \
--web.listen-address=:9093

Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

Start Alertmanager:

sudo systemctl daemon-reload
sudo systemctl start alertmanager
sudo systemctl enable alertmanager
sudo systemctl status alertmanager

# Access UI: http://your-server-ip:9093

Bước 4: Tạo Alert Rules

sudo mkdir -p /etc/prometheus/alerts
sudo nano /etc/prometheus/alerts/system-alerts.yml
groups:
- name: system_alerts
interval: 30s
rules:
# Instance down
- alert: InstanceDown
expr: up == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 2 minutes."

# High CPU usage
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 80% (current value: {{ $value }}%)"

# Critical CPU usage
- alert: CriticalCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 95
for: 5m
labels:
severity: critical
annotations:
summary: "Critical CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 95% (current value: {{ $value }}%)"

# High memory usage
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 80% (current value: {{ $value }}%)"

# Low disk space
- alert: LowDiskSpace
expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs"} / node_filesystem_size_bytes{fstype!~"tmpfs|fuse.lxcfs"}) * 100 < 20
for: 5m
labels:
severity: warning
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk {{ $labels.mountpoint }} has less than 20% free space (current: {{ $value }}%)"

# Critical disk space
- alert: CriticalDiskSpace
expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs"} / node_filesystem_size_bytes{fstype!~"tmpfs|fuse.lxcfs"}) * 100 < 10
for: 5m
labels:
severity: critical
annotations:
summary: "Critical disk space on {{ $labels.instance }}"
description: "Disk {{ $labels.mountpoint }} has less than 10% free space (current: {{ $value }}%)"

# High network traffic
- alert: HighNetworkTraffic
expr: rate(node_network_receive_bytes_total[5m]) > 100000000
for: 5m
labels:
severity: warning
annotations:
summary: "High network traffic on {{ $labels.instance }}"
description: "Network interface {{ $labels.device }} receiving more than 100MB/s"

# System load
- alert: HighSystemLoad
expr: node_load15 / count(node_cpu_seconds_total{mode="idle"}) without (cpu, mode) > 2
for: 10m
labels:
severity: warning
annotations:
summary: "High system load on {{ $labels.instance }}"
description: "15-minute load average is {{ $value }} (threshold: 2 per CPU core)"

Set permissions:

sudo chown prometheus:prometheus /etc/prometheus/alerts/system-alerts.yml

# Reload Prometheus config
curl -X POST http://localhost:9090/-/reload

Verify alerts:

Phần 5: Advanced Queries và Dashboards

Useful PromQL Queries

# CPU Usage (%)
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory Usage (%)
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk Usage (%)
(1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs"} / node_filesystem_size_bytes{fstype!~"tmpfs|fuse.lxcfs"})) * 100

# Network In (MB/s)
rate(node_network_receive_bytes_total[5m]) / 1024 / 1024

# Network Out (MB/s)
rate(node_network_transmit_bytes_total[5m]) / 1024 / 1024

# Disk I/O (reads per second)
rate(node_disk_reads_completed_total[5m])

# Disk I/O (writes per second)
rate(node_disk_writes_completed_total[5m])

# System uptime (days)
(time() - node_boot_time_seconds) / 86400

# Number of CPUs
count(node_cpu_seconds_total{mode="idle"}) without (cpu, mode)

# Load average (1m, 5m, 15m)
node_load1
node_load5
node_load15

Custom Exporter Example

Tạo custom metrics cho application:

#!/usr/bin/env python3
# app-exporter.py

from prometheus_client import start_http_server, Gauge
import time
import psutil

# Define metrics
cpu_usage = Gauge('app_cpu_usage_percent', 'Application CPU usage percentage')
mem_usage = Gauge('app_memory_usage_bytes', 'Application memory usage in bytes')
request_count = Gauge('app_request_total', 'Total number of requests')

def collect_metrics():
"""Collect application metrics"""
while True:
# Get CPU usage
cpu_usage.set(psutil.cpu_percent(interval=1))

# Get memory usage
mem = psutil.virtual_memory()
mem_usage.set(mem.used)

# Increment request count (example)
request_count.inc()

time.sleep(15) # Collect every 15s

if __name__ == '__main__':
# Start HTTP server on port 9101
start_http_server(9101)
print("Exporter running on http://localhost:9101/metrics")
collect_metrics()

Add to Prometheus:

scrape_configs:
- job_name: 'custom_app'
static_configs:
- targets: ['localhost:9101']

Testing và Troubleshooting

Test Alert Rules

# Validate rules file
promtool check rules /etc/prometheus/alerts/system-alerts.yml

# Test alert expression
curl 'http://localhost:9090/api/v1/query?query=up==0'

# Trigger test alert (manual)
curl -X POST http://localhost:9093/api/v1/alerts \
-H 'Content-Type: application/json' \
-d '[{
"labels": {"alertname":"TestAlert","severity":"critical"},
"annotations": {"summary":"This is a test alert"}
}]'

Common Issues

Issue 1: Prometheus không scrape được metrics

# Check targets status
curl http://localhost:9090/api/v1/targets | jq

# Check firewall
sudo ufw status
sudo ufw allow 9100/tcp

# Test connection
curl http://target-server:9100/metrics

Issue 2: Alertmanager không gửi email

# Check Alertmanager logs
sudo journalctl -u alertmanager -f

# Test SMTP connection
telnet smtp.gmail.com 587

# Verify config
amtool config show

Issue 3: Grafana không connect được Prometheus

# Test from Grafana server
curl http://prometheus-server:9090/api/v1/query?query=up

# Check Grafana logs
sudo tail -f /var/log/grafana/grafana.log

Security Best Practices

1. Enable Authentication

Prometheus Basic Auth:

# Install htpasswd
sudo apt install apache2-utils -y

# Create password file
htpasswd -c /etc/prometheus/.htpasswd admin

# Add to prometheus.service
--web.config.file=/etc/prometheus/web-config.yml
# /etc/prometheus/web-config.yml
basic_auth_users:
admin: $2y$10$...hashed_password...

2. TLS/SSL Configuration

# Generate self-signed certificate
sudo openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
-keyout /etc/prometheus/prometheus.key \
-out /etc/prometheus/prometheus.crt

# Update web-config.yml
tls_server_config:
cert_file: /etc/prometheus/prometheus.crt
key_file: /etc/prometheus/prometheus.key

3. Reverse Proxy với Nginx

server {
listen 80;
server_name prometheus.example.com;
return 301 https://$server_name$request_uri;
}

server {
listen 443 ssl http2;
server_name prometheus.example.com;

ssl_certificate /etc/letsencrypt/live/prometheus.example.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/prometheus.example.com/privkey.pem;

location / {
proxy_pass http://localhost:9090;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
auth_basic "Prometheus";
auth_basic_user_file /etc/nginx/.htpasswd;
}
}

Backup và Maintenance

Backup Prometheus Data

#!/bin/bash
# backup-prometheus.sh

BACKUP_DIR="/backup/prometheus"
DATE=$(date +%Y%m%d_%H%M%S)

# Create snapshot
curl -X POST http://localhost:9090/api/v1/admin/tsdb/snapshot

# Copy snapshot
sudo cp -r /var/lib/prometheus/snapshots/latest $BACKUP_DIR/prometheus-$DATE

# Backup configs
sudo tar -czf $BACKUP_DIR/prometheus-config-$DATE.tar.gz /etc/prometheus

# Keep only 7 days backups
find $BACKUP_DIR -mtime +7 -delete

echo "Backup completed: $DATE"

Maintenance Tasks

# Clean old data (adjust retention)
--storage.tsdb.retention.time=30d

# Check TSDB status
curl http://localhost:9090/api/v1/status/tsdb

# Compact data blocks
promtool tsdb analyze /var/lib/prometheus

Tài Liệu Tham Khảo

Kết Luận

Bạn đã hoàn thành việc xây dựng một hệ thống monitoring production-ready với Prometheus và Grafana. Hệ thống này có thể:

✅ Collect metrics từ multiple servers
✅ Visualize data với beautiful dashboards
✅ Alert khi có incidents
✅ Scale để monitor hàng trăm servers

Next steps:

  • Tích hợp thêm exporters (MySQL, PostgreSQL, Redis, etc.)
  • Cấu hình High Availability cho Prometheus
  • Implement long-term storage (Thanos, Cortex)
  • Set up distributed tracing (Jaeger, Tempo)

Tags: #prometheus #grafana #monitoring #alerting #observability #devops #metrics

Cập nhật lần cuối: 19/12/2025
Tác giả: BacPV Docs Team