Implement comprehensive system metrics collection with real-time monitoring
## System Metrics Collection
- Add SystemMetrics module with CPU, memory, disk, and system info collection
- Integrate Erlang :os_mon application (cpu_sup, memsup, disksup)
- Collect and format active system alarms with structured JSON output
- Replace simple "Hello" messages with rich system data in MQTT payloads
## MQTT Integration
- Update MqttClient to publish comprehensive metrics every 30 seconds
- Add :os_mon to application dependencies for system monitoring
- Maintain backward compatibility with existing dashboard consumption
## Documentation Updates
- Update CLAUDE.md with Phase 1 completion status and implementation details
- Completely rewrite README.md to reflect current project capabilities
- Document alarm format, architecture, and development workflow
## Technical Improvements
- Graceful error handling for metrics collection failures
- Clean alarm formatting: {severity, path/details, id}
- Dashboard automatically receives and displays real-time system data and alerts
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
parent
4a928b7067
commit
eff32b3233
37
CLAUDE.md
37
CLAUDE.md
@ -98,7 +98,40 @@ The project includes a Phoenix LiveView dashboard (`dashboard/`) that provides r
|
||||
- **Topic Parsing**: Topics may arrive as lists or strings, handle both formats
|
||||
- **Client ID Conflicts**: Use unique client IDs to prevent connection instability
|
||||
|
||||
## Development Roadmap
|
||||
|
||||
### Phase 1: System Metrics Collection (Completed)
|
||||
- ✅ **SystemMetrics Module**: `server/lib/systant/system_metrics.ex` - Comprehensive metrics collection
|
||||
- ✅ **CPU Metrics**: Load averages (1/5/15min) and utilization via `:cpu_sup`
|
||||
- ✅ **Memory Metrics**: System memory data and monitoring via `:memsup`
|
||||
- ✅ **Disk Metrics**: Disk usage and capacity for all mounted drives via `:disksup`
|
||||
- ✅ **System Info**: Uptime, Erlang/OTP versions, scheduler info
|
||||
- ✅ **System Alarms**: Active os_mon alarms (disk_almost_full, memory_high_watermark, etc.)
|
||||
- ✅ **MQTT Integration**: Real metrics published every 30 seconds replacing simple messages
|
||||
- 🔄 **Network Metrics**: TODO - Interface statistics, bandwidth utilization
|
||||
- 🔄 **GPU Metrics**: TODO - NVIDIA/AMD GPU utilization, temperatures, memory usage
|
||||
|
||||
#### Implementation Details
|
||||
- Uses Erlang's built-in `:os_mon` application (cpu_sup, memsup, disksup)
|
||||
- Collects active system alarms from `:alarm_handler` with structured format
|
||||
- Graceful error handling with fallbacks when metrics unavailable
|
||||
- JSON payload structure: `{timestamp, hostname, cpu, memory, disk, system, alarms}`
|
||||
- Dashboard automatically receives and displays real-time system data and alerts
|
||||
- Alarm format: `{severity, path/details, id}` for clean consumption
|
||||
|
||||
### Phase 2: Command System
|
||||
- Subscribe to `systant/+/commands` in MqttClient
|
||||
- Implement secure command execution framework with validation/whitelisting
|
||||
- Support commands like: restart services, update packages, system queries
|
||||
- Response mechanism to send command results back via MQTT
|
||||
|
||||
### Phase 3: Home Assistant Integration
|
||||
- Custom MQTT integration following Home Assistant patterns
|
||||
- Auto-discovery of systant hosts via MQTT discovery protocol
|
||||
- Create entities for metrics (sensors) and commands (buttons/services)
|
||||
- Dashboard cards and automation support
|
||||
|
||||
### Future Plans
|
||||
- Integration with Home Assistant via custom MQTT integration
|
||||
- Expandable command handling for host-specific automation
|
||||
- Multi-host deployment for comprehensive system monitoring
|
||||
- Advanced alerting and threshold monitoring
|
||||
- Historical data retention and trending
|
||||
123
README.md
123
README.md
@ -1,60 +1,79 @@
|
||||
# Systant
|
||||
|
||||
An Elixir application that runs as a systemd daemon to:
|
||||
1. Publish system stats to MQTT every 30 seconds
|
||||
2. Listen for commands over MQTT and log them to syslog
|
||||
A comprehensive Elixir-based system monitoring solution with real-time dashboard, designed for deployment across multiple NixOS hosts.
|
||||
|
||||
## Configuration
|
||||
## Components
|
||||
|
||||
Edit `config/config.exs` to configure MQTT connection:
|
||||
|
||||
```elixir
|
||||
config :systant, Systant.MqttClient,
|
||||
host: "localhost",
|
||||
port: 1883,
|
||||
client_id: "systant",
|
||||
username: nil,
|
||||
password: nil,
|
||||
stats_topic: "system/stats",
|
||||
command_topic: "system/commands",
|
||||
publish_interval: 30_000
|
||||
```
|
||||
|
||||
## Building
|
||||
|
||||
```bash
|
||||
mix deps.get
|
||||
mix compile
|
||||
```
|
||||
|
||||
## Running
|
||||
|
||||
```bash
|
||||
# Development
|
||||
mix run --no-halt
|
||||
|
||||
# Production release
|
||||
MIX_ENV=prod mix release
|
||||
_build/prod/rel/systant/bin/systant start
|
||||
```
|
||||
|
||||
## Systemd Installation
|
||||
|
||||
1. Build production release
|
||||
2. Copy binary to `/usr/local/bin/`
|
||||
3. Copy `systant.service` to `/etc/systemd/system/`
|
||||
4. Enable and start:
|
||||
|
||||
```bash
|
||||
sudo systemctl enable systant
|
||||
sudo systemctl start systant
|
||||
```
|
||||
- **Server** (`server/`): Elixir OTP application that collects and publishes system metrics via MQTT
|
||||
- **Dashboard** (`dashboard/`): Phoenix LiveView web dashboard for real-time monitoring
|
||||
- **Nix Integration**: Complete NixOS module and packaging for easy deployment
|
||||
|
||||
## Features
|
||||
|
||||
- Publishes "Hello from systant" stats every 30 seconds to `system/stats` topic
|
||||
- Listens on `system/commands` topic and logs received messages
|
||||
- Configurable MQTT connection settings
|
||||
- Runs as systemd daemon with auto-restart
|
||||
- Logs to system journal
|
||||
### System Metrics Collection
|
||||
- **CPU**: Load averages (1/5/15min) and utilization monitoring
|
||||
- **Memory**: System memory usage and swap monitoring
|
||||
- **Disk**: Usage statistics and capacity monitoring for all drives
|
||||
- **System Alarms**: Real-time alerts for disk space, memory pressure, etc.
|
||||
- **System Info**: Uptime, Erlang/OTP versions, scheduler information
|
||||
|
||||
### Real-time Dashboard
|
||||
- Phoenix LiveView interface showing all connected hosts
|
||||
- Live system metrics and alert monitoring
|
||||
- Automatic reconnection and error handling
|
||||
|
||||
### MQTT Integration
|
||||
- Publishes comprehensive system metrics every 30 seconds
|
||||
- Uses hostname-based topics: `systant/${hostname}/stats`
|
||||
- Structured JSON payloads with full system data
|
||||
- Configurable MQTT broker connection
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Development Environment
|
||||
```bash
|
||||
# Enter Nix development shell
|
||||
nix develop
|
||||
|
||||
# Run the server
|
||||
cd server && mix run --no-halt
|
||||
|
||||
# Run the dashboard (separate terminal)
|
||||
just dashboard
|
||||
# or: cd dashboard && mix phx.server
|
||||
```
|
||||
|
||||
### Production Deployment (NixOS)
|
||||
```bash
|
||||
# Build and install via Nix
|
||||
nix build
|
||||
sudo nixos-rebuild switch --flake .
|
||||
|
||||
# Or use the NixOS module in your configuration:
|
||||
# imports = [ ./path/to/systant/nix/nixos-module.nix ];
|
||||
# services.systant.enable = true;
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
Default MQTT configuration (customizable via environment variables):
|
||||
- **Host**: `mqtt.home:1883`
|
||||
- **Topics**: `systant/${hostname}/stats` and `systant/${hostname}/commands`
|
||||
- **Interval**: 30 seconds
|
||||
- **Client ID**: `systant_${random}` (auto-generated to avoid conflicts)
|
||||
|
||||
## Architecture
|
||||
|
||||
- **Server**: `server/lib/systant/mqtt_client.ex` - MQTT publishing and command handling
|
||||
- **Metrics**: `server/lib/systant/system_metrics.ex` - System data collection using `:os_mon`
|
||||
- **Dashboard**: `dashboard/lib/dashboard/mqtt_subscriber.ex` - Real-time MQTT data consumption
|
||||
- **Nix**: `nix/package.nix` and `nix/nixos-module.nix` - Complete packaging and deployment
|
||||
|
||||
## Roadmap
|
||||
|
||||
- ✅ **Phase 1**: System metrics collection with real-time dashboard
|
||||
- 🔄 **Phase 2**: Command system for remote host management
|
||||
- 🔄 **Phase 3**: Home Assistant integration for automation
|
||||
|
||||
See `CLAUDE.md` for detailed development context and implementation notes.
|
||||
|
||||
|
||||
@ -30,13 +30,9 @@ defmodule Systant.MqttClient do
|
||||
{:ok, _pid} ->
|
||||
Logger.info("MQTT client connected successfully")
|
||||
|
||||
# Send immediate hello message
|
||||
hello_msg = %{
|
||||
message: "Hello - systant started",
|
||||
timestamp: DateTime.utc_now() |> DateTime.to_iso8601(),
|
||||
hostname: get_hostname()
|
||||
}
|
||||
Tortoise.publish(client_id, config[:stats_topic], Jason.encode!(hello_msg), qos: 0)
|
||||
# Send immediate system metrics on startup
|
||||
startup_stats = Systant.SystemMetrics.collect_metrics()
|
||||
Tortoise.publish(client_id, config[:stats_topic], Jason.encode!(startup_stats), qos: 0)
|
||||
|
||||
schedule_stats_publish(config[:publish_interval])
|
||||
{:ok, %{config: config, client_id: client_id}}
|
||||
@ -57,23 +53,19 @@ defmodule Systant.MqttClient do
|
||||
{:noreply, state}
|
||||
end
|
||||
|
||||
def terminate(reason, state) do
|
||||
def terminate(reason, _state) do
|
||||
Logger.info("MQTT client terminating: #{inspect(reason)}")
|
||||
:ok
|
||||
end
|
||||
|
||||
defp publish_stats(config, client_id) do
|
||||
stats = %{
|
||||
message: "Hello from systant",
|
||||
timestamp: DateTime.utc_now() |> DateTime.to_iso8601(),
|
||||
hostname: get_hostname()
|
||||
}
|
||||
stats = Systant.SystemMetrics.collect_metrics()
|
||||
|
||||
payload = Jason.encode!(stats)
|
||||
|
||||
case Tortoise.publish(client_id, config[:stats_topic], payload, qos: 0) do
|
||||
:ok ->
|
||||
Logger.info("Published stats: #{payload}")
|
||||
Logger.info("Published system metrics for #{stats.hostname}")
|
||||
{:error, reason} ->
|
||||
Logger.error("Failed to publish stats: #{inspect(reason)}")
|
||||
end
|
||||
@ -82,11 +74,4 @@ defmodule Systant.MqttClient do
|
||||
defp schedule_stats_publish(interval) do
|
||||
Process.send_after(self(), :publish_stats, interval)
|
||||
end
|
||||
|
||||
defp get_hostname do
|
||||
case :inet.gethostname() do
|
||||
{:ok, hostname} -> List.to_string(hostname)
|
||||
_ -> "unknown"
|
||||
end
|
||||
end
|
||||
end
|
||||
203
server/lib/systant/system_metrics.ex
Normal file
203
server/lib/systant/system_metrics.ex
Normal file
@ -0,0 +1,203 @@
|
||||
defmodule Systant.SystemMetrics do
|
||||
@moduledoc """
|
||||
Collects system metrics using Erlang's built-in :os_mon application.
|
||||
Provides CPU, memory, disk, and network statistics.
|
||||
"""
|
||||
|
||||
require Logger
|
||||
|
||||
@doc """
|
||||
Collect all available system metrics
|
||||
"""
|
||||
def collect_metrics do
|
||||
%{
|
||||
timestamp: DateTime.utc_now() |> DateTime.to_iso8601(),
|
||||
hostname: get_hostname(),
|
||||
cpu: collect_cpu_metrics(),
|
||||
memory: collect_memory_metrics(),
|
||||
disk: collect_disk_metrics(),
|
||||
system: collect_system_info(),
|
||||
alarms: collect_active_alarms()
|
||||
}
|
||||
end
|
||||
|
||||
@doc """
|
||||
Collect CPU metrics using cpu_sup
|
||||
"""
|
||||
def collect_cpu_metrics do
|
||||
try do
|
||||
# Get CPU utilization (average over all cores)
|
||||
cpu_util = case :cpu_sup.util() do
|
||||
{:badrpc, _} -> nil
|
||||
util when is_number(util) -> util
|
||||
_ -> nil
|
||||
end
|
||||
|
||||
# Get load averages (1, 5, 15 minutes)
|
||||
load_avg = case :cpu_sup.avg1() do
|
||||
{:badrpc, _} -> nil
|
||||
avg when is_number(avg) ->
|
||||
%{
|
||||
avg1: avg,
|
||||
avg5: safe_call(:cpu_sup, :avg5, []),
|
||||
avg15: safe_call(:cpu_sup, :avg15, [])
|
||||
}
|
||||
_ -> nil
|
||||
end
|
||||
|
||||
%{
|
||||
utilization: cpu_util,
|
||||
load_average: load_avg
|
||||
}
|
||||
rescue
|
||||
_ ->
|
||||
Logger.warning("CPU metrics collection failed")
|
||||
%{utilization: nil, load_average: nil}
|
||||
end
|
||||
end
|
||||
|
||||
@doc """
|
||||
Collect memory metrics using memsup
|
||||
"""
|
||||
def collect_memory_metrics do
|
||||
try do
|
||||
# Get system memory data
|
||||
system_memory = case :memsup.get_system_memory_data() do
|
||||
{:badrpc, _} -> nil
|
||||
data when is_list(data) -> Enum.into(data, %{})
|
||||
_ -> nil
|
||||
end
|
||||
|
||||
# Get memory check interval
|
||||
check_interval = safe_call(:memsup, :get_check_interval, [])
|
||||
|
||||
%{
|
||||
system: system_memory,
|
||||
check_interval: check_interval
|
||||
}
|
||||
rescue
|
||||
_ ->
|
||||
Logger.warning("Memory metrics collection failed")
|
||||
%{system: nil, check_interval: nil}
|
||||
end
|
||||
end
|
||||
|
||||
@doc """
|
||||
Collect disk metrics using disksup
|
||||
"""
|
||||
def collect_disk_metrics do
|
||||
try do
|
||||
# Get disk data for all disks
|
||||
disks = case :disksup.get_disk_data() do
|
||||
{:badrpc, _} -> []
|
||||
data when is_list(data) ->
|
||||
Enum.map(data, fn {path, kb_size, capacity} ->
|
||||
%{
|
||||
path: List.to_string(path),
|
||||
size_kb: kb_size,
|
||||
capacity_percent: capacity
|
||||
}
|
||||
end)
|
||||
_ -> []
|
||||
end
|
||||
|
||||
%{disks: disks}
|
||||
rescue
|
||||
_ ->
|
||||
Logger.warning("Disk metrics collection failed")
|
||||
%{disks: []}
|
||||
end
|
||||
end
|
||||
|
||||
@doc """
|
||||
Collect active system alarms from alarm_handler
|
||||
"""
|
||||
def collect_active_alarms do
|
||||
try do
|
||||
# Get all active alarms from the alarm handler
|
||||
case :alarm_handler.get_alarms() do
|
||||
alarms when is_list(alarms) ->
|
||||
Enum.map(alarms, fn {alarm_id, alarm_desc} ->
|
||||
format_alarm(alarm_id, alarm_desc)
|
||||
end)
|
||||
_ -> []
|
||||
end
|
||||
rescue
|
||||
_ ->
|
||||
Logger.warning("Alarm collection failed")
|
||||
[]
|
||||
end
|
||||
end
|
||||
|
||||
@doc """
|
||||
Collect general system information
|
||||
"""
|
||||
def collect_system_info do
|
||||
try do
|
||||
%{
|
||||
uptime_seconds: get_uptime(),
|
||||
erlang_version: System.version(),
|
||||
otp_release: System.otp_release(),
|
||||
schedulers: System.schedulers(),
|
||||
logical_processors: System.schedulers_online()
|
||||
}
|
||||
rescue
|
||||
_ ->
|
||||
Logger.warning("System info collection failed")
|
||||
%{}
|
||||
end
|
||||
end
|
||||
|
||||
# Private helper functions
|
||||
|
||||
defp get_hostname do
|
||||
case :inet.gethostname() do
|
||||
{:ok, hostname} -> List.to_string(hostname)
|
||||
_ -> "unknown"
|
||||
end
|
||||
end
|
||||
|
||||
defp get_uptime do
|
||||
try do
|
||||
# Get system uptime in milliseconds, convert to seconds
|
||||
:erlang.statistics(:wall_clock) |> elem(0) |> div(1000)
|
||||
rescue
|
||||
_ -> nil
|
||||
end
|
||||
end
|
||||
|
||||
defp safe_call(module, function, args) do
|
||||
try do
|
||||
apply(module, function, args)
|
||||
catch
|
||||
:exit, _ -> nil
|
||||
:error, _ -> nil
|
||||
end
|
||||
end
|
||||
|
||||
defp format_alarm(alarm_id, _alarm_desc) do
|
||||
case alarm_id do
|
||||
{:disk_almost_full, path} when is_list(path) ->
|
||||
%{
|
||||
severity: "warning",
|
||||
path: List.to_string(path),
|
||||
id: "disk_almost_full"
|
||||
}
|
||||
{:system_memory_high_watermark, _} ->
|
||||
%{
|
||||
severity: "critical",
|
||||
id: "system_memory_high_watermark"
|
||||
}
|
||||
atom when is_atom(atom) ->
|
||||
%{
|
||||
severity: "info",
|
||||
id: Atom.to_string(atom)
|
||||
}
|
||||
_ ->
|
||||
%{
|
||||
severity: "info",
|
||||
id: inspect(alarm_id)
|
||||
}
|
||||
end
|
||||
end
|
||||
end
|
||||
@ -15,7 +15,7 @@ defmodule SystemStatsDaemon.MixProject do
|
||||
# Run "mix help compile.app" to learn about applications.
|
||||
def application do
|
||||
[
|
||||
extra_applications: [:logger],
|
||||
extra_applications: [:logger, :os_mon],
|
||||
mod: {Systant.Application, []}
|
||||
]
|
||||
end
|
||||
|
||||
Loading…
Reference in New Issue
Block a user