diff --git a/CLAUDE.md b/CLAUDE.md index 7c0c3cd..c7cea4d 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -98,7 +98,40 @@ The project includes a Phoenix LiveView dashboard (`dashboard/`) that provides r - **Topic Parsing**: Topics may arrive as lists or strings, handle both formats - **Client ID Conflicts**: Use unique client IDs to prevent connection instability +## Development Roadmap + +### Phase 1: System Metrics Collection (Completed) +- ✅ **SystemMetrics Module**: `server/lib/systant/system_metrics.ex` - Comprehensive metrics collection +- ✅ **CPU Metrics**: Load averages (1/5/15min) and utilization via `:cpu_sup` +- ✅ **Memory Metrics**: System memory data and monitoring via `:memsup` +- ✅ **Disk Metrics**: Disk usage and capacity for all mounted drives via `:disksup` +- ✅ **System Info**: Uptime, Erlang/OTP versions, scheduler info +- ✅ **System Alarms**: Active os_mon alarms (disk_almost_full, memory_high_watermark, etc.) +- ✅ **MQTT Integration**: Real metrics published every 30 seconds replacing simple messages +- 🔄 **Network Metrics**: TODO - Interface statistics, bandwidth utilization +- 🔄 **GPU Metrics**: TODO - NVIDIA/AMD GPU utilization, temperatures, memory usage + +#### Implementation Details +- Uses Erlang's built-in `:os_mon` application (cpu_sup, memsup, disksup) +- Collects active system alarms from `:alarm_handler` with structured format +- Graceful error handling with fallbacks when metrics unavailable +- JSON payload structure: `{timestamp, hostname, cpu, memory, disk, system, alarms}` +- Dashboard automatically receives and displays real-time system data and alerts +- Alarm format: `{severity, path/details, id}` for clean consumption + +### Phase 2: Command System +- Subscribe to `systant/+/commands` in MqttClient +- Implement secure command execution framework with validation/whitelisting +- Support commands like: restart services, update packages, system queries +- Response mechanism to send command results back via MQTT + +### Phase 3: Home Assistant Integration +- Custom MQTT integration following Home Assistant patterns +- Auto-discovery of systant hosts via MQTT discovery protocol +- Create entities for metrics (sensors) and commands (buttons/services) +- Dashboard cards and automation support + ### Future Plans -- Integration with Home Assistant via custom MQTT integration -- Expandable command handling for host-specific automation -- Multi-host deployment for comprehensive system monitoring \ No newline at end of file +- Multi-host deployment for comprehensive system monitoring +- Advanced alerting and threshold monitoring +- Historical data retention and trending \ No newline at end of file diff --git a/README.md b/README.md index bc6ed11..07d8f11 100644 --- a/README.md +++ b/README.md @@ -1,60 +1,79 @@ # Systant -An Elixir application that runs as a systemd daemon to: -1. Publish system stats to MQTT every 30 seconds -2. Listen for commands over MQTT and log them to syslog +A comprehensive Elixir-based system monitoring solution with real-time dashboard, designed for deployment across multiple NixOS hosts. -## Configuration +## Components -Edit `config/config.exs` to configure MQTT connection: - -```elixir -config :systant, Systant.MqttClient, - host: "localhost", - port: 1883, - client_id: "systant", - username: nil, - password: nil, - stats_topic: "system/stats", - command_topic: "system/commands", - publish_interval: 30_000 -``` - -## Building - -```bash -mix deps.get -mix compile -``` - -## Running - -```bash -# Development -mix run --no-halt - -# Production release -MIX_ENV=prod mix release -_build/prod/rel/systant/bin/systant start -``` - -## Systemd Installation - -1. Build production release -2. Copy binary to `/usr/local/bin/` -3. Copy `systant.service` to `/etc/systemd/system/` -4. Enable and start: - -```bash -sudo systemctl enable systant -sudo systemctl start systant -``` +- **Server** (`server/`): Elixir OTP application that collects and publishes system metrics via MQTT +- **Dashboard** (`dashboard/`): Phoenix LiveView web dashboard for real-time monitoring +- **Nix Integration**: Complete NixOS module and packaging for easy deployment ## Features -- Publishes "Hello from systant" stats every 30 seconds to `system/stats` topic -- Listens on `system/commands` topic and logs received messages -- Configurable MQTT connection settings -- Runs as systemd daemon with auto-restart -- Logs to system journal +### System Metrics Collection +- **CPU**: Load averages (1/5/15min) and utilization monitoring +- **Memory**: System memory usage and swap monitoring +- **Disk**: Usage statistics and capacity monitoring for all drives +- **System Alarms**: Real-time alerts for disk space, memory pressure, etc. +- **System Info**: Uptime, Erlang/OTP versions, scheduler information + +### Real-time Dashboard +- Phoenix LiveView interface showing all connected hosts +- Live system metrics and alert monitoring +- Automatic reconnection and error handling + +### MQTT Integration +- Publishes comprehensive system metrics every 30 seconds +- Uses hostname-based topics: `systant/${hostname}/stats` +- Structured JSON payloads with full system data +- Configurable MQTT broker connection + +## Quick Start + +### Development Environment +```bash +# Enter Nix development shell +nix develop + +# Run the server +cd server && mix run --no-halt + +# Run the dashboard (separate terminal) +just dashboard +# or: cd dashboard && mix phx.server +``` + +### Production Deployment (NixOS) +```bash +# Build and install via Nix +nix build +sudo nixos-rebuild switch --flake . + +# Or use the NixOS module in your configuration: +# imports = [ ./path/to/systant/nix/nixos-module.nix ]; +# services.systant.enable = true; +``` + +## Configuration + +Default MQTT configuration (customizable via environment variables): +- **Host**: `mqtt.home:1883` +- **Topics**: `systant/${hostname}/stats` and `systant/${hostname}/commands` +- **Interval**: 30 seconds +- **Client ID**: `systant_${random}` (auto-generated to avoid conflicts) + +## Architecture + +- **Server**: `server/lib/systant/mqtt_client.ex` - MQTT publishing and command handling +- **Metrics**: `server/lib/systant/system_metrics.ex` - System data collection using `:os_mon` +- **Dashboard**: `dashboard/lib/dashboard/mqtt_subscriber.ex` - Real-time MQTT data consumption +- **Nix**: `nix/package.nix` and `nix/nixos-module.nix` - Complete packaging and deployment + +## Roadmap + +- ✅ **Phase 1**: System metrics collection with real-time dashboard +- 🔄 **Phase 2**: Command system for remote host management +- 🔄 **Phase 3**: Home Assistant integration for automation + +See `CLAUDE.md` for detailed development context and implementation notes. diff --git a/server/lib/systant/mqtt_client.ex b/server/lib/systant/mqtt_client.ex index 9c832e3..c18171c 100644 --- a/server/lib/systant/mqtt_client.ex +++ b/server/lib/systant/mqtt_client.ex @@ -30,13 +30,9 @@ defmodule Systant.MqttClient do {:ok, _pid} -> Logger.info("MQTT client connected successfully") - # Send immediate hello message - hello_msg = %{ - message: "Hello - systant started", - timestamp: DateTime.utc_now() |> DateTime.to_iso8601(), - hostname: get_hostname() - } - Tortoise.publish(client_id, config[:stats_topic], Jason.encode!(hello_msg), qos: 0) + # Send immediate system metrics on startup + startup_stats = Systant.SystemMetrics.collect_metrics() + Tortoise.publish(client_id, config[:stats_topic], Jason.encode!(startup_stats), qos: 0) schedule_stats_publish(config[:publish_interval]) {:ok, %{config: config, client_id: client_id}} @@ -57,23 +53,19 @@ defmodule Systant.MqttClient do {:noreply, state} end - def terminate(reason, state) do + def terminate(reason, _state) do Logger.info("MQTT client terminating: #{inspect(reason)}") :ok end defp publish_stats(config, client_id) do - stats = %{ - message: "Hello from systant", - timestamp: DateTime.utc_now() |> DateTime.to_iso8601(), - hostname: get_hostname() - } + stats = Systant.SystemMetrics.collect_metrics() payload = Jason.encode!(stats) case Tortoise.publish(client_id, config[:stats_topic], payload, qos: 0) do :ok -> - Logger.info("Published stats: #{payload}") + Logger.info("Published system metrics for #{stats.hostname}") {:error, reason} -> Logger.error("Failed to publish stats: #{inspect(reason)}") end @@ -82,11 +74,4 @@ defmodule Systant.MqttClient do defp schedule_stats_publish(interval) do Process.send_after(self(), :publish_stats, interval) end - - defp get_hostname do - case :inet.gethostname() do - {:ok, hostname} -> List.to_string(hostname) - _ -> "unknown" - end - end end \ No newline at end of file diff --git a/server/lib/systant/system_metrics.ex b/server/lib/systant/system_metrics.ex new file mode 100644 index 0000000..bd8646d --- /dev/null +++ b/server/lib/systant/system_metrics.ex @@ -0,0 +1,203 @@ +defmodule Systant.SystemMetrics do + @moduledoc """ + Collects system metrics using Erlang's built-in :os_mon application. + Provides CPU, memory, disk, and network statistics. + """ + + require Logger + + @doc """ + Collect all available system metrics + """ + def collect_metrics do + %{ + timestamp: DateTime.utc_now() |> DateTime.to_iso8601(), + hostname: get_hostname(), + cpu: collect_cpu_metrics(), + memory: collect_memory_metrics(), + disk: collect_disk_metrics(), + system: collect_system_info(), + alarms: collect_active_alarms() + } + end + + @doc """ + Collect CPU metrics using cpu_sup + """ + def collect_cpu_metrics do + try do + # Get CPU utilization (average over all cores) + cpu_util = case :cpu_sup.util() do + {:badrpc, _} -> nil + util when is_number(util) -> util + _ -> nil + end + + # Get load averages (1, 5, 15 minutes) + load_avg = case :cpu_sup.avg1() do + {:badrpc, _} -> nil + avg when is_number(avg) -> + %{ + avg1: avg, + avg5: safe_call(:cpu_sup, :avg5, []), + avg15: safe_call(:cpu_sup, :avg15, []) + } + _ -> nil + end + + %{ + utilization: cpu_util, + load_average: load_avg + } + rescue + _ -> + Logger.warning("CPU metrics collection failed") + %{utilization: nil, load_average: nil} + end + end + + @doc """ + Collect memory metrics using memsup + """ + def collect_memory_metrics do + try do + # Get system memory data + system_memory = case :memsup.get_system_memory_data() do + {:badrpc, _} -> nil + data when is_list(data) -> Enum.into(data, %{}) + _ -> nil + end + + # Get memory check interval + check_interval = safe_call(:memsup, :get_check_interval, []) + + %{ + system: system_memory, + check_interval: check_interval + } + rescue + _ -> + Logger.warning("Memory metrics collection failed") + %{system: nil, check_interval: nil} + end + end + + @doc """ + Collect disk metrics using disksup + """ + def collect_disk_metrics do + try do + # Get disk data for all disks + disks = case :disksup.get_disk_data() do + {:badrpc, _} -> [] + data when is_list(data) -> + Enum.map(data, fn {path, kb_size, capacity} -> + %{ + path: List.to_string(path), + size_kb: kb_size, + capacity_percent: capacity + } + end) + _ -> [] + end + + %{disks: disks} + rescue + _ -> + Logger.warning("Disk metrics collection failed") + %{disks: []} + end + end + + @doc """ + Collect active system alarms from alarm_handler + """ + def collect_active_alarms do + try do + # Get all active alarms from the alarm handler + case :alarm_handler.get_alarms() do + alarms when is_list(alarms) -> + Enum.map(alarms, fn {alarm_id, alarm_desc} -> + format_alarm(alarm_id, alarm_desc) + end) + _ -> [] + end + rescue + _ -> + Logger.warning("Alarm collection failed") + [] + end + end + + @doc """ + Collect general system information + """ + def collect_system_info do + try do + %{ + uptime_seconds: get_uptime(), + erlang_version: System.version(), + otp_release: System.otp_release(), + schedulers: System.schedulers(), + logical_processors: System.schedulers_online() + } + rescue + _ -> + Logger.warning("System info collection failed") + %{} + end + end + + # Private helper functions + + defp get_hostname do + case :inet.gethostname() do + {:ok, hostname} -> List.to_string(hostname) + _ -> "unknown" + end + end + + defp get_uptime do + try do + # Get system uptime in milliseconds, convert to seconds + :erlang.statistics(:wall_clock) |> elem(0) |> div(1000) + rescue + _ -> nil + end + end + + defp safe_call(module, function, args) do + try do + apply(module, function, args) + catch + :exit, _ -> nil + :error, _ -> nil + end + end + + defp format_alarm(alarm_id, _alarm_desc) do + case alarm_id do + {:disk_almost_full, path} when is_list(path) -> + %{ + severity: "warning", + path: List.to_string(path), + id: "disk_almost_full" + } + {:system_memory_high_watermark, _} -> + %{ + severity: "critical", + id: "system_memory_high_watermark" + } + atom when is_atom(atom) -> + %{ + severity: "info", + id: Atom.to_string(atom) + } + _ -> + %{ + severity: "info", + id: inspect(alarm_id) + } + end + end +end \ No newline at end of file diff --git a/server/mix.exs b/server/mix.exs index 6a4bee2..570279f 100644 --- a/server/mix.exs +++ b/server/mix.exs @@ -15,7 +15,7 @@ defmodule SystemStatsDaemon.MixProject do # Run "mix help compile.app" to learn about applications. def application do [ - extra_applications: [:logger], + extra_applications: [:logger, :os_mon], mod: {Systant.Application, []} ] end