Skip to content

Alarms & notifications

CORE-M models alarms as first-class, tenant-scoped entities — not just log lines. An alarm has a lifecycle, an owner, a history of comments and state changes, and it drives notifications. This page covers the alarm entity and its lifecycle, then the notification system that routes alarms to people and systems.

Alarms are created by rule chains, device state changes, protocol adapters, or system monitors, and are stored in Aerospike namespace rules, set alarms, keyed by {tenant_id}:{alarm_id}.

An alarm record carries alarm_id, tenant_id, originator_type, originator_id, type, severity, status, assignee_user_id, start_ts, end_ts, acknowledged_at, cleared_at, details_json, latest_event_hash, created_at, updated_at, and version. Secondary indexes support filtering by tenant_id, originator_id, severity, status, type, and the timestamps — which powers the alarm-center filters.

When a create-alarm node fires (say device D1 reports temperature 95), an alarm is created with status="active_unack", the configured severity, originator_id="D1", and type from the rule. The platform then:

  • Publishes alarm.created.T1.{alarm_id}.
  • Pushes alarm_update to WebSocket clients subscribed to tenant alarms.
  • Updates corem_alarms_total{severity,status}.

An alarm has two orthogonal dimensions: active vs cleared (the condition) and unacknowledged vs acknowledged (operator attention). The four status values combine them:

StatusConditionOperator
active_unackStill activeNot yet acknowledged
active_ackStill activeAcknowledged
cleared_unackClearedNot yet acknowledged
cleared_ackClearedAcknowledged

Conceptually this is the familiar triggered → acknowledged → cleared flow, with the twist that a cleared alarm can reopen if the condition recurs.

stateDiagram-v2
  [*] --> active_unack: alarm created
  active_unack --> active_ack: acknowledge
  active_unack --> cleared_unack: clear
  active_ack --> cleared_ack: clear
  cleared_unack --> cleared_ack: acknowledge
  cleared_ack --> active_unack: condition recurs (reopen)
  cleared_unack --> active_unack: condition recurs (reopen)
  1. Acknowledge. A user with the alarms permission acknowledges the alarm, optionally with a comment (“Investigating”). Status moves to active_ack; acknowledged_by and acknowledged_at are set; the comment is appended to history; audit event alarm.acknowledged is written.

  2. Clear. Clearing with a resolution (“Fan replaced”) moves status to cleared_ack, records cleared_by, cleared_at, and resolution, and publishes alarm.cleared.T1.{alarm_id}.

  3. Reopen. If the same alarm condition triggers again within the deduplication window, the alarm reopens to active_unack, reopened_count increments, and the previous clear metadata remains in history.

Alarms are deduplicated by originator and type. If an alarm is already active for (tenant, device, type), a repeat trigger updates the existing alarm rather than spawning a new one: latest_event_hash and updated_at change, repeat_count increments, and alarm.updated.T1.{alarm_id} is published. This keeps a noisy sensor from flooding the alarm center with duplicates.

Severity escalates on repeat. If an active warning alarm receives a new trigger at critical severity, the alarm’s severity becomes critical, escalation_count increments, and the critical-alarm notification rules are re-evaluated.

Every lifecycle transition — acknowledge, clear, reopen, comment, assignment — is recorded in the alarm’s history, and notable transitions emit audit events. This gives operators a complete, append-only record of who did what and when.

Lifecycle transitions use Aerospike generation checks (compare-and-swap) to prevent lost updates. If two operators load alarm A1 at version 4 and both try to acknowledge:

  • The first write succeeds.
  • The second, still holding stale version 4, is rejected with ABORTED / HTTP 409, and the response returns the current version and status so the UI can refresh.

The same per-record CAS applies to bulk actions: acknowledging 25 alarms at once applies CAS per alarm and reports failures per alarm_id, so the UI can show partial success when some alarms changed concurrently.

Alarms drive notifications: messages delivered to people or systems. Notification targets, templates, and rules are tenant-scoped and stored in Aerospike namespace rules, set notifications, keyed by {tenant_id}:{notification_id}.

Target typeNotes
emailSMTP delivery
webhookGeneric HTTP POST
Slack-compatible webhookIncoming-webhook style
SMS providerVia configured SMS gateway
in-app notificationSurfaced inside CORE-M
Redpanda eventPublished for downstream consumers

Templates render the message body with variable substitution from alarm and device fields. Templates are versioned: editing a template versions the previous content and writes an audit event with template_id, old_version, new_version, and actor_user_id.

A notification rule ties it together: a matcher (e.g. severity="critical" and status="active_unack"), one or more targets, a template, plus quiet hours, escalation, suppression, and a deduplication window. When an alarm matches a rule, a notification record is created with status="scheduled", notification.scheduled.T1.{notification_id} is published, and the worker renders the template.

flowchart TD
  match([Alarm matches rule]) --> sched["status = scheduled"]
  sched --> send{Deliver to target}
  send -->|accepted| delivered["status = delivered<br/>record delivered_at,<br/>provider_message_id"]
  send -->|error| retry{Retries left?}
  retry -->|yes| backoff["Exponential backoff,<br/>retry per rule policy"]
  backoff --> send
  retry -->|no| failed["status = failed<br/>store failure_reason"]

On success the notification becomes delivered, recording delivered_at and the provider_message_id, and publishing notification.delivered.T1.{notification_id}. On failure the worker retries with exponential backoff per the rule’s policy; after max attempts the status becomes failed, failure_reason stores the final provider error, and corem_notification_delivery_failures_total{target_type} increments.

Notification rules support several controls to keep alerting useful rather than noisy:

  • Quiet hours. Defer non-critical notifications during a window (e.g. 22:00– 07:00 tenant local time). A warning alarm at 23:00 produces a deferred notification that is held until 07:00, with no provider call made in the meantime.
  • Critical bypass. A rule can allow critical alarms to bypass quiet hours — a critical alarm during the quiet window is scheduled immediately.
  • Escalation. If an alarm stays active_unack past a threshold (e.g. 15 minutes), the escalation scheduler schedules the next level’s notification (e.g. level 2 → “oncall-manager”) and records escalation_level in the alarm history.
  • Suppression and dedup windows. Per-target suppression and deduplication windows keep repeated triggers from generating repeated notifications.

Where tenant policy allows, users can set personal notification preferences — for example opting out of email while keeping in-app notifications — and the change is audited (notification.preference.updated). Every notification rule change and delivery state change produces an audit event, so the full alerting configuration and its delivery outcomes are traceable.