Alarms & notifications

CORE-M models alarms as first-class, tenant-scoped entities — not just log lines. An alarm has a lifecycle, an owner, a history of comments and state changes, and it drives notifications. This page covers the alarm entity and its lifecycle, then the notification system that routes alarms to people and systems.

Alarms are created by rule chains, device state changes, protocol adapters, or system monitors, and are stored in Aerospike namespace rules, set alarms, keyed by {tenant_id}:{alarm_id}.

The alarm entity

An alarm record carries alarm_id, tenant_id, originator_type, originator_id, type, severity, status, assignee_user_id, start_ts, end_ts, acknowledged_at, cleared_at, details_json, latest_event_hash, created_at, updated_at, and version. Secondary indexes support filtering by tenant_id, originator_id, severity, status, type, and the timestamps — which powers the alarm-center filters.

When a create-alarm node fires (say device D1 reports temperature 95), an alarm is created with status="ActiveUnack", the configured severity, originator_id="D1", and type from the rule. The platform then:

Publishes alarm.created.T1.{alarm_id}.
Pushes alarm_update to WebSocket clients subscribed to tenant alarms.
Updates corem_alarms_total{severity,status}.

Lifecycle

An alarm has two orthogonal dimensions: active vs cleared (the condition) and unacknowledged vs acknowledged (operator attention). The four status values combine them:

Status	Condition	Operator
`ActiveUnack`	Still active	Not yet acknowledged
`ActiveAck`	Still active	Acknowledged
`ClearedUnack`	Cleared	Not yet acknowledged
`ClearedAck`	Cleared	Acknowledged

Conceptually this is the familiar triggered → acknowledged → cleared flow, with the twist that a cleared alarm can reopen if the condition recurs.

stateDiagram-v2
  [*] --> ActiveUnack: alarm created
  ActiveUnack --> ActiveAck: acknowledge
  ActiveUnack --> ClearedUnack: clear
  ActiveAck --> ClearedAck: clear
  ClearedUnack --> ClearedAck: acknowledge
  ClearedAck --> ActiveUnack: condition recurs (reopen)
  ClearedUnack --> ActiveUnack: condition recurs (reopen)

Acknowledge. A user with the alarms permission acknowledges the alarm, optionally with a comment (“Investigating”). Status moves to ActiveAck; acknowledged_by and acknowledged_at are set; the comment is appended to history; audit event alarm.acknowledged is written.
Clear. Clearing with a resolution (“Fan replaced”) moves status to ClearedAck, records cleared_by, cleared_at, and resolution, and publishes alarm.cleared.T1.{alarm_id}.
Reopen. If the same alarm condition triggers again within the deduplication window, the alarm reopens to ActiveUnack, reopened_count increments, and the previous clear metadata remains in history.

Deduplication and severity on re-fire

Alarms are deduplicated by originator and type. If an alarm is already active for (tenant, device, type), a repeat trigger updates the existing alarm rather than spawning a new one: latest_event_hash and updated_at change, repeat_count increments, and alarm.updated.T1.{alarm_id} is published. This keeps a noisy sensor from flooding the alarm center with duplicates.

Severity is one of Critical, Major, Minor, Warning, or Indeterminate. On a re-fire the alarm’s severity is overwritten unconditionally with the severity of the new trigger — there is no escalation counter, and severity can move down as well as up. If a Major alarm re-fires at Minor, the alarm becomes Minor; if it re-fires at Critical, it becomes Critical. The notification rules are then re-evaluated against the new severity.

Comments, assignment, and history

Every lifecycle transition — acknowledge, clear, reopen, comment, assignment — is recorded in the alarm’s history, and notable transitions emit audit events. This gives operators a complete, append-only record of who did what and when.

Concurrent acknowledgement (CAS)

Lifecycle transitions use Aerospike generation checks (compare-and-swap) to prevent lost updates. If two operators load alarm A1 at version 4 and both try to acknowledge:

The first write succeeds.
The second, still holding stale version 4, is rejected with ABORTED / HTTP 409, and the response returns the current version and status so the UI can refresh.

The same per-record CAS applies to bulk actions: acknowledging 25 alarms at once applies CAS per alarm and reports failures per alarm_id, so the UI can show partial success when some alarms changed concurrently.

Notifications

Alarms drive notifications: messages delivered to people or systems. Notification targets, templates, and rules are tenant-scoped and stored in Aerospike namespace rules, set notifications, keyed by {tenant_id}:{notification_id}.

Targets

Target type	Notes
smtp	Email delivery over SMTP
slack	Slack incoming-webhook style
teams	Microsoft Teams incoming webhook

Templates

Templates render the message body with variable substitution from alarm and device fields. Templates are versioned: editing a template versions the previous content and writes an audit event with template_id, old_version, new_version, and actor_user_id.

Rules

A notification rule ties it together: a matcher (e.g. severity="Critical" and status="ActiveUnack"), one or more targets, a template, plus quiet hours, escalation, suppression, and a deduplication window. When an alarm matches a rule, a notification record is created with status="scheduled", notification.scheduled.T1.{notification_id} is published, and the worker renders the template.

Delivery and retry

flowchart TD
  match([Alarm matches rule]) --> sched["status = scheduled"]
  sched --> send{Deliver to target}
  send -->|accepted| delivered["status = delivered<br/>record delivered_at,<br/>provider_message_id"]
  send -->|error| retry{Retries left?}
  retry -->|yes| backoff["Exponential backoff,<br/>retry per rule policy"]
  backoff --> send
  retry -->|no| failed["status = failed<br/>store failure_reason"]

On success the notification becomes delivered, recording delivered_at and the provider_message_id, and publishing notification.delivered.T1.{notification_id}. On failure the worker retries with exponential backoff per the rule’s policy; after max attempts the status becomes failed, failure_reason stores the final provider error, and corem_notification_delivery_failures_total{target_type} increments.

Quiet hours, escalation, and suppression

Notification rules support several controls to keep alerting useful rather than noisy:

Quiet hours. Defer non-critical notifications during a window (e.g. 22:00– 07:00 tenant local time). A Warning alarm at 23:00 produces a deferred notification that is held until 07:00, with no provider call made in the meantime.
Critical bypass. A rule can allow Critical alarms to bypass quiet hours — a Critical alarm during the quiet window is scheduled immediately.
Escalation. If an alarm stays ActiveUnack past a threshold (e.g. 15 minutes), the escalation scheduler schedules the next level’s notification (e.g. level 2 → “oncall-manager”) and records escalation_level in the alarm history.
Suppression and dedup windows. Per-target suppression and deduplication windows keep repeated triggers from generating repeated notifications.

Preferences and audit

Where tenant policy allows, users can set personal notification preferences — for example opting out of email while keeping in-app notifications — and the change is audited (notification.preference.updated). Every notification rule change and delivery state change produces an audit event, so the full alerting configuration and its delivery outcomes are traceable.

Where to go next

Rule chains The create-alarm and schedule-notification nodes that feed this subsystem.

Entity model Entity views that scope which alarms a customer user can see.

Dashboards The alarm-table widget and the live alarm center.