Building a Scalable Anomaly Detection System with OpenSearch

· 7 min read
opensearch ml
OpenSearch anomaly detection architecture

This document provides a comprehensive guide to setting up an anomaly detection system using OpenSearch, Prometheus, and Vector. While implemented on-premises, the solution is infrastructure-agnostic and easily adaptable to any environment. It covers the end-to-end process, from metric collection to anomaly alerting via Slack. The aim is to automate the detection of anomalies in server metrics, reducing reliance on static thresholds that often result in false positives.

Problem Statement

Static alarm thresholds are prone to false positives and insufficient for identifying anomalies effectively. The manual approach to monitoring system health can be inefficient and error-prone. To address this, we propose an automated anomaly detection system leveraging machine learning-based anomaly detection in OpenSearch, replacing traditional static threshold alarms.

Solution

Overview

The proposed solution consists of the following steps:

  1. Prometheus Scraping: Collect system metrics from monitored servers.
  2. Sending Metrics to OpenSearch: Forward only relevant metrics to OpenSearch.
  3. Configuring Anomaly Detection in OpenSearch: Set up anomaly detection rules.
  4. Slack Notifications: Notify relevant teams via Slack when anomalies are detected.

Step-by-Step Implementation

1. Prometheus Configuration

Modify prometheus.yml

To enable Prometheus to load additional metric processing rules, add the following configuration to your prometheus.yml file:

rule_files:
- "/etc/prometheus/alert.yml"

Why is this necessary?

Create or Modify alert.yml

Define the necessary metrics for monitoring and anomaly detection:

groups:
- name: prometheus_calculated_metrics
interval: 30s # Frequency of metric collection
rules:
- record: node_memory_usage_percentage
expr: ((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100)
labels:
type: recording_rule
- record: kube_pod_crash_looping
expr: increase(kube_pod_container_status_restarts_total{job="kubernetes-service-endpoints"}[5m])
labels:
type: recording_rule
- record: up_down
expr: up{job!~"blackbox|tempo|alertmanager"}
labels:
type: recording_rule

Apply Configuration and Restart Prometheus

After making these changes, restart Prometheus to apply the new configuration:

sudo systemctl restart prometheus

If running Prometheus in Docker, use:

docker restart prometheus_container_name

2. Vector Setup and Configuration

Install Vector — Documentation

curl --proto '=https' --tlsv1.2 -sSfL https://sh.vector.dev | bash
vector --version

Configure vector.toml

[sources.prometheus] # 1. Prometheus Source Configuration
type = "prometheus_scrape"
endpoints = ["http://[Prometheus Server IP]:9090/federate?match[]={type=~%27recording_rule%27}"]
scrape_interval_secs = 60
scrape_timeout_secs = 30

[transforms.1remove_build_date] # 2. Data Transformation: Removing Unnecessary Fields
type = "remap"
inputs = ["prometheus"]
source = '''
del(.tags.build_date)
'''

[sinks.opensearch] # 3. Sending Data to OpenSearch
type = "elasticsearch"
inputs = ["remove_build_date"]
endpoint = "https://[OpenSearch Server IP]:9200"
bulk.index = "prometheus-metrics-%Y.%m.%d"
auth.strategy = "basic"
auth.user = "[username]" # Your Opensearch Username
auth.password = "[password]" # Your Opensearch Password

Let’s see what each line on config file means:

Start and Enable Vector Service

sudo systemctl enable --now vector

3. OpenSearch Anomaly Detection Setup

Before proceeding further check if you can see the log files from prometheus under the indexes field. If you can see the prometheus-metrics-xxxx-xx-xx indexes you are ready to configure the anomaly detection.

Step 1: Navigate to Anomaly Detection

Building a Scalable Anomaly Detection System with OpenSearch — figure

Step 2: Create a New Detector

Building a Scalable Anomaly Detection System with OpenSearch — figure

Step 3: Configure Index and Data Filtering

Building a Scalable Anomaly Detection System with OpenSearch — figure

Step 4: Set Detection Parameters

Building a Scalable Anomaly Detection System with OpenSearch — figure
Building a Scalable Anomaly Detection System with OpenSearch — figure

Step 5: Define Features

Building a Scalable Anomaly Detection System with OpenSearch — figure

Step 6: Enable Real-time Detection

Building a Scalable Anomaly Detection System with OpenSearch — figure

Step 7: Review and Create Detector

Building a Scalable Anomaly Detection System with OpenSearch — figure
Building a Scalable Anomaly Detection System with OpenSearch — figure

4. Alerting and Notifications

We need to create an alert to send an alarm for an anomaly detected in the created detector. So let’s head to the alerting tab under OpenSearch plugins.

Step 1: Navigate to Alerting

Building a Scalable Anomaly Detection System with OpenSearch — figure
Building a Scalable Anomaly Detection System with OpenSearch — figure

Step 2: Create a Monitor

Building a Scalable Anomaly Detection System with OpenSearch — figure

Step 3: Define Triggers

Building a Scalable Anomaly Detection System with OpenSearch — figure

Step 4: Configure Slack Notifications

Building a Scalable Anomaly Detection System with OpenSearch — figure
Building a Scalable Anomaly Detection System with OpenSearch — figure

Step 5: Define Message Format

Copy and paste the following message template in the action configuration:

🚨 **Anomaly Detected** 🚨

Monitor: {{ctx.monitor.name}}
Trigger: {{ctx.trigger.name}}
Severity: {{ctx.trigger.severity}}
Period Start: {{ctx.periodStart}}
Period End: {{ctx.periodEnd}}

🔍 **Anomaly Details**
- Anomaly Score: {{ctx.results[0].aggregations.anomaly_grade.value}}
- Confidence: {{ctx.results[0].aggregations.confidence.value}}
- Detector ID: {{ctx.monitor.inputs[0].detector_id}}
- Detector Name: {{ctx.monitor.inputs[0].detector_name}}

📊 **Anomaly Result Fields**
{{#ctx.results[0].hits.hits}}
- Field Name: {{_source.field_name}}
- Value: {{_source.value}}
- Timestamp: {{_source.timestamp}}
{{/ctx.results[0].hits.hits}}

⚠️ Please investigate the anomaly and take the necessary actions.

Step 6: Enable Action Throttling and Create Monitor

Building a Scalable Anomaly Detection System with OpenSearch — figure

Conclusion

Once the detector identifies an anomaly, a message will be sent to your Slack channel. This ensures that anomalies are promptly reported, allowing for quick investigation and resolution.

Building a Scalable Anomaly Detection System with OpenSearch — figure

Notes

Building a Scalable Anomaly Detection System with OpenSearch — figure
Building a Scalable Anomaly Detection System with OpenSearch — figure
Building a Scalable Anomaly Detection System with OpenSearch — figure

Further Information