AWSTerraformDevOpsCloudWatchSSMEventBridgeInfrastructure as CodeSelf-HealingAutomation

Building Self-Healing Infrastructure on AWS with Terraform

March 10, 2026

The user just wants to format the markdown content — no file creation needed. Here's the cleaned-up version:

The Problem

It's 3 AM. Your web server crashes. Your monitoring sends you an alert. You wake up, SSH into the server, restart the service, and go back to sleep. Sound familiar?

What if AWS could detect the problem, fix it automatically, and just send you an email saying "something broke, but we already fixed it"?

That's exactly what I built.

What This Project Does

This is a fully automated self-healing infrastructure on AWS. An EC2 instance runs Apache HTTPD, and when the service goes down — whether from a crash, high CPU, or a status check failure — the system:

Detects the failure via CloudWatch Alarms
Alerts via SNS email notification
Remediates automatically via SSM Automation (restart httpd → check status → reboot if needed)
Verifies recovery and sends a follow-up OK email

No human intervention required. The entire loop runs in under 5 minutes.

Architecture

EC2 (httpd + CloudWatch Agent + Health Check Cron)
     │
     ├── metrics ──→ CloudWatch Alarms
     │                    │
     │                    ├──→ SNS ──→ Email (ALARM + OK)
     │                    │
     │                    └──→ EventBridge ──→ SSM Automation
     │                                            │
     │                                            ├─ Step 1: restart httpd
     │                                            ├─ Step 2: check status
     │                                            └─ Step 3: reboot EC2 (last resort)
     │
     └── recovers ──→ alarm returns to OK

AWS Services Used

Service	Role
EC2	Amazon Linux 2023 running Apache HTTPD
CloudWatch Agent	Collects CPU, memory, disk metrics and forwards logs
CloudWatch Alarms	Three alarms monitoring HTTP health, CPU usage, and EC2 status checks
CloudWatch Dashboard	Real-time visibility with 7 widgets
EventBridge	Routes alarm state changes to SSM Automation
SSM Automation	Executes a multi-step runbook to restart or reboot
SSM Parameter Store	Stores CloudWatch Agent configuration
SNS	Email notifications for alarm and recovery events
IAM	Least-privilege roles for EC2, SSM, and EventBridge

Terraform Module Structure

infra/
├── main.tf                    # Module orchestration
├── modules/
│   ├── networking/            # VPC, subnet, IGW, security group
│   ├── iam/                   # EC2 role, instance profile
│   ├── compute/               # EC2 instance, key pair
│   ├── monitoring/            # CW agent config, alarms, dashboard
│   ├── notification/          # SNS topic + email subscription
│   └── remediation/           # SSM runbook, EventBridge rule
└── scripts/
    └── cloudwatch_ssm.sh      # EC2 bootstrap script

Each module has a single responsibility. The main.tf wires them together — networking outputs feed into compute, compute outputs feed into monitoring, monitoring outputs feed into remediation.

The SSM Automation Runbook

The runbook SelfHealing-RestartOrReboot has three steps:

Step 1 — Restart httpd: Runs systemctl restart httpd via SSM RunCommand. If it fails, continues to Step 2.

Step 2 — Check httpd status: Runs systemctl is-active httpd. If httpd is active, healing is complete. If it fails, escalates to Step 3.

Step 3 — Reboot EC2 (last resort): Calls the EC2 RebootInstances API. The instance reboots, user data runs again, and everything comes back online.

This escalation pattern — try the least disruptive fix first, then escalate — is a common pattern in production incident response. This project automates it.

The Custom Health Check

CloudWatch doesn't monitor HTTP application health by default. A simple bash script runs every minute via cron:

TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" \
  -H "X-aws-ec2-metadata-token-ttl-seconds: 300")
INSTANCE_ID=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
  http://169.254.169.254/latest/meta-data/instance-id)

HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" http://localhost/)

if [ "$HTTP_CODE" -eq 200 ]; then VALUE=0; else VALUE=1; fi

aws cloudwatch put-metric-data \
  --namespace "SelfHealingInfra" \
  --metric-name "HTTPCheckFailed" \
  --dimensions "InstanceId=$INSTANCE_ID" \
  --value "$VALUE"

It hits localhost, checks the HTTP status code, and pushes a custom metric HTTPCheckFailed (0 = healthy, 1 = down) to CloudWatch. The alarm watches this metric and fires after 2 consecutive failures.

Note: This uses IMDSv2 (token-based metadata) — the AWS-recommended approach over the older IMDSv1.

Three Alarms, One Runbook

Alarm	What It Watches	Threshold
HTTP Check Failed	Custom `HTTPCheckFailed` metric	> 0 for 2 min
High CPU	`cpu_usage_idle` from CW Agent	< 20% for 5 min
EC2 Status Check	AWS native `StatusCheckFailed`	> 0 for 2 min

All three alarms are wired to the same EventBridge rule. Any alarm entering ALARM state triggers the same SSM runbook — whether the web server crashes, the CPU spikes, or the instance itself has issues.

One important configuration: treat_missing_data = "breaching". If the CloudWatch Agent crashes and stops sending metrics, the alarm still fires. Missing data means something is wrong.

Testing It

Simulate HTTP failure:

sudo systemctl stop httpd

Within 3–4 minutes: alarm fires → email arrives → SSM restarts httpd → alarm returns to OK → recovery email arrives. All automatic.

Simulate unrepairable failure:

sudo mv /usr/sbin/httpd /usr/sbin/httpd.bak
sudo systemctl stop httpd

Now the restart fails, so SSM escalates to rebooting the entire instance.

Lessons Learned

CloudWatch Agent dimensions matter. The agent publishes metrics with dimensions like host, cpu, device. If your alarms and dashboard query different dimensions, you get "No data available." Use append_dimensions and aggregation_dimensions to normalize everything to InstanceId.

SSM Session Manager > SSH. No need to manage SSH keys or open port 22. SSM sessions work through the IAM role — more secure, fully audited.

Cron needs cronie on AL2023. Amazon Linux 2023 doesn't ship with crond installed. Add yum install -y cronie to the bootstrap script.

IMDSv2 is the way. The older ec2-metadata command isn't always available. The token-based IMDS endpoint works everywhere and is more secure.

What's Next

The detect → alert → remediate → verify pattern extends well beyond EC2:

ECS tasks that restart failed containers
Lambda functions triggered by CloudWatch alarms for custom remediation logic
Auto Scaling Groups that replace unhealthy instances entirely
Multi-region failover using Route 53 health checks