Project Resilience
High-Availability 3-Tier AWS Infrastructure
The Problem
Standard single-server deployments represent a Single Point of Failure (SPOF) If a single AWS Availability Zone (AZ)
down or a manual configuration error occurs, the entire application stays offline, leading to significant downtime and potential data loss.
I needed to move away from "manual, fragile setups" toward a system that is decoupled, secure, and self-healing.
The Solution
I engineered a production-grade 3-tier architecture in the ap-southeast-1 (Singapore) region.
The system is designed to "survive" by distributing workloads across multiple AZs and automating the recovery process.
Key solutions included:
- High Availability: Distributing traffic via an ALB to an Auto Scaling Group.
- Data Durability: Utilizing RDS Multi-AZ for synchronous replication.
- Secure Networking: Isolating the Database and App tiers in Private Subnets.
Infrastructure as Code
Terraform configuration for the Security Group Nesting:
# The "Chain of Trust": Only the App Tier can talk to the Database
resource "aws_security_group" "db_sg" {
name = "database-layer-security-group"
description = "Allow MySQL traffic from App Tier only"
vpc_id = aws_vpc.main.id
ingress {
from_port = 3306
to_port = 3306
protocol = "tcp"
# This is nesting: No IP addresses, just the App SG ID
security_groups = [aws_security_group.app_sg.id]
}
}
The Results
90%
MTTR Reduction
Automated instance replacement via ASG eliminates manual rebooting.
<60s
Failover
RDS Multi-AZ ensures rapid database failover with zero data loss.
99.9%
Uptime
Multi-AZ deployment ensures regional resilience against AZ outages.
0
Interventions
Self-healing infrastructure automatically maintains desired instance count.
Technologies Used
- Cloud Provider: AWS (Region: ap-southeast-1).
- Infrastructure as Code: Terraform (~> 5.0).
- Database: MySQL 8.0 (RDS Multi-AZ).
- Compute: Amazon Linux 2023 + Apache HTTP Server.
Auto Remediation
Event-Driven Self-Healing Infrastructure
The Problem
In modern cloud environments, relying on manual SSH to fix a crashed service is slow and does not scale.
Manual intervention creates a bottleneck, increases Mean Time To Recovery (MTTR), and often requires open SSH ports,
which increases the security attack surface.
The Solution
I designed a pipeline that automates the entire detection-to-remediation lifecycle:
- Monitoring & Detection: CloudWatch Alarms monitor instance health; if a service crashes, the alarm state change triggers EventBridge.
- The "Brain" (Lambda): A Python-based Lambda function validates the environment, checking for a Maintenance tag to ensure safety.
- Secure Remediation: The system triggers AWS Systems Manager (SSM) to run a custom Command Document, restarting the service securely over the AWS backbone without needing open SSH ports.
- Real-time Observability: Automated logs are sent via Discord Webhooks, providing the specific cause and duration of downtime.
Infrastructure as Code
Here is the Python logic that serves as the system's circuit breaker, combined with the Terraform trigger:
# IAM Policy for Lambda Hardening
resource "aws_iam_role_policy" "lambda_policy" {
name = "lambda_janitor_permissions"
role = aws_iam_role.lambda_role.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = ["ssm:SendCommand", "ec2:DescribeTags"],
Effect = "Allow",
Resource = "*"
},
{
Action = ["logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents"],
Effect = "Allow",
Resource = "arn:aws:logs:*:*:*"
}
]
})
}
# Intelligence Layer Configuration
resource "aws_lambda_function" "janitor_brain" {
function_name = "CloudJanitor_Remediation_Brain"
handler = "lambda_function.lambda_handler"
runtime = "python3.11"
environment {
variables = {
DISCORD_WEBHOOK_URL = var.discord_webhook_url
SSM_DOCUMENT_NAME = aws_ssm_document.remediate_nginx.name
}
}
}
# Cross-Service Permission
resource "aws_lambda_permission" "allow_cloudwatch" {
statement_id = "AllowExecutionFromCloudWatch"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.janitor_brain.function_name
principal = "events.amazonaws.com"
source_arn = aws_cloudwatch_event_rule.remediation_rule.arn
}
import boto3
import os
def lambda_handler(event, context):
ec2 = boto3.client('ec2')
ssm = boto3.client('ssm')
# Validation: Maintenance Check
detail = event.get('detail', {})
metrics = detail.get('configuration', {}).get('metrics', [{}])
instance_id = metrics[0].get('metricStat', {}).get('metric', {}).get('dimensions', {}).get('InstanceId', 'Unknown')
tags = ec2.describe_tags(Filters=[{'Name': 'resource-id', 'Values': [instance_id]}])['Tags']
is_maintenance = any(t['Key'] == 'Maintenance' and t['Value'].lower() == 'true' for t in tags)
if is_maintenance:
return {"status": "skipped", "reason": "Maintenance Mode Active"}
# Secure Remediation via SSM
ssm.send_command(
InstanceIds=[instance_id],
DocumentName=os.environ.get('SSM_DOCUMENT_NAME')
)
return {"status": "success"}
The Results
90%
Reduction in MTTR
System identifies and fixes failures in under 30 seconds, down from 15 minutes of manual detection.
0
Zero-Trust Security
Remediation occurs via SSM, allowing for the complete removal of public SSH access.
24/7
Dynamic Observability
Real-time Discord notifications provide instant clarity on why a service failed.
Technologies Used
- Cloud Provider: AWS (Region: ap-southeast-1)
- Automation & IaC: Terraform, Python (Boto3), Bash, HCL.
- AWS Services: EC2, Lambda, EventBridge, CloudWatch, Systems Manager (SSM), IAM.
- Integrations: Discord API for real-time webhooks.