AWS Backup & Disaster Recovery Strategies | Forrict Skip to main content
Cloud Architecture Business Continuity

AWS Backup & Disaster Recovery Strategies

Forrict Team
AWS Backup & Disaster Recovery Strategies
Comprehensive guide to AWS backup solutions, disaster recovery strategies, RTO/RPO planning, and automation with AWS Backup, cross-region replication, and CDK

AWS Backup & Disaster Recovery Strategies

Protect your business with comprehensive backup and disaster recovery strategies using AWS Backup, cross-region replication, and automated recovery procedures

Introduction

Data loss and service disruptions can be catastrophic for any business. Whether caused by human error, cyberattacks, natural disasters, or system failures, the impact can range from temporary inconvenience to complete business failure. A robust backup and disaster recovery (DR) strategy is not just a technical requirement – it’s a business imperative.

This comprehensive guide explores AWS backup and disaster recovery strategies, from basic backup automation to multi-region disaster recovery architectures. You’ll learn how to design solutions that meet your Recovery Time Objective (RTO) and Recovery Point Objective (RPO) requirements while optimizing costs.

What You’ll Learn:

  • Understanding RTO, RPO, and recovery strategies
  • AWS Backup service for centralized backup management
  • Cross-region replication strategies for S3, RDS, and DynamoDB
  • Disaster recovery patterns: Backup & Restore, Pilot Light, Warm Standby, Hot Standby
  • Automating backup policies with AWS CDK and Python
  • Testing and validating recovery procedures
  • Cost optimization for backup and DR solutions

Understanding RTO and RPO

Recovery Time Objective (RTO)

RTO is the maximum acceptable time that an application can be down after a disaster occurs. It answers: “How quickly must we recover?”

Examples:

  • E-commerce site: RTO of 1 hour (lost sales, reputation damage)
  • Internal reporting system: RTO of 24 hours (less critical)
  • Banking application: RTO of minutes (regulatory requirements)

Recovery Point Objective (RPO)

RPO is the maximum acceptable amount of data loss measured in time. It answers: “How much data can we afford to lose?”

Examples:

  • Financial trading system: RPO of seconds (every transaction matters)
  • Blog website: RPO of 24 hours (daily backups acceptable)
  • Data warehouse: RPO of 1 hour (hourly snapshots)

RTO vs RPO Cost Trade-offs

Higher Availability = Higher Cost

RPO/RTO    Strategy           AWS Services              Monthly Cost (estimate)
────────────────────────────────────────────────────────────────────────────
Days       Backup & Restore   S3, Glacier, AWS Backup   $50-500
Hours      Pilot Light        S3, RDS standby, Route53  $500-2000
Minutes    Warm Standby       Multi-AZ, Read Replica    $2000-10000
Seconds    Hot Standby        Active-Active, Multi-Reg  $10000+

AWS Backup Service

Introduction to AWS Backup

AWS Backup is a fully managed service that centralizes and automates data protection across AWS services including:

  • Amazon EC2 (EBS volumes)
  • Amazon RDS and Aurora
  • Amazon DynamoDB
  • Amazon EFS
  • Amazon FSx
  • AWS Storage Gateway
  • Amazon S3

Key Benefits:

  • Centralized management: Single console for all backups
  • Automated backup scheduling: Policy-based backup plans
  • Cross-region backup: Automatic replication to different regions
  • Compliance reporting: Track backup compliance requirements
  • Lifecycle management: Transition to cold storage automatically
  • Encryption: All backups encrypted at rest and in transit

Creating Backup Plans with AWS CDK

// lib/backup-stack.ts
import * as cdk from 'aws-cdk-lib';
import * as backup from 'aws-cdk-lib/aws-backup';
import * as events from 'aws-cdk-lib/aws-events';
import * as iam from 'aws-cdk-lib/aws-iam';
import * as rds from 'aws-cdk-lib/aws-rds';
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import { Construct } from 'constructs';

export class BackupStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // Create backup vault
    const backupVault = new backup.BackupVault(this, 'PrimaryBackupVault', {
      backupVaultName: 'primary-backup-vault',
      // Enable encryption with KMS
      encryptionKey: new cdk.aws_kms.Key(this, 'BackupKey', {
        description: 'KMS key for backup encryption',
        enableKeyRotation: true,
        removalPolicy: cdk.RemovalPolicy.RETAIN,
      }),
      // Prevent accidental deletion
      removalPolicy: cdk.RemovalPolicy.RETAIN,
    });

    // Create backup vault for cross-region replication
    const drBackupVault = new backup.BackupVault(this, 'DRBackupVault', {
      backupVaultName: 'dr-backup-vault-eu-west-1',
      removalPolicy: cdk.RemovalPolicy.RETAIN,
    });

    // Daily backup plan with lifecycle rules
    const dailyBackupPlan = new backup.BackupPlan(this, 'DailyBackupPlan', {
      backupPlanName: 'daily-backup-plan',
      backupPlanRules: [
        new backup.BackupPlanRule({
          ruleName: 'DailyBackups',
          backupVault,
          // Schedule: Daily at 2 AM UTC
          scheduleExpression: events.Schedule.cron({
            hour: '2',
            minute: '0',
          }),
          // Start backup within 1 hour of schedule
          startWindow: cdk.Duration.hours(1),
          // Complete backup within 2 hours
          completionWindow: cdk.Duration.hours(2),
          // Lifecycle: Transition to cold storage after 30 days
          moveToColdStorageAfter: cdk.Duration.days(30),
          // Delete after 90 days
          deleteAfter: cdk.Duration.days(90),
          // Copy to DR region
          copyActions: [
            {
              destinationBackupVault: drBackupVault,
              moveToColdStorageAfter: cdk.Duration.days(30),
              deleteAfter: cdk.Duration.days(90),
            },
          ],
        }),
      ],
    });

    // Hourly backup plan for critical databases
    const hourlyBackupPlan = new backup.BackupPlan(this, 'HourlyBackupPlan', {
      backupPlanName: 'hourly-critical-backup-plan',
      backupPlanRules: [
        new backup.BackupPlanRule({
          ruleName: 'HourlyBackups',
          backupVault,
          // Schedule: Every hour
          scheduleExpression: events.Schedule.rate(cdk.Duration.hours(1)),
          startWindow: cdk.Duration.minutes(30),
          completionWindow: cdk.Duration.hours(1),
          // Keep hourly backups for 7 days
          deleteAfter: cdk.Duration.days(7),
          // Copy to DR region immediately
          copyActions: [
            {
              destinationBackupVault: drBackupVault,
              deleteAfter: cdk.Duration.days(7),
            },
          ],
        }),
      ],
    });

    // Weekly backup plan with long retention
    const weeklyBackupPlan = new backup.BackupPlan(this, 'WeeklyBackupPlan', {
      backupPlanName: 'weekly-long-term-backup-plan',
      backupPlanRules: [
        new backup.BackupPlanRule({
          ruleName: 'WeeklyBackups',
          backupVault,
          // Schedule: Every Sunday at 3 AM UTC
          scheduleExpression: events.Schedule.cron({
            weekDay: 'SUN',
            hour: '3',
            minute: '0',
          }),
          startWindow: cdk.Duration.hours(2),
          completionWindow: cdk.Duration.hours(4),
          // Transition to cold storage after 7 days
          moveToColdStorageAfter: cdk.Duration.days(7),
          // Keep for 1 year
          deleteAfter: cdk.Duration.days(365),
          copyActions: [
            {
              destinationBackupVault: drBackupVault,
              moveToColdStorageAfter: cdk.Duration.days(7),
              deleteAfter: cdk.Duration.days(365),
            },
          ],
        }),
      ],
    });

    // Backup selection: Tag-based resource selection
    // Daily backups for all production resources
    dailyBackupPlan.addSelection('ProductionResources', {
      backupSelection: {
        selectionName: 'production-daily-backup',
        iamRole: new iam.Role(this, 'BackupRole', {
          assumedBy: new iam.ServicePrincipal('backup.amazonaws.com'),
          managedPolicies: [
            iam.ManagedPolicy.fromAwsManagedPolicyName(
              'service-role/AWSBackupServiceRolePolicyForBackup'
            ),
            iam.ManagedPolicy.fromAwsManagedPolicyName(
              'service-role/AWSBackupServiceRolePolicyForRestores'
            ),
          ],
        }),
      },
      resources: [
        // Select resources by tag
        backup.BackupResource.fromTag('Environment', 'production'),
        backup.BackupResource.fromTag('Backup', 'daily'),
      ],
    });

    // Hourly backups for critical databases
    hourlyBackupPlan.addSelection('CriticalDatabases', {
      resources: [
        backup.BackupResource.fromTag('Criticality', 'critical'),
        backup.BackupResource.fromTag('Backup', 'hourly'),
      ],
    });

    // Weekly backups for long-term retention
    weeklyBackupPlan.addSelection('LongTermRetention', {
      resources: [
        backup.BackupResource.fromTag('Backup', 'weekly'),
      ],
    });

    // CloudWatch alarm for failed backups
    const backupFailureAlarm = new cdk.aws_cloudwatch.Alarm(
      this,
      'BackupFailureAlarm',
      {
        alarmName: 'backup-job-failure',
        metric: new cdk.aws_cloudwatch.Metric({
          namespace: 'AWS/Backup',
          metricName: 'NumberOfBackupJobsFailed',
          statistic: 'Sum',
          period: cdk.Duration.hours(1),
        }),
        threshold: 1,
        evaluationPeriods: 1,
        comparisonOperator:
          cdk.aws_cloudwatch.ComparisonOperator.GREATER_THAN_OR_EQUAL_TO_THRESHOLD,
        treatMissingData: cdk.aws_cloudwatch.TreatMissingData.NOT_BREACHING,
      }
    );

    // SNS topic for backup notifications
    const backupTopic = new cdk.aws_sns.Topic(this, 'BackupNotifications', {
      displayName: 'Backup Job Notifications',
    });

    backupFailureAlarm.addAlarmAction(
      new cdk.aws_cloudwatch_actions.SnsAction(backupTopic)
    );

    // Outputs
    new cdk.CfnOutput(this, 'BackupVaultArn', {
      value: backupVault.backupVaultArn,
      description: 'Primary backup vault ARN',
    });

    new cdk.CfnOutput(this, 'DailyBackupPlanId', {
      value: dailyBackupPlan.backupPlanId,
      description: 'Daily backup plan ID',
    });
  }
}

Python Script for Backup Automation

# scripts/backup_automation.py
import boto3
import json
from datetime import datetime, timedelta
from typing import List, Dict

class AWSBackupManager:
    """Manage AWS Backup operations programmatically"""

    def __init__(self, region: str = 'eu-central-1'):
        self.backup_client = boto3.client('backup', region_name=region)
        self.ec2_client = boto3.client('ec2', region_name=region)
        self.rds_client = boto3.client('rds', region_name=region)
        self.region = region

    def create_on_demand_backup(
        self,
        resource_arn: str,
        backup_vault_name: str,
        iam_role_arn: str,
        backup_name: str = None
    ) -> str:
        """
        Create an on-demand backup for a specific resource

        Args:
            resource_arn: ARN of the resource to backup
            backup_vault_name: Name of the backup vault
            iam_role_arn: IAM role ARN for backup service
            backup_name: Optional custom backup name

        Returns:
            Backup job ID
        """
        if not backup_name:
            timestamp = datetime.now().strftime('%Y%m%d-%H%M%S')
            backup_name = f"on-demand-backup-{timestamp}"

        try:
            response = self.backup_client.start_backup_job(
                BackupVaultName=backup_vault_name,
                ResourceArn=resource_arn,
                IamRoleArn=iam_role_arn,
                IdempotencyToken=backup_name,
                StartWindowMinutes=60,
                CompleteWindowMinutes=120,
                Lifecycle={
                    'MoveToColdStorageAfterDays': 30,
                    'DeleteAfterDays': 90
                }
            )

            backup_job_id = response['BackupJobId']
            print(f"Backup job started: {backup_job_id}")
            return backup_job_id

        except Exception as e:
            print(f"Error creating backup: {str(e)}")
            raise

    def list_backup_jobs(
        self,
        backup_vault_name: str = None,
        state: str = None,
        max_results: int = 100
    ) -> List[Dict]:
        """
        List backup jobs with optional filters

        Args:
            backup_vault_name: Filter by backup vault
            state: Filter by job state (CREATED, PENDING, RUNNING, COMPLETED, FAILED, ABORTED)
            max_results: Maximum number of results

        Returns:
            List of backup jobs
        """
        params = {'MaxResults': max_results}

        if backup_vault_name:
            params['ByBackupVaultName'] = backup_vault_name

        if state:
            params['ByState'] = state

        try:
            response = self.backup_client.list_backup_jobs(**params)
            return response.get('BackupJobs', [])
        except Exception as e:
            print(f"Error listing backup jobs: {str(e)}")
            raise

    def restore_from_backup(
        self,
        recovery_point_arn: str,
        iam_role_arn: str,
        metadata: Dict,
        resource_type: str
    ) -> str:
        """
        Restore a resource from a backup recovery point

        Args:
            recovery_point_arn: ARN of the recovery point
            iam_role_arn: IAM role ARN for restore service
            metadata: Restore metadata (varies by resource type)
            resource_type: Type of resource (EBS, RDS, etc.)

        Returns:
            Restore job ID
        """
        try:
            response = self.backup_client.start_restore_job(
                RecoveryPointArn=recovery_point_arn,
                IamRoleArn=iam_role_arn,
                Metadata=metadata,
                ResourceType=resource_type
            )

            restore_job_id = response['RestoreJobId']
            print(f"Restore job started: {restore_job_id}")
            return restore_job_id

        except Exception as e:
            print(f"Error starting restore: {str(e)}")
            raise

    def get_recovery_points(
        self,
        backup_vault_name: str,
        resource_arn: str = None
    ) -> List[Dict]:
        """
        List available recovery points in a backup vault

        Args:
            backup_vault_name: Name of the backup vault
            resource_arn: Optional filter by resource ARN

        Returns:
            List of recovery points
        """
        params = {'BackupVaultName': backup_vault_name}

        if resource_arn:
            params['ByResourceArn'] = resource_arn

        try:
            response = self.backup_client.list_recovery_points_by_backup_vault(
                **params
            )
            return response.get('RecoveryPoints', [])
        except Exception as e:
            print(f"Error listing recovery points: {str(e)}")
            raise

    def tag_resources_for_backup(
        self,
        resource_ids: List[str],
        resource_type: str,
        backup_frequency: str = 'daily'
    ) -> None:
        """
        Tag resources for automated backup

        Args:
            resource_ids: List of resource IDs to tag
            resource_type: Type of resource (instance, volume, db-instance)
            backup_frequency: Backup frequency (hourly, daily, weekly)
        """
        tags = [
            {'Key': 'Backup', 'Value': backup_frequency},
            {'Key': 'ManagedBy', 'Value': 'AWSBackup'}
        ]

        try:
            if resource_type == 'instance':
                self.ec2_client.create_tags(
                    Resources=resource_ids,
                    Tags=tags
                )
            elif resource_type == 'db-instance':
                for db_id in resource_ids:
                    self.rds_client.add_tags_to_resource(
                        ResourceName=db_id,
                        Tags=tags
                    )

            print(f"Tagged {len(resource_ids)} {resource_type}(s) for {backup_frequency} backup")

        except Exception as e:
            print(f"Error tagging resources: {str(e)}")
            raise

    def generate_backup_report(
        self,
        backup_vault_name: str,
        days: int = 7
    ) -> Dict:
        """
        Generate backup compliance report

        Args:
            backup_vault_name: Name of the backup vault
            days: Number of days to look back

        Returns:
            Report dictionary with statistics
        """
        cutoff_date = datetime.now() - timedelta(days=days)

        # Get all backup jobs in timeframe
        jobs = self.list_backup_jobs(backup_vault_name=backup_vault_name)

        # Filter by date
        recent_jobs = [
            job for job in jobs
            if datetime.fromisoformat(
                job['CreationDate'].replace('Z', '+00:00')
            ) > cutoff_date
        ]

        # Calculate statistics
        total_jobs = len(recent_jobs)
        completed = len([j for j in recent_jobs if j['State'] == 'COMPLETED'])
        failed = len([j for j in recent_jobs if j['State'] == 'FAILED'])
        running = len([j for j in recent_jobs if j['State'] == 'RUNNING'])

        total_size = sum(
            job.get('BackupSizeInBytes', 0)
            for job in recent_jobs
            if job['State'] == 'COMPLETED'
        )

        report = {
            'vault_name': backup_vault_name,
            'period_days': days,
            'total_jobs': total_jobs,
            'completed_jobs': completed,
            'failed_jobs': failed,
            'running_jobs': running,
            'success_rate': f"{(completed/total_jobs*100):.2f}%" if total_jobs > 0 else "0%",
            'total_backup_size_gb': f"{total_size / (1024**3):.2f}",
            'generated_at': datetime.now().isoformat()
        }

        return report

# Example usage
if __name__ == '__main__':
    # Initialize backup manager
    backup_mgr = AWSBackupManager(region='eu-central-1')

    # Create on-demand backup for RDS instance
    rds_arn = "arn:aws:rds:eu-central-1:123456789012:db:production-db"
    iam_role = "arn:aws:iam::123456789012:role/AWSBackupServiceRole"

    job_id = backup_mgr.create_on_demand_backup(
        resource_arn=rds_arn,
        backup_vault_name='primary-backup-vault',
        iam_role_arn=iam_role,
        backup_name='production-db-manual-backup'
    )

    # Generate backup report
    report = backup_mgr.generate_backup_report(
        backup_vault_name='primary-backup-vault',
        days=7
    )

    print("\n=== Backup Report ===")
    print(json.dumps(report, indent=2))

    # Tag EC2 instances for daily backup
    instance_ids = ['i-1234567890abcdef0', 'i-0987654321fedcba0']
    backup_mgr.tag_resources_for_backup(
        resource_ids=instance_ids,
        resource_type='instance',
        backup_frequency='daily'
    )

Cross-Region Replication Strategies

S3 Cross-Region Replication

// lib/s3-replication-stack.ts
import * as cdk from 'aws-cdk-lib';
import * as s3 from 'aws-cdk-lib/aws-s3';
import * as iam from 'aws-cdk-lib/aws-iam';
import { Construct } from 'constructs';

export class S3ReplicationStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // Source bucket in primary region (eu-central-1)
    const sourceBucket = new s3.Bucket(this, 'SourceBucket', {
      bucketName: 'forrict-production-data-primary',
      versioned: true, // Required for replication
      encryption: s3.BucketEncryption.S3_MANAGED,
      lifecycleRules: [
        {
          // Transition old versions to Glacier after 30 days
          noncurrentVersionTransitions: [
            {
              storageClass: s3.StorageClass.GLACIER,
              transitionAfter: cdk.Duration.days(30),
            },
          ],
          // Delete old versions after 90 days
          noncurrentVersionExpiration: cdk.Duration.days(90),
        },
      ],
      removalPolicy: cdk.RemovalPolicy.RETAIN,
    });

    // Destination bucket in DR region (eu-west-1)
    const destinationBucket = new s3.Bucket(this, 'DestinationBucket', {
      bucketName: 'forrict-production-data-dr',
      versioned: true,
      encryption: s3.BucketEncryption.S3_MANAGED,
      removalPolicy: cdk.RemovalPolicy.RETAIN,
    });

    // IAM role for replication
    const replicationRole = new iam.Role(this, 'ReplicationRole', {
      assumedBy: new iam.ServicePrincipal('s3.amazonaws.com'),
      path: '/service-role/',
    });

    // Grant permissions to source bucket
    sourceBucket.grantRead(replicationRole);

    // Grant permissions to destination bucket
    destinationBucket.grantWrite(replicationRole);

    // Add replication configuration
    const cfnSourceBucket = sourceBucket.node.defaultChild as s3.CfnBucket;
    cfnSourceBucket.replicationConfiguration = {
      role: replicationRole.roleArn,
      rules: [
        {
          id: 'replicate-all-objects',
          status: 'Enabled',
          priority: 1,
          filter: {
            // Replicate all objects
            prefix: '',
          },
          destination: {
            bucket: destinationBucket.bucketArn,
            // Replicate storage class
            storageClass: 'STANDARD',
            // Enable replication time control (15 minutes SLA)
            replicationTime: {
              status: 'Enabled',
              time: {
                minutes: 15,
              },
            },
            // Enable metrics for monitoring
            metrics: {
              status: 'Enabled',
              eventThreshold: {
                minutes: 15,
              },
            },
          },
          // Delete marker replication
          deleteMarkerReplication: {
            status: 'Enabled',
          },
        },
      ],
    };

    // CloudWatch alarm for replication lag
    const replicationAlarm = new cdk.aws_cloudwatch.Alarm(
      this,
      'ReplicationLagAlarm',
      {
        alarmName: 's3-replication-lag',
        metric: new cdk.aws_cloudwatch.Metric({
          namespace: 'AWS/S3',
          metricName: 'ReplicationLatency',
          dimensionsMap: {
            SourceBucket: sourceBucket.bucketName,
            DestinationBucket: destinationBucket.bucketName,
          },
          statistic: 'Maximum',
          period: cdk.Duration.minutes(5),
        }),
        // Alert if replication takes longer than 30 minutes
        threshold: 1800, // seconds
        evaluationPeriods: 2,
      }
    );

    // Outputs
    new cdk.CfnOutput(this, 'SourceBucketName', {
      value: sourceBucket.bucketName,
    });

    new cdk.CfnOutput(this, 'DestinationBucketName', {
      value: destinationBucket.bucketName,
    });
  }
}

RDS Cross-Region Read Replica

// lib/rds-replica-stack.ts
import * as cdk from 'aws-cdk-lib';
import * as rds from 'aws-cdk-lib/aws-rds';
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import * as secretsmanager from 'aws-cdk-lib/aws-secretsmanager';
import { Construct } from 'constructs';

export class RDSReplicaStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // VPC for database
    const vpc = ec2.Vpc.fromLookup(this, 'VPC', {
      vpcId: 'vpc-xxxxx', // Your VPC ID
    });

    // Create database credentials secret
    const databaseCredentials = new secretsmanager.Secret(
      this,
      'DatabaseCredentials',
      {
        secretName: 'production-db-credentials',
        generateSecretString: {
          secretStringTemplate: JSON.stringify({ username: 'admin' }),
          generateStringKey: 'password',
          excludePunctuation: true,
          includeSpace: false,
          passwordLength: 32,
        },
      }
    );

    // Primary RDS instance in eu-central-1
    const primaryDatabase = new rds.DatabaseInstance(this, 'PrimaryDatabase', {
      engine: rds.DatabaseInstanceEngine.postgres({
        version: rds.PostgresEngineVersion.VER_15_4,
      }),
      instanceType: ec2.InstanceType.of(
        ec2.InstanceClass.R6G,
        ec2.InstanceSize.XLARGE
      ),
      credentials: rds.Credentials.fromSecret(databaseCredentials),
      vpc,
      vpcSubnets: {
        subnetType: ec2.SubnetType.PRIVATE_ISOLATED,
      },
      multiAz: true, // Multi-AZ for high availability
      allocatedStorage: 100,
      maxAllocatedStorage: 500, // Enable storage autoscaling
      storageType: rds.StorageType.GP3,
      storageEncrypted: true,
      databaseName: 'production',
      backupRetention: cdk.Duration.days(7),
      preferredBackupWindow: '03:00-04:00', // 3-4 AM UTC
      preferredMaintenanceWindow: 'Sun:04:00-Sun:05:00',
      deletionProtection: true,
      cloudwatchLogsExports: ['postgresql'],
      parameterGroup: new rds.ParameterGroup(this, 'ParameterGroup', {
        engine: rds.DatabaseInstanceEngine.postgres({
          version: rds.PostgresEngineVersion.VER_15_4,
        }),
        parameters: {
          // Enable automatic backups to S3
          'rds.force_ssl': '1',
          'log_connections': '1',
          'log_disconnections': '1',
          'log_duration': '1',
        },
      }),
      removalPolicy: cdk.RemovalPolicy.SNAPSHOT,
    });

    // Create read replica in DR region (eu-west-1)
    const readReplica = new rds.DatabaseInstanceReadReplica(
      this,
      'ReadReplica',
      {
        sourceDatabaseInstance: primaryDatabase,
        instanceType: ec2.InstanceType.of(
          ec2.InstanceClass.R6G,
          ec2.InstanceSize.XLARGE
        ),
        vpc,
        // Can be in different AZ or region
        availabilityZone: 'eu-west-1a',
        storageEncrypted: true,
        // Read replica can have different backup retention
        backupRetention: cdk.Duration.days(7),
        deletionProtection: true,
        removalPolicy: cdk.RemovalPolicy.SNAPSHOT,
      }
    );

    // CloudWatch alarms
    const cpuAlarm = new cdk.aws_cloudwatch.Alarm(this, 'CPUAlarm', {
      alarmName: 'rds-high-cpu',
      metric: primaryDatabase.metricCPUUtilization(),
      threshold: 80,
      evaluationPeriods: 2,
    });

    const replicationLagAlarm = new cdk.aws_cloudwatch.Alarm(
      this,
      'ReplicationLagAlarm',
      {
        alarmName: 'rds-replication-lag',
        metric: readReplica.metric('ReplicaLag', {
          statistic: 'Average',
          period: cdk.Duration.minutes(1),
        }),
        // Alert if replication lag exceeds 60 seconds
        threshold: 60,
        evaluationPeriods: 3,
      }
    );

    // Outputs
    new cdk.CfnOutput(this, 'PrimaryDatabaseEndpoint', {
      value: primaryDatabase.dbInstanceEndpointAddress,
      description: 'Primary database endpoint',
    });

    new cdk.CfnOutput(this, 'ReadReplicaEndpoint', {
      value: readReplica.dbInstanceEndpointAddress,
      description: 'Read replica endpoint',
    });

    new cdk.CfnOutput(this, 'DatabaseSecretArn', {
      value: databaseCredentials.secretArn,
      description: 'Database credentials secret ARN',
    });
  }
}

Disaster Recovery Patterns

Pattern 1: Backup and Restore (RPO: Hours, RTO: Hours)

Use Case: Non-critical applications where several hours of downtime is acceptable

Architecture:

  • Regular backups to S3/Glacier
  • No resources running in DR region
  • Lowest cost option

Cost: $50-500/month

Recovery Process:

  1. Detect outage in primary region
  2. Retrieve backups from S3
  3. Launch infrastructure in DR region using IaC
  4. Restore data from backups
  5. Update DNS to point to DR region

Pattern 2: Pilot Light (RPO: Minutes, RTO: 10-30 minutes)

Use Case: Production applications with moderate RTO/RPO requirements

Architecture:

  • Core infrastructure running in DR region (databases, minimal compute)
  • Data continuously replicated
  • Additional resources provisioned during disaster

Cost: $500-2000/month

// lib/pilot-light-stack.ts
import * as cdk from 'aws-cdk-lib';
import * as rds from 'aws-cdk-lib/aws-rds';
import * as route53 from 'aws-cdk-lib/aws-route53';
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import { Construct } from 'constructs';

export class PilotLightStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // Primary region: Full application stack
    // DR region: Only database replica + minimal infrastructure

    const vpc = ec2.Vpc.fromLookup(this, 'VPC', {
      isDefault: false,
    });

    // Database read replica (always running in DR region)
    const drDatabase = new rds.DatabaseInstanceReadReplica(
      this,
      'DRDatabase',
      {
        sourceDatabaseInstance: rds.DatabaseInstance.fromDatabaseInstanceAttributes(
          this,
          'SourceDB',
          {
            instanceIdentifier: 'primary-database',
            instanceEndpointAddress: 'primary-db.xxxxx.eu-central-1.rds.amazonaws.com',
            port: 5432,
            securityGroups: [],
          }
        ),
        instanceType: ec2.InstanceType.of(
          ec2.InstanceClass.T4G,
          ec2.InstanceSize.SMALL // Smaller instance in DR region
        ),
        vpc,
      }
    );

    // Auto Scaling Group (launch configuration only, no instances)
    // Scale out during disaster
    const launchTemplate = new ec2.LaunchTemplate(this, 'DRLaunchTemplate', {
      instanceType: ec2.InstanceType.of(
        ec2.InstanceClass.T4G,
        ec2.InstanceSize.MEDIUM
      ),
      machineImage: ec2.MachineImage.latestAmazonLinux2023(),
      userData: ec2.UserData.forLinux(),
    });

    const asg = new cdk.aws_autoscaling.AutoScalingGroup(this, 'DRASG', {
      vpc,
      launchTemplate,
      minCapacity: 0, // No instances normally
      maxCapacity: 10,
      desiredCapacity: 0, // Scale to 5 during disaster
    });

    // Route53 health check and failover
    const hostedZone = route53.HostedZone.fromLookup(this, 'HostedZone', {
      domainName: 'forrict.nl',
    });

    // Primary region record (with health check)
    const primaryRecord = new route53.ARecord(this, 'PrimaryRecord', {
      zone: hostedZone,
      recordName: 'app',
      target: route53.RecordTarget.fromIpAddresses('1.2.3.4'),
      ttl: cdk.Duration.seconds(60),
    });

    // DR region record (failover)
    const drRecord = new route53.ARecord(this, 'DRRecord', {
      zone: hostedZone,
      recordName: 'app',
      target: route53.RecordTarget.fromAlias({
        bind: () => ({
          hostedZoneId: 'Z123456',
          dnsName: asg.loadBalancer.loadBalancerDnsName,
        }),
      }),
      ttl: cdk.Duration.seconds(60),
    });

    // Lambda function to promote read replica to standalone
    const promoteReplicaFunction = new cdk.aws_lambda.Function(
      this,
      'PromoteReplica',
      {
        runtime: cdk.aws_lambda.Runtime.PYTHON_3_11,
        handler: 'index.handler',
        code: cdk.aws_lambda.Code.fromInline(`
import boto3
import os

rds = boto3.client('rds')

def handler(event, context):
    db_instance_id = os.environ['DB_INSTANCE_ID']

    # Promote read replica to standalone instance
    response = rds.promote_read_replica(
        DBInstanceIdentifier=db_instance_id
    )

    print(f"Promoted read replica: {db_instance_id}")

    return {
        'statusCode': 200,
        'body': f"Promoted {db_instance_id} to standalone instance"
    }
        `),
        environment: {
          DB_INSTANCE_ID: drDatabase.instanceIdentifier,
        },
        timeout: cdk.Duration.minutes(5),
      }
    );

    // Grant permissions to promote replica
    drDatabase.grantConnect(promoteReplicaFunction);
    promoteReplicaFunction.addToRolePolicy(
      new cdk.aws_iam.PolicyStatement({
        actions: ['rds:PromoteReadReplica'],
        resources: [drDatabase.instanceArn],
      })
    );
  }
}

Pattern 3: Warm Standby (RPO: Seconds, RTO: Minutes)

Use Case: Business-critical applications requiring minimal downtime

Architecture:

  • Scaled-down version of full environment running in DR region
  • Data synchronized in real-time
  • Can handle some production traffic

Cost: $2000-10000/month

Pattern 4: Hot Standby / Active-Active (RPO: None, RTO: Seconds)

Use Case: Mission-critical applications with zero tolerance for downtime

Architecture:

  • Full production environment in multiple regions
  • Active-active configuration
  • Real-time data replication
  • Global load balancing

Cost: $10000+/month

Testing and Validation

Automated DR Testing Script

# scripts/dr_testing.py
import boto3
import time
from typing import Dict, List
from datetime import datetime

class DRTester:
    """Automated disaster recovery testing"""

    def __init__(self, primary_region: str, dr_region: str):
        self.primary_region = primary_region
        self.dr_region = dr_region
        self.backup_client = boto3.client('backup', region_name=dr_region)
        self.rds_client = boto3.client('rds', region_name=dr_region)
        self.ec2_client = boto3.client('ec2', region_name=dr_region)

    def test_backup_restore(
        self,
        recovery_point_arn: str,
        test_instance_name: str
    ) -> Dict:
        """
        Test restoring from backup

        Returns test results and metrics
        """
        print("Starting DR restore test...")
        start_time = datetime.now()

        try:
            # Start restore job
            response = self.backup_client.start_restore_job(
                RecoveryPointArn=recovery_point_arn,
                Metadata={
                    'DBInstanceIdentifier': test_instance_name,
                    'DBInstanceClass': 'db.t3.small',
                    'Engine': 'postgres',
                },
                IamRoleArn='arn:aws:iam::123456789012:role/AWSBackupServiceRole',
                ResourceType='RDS'
            )

            restore_job_id = response['RestoreJobId']
            print(f"Restore job started: {restore_job_id}")

            # Monitor restore progress
            while True:
                job_status = self.backup_client.describe_restore_job(
                    RestoreJobId=restore_job_id
                )

                status = job_status['Status']
                print(f"Restore status: {status}")

                if status == 'COMPLETED':
                    break
                elif status in ['FAILED', 'ABORTED']:
                    raise Exception(f"Restore failed with status: {status}")

                time.sleep(30)

            # Calculate RTO
            end_time = datetime.now()
            rto_seconds = (end_time - start_time).total_seconds()

            # Verify restored instance
            restored_resource = job_status['CreatedResourceArn']
            verification_result = self._verify_database(test_instance_name)

            # Clean up test instance
            self._cleanup_test_resources(test_instance_name)

            return {
                'success': True,
                'rto_seconds': rto_seconds,
                'rto_minutes': round(rto_seconds / 60, 2),
                'restored_resource': restored_resource,
                'verification': verification_result,
                'timestamp': datetime.now().isoformat()
            }

        except Exception as e:
            return {
                'success': False,
                'error': str(e),
                'timestamp': datetime.now().isoformat()
            }

    def _verify_database(self, db_instance_id: str) -> Dict:
        """Verify restored database is accessible"""
        try:
            response = self.rds_client.describe_db_instances(
                DBInstanceIdentifier=db_instance_id
            )

            db = response['DBInstances'][0]

            return {
                'status': db['DBInstanceStatus'],
                'endpoint': db.get('Endpoint', {}).get('Address'),
                'engine': db['Engine'],
                'storage_gb': db['AllocatedStorage'],
                'accessible': db['DBInstanceStatus'] == 'available'
            }
        except Exception as e:
            return {
                'accessible': False,
                'error': str(e)
            }

    def _cleanup_test_resources(self, db_instance_id: str) -> None:
        """Clean up test resources"""
        try:
            self.rds_client.delete_db_instance(
                DBInstanceIdentifier=db_instance_id,
                SkipFinalSnapshot=True,
                DeleteAutomatedBackups=True
            )
            print(f"Cleaned up test instance: {db_instance_id}")
        except Exception as e:
            print(f"Error cleaning up: {str(e)}")

    def test_failover(self, hosted_zone_id: str, record_name: str) -> Dict:
        """Test DNS failover"""
        route53 = boto3.client('route53')

        try:
            # Get current DNS records
            response = route53.list_resource_record_sets(
                HostedZoneId=hosted_zone_id,
                StartRecordName=record_name,
                MaxItems='1'
            )

            original_records = response['ResourceRecordSets']

            # Simulate failover by updating health check
            # In production, health check would fail automatically

            return {
                'success': True,
                'original_records': original_records,
                'failover_tested': True
            }
        except Exception as e:
            return {
                'success': False,
                'error': str(e)
            }

# Example usage
if __name__ == '__main__':
    tester = DRTester(
        primary_region='eu-central-1',
        dr_region='eu-west-1'
    )

    # Test backup restore
    result = tester.test_backup_restore(
        recovery_point_arn='arn:aws:backup:eu-west-1:123456789012:recovery-point:xxxxx',
        test_instance_name='dr-test-instance'
    )

    print("\n=== DR Test Results ===")
    print(f"Success: {result['success']}")
    print(f"RTO: {result.get('rto_minutes', 'N/A')} minutes")
    print(f"Verification: {result.get('verification', {})}")

Best Practices

1. Backup Strategy

  • Implement 3-2-1 rule: 3 copies, 2 different media, 1 offsite
  • Automate backups with AWS Backup
  • Test restores regularly (monthly minimum)
  • Tag resources for automated backup
  • Implement lifecycle policies to control costs

2. Data Protection

  • Enable versioning on S3 buckets
  • Use cross-region replication for critical data
  • Encrypt all backups at rest and in transit
  • Implement backup retention policies aligned with compliance
  • Monitor backup job success rates

3. Recovery Planning

  • Document RTO and RPO requirements for each workload
  • Create detailed runbooks for recovery procedures
  • Automate recovery where possible
  • Practice failure scenarios (chaos engineering)
  • Maintain up-to-date contact lists

4. Cost Optimization

  • Use Glacier for long-term retention
  • Implement intelligent tiering for S3
  • Right-size DR infrastructure
  • Use spot instances for DR testing
  • Delete old backups automatically

5. Monitoring and Alerting

  • Monitor backup job completion
  • Alert on replication lag
  • Track RTO/RPO compliance
  • Monitor cross-region bandwidth costs
  • Set up recovery time tracking

Conclusion

A comprehensive backup and disaster recovery strategy is essential for business continuity. AWS provides powerful tools like AWS Backup, cross-region replication, and multi-region architectures to protect your data and applications. The key is choosing the right DR pattern based on your RTO/RPO requirements and budget constraints.

Key Takeaways:

  • Define clear RTO and RPO requirements for each workload
  • Automate backups using AWS Backup service
  • Implement cross-region replication for critical data
  • Choose appropriate DR pattern (Backup & Restore, Pilot Light, Warm Standby, Hot Standby)
  • Test recovery procedures regularly
  • Monitor backup compliance and replication lag

Ready to implement a robust disaster recovery strategy? Forrict can help you design and implement backup and DR solutions tailored to your business requirements and compliance needs.

Resources

F

Forrict Team

AWS expert and consultant at Forrict, specializing in cloud architecture and AWS best practices for Dutch businesses.

Tags

AWS Backup Disaster Recovery Business Continuity AWS Backup RDS EBS S3 CloudFormation CDK

Related Articles

Ready to Transform Your AWS Infrastructure?

Let's discuss how we can help optimize your cloud journey