This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Memos

Technical memos and in-depth research documents

Overview

This section contains technical memos and in-depth research documents covering architectural decisions, technical analyses, and comprehensive guides for complex topics.

Browse Memos

Navigate through the sidebar to explore individual memos, or use the search feature to find specific documents.

1 - Sample Memo

A sample memo demonstrating the structure and format

Executive Summary

This is a sample memo that demonstrates the recommended structure and format for technical memos. It provides a template that can be used as a starting point for creating new memos.

Introduction

Purpose

Explain the purpose of the memo and what problems or questions it addresses.

Scope

Define the scope of the document, including what is covered and what is explicitly out of scope.

Audience

Identify the intended audience and any prerequisite knowledge required.

Background

Provide necessary background information and context for understanding the memo content.

Current State

Describe the current situation, challenges, or problems that motivated this memo.

Requirements

List any requirements or constraints that influenced the analysis or recommendations.

Technical Analysis

Approach

Describe the methodology or approach used in the analysis.

Findings

Present the key findings from the technical analysis.

Trade-offs

Discuss the trade-offs considered and how different options were evaluated.

Recommendations

Provide specific recommendations based on the analysis.

Implementation Considerations

Discuss practical considerations for implementing the recommendations.

Risk Assessment

Identify potential risks and mitigation strategies.

Conclusion

Summarize the key points and recommendations.

References

  • List relevant documentation
  • External resources
  • Related memos

2 - Viitata Tenancy Infrastructure

Migration from single-tenant to multi-tenant architecture

Executive Summary

This memo documents the strategic architecture migration for Viitata from a single-tenant-per-instance model to a multi-tenant architecture on Heroku. This migration addresses critical operational inefficiencies, enables deployment of the new Viitata version with its required worker architecture, and significantly reduces both current costs and the cost of scaling while eliminating DevOps friction for client onboarding.

Key Changes:

  • Architecture: Single-tenant-per-instance → Multi-tenant shared infrastructure
  • Platform: Heroku (no change)
  • Application Version: Current (single worker) → New version (3 workers required)
  • Cost Impact: $96/month currently → $288/month if upgraded on single-tenant → $130/month on multi-tenant
  • Cost Savings: 55% reduction vs. deploying new version on single-tenant architecture

Introduction

Purpose

This document outlines the rationale, technical approach, and benefits of migrating Viitata from a distributed single-tenant-per-instance model to a consolidated multi-tenant architecture on Heroku.

Scope

This memo covers:

  • Current single-tenant-per-instance architecture on Heroku
  • New Viitata version requirements (3-worker architecture)
  • Proposed multi-tenant architecture on Heroku
  • Cost analysis and operational benefits
  • Technical considerations and trade-offs

Out of scope:

  • Detailed application code changes for multi-tenancy
  • Specific Heroku configuration details
  • Data migration procedures and implementation timeline

Audience

This document is intended for technical leadership, DevOps engineers, and stakeholders involved in infrastructure planning and decision-making.

Background

Current State: Single-Tenant-Per-Instance on Heroku

Viitata currently operates with a single-tenant-per-instance model on Heroku, consisting of:

Infrastructure per Tenant:

  • 1 Heroku application instance (single worker)
  • 1 PostgreSQL database
  • 1 Redis cache instance
  • Cost: ~$16/month per tenant

Current Deployment:

  • 6 production instances running current Viitata version
  • Each instance operates as a single-worker application
  • Total monthly cost: ~$96 (6 instances × $16)
  • Each instance requires independent CI/CD pipeline
  • Each instance requires separate DevOps configuration

Important Note: The current architecture runs an older version of Viitata that does not require the 3-worker architecture. However, the new version of Viitata cannot be deployed without this infrastructure change.

Challenges with Current Architecture

1. Cost Scalability Concerns

With 6 production tenants at $16/month each, the current architecture costs approximately $96/month. While manageable at this scale, the cost scales linearly with each new tenant ($16 per additional tenant). More critically, the new version of Viitata requires a 3-worker architecture that would triple costs to approximately $288/month for the same 6 tenants.

2. DevOps Friction

Each new client onboarding requires:

  • Provisioning new Heroku application
  • Configuring new PostgreSQL database
  • Setting up new Redis cache
  • Configuring CI/CD pipeline
  • Managing environment variables and secrets
  • Setting up monitoring and logging

This creates substantial friction and delays in client onboarding.

3. CI/CD Maintenance Overhead

Maintaining 6 separate CI/CD pipelines creates:

  • Increased complexity in deployment processes
  • Higher risk of configuration drift
  • Difficulty in applying updates uniformly
  • Additional testing burden across instances

4. Blocking Issue: New Viitata Version Requirements

The new version of Viitata fundamentally requires three distinct worker types to function:

  • Web worker: Handles HTTP requests
  • Celery worker: Processes asynchronous tasks
  • Celery Beat worker: Manages scheduled tasks and periodic jobs

This is not optional - the new Viitata version cannot be deployed without all three workers running.

Under the single-tenant model, deploying the new version would require:

  • 18 total worker processes (6 instances × 3 workers)
  • Tripling of infrastructure costs per tenant (from $16 to ~$48 per tenant)
  • Total monthly cost increase from $96 to approximately $288/month
  • 18 separate processes to monitor and manage

Critical Impact: The single-tenant architecture makes it economically and operationally prohibitive to deploy the new version of Viitata. Without migrating to multi-tenant, the platform cannot evolve.

Technical Analysis

Proposed Architecture: Multi-Tenant on Heroku

The new architecture consolidates all tenants into a single shared Heroku infrastructure:

Shared Infrastructure:

  • 1 Heroku application (supporting 3 worker types)
  • 1 Heroku PostgreSQL database (with tenant isolation)
  • 1 Heroku Redis cache (with tenant namespacing)
  • Estimated cost: ~$130/month total

Worker Configuration:

  • 1 web worker (serving all tenants)
  • 1 Celery worker (processing tasks for all tenants)
  • 1 Celery Beat worker (managing schedules for all tenants)
  • Total: 3 workers supporting all tenants

Cost Analysis

Architecture ModelViitata VersionTenantsWorkersMonthly CostCost per Tenant
Current (Single-Tenant)Old66 (1 per instance)$96$16.00
Single-Tenant UpgradedNew618 (3 per instance)$288$48.00
Multi-Tenant (Proposed)New63 (shared)$130$21.67
Savings vs. Upgraded---83%-$158/month-55%

Key Insights:

  • Current architecture cannot run the new Viitata version without significant cost increase
  • New version’s 3-worker requirement would triple single-tenant costs ($96 → $288)
  • Multi-tenant architecture enables new version deployment at 55% lower cost than single-tenant upgrade
  • Marginal cost advantage: Adding tenant #7 costs $0/month (vs. $48/month in single-tenant)
  • Cost efficiency improves with scale: 10 tenants = $13/tenant, 20 tenants = $6.50/tenant

Benefits

1. Cost Reduction

  • 83% reduction in infrastructure costs
  • Costs remain flat as tenant count grows (until scaling threshold)
  • Predictable cost model

2. Operational Efficiency

  • Single CI/CD pipeline for all tenants
  • Unified deployment process
  • Consistent configuration across all tenants
  • Reduced maintenance overhead

3. Client Onboarding

  • Near-instant tenant provisioning (database record vs. full infrastructure)
  • Minimal DevOps involvement
  • Faster time-to-value for new clients
  • Reduced onboarding friction

4. Enables New Viitata Version Deployment

  • Supports required 3-worker architecture (web, Celery, Celery Beat)
  • 3 shared workers support all tenants (vs. 18 separate workers in single-tenant)
  • Makes new version economically viable to deploy
  • Simplified monitoring and management
  • Better resource utilization
  • Easier to scale horizontally when needed

Technical Considerations

Data Isolation

  • Tenant identification at application layer
  • Row-level security in PostgreSQL
  • Redis key namespacing by tenant ID
  • Careful query design to prevent data leakage

Performance

  • Shared resources require proper resource allocation
  • Connection pooling for database efficiency
  • Caching strategies to prevent tenant interference
  • Monitoring to identify tenant-specific performance issues

Security

  • Tenant isolation at application and data layers
  • Secure tenant context management
  • Audit logging for compliance
  • Regular security reviews of multi-tenant code paths

Scalability

  • Horizontal scaling when single instance reaches capacity
  • Database sharding if needed for very large tenant counts
  • CDN and edge caching for static assets
  • Load balancing across multiple application instances

Trade-offs

Advantages

  • Dramatic cost reduction
  • Simplified operations
  • Faster client onboarding
  • Better resource utilization
  • Easier maintenance and updates

Disadvantages

  • Tenant isolation complexity in application code
  • Potential “noisy neighbor” issues
  • Database restore impact: Currently, database snapshots can be restored per-tenant without affecting other clients. In multi-tenant architecture, a database restore would affect all tenants simultaneously, making it impossible to roll back a single client’s data due to a bug or data issue
  • More complex deployment rollback scenarios
  • Requires careful tenant-aware code design
  • Less isolation between tenants compared to separate instances

Risk Mitigation

  • Comprehensive testing of tenant isolation
  • Resource limits per tenant
  • Monitoring and alerting for anomalies
  • Gradual migration approach
  • Ability to isolate problematic tenants if needed
  • Database restore mitigation:
    • Implement application-level point-in-time recovery per tenant
    • Maintain granular database backups with tenant-specific restore capabilities
    • Use transaction logs to selectively restore tenant data
    • Establish procedures for tenant-specific data rollback without full database restore
    • More rigorous testing and staging processes to prevent production data issues
    • Consider automated daily tenant-level logical backups (pg_dump per tenant)

Conclusion

The migration from a single-tenant-per-instance architecture to a multi-tenant architecture on Heroku represents a strategic necessity for Viitata’s evolution. This change delivers:

  • Enables deployment of new Viitata version with required 3-worker architecture
  • 55% cost reduction vs. deploying new version on single-tenant ($288/month → $130/month)
  • Dramatic reduction in operational complexity (6 CI/CD pipelines → 1, 18 workers → 3)
  • Near-zero marginal cost for new tenants ($0 vs. $48/tenant in single-tenant)
  • Improving cost efficiency at scale: Cost per tenant decreases as platform grows
  • Eliminates DevOps friction in client onboarding

Without this migration, deploying the new version of Viitata would nearly triple costs while adding significant operational burden. The multi-tenant architecture not only makes the new version economically viable but also positions Viitata for sustainable growth with costs that improve with scale.

While multi-tenancy introduces complexity in application design around tenant isolation and data security, the alternative—remaining on single-tenant architecture—would either block the platform’s evolution or make it financially unsustainable. The operational benefits, cost savings, and improved scalability make this migration essential for Viitata’s future.

References

3 - Heroku to AWS Migration

Migration from Heroku to AWS for improved compliance, cost, and control

Executive Summary

This memo documents the strategic platform migration for Viitata from Heroku to AWS (Amazon Web Services). This migration addresses critical compliance requirements around UK data residency, reduces infrastructure costs, provides greater operational flexibility and control, and enables better performance and integration with additional AWS services.

Key Changes:

  • Platform: Heroku → AWS ECS (Elastic Container Service)
  • Region: EU-West-1 (Ireland) → EU-West-2 (London, UK)
  • Primary Driver: Compliance - UK data residency and backup retention
  • Additional Benefits: Cost reduction, greater control, performance improvements, AWS service ecosystem

Critical Compliance Issue: Currently on Heroku, while the primary database is in EU-West-1 (Ireland), database backups are retained in the USA. This creates compliance risks for UK data residency requirements. AWS enables full infrastructure and data containment within EU-West-2 (London).

Introduction

Purpose

This document outlines the rationale, technical approach, and benefits of migrating Viitata’s multi-tenant infrastructure from Heroku to AWS, with a focus on achieving UK data sovereignty and compliance requirements while improving operational capabilities.

Scope

This memo covers:

  • Current multi-tenant architecture on Heroku
  • Compliance and data residency challenges
  • Proposed multi-tenant architecture on AWS ECS
  • Cost analysis and operational benefits
  • Technical considerations and trade-offs

Out of scope:

  • Detailed AWS infrastructure-as-code configurations
  • Specific containerization implementation details
  • Data migration procedures and implementation timeline
  • Application code changes required for AWS

Audience

This document is intended for technical leadership, compliance officers, DevOps engineers, and stakeholders involved in infrastructure planning and decision-making.

Background

Current State: Multi-Tenant on Heroku

Viitata currently operates with a multi-tenant architecture on Heroku, consisting of:

Infrastructure:

  • 1 Heroku application (3 dynos: web, Celery worker, Celery Beat)
  • 1 Heroku PostgreSQL database in EU-West-1 (Ireland)
  • 1 Heroku Redis cache
  • Current cost: ~$130/month

Current Deployment:

  • Multi-tenant architecture supporting 6 production tenants
  • Single CI/CD pipeline
  • Heroku-managed infrastructure and scaling
  • Automatic SSL, DNS, and platform maintenance

Critical Issues with Current Platform

1. Compliance and Data Residency (Primary Driver)

Database Backup Location:

  • Primary database: EU-West-1 (Ireland, EU)
  • Database backups: Stored in USA (Heroku’s backup infrastructure)

This creates significant compliance risks:

  • UK data residency requirements cannot be met
  • Backup data crosses international boundaries
  • Potential violations of data protection regulations
  • Risk for clients requiring UK-only data storage
  • Audit and compliance reporting challenges

Regional Limitation:

  • Application and database in Ireland (EU-West-1), not UK
  • No option for UK-specific region on Heroku
  • Cannot guarantee UK data sovereignty

2. Cost Considerations

While Heroku provides managed services, the cost includes:

  • Premium for managed platform (~30-40% over raw compute)
  • Limited ability to optimize resource allocation
  • Dyno pricing model less flexible than AWS instance types
  • Add-on costs (PostgreSQL, Redis) with limited customization

3. Limited Control and Flexibility

Infrastructure Control:

  • Cannot customize underlying OS or runtime environment
  • Limited control over networking and security groups
  • Restricted access to infrastructure-level monitoring
  • Cannot implement custom security controls

Resource Optimization:

  • Fixed dyno sizes with limited granularity
  • Cannot right-size resources for specific workloads
  • Limited ability to use spot instances or reserved capacity
  • Cannot separate worker resources by type

4. Performance Constraints

Heroku Limitations:

  • Shared infrastructure with potential noisy neighbor issues
  • Limited database connection pooling options
  • Router timeout constraints (30 seconds)
  • Limited control over caching layers
  • Cannot implement custom CDN configurations

5. AWS Service Integration

Current limitations for integrating with AWS services:

  • External network calls to AWS services (S3, SES, etc.)
  • Additional latency for AWS service integration
  • Cannot use VPC peering or private networking
  • Limited IAM role-based security
  • Cannot leverage AWS-native monitoring and logging

Technical Analysis

Proposed Architecture: Multi-Tenant on AWS ECS

The new architecture migrates the multi-tenant application to AWS infrastructure with a fully containerized, role-based security model.

Architecture Overview

The following diagram illustrates the proposed AWS architecture:

graph TB
    subgraph Internet
        Users[Users/Clients]
    end

    subgraph "AWS EU-West-2 (London)"
        subgraph "VPC"
            subgraph "Public Subnets"
                ALB[Application Load Balancer<br/>HTTPS:443]
                NAT[NAT Gateway]
            end

            subgraph "Private Subnets"
                subgraph "ECS Fargate Cluster"
                    Web[Web Tasks<br/>nginx + gunicorn<br/>Auto-scaling]
                    Worker[Celery Worker Tasks<br/>Auto-scaling]
                    Beat[Celery Beat Task<br/>Single instance]
                end

                RDS[(RDS PostgreSQL<br/>Multi-AZ<br/>Automated Backups)]
                Cache[(ElastiCache Valkey<br/>Redis-compatible<br/>Cache & Results)]
            end
        end

        SQS[Amazon SQS<br/>Celery Message Broker<br/>Task Queue]
        S3[S3 Bucket<br/>Media Storage]
        CW[CloudWatch<br/>Logs & Metrics]
        SM[Secrets Manager<br/>Credentials]
    end

    Users -->|HTTPS| ALB
    ALB -->|Routes traffic| Web

    Web -->|IAM Role| S3
    Web -->|Read/Write| RDS
    Web -->|Cache/Sessions| Cache
    Web -->|Send tasks| SQS
    Web -->|Logs| CW
    Web -->|Get secrets| SM

    Worker -->|IAM Role| S3
    Worker -->|Read/Write| RDS
    Worker -->|Receive/Delete tasks| SQS
    Worker -->|Store results| Cache
    Worker -->|Logs| CW
    Worker -->|Get secrets| SM

    Beat -->|Send scheduled tasks| SQS
    Beat -->|Read/Write| RDS
    Beat -->|Logs| CW

    Web -.->|Outbound via| NAT
    Worker -.->|Outbound via| NAT
    Beat -.->|Outbound via| NAT

    classDef public fill:#e1f5ff,stroke:#01579b,stroke-width:2px
    classDef private fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
    classDef data fill:#fff3e0,stroke:#e65100,stroke-width:2px
    classDef compute fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px
    classDef aws fill:#fff9c4,stroke:#f57f17,stroke-width:2px

    class ALB,NAT public
    class Web,Worker,Beat compute
    class RDS,Cache data
    class S3,CW,SM,SQS aws

Core Infrastructure Components:

  1. Database Layer

    • Amazon RDS for PostgreSQL in EU-West-2 (London)
    • Multi-AZ deployment for high availability
    • Automated backups retained in EU-West-2
    • Point-in-time recovery capabilities
    • All tenant data with row-level isolation
  2. Cache Layer

    • Amazon ElastiCache with Valkey (Redis-compatible) in EU-West-2
    • Used for session storage, application caching, and Celery result backend
    • Tenant-namespaced keys for data isolation
    • High-performance in-memory data store
  3. Message Queue Layer

    • Amazon SQS (Simple Queue Service) in EU-West-2
    • Celery message broker for task distribution
    • Fully managed, serverless message queue
    • No infrastructure to maintain or scale
    • Automatic message retention and delivery
    • Dead letter queue for failed tasks
    • FIFO queues for task ordering if needed
    • Cost-effective: Pay only for messages processed
  4. Compute Layer - ECS Fargate

    Three separate ECS task definitions running on Fargate:

    Task Definition 1: Web Application

    • Container: nginx + gunicorn
    • Receives traffic from Application Load Balancer (ALB)
    • Handles HTTPS requests routed by ALB
    • Auto-scaling based on CPU/memory and request count
    • Appropriate resources: ~0.5-1 vCPU, 1-2GB memory
    • Multiple tasks for high availability and load distribution
    • ALB distributes traffic across all healthy web tasks

    Task Definition 2: Celery Worker

    • Container: Celery worker process
    • Consumes tasks from Amazon SQS queue
    • Auto-scaling based on SQS queue depth (ApproximateNumberOfMessagesVisible) and CPU utilization
    • Right-sized resources: ~0.25-0.5 vCPU, 0.5-1GB memory
    • Can scale independently based on task backlog in SQS

    Task Definition 3: Celery Beat

    • Container: Celery Beat scheduler
    • Manages periodic and scheduled tasks
    • Publishes scheduled tasks to SQS queue
    • Fixed scaling: Single task (Beat requires single instance)
    • Minimal resources: ~0.25 vCPU, 0.5GB memory
    • Auto-restart on failure

    Rationale for Separate Task Definitions:

    • Each workload has different resource requirements
    • Independent scaling policies per service type
    • Web scales with traffic, workers scale with queue depth
    • Cost optimization: Right-size each workload separately
    • Isolation: Issues in one service don’t affect others

    Auto-Scaling Configuration:

    ECS provides automatic scaling that adjusts the number of running tasks based on demand, with both scale-up and scale-down capabilities:

    Web Tasks Auto-Scaling:

    • Metrics: CPU utilization, memory utilization, ALB request count per target
    • Scale-up triggers:
      • CPU > 70% for 2 minutes → Add tasks
      • Requests per task > 1000/min → Add tasks
    • Scale-down triggers:
      • CPU < 30% for 5 minutes → Remove tasks
      • Requests per task < 200/min → Remove tasks
    • Min/Max tasks: 2 minimum (HA), 10 maximum
    • Benefits: Handles traffic spikes from multiple tenants, scales down during low usage to save costs

    Celery Worker Auto-Scaling:

    • Metrics: CPU utilization, SQS ApproximateNumberOfMessagesVisible (native CloudWatch metric)
    • Scale-up triggers:
      • SQS queue depth > 100 messages → Add workers
      • CPU > 80% for 3 minutes → Add workers
    • Scale-down triggers:
      • SQS queue depth < 10 messages for 10 minutes → Remove workers
      • CPU < 20% for 10 minutes → Remove workers
    • Min/Max tasks: 1 minimum, 5 maximum
    • Benefits: SQS provides native queue metrics for accurate scaling decisions; efficiently processes task backlog, reduces to minimum during idle periods

    Celery Beat Scaling:

    • Fixed at 1 task (Beat scheduler requires single instance)
    • Auto-restart on failure for reliability

    Multi-Tenant Scaling Benefits:

    Auto-scaling is particularly valuable for multi-tenant architecture:

    • Unpredictable tenant activity: Different tenants have different usage patterns and peak times
    • Cost efficiency: Automatically scales down during low-usage periods (nights, weekends)
    • Spike handling: Automatically scales up when multiple tenants become active simultaneously
    • Resource optimization: Pays only for resources actually needed at any given time
    • Example scenario:
      • During business hours (9am-5pm): 6-8 web tasks handle peak multi-tenant load
      • During nights (11pm-6am): Scales down to 2 web tasks, saving ~$40-60/month
      • Weekend spikes: Auto-scales up to handle unexpected tenant activity

    Comparison to Heroku:

    Scaling FeatureHerokuAWS ECS
    Scale-upManual or via add-onsAutomatic based on metrics
    Scale-downManual onlyAutomatic (saves costs)
    Scaling metricsLimited (response time, throughput)Extensive (CPU, memory, custom CloudWatch metrics, ALB metrics, queue depth)
    Per-service scalingRequires multiple appsBuilt-in per task definition
    Cost during low usageFixed (pays for min dynos)Dynamic (scales to minimum)
    Multi-tenant optimizationLimitedExcellent - handles variable tenant load patterns

    Cost Impact:

    • Scale-down capability can reduce compute costs by 40-50% during off-peak hours
    • For multi-tenant with variable load, average monthly compute cost drops significantly
    • Example: Instead of running 6 web tasks 24/7, average 4 tasks/hour = 33% cost reduction
  5. Networking and Load Balancing

    • Application Load Balancer (ALB) as entry point for all web traffic
      • Sits in public subnets
      • Terminates HTTPS/SSL connections
      • Routes traffic to web task definition only
      • Health checks on web tasks
      • Automatically distributes load across multiple web task instances
    • VPC with public and private subnets across multiple Availability Zones
    • Private subnets for ECS tasks, RDS, and ElastiCache (no direct internet access)
    • Public subnets for ALB only
    • NAT Gateway for outbound internet access from private subnets
    • Security groups for service-level network isolation
      • ALB security group: Allow inbound 443 from internet
      • Web task security group: Allow inbound from ALB only
      • Worker task security groups: No inbound internet traffic
      • RDS/ElastiCache security groups: Allow access from ECS tasks only
  6. Security Model: Role-Based Authentication

    Shift from IAM Users to IAM Roles:

    • Current (Heroku): IAM user credentials stored as environment variables for AWS service access (S3, SES, etc.)
    • Proposed (AWS): ECS task IAM roles with least-privilege permissions

    IAM Role-Based Security Benefits:

    • No stored credentials: Tasks assume roles automatically via ECS Task Role
    • Dramatically reduced credential leakage risk: No long-lived access keys in environment variables or code
    • Automatic credential rotation: AWS STS provides temporary credentials (auto-expire and rotate)
    • Least-privilege access: Each task definition gets only required permissions
    • Audit trail: CloudTrail logs all role assumption and service access
    • Infrastructure-as-code: IAM roles defined in CloudFormation templates
    • Centralized security: All permissions defined and version-controlled

    Example Security Architecture:

    • Web task role: Read S3 (media), write CloudWatch Logs
    • Celery worker task role: Read/Write S3, SES send email, SQS receive/delete messages, CloudWatch Logs
    • Celery Beat task role: SQS send messages, CloudWatch Logs only
    • RDS access: PostgreSQL username/password from Secrets Manager (accessed via IAM role)
    • SQS access: Fully controlled via IAM roles (no credentials needed)
    • No IAM user access keys anywhere in the system
  7. Additional AWS Services

    • Amazon SQS for Celery message broker (fully managed queue)
    • CloudWatch Logs for centralized logging
    • CloudWatch Metrics for monitoring and alerting (includes native SQS metrics)
    • AWS Secrets Manager for database credentials and API keys
    • S3 for media file storage (UK region)
    • CloudFormation for infrastructure-as-code deployment
    • ECR (Elastic Container Registry) for Docker image storage

Region and Compliance:

  • All resources in EU-West-2 (London, UK)
  • No data transfer outside UK jurisdiction
  • Estimated cost: ~$90-115/month (12-30% savings, includes SQS)

Connectivity Flow

Internet (HTTPS:443)
  ↓
Application Load Balancer (Public Subnet)
  ↓ (Routes to web tasks only)
ECS Web Tasks (nginx + gunicorn) (Private Subnet)
  ↓
  ├─→ RDS PostgreSQL (Private Subnet)
  ├─→ ElastiCache Valkey (Private Subnet - cache/sessions)
  ├─→ Amazon SQS (Send tasks via IAM role)
  └─→ S3 (via IAM role)

ECS Celery Workers (Private Subnet)
  ↓
  ├─→ RDS PostgreSQL (Private Subnet)
  ├─→ ElastiCache Valkey (Private Subnet - result backend)
  ├─→ Amazon SQS (Receive/Delete tasks via IAM role)
  └─→ S3 (via IAM role)

ECS Celery Beat (Private Subnet)
  ↓
  ├─→ RDS PostgreSQL (Private Subnet)
  └─→ Amazon SQS (Send scheduled tasks via IAM role)

Key Points:

  • Only web tasks receive traffic from ALB - Workers have no inbound internet traffic
  • All three task definitions connect to RDS
  • Celery uses Amazon SQS as message broker - fully managed, serverless queue
  • ElastiCache Valkey used for caching, sessions, and Celery result backend
  • All ECS tasks connect to AWS services (S3, SQS, CloudWatch) via IAM roles
  • No stored credentials required anywhere
  • SQS provides native CloudWatch metrics for auto-scaling

Infrastructure-as-Code

The entire architecture is defined in CloudFormation templates:

  • VPC, subnets, route tables, security groups
  • RDS database configuration
  • ElastiCache cluster configuration
  • SQS queues (standard and dead letter queues)
  • ECS cluster, task definitions, and services
  • Application Load Balancer and target groups
  • IAM roles and policies
  • CloudWatch alarms and dashboards

Benefits:

  • Version-controlled infrastructure
  • Reproducible environments (staging = production)
  • No manual configuration or drift
  • Peer-reviewed infrastructure changes via Git
  • Disaster recovery: Rebuild from templates

CI/CD Pipeline

GitHub Actions Workflow:

The deployment pipeline is automated via GitHub Actions, providing consistent and reliable deployments:

graph LR
    A[Code Push to GitHub] --> B[GitHub Actions Triggered]
    B --> C[Build Docker Image]
    C --> D[Push to ECR]
    D --> E[Update ECS Task Definitions]
    E --> F[Deploy Web Tasks]
    E --> G[Deploy Celery Workers]
    E --> H[Deploy Celery Beat]

    style A fill:#e8f5e9
    style B fill:#fff3e0
    style C fill:#e1f5ff
    style D fill:#f3e5f5
    style E fill:#fff9c4
    style F fill:#e8f5e9
    style G fill:#e8f5e9
    style H fill:#e8f5e9

Deployment Process:

  1. Trigger: Code pushed to main branch or pull request merged
  2. Build: GitHub Actions workflow executes
    • Builds single Docker image containing application code
    • Runs tests (optional: can block deployment on failure)
    • Tags image with commit SHA and/or semantic version
  3. Push to ECR: Docker image pushed to Elastic Container Registry in EU-West-2
    • ECR provides secure, private Docker registry
    • Images stored in same region as deployment (UK)
    • Automatic image scanning for vulnerabilities (optional)
  4. Update Task Definitions: GitHub Actions updates ECS task definitions
    • All three task definitions reference the same Docker image
    • Only the container command/entrypoint differs per service:
      • Web: gunicorn command
      • Celery Worker: celery worker command
      • Celery Beat: celery beat command
  5. Deploy: ECS performs rolling updates
    • Web tasks: Rolling deployment with health checks via ALB
    • Celery Workers: Rolling update, new tasks pick up from queue
    • Celery Beat: Stop old task, start new task (single instance)

Single Image, Multiple Services:

All three ECS task definitions use the same Docker image from ECR. The service type is determined by the command executed:

# Example task definition differences
Web Task Definition:
  Image: <ECR_URI>:latest
  Command: ["gunicorn", "app.wsgi:application"]

Celery Worker Task Definition:
  Image: <ECR_URI>:latest  # Same image!
  Command: ["celery", "-A", "app", "worker"]

Celery Beat Task Definition:
  Image: <ECR_URI>:latest  # Same image!
  Command: ["celery", "-A", "app", "beat"]

Benefits:

  • Single build: One Docker image for all services (faster builds)
  • Consistency: All services run identical application code
  • Simplified versioning: Single image tag tracks deployment
  • Reduced storage: ECR stores one image instead of three
  • Atomic deployments: All services deployed from same code version

Comparison to Heroku:

AspectHerokuAWS ECS
Deployment triggergit push herokuGitHub Actions workflow
Build processHeroku buildpacksDocker image build
Artifact storageHeroku slug storageECR (version-controlled)
Deployment controlLimited (auto-deploy)Full control (approval gates, rollback)
Multi-serviceSeparate apps or ProcfileTask definitions with same image
Rollbackheroku releases:rollbackECS task definition revision or redeploy previous image tag

Additional CI/CD Capabilities:

  • Environment-specific deployments: Separate workflows for staging and production
  • Approval gates: Require manual approval before production deployment
  • Automated testing: Run integration tests against staging before production
  • Blue-green deployments: Deploy new version alongside old, switch traffic
  • Canary deployments: Gradually shift traffic to new version
  • Automated rollback: Detect failures via CloudWatch alarms and auto-rollback

Cost Analysis

ComponentHeroku (Current)AWS (Proposed)Notes
Web Application~$25-50~$20-35ECS Fargate or EC2 instances
Celery Worker~$25-50~$20-35Right-sized for workload
Celery Beat~$25-50~$10-15Smaller instance for scheduler
PostgreSQL~$15-20~$20-25RDS with backups in UK
Redis Cache~$15-20~$10-15ElastiCache (cache + result backend)
SQS Message QueueIncluded in dyno~$1-3Pay per million requests, negligible cost
Load BalancerIncluded~$15-20ALB costs
Total~$130/month~$90-115/month12-30% savings

Cost Optimization Opportunities:

  • Auto-scaling with scale-down: Automatically reduce running tasks during low-usage periods (40-50% compute savings during off-peak)
  • Reserved instances for baseline workloads (up to 50% additional savings)
  • Spot instances for Celery workers (up to 70% savings on compute)
  • S3 storage tiers for media files
  • CloudWatch log retention policies
  • Right-sizing based on actual usage patterns

Multi-Tenant Auto-Scaling Impact: The auto-scaling capability is particularly valuable for multi-tenant architecture where tenant usage patterns vary throughout the day. Instead of paying for peak capacity 24/7 (as with Heroku), ECS automatically scales down during low-usage periods, significantly reducing average compute costs.

Important Note: Cost savings are secondary to compliance requirements. Even if costs were equivalent, the migration would be necessary for UK data residency.

Benefits

1. Compliance and Data Sovereignty (Primary Benefit)

  • UK data residency: All infrastructure and data in EU-West-2 (London)
  • Backup compliance: Database backups remain in UK
  • Audit trail: Full control and visibility over data location
  • Regulatory compliance: Meets UK data protection requirements
  • Client confidence: Can guarantee UK-only data storage
  • Reduced legal risk: Eliminates cross-border data transfer concerns

2. Cost Efficiency

  • 15-30% immediate cost reduction (~$130 → ~$90-110/month)
  • Automatic scale-down during low usage: 40-50% additional compute savings during off-peak hours
  • Multi-tenant load optimization: Auto-scaling handles variable tenant usage patterns efficiently
  • Additional savings opportunities with reserved/spot instances
  • More granular resource allocation (no paying for unused capacity)
  • Flexible pricing models (on-demand, reserved, spot)
  • Pay only for actual resource consumption, not fixed capacity

3. Greater Control and Flexibility

Infrastructure Control:

  • Full control over container images and runtime environment
  • Custom networking and security group configuration
  • Direct access to infrastructure-level metrics
  • Ability to implement custom security controls
  • VPC configuration for network isolation

Operational Flexibility:

  • Choose instance types optimized for workload
  • Separate scaling policies per service
  • Custom monitoring and alerting
  • Advanced deployment strategies (blue/green, canary)

Infrastructure-as-Code:

  • 100% version-controlled infrastructure using CloudFormation templates
  • Entire infrastructure stack defined as code (VPC, ECS, RDS, ElastiCache, ALB, etc.)
  • Git-based workflow for infrastructure changes (review, approve, deploy)
  • Reproducible environments (staging matches production exactly)
  • Disaster recovery: Rebuild entire infrastructure from templates
  • Change tracking and audit trail for infrastructure modifications
  • Team collaboration on infrastructure changes via pull requests
  • No manual ClickOps or undocumented configuration drift

Heroku Limitation: On Heroku, infrastructure is configured via web UI or CLI commands that aren’t easily version-controlled. App configuration can be tracked, but the underlying platform infrastructure (databases, dynos, add-ons) requires manual provisioning and documentation.

4. Performance Improvements

AWS Performance Advantages:

  • Dedicated compute resources (no noisy neighbors)
  • Auto-scaling for traffic spikes: Automatically adds capacity when multiple tenants become active
  • Dynamic resource allocation: Scales up/down based on actual demand (CPU, memory, request count, queue depth)
  • Advanced connection pooling (RDS Proxy)
  • No 30-second timeout constraints
  • Custom CDN configuration (CloudFront)
  • Private networking between services (reduced latency)
  • Better database performance tuning options
  • Independent scaling per service type (web vs. workers)

5. AWS Service Ecosystem

Native Integration:

  • Amazon SQS for Celery message broker: Fully managed, serverless queue with no infrastructure to maintain
  • S3 for media and static file storage (same region)
  • SES for email services
  • Lambda for serverless functions
  • CloudWatch for comprehensive monitoring (includes native SQS metrics for auto-scaling)
  • AWS Secrets Manager for credential management
  • IAM roles for secure, passwordless service access
  • VPC endpoints for private AWS service access

SQS-Specific Benefits:

  • Zero infrastructure management: No Redis/ElastiCache broker to maintain or scale
  • Native CloudWatch metrics: ApproximateNumberOfMessagesVisible metric for accurate worker auto-scaling
  • Reliability: Built-in message durability and delivery guarantees
  • Dead letter queues: Automatic handling of failed tasks
  • Cost-effective: Pay only for messages processed (~$0.40 per million requests)
  • Unlimited scalability: No capacity planning required
  • IAM-based security: No broker credentials to manage

Technical Considerations

Containerization

ECS Container Setup:

  • Dockerize Django application (web, Celery, Beat)
  • Use ECR (Elastic Container Registry) for image storage
  • Multi-stage builds for optimized image sizes
  • Health checks for container orchestration
  • Environment-based configuration

Database Migration

RDS PostgreSQL:

  • Similar to Heroku Postgres (based on PostgreSQL)
  • Enhanced monitoring and performance insights
  • Automated backups with configurable retention
  • Multi-AZ deployment for high availability
  • Parameter groups for fine-tuned configuration
  • RDS Proxy for connection pooling

Networking and Security

VPC Configuration:

  • Private subnets for database and cache
  • Public subnets for load balancer
  • Security groups for service isolation
  • NAT Gateway for outbound traffic from private subnets
  • VPC endpoints for AWS services (S3, Secrets Manager)

Security Enhancements:

  • IAM roles instead of static credentials
  • Secrets Manager for sensitive configuration
  • AWS WAF (Web Application Firewall) for web protection
  • CloudTrail for audit logging
  • Encryption at rest and in transit

Monitoring and Observability

CloudWatch Integration:

  • Application logs aggregation
  • Custom metrics and dashboards
  • Alerting for critical events
  • Performance monitoring
  • Cost tracking and budgets

X-Ray (Optional):

  • Distributed tracing
  • Performance bottleneck identification
  • Request flow visualization

High Availability

AWS HA Features:

  • Multi-AZ RDS deployment
  • ECS service auto-recovery
  • Application Load Balancer health checks
  • Automated backups and snapshots
  • Cross-AZ redundancy

Trade-offs

Advantages

  • UK data residency and compliance (eliminates critical blocker)
  • Cost savings (15-30%)
  • 100% infrastructure-as-code with CloudFormation (version control, reproducibility, no drift)
  • Greater infrastructure control
  • Better performance and scalability
  • AWS service ecosystem integration
  • More flexible resource allocation
  • Enhanced security capabilities
  • Better monitoring and observability

Disadvantages

  • Increased operational responsibility (less managed than Heroku)
  • Steeper learning curve for AWS services
  • More complex infrastructure management
  • Need for AWS expertise in team
  • Infrastructure-as-code maintenance overhead
  • More components to monitor and maintain
  • Deployment complexity (ECS vs. git push heroku)

Risk Mitigation

  • Training and documentation: Invest in AWS training for team
  • Infrastructure-as-code: Use Terraform or CloudFormation for reproducibility
  • Gradual migration: Test thoroughly in staging environment
  • Monitoring from day one: Comprehensive CloudWatch setup before go-live
  • AWS support plan: Consider AWS Business Support for expert guidance
  • Runbooks and procedures: Document common operational tasks
  • Disaster recovery testing: Regular restore testing from backups

Compliance Requirements

UK Data Residency

Requirements Met:

  • All compute resources in EU-West-2 (London)
  • Database and backups in EU-West-2 (London)
  • Redis cache in EU-West-2 (London)
  • Application logs in EU-West-2 (CloudWatch Logs)
  • S3 storage in EU-West-2 (if used)

Audit Trail:

  • CloudTrail logs all API calls and data access
  • Resource tagging for compliance tracking
  • IAM policies enforce regional restrictions
  • AWS Organizations for governance controls

Data Protection Compliance

GDPR/UK GDPR Alignment:

  • Data processing within UK jurisdiction
  • No cross-border data transfers to USA
  • Right to erasure (tenant deletion capabilities)
  • Data encryption at rest and in transit
  • Audit logs for data access

Conclusion

The migration from Heroku to AWS ECS represents a strategic necessity for Viitata, driven primarily by UK data residency and compliance requirements. The current Heroku architecture creates unacceptable compliance risk due to database backups being retained in the USA, making it impossible to guarantee UK-only data storage.

This migration delivers:

  • Resolves critical compliance issue: UK data residency with backups in EU-West-2 (London)
  • Eliminates USA data transfer: All data and backups remain in UK
  • Cost savings: 15-30% base reduction + 40-50% additional savings from auto-scaling during off-peak
  • Auto-scaling with scale-down: Handles multi-tenant traffic spikes automatically while reducing costs during low-usage periods
  • 100% version-controlled infrastructure: CloudFormation templates for entire stack, eliminating manual configuration and drift
  • Greater operational control: Full infrastructure flexibility and customization
  • Performance improvements: Dedicated resources, auto-scaling for spikes, and advanced AWS features
  • AWS service ecosystem: Native integration with SQS (message broker), S3, CloudWatch, IAM, and other services
  • Serverless message queue: SQS eliminates need to manage message broker infrastructure
  • Enhanced security: VPC isolation, IAM roles, Secrets Manager, and encryption

While AWS requires greater operational expertise compared to Heroku’s managed platform, the compliance requirements make this migration essential. The additional benefits of cost savings, performance improvements, and operational flexibility provide further justification beyond the compliance imperative.

Without this migration, Viitata cannot serve clients with UK data residency requirements and remains exposed to compliance risks associated with cross-border backup storage.

References