This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Memos

Technical memos and in-depth research documents

1: Sample Memo
2: Viitata Tenancy Infrastructure
3: Heroku to AWS Migration

Overview

This section contains technical memos and in-depth research documents covering architectural decisions, technical analyses, and comprehensive guides for complex topics.

Browse Memos

Navigate through the sidebar to explore individual memos, or use the search feature to find specific documents.

1 - Sample Memo

A sample memo demonstrating the structure and format

Executive Summary

This is a sample memo that demonstrates the recommended structure and format for technical memos. It provides a template that can be used as a starting point for creating new memos.

Introduction

Purpose

Explain the purpose of the memo and what problems or questions it addresses.

Scope

Define the scope of the document, including what is covered and what is explicitly out of scope.

Audience

Identify the intended audience and any prerequisite knowledge required.

Background

Provide necessary background information and context for understanding the memo content.

Current State

Describe the current situation, challenges, or problems that motivated this memo.

Requirements

List any requirements or constraints that influenced the analysis or recommendations.

Technical Analysis

Approach

Describe the methodology or approach used in the analysis.

Findings

Present the key findings from the technical analysis.

Trade-offs

Discuss the trade-offs considered and how different options were evaluated.

Recommendations

Provide specific recommendations based on the analysis.

Implementation Considerations

Discuss practical considerations for implementing the recommendations.

Risk Assessment

Identify potential risks and mitigation strategies.

Conclusion

Summarize the key points and recommendations.

References

List relevant documentation
External resources
Related memos

2 - Viitata Tenancy Infrastructure

Migration from single-tenant to multi-tenant architecture

Executive Summary

This memo documents the strategic architecture migration for Viitata from a single-tenant-per-instance model to a multi-tenant architecture on Heroku. This migration addresses critical operational inefficiencies, enables deployment of the new Viitata version with its required worker architecture, and significantly reduces both current costs and the cost of scaling while eliminating DevOps friction for client onboarding.

Key Changes:

Architecture: Single-tenant-per-instance → Multi-tenant shared infrastructure
Platform: Heroku (no change)
Application Version: Current (single worker) → New version (3 workers required)
Cost Impact: $96/month currently → $288/month if upgraded on single-tenant → $130/month on multi-tenant
Cost Savings: 55% reduction vs. deploying new version on single-tenant architecture

Introduction

Purpose

This document outlines the rationale, technical approach, and benefits of migrating Viitata from a distributed single-tenant-per-instance model to a consolidated multi-tenant architecture on Heroku.

Scope

This memo covers:

Current single-tenant-per-instance architecture on Heroku
New Viitata version requirements (3-worker architecture)
Proposed multi-tenant architecture on Heroku
Cost analysis and operational benefits
Technical considerations and trade-offs

Out of scope:

Detailed application code changes for multi-tenancy
Specific Heroku configuration details
Data migration procedures and implementation timeline

Audience

This document is intended for technical leadership, DevOps engineers, and stakeholders involved in infrastructure planning and decision-making.

Background

Current State: Single-Tenant-Per-Instance on Heroku

Viitata currently operates with a single-tenant-per-instance model on Heroku, consisting of:

Infrastructure per Tenant:

1 Heroku application instance (single worker)
1 PostgreSQL database
1 Redis cache instance
Cost: ~$16/month per tenant

Current Deployment:

6 production instances running current Viitata version
Each instance operates as a single-worker application
Total monthly cost: ~$96 (6 instances × $16)
Each instance requires independent CI/CD pipeline
Each instance requires separate DevOps configuration

Important Note: The current architecture runs an older version of Viitata that does not require the 3-worker architecture. However, the new version of Viitata cannot be deployed without this infrastructure change.

Challenges with Current Architecture

1. Cost Scalability Concerns

With 6 production tenants at $16/month each, the current architecture costs approximately $96/month. While manageable at this scale, the cost scales linearly with each new tenant ($16 per additional tenant). More critically, the new version of Viitata requires a 3-worker architecture that would triple costs to approximately $288/month for the same 6 tenants.

2. DevOps Friction

Each new client onboarding requires:

Provisioning new Heroku application
Configuring new PostgreSQL database
Setting up new Redis cache
Configuring CI/CD pipeline
Managing environment variables and secrets
Setting up monitoring and logging

This creates substantial friction and delays in client onboarding.

3. CI/CD Maintenance Overhead

Maintaining 6 separate CI/CD pipelines creates:

Increased complexity in deployment processes
Higher risk of configuration drift
Difficulty in applying updates uniformly
Additional testing burden across instances

4. Blocking Issue: New Viitata Version Requirements

The new version of Viitata fundamentally requires three distinct worker types to function:

Web worker: Handles HTTP requests
Celery worker: Processes asynchronous tasks
Celery Beat worker: Manages scheduled tasks and periodic jobs

This is not optional - the new Viitata version cannot be deployed without all three workers running.

Under the single-tenant model, deploying the new version would require:

18 total worker processes (6 instances × 3 workers)
Tripling of infrastructure costs per tenant (from $16 to ~$48 per tenant)
Total monthly cost increase from $96 to approximately $288/month
18 separate processes to monitor and manage

Critical Impact: The single-tenant architecture makes it economically and operationally prohibitive to deploy the new version of Viitata. Without migrating to multi-tenant, the platform cannot evolve.

Technical Analysis

Proposed Architecture: Multi-Tenant on Heroku

The new architecture consolidates all tenants into a single shared Heroku infrastructure:

Shared Infrastructure:

1 Heroku application (supporting 3 worker types)
1 Heroku PostgreSQL database (with tenant isolation)
1 Heroku Redis cache (with tenant namespacing)
Estimated cost: ~$130/month total

Worker Configuration:

1 web worker (serving all tenants)
1 Celery worker (processing tasks for all tenants)
1 Celery Beat worker (managing schedules for all tenants)
Total: 3 workers supporting all tenants

Cost Analysis

Architecture Model	Viitata Version	Tenants	Workers	Monthly Cost	Cost per Tenant
Current (Single-Tenant)	Old	6	6 (1 per instance)	$96	$16.00
Single-Tenant Upgraded	New	6	18 (3 per instance)	$288	$48.00
Multi-Tenant (Proposed)	New	6	3 (shared)	$130	$21.67
Savings vs. Upgraded	-	-	-83%	-$158/month	-55%

Key Insights:

Current architecture cannot run the new Viitata version without significant cost increase
New version’s 3-worker requirement would triple single-tenant costs ($96 → $288)
Multi-tenant architecture enables new version deployment at 55% lower cost than single-tenant upgrade
Marginal cost advantage: Adding tenant #7 costs $0/month (vs. $48/month in single-tenant)
Cost efficiency improves with scale: 10 tenants = $13/tenant, 20 tenants = $6.50/tenant

Benefits

1. Cost Reduction

83% reduction in infrastructure costs
Costs remain flat as tenant count grows (until scaling threshold)
Predictable cost model

2. Operational Efficiency

Single CI/CD pipeline for all tenants
Unified deployment process
Consistent configuration across all tenants
Reduced maintenance overhead

3. Client Onboarding

Near-instant tenant provisioning (database record vs. full infrastructure)
Minimal DevOps involvement
Faster time-to-value for new clients
Reduced onboarding friction

4. Enables New Viitata Version Deployment

Supports required 3-worker architecture (web, Celery, Celery Beat)
3 shared workers support all tenants (vs. 18 separate workers in single-tenant)
Makes new version economically viable to deploy
Simplified monitoring and management
Better resource utilization
Easier to scale horizontally when needed

Technical Considerations

Data Isolation

Tenant identification at application layer
Row-level security in PostgreSQL
Redis key namespacing by tenant ID
Careful query design to prevent data leakage

Performance

Shared resources require proper resource allocation
Connection pooling for database efficiency
Caching strategies to prevent tenant interference
Monitoring to identify tenant-specific performance issues

Security

Tenant isolation at application and data layers
Secure tenant context management
Audit logging for compliance
Regular security reviews of multi-tenant code paths

Scalability

Horizontal scaling when single instance reaches capacity
Database sharding if needed for very large tenant counts
CDN and edge caching for static assets
Load balancing across multiple application instances

Trade-offs

Advantages

Dramatic cost reduction
Simplified operations
Faster client onboarding
Better resource utilization
Easier maintenance and updates

Disadvantages

Tenant isolation complexity in application code
Potential “noisy neighbor” issues
Database restore impact: Currently, database snapshots can be restored per-tenant without affecting other clients. In multi-tenant architecture, a database restore would affect all tenants simultaneously, making it impossible to roll back a single client’s data due to a bug or data issue
More complex deployment rollback scenarios
Requires careful tenant-aware code design
Less isolation between tenants compared to separate instances

Risk Mitigation

Comprehensive testing of tenant isolation
Resource limits per tenant
Monitoring and alerting for anomalies
Gradual migration approach
Ability to isolate problematic tenants if needed
Database restore mitigation:
- Implement application-level point-in-time recovery per tenant
- Maintain granular database backups with tenant-specific restore capabilities
- Use transaction logs to selectively restore tenant data
- Establish procedures for tenant-specific data rollback without full database restore
- More rigorous testing and staging processes to prevent production data issues
- Consider automated daily tenant-level logical backups (pg_dump per tenant)

Conclusion

The migration from a single-tenant-per-instance architecture to a multi-tenant architecture on Heroku represents a strategic necessity for Viitata’s evolution. This change delivers:

Enables deployment of new Viitata version with required 3-worker architecture
55% cost reduction vs. deploying new version on single-tenant ($288/month → $130/month)
Dramatic reduction in operational complexity (6 CI/CD pipelines → 1, 18 workers → 3)
Near-zero marginal cost for new tenants ($0 vs. $48/tenant in single-tenant)
Improving cost efficiency at scale: Cost per tenant decreases as platform grows
Eliminates DevOps friction in client onboarding

Without this migration, deploying the new version of Viitata would nearly triple costs while adding significant operational burden. The multi-tenant architecture not only makes the new version economically viable but also positions Viitata for sustainable growth with costs that improve with scale.

While multi-tenancy introduces complexity in application design around tenant isolation and data security, the alternative—remaining on single-tenant architecture—would either block the platform’s evolution or make it financially unsustainable. The operational benefits, cost savings, and improved scalability make this migration essential for Viitata’s future.

References

3 - Heroku to AWS Migration

Migration from Heroku to AWS for improved compliance, cost, and control

Executive Summary

This memo documents the strategic platform migration for Viitata from Heroku to AWS (Amazon Web Services). This migration addresses critical compliance requirements around UK data residency, reduces infrastructure costs, provides greater operational flexibility and control, and enables better performance and integration with additional AWS services.

Key Changes:

Platform: Heroku → AWS ECS (Elastic Container Service)
Region: EU-West-1 (Ireland) → EU-West-2 (London, UK)
Primary Driver: Compliance - UK data residency and backup retention
Additional Benefits: Cost reduction, greater control, performance improvements, AWS service ecosystem

Critical Compliance Issue: Currently on Heroku, while the primary database is in EU-West-1 (Ireland), database backups are retained in the USA. This creates compliance risks for UK data residency requirements. AWS enables full infrastructure and data containment within EU-West-2 (London).

Introduction

Purpose

This document outlines the rationale, technical approach, and benefits of migrating Viitata’s multi-tenant infrastructure from Heroku to AWS, with a focus on achieving UK data sovereignty and compliance requirements while improving operational capabilities.

Scope

This memo covers:

Current multi-tenant architecture on Heroku
Compliance and data residency challenges
Proposed multi-tenant architecture on AWS ECS
Cost analysis and operational benefits
Technical considerations and trade-offs

Out of scope:

Detailed AWS infrastructure-as-code configurations
Specific containerization implementation details
Data migration procedures and implementation timeline
Application code changes required for AWS

Audience

This document is intended for technical leadership, compliance officers, DevOps engineers, and stakeholders involved in infrastructure planning and decision-making.

Background

Current State: Multi-Tenant on Heroku

Viitata currently operates with a multi-tenant architecture on Heroku, consisting of:

Infrastructure:

1 Heroku application (3 dynos: web, Celery worker, Celery Beat)
1 Heroku PostgreSQL database in EU-West-1 (Ireland)
1 Heroku Redis cache
Current cost: ~$130/month

Current Deployment:

Multi-tenant architecture supporting 6 production tenants
Single CI/CD pipeline
Heroku-managed infrastructure and scaling
Automatic SSL, DNS, and platform maintenance

Critical Issues with Current Platform

1. Compliance and Data Residency (Primary Driver)

Database Backup Location:

Primary database: EU-West-1 (Ireland, EU)
Database backups: Stored in USA (Heroku’s backup infrastructure)

This creates significant compliance risks:

UK data residency requirements cannot be met
Backup data crosses international boundaries
Potential violations of data protection regulations
Risk for clients requiring UK-only data storage
Audit and compliance reporting challenges

Regional Limitation:

Application and database in Ireland (EU-West-1), not UK
No option for UK-specific region on Heroku
Cannot guarantee UK data sovereignty

2. Cost Considerations

While Heroku provides managed services, the cost includes:

Premium for managed platform (~30-40% over raw compute)
Limited ability to optimize resource allocation
Dyno pricing model less flexible than AWS instance types
Add-on costs (PostgreSQL, Redis) with limited customization

3. Limited Control and Flexibility

Infrastructure Control:

Cannot customize underlying OS or runtime environment
Limited control over networking and security groups
Restricted access to infrastructure-level monitoring
Cannot implement custom security controls

Resource Optimization:

Fixed dyno sizes with limited granularity
Cannot right-size resources for specific workloads
Limited ability to use spot instances or reserved capacity
Cannot separate worker resources by type

4. Performance Constraints

Heroku Limitations:

Shared infrastructure with potential noisy neighbor issues
Limited database connection pooling options
Router timeout constraints (30 seconds)
Limited control over caching layers
Cannot implement custom CDN configurations

5. AWS Service Integration

Current limitations for integrating with AWS services:

External network calls to AWS services (S3, SES, etc.)
Additional latency for AWS service integration
Cannot use VPC peering or private networking
Limited IAM role-based security
Cannot leverage AWS-native monitoring and logging

Technical Analysis

Proposed Architecture: Multi-Tenant on AWS ECS

The new architecture migrates the multi-tenant application to AWS infrastructure with a fully containerized, role-based security model.

Architecture Overview

The following diagram illustrates the proposed AWS architecture:

graph TB
    subgraph Internet
        Users[Users/Clients]
    end

    subgraph "AWS EU-West-2 (London)"
        subgraph "VPC"
            subgraph "Public Subnets"
                ALB[Application Load Balancer<br/>HTTPS:443]
                NAT[NAT Gateway]
            end

            subgraph "Private Subnets"
                subgraph "ECS Fargate Cluster"
                    Web[Web Tasks<br/>nginx + gunicorn<br/>Auto-scaling]
                    Worker[Celery Worker Tasks<br/>Auto-scaling]
                    Beat[Celery Beat Task<br/>Single instance]
                end

                RDS[(RDS PostgreSQL<br/>Multi-AZ<br/>Automated Backups)]
                Cache[(ElastiCache Valkey<br/>Redis-compatible<br/>Cache & Results)]
            end
        end

        SQS[Amazon SQS<br/>Celery Message Broker<br/>Task Queue]
        S3[S3 Bucket<br/>Media Storage]
        CW[CloudWatch<br/>Logs & Metrics]
        SM[Secrets Manager<br/>Credentials]
    end

    Users -->|HTTPS| ALB
    ALB -->|Routes traffic| Web

    Web -->|IAM Role| S3
    Web -->|Read/Write| RDS
    Web -->|Cache/Sessions| Cache
    Web -->|Send tasks| SQS
    Web -->|Logs| CW
    Web -->|Get secrets| SM

    Worker -->|IAM Role| S3
    Worker -->|Read/Write| RDS
    Worker -->|Receive/Delete tasks| SQS
    Worker -->|Store results| Cache
    Worker -->|Logs| CW
    Worker -->|Get secrets| SM

    Beat -->|Send scheduled tasks| SQS
    Beat -->|Read/Write| RDS
    Beat -->|Logs| CW

    Web -.->|Outbound via| NAT
    Worker -.->|Outbound via| NAT
    Beat -.->|Outbound via| NAT

    classDef public fill:#e1f5ff,stroke:#01579b,stroke-width:2px
    classDef private fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
    classDef data fill:#fff3e0,stroke:#e65100,stroke-width:2px
    classDef compute fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px
    classDef aws fill:#fff9c4,stroke:#f57f17,stroke-width:2px

    class ALB,NAT public
    class Web,Worker,Beat compute
    class RDS,Cache data
    class S3,CW,SM,SQS aws

Core Infrastructure Components:

Database Layer
- Amazon RDS for PostgreSQL in EU-West-2 (London)
- Multi-AZ deployment for high availability
- Automated backups retained in EU-West-2
- Point-in-time recovery capabilities
- All tenant data with row-level isolation
Cache Layer
- Amazon ElastiCache with Valkey (Redis-compatible) in EU-West-2
- Used for session storage, application caching, and Celery result backend
- Tenant-namespaced keys for data isolation
- High-performance in-memory data store
Message Queue Layer
- Amazon SQS (Simple Queue Service) in EU-West-2
- Celery message broker for task distribution
- Fully managed, serverless message queue
- No infrastructure to maintain or scale
- Automatic message retention and delivery
- Dead letter queue for failed tasks
- FIFO queues for task ordering if needed
- Cost-effective: Pay only for messages processed

Compute Layer - ECS Fargate

Three separate ECS task definitions running on Fargate:

Task Definition 1: Web Application

Container: nginx + gunicorn
Receives traffic from Application Load Balancer (ALB)
Handles HTTPS requests routed by ALB
Auto-scaling based on CPU/memory and request count
Appropriate resources: ~0.5-1 vCPU, 1-2GB memory
Multiple tasks for high availability and load distribution
ALB distributes traffic across all healthy web tasks

Task Definition 2: Celery Worker

Container: Celery worker process
Consumes tasks from Amazon SQS queue
Auto-scaling based on SQS queue depth (ApproximateNumberOfMessagesVisible) and CPU utilization
Right-sized resources: ~0.25-0.5 vCPU, 0.5-1GB memory
Can scale independently based on task backlog in SQS

Task Definition 3: Celery Beat

Container: Celery Beat scheduler
Manages periodic and scheduled tasks
Publishes scheduled tasks to SQS queue
Fixed scaling: Single task (Beat requires single instance)
Minimal resources: ~0.25 vCPU, 0.5GB memory
Auto-restart on failure

Rationale for Separate Task Definitions:

Each workload has different resource requirements
Independent scaling policies per service type
Web scales with traffic, workers scale with queue depth
Cost optimization: Right-size each workload separately
Isolation: Issues in one service don’t affect others

Auto-Scaling Configuration:

ECS provides automatic scaling that adjusts the number of running tasks based on demand, with both scale-up and scale-down capabilities:

Web Tasks Auto-Scaling:

Metrics: CPU utilization, memory utilization, ALB request count per target
Scale-up triggers:
- CPU > 70% for 2 minutes → Add tasks
- Requests per task > 1000/min → Add tasks
Scale-down triggers:
- CPU < 30% for 5 minutes → Remove tasks
- Requests per task < 200/min → Remove tasks
Min/Max tasks: 2 minimum (HA), 10 maximum
Benefits: Handles traffic spikes from multiple tenants, scales down during low usage to save costs

Celery Worker Auto-Scaling:

Metrics: CPU utilization, SQS ApproximateNumberOfMessagesVisible (native CloudWatch metric)
Scale-up triggers:
- SQS queue depth > 100 messages → Add workers
- CPU > 80% for 3 minutes → Add workers
Scale-down triggers:
- SQS queue depth < 10 messages for 10 minutes → Remove workers
- CPU < 20% for 10 minutes → Remove workers
Min/Max tasks: 1 minimum, 5 maximum
Benefits: SQS provides native queue metrics for accurate scaling decisions; efficiently processes task backlog, reduces to minimum during idle periods

Celery Beat Scaling:

Fixed at 1 task (Beat scheduler requires single instance)
Auto-restart on failure for reliability

Multi-Tenant Scaling Benefits:

Auto-scaling is particularly valuable for multi-tenant architecture:

Unpredictable tenant activity: Different tenants have different usage patterns and peak times
Cost efficiency: Automatically scales down during low-usage periods (nights, weekends)
Spike handling: Automatically scales up when multiple tenants become active simultaneously
Resource optimization: Pays only for resources actually needed at any given time
Example scenario:
- During business hours (9am-5pm): 6-8 web tasks handle peak multi-tenant load
- During nights (11pm-6am): Scales down to 2 web tasks, saving ~$40-60/month
- Weekend spikes: Auto-scales up to handle unexpected tenant activity

Comparison to Heroku:

Scaling Feature	Heroku	AWS ECS
Scale-up	Manual or via add-ons	Automatic based on metrics
Scale-down	Manual only	Automatic (saves costs)
Scaling metrics	Limited (response time, throughput)	Extensive (CPU, memory, custom CloudWatch metrics, ALB metrics, queue depth)
Per-service scaling	Requires multiple apps	Built-in per task definition
Cost during low usage	Fixed (pays for min dynos)	Dynamic (scales to minimum)
Multi-tenant optimization	Limited	Excellent - handles variable tenant load patterns

Cost Impact:

Scale-down capability can reduce compute costs by 40-50% during off-peak hours
For multi-tenant with variable load, average monthly compute cost drops significantly
Example: Instead of running 6 web tasks 24/7, average 4 tasks/hour = 33% cost reduction

Networking and Load Balancing
- Application Load Balancer (ALB) as entry point for all web traffic
  - Sits in public subnets
  - Terminates HTTPS/SSL connections
  - Routes traffic to web task definition only
  - Health checks on web tasks
  - Automatically distributes load across multiple web task instances
- VPC with public and private subnets across multiple Availability Zones
- Private subnets for ECS tasks, RDS, and ElastiCache (no direct internet access)
- Public subnets for ALB only
- NAT Gateway for outbound internet access from private subnets
- Security groups for service-level network isolation
  - ALB security group: Allow inbound 443 from internet
  - Web task security group: Allow inbound from ALB only
  - Worker task security groups: No inbound internet traffic
  - RDS/ElastiCache security groups: Allow access from ECS tasks only
Security Model: Role-Based Authentication
Shift from IAM Users to IAM Roles:
- Current (Heroku): IAM user credentials stored as environment variables for AWS service access (S3, SES, etc.)
- Proposed (AWS): ECS task IAM roles with least-privilege permissions
IAM Role-Based Security Benefits:
- No stored credentials: Tasks assume roles automatically via ECS Task Role
- Dramatically reduced credential leakage risk: No long-lived access keys in environment variables or code
- Automatic credential rotation: AWS STS provides temporary credentials (auto-expire and rotate)
- Least-privilege access: Each task definition gets only required permissions
- Audit trail: CloudTrail logs all role assumption and service access
- Infrastructure-as-code: IAM roles defined in CloudFormation templates
- Centralized security: All permissions defined and version-controlled
Example Security Architecture:
- Web task role: Read S3 (media), write CloudWatch Logs
- Celery worker task role: Read/Write S3, SES send email, SQS receive/delete messages, CloudWatch Logs
- Celery Beat task role: SQS send messages, CloudWatch Logs only
- RDS access: PostgreSQL username/password from Secrets Manager (accessed via IAM role)
- SQS access: Fully controlled via IAM roles (no credentials needed)
- No IAM user access keys anywhere in the system
Additional AWS Services
- Amazon SQS for Celery message broker (fully managed queue)
- CloudWatch Logs for centralized logging
- CloudWatch Metrics for monitoring and alerting (includes native SQS metrics)
- AWS Secrets Manager for database credentials and API keys
- S3 for media file storage (UK region)
- CloudFormation for infrastructure-as-code deployment
- ECR (Elastic Container Registry) for Docker image storage

Region and Compliance:

All resources in EU-West-2 (London, UK)
No data transfer outside UK jurisdiction
Estimated cost: ~$90-115/month (12-30% savings, includes SQS)

Connectivity Flow

Internet (HTTPS:443)
  ↓
Application Load Balancer (Public Subnet)
  ↓ (Routes to web tasks only)
ECS Web Tasks (nginx + gunicorn) (Private Subnet)
  ↓
  ├─→ RDS PostgreSQL (Private Subnet)
  ├─→ ElastiCache Valkey (Private Subnet - cache/sessions)
  ├─→ Amazon SQS (Send tasks via IAM role)
  └─→ S3 (via IAM role)

ECS Celery Workers (Private Subnet)
  ↓
  ├─→ RDS PostgreSQL (Private Subnet)
  ├─→ ElastiCache Valkey (Private Subnet - result backend)
  ├─→ Amazon SQS (Receive/Delete tasks via IAM role)
  └─→ S3 (via IAM role)

ECS Celery Beat (Private Subnet)
  ↓
  ├─→ RDS PostgreSQL (Private Subnet)
  └─→ Amazon SQS (Send scheduled tasks via IAM role)

Key Points:

Only web tasks receive traffic from ALB - Workers have no inbound internet traffic
All three task definitions connect to RDS
Celery uses Amazon SQS as message broker - fully managed, serverless queue
ElastiCache Valkey used for caching, sessions, and Celery result backend
All ECS tasks connect to AWS services (S3, SQS, CloudWatch) via IAM roles
No stored credentials required anywhere
SQS provides native CloudWatch metrics for auto-scaling

Infrastructure-as-Code

The entire architecture is defined in CloudFormation templates:

VPC, subnets, route tables, security groups
RDS database configuration
ElastiCache cluster configuration
SQS queues (standard and dead letter queues)
ECS cluster, task definitions, and services
Application Load Balancer and target groups
IAM roles and policies
CloudWatch alarms and dashboards

Benefits:

Version-controlled infrastructure
Reproducible environments (staging = production)
No manual configuration or drift
Peer-reviewed infrastructure changes via Git
Disaster recovery: Rebuild from templates

CI/CD Pipeline

GitHub Actions Workflow:

The deployment pipeline is automated via GitHub Actions, providing consistent and reliable deployments:

graph LR
    A[Code Push to GitHub] --> B[GitHub Actions Triggered]
    B --> C[Build Docker Image]
    C --> D[Push to ECR]
    D --> E[Update ECS Task Definitions]
    E --> F[Deploy Web Tasks]
    E --> G[Deploy Celery Workers]
    E --> H[Deploy Celery Beat]

    style A fill:#e8f5e9
    style B fill:#fff3e0
    style C fill:#e1f5ff
    style D fill:#f3e5f5
    style E fill:#fff9c4
    style F fill:#e8f5e9
    style G fill:#e8f5e9
    style H fill:#e8f5e9

Deployment Process:

Trigger: Code pushed to main branch or pull request merged
Build: GitHub Actions workflow executes
- Builds single Docker image containing application code
- Runs tests (optional: can block deployment on failure)
- Tags image with commit SHA and/or semantic version
Push to ECR: Docker image pushed to Elastic Container Registry in EU-West-2
- ECR provides secure, private Docker registry
- Images stored in same region as deployment (UK)
- Automatic image scanning for vulnerabilities (optional)
Update Task Definitions: GitHub Actions updates ECS task definitions
- All three task definitions reference the same Docker image
- Only the container command/entrypoint differs per service:
  - Web: gunicorn command
  - Celery Worker: celery worker command
  - Celery Beat: celery beat command
Deploy: ECS performs rolling updates
- Web tasks: Rolling deployment with health checks via ALB
- Celery Workers: Rolling update, new tasks pick up from queue
- Celery Beat: Stop old task, start new task (single instance)

Single Image, Multiple Services:

All three ECS task definitions use the same Docker image from ECR. The service type is determined by the command executed:

# Example task definition differences
Web Task Definition:
  Image: <ECR_URI>:latest
  Command: ["gunicorn", "app.wsgi:application"]

Celery Worker Task Definition:
  Image: <ECR_URI>:latest  # Same image!
  Command: ["celery", "-A", "app", "worker"]

Celery Beat Task Definition:
  Image: <ECR_URI>:latest  # Same image!
  Command: ["celery", "-A", "app", "beat"]

Benefits:

Single build: One Docker image for all services (faster builds)
Consistency: All services run identical application code
Simplified versioning: Single image tag tracks deployment
Reduced storage: ECR stores one image instead of three
Atomic deployments: All services deployed from same code version

Comparison to Heroku:

Aspect	Heroku	AWS ECS
Deployment trigger	`git push heroku`	GitHub Actions workflow
Build process	Heroku buildpacks	Docker image build
Artifact storage	Heroku slug storage	ECR (version-controlled)
Deployment control	Limited (auto-deploy)	Full control (approval gates, rollback)
Multi-service	Separate apps or Procfile	Task definitions with same image
Rollback	`heroku releases:rollback`	ECS task definition revision or redeploy previous image tag

Additional CI/CD Capabilities:

Environment-specific deployments: Separate workflows for staging and production
Approval gates: Require manual approval before production deployment
Automated testing: Run integration tests against staging before production
Blue-green deployments: Deploy new version alongside old, switch traffic
Canary deployments: Gradually shift traffic to new version
Automated rollback: Detect failures via CloudWatch alarms and auto-rollback

Cost Analysis

Component	Heroku (Current)	AWS (Proposed)	Notes
Web Application	~$25-50	~$20-35	ECS Fargate or EC2 instances
Celery Worker	~$25-50	~$20-35	Right-sized for workload
Celery Beat	~$25-50	~$10-15	Smaller instance for scheduler
PostgreSQL	~$15-20	~$20-25	RDS with backups in UK
Redis Cache	~$15-20	~$10-15	ElastiCache (cache + result backend)
SQS Message Queue	Included in dyno	~$1-3	Pay per million requests, negligible cost
Load Balancer	Included	~$15-20	ALB costs
Total	~$130/month	~$90-115/month	12-30% savings

Cost Optimization Opportunities:

Auto-scaling with scale-down: Automatically reduce running tasks during low-usage periods (40-50% compute savings during off-peak)
Reserved instances for baseline workloads (up to 50% additional savings)
Spot instances for Celery workers (up to 70% savings on compute)
S3 storage tiers for media files
CloudWatch log retention policies
Right-sizing based on actual usage patterns

Multi-Tenant Auto-Scaling Impact: The auto-scaling capability is particularly valuable for multi-tenant architecture where tenant usage patterns vary throughout the day. Instead of paying for peak capacity 24/7 (as with Heroku), ECS automatically scales down during low-usage periods, significantly reducing average compute costs.

Important Note: Cost savings are secondary to compliance requirements. Even if costs were equivalent, the migration would be necessary for UK data residency.

Benefits

1. Compliance and Data Sovereignty (Primary Benefit)

UK data residency: All infrastructure and data in EU-West-2 (London)
Backup compliance: Database backups remain in UK
Audit trail: Full control and visibility over data location
Regulatory compliance: Meets UK data protection requirements
Client confidence: Can guarantee UK-only data storage
Reduced legal risk: Eliminates cross-border data transfer concerns

2. Cost Efficiency

15-30% immediate cost reduction (~$130 → ~$90-110/month)
Automatic scale-down during low usage: 40-50% additional compute savings during off-peak hours
Multi-tenant load optimization: Auto-scaling handles variable tenant usage patterns efficiently
Additional savings opportunities with reserved/spot instances
More granular resource allocation (no paying for unused capacity)
Flexible pricing models (on-demand, reserved, spot)
Pay only for actual resource consumption, not fixed capacity

3. Greater Control and Flexibility

Infrastructure Control:

Full control over container images and runtime environment
Custom networking and security group configuration
Direct access to infrastructure-level metrics
Ability to implement custom security controls
VPC configuration for network isolation

Operational Flexibility:

Choose instance types optimized for workload
Separate scaling policies per service
Custom monitoring and alerting
Advanced deployment strategies (blue/green, canary)

Infrastructure-as-Code:

100% version-controlled infrastructure using CloudFormation templates
Entire infrastructure stack defined as code (VPC, ECS, RDS, ElastiCache, ALB, etc.)
Git-based workflow for infrastructure changes (review, approve, deploy)
Reproducible environments (staging matches production exactly)
Disaster recovery: Rebuild entire infrastructure from templates
Change tracking and audit trail for infrastructure modifications
Team collaboration on infrastructure changes via pull requests
No manual ClickOps or undocumented configuration drift

Heroku Limitation: On Heroku, infrastructure is configured via web UI or CLI commands that aren’t easily version-controlled. App configuration can be tracked, but the underlying platform infrastructure (databases, dynos, add-ons) requires manual provisioning and documentation.

4. Performance Improvements

AWS Performance Advantages:

Dedicated compute resources (no noisy neighbors)
Auto-scaling for traffic spikes: Automatically adds capacity when multiple tenants become active
Dynamic resource allocation: Scales up/down based on actual demand (CPU, memory, request count, queue depth)
Advanced connection pooling (RDS Proxy)
No 30-second timeout constraints
Custom CDN configuration (CloudFront)
Private networking between services (reduced latency)
Better database performance tuning options
Independent scaling per service type (web vs. workers)

5. AWS Service Ecosystem

Native Integration:

Amazon SQS for Celery message broker: Fully managed, serverless queue with no infrastructure to maintain
S3 for media and static file storage (same region)
SES for email services
Lambda for serverless functions
CloudWatch for comprehensive monitoring (includes native SQS metrics for auto-scaling)
AWS Secrets Manager for credential management
IAM roles for secure, passwordless service access
VPC endpoints for private AWS service access

SQS-Specific Benefits:

Zero infrastructure management: No Redis/ElastiCache broker to maintain or scale
Native CloudWatch metrics: ApproximateNumberOfMessagesVisible metric for accurate worker auto-scaling
Reliability: Built-in message durability and delivery guarantees
Dead letter queues: Automatic handling of failed tasks
Cost-effective: Pay only for messages processed (~$0.40 per million requests)
Unlimited scalability: No capacity planning required
IAM-based security: No broker credentials to manage

Technical Considerations

Containerization

ECS Container Setup:

Dockerize Django application (web, Celery, Beat)
Use ECR (Elastic Container Registry) for image storage
Multi-stage builds for optimized image sizes
Health checks for container orchestration
Environment-based configuration

Database Migration

RDS PostgreSQL:

Similar to Heroku Postgres (based on PostgreSQL)
Enhanced monitoring and performance insights
Automated backups with configurable retention
Multi-AZ deployment for high availability
Parameter groups for fine-tuned configuration
RDS Proxy for connection pooling

Networking and Security

VPC Configuration:

Private subnets for database and cache
Public subnets for load balancer
Security groups for service isolation
NAT Gateway for outbound traffic from private subnets
VPC endpoints for AWS services (S3, Secrets Manager)

Security Enhancements:

IAM roles instead of static credentials
Secrets Manager for sensitive configuration
AWS WAF (Web Application Firewall) for web protection
CloudTrail for audit logging
Encryption at rest and in transit

Monitoring and Observability

CloudWatch Integration:

Application logs aggregation
Custom metrics and dashboards
Alerting for critical events
Performance monitoring
Cost tracking and budgets

X-Ray (Optional):

Distributed tracing
Performance bottleneck identification
Request flow visualization

High Availability

AWS HA Features:

Multi-AZ RDS deployment
ECS service auto-recovery
Application Load Balancer health checks
Automated backups and snapshots
Cross-AZ redundancy

Trade-offs

Advantages

UK data residency and compliance (eliminates critical blocker)
Cost savings (15-30%)
100% infrastructure-as-code with CloudFormation (version control, reproducibility, no drift)
Greater infrastructure control
Better performance and scalability
AWS service ecosystem integration
More flexible resource allocation
Enhanced security capabilities
Better monitoring and observability

Disadvantages

Increased operational responsibility (less managed than Heroku)
Steeper learning curve for AWS services
More complex infrastructure management
Need for AWS expertise in team
Infrastructure-as-code maintenance overhead
More components to monitor and maintain
Deployment complexity (ECS vs. git push heroku)

Risk Mitigation

Training and documentation: Invest in AWS training for team
Infrastructure-as-code: Use Terraform or CloudFormation for reproducibility
Gradual migration: Test thoroughly in staging environment
Monitoring from day one: Comprehensive CloudWatch setup before go-live
AWS support plan: Consider AWS Business Support for expert guidance
Runbooks and procedures: Document common operational tasks
Disaster recovery testing: Regular restore testing from backups

Compliance Requirements

UK Data Residency

Requirements Met:

All compute resources in EU-West-2 (London)
Database and backups in EU-West-2 (London)
Redis cache in EU-West-2 (London)
Application logs in EU-West-2 (CloudWatch Logs)
S3 storage in EU-West-2 (if used)

Audit Trail:

CloudTrail logs all API calls and data access
Resource tagging for compliance tracking
IAM policies enforce regional restrictions
AWS Organizations for governance controls

Data Protection Compliance

GDPR/UK GDPR Alignment:

Data processing within UK jurisdiction
No cross-border data transfers to USA
Right to erasure (tenant deletion capabilities)
Data encryption at rest and in transit
Audit logs for data access

Conclusion

The migration from Heroku to AWS ECS represents a strategic necessity for Viitata, driven primarily by UK data residency and compliance requirements. The current Heroku architecture creates unacceptable compliance risk due to database backups being retained in the USA, making it impossible to guarantee UK-only data storage.

This migration delivers:

Resolves critical compliance issue: UK data residency with backups in EU-West-2 (London)
Eliminates USA data transfer: All data and backups remain in UK
Cost savings: 15-30% base reduction + 40-50% additional savings from auto-scaling during off-peak
Auto-scaling with scale-down: Handles multi-tenant traffic spikes automatically while reducing costs during low-usage periods
100% version-controlled infrastructure: CloudFormation templates for entire stack, eliminating manual configuration and drift
Greater operational control: Full infrastructure flexibility and customization
Performance improvements: Dedicated resources, auto-scaling for spikes, and advanced AWS features
AWS service ecosystem: Native integration with SQS (message broker), S3, CloudWatch, IAM, and other services
Serverless message queue: SQS eliminates need to manage message broker infrastructure
Enhanced security: VPC isolation, IAM roles, Secrets Manager, and encryption

While AWS requires greater operational expertise compared to Heroku’s managed platform, the compliance requirements make this migration essential. The additional benefits of cost savings, performance improvements, and operational flexibility provide further justification beyond the compliance imperative.

Without this migration, Viitata cannot serve clients with UK data residency requirements and remains exposed to compliance risks associated with cross-border backup storage.