This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Memos

Technical memos and in-depth research documents

1: EPA Quote Forms — End-to-End Testing Approach
2: Sample Memo
3: Viitata Tenancy Infrastructure
4: Heroku to AWS Migration

Overview

This section contains technical memos and in-depth research documents covering architectural decisions, technical analyses, and comprehensive guides for complex topics.

Browse Memos

Navigate through the sidebar to explore individual memos, or use the search feature to find specific documents.

1 - EPA Quote Forms — End-to-End Testing Approach

Overview of Automated End-to-End Testing for EPA Insurance Quote Forms

Purpose

This memo outlines the automated end-to-end (E2E) testing strategy in place for the EPA insurance quote platform. The test suite provides continuous quality assurance across all quote form journeys, ensuring that customers can complete quotes reliably and that business-critical outcomes are correctly handled.

Separation of Concerns — Unit Tests vs End-to-End Tests

The EPA platform has two distinct layers of automated testing, each with a different purpose and owned by a different part of the codebase.

Unit Tests — Inside the Product

Unit tests live within the flux-epa product repository itself. They test the application’s internal logic in isolation — individual functions, business rules, data transformations, and component rendering — without launching a browser or navigating a real form. Unit tests are:

Fast — They run in milliseconds because they don’t involve a browser or network
Narrow — Each test covers a single piece of logic (e.g. “does this function calculate the premium correctly?”)
Developer-facing — They catch regressions in code as it is written, typically running on every commit

Unit tests answer the question: does each piece of the application work correctly on its own?

End-to-End Tests — Outside the Product

This E2E test suite is a separate project, deliberately kept outside the product codebase. It tests the application as a deployed whole — launching a real browser, navigating to a real environment, and completing quote journeys exactly as a customer would. E2E tests are:

Slow — Each test takes seconds to minutes because it drives a real browser through multiple form pages
Wide — A single test exercises the full stack: frontend rendering, form validation, API calls, backend processing, and quote outcome routing
Customer-facing — They verify the experience from the customer’s perspective, not the developer’s

E2E tests answer the question: can a customer actually complete a quote and reach the right outcome?

Why Both Are Needed

Neither layer can replace the other. They catch fundamentally different types of problems:

	Unit Tests	E2E Tests
Catches	Logic errors in individual functions	Broken journeys, wrong destinations, missing content
Misses	Integration failures between components	Internal implementation bugs
Speed	Milliseconds	Seconds to minutes
Runs against	Code in isolation	A deployed environment
Owned by	Product repository (flux-epa)	Test repository (epa-tests-e2e)

A unit test might confirm that a premium calculation function returns the correct value, but it cannot tell you whether the customer actually sees that value on screen, or whether clicking “Get a Quote” sends them to the right page. Conversely, an E2E test can confirm the full journey works, but if it fails, it cannot pinpoint whether the issue is in the frontend, the API, or the business logic — that is where unit tests provide the detail.

Ownership Model

Each product has a product owner — a single developer who is responsible for that product’s E2E test suite. They maintain the component classes, step classes, test methods, and documentation, and they evolve the E2E coverage as the product grows. The product owner is the person who understands the customer journeys end to end and ensures the test suite reflects them accurately.

Other developers working on the product — building features, fixing bugs, refactoring code — are responsible for writing unit tests to cover their changes within the product repository. They are not expected to write or maintain E2E tests. However, the E2E suite is available to them as a tool:

Validation — After completing a feature or fix, a developer can run the E2E tests against their deployed changes to confirm that the customer-facing journeys still work correctly
Sanity checking — A quick smoke run provides confidence that a change has not broken anything obvious before handing work over for review
Regression testing — The full suite can be run to verify that a change has not introduced side effects in other parts of the form or in other brands

This creates a clear division of responsibility:

	Product Owner	Feature/Bug Fix Developers
Writes	E2E tests (components, steps, test methods)	Unit tests (within the product repo)
Maintains	E2E test suite, documentation, parametrised rules	Product code, unit test coverage
Uses E2E tests to	Validate developer work, verify releases, expand coverage	Sanity check their own changes, run regression tests

The E2E tests become a quality gate that the product owner controls. When a developer submits work, the product owner can run the relevant E2E tests to verify that customer journeys are intact — without needing to manually click through forms. This frees developers to focus on building and unit-testing their code, while the product owner has an automated tool to confirm the end result works from the customer’s perspective.

The Separation Is Deliberate

Keeping the E2E tests in their own repository and distributing them as a package reinforces this ownership model:

Independent release cycles — The product owner can update the E2E suite without changing the product, and vice versa
Environment flexibility — E2E tests run against any deployed environment, not just a local development setup
No coupling — The E2E tests interact with the application purely through its UI, the same way a customer does. They have no knowledge of internal code, database schemas, or API contracts. If a developer refactors the product’s internals but the UI behaviour remains the same, the E2E tests continue to pass without changes
Low barrier for developers — Developers do not need to understand the E2E test codebase to benefit from it. They run the tests, review the report, and act on any failures

Scope of Testing

Products Covered

The test suite covers 6 insurance quote forms across 3 brands:

Brand	Product	Test Cases
Adrian Flux	Car Insurance	32
Adrian Flux	Learner Driver Insurance	21
Sterling Insurance	Car Insurance	34
Sterling Insurance	Learner Driver Insurance	16
Bikesure	Motorcycle Insurance	34
Bikesure	Short-term Motorcycle Insurance	32
	Total	169

Each form consists of up to 12 steps that a customer completes to receive a quote, including vehicle details, personal information, driving history, and policy preferences.

What the Tests Verify

Tests are organised into the following categories:

Smoke tests — Confirm that every page of the quote form loads correctly, displays the right content (titles, disclaimers, legal text), and that the core happy-path journey works end to end.
Validation tests — Verify that form fields reject invalid or missing input and display the correct error messages to the customer.
Quote outcome tests — Complete the full quote journey and confirm that customers are directed to the correct outcome page (buy online, request a callback, or call us) based on their details.
Source code tests — Verify that marketing source codes are correctly passed through the entire journey and appear in callback URLs and quote outcomes, ensuring accurate attribution.
Multi-party tests — Test the additional driver flows, including adding, editing, and removing named drivers from a policy.

Boundary: The Form, Not the Backend

These tests are scoped to the quote form itself — the customer-facing frontend application. They do not directly test connected backend services such as the quoting engine, pricing APIs, CRM systems, or payment gateways. The tests interact only with what the customer sees in the browser.

However, because the form depends on backend services to function, certain backend behaviours can be inferred from the test results. When a test completes a full quote journey and asserts the outcome, it is implicitly validating that the backend processed the submission correctly:

Destination page assertions — When a test submits a quote and verifies the customer is redirected to the “buy” page, this confirms that the backend quoting engine received the submission, processed it, returned a quotable result, and the form routed the customer to the correct destination. A failure here could indicate a backend issue (quoting engine down, pricing rules misconfigured) even though the test itself only checks the URL the customer lands on.
Quote reference assertions — When a test verifies that a quote reference number is displayed on the outcome page, this confirms that the backend successfully generated and returned a quote reference. The presence of this reference implies the quote was persisted in the backend system.
Content assertions on outcome pages — Tests verify specific content on outcome pages: premium amounts, policy benefits, insurer names, excess values, and callback phone numbers. This content is populated by backend responses. If the backend returns incorrect data, these assertions will fail — surfacing a backend problem through a frontend test.
Source code propagation — Tests verify that marketing source codes passed via URL parameters survive the full journey and appear in callback URLs and outcome pages. This validates that the form correctly passes source codes to the backend and that the backend includes them in its response data.

In this way, the E2E tests act as an early warning system for backend issues, even though they are not backend tests. A pattern of destination or content failures across multiple EPAs can indicate a shared backend service problem, while a failure isolated to a single EPA points to a product-specific configuration issue.

What these tests will not tell you is why a backend service failed — only that something in the chain produced an unexpected result from the customer’s perspective. Diagnosing the root cause requires backend logs, monitoring, and the product’s own unit and integration tests.

Architecture

The test suite is built on a three-layer architecture that separates concerns and promotes reuse. Each layer has a distinct responsibility:

 ┌─────────────────────────────────────────────────────┐
 │                      Tests                          │
 │  Orchestrate steps into full scenarios and assert   │
 │  expected outcomes.                                 │
 │  e.g. test_AFC004_simple_vehicle_lookup             │
 ├─────────────────────────────────────────────────────┤
 │                      Steps                          │
 │  Represent a single page/step of the form.          │
 │  Compose element actions into complete scenarios    │
 │  for that page.                                     │
 │  e.g. VehicleLookupStep.fill()                      │
 ├─────────────────────────────────────────────────────┤
 │                    Components                       │
 │  Represent a reusable form component. Expose        │
 │  individual user actions on that component.         │
 │  e.g. F058VehicleLookupCar.reg_lookup()             │
 └─────────────────────────────────────────────────────┘

Components — Individual Actions

At the lowest level, component classes represent a single reusable form element such as a vehicle registration lookup or a purchase details panel. Each method on a component maps to one discrete user action: entering a registration number, clicking “Find Car”, selecting a manufacturer from a dropdown, and so on.

Because components are self-contained, the same component can be shared across multiple brands and products. For example, the car vehicle lookup component is used by both Adrian Flux Car and Sterling Car forms without duplication.

Critically, the component numbering in the test suite mirrors the component numbering used in the flux-epa product itself and its documentation. The test class F058VehicleLookupCar corresponds directly to component F058 in the EPA platform; F066VehiclePurchaseDetails corresponds to component F066, and so on. This shared naming convention means:

Traceability — When a component is changed in the product, it is immediately clear which test component covers it
Common language — Developers, testers, and product documentation all refer to the same component by the same number
Incremental coverage — New actions and scenarios can be added to a component class over time without touching any existing tests. For example, F058VehicleLookupCar currently covers registration lookup, manual entry, and vehicle changes. As the product evolves, additional actions (e.g. a new vehicle data source, a different lookup flow) can be added to the same class and then consumed by new or existing step scenarios. Coverage grows gradually, component by component, without requiring large-scale rewrites

The current component library is small — F001 (motorcycle lookup), F058 (car lookup), and F066 (purchase details) — but it is designed to expand. As new components are built in the test suite, they follow the same F-number convention, keeping the test layer aligned with the product layer.

Steps — Page Scenarios

The middle layer contains step classes, one per page of the quote form. A step class composes the actions from one or more components into complete scenarios for that page. For example, the entry page step brings together the vehicle lookup component and the purchase details component, calling their actions in the right order and advancing to the next page.

Each step exposes different scenarios:

fill — the standard happy-path completion of that page
fill_complex — an advanced path that exercises more of the page’s functionality (manual entry, changing selections, toggling options)
error methods — deliberately trigger validation errors to test that the form rejects bad input correctly

This means the details of how a form page works are defined in one place. If the form changes, only the step class needs updating — not every test that uses it.

Tests — Orchestrated Journeys

At the top level, test methods orchestrate steps into meaningful scenarios and assert expected outcomes. A test might call a single step to verify one page works in isolation, or chain all steps together to complete a full end-to-end quote journey.

Tests are concerned with what should happen, not how to interact with the form. For example, a full quote journey test reads as a simple sequence — fill vehicle details, fill personal details, fill driving history, get a quote, verify the outcome — with each step handling its own interactions internally.

Why This Matters

This layered separation provides several practical benefits:

Reduced duplication — Common form components are written once and reused across brands. The vehicle lookup is defined in a single component class, not copied into every test.
Easier maintenance — When a form page changes, updates are made in one step class rather than across dozens of individual tests.
Readability — Tests read as business-level scenarios (“complete personal details, then get a quote”) rather than low-level browser interactions (“click this button, fill this field”).
Faster test development — New tests for existing forms can be composed from the library of steps and components that already exist.

Technology Stack

Playwright — The Engine

Playwright is the core of the test suite. It is a browser automation framework developed by Microsoft that launches and controls a real web browser (Chromium) programmatically. Playwright is responsible for everything the tests do:

Navigating to quote form URLs
Interacting with form elements — clicking buttons, filling text fields, selecting dropdowns, toggling radio buttons — exactly as a customer would
Waiting intelligently for pages to load, network requests to complete, and elements to become visible before proceeding
Asserting that the page is in the expected state — checking text content, element visibility, URL changes, and page metadata
Recording video of every test session and capturing screenshots on failure
Tracing detailed execution logs (DOM snapshots, network requests, console output) that can be replayed step-by-step when investigating failures

The EpaStep class wraps Playwright’s Page object with helpers specific to EPA quote forms — navigating between form steps, verifying step titles, checking field-level validation errors, and confirming quote outcome destinations. Every component and step class ultimately delegates to Playwright for all browser interactions.

pytest — The Runner

pytest serves as the test runner and orchestration layer. It does not interact with the browser directly — that is entirely Playwright’s domain. pytest’s responsibilities are:

Discovery — Automatically finding and collecting all test methods across the 6 product suites
Fixtures — Managing setup and teardown (browser contexts, cookies, environment configuration, form URLs) so that each test starts in a clean, correctly configured state
Markers — Providing the tagging system (@pytest.mark.smoke, @pytest.mark.adrianflux, etc.) that allows selective test execution
Parametrisation — Driving data-driven tests, such as running the same quote journey with 21 different source codes or multiple email/price combinations to verify different outcomes
Reporting — Generating the HTML test report with pass/fail results, embedded screenshots, video links, and links to test case documentation
Plugin system — The test suite is packaged as a pytest plugin, meaning consuming projects get all fixtures, markers, and configuration automatically just by installing the package

How They Work Together

pytest                              Playwright
─────                               ──────────
Discovers tests
Resolves fixtures (URLs, cookies)
                                    Launches browser
                                    Sets viewport, video recording
Calls test method
                                    Navigates to quote form
                                    Fills fields, clicks buttons
                                    Waits for pages to load
                                    Asserts page content
Collects pass/fail result
                                    Captures screenshot (on failure)
                                    Saves video recording
Generates HTML report
Attaches screenshots & video links

In short: Playwright does the work, pytest organises and reports on it.

Automatic Evidence Capture

Every test run automatically produces:

Video recordings of each test, viewable directly from the HTML report
Screenshots captured at the point of any failure
Playwright trace files for failed tests, providing a step-by-step replay of DOM state, network activity, and console output

These artifacts are compiled into a self-contained HTML report that can be opened in any browser and shared without special tooling. Each test result links directly to its video, failure screenshot, and corresponding test case documentation on GitHub.

Reports from CI runs are published to a CDN and accessible via a browser:

Report index (all runs): https://tests-e2e-cdn.igate-test.co.uk/index.html
Example report: https://tests-e2e-cdn.igate-test.co.uk/2026-02-13_15/report.html?sort=result

Test Environments

Tests can be targeted at any environment (development, UAT, staging, production) by changing a single configuration value. Each brand can also be pointed at a different environment independently, allowing testing to proceed in parallel across teams.

Test Execution

On-Demand via GitHub

Tests are triggered on demand through GitHub Actions. The person running the tests selects:

The target environment (e.g. UAT, staging)
Optionally, a subset of tests to run (e.g. only smoke tests, only a specific brand)

Results and artifacts are retained for 30 days.

Selective Execution

Tests are tagged with descriptive markers, making it straightforward to run targeted subsets:

Marker	What It Runs
`smoke`	Core happy-path tests across all brands
`validation`	Input validation and error handling
`quoting`	Full end-to-end quote journeys
`adrianflux`	All Adrian Flux tests only
`sterling`	All Sterling tests only
`bikesure`	All Bikesure tests only
`car`	Car insurance forms only
`learner`	Learner driver forms only
`bike`	Motorcycle forms only

Markers can be combined — for example, running only smoke tests for Bikesure motorcycle forms.

Parametrised Tests — Validating Destination Rules at Scale

One of the most powerful features of the test suite is its use of parametrisation to validate EPA destination rules comprehensively. Rather than writing a separate test for each combination of inputs and expected outcomes, a single test method is written once and then driven by a data table. pytest automatically generates and runs a distinct test for every row in that table.

The Problem: Destination Rules Vary by EPA

Each EPA has its own business rules that determine where a customer is directed after completing a quote. Depending on the combination of quote status, premium amount, and source code, the customer may be sent to:

Buy online — A quote was issued and the customer can purchase immediately
Callback — The quote needs further underwriting; the customer is offered a callback
Call us — The quote cannot be processed online; the customer is directed to call

These rules differ between EPAs. For example, a quoted customer with a premium of £1,000 might be directed to the buy page on Sterling Car, but to the callback page on Adrian Flux Car. Getting these rules wrong means customers end up on the wrong page — either unable to buy when they should be able to, or seeing incorrect messaging.

How Parametrisation Solves This

Each EPA’s destination rules are expressed as a simple data table directly in the test code:

Email Prefix	Price	Expected Destination
quoted	£0	call-us
quoted	£1	callback
quoted	£1,999	callback
quoted	£2,000	callback
quoted	£3,999	callback
quoted	£4,000	call-us
rejected	£0	call-us
rejected	£1	call-us
rejected	£1,999	call-us
…	…	…

(Example: Adrian Flux Car — 12 combinations from a single test definition)

From this one table, pytest generates 12 independent tests — each running the full quote journey end to end with different inputs and verifying the customer lands on the correct outcome page. The report shows each combination as a separate pass or fail, making it immediately clear which specific rule has broken.

Scaling Across EPAs

The same pattern is applied across all 6 products, with each EPA’s table reflecting its own destination rules. This means:

Adding a new destination rule is as simple as adding a row to the table — no new test code to write
Changing a rule (e.g. moving the callback threshold from £2,000 to £3,000) means updating one value in the table
Each EPA’s rules are visible at a glance in the test file, serving as living documentation of the business logic

Beyond Destinations: Source Code Attribution

The same parametrisation approach is used to validate marketing source code handling across the quote journey. A second data table combines source codes, “where did you hear” selections, expected destinations, and expected callback URL parameters — generating up to 21 test runs from a single test method per EPA.

This covers scenarios such as:

A source code passed via URL appearing correctly in callback links
Short-form source codes being resolved to their canonical form
Missing or empty source codes falling back to the correct default
Source codes surviving the full journey through all form steps and appearing on the outcome page

The Net Effect

Across all 6 products, parametrisation generates a large number of test runs from a relatively small number of test definitions. One test method with a 12-row table and another with a 21-row table produces 33 full end-to-end journeys per EPA — nearly 200 destination and source code validations across the suite, each running independently and reporting individually. Adding a new EPA’s rules means defining its data table; the test logic is already written.

Reusability — Packaged as a Shared Library

The test suite is not a standalone script — it is built and distributed as a Python package (flux-epa-e2e-tests) that can be installed into any project. This means the tests, fixtures, step classes, and component classes are all reusable across multiple consuming applications.

How It Works

The package is registered as a pytest plugin via a standard entry point. When a consuming project installs the package, pytest automatically discovers and loads:

All test fixtures (browser configuration, cookie setup, environment resolution)
All test markers (smoke, quoting, adrianflux, etc.)
All default configuration (HTML reporting, video recording, tracing, output directories)
All test suites, step classes, and component classes

A consuming project needs only two things to run the full test suite:

pip install flux-epa-e2e-tests (or add it to their dependencies)
A .env file with the target environment URLs and QA bypass cookie

No additional pytest configuration, fixture definitions, or test imports are required — the plugin handles all of it. The consuming project can also override any default (report path, output directory, target URLs) through its own configuration or command line flags.

Versioned Releases

The package follows semantic versioning (currently v0.12) and is published to a private package registry (Gemfury). This ensures:

Consuming projects pin to a known version and upgrade deliberately
Test changes are tracked through a changelog
Rollback to a previous version is straightforward if needed

CI/CD Pipelines

The test suite has two GitHub Actions workflows that automate execution and distribution.

1. Test Execution Pipeline

A manually triggered workflow that runs the test suite against a configured environment:

Provisions an Ubuntu runner with Python 3.12
Installs the test package and Playwright’s Chromium browser
Runs tests — optionally filtered by a marker expression (e.g. smoke, adrianflux and quoting)
Captures the pytest summary line (passed/failed counts)
Uploads the full results directory (HTML report, videos, screenshots, traces) as a GitHub artifact retained for 30 days
Uploads the HTML report to Cloudflare R2 cloud storage for easy sharing via a public URL

Environment URLs and secrets (QA bypass cookie, R2 credentials) are managed through GitHub repository variables and secrets, keeping sensitive values out of the codebase.

2. Package Publishing Pipeline

Triggered automatically when a GitHub release is created:

Builds the Python package (wheel)
Uploads it to the Gemfury private package registry

This means the release process is: tag a version, create a GitHub release, and the package is available for consuming projects to install within minutes.

Pipeline Summary

                    ┌──────────────────────┐
  Manual trigger ──►│  Run E2E Tests       │
  (select markers)  │  - pytest on CI      │
                    │  - HTML report       │──► GitHub Artifacts (30 days)
                    │  - Videos & traces   │──► Cloudflare R2 (shareable URL)
                    └──────────────────────┘

                    ┌──────────────────────┐
  GitHub release ──►│  Publish Package     │
                    │  - Build wheel       │──► Gemfury Registry
                    │  - Upload to Gemfury │    (pip install in other projects)
                    └──────────────────────┘

Test Case Numbering, Documentation, and Traceability

The Numbering System

Every test method carries a unique, sequentially numbered identifier that ties together the test code, its documentation, and its result in the HTML report. The identifier is embedded directly in the method name and docstring:

Method name: test_AFC009_complex_vehicle_lookup
Docstring: AFC-009: Verify complex vehicle lookup with manual entry advances to car details.

Each product suite has its own prefix (AFC for Adrian Flux Car, BKSB for Bikesure Bike, SL for Sterling Learner, etc.) and numbers run sequentially from 001 upwards in the order the tests appear in the file. This means the case number reflects the test’s position in the suite — there are no gaps or out-of-order numbers.

Documentation with Claude Code

Test case documentation is generated and maintained using Claude Code (Anthropic’s AI coding assistant) through a set of custom slash commands built specifically for this project:

/new-test-case — When a new test is written, this command:

Reads the test method, the step classes it calls, and the element classes those steps use
Understands what the test does at the UI level by following the full call chain
Assigns the next available case number and renames the test method accordingly
Generates a plain-English documentation file describing every step the test performs
Adds the new case to the suite’s index file

/check-test-case-order — Audits a test suite to verify that all case numbers are sequential, all documentation files exist, and all cross-references between documents are valid. If numbers have drifted (e.g. after tests were reordered), it re-indexes the entire suite — renaming test methods, updating doc files, and correcting cross-references in a single operation.

/remove-test-case — Removes a test from the suite, deletes its documentation, and re-indexes all subsequent cases so the numbering remains gapless.

This approach means documentation is never written from scratch by hand. Claude reads the actual test code — including the step and component layers — and produces accurate, up-to-date descriptions of what each test does. When tests change, the documentation can be regenerated from the code rather than manually updated.

What a Test Case Document Looks Like

Each test case has its own markdown file (e.g. docs/adrianflux_car/AFC-009.md) containing:

Category — Smoke, Validation, or Parametrised
Markers — The pytest markers applied to this test
Test method — The exact function name for traceability back to the code
Steps — A numbered list describing every action in plain English
Test data (for parametrised tests) — A table of input combinations and expected outcomes

There are currently 174 individual test case documents plus a summary index for each product that lists all cases in a single table.

Linked in the HTML Report

The test case documentation is not just a separate reference — it is linked directly from the HTML test report. A pytest hook inspects each test result, extracts the case ID from the test method name (e.g. AFC009 from test_AFC009_complex_vehicle_lookup), and appends a clickable link to the corresponding documentation file on GitHub.

This means that when reviewing test results, anyone can click through from a pass or failure directly to the full plain-English description of what that test was verifying — without needing to read the test code.

HTML Report Row
┌──────────────────────────────────────────────────────────────┐
│  test_AFC009_complex_vehicle_lookup     PASSED               │
│  [Video]  [Screenshot]  [Test Case ↗]                        │
│                              │                               │
│                              └─► docs/adrianflux_car/AFC-009.md
│                                  on GitHub                   │
└──────────────────────────────────────────────────────────────┘

This creates a full traceability chain: test result → test case documentation → test code → step classes → component classes, all connected by the case ID.

Quality Controls on the Test Suite Itself

The test codebase is held to the same engineering standards as production code:

100% documentation coverage — Every module, class, and method has a descriptive docstring, enforced by automated checks.
Automated linting and formatting — Code style is enforced consistently across the entire suite.
Versioned releases — The test suite is versioned (currently v0.12) and distributed as a package, ensuring reproducible test runs.

Key Benefits

Confidence in releases — Tests can be run against any environment before or after a deployment to verify that quote journeys work correctly.
Rapid feedback — Issues are caught early with clear evidence (video, screenshots, traces) to support diagnosis.
Cross-brand consistency — The same testing patterns are applied uniformly across all 3 brands and 6 products.
Marketing attribution assurance — Source code tests verify that campaign tracking remains intact through the full customer journey.
Scalability — Adding a new brand or product follows an established template, keeping the approach consistent as the platform grows.

Applying This Architecture to Other Products

The architecture behind this test suite is not specific to insurance quote forms. The three-layer pattern (components, steps, tests), the tooling (Playwright, pytest, Claude Code documentation), and the infrastructure (CI pipelines, CDN-hosted reports, package distribution) can be applied to any multi-step web application. Adopting it for a new product does not require starting from scratch — the patterns and infrastructure are already proven and can be replicated.

What Transfers Directly

The following elements are product-agnostic and can be reused as-is or with minimal adaptation:

The EpaStep wrapper — The helper class that wraps Playwright’s page with convenience methods (navigate, verify titles, check errors, wait for pages) is not EPA-specific. A similar wrapper could be created for any product, or the existing one extended.
The three-layer pattern — Components (individual UI actions), steps (page scenarios), and tests (orchestrated journeys) work for any multi-page application: onboarding flows, checkout processes, claims forms, account management, or any wizard-style UI.
Parametrised destination/outcome testing — Any product with branching outcomes based on user input (approval/rejection, pricing tiers, eligibility checks) can use the same data-table approach to validate all paths from a single test definition.
The CI pipeline — The GitHub Actions workflows for running tests on demand, uploading reports to cloud storage, and publishing the package require only configuration changes (environment URLs, secrets) to point at a different product.
The CDN-hosted report index — The R2 upload and index page pattern works for any test suite, giving stakeholders a single URL to find any historical test run.
Claude Code slash commands — The /new-test-case, /check-test-case-order, and /remove-test-case commands are driven by naming conventions (prefix + sequential number + markdown docs). Adapting them for a new product means defining a new prefix — the commands themselves handle discovery, numbering, documentation generation, and re-indexing.

What a New Product Needs

To bring a new product into the same testing architecture, the product-specific work is:

Component classes — One class per reusable UI component in the new product, following the F-number convention if the product uses it, or a suitable naming scheme otherwise. Each class exposes the actions a user can perform on that component.
Step classes — One class per page or screen, composing component actions into scenarios (happy path, complex path, error paths).
Test methods — Orchestrating steps into the journeys that matter for that product, tagged with appropriate markers.
Fixtures — Product-specific configuration: base URLs, environment overrides, expected content (titles, disclaimers, legal text).
Parametrised data tables — The product’s business rules expressed as input/outcome tables for destination testing.

The infrastructure, reporting, documentation tooling, and package distribution are already in place. A new product plugs into the existing system rather than building its own.

A Practical Example

Consider a hypothetical home insurance quote form. Applying this architecture would look like:

Layer	EPA (existing)	Home Insurance (new)
Components	`F058VehicleLookupCar`	`F040PropertyLookup`
	`F066VehiclePurchaseDetails`	`F041PropertyDetails`
Steps	`S01_entry_page` (vehicle reg)	`S01_entry_page` (postcode lookup)
	`S04_policyholder_details`	`S03_policyholder_details`
Tests	`test_AFC024_quote_outcomes`	`test_HI024_quote_outcomes`
Parametrised rules	quoted + £1,000 → callback	standard + £500 → buy
Docs	`AFC-024.md`	`HI-024.md`

The component and step classes are new — they reflect the new product’s UI. But the test structure, documentation tooling, CI pipeline, and report infrastructure are identical. The investment made in the EPA suite pays forward into every subsequent product.

Summary

The EPA E2E test suite provides structured, automated coverage of all customer-facing quote journeys across Adrian Flux, Sterling Insurance, and Bikesure. With 169 documented test cases, automatic evidence capture, and flexible execution options, the suite gives stakeholders confidence that quote forms are functioning correctly and that customers are reaching the right outcomes.

2 - Sample Memo

A sample memo demonstrating the structure and format

Executive Summary

This is a sample memo that demonstrates the recommended structure and format for technical memos. It provides a template that can be used as a starting point for creating new memos.

Introduction

Purpose

Explain the purpose of the memo and what problems or questions it addresses.

Scope

Define the scope of the document, including what is covered and what is explicitly out of scope.

Audience

Identify the intended audience and any prerequisite knowledge required.

Background

Provide necessary background information and context for understanding the memo content.

Current State

Describe the current situation, challenges, or problems that motivated this memo.

Requirements

List any requirements or constraints that influenced the analysis or recommendations.

Technical Analysis

Approach

Describe the methodology or approach used in the analysis.

Findings

Present the key findings from the technical analysis.

Trade-offs

Discuss the trade-offs considered and how different options were evaluated.

Recommendations

Provide specific recommendations based on the analysis.

Implementation Considerations

Discuss practical considerations for implementing the recommendations.

Risk Assessment

Identify potential risks and mitigation strategies.

Conclusion

Summarize the key points and recommendations.

References

List relevant documentation
External resources
Related memos

3 - Viitata Tenancy Infrastructure

Migration from single-tenant to multi-tenant architecture

Executive Summary

This memo documents the strategic architecture migration for Viitata from a single-tenant-per-instance model to a multi-tenant architecture on Heroku. This migration addresses critical operational inefficiencies, enables deployment of the new Viitata version with its required worker architecture, and significantly reduces both current costs and the cost of scaling while eliminating DevOps friction for client onboarding.

Key Changes:

Architecture: Single-tenant-per-instance → Multi-tenant shared infrastructure
Platform: Heroku (no change)
Application Version: Current (single worker) → New version (3 workers required)
Cost Impact: $96/month currently → $288/month if upgraded on single-tenant → $130/month on multi-tenant
Cost Savings: 55% reduction vs. deploying new version on single-tenant architecture

Introduction

Purpose

This document outlines the rationale, technical approach, and benefits of migrating Viitata from a distributed single-tenant-per-instance model to a consolidated multi-tenant architecture on Heroku.

Scope

This memo covers:

Current single-tenant-per-instance architecture on Heroku
New Viitata version requirements (3-worker architecture)
Proposed multi-tenant architecture on Heroku
Cost analysis and operational benefits
Technical considerations and trade-offs

Out of scope:

Detailed application code changes for multi-tenancy
Specific Heroku configuration details
Data migration procedures and implementation timeline

Audience

This document is intended for technical leadership, DevOps engineers, and stakeholders involved in infrastructure planning and decision-making.

Background

Current State: Single-Tenant-Per-Instance on Heroku

Viitata currently operates with a single-tenant-per-instance model on Heroku, consisting of:

Infrastructure per Tenant:

1 Heroku application instance (single worker)
1 PostgreSQL database
1 Redis cache instance
Cost: ~$16/month per tenant

Current Deployment:

6 production instances running current Viitata version
Each instance operates as a single-worker application
Total monthly cost: ~$96 (6 instances × $16)
Each instance requires independent CI/CD pipeline
Each instance requires separate DevOps configuration

Important Note: The current architecture runs an older version of Viitata that does not require the 3-worker architecture. However, the new version of Viitata cannot be deployed without this infrastructure change.

Challenges with Current Architecture

1. Cost Scalability Concerns

With 6 production tenants at $16/month each, the current architecture costs approximately $96/month. While manageable at this scale, the cost scales linearly with each new tenant ($16 per additional tenant). More critically, the new version of Viitata requires a 3-worker architecture that would triple costs to approximately $288/month for the same 6 tenants.

2. DevOps Friction

Each new client onboarding requires:

Provisioning new Heroku application
Configuring new PostgreSQL database
Setting up new Redis cache
Configuring CI/CD pipeline
Managing environment variables and secrets
Setting up monitoring and logging

This creates substantial friction and delays in client onboarding.

3. CI/CD Maintenance Overhead

Maintaining 6 separate CI/CD pipelines creates:

Increased complexity in deployment processes
Higher risk of configuration drift
Difficulty in applying updates uniformly
Additional testing burden across instances

4. Blocking Issue: New Viitata Version Requirements

The new version of Viitata fundamentally requires three distinct worker types to function:

Web worker: Handles HTTP requests
Celery worker: Processes asynchronous tasks
Celery Beat worker: Manages scheduled tasks and periodic jobs

This is not optional - the new Viitata version cannot be deployed without all three workers running.

Under the single-tenant model, deploying the new version would require:

18 total worker processes (6 instances × 3 workers)
Tripling of infrastructure costs per tenant (from $16 to ~$48 per tenant)
Total monthly cost increase from $96 to approximately $288/month
18 separate processes to monitor and manage

Critical Impact: The single-tenant architecture makes it economically and operationally prohibitive to deploy the new version of Viitata. Without migrating to multi-tenant, the platform cannot evolve.

Technical Analysis

Proposed Architecture: Multi-Tenant on Heroku

The new architecture consolidates all tenants into a single shared Heroku infrastructure:

Shared Infrastructure:

1 Heroku application (supporting 3 worker types)
1 Heroku PostgreSQL database (with tenant isolation)
1 Heroku Redis cache (with tenant namespacing)
Estimated cost: ~$130/month total

Worker Configuration:

1 web worker (serving all tenants)
1 Celery worker (processing tasks for all tenants)
1 Celery Beat worker (managing schedules for all tenants)
Total: 3 workers supporting all tenants

Cost Analysis

Architecture Model	Viitata Version	Tenants	Workers	Monthly Cost	Cost per Tenant
Current (Single-Tenant)	Old	6	6 (1 per instance)	$96	$16.00
Single-Tenant Upgraded	New	6	18 (3 per instance)	$288	$48.00
Multi-Tenant (Proposed)	New	6	3 (shared)	$130	$21.67
Savings vs. Upgraded	-	-	-83%	-$158/month	-55%

Key Insights:

Current architecture cannot run the new Viitata version without significant cost increase
New version’s 3-worker requirement would triple single-tenant costs ($96 → $288)
Multi-tenant architecture enables new version deployment at 55% lower cost than single-tenant upgrade
Marginal cost advantage: Adding tenant #7 costs $0/month (vs. $48/month in single-tenant)
Cost efficiency improves with scale: 10 tenants = $13/tenant, 20 tenants = $6.50/tenant

Benefits

1. Cost Reduction

83% reduction in infrastructure costs
Costs remain flat as tenant count grows (until scaling threshold)
Predictable cost model

2. Operational Efficiency

Single CI/CD pipeline for all tenants
Unified deployment process
Consistent configuration across all tenants
Reduced maintenance overhead

3. Client Onboarding

Near-instant tenant provisioning (database record vs. full infrastructure)
Minimal DevOps involvement
Faster time-to-value for new clients
Reduced onboarding friction

4. Enables New Viitata Version Deployment

Supports required 3-worker architecture (web, Celery, Celery Beat)
3 shared workers support all tenants (vs. 18 separate workers in single-tenant)
Makes new version economically viable to deploy
Simplified monitoring and management
Better resource utilization
Easier to scale horizontally when needed

Technical Considerations

Data Isolation

Tenant identification at application layer
Row-level security in PostgreSQL
Redis key namespacing by tenant ID
Careful query design to prevent data leakage

Performance

Shared resources require proper resource allocation
Connection pooling for database efficiency
Caching strategies to prevent tenant interference
Monitoring to identify tenant-specific performance issues

Security

Tenant isolation at application and data layers
Secure tenant context management
Audit logging for compliance
Regular security reviews of multi-tenant code paths

Scalability

Horizontal scaling when single instance reaches capacity
Database sharding if needed for very large tenant counts
CDN and edge caching for static assets
Load balancing across multiple application instances

Trade-offs

Advantages

Dramatic cost reduction
Simplified operations
Faster client onboarding
Better resource utilization
Easier maintenance and updates

Disadvantages

Tenant isolation complexity in application code
Potential “noisy neighbor” issues
Database restore impact: Currently, database snapshots can be restored per-tenant without affecting other clients. In multi-tenant architecture, a database restore would affect all tenants simultaneously, making it impossible to roll back a single client’s data due to a bug or data issue
More complex deployment rollback scenarios
Requires careful tenant-aware code design
Less isolation between tenants compared to separate instances

Risk Mitigation

Comprehensive testing of tenant isolation
Resource limits per tenant
Monitoring and alerting for anomalies
Gradual migration approach
Ability to isolate problematic tenants if needed
Database restore mitigation:
- Implement application-level point-in-time recovery per tenant
- Maintain granular database backups with tenant-specific restore capabilities
- Use transaction logs to selectively restore tenant data
- Establish procedures for tenant-specific data rollback without full database restore
- More rigorous testing and staging processes to prevent production data issues
- Consider automated daily tenant-level logical backups (pg_dump per tenant)

Conclusion

The migration from a single-tenant-per-instance architecture to a multi-tenant architecture on Heroku represents a strategic necessity for Viitata’s evolution. This change delivers:

Enables deployment of new Viitata version with required 3-worker architecture
55% cost reduction vs. deploying new version on single-tenant ($288/month → $130/month)
Dramatic reduction in operational complexity (6 CI/CD pipelines → 1, 18 workers → 3)
Near-zero marginal cost for new tenants ($0 vs. $48/tenant in single-tenant)
Improving cost efficiency at scale: Cost per tenant decreases as platform grows
Eliminates DevOps friction in client onboarding

Without this migration, deploying the new version of Viitata would nearly triple costs while adding significant operational burden. The multi-tenant architecture not only makes the new version economically viable but also positions Viitata for sustainable growth with costs that improve with scale.

While multi-tenancy introduces complexity in application design around tenant isolation and data security, the alternative—remaining on single-tenant architecture—would either block the platform’s evolution or make it financially unsustainable. The operational benefits, cost savings, and improved scalability make this migration essential for Viitata’s future.

References

4 - Heroku to AWS Migration

Migration from Heroku to AWS for improved compliance, cost, and control

Executive Summary

This memo documents the strategic platform migration for Viitata from Heroku to AWS (Amazon Web Services). This migration addresses critical compliance requirements around UK data residency, reduces infrastructure costs, provides greater operational flexibility and control, and enables better performance and integration with additional AWS services.

Key Changes:

Platform: Heroku → AWS ECS (Elastic Container Service)
Region: EU-West-1 (Ireland) → EU-West-2 (London, UK)
Primary Driver: Compliance - UK data residency and backup retention
Additional Benefits: Cost reduction, greater control, performance improvements, AWS service ecosystem

Critical Compliance Issue: Currently on Heroku, while the primary database is in EU-West-1 (Ireland), database backups are retained in the USA. This creates compliance risks for UK data residency requirements. AWS enables full infrastructure and data containment within EU-West-2 (London).

Introduction

Purpose

This document outlines the rationale, technical approach, and benefits of migrating Viitata’s multi-tenant infrastructure from Heroku to AWS, with a focus on achieving UK data sovereignty and compliance requirements while improving operational capabilities.

Scope

This memo covers:

Current multi-tenant architecture on Heroku
Compliance and data residency challenges
Proposed multi-tenant architecture on AWS ECS
Cost analysis and operational benefits
Technical considerations and trade-offs

Out of scope:

Detailed AWS infrastructure-as-code configurations
Specific containerization implementation details
Data migration procedures and implementation timeline
Application code changes required for AWS

Audience

This document is intended for technical leadership, compliance officers, DevOps engineers, and stakeholders involved in infrastructure planning and decision-making.

Background

Current State: Multi-Tenant on Heroku

Viitata currently operates with a multi-tenant architecture on Heroku, consisting of:

Infrastructure:

1 Heroku application (3 dynos: web, Celery worker, Celery Beat)
1 Heroku PostgreSQL database in EU-West-1 (Ireland)
1 Heroku Redis cache
Current cost: ~$130/month

Current Deployment:

Multi-tenant architecture supporting 6 production tenants
Single CI/CD pipeline
Heroku-managed infrastructure and scaling
Automatic SSL, DNS, and platform maintenance

Critical Issues with Current Platform

1. Compliance and Data Residency (Primary Driver)

Database Backup Location:

Primary database: EU-West-1 (Ireland, EU)
Database backups: Stored in USA (Heroku’s backup infrastructure)

This creates significant compliance risks:

UK data residency requirements cannot be met
Backup data crosses international boundaries
Potential violations of data protection regulations
Risk for clients requiring UK-only data storage
Audit and compliance reporting challenges

Regional Limitation:

Application and database in Ireland (EU-West-1), not UK
No option for UK-specific region on Heroku
Cannot guarantee UK data sovereignty

2. Cost Considerations

While Heroku provides managed services, the cost includes:

Premium for managed platform (~30-40% over raw compute)
Limited ability to optimize resource allocation
Dyno pricing model less flexible than AWS instance types
Add-on costs (PostgreSQL, Redis) with limited customization

3. Limited Control and Flexibility

Infrastructure Control:

Cannot customize underlying OS or runtime environment
Limited control over networking and security groups
Restricted access to infrastructure-level monitoring
Cannot implement custom security controls

Resource Optimization:

Fixed dyno sizes with limited granularity
Cannot right-size resources for specific workloads
Limited ability to use spot instances or reserved capacity
Cannot separate worker resources by type

4. Performance Constraints

Heroku Limitations:

Shared infrastructure with potential noisy neighbor issues
Limited database connection pooling options
Router timeout constraints (30 seconds)
Limited control over caching layers
Cannot implement custom CDN configurations

5. AWS Service Integration

Current limitations for integrating with AWS services:

External network calls to AWS services (S3, SES, etc.)
Additional latency for AWS service integration
Cannot use VPC peering or private networking
Limited IAM role-based security
Cannot leverage AWS-native monitoring and logging

Technical Analysis

Proposed Architecture: Multi-Tenant on AWS ECS

The new architecture migrates the multi-tenant application to AWS infrastructure with a fully containerized, role-based security model.

Architecture Overview

The following diagram illustrates the proposed AWS architecture:

graph TB
    subgraph Internet
        Users[Users/Clients]
    end

    subgraph "AWS EU-West-2 (London)"
        subgraph "VPC"
            subgraph "Public Subnets"
                ALB[Application Load Balancer<br/>HTTPS:443]
                NAT[NAT Gateway]
            end

            subgraph "Private Subnets"
                subgraph "ECS Fargate Cluster"
                    Web[Web Tasks<br/>nginx + gunicorn<br/>Auto-scaling]
                    Worker[Celery Worker Tasks<br/>Auto-scaling]
                    Beat[Celery Beat Task<br/>Single instance]
                end

                RDS[(RDS PostgreSQL<br/>Multi-AZ<br/>Automated Backups)]
                Cache[(ElastiCache Valkey<br/>Redis-compatible<br/>Cache & Results)]
            end
        end

        SQS[Amazon SQS<br/>Celery Message Broker<br/>Task Queue]
        S3[S3 Bucket<br/>Media Storage]
        CW[CloudWatch<br/>Logs & Metrics]
        SM[Secrets Manager<br/>Credentials]
    end

    Users -->|HTTPS| ALB
    ALB -->|Routes traffic| Web

    Web -->|IAM Role| S3
    Web -->|Read/Write| RDS
    Web -->|Cache/Sessions| Cache
    Web -->|Send tasks| SQS
    Web -->|Logs| CW
    Web -->|Get secrets| SM

    Worker -->|IAM Role| S3
    Worker -->|Read/Write| RDS
    Worker -->|Receive/Delete tasks| SQS
    Worker -->|Store results| Cache
    Worker -->|Logs| CW
    Worker -->|Get secrets| SM

    Beat -->|Send scheduled tasks| SQS
    Beat -->|Read/Write| RDS
    Beat -->|Logs| CW

    Web -.->|Outbound via| NAT
    Worker -.->|Outbound via| NAT
    Beat -.->|Outbound via| NAT

    classDef public fill:#e1f5ff,stroke:#01579b,stroke-width:2px
    classDef private fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
    classDef data fill:#fff3e0,stroke:#e65100,stroke-width:2px
    classDef compute fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px
    classDef aws fill:#fff9c4,stroke:#f57f17,stroke-width:2px

    class ALB,NAT public
    class Web,Worker,Beat compute
    class RDS,Cache data
    class S3,CW,SM,SQS aws

Core Infrastructure Components:

Database Layer
- Amazon RDS for PostgreSQL in EU-West-2 (London)
- Multi-AZ deployment for high availability
- Automated backups retained in EU-West-2
- Point-in-time recovery capabilities
- All tenant data with row-level isolation
Cache Layer
- Amazon ElastiCache with Valkey (Redis-compatible) in EU-West-2
- Used for session storage, application caching, and Celery result backend
- Tenant-namespaced keys for data isolation
- High-performance in-memory data store
Message Queue Layer
- Amazon SQS (Simple Queue Service) in EU-West-2
- Celery message broker for task distribution
- Fully managed, serverless message queue
- No infrastructure to maintain or scale
- Automatic message retention and delivery
- Dead letter queue for failed tasks
- FIFO queues for task ordering if needed
- Cost-effective: Pay only for messages processed

Compute Layer - ECS Fargate

Three separate ECS task definitions running on Fargate:

Task Definition 1: Web Application

Container: nginx + gunicorn
Receives traffic from Application Load Balancer (ALB)
Handles HTTPS requests routed by ALB
Auto-scaling based on CPU/memory and request count
Appropriate resources: ~0.5-1 vCPU, 1-2GB memory
Multiple tasks for high availability and load distribution
ALB distributes traffic across all healthy web tasks

Task Definition 2: Celery Worker

Container: Celery worker process
Consumes tasks from Amazon SQS queue
Auto-scaling based on SQS queue depth (ApproximateNumberOfMessagesVisible) and CPU utilization
Right-sized resources: ~0.25-0.5 vCPU, 0.5-1GB memory
Can scale independently based on task backlog in SQS

Task Definition 3: Celery Beat

Container: Celery Beat scheduler
Manages periodic and scheduled tasks
Publishes scheduled tasks to SQS queue
Fixed scaling: Single task (Beat requires single instance)
Minimal resources: ~0.25 vCPU, 0.5GB memory
Auto-restart on failure

Rationale for Separate Task Definitions:

Each workload has different resource requirements
Independent scaling policies per service type
Web scales with traffic, workers scale with queue depth
Cost optimization: Right-size each workload separately
Isolation: Issues in one service don’t affect others

Auto-Scaling Configuration:

ECS provides automatic scaling that adjusts the number of running tasks based on demand, with both scale-up and scale-down capabilities:

Web Tasks Auto-Scaling:

Metrics: CPU utilization, memory utilization, ALB request count per target
Scale-up triggers:
- CPU > 70% for 2 minutes → Add tasks
- Requests per task > 1000/min → Add tasks
Scale-down triggers:
- CPU < 30% for 5 minutes → Remove tasks
- Requests per task < 200/min → Remove tasks
Min/Max tasks: 2 minimum (HA), 10 maximum
Benefits: Handles traffic spikes from multiple tenants, scales down during low usage to save costs

Celery Worker Auto-Scaling:

Metrics: CPU utilization, SQS ApproximateNumberOfMessagesVisible (native CloudWatch metric)
Scale-up triggers:
- SQS queue depth > 100 messages → Add workers
- CPU > 80% for 3 minutes → Add workers
Scale-down triggers:
- SQS queue depth < 10 messages for 10 minutes → Remove workers
- CPU < 20% for 10 minutes → Remove workers
Min/Max tasks: 1 minimum, 5 maximum
Benefits: SQS provides native queue metrics for accurate scaling decisions; efficiently processes task backlog, reduces to minimum during idle periods

Celery Beat Scaling:

Fixed at 1 task (Beat scheduler requires single instance)
Auto-restart on failure for reliability

Multi-Tenant Scaling Benefits:

Auto-scaling is particularly valuable for multi-tenant architecture:

Unpredictable tenant activity: Different tenants have different usage patterns and peak times
Cost efficiency: Automatically scales down during low-usage periods (nights, weekends)
Spike handling: Automatically scales up when multiple tenants become active simultaneously
Resource optimization: Pays only for resources actually needed at any given time
Example scenario:
- During business hours (9am-5pm): 6-8 web tasks handle peak multi-tenant load
- During nights (11pm-6am): Scales down to 2 web tasks, saving ~$40-60/month
- Weekend spikes: Auto-scales up to handle unexpected tenant activity

Comparison to Heroku:

Scaling Feature	Heroku	AWS ECS
Scale-up	Manual or via add-ons	Automatic based on metrics
Scale-down	Manual only	Automatic (saves costs)
Scaling metrics	Limited (response time, throughput)	Extensive (CPU, memory, custom CloudWatch metrics, ALB metrics, queue depth)
Per-service scaling	Requires multiple apps	Built-in per task definition
Cost during low usage	Fixed (pays for min dynos)	Dynamic (scales to minimum)
Multi-tenant optimization	Limited	Excellent - handles variable tenant load patterns

Cost Impact:

Scale-down capability can reduce compute costs by 40-50% during off-peak hours
For multi-tenant with variable load, average monthly compute cost drops significantly
Example: Instead of running 6 web tasks 24/7, average 4 tasks/hour = 33% cost reduction

Networking and Load Balancing
- Application Load Balancer (ALB) as entry point for all web traffic
  - Sits in public subnets
  - Terminates HTTPS/SSL connections
  - Routes traffic to web task definition only
  - Health checks on web tasks
  - Automatically distributes load across multiple web task instances
- VPC with public and private subnets across multiple Availability Zones
- Private subnets for ECS tasks, RDS, and ElastiCache (no direct internet access)
- Public subnets for ALB only
- NAT Gateway for outbound internet access from private subnets
- Security groups for service-level network isolation
  - ALB security group: Allow inbound 443 from internet
  - Web task security group: Allow inbound from ALB only
  - Worker task security groups: No inbound internet traffic
  - RDS/ElastiCache security groups: Allow access from ECS tasks only
Security Model: Role-Based Authentication
Shift from IAM Users to IAM Roles:
- Current (Heroku): IAM user credentials stored as environment variables for AWS service access (S3, SES, etc.)
- Proposed (AWS): ECS task IAM roles with least-privilege permissions
IAM Role-Based Security Benefits:
- No stored credentials: Tasks assume roles automatically via ECS Task Role
- Dramatically reduced credential leakage risk: No long-lived access keys in environment variables or code
- Automatic credential rotation: AWS STS provides temporary credentials (auto-expire and rotate)
- Least-privilege access: Each task definition gets only required permissions
- Audit trail: CloudTrail logs all role assumption and service access
- Infrastructure-as-code: IAM roles defined in CloudFormation templates
- Centralized security: All permissions defined and version-controlled
Example Security Architecture:
- Web task role: Read S3 (media), write CloudWatch Logs
- Celery worker task role: Read/Write S3, SES send email, SQS receive/delete messages, CloudWatch Logs
- Celery Beat task role: SQS send messages, CloudWatch Logs only
- RDS access: PostgreSQL username/password from Secrets Manager (accessed via IAM role)
- SQS access: Fully controlled via IAM roles (no credentials needed)
- No IAM user access keys anywhere in the system
Additional AWS Services
- Amazon SQS for Celery message broker (fully managed queue)
- CloudWatch Logs for centralized logging
- CloudWatch Metrics for monitoring and alerting (includes native SQS metrics)
- AWS Secrets Manager for database credentials and API keys
- S3 for media file storage (UK region)
- CloudFormation for infrastructure-as-code deployment
- ECR (Elastic Container Registry) for Docker image storage

Region and Compliance:

All resources in EU-West-2 (London, UK)
No data transfer outside UK jurisdiction
Estimated cost: ~$90-115/month (12-30% savings, includes SQS)

Connectivity Flow

Internet (HTTPS:443)
  ↓
Application Load Balancer (Public Subnet)
  ↓ (Routes to web tasks only)
ECS Web Tasks (nginx + gunicorn) (Private Subnet)
  ↓
  ├─→ RDS PostgreSQL (Private Subnet)
  ├─→ ElastiCache Valkey (Private Subnet - cache/sessions)
  ├─→ Amazon SQS (Send tasks via IAM role)
  └─→ S3 (via IAM role)

ECS Celery Workers (Private Subnet)
  ↓
  ├─→ RDS PostgreSQL (Private Subnet)
  ├─→ ElastiCache Valkey (Private Subnet - result backend)
  ├─→ Amazon SQS (Receive/Delete tasks via IAM role)
  └─→ S3 (via IAM role)

ECS Celery Beat (Private Subnet)
  ↓
  ├─→ RDS PostgreSQL (Private Subnet)
  └─→ Amazon SQS (Send scheduled tasks via IAM role)

Key Points:

Only web tasks receive traffic from ALB - Workers have no inbound internet traffic
All three task definitions connect to RDS
Celery uses Amazon SQS as message broker - fully managed, serverless queue
ElastiCache Valkey used for caching, sessions, and Celery result backend
All ECS tasks connect to AWS services (S3, SQS, CloudWatch) via IAM roles
No stored credentials required anywhere
SQS provides native CloudWatch metrics for auto-scaling

Infrastructure-as-Code

The entire architecture is defined in CloudFormation templates:

VPC, subnets, route tables, security groups
RDS database configuration
ElastiCache cluster configuration
SQS queues (standard and dead letter queues)
ECS cluster, task definitions, and services
Application Load Balancer and target groups
IAM roles and policies
CloudWatch alarms and dashboards

Benefits:

Version-controlled infrastructure
Reproducible environments (staging = production)
No manual configuration or drift
Peer-reviewed infrastructure changes via Git
Disaster recovery: Rebuild from templates

CI/CD Pipeline

GitHub Actions Workflow:

The deployment pipeline is automated via GitHub Actions, providing consistent and reliable deployments:

graph LR
    A[Code Push to GitHub] --> B[GitHub Actions Triggered]
    B --> C[Build Docker Image]
    C --> D[Push to ECR]
    D --> E[Update ECS Task Definitions]
    E --> F[Deploy Web Tasks]
    E --> G[Deploy Celery Workers]
    E --> H[Deploy Celery Beat]

    style A fill:#e8f5e9
    style B fill:#fff3e0
    style C fill:#e1f5ff
    style D fill:#f3e5f5
    style E fill:#fff9c4
    style F fill:#e8f5e9
    style G fill:#e8f5e9
    style H fill:#e8f5e9

Deployment Process:

Trigger: Code pushed to main branch or pull request merged
Build: GitHub Actions workflow executes
- Builds single Docker image containing application code
- Runs tests (optional: can block deployment on failure)
- Tags image with commit SHA and/or semantic version
Push to ECR: Docker image pushed to Elastic Container Registry in EU-West-2
- ECR provides secure, private Docker registry
- Images stored in same region as deployment (UK)
- Automatic image scanning for vulnerabilities (optional)
Update Task Definitions: GitHub Actions updates ECS task definitions
- All three task definitions reference the same Docker image
- Only the container command/entrypoint differs per service:
  - Web: gunicorn command
  - Celery Worker: celery worker command
  - Celery Beat: celery beat command
Deploy: ECS performs rolling updates
- Web tasks: Rolling deployment with health checks via ALB
- Celery Workers: Rolling update, new tasks pick up from queue
- Celery Beat: Stop old task, start new task (single instance)

Single Image, Multiple Services:

All three ECS task definitions use the same Docker image from ECR. The service type is determined by the command executed:

# Example task definition differences
Web Task Definition:
  Image: <ECR_URI>:latest
  Command: ["gunicorn", "app.wsgi:application"]

Celery Worker Task Definition:
  Image: <ECR_URI>:latest  # Same image!
  Command: ["celery", "-A", "app", "worker"]

Celery Beat Task Definition:
  Image: <ECR_URI>:latest  # Same image!
  Command: ["celery", "-A", "app", "beat"]

Benefits:

Single build: One Docker image for all services (faster builds)
Consistency: All services run identical application code
Simplified versioning: Single image tag tracks deployment
Reduced storage: ECR stores one image instead of three
Atomic deployments: All services deployed from same code version

Comparison to Heroku:

Aspect	Heroku	AWS ECS
Deployment trigger	`git push heroku`	GitHub Actions workflow
Build process	Heroku buildpacks	Docker image build
Artifact storage	Heroku slug storage	ECR (version-controlled)
Deployment control	Limited (auto-deploy)	Full control (approval gates, rollback)
Multi-service	Separate apps or Procfile	Task definitions with same image
Rollback	`heroku releases:rollback`	ECS task definition revision or redeploy previous image tag

Additional CI/CD Capabilities:

Environment-specific deployments: Separate workflows for staging and production
Approval gates: Require manual approval before production deployment
Automated testing: Run integration tests against staging before production
Blue-green deployments: Deploy new version alongside old, switch traffic
Canary deployments: Gradually shift traffic to new version
Automated rollback: Detect failures via CloudWatch alarms and auto-rollback

Cost Analysis

Component	Heroku (Current)	AWS (Proposed)	Notes
Web Application	~$25-50	~$20-35	ECS Fargate or EC2 instances
Celery Worker	~$25-50	~$20-35	Right-sized for workload
Celery Beat	~$25-50	~$10-15	Smaller instance for scheduler
PostgreSQL	~$15-20	~$20-25	RDS with backups in UK
Redis Cache	~$15-20	~$10-15	ElastiCache (cache + result backend)
SQS Message Queue	Included in dyno	~$1-3	Pay per million requests, negligible cost
Load Balancer	Included	~$15-20	ALB costs
Total	~$130/month	~$90-115/month	12-30% savings

Cost Optimization Opportunities:

Auto-scaling with scale-down: Automatically reduce running tasks during low-usage periods (40-50% compute savings during off-peak)
Reserved instances for baseline workloads (up to 50% additional savings)
Spot instances for Celery workers (up to 70% savings on compute)
S3 storage tiers for media files
CloudWatch log retention policies
Right-sizing based on actual usage patterns

Multi-Tenant Auto-Scaling Impact: The auto-scaling capability is particularly valuable for multi-tenant architecture where tenant usage patterns vary throughout the day. Instead of paying for peak capacity 24/7 (as with Heroku), ECS automatically scales down during low-usage periods, significantly reducing average compute costs.

Important Note: Cost savings are secondary to compliance requirements. Even if costs were equivalent, the migration would be necessary for UK data residency.

Benefits

1. Compliance and Data Sovereignty (Primary Benefit)

UK data residency: All infrastructure and data in EU-West-2 (London)
Backup compliance: Database backups remain in UK
Audit trail: Full control and visibility over data location
Regulatory compliance: Meets UK data protection requirements
Client confidence: Can guarantee UK-only data storage
Reduced legal risk: Eliminates cross-border data transfer concerns

2. Cost Efficiency

15-30% immediate cost reduction (~$130 → ~$90-110/month)
Automatic scale-down during low usage: 40-50% additional compute savings during off-peak hours
Multi-tenant load optimization: Auto-scaling handles variable tenant usage patterns efficiently
Additional savings opportunities with reserved/spot instances
More granular resource allocation (no paying for unused capacity)
Flexible pricing models (on-demand, reserved, spot)
Pay only for actual resource consumption, not fixed capacity

3. Greater Control and Flexibility

Infrastructure Control:

Full control over container images and runtime environment
Custom networking and security group configuration
Direct access to infrastructure-level metrics
Ability to implement custom security controls
VPC configuration for network isolation

Operational Flexibility:

Choose instance types optimized for workload
Separate scaling policies per service
Custom monitoring and alerting
Advanced deployment strategies (blue/green, canary)

Infrastructure-as-Code:

100% version-controlled infrastructure using CloudFormation templates
Entire infrastructure stack defined as code (VPC, ECS, RDS, ElastiCache, ALB, etc.)
Git-based workflow for infrastructure changes (review, approve, deploy)
Reproducible environments (staging matches production exactly)
Disaster recovery: Rebuild entire infrastructure from templates
Change tracking and audit trail for infrastructure modifications
Team collaboration on infrastructure changes via pull requests
No manual ClickOps or undocumented configuration drift

Heroku Limitation: On Heroku, infrastructure is configured via web UI or CLI commands that aren’t easily version-controlled. App configuration can be tracked, but the underlying platform infrastructure (databases, dynos, add-ons) requires manual provisioning and documentation.

4. Performance Improvements

AWS Performance Advantages:

Dedicated compute resources (no noisy neighbors)
Auto-scaling for traffic spikes: Automatically adds capacity when multiple tenants become active
Dynamic resource allocation: Scales up/down based on actual demand (CPU, memory, request count, queue depth)
Advanced connection pooling (RDS Proxy)
No 30-second timeout constraints
Custom CDN configuration (CloudFront)
Private networking between services (reduced latency)
Better database performance tuning options
Independent scaling per service type (web vs. workers)

5. AWS Service Ecosystem

Native Integration:

Amazon SQS for Celery message broker: Fully managed, serverless queue with no infrastructure to maintain
S3 for media and static file storage (same region)
SES for email services
Lambda for serverless functions
CloudWatch for comprehensive monitoring (includes native SQS metrics for auto-scaling)
AWS Secrets Manager for credential management
IAM roles for secure, passwordless service access
VPC endpoints for private AWS service access

SQS-Specific Benefits:

Zero infrastructure management: No Redis/ElastiCache broker to maintain or scale
Native CloudWatch metrics: ApproximateNumberOfMessagesVisible metric for accurate worker auto-scaling
Reliability: Built-in message durability and delivery guarantees
Dead letter queues: Automatic handling of failed tasks
Cost-effective: Pay only for messages processed (~$0.40 per million requests)
Unlimited scalability: No capacity planning required
IAM-based security: No broker credentials to manage

Technical Considerations

Containerization

ECS Container Setup:

Dockerize Django application (web, Celery, Beat)
Use ECR (Elastic Container Registry) for image storage
Multi-stage builds for optimized image sizes
Health checks for container orchestration
Environment-based configuration

Database Migration

RDS PostgreSQL:

Similar to Heroku Postgres (based on PostgreSQL)
Enhanced monitoring and performance insights
Automated backups with configurable retention
Multi-AZ deployment for high availability
Parameter groups for fine-tuned configuration
RDS Proxy for connection pooling

Networking and Security

VPC Configuration:

Private subnets for database and cache
Public subnets for load balancer
Security groups for service isolation
NAT Gateway for outbound traffic from private subnets
VPC endpoints for AWS services (S3, Secrets Manager)

Security Enhancements:

IAM roles instead of static credentials
Secrets Manager for sensitive configuration
AWS WAF (Web Application Firewall) for web protection
CloudTrail for audit logging
Encryption at rest and in transit

Monitoring and Observability

CloudWatch Integration:

Application logs aggregation
Custom metrics and dashboards
Alerting for critical events
Performance monitoring
Cost tracking and budgets

X-Ray (Optional):

Distributed tracing
Performance bottleneck identification
Request flow visualization

High Availability

AWS HA Features:

Multi-AZ RDS deployment
ECS service auto-recovery
Application Load Balancer health checks
Automated backups and snapshots
Cross-AZ redundancy

Trade-offs

Advantages

UK data residency and compliance (eliminates critical blocker)
Cost savings (15-30%)
100% infrastructure-as-code with CloudFormation (version control, reproducibility, no drift)
Greater infrastructure control
Better performance and scalability
AWS service ecosystem integration
More flexible resource allocation
Enhanced security capabilities
Better monitoring and observability

Disadvantages

Increased operational responsibility (less managed than Heroku)
Steeper learning curve for AWS services
More complex infrastructure management
Need for AWS expertise in team
Infrastructure-as-code maintenance overhead
More components to monitor and maintain
Deployment complexity (ECS vs. git push heroku)

Risk Mitigation

Training and documentation: Invest in AWS training for team
Infrastructure-as-code: Use Terraform or CloudFormation for reproducibility
Gradual migration: Test thoroughly in staging environment
Monitoring from day one: Comprehensive CloudWatch setup before go-live
AWS support plan: Consider AWS Business Support for expert guidance
Runbooks and procedures: Document common operational tasks
Disaster recovery testing: Regular restore testing from backups

Compliance Requirements

UK Data Residency

Requirements Met:

All compute resources in EU-West-2 (London)
Database and backups in EU-West-2 (London)
Redis cache in EU-West-2 (London)
Application logs in EU-West-2 (CloudWatch Logs)
S3 storage in EU-West-2 (if used)

Audit Trail:

CloudTrail logs all API calls and data access
Resource tagging for compliance tracking
IAM policies enforce regional restrictions
AWS Organizations for governance controls

Data Protection Compliance

GDPR/UK GDPR Alignment:

Data processing within UK jurisdiction
No cross-border data transfers to USA
Right to erasure (tenant deletion capabilities)
Data encryption at rest and in transit
Audit logs for data access

Conclusion

The migration from Heroku to AWS ECS represents a strategic necessity for Viitata, driven primarily by UK data residency and compliance requirements. The current Heroku architecture creates unacceptable compliance risk due to database backups being retained in the USA, making it impossible to guarantee UK-only data storage.

This migration delivers:

Resolves critical compliance issue: UK data residency with backups in EU-West-2 (London)
Eliminates USA data transfer: All data and backups remain in UK
Cost savings: 15-30% base reduction + 40-50% additional savings from auto-scaling during off-peak
Auto-scaling with scale-down: Handles multi-tenant traffic spikes automatically while reducing costs during low-usage periods
100% version-controlled infrastructure: CloudFormation templates for entire stack, eliminating manual configuration and drift
Greater operational control: Full infrastructure flexibility and customization
Performance improvements: Dedicated resources, auto-scaling for spikes, and advanced AWS features
AWS service ecosystem: Native integration with SQS (message broker), S3, CloudWatch, IAM, and other services
Serverless message queue: SQS eliminates need to manage message broker infrastructure
Enhanced security: VPC isolation, IAM roles, Secrets Manager, and encryption

While AWS requires greater operational expertise compared to Heroku’s managed platform, the compliance requirements make this migration essential. The additional benefits of cost savings, performance improvements, and operational flexibility provide further justification beyond the compliance imperative.

Without this migration, Viitata cannot serve clients with UK data residency requirements and remains exposed to compliance risks associated with cross-border backup storage.