1 - EPA Quote Forms — End-to-End Testing Approach
Overview of Automated End-to-End Testing for EPA Insurance Quote Forms
Purpose
This memo outlines the automated end-to-end (E2E) testing strategy in place for the EPA insurance quote platform. The test suite provides continuous quality assurance across all quote form journeys,
ensuring that customers can complete quotes reliably and that business-critical outcomes are correctly handled.
Separation of Concerns — Unit Tests vs End-to-End Tests
The EPA platform has two distinct layers of automated testing, each with a different purpose and owned by a different part of the codebase.
Unit Tests — Inside the Product
Unit tests live within the flux-epa product repository itself. They test the application’s internal logic in isolation — individual functions, business rules, data transformations, and component
rendering — without launching a browser or navigating a real form. Unit tests are:
- Fast — They run in milliseconds because they don’t involve a browser or network
- Narrow — Each test covers a single piece of logic (e.g. “does this function calculate the premium correctly?”)
- Developer-facing — They catch regressions in code as it is written, typically running on every commit
Unit tests answer the question: does each piece of the application work correctly on its own?
End-to-End Tests — Outside the Product
This E2E test suite is a separate project, deliberately kept outside the product codebase. It tests the application as a deployed whole — launching a real browser, navigating to a real
environment, and completing quote journeys exactly as a customer would. E2E tests are:
- Slow — Each test takes seconds to minutes because it drives a real browser through multiple form pages
- Wide — A single test exercises the full stack: frontend rendering, form validation, API calls, backend processing, and quote outcome routing
- Customer-facing — They verify the experience from the customer’s perspective, not the developer’s
E2E tests answer the question: can a customer actually complete a quote and reach the right outcome?
Why Both Are Needed
Neither layer can replace the other. They catch fundamentally different types of problems:
| Unit Tests | E2E Tests |
|---|
| Catches | Logic errors in individual functions | Broken journeys, wrong destinations, missing content |
| Misses | Integration failures between components | Internal implementation bugs |
| Speed | Milliseconds | Seconds to minutes |
| Runs against | Code in isolation | A deployed environment |
| Owned by | Product repository (flux-epa) | Test repository (epa-tests-e2e) |
A unit test might confirm that a premium calculation function returns the correct value, but it cannot tell you whether the customer actually sees that value on screen, or whether clicking “Get a
Quote” sends them to the right page. Conversely, an E2E test can confirm the full journey works, but if it fails, it cannot pinpoint whether the issue is in the frontend, the API, or the business
logic — that is where unit tests provide the detail.
Ownership Model
Each product has a product owner — a single developer who is responsible for that product’s E2E test suite. They maintain the component classes, step classes, test methods, and documentation, and
they evolve the E2E coverage as the product grows. The product owner is the person who understands the customer journeys end to end and ensures the test suite reflects them accurately.
Other developers working on the product — building features, fixing bugs, refactoring code — are responsible for writing unit tests to cover their changes within the product repository. They are
not expected to write or maintain E2E tests. However, the E2E suite is available to them as a tool:
- Validation — After completing a feature or fix, a developer can run the E2E tests against their deployed changes to confirm that the customer-facing journeys still work correctly
- Sanity checking — A quick smoke run provides confidence that a change has not broken anything obvious before handing work over for review
- Regression testing — The full suite can be run to verify that a change has not introduced side effects in other parts of the form or in other brands
This creates a clear division of responsibility:
| Product Owner | Feature/Bug Fix Developers |
|---|
| Writes | E2E tests (components, steps, test methods) | Unit tests (within the product repo) |
| Maintains | E2E test suite, documentation, parametrised rules | Product code, unit test coverage |
| Uses E2E tests to | Validate developer work, verify releases, expand coverage | Sanity check their own changes, run regression tests |
The E2E tests become a quality gate that the product owner controls. When a developer submits work, the product owner can run the relevant E2E tests to verify that customer journeys are intact —
without needing to manually click through forms. This frees developers to focus on building and unit-testing their code, while the product owner has an automated tool to confirm the end result works
from the customer’s perspective.
The Separation Is Deliberate
Keeping the E2E tests in their own repository and distributing them as a package reinforces this ownership model:
- Independent release cycles — The product owner can update the E2E suite without changing the product, and vice versa
- Environment flexibility — E2E tests run against any deployed environment, not just a local development setup
- No coupling — The E2E tests interact with the application purely through its UI, the same way a customer does. They have no knowledge of internal code, database schemas, or API contracts. If a
developer refactors the product’s internals but the UI behaviour remains the same, the E2E tests continue to pass without changes
- Low barrier for developers — Developers do not need to understand the E2E test codebase to benefit from it. They run the tests, review the report, and act on any failures
Scope of Testing
Products Covered
The test suite covers 6 insurance quote forms across 3 brands:
| Brand | Product | Test Cases |
|---|
| Adrian Flux | Car Insurance | 32 |
| Adrian Flux | Learner Driver Insurance | 21 |
| Sterling Insurance | Car Insurance | 34 |
| Sterling Insurance | Learner Driver Insurance | 16 |
| Bikesure | Motorcycle Insurance | 34 |
| Bikesure | Short-term Motorcycle Insurance | 32 |
| Total | 169 |
Each form consists of up to 12 steps that a customer completes to receive a quote, including vehicle details, personal information, driving history, and policy preferences.
What the Tests Verify
Tests are organised into the following categories:
- Smoke tests — Confirm that every page of the quote form loads correctly, displays the right content (titles, disclaimers, legal text), and that the core happy-path journey works end to end.
- Validation tests — Verify that form fields reject invalid or missing input and display the correct error messages to the customer.
- Quote outcome tests — Complete the full quote journey and confirm that customers are directed to the correct outcome page (buy online, request a callback, or call us) based on their details.
- Source code tests — Verify that marketing source codes are correctly passed through the entire journey and appear in callback URLs and quote outcomes, ensuring accurate attribution.
- Multi-party tests — Test the additional driver flows, including adding, editing, and removing named drivers from a policy.
These tests are scoped to the quote form itself — the customer-facing frontend application. They do not directly test connected backend services such as the quoting engine, pricing APIs, CRM
systems, or payment gateways. The tests interact only with what the customer sees in the browser.
However, because the form depends on backend services to function, certain backend behaviours can be inferred from the test results. When a test completes a full quote journey and asserts the
outcome, it is implicitly validating that the backend processed the submission correctly:
- Destination page assertions — When a test submits a quote and verifies the customer is redirected to the “buy” page, this confirms that the backend quoting engine received the submission,
processed it, returned a quotable result, and the form routed the customer to the correct destination. A failure here could indicate a backend issue (quoting engine down, pricing rules
misconfigured) even though the test itself only checks the URL the customer lands on.
- Quote reference assertions — When a test verifies that a quote reference number is displayed on the outcome page, this confirms that the backend successfully generated and returned a quote
reference. The presence of this reference implies the quote was persisted in the backend system.
- Content assertions on outcome pages — Tests verify specific content on outcome pages: premium amounts, policy benefits, insurer names, excess values, and callback phone numbers. This content is
populated by backend responses. If the backend returns incorrect data, these assertions will fail — surfacing a backend problem through a frontend test.
- Source code propagation — Tests verify that marketing source codes passed via URL parameters survive the full journey and appear in callback URLs and outcome pages. This validates that the form
correctly passes source codes to the backend and that the backend includes them in its response data.
In this way, the E2E tests act as an early warning system for backend issues, even though they are not backend tests. A pattern of destination or content failures across multiple EPAs can indicate
a shared backend service problem, while a failure isolated to a single EPA points to a product-specific configuration issue.
What these tests will not tell you is why a backend service failed — only that something in the chain produced an unexpected result from the customer’s perspective. Diagnosing the root cause
requires backend logs, monitoring, and the product’s own unit and integration tests.
Architecture
The test suite is built on a three-layer architecture that separates concerns and promotes reuse. Each layer has a distinct responsibility:
┌─────────────────────────────────────────────────────┐
│ Tests │
│ Orchestrate steps into full scenarios and assert │
│ expected outcomes. │
│ e.g. test_AFC004_simple_vehicle_lookup │
├─────────────────────────────────────────────────────┤
│ Steps │
│ Represent a single page/step of the form. │
│ Compose element actions into complete scenarios │
│ for that page. │
│ e.g. VehicleLookupStep.fill() │
├─────────────────────────────────────────────────────┤
│ Components │
│ Represent a reusable form component. Expose │
│ individual user actions on that component. │
│ e.g. F058VehicleLookupCar.reg_lookup() │
└─────────────────────────────────────────────────────┘
Components — Individual Actions
At the lowest level, component classes represent a single reusable form element such as a vehicle registration lookup or a purchase details panel. Each method on a component maps to one discrete
user action: entering a registration number, clicking “Find Car”, selecting a manufacturer from a dropdown, and so on.
Because components are self-contained, the same component can be shared across multiple brands and products. For example, the car vehicle lookup component is used by both Adrian Flux Car and Sterling
Car forms without duplication.
Critically, the component numbering in the test suite mirrors the component numbering used in the flux-epa product itself and its documentation. The test class F058VehicleLookupCar corresponds
directly to component F058 in the EPA platform; F066VehiclePurchaseDetails corresponds to component F066, and so on. This shared naming convention means:
- Traceability — When a component is changed in the product, it is immediately clear which test component covers it
- Common language — Developers, testers, and product documentation all refer to the same component by the same number
- Incremental coverage — New actions and scenarios can be added to a component class over time without touching any existing tests. For example,
F058VehicleLookupCar currently covers
registration lookup, manual entry, and vehicle changes. As the product evolves, additional actions (e.g. a new vehicle data source, a different lookup flow) can be added to the same class and then
consumed by new or existing step scenarios. Coverage grows gradually, component by component, without requiring large-scale rewrites
The current component library is small — F001 (motorcycle lookup), F058 (car lookup), and F066 (purchase details) — but it is designed to expand. As new components are built in the test suite, they
follow the same F-number convention, keeping the test layer aligned with the product layer.
Steps — Page Scenarios
The middle layer contains step classes, one per page of the quote form. A step class composes the actions from one or more components into complete scenarios for that page. For example, the entry
page step brings together the vehicle lookup component and the purchase details component, calling their actions in the right order and advancing to the next page.
Each step exposes different scenarios:
- fill — the standard happy-path completion of that page
- fill_complex — an advanced path that exercises more of the page’s functionality (manual entry, changing selections, toggling options)
- error methods — deliberately trigger validation errors to test that the form rejects bad input correctly
This means the details of how a form page works are defined in one place. If the form changes, only the step class needs updating — not every test that uses it.
Tests — Orchestrated Journeys
At the top level, test methods orchestrate steps into meaningful scenarios and assert expected outcomes. A test might call a single step to verify one page works in isolation, or chain all steps
together to complete a full end-to-end quote journey.
Tests are concerned with what should happen, not how to interact with the form. For example, a full quote journey test reads as a simple sequence — fill vehicle details, fill personal details,
fill driving history, get a quote, verify the outcome — with each step handling its own interactions internally.
Why This Matters
This layered separation provides several practical benefits:
- Reduced duplication — Common form components are written once and reused across brands. The vehicle lookup is defined in a single component class, not copied into every test.
- Easier maintenance — When a form page changes, updates are made in one step class rather than across dozens of individual tests.
- Readability — Tests read as business-level scenarios (“complete personal details, then get a quote”) rather than low-level browser interactions (“click this button, fill this field”).
- Faster test development — New tests for existing forms can be composed from the library of steps and components that already exist.
Technology Stack
Playwright — The Engine
Playwright is the core of the test suite. It is a browser automation framework developed by Microsoft that launches and controls a real web browser (Chromium) programmatically. Playwright is
responsible for everything the tests do:
- Navigating to quote form URLs
- Interacting with form elements — clicking buttons, filling text fields, selecting dropdowns, toggling radio buttons — exactly as a customer would
- Waiting intelligently for pages to load, network requests to complete, and elements to become visible before proceeding
- Asserting that the page is in the expected state — checking text content, element visibility, URL changes, and page metadata
- Recording video of every test session and capturing screenshots on failure
- Tracing detailed execution logs (DOM snapshots, network requests, console output) that can be replayed step-by-step when investigating failures
The EpaStep class wraps Playwright’s Page object with helpers specific to EPA quote forms — navigating between form steps, verifying step titles, checking field-level validation errors, and
confirming quote outcome destinations. Every component and step class ultimately delegates to Playwright for all browser interactions.
pytest — The Runner
pytest serves as the test runner and orchestration layer. It does not interact with the browser directly — that is entirely Playwright’s domain. pytest’s responsibilities are:
- Discovery — Automatically finding and collecting all test methods across the 6 product suites
- Fixtures — Managing setup and teardown (browser contexts, cookies, environment configuration, form URLs) so that each test starts in a clean, correctly configured state
- Markers — Providing the tagging system (
@pytest.mark.smoke, @pytest.mark.adrianflux, etc.) that allows selective test execution - Parametrisation — Driving data-driven tests, such as running the same quote journey with 21 different source codes or multiple email/price combinations to verify different outcomes
- Reporting — Generating the HTML test report with pass/fail results, embedded screenshots, video links, and links to test case documentation
- Plugin system — The test suite is packaged as a pytest plugin, meaning consuming projects get all fixtures, markers, and configuration automatically just by installing the package
How They Work Together
pytest Playwright
───── ──────────
Discovers tests
Resolves fixtures (URLs, cookies)
Launches browser
Sets viewport, video recording
Calls test method
Navigates to quote form
Fills fields, clicks buttons
Waits for pages to load
Asserts page content
Collects pass/fail result
Captures screenshot (on failure)
Saves video recording
Generates HTML report
Attaches screenshots & video links
In short: Playwright does the work, pytest organises and reports on it.
Automatic Evidence Capture
Every test run automatically produces:
- Video recordings of each test, viewable directly from the HTML report
- Screenshots captured at the point of any failure
- Playwright trace files for failed tests, providing a step-by-step replay of DOM state, network activity, and console output
These artifacts are compiled into a self-contained HTML report that can be opened in any browser and shared without special tooling. Each test result links directly to its video, failure
screenshot, and corresponding test case documentation on GitHub.
Reports from CI runs are published to a CDN and accessible via a browser:
Test Environments
Tests can be targeted at any environment (development, UAT, staging, production) by changing a single configuration value. Each brand can also be pointed at a different environment independently,
allowing testing to proceed in parallel across teams.
Test Execution
On-Demand via GitHub
Tests are triggered on demand through GitHub Actions. The person running the tests selects:
- The target environment (e.g. UAT, staging)
- Optionally, a subset of tests to run (e.g. only smoke tests, only a specific brand)
Results and artifacts are retained for 30 days.
Selective Execution
Tests are tagged with descriptive markers, making it straightforward to run targeted subsets:
| Marker | What It Runs |
|---|
smoke | Core happy-path tests across all brands |
validation | Input validation and error handling |
quoting | Full end-to-end quote journeys |
adrianflux | All Adrian Flux tests only |
sterling | All Sterling tests only |
bikesure | All Bikesure tests only |
car | Car insurance forms only |
learner | Learner driver forms only |
bike | Motorcycle forms only |
Markers can be combined — for example, running only smoke tests for Bikesure motorcycle forms.
Parametrised Tests — Validating Destination Rules at Scale
One of the most powerful features of the test suite is its use of parametrisation to validate EPA destination rules comprehensively. Rather than writing a separate test for each combination of
inputs and expected outcomes, a single test method is written once and then driven by a data table. pytest automatically generates and runs a distinct test for every row in that table.
The Problem: Destination Rules Vary by EPA
Each EPA has its own business rules that determine where a customer is directed after completing a quote. Depending on the combination of quote status, premium amount, and source code, the customer
may be sent to:
- Buy online — A quote was issued and the customer can purchase immediately
- Callback — The quote needs further underwriting; the customer is offered a callback
- Call us — The quote cannot be processed online; the customer is directed to call
These rules differ between EPAs. For example, a quoted customer with a premium of £1,000 might be directed to the buy page on Sterling Car, but to the callback page on Adrian Flux Car. Getting
these rules wrong means customers end up on the wrong page — either unable to buy when they should be able to, or seeing incorrect messaging.
How Parametrisation Solves This
Each EPA’s destination rules are expressed as a simple data table directly in the test code:
| Email Prefix | Price | Expected Destination |
|---|
| quoted | £0 | call-us |
| quoted | £1 | callback |
| quoted | £1,999 | callback |
| quoted | £2,000 | callback |
| quoted | £3,999 | callback |
| quoted | £4,000 | call-us |
| rejected | £0 | call-us |
| rejected | £1 | call-us |
| rejected | £1,999 | call-us |
| … | … | … |
(Example: Adrian Flux Car — 12 combinations from a single test definition)
From this one table, pytest generates 12 independent tests — each running the full quote journey end to end with different inputs and verifying the customer lands on the correct outcome page. The
report shows each combination as a separate pass or fail, making it immediately clear which specific rule has broken.
Scaling Across EPAs
The same pattern is applied across all 6 products, with each EPA’s table reflecting its own destination rules. This means:
- Adding a new destination rule is as simple as adding a row to the table — no new test code to write
- Changing a rule (e.g. moving the callback threshold from £2,000 to £3,000) means updating one value in the table
- Each EPA’s rules are visible at a glance in the test file, serving as living documentation of the business logic
Beyond Destinations: Source Code Attribution
The same parametrisation approach is used to validate marketing source code handling across the quote journey. A second data table combines source codes, “where did you hear” selections, expected
destinations, and expected callback URL parameters — generating up to 21 test runs from a single test method per EPA.
This covers scenarios such as:
- A source code passed via URL appearing correctly in callback links
- Short-form source codes being resolved to their canonical form
- Missing or empty source codes falling back to the correct default
- Source codes surviving the full journey through all form steps and appearing on the outcome page
The Net Effect
Across all 6 products, parametrisation generates a large number of test runs from a relatively small number of test definitions. One test method with a 12-row table and another with a 21-row table
produces 33 full end-to-end journeys per EPA — nearly 200 destination and source code validations across the suite, each running independently and reporting individually. Adding a new EPA’s rules
means defining its data table; the test logic is already written.
Reusability — Packaged as a Shared Library
The test suite is not a standalone script — it is built and distributed as a Python package (flux-epa-e2e-tests) that can be installed into any project. This means the tests, fixtures, step
classes, and component classes are all reusable across multiple consuming applications.
How It Works
The package is registered as a pytest plugin via a standard entry point. When a consuming project installs the package, pytest automatically discovers and loads:
- All test fixtures (browser configuration, cookie setup, environment resolution)
- All test markers (smoke, quoting, adrianflux, etc.)
- All default configuration (HTML reporting, video recording, tracing, output directories)
- All test suites, step classes, and component classes
A consuming project needs only two things to run the full test suite:
pip install flux-epa-e2e-tests (or add it to their dependencies)- A
.env file with the target environment URLs and QA bypass cookie
No additional pytest configuration, fixture definitions, or test imports are required — the plugin handles all of it. The consuming project can also override any default (report path, output
directory, target URLs) through its own configuration or command line flags.
Versioned Releases
The package follows semantic versioning (currently v0.12) and is published to a private package registry (Gemfury). This ensures:
- Consuming projects pin to a known version and upgrade deliberately
- Test changes are tracked through a changelog
- Rollback to a previous version is straightforward if needed
CI/CD Pipelines
The test suite has two GitHub Actions workflows that automate execution and distribution.
1. Test Execution Pipeline
A manually triggered workflow that runs the test suite against a configured environment:
- Provisions an Ubuntu runner with Python 3.12
- Installs the test package and Playwright’s Chromium browser
- Runs tests — optionally filtered by a marker expression (e.g.
smoke, adrianflux and quoting) - Captures the pytest summary line (passed/failed counts)
- Uploads the full results directory (HTML report, videos, screenshots, traces) as a GitHub artifact retained for 30 days
- Uploads the HTML report to Cloudflare R2 cloud storage for easy sharing via a public URL
Environment URLs and secrets (QA bypass cookie, R2 credentials) are managed through GitHub repository variables and secrets, keeping sensitive values out of the codebase.
2. Package Publishing Pipeline
Triggered automatically when a GitHub release is created:
- Builds the Python package (wheel)
- Uploads it to the Gemfury private package registry
This means the release process is: tag a version, create a GitHub release, and the package is available for consuming projects to install within minutes.
Pipeline Summary
┌──────────────────────┐
Manual trigger ──►│ Run E2E Tests │
(select markers) │ - pytest on CI │
│ - HTML report │──► GitHub Artifacts (30 days)
│ - Videos & traces │──► Cloudflare R2 (shareable URL)
└──────────────────────┘
┌──────────────────────┐
GitHub release ──►│ Publish Package │
│ - Build wheel │──► Gemfury Registry
│ - Upload to Gemfury │ (pip install in other projects)
└──────────────────────┘
Test Case Numbering, Documentation, and Traceability
The Numbering System
Every test method carries a unique, sequentially numbered identifier that ties together the test code, its documentation, and its result in the HTML report. The identifier is embedded directly in the
method name and docstring:
- Method name:
test_AFC009_complex_vehicle_lookup - Docstring:
AFC-009: Verify complex vehicle lookup with manual entry advances to car details.
Each product suite has its own prefix (AFC for Adrian Flux Car, BKSB for Bikesure Bike, SL for Sterling Learner, etc.) and numbers run sequentially from 001 upwards in the order the tests appear in
the file. This means the case number reflects the test’s position in the suite — there are no gaps or out-of-order numbers.
Documentation with Claude Code
Test case documentation is generated and maintained using Claude Code (Anthropic’s AI coding assistant) through a set of custom slash commands built specifically for this project:
/new-test-case — When a new test is written, this command:
- Reads the test method, the step classes it calls, and the element classes those steps use
- Understands what the test does at the UI level by following the full call chain
- Assigns the next available case number and renames the test method accordingly
- Generates a plain-English documentation file describing every step the test performs
- Adds the new case to the suite’s index file
/check-test-case-order — Audits a test suite to verify that all case numbers are sequential, all documentation files exist, and all cross-references between documents are valid. If numbers have
drifted (e.g. after tests were reordered), it re-indexes the entire suite — renaming test methods, updating doc files, and correcting cross-references in a single operation.
/remove-test-case — Removes a test from the suite, deletes its documentation, and re-indexes all subsequent cases so the numbering remains gapless.
This approach means documentation is never written from scratch by hand. Claude reads the actual test code — including the step and component layers — and produces accurate, up-to-date descriptions of
what each test does. When tests change, the documentation can be regenerated from the code rather than manually updated.
What a Test Case Document Looks Like
Each test case has its own markdown file (e.g. docs/adrianflux_car/AFC-009.md) containing:
- Category — Smoke, Validation, or Parametrised
- Markers — The pytest markers applied to this test
- Test method — The exact function name for traceability back to the code
- Steps — A numbered list describing every action in plain English
- Test data (for parametrised tests) — A table of input combinations and expected outcomes
There are currently 174 individual test case documents plus a summary index for each product that lists all cases in a single table.
Linked in the HTML Report
The test case documentation is not just a separate reference — it is linked directly from the HTML test report. A pytest hook inspects each test result, extracts the case ID from the test method
name (e.g. AFC009 from test_AFC009_complex_vehicle_lookup), and appends a clickable link to the corresponding documentation file on GitHub.
This means that when reviewing test results, anyone can click through from a pass or failure directly to the full plain-English description of what that test was verifying — without needing to read
the test code.
HTML Report Row
┌──────────────────────────────────────────────────────────────┐
│ test_AFC009_complex_vehicle_lookup PASSED │
│ [Video] [Screenshot] [Test Case ↗] │
│ │ │
│ └─► docs/adrianflux_car/AFC-009.md
│ on GitHub │
└──────────────────────────────────────────────────────────────┘
This creates a full traceability chain: test result → test case documentation → test code → step classes → component classes, all connected by the case ID.
Quality Controls on the Test Suite Itself
The test codebase is held to the same engineering standards as production code:
- 100% documentation coverage — Every module, class, and method has a descriptive docstring, enforced by automated checks.
- Automated linting and formatting — Code style is enforced consistently across the entire suite.
- Versioned releases — The test suite is versioned (currently v0.12) and distributed as a package, ensuring reproducible test runs.
Key Benefits
- Confidence in releases — Tests can be run against any environment before or after a deployment to verify that quote journeys work correctly.
- Rapid feedback — Issues are caught early with clear evidence (video, screenshots, traces) to support diagnosis.
- Cross-brand consistency — The same testing patterns are applied uniformly across all 3 brands and 6 products.
- Marketing attribution assurance — Source code tests verify that campaign tracking remains intact through the full customer journey.
- Scalability — Adding a new brand or product follows an established template, keeping the approach consistent as the platform grows.
Applying This Architecture to Other Products
The architecture behind this test suite is not specific to insurance quote forms. The three-layer pattern (components, steps, tests), the tooling (Playwright, pytest, Claude Code documentation), and
the infrastructure (CI pipelines, CDN-hosted reports, package distribution) can be applied to any multi-step web application. Adopting it for a new product does not require starting from scratch — the
patterns and infrastructure are already proven and can be replicated.
What Transfers Directly
The following elements are product-agnostic and can be reused as-is or with minimal adaptation:
- The
EpaStep wrapper — The helper class that wraps Playwright’s page with convenience methods (navigate, verify titles, check errors, wait for pages) is not EPA-specific. A similar wrapper
could be created for any product, or the existing one extended. - The three-layer pattern — Components (individual UI actions), steps (page scenarios), and tests (orchestrated journeys) work for any multi-page application: onboarding flows, checkout processes,
claims forms, account management, or any wizard-style UI.
- Parametrised destination/outcome testing — Any product with branching outcomes based on user input (approval/rejection, pricing tiers, eligibility checks) can use the same data-table approach to
validate all paths from a single test definition.
- The CI pipeline — The GitHub Actions workflows for running tests on demand, uploading reports to cloud storage, and publishing the package require only configuration changes (environment URLs,
secrets) to point at a different product.
- The CDN-hosted report index — The R2 upload and index page pattern works for any test suite, giving stakeholders a single URL to find any historical test run.
- Claude Code slash commands — The
/new-test-case, /check-test-case-order, and /remove-test-case commands are driven by naming conventions (prefix + sequential number + markdown docs).
Adapting them for a new product means defining a new prefix — the commands themselves handle discovery, numbering, documentation generation, and re-indexing.
What a New Product Needs
To bring a new product into the same testing architecture, the product-specific work is:
- Component classes — One class per reusable UI component in the new product, following the F-number convention if the product uses it, or a suitable naming scheme otherwise. Each class exposes
the actions a user can perform on that component.
- Step classes — One class per page or screen, composing component actions into scenarios (happy path, complex path, error paths).
- Test methods — Orchestrating steps into the journeys that matter for that product, tagged with appropriate markers.
- Fixtures — Product-specific configuration: base URLs, environment overrides, expected content (titles, disclaimers, legal text).
- Parametrised data tables — The product’s business rules expressed as input/outcome tables for destination testing.
The infrastructure, reporting, documentation tooling, and package distribution are already in place. A new product plugs into the existing system rather than building its own.
A Practical Example
Consider a hypothetical home insurance quote form. Applying this architecture would look like:
| Layer | EPA (existing) | Home Insurance (new) |
|---|
| Components | F058VehicleLookupCar | F040PropertyLookup |
| F066VehiclePurchaseDetails | F041PropertyDetails |
| Steps | S01_entry_page (vehicle reg) | S01_entry_page (postcode lookup) |
| S04_policyholder_details | S03_policyholder_details |
| Tests | test_AFC024_quote_outcomes | test_HI024_quote_outcomes |
| Parametrised rules | quoted + £1,000 → callback | standard + £500 → buy |
| Docs | AFC-024.md | HI-024.md |
The component and step classes are new — they reflect the new product’s UI. But the test structure, documentation tooling, CI pipeline, and report infrastructure are identical. The investment made in
the EPA suite pays forward into every subsequent product.
Summary
The EPA E2E test suite provides structured, automated coverage of all customer-facing quote journeys across Adrian Flux, Sterling Insurance, and Bikesure. With 169 documented test cases, automatic
evidence capture, and flexible execution options, the suite gives stakeholders confidence that quote forms are functioning correctly and that customers are reaching the right outcomes.
3 - Viitata Tenancy Infrastructure
Migration from single-tenant to multi-tenant architecture
Executive Summary
This memo documents the strategic architecture migration for Viitata from a single-tenant-per-instance model to a multi-tenant architecture on Heroku. This migration addresses critical operational inefficiencies, enables deployment of the new Viitata version with its required worker architecture, and significantly reduces both current costs and the cost of scaling while eliminating DevOps friction for client onboarding.
Key Changes:
- Architecture: Single-tenant-per-instance → Multi-tenant shared infrastructure
- Platform: Heroku (no change)
- Application Version: Current (single worker) → New version (3 workers required)
- Cost Impact: $96/month currently → $288/month if upgraded on single-tenant → $130/month on multi-tenant
- Cost Savings: 55% reduction vs. deploying new version on single-tenant architecture
Introduction
Purpose
This document outlines the rationale, technical approach, and benefits of migrating Viitata from a distributed single-tenant-per-instance model to a consolidated multi-tenant architecture on Heroku.
Scope
This memo covers:
- Current single-tenant-per-instance architecture on Heroku
- New Viitata version requirements (3-worker architecture)
- Proposed multi-tenant architecture on Heroku
- Cost analysis and operational benefits
- Technical considerations and trade-offs
Out of scope:
- Detailed application code changes for multi-tenancy
- Specific Heroku configuration details
- Data migration procedures and implementation timeline
Audience
This document is intended for technical leadership, DevOps engineers, and stakeholders involved in infrastructure planning and decision-making.
Background
Current State: Single-Tenant-Per-Instance on Heroku
Viitata currently operates with a single-tenant-per-instance model on Heroku, consisting of:
Infrastructure per Tenant:
- 1 Heroku application instance (single worker)
- 1 PostgreSQL database
- 1 Redis cache instance
- Cost: ~$16/month per tenant
Current Deployment:
- 6 production instances running current Viitata version
- Each instance operates as a single-worker application
- Total monthly cost: ~$96 (6 instances × $16)
- Each instance requires independent CI/CD pipeline
- Each instance requires separate DevOps configuration
Important Note: The current architecture runs an older version of Viitata that does not require the 3-worker architecture. However, the new version of Viitata cannot be deployed without this infrastructure change.
Challenges with Current Architecture
1. Cost Scalability Concerns
With 6 production tenants at $16/month each, the current architecture costs approximately $96/month. While manageable at this scale, the cost scales linearly with each new tenant ($16 per additional tenant). More critically, the new version of Viitata requires a 3-worker architecture that would triple costs to approximately $288/month for the same 6 tenants.
2. DevOps Friction
Each new client onboarding requires:
- Provisioning new Heroku application
- Configuring new PostgreSQL database
- Setting up new Redis cache
- Configuring CI/CD pipeline
- Managing environment variables and secrets
- Setting up monitoring and logging
This creates substantial friction and delays in client onboarding.
3. CI/CD Maintenance Overhead
Maintaining 6 separate CI/CD pipelines creates:
- Increased complexity in deployment processes
- Higher risk of configuration drift
- Difficulty in applying updates uniformly
- Additional testing burden across instances
4. Blocking Issue: New Viitata Version Requirements
The new version of Viitata fundamentally requires three distinct worker types to function:
- Web worker: Handles HTTP requests
- Celery worker: Processes asynchronous tasks
- Celery Beat worker: Manages scheduled tasks and periodic jobs
This is not optional - the new Viitata version cannot be deployed without all three workers running.
Under the single-tenant model, deploying the new version would require:
- 18 total worker processes (6 instances × 3 workers)
- Tripling of infrastructure costs per tenant (from $16 to ~$48 per tenant)
- Total monthly cost increase from $96 to approximately $288/month
- 18 separate processes to monitor and manage
Critical Impact: The single-tenant architecture makes it economically and operationally prohibitive to deploy the new version of Viitata. Without migrating to multi-tenant, the platform cannot evolve.
Technical Analysis
Proposed Architecture: Multi-Tenant on Heroku
The new architecture consolidates all tenants into a single shared Heroku infrastructure:
Shared Infrastructure:
- 1 Heroku application (supporting 3 worker types)
- 1 Heroku PostgreSQL database (with tenant isolation)
- 1 Heroku Redis cache (with tenant namespacing)
- Estimated cost: ~$130/month total
Worker Configuration:
- 1 web worker (serving all tenants)
- 1 Celery worker (processing tasks for all tenants)
- 1 Celery Beat worker (managing schedules for all tenants)
- Total: 3 workers supporting all tenants
Cost Analysis
| Architecture Model | Viitata Version | Tenants | Workers | Monthly Cost | Cost per Tenant |
|---|
| Current (Single-Tenant) | Old | 6 | 6 (1 per instance) | $96 | $16.00 |
| Single-Tenant Upgraded | New | 6 | 18 (3 per instance) | $288 | $48.00 |
| Multi-Tenant (Proposed) | New | 6 | 3 (shared) | $130 | $21.67 |
| Savings vs. Upgraded | - | - | -83% | -$158/month | -55% |
Key Insights:
- Current architecture cannot run the new Viitata version without significant cost increase
- New version’s 3-worker requirement would triple single-tenant costs ($96 → $288)
- Multi-tenant architecture enables new version deployment at 55% lower cost than single-tenant upgrade
- Marginal cost advantage: Adding tenant #7 costs $0/month (vs. $48/month in single-tenant)
- Cost efficiency improves with scale: 10 tenants = $13/tenant, 20 tenants = $6.50/tenant
Benefits
1. Cost Reduction
- 83% reduction in infrastructure costs
- Costs remain flat as tenant count grows (until scaling threshold)
- Predictable cost model
2. Operational Efficiency
- Single CI/CD pipeline for all tenants
- Unified deployment process
- Consistent configuration across all tenants
- Reduced maintenance overhead
3. Client Onboarding
- Near-instant tenant provisioning (database record vs. full infrastructure)
- Minimal DevOps involvement
- Faster time-to-value for new clients
- Reduced onboarding friction
4. Enables New Viitata Version Deployment
- Supports required 3-worker architecture (web, Celery, Celery Beat)
- 3 shared workers support all tenants (vs. 18 separate workers in single-tenant)
- Makes new version economically viable to deploy
- Simplified monitoring and management
- Better resource utilization
- Easier to scale horizontally when needed
Technical Considerations
Data Isolation
- Tenant identification at application layer
- Row-level security in PostgreSQL
- Redis key namespacing by tenant ID
- Careful query design to prevent data leakage
- Shared resources require proper resource allocation
- Connection pooling for database efficiency
- Caching strategies to prevent tenant interference
- Monitoring to identify tenant-specific performance issues
Security
- Tenant isolation at application and data layers
- Secure tenant context management
- Audit logging for compliance
- Regular security reviews of multi-tenant code paths
Scalability
- Horizontal scaling when single instance reaches capacity
- Database sharding if needed for very large tenant counts
- CDN and edge caching for static assets
- Load balancing across multiple application instances
Trade-offs
Advantages
- Dramatic cost reduction
- Simplified operations
- Faster client onboarding
- Better resource utilization
- Easier maintenance and updates
Disadvantages
- Tenant isolation complexity in application code
- Potential “noisy neighbor” issues
- Database restore impact: Currently, database snapshots can be restored per-tenant without affecting other clients. In multi-tenant architecture, a database restore would affect all tenants simultaneously, making it impossible to roll back a single client’s data due to a bug or data issue
- More complex deployment rollback scenarios
- Requires careful tenant-aware code design
- Less isolation between tenants compared to separate instances
Risk Mitigation
- Comprehensive testing of tenant isolation
- Resource limits per tenant
- Monitoring and alerting for anomalies
- Gradual migration approach
- Ability to isolate problematic tenants if needed
- Database restore mitigation:
- Implement application-level point-in-time recovery per tenant
- Maintain granular database backups with tenant-specific restore capabilities
- Use transaction logs to selectively restore tenant data
- Establish procedures for tenant-specific data rollback without full database restore
- More rigorous testing and staging processes to prevent production data issues
- Consider automated daily tenant-level logical backups (pg_dump per tenant)
Conclusion
The migration from a single-tenant-per-instance architecture to a multi-tenant architecture on Heroku represents a strategic necessity for Viitata’s evolution. This change delivers:
- Enables deployment of new Viitata version with required 3-worker architecture
- 55% cost reduction vs. deploying new version on single-tenant ($288/month → $130/month)
- Dramatic reduction in operational complexity (6 CI/CD pipelines → 1, 18 workers → 3)
- Near-zero marginal cost for new tenants ($0 vs. $48/tenant in single-tenant)
- Improving cost efficiency at scale: Cost per tenant decreases as platform grows
- Eliminates DevOps friction in client onboarding
Without this migration, deploying the new version of Viitata would nearly triple costs while adding significant operational burden. The multi-tenant architecture not only makes the new version economically viable but also positions Viitata for sustainable growth with costs that improve with scale.
While multi-tenancy introduces complexity in application design around tenant isolation and data security, the alternative—remaining on single-tenant architecture—would either block the platform’s evolution or make it financially unsustainable. The operational benefits, cost savings, and improved scalability make this migration essential for Viitata’s future.
References
4 - Heroku to AWS Migration
Migration from Heroku to AWS for improved compliance, cost, and control
Executive Summary
This memo documents the strategic platform migration for Viitata from Heroku to AWS (Amazon Web Services). This migration addresses critical compliance requirements around UK data residency, reduces infrastructure costs, provides greater operational flexibility and control, and enables better performance and integration with additional AWS services.
Key Changes:
- Platform: Heroku → AWS ECS (Elastic Container Service)
- Region: EU-West-1 (Ireland) → EU-West-2 (London, UK)
- Primary Driver: Compliance - UK data residency and backup retention
- Additional Benefits: Cost reduction, greater control, performance improvements, AWS service ecosystem
Critical Compliance Issue:
Currently on Heroku, while the primary database is in EU-West-1 (Ireland), database backups are retained in the USA. This creates compliance risks for UK data residency requirements. AWS enables full infrastructure and data containment within EU-West-2 (London).
Introduction
Purpose
This document outlines the rationale, technical approach, and benefits of migrating Viitata’s multi-tenant infrastructure from Heroku to AWS, with a focus on achieving UK data sovereignty and compliance requirements while improving operational capabilities.
Scope
This memo covers:
- Current multi-tenant architecture on Heroku
- Compliance and data residency challenges
- Proposed multi-tenant architecture on AWS ECS
- Cost analysis and operational benefits
- Technical considerations and trade-offs
Out of scope:
- Detailed AWS infrastructure-as-code configurations
- Specific containerization implementation details
- Data migration procedures and implementation timeline
- Application code changes required for AWS
Audience
This document is intended for technical leadership, compliance officers, DevOps engineers, and stakeholders involved in infrastructure planning and decision-making.
Background
Current State: Multi-Tenant on Heroku
Viitata currently operates with a multi-tenant architecture on Heroku, consisting of:
Infrastructure:
- 1 Heroku application (3 dynos: web, Celery worker, Celery Beat)
- 1 Heroku PostgreSQL database in EU-West-1 (Ireland)
- 1 Heroku Redis cache
- Current cost: ~$130/month
Current Deployment:
- Multi-tenant architecture supporting 6 production tenants
- Single CI/CD pipeline
- Heroku-managed infrastructure and scaling
- Automatic SSL, DNS, and platform maintenance
1. Compliance and Data Residency (Primary Driver)
Database Backup Location:
- Primary database: EU-West-1 (Ireland, EU)
- Database backups: Stored in USA (Heroku’s backup infrastructure)
This creates significant compliance risks:
- UK data residency requirements cannot be met
- Backup data crosses international boundaries
- Potential violations of data protection regulations
- Risk for clients requiring UK-only data storage
- Audit and compliance reporting challenges
Regional Limitation:
- Application and database in Ireland (EU-West-1), not UK
- No option for UK-specific region on Heroku
- Cannot guarantee UK data sovereignty
2. Cost Considerations
While Heroku provides managed services, the cost includes:
- Premium for managed platform (~30-40% over raw compute)
- Limited ability to optimize resource allocation
- Dyno pricing model less flexible than AWS instance types
- Add-on costs (PostgreSQL, Redis) with limited customization
3. Limited Control and Flexibility
Infrastructure Control:
- Cannot customize underlying OS or runtime environment
- Limited control over networking and security groups
- Restricted access to infrastructure-level monitoring
- Cannot implement custom security controls
Resource Optimization:
- Fixed dyno sizes with limited granularity
- Cannot right-size resources for specific workloads
- Limited ability to use spot instances or reserved capacity
- Cannot separate worker resources by type
Heroku Limitations:
- Shared infrastructure with potential noisy neighbor issues
- Limited database connection pooling options
- Router timeout constraints (30 seconds)
- Limited control over caching layers
- Cannot implement custom CDN configurations
5. AWS Service Integration
Current limitations for integrating with AWS services:
- External network calls to AWS services (S3, SES, etc.)
- Additional latency for AWS service integration
- Cannot use VPC peering or private networking
- Limited IAM role-based security
- Cannot leverage AWS-native monitoring and logging
Technical Analysis
Proposed Architecture: Multi-Tenant on AWS ECS
The new architecture migrates the multi-tenant application to AWS infrastructure with a fully containerized, role-based security model.
Architecture Overview
The following diagram illustrates the proposed AWS architecture:
graph TB
subgraph Internet
Users[Users/Clients]
end
subgraph "AWS EU-West-2 (London)"
subgraph "VPC"
subgraph "Public Subnets"
ALB[Application Load Balancer<br/>HTTPS:443]
NAT[NAT Gateway]
end
subgraph "Private Subnets"
subgraph "ECS Fargate Cluster"
Web[Web Tasks<br/>nginx + gunicorn<br/>Auto-scaling]
Worker[Celery Worker Tasks<br/>Auto-scaling]
Beat[Celery Beat Task<br/>Single instance]
end
RDS[(RDS PostgreSQL<br/>Multi-AZ<br/>Automated Backups)]
Cache[(ElastiCache Valkey<br/>Redis-compatible<br/>Cache & Results)]
end
end
SQS[Amazon SQS<br/>Celery Message Broker<br/>Task Queue]
S3[S3 Bucket<br/>Media Storage]
CW[CloudWatch<br/>Logs & Metrics]
SM[Secrets Manager<br/>Credentials]
end
Users -->|HTTPS| ALB
ALB -->|Routes traffic| Web
Web -->|IAM Role| S3
Web -->|Read/Write| RDS
Web -->|Cache/Sessions| Cache
Web -->|Send tasks| SQS
Web -->|Logs| CW
Web -->|Get secrets| SM
Worker -->|IAM Role| S3
Worker -->|Read/Write| RDS
Worker -->|Receive/Delete tasks| SQS
Worker -->|Store results| Cache
Worker -->|Logs| CW
Worker -->|Get secrets| SM
Beat -->|Send scheduled tasks| SQS
Beat -->|Read/Write| RDS
Beat -->|Logs| CW
Web -.->|Outbound via| NAT
Worker -.->|Outbound via| NAT
Beat -.->|Outbound via| NAT
classDef public fill:#e1f5ff,stroke:#01579b,stroke-width:2px
classDef private fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
classDef data fill:#fff3e0,stroke:#e65100,stroke-width:2px
classDef compute fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px
classDef aws fill:#fff9c4,stroke:#f57f17,stroke-width:2px
class ALB,NAT public
class Web,Worker,Beat compute
class RDS,Cache data
class S3,CW,SM,SQS awsCore Infrastructure Components:
Database Layer
- Amazon RDS for PostgreSQL in EU-West-2 (London)
- Multi-AZ deployment for high availability
- Automated backups retained in EU-West-2
- Point-in-time recovery capabilities
- All tenant data with row-level isolation
Cache Layer
- Amazon ElastiCache with Valkey (Redis-compatible) in EU-West-2
- Used for session storage, application caching, and Celery result backend
- Tenant-namespaced keys for data isolation
- High-performance in-memory data store
Message Queue Layer
- Amazon SQS (Simple Queue Service) in EU-West-2
- Celery message broker for task distribution
- Fully managed, serverless message queue
- No infrastructure to maintain or scale
- Automatic message retention and delivery
- Dead letter queue for failed tasks
- FIFO queues for task ordering if needed
- Cost-effective: Pay only for messages processed
Compute Layer - ECS Fargate
Three separate ECS task definitions running on Fargate:
Task Definition 1: Web Application
- Container: nginx + gunicorn
- Receives traffic from Application Load Balancer (ALB)
- Handles HTTPS requests routed by ALB
- Auto-scaling based on CPU/memory and request count
- Appropriate resources: ~0.5-1 vCPU, 1-2GB memory
- Multiple tasks for high availability and load distribution
- ALB distributes traffic across all healthy web tasks
Task Definition 2: Celery Worker
- Container: Celery worker process
- Consumes tasks from Amazon SQS queue
- Auto-scaling based on SQS queue depth (ApproximateNumberOfMessagesVisible) and CPU utilization
- Right-sized resources: ~0.25-0.5 vCPU, 0.5-1GB memory
- Can scale independently based on task backlog in SQS
Task Definition 3: Celery Beat
- Container: Celery Beat scheduler
- Manages periodic and scheduled tasks
- Publishes scheduled tasks to SQS queue
- Fixed scaling: Single task (Beat requires single instance)
- Minimal resources: ~0.25 vCPU, 0.5GB memory
- Auto-restart on failure
Rationale for Separate Task Definitions:
- Each workload has different resource requirements
- Independent scaling policies per service type
- Web scales with traffic, workers scale with queue depth
- Cost optimization: Right-size each workload separately
- Isolation: Issues in one service don’t affect others
Auto-Scaling Configuration:
ECS provides automatic scaling that adjusts the number of running tasks based on demand, with both scale-up and scale-down capabilities:
Web Tasks Auto-Scaling:
- Metrics: CPU utilization, memory utilization, ALB request count per target
- Scale-up triggers:
- CPU > 70% for 2 minutes → Add tasks
- Requests per task > 1000/min → Add tasks
- Scale-down triggers:
- CPU < 30% for 5 minutes → Remove tasks
- Requests per task < 200/min → Remove tasks
- Min/Max tasks: 2 minimum (HA), 10 maximum
- Benefits: Handles traffic spikes from multiple tenants, scales down during low usage to save costs
Celery Worker Auto-Scaling:
- Metrics: CPU utilization, SQS ApproximateNumberOfMessagesVisible (native CloudWatch metric)
- Scale-up triggers:
- SQS queue depth > 100 messages → Add workers
- CPU > 80% for 3 minutes → Add workers
- Scale-down triggers:
- SQS queue depth < 10 messages for 10 minutes → Remove workers
- CPU < 20% for 10 minutes → Remove workers
- Min/Max tasks: 1 minimum, 5 maximum
- Benefits: SQS provides native queue metrics for accurate scaling decisions; efficiently processes task backlog, reduces to minimum during idle periods
Celery Beat Scaling:
- Fixed at 1 task (Beat scheduler requires single instance)
- Auto-restart on failure for reliability
Multi-Tenant Scaling Benefits:
Auto-scaling is particularly valuable for multi-tenant architecture:
- Unpredictable tenant activity: Different tenants have different usage patterns and peak times
- Cost efficiency: Automatically scales down during low-usage periods (nights, weekends)
- Spike handling: Automatically scales up when multiple tenants become active simultaneously
- Resource optimization: Pays only for resources actually needed at any given time
- Example scenario:
- During business hours (9am-5pm): 6-8 web tasks handle peak multi-tenant load
- During nights (11pm-6am): Scales down to 2 web tasks, saving ~$40-60/month
- Weekend spikes: Auto-scales up to handle unexpected tenant activity
Comparison to Heroku:
| Scaling Feature | Heroku | AWS ECS |
|---|
| Scale-up | Manual or via add-ons | Automatic based on metrics |
| Scale-down | Manual only | Automatic (saves costs) |
| Scaling metrics | Limited (response time, throughput) | Extensive (CPU, memory, custom CloudWatch metrics, ALB metrics, queue depth) |
| Per-service scaling | Requires multiple apps | Built-in per task definition |
| Cost during low usage | Fixed (pays for min dynos) | Dynamic (scales to minimum) |
| Multi-tenant optimization | Limited | Excellent - handles variable tenant load patterns |
Cost Impact:
- Scale-down capability can reduce compute costs by 40-50% during off-peak hours
- For multi-tenant with variable load, average monthly compute cost drops significantly
- Example: Instead of running 6 web tasks 24/7, average 4 tasks/hour = 33% cost reduction
Networking and Load Balancing
- Application Load Balancer (ALB) as entry point for all web traffic
- Sits in public subnets
- Terminates HTTPS/SSL connections
- Routes traffic to web task definition only
- Health checks on web tasks
- Automatically distributes load across multiple web task instances
- VPC with public and private subnets across multiple Availability Zones
- Private subnets for ECS tasks, RDS, and ElastiCache (no direct internet access)
- Public subnets for ALB only
- NAT Gateway for outbound internet access from private subnets
- Security groups for service-level network isolation
- ALB security group: Allow inbound 443 from internet
- Web task security group: Allow inbound from ALB only
- Worker task security groups: No inbound internet traffic
- RDS/ElastiCache security groups: Allow access from ECS tasks only
Security Model: Role-Based Authentication
Shift from IAM Users to IAM Roles:
- Current (Heroku): IAM user credentials stored as environment variables for AWS service access (S3, SES, etc.)
- Proposed (AWS): ECS task IAM roles with least-privilege permissions
IAM Role-Based Security Benefits:
- No stored credentials: Tasks assume roles automatically via ECS Task Role
- Dramatically reduced credential leakage risk: No long-lived access keys in environment variables or code
- Automatic credential rotation: AWS STS provides temporary credentials (auto-expire and rotate)
- Least-privilege access: Each task definition gets only required permissions
- Audit trail: CloudTrail logs all role assumption and service access
- Infrastructure-as-code: IAM roles defined in CloudFormation templates
- Centralized security: All permissions defined and version-controlled
Example Security Architecture:
- Web task role: Read S3 (media), write CloudWatch Logs
- Celery worker task role: Read/Write S3, SES send email, SQS receive/delete messages, CloudWatch Logs
- Celery Beat task role: SQS send messages, CloudWatch Logs only
- RDS access: PostgreSQL username/password from Secrets Manager (accessed via IAM role)
- SQS access: Fully controlled via IAM roles (no credentials needed)
- No IAM user access keys anywhere in the system
Additional AWS Services
- Amazon SQS for Celery message broker (fully managed queue)
- CloudWatch Logs for centralized logging
- CloudWatch Metrics for monitoring and alerting (includes native SQS metrics)
- AWS Secrets Manager for database credentials and API keys
- S3 for media file storage (UK region)
- CloudFormation for infrastructure-as-code deployment
- ECR (Elastic Container Registry) for Docker image storage
Region and Compliance:
- All resources in EU-West-2 (London, UK)
- No data transfer outside UK jurisdiction
- Estimated cost: ~$90-115/month (12-30% savings, includes SQS)
Connectivity Flow
Internet (HTTPS:443)
↓
Application Load Balancer (Public Subnet)
↓ (Routes to web tasks only)
ECS Web Tasks (nginx + gunicorn) (Private Subnet)
↓
├─→ RDS PostgreSQL (Private Subnet)
├─→ ElastiCache Valkey (Private Subnet - cache/sessions)
├─→ Amazon SQS (Send tasks via IAM role)
└─→ S3 (via IAM role)
ECS Celery Workers (Private Subnet)
↓
├─→ RDS PostgreSQL (Private Subnet)
├─→ ElastiCache Valkey (Private Subnet - result backend)
├─→ Amazon SQS (Receive/Delete tasks via IAM role)
└─→ S3 (via IAM role)
ECS Celery Beat (Private Subnet)
↓
├─→ RDS PostgreSQL (Private Subnet)
└─→ Amazon SQS (Send scheduled tasks via IAM role)
Key Points:
- Only web tasks receive traffic from ALB - Workers have no inbound internet traffic
- All three task definitions connect to RDS
- Celery uses Amazon SQS as message broker - fully managed, serverless queue
- ElastiCache Valkey used for caching, sessions, and Celery result backend
- All ECS tasks connect to AWS services (S3, SQS, CloudWatch) via IAM roles
- No stored credentials required anywhere
- SQS provides native CloudWatch metrics for auto-scaling
Infrastructure-as-Code
The entire architecture is defined in CloudFormation templates:
- VPC, subnets, route tables, security groups
- RDS database configuration
- ElastiCache cluster configuration
- SQS queues (standard and dead letter queues)
- ECS cluster, task definitions, and services
- Application Load Balancer and target groups
- IAM roles and policies
- CloudWatch alarms and dashboards
Benefits:
- Version-controlled infrastructure
- Reproducible environments (staging = production)
- No manual configuration or drift
- Peer-reviewed infrastructure changes via Git
- Disaster recovery: Rebuild from templates
CI/CD Pipeline
GitHub Actions Workflow:
The deployment pipeline is automated via GitHub Actions, providing consistent and reliable deployments:
graph LR
A[Code Push to GitHub] --> B[GitHub Actions Triggered]
B --> C[Build Docker Image]
C --> D[Push to ECR]
D --> E[Update ECS Task Definitions]
E --> F[Deploy Web Tasks]
E --> G[Deploy Celery Workers]
E --> H[Deploy Celery Beat]
style A fill:#e8f5e9
style B fill:#fff3e0
style C fill:#e1f5ff
style D fill:#f3e5f5
style E fill:#fff9c4
style F fill:#e8f5e9
style G fill:#e8f5e9
style H fill:#e8f5e9Deployment Process:
- Trigger: Code pushed to main branch or pull request merged
- Build: GitHub Actions workflow executes
- Builds single Docker image containing application code
- Runs tests (optional: can block deployment on failure)
- Tags image with commit SHA and/or semantic version
- Push to ECR: Docker image pushed to Elastic Container Registry in EU-West-2
- ECR provides secure, private Docker registry
- Images stored in same region as deployment (UK)
- Automatic image scanning for vulnerabilities (optional)
- Update Task Definitions: GitHub Actions updates ECS task definitions
- All three task definitions reference the same Docker image
- Only the container command/entrypoint differs per service:
- Web:
gunicorn command - Celery Worker:
celery worker command - Celery Beat:
celery beat command
- Deploy: ECS performs rolling updates
- Web tasks: Rolling deployment with health checks via ALB
- Celery Workers: Rolling update, new tasks pick up from queue
- Celery Beat: Stop old task, start new task (single instance)
Single Image, Multiple Services:
All three ECS task definitions use the same Docker image from ECR. The service type is determined by the command executed:
# Example task definition differences
Web Task Definition:
Image: <ECR_URI>:latest
Command: ["gunicorn", "app.wsgi:application"]
Celery Worker Task Definition:
Image: <ECR_URI>:latest # Same image!
Command: ["celery", "-A", "app", "worker"]
Celery Beat Task Definition:
Image: <ECR_URI>:latest # Same image!
Command: ["celery", "-A", "app", "beat"]
Benefits:
- Single build: One Docker image for all services (faster builds)
- Consistency: All services run identical application code
- Simplified versioning: Single image tag tracks deployment
- Reduced storage: ECR stores one image instead of three
- Atomic deployments: All services deployed from same code version
Comparison to Heroku:
| Aspect | Heroku | AWS ECS |
|---|
| Deployment trigger | git push heroku | GitHub Actions workflow |
| Build process | Heroku buildpacks | Docker image build |
| Artifact storage | Heroku slug storage | ECR (version-controlled) |
| Deployment control | Limited (auto-deploy) | Full control (approval gates, rollback) |
| Multi-service | Separate apps or Procfile | Task definitions with same image |
| Rollback | heroku releases:rollback | ECS task definition revision or redeploy previous image tag |
Additional CI/CD Capabilities:
- Environment-specific deployments: Separate workflows for staging and production
- Approval gates: Require manual approval before production deployment
- Automated testing: Run integration tests against staging before production
- Blue-green deployments: Deploy new version alongside old, switch traffic
- Canary deployments: Gradually shift traffic to new version
- Automated rollback: Detect failures via CloudWatch alarms and auto-rollback
Cost Analysis
| Component | Heroku (Current) | AWS (Proposed) | Notes |
|---|
| Web Application | ~$25-50 | ~$20-35 | ECS Fargate or EC2 instances |
| Celery Worker | ~$25-50 | ~$20-35 | Right-sized for workload |
| Celery Beat | ~$25-50 | ~$10-15 | Smaller instance for scheduler |
| PostgreSQL | ~$15-20 | ~$20-25 | RDS with backups in UK |
| Redis Cache | ~$15-20 | ~$10-15 | ElastiCache (cache + result backend) |
| SQS Message Queue | Included in dyno | ~$1-3 | Pay per million requests, negligible cost |
| Load Balancer | Included | ~$15-20 | ALB costs |
| Total | ~$130/month | ~$90-115/month | 12-30% savings |
Cost Optimization Opportunities:
- Auto-scaling with scale-down: Automatically reduce running tasks during low-usage periods (40-50% compute savings during off-peak)
- Reserved instances for baseline workloads (up to 50% additional savings)
- Spot instances for Celery workers (up to 70% savings on compute)
- S3 storage tiers for media files
- CloudWatch log retention policies
- Right-sizing based on actual usage patterns
Multi-Tenant Auto-Scaling Impact:
The auto-scaling capability is particularly valuable for multi-tenant architecture where tenant usage patterns vary throughout the day. Instead of paying for peak capacity 24/7 (as with Heroku), ECS automatically scales down during low-usage periods, significantly reducing average compute costs.
Important Note: Cost savings are secondary to compliance requirements. Even if costs were equivalent, the migration would be necessary for UK data residency.
Benefits
1. Compliance and Data Sovereignty (Primary Benefit)
- UK data residency: All infrastructure and data in EU-West-2 (London)
- Backup compliance: Database backups remain in UK
- Audit trail: Full control and visibility over data location
- Regulatory compliance: Meets UK data protection requirements
- Client confidence: Can guarantee UK-only data storage
- Reduced legal risk: Eliminates cross-border data transfer concerns
2. Cost Efficiency
- 15-30% immediate cost reduction (~$130 → ~$90-110/month)
- Automatic scale-down during low usage: 40-50% additional compute savings during off-peak hours
- Multi-tenant load optimization: Auto-scaling handles variable tenant usage patterns efficiently
- Additional savings opportunities with reserved/spot instances
- More granular resource allocation (no paying for unused capacity)
- Flexible pricing models (on-demand, reserved, spot)
- Pay only for actual resource consumption, not fixed capacity
3. Greater Control and Flexibility
Infrastructure Control:
- Full control over container images and runtime environment
- Custom networking and security group configuration
- Direct access to infrastructure-level metrics
- Ability to implement custom security controls
- VPC configuration for network isolation
Operational Flexibility:
- Choose instance types optimized for workload
- Separate scaling policies per service
- Custom monitoring and alerting
- Advanced deployment strategies (blue/green, canary)
Infrastructure-as-Code:
- 100% version-controlled infrastructure using CloudFormation templates
- Entire infrastructure stack defined as code (VPC, ECS, RDS, ElastiCache, ALB, etc.)
- Git-based workflow for infrastructure changes (review, approve, deploy)
- Reproducible environments (staging matches production exactly)
- Disaster recovery: Rebuild entire infrastructure from templates
- Change tracking and audit trail for infrastructure modifications
- Team collaboration on infrastructure changes via pull requests
- No manual ClickOps or undocumented configuration drift
Heroku Limitation: On Heroku, infrastructure is configured via web UI or CLI commands that aren’t easily version-controlled. App configuration can be tracked, but the underlying platform infrastructure (databases, dynos, add-ons) requires manual provisioning and documentation.
AWS Performance Advantages:
- Dedicated compute resources (no noisy neighbors)
- Auto-scaling for traffic spikes: Automatically adds capacity when multiple tenants become active
- Dynamic resource allocation: Scales up/down based on actual demand (CPU, memory, request count, queue depth)
- Advanced connection pooling (RDS Proxy)
- No 30-second timeout constraints
- Custom CDN configuration (CloudFront)
- Private networking between services (reduced latency)
- Better database performance tuning options
- Independent scaling per service type (web vs. workers)
5. AWS Service Ecosystem
Native Integration:
- Amazon SQS for Celery message broker: Fully managed, serverless queue with no infrastructure to maintain
- S3 for media and static file storage (same region)
- SES for email services
- Lambda for serverless functions
- CloudWatch for comprehensive monitoring (includes native SQS metrics for auto-scaling)
- AWS Secrets Manager for credential management
- IAM roles for secure, passwordless service access
- VPC endpoints for private AWS service access
SQS-Specific Benefits:
- Zero infrastructure management: No Redis/ElastiCache broker to maintain or scale
- Native CloudWatch metrics: ApproximateNumberOfMessagesVisible metric for accurate worker auto-scaling
- Reliability: Built-in message durability and delivery guarantees
- Dead letter queues: Automatic handling of failed tasks
- Cost-effective: Pay only for messages processed (~$0.40 per million requests)
- Unlimited scalability: No capacity planning required
- IAM-based security: No broker credentials to manage
Technical Considerations
Containerization
ECS Container Setup:
- Dockerize Django application (web, Celery, Beat)
- Use ECR (Elastic Container Registry) for image storage
- Multi-stage builds for optimized image sizes
- Health checks for container orchestration
- Environment-based configuration
Database Migration
RDS PostgreSQL:
- Similar to Heroku Postgres (based on PostgreSQL)
- Enhanced monitoring and performance insights
- Automated backups with configurable retention
- Multi-AZ deployment for high availability
- Parameter groups for fine-tuned configuration
- RDS Proxy for connection pooling
Networking and Security
VPC Configuration:
- Private subnets for database and cache
- Public subnets for load balancer
- Security groups for service isolation
- NAT Gateway for outbound traffic from private subnets
- VPC endpoints for AWS services (S3, Secrets Manager)
Security Enhancements:
- IAM roles instead of static credentials
- Secrets Manager for sensitive configuration
- AWS WAF (Web Application Firewall) for web protection
- CloudTrail for audit logging
- Encryption at rest and in transit
Monitoring and Observability
CloudWatch Integration:
- Application logs aggregation
- Custom metrics and dashboards
- Alerting for critical events
- Performance monitoring
- Cost tracking and budgets
X-Ray (Optional):
- Distributed tracing
- Performance bottleneck identification
- Request flow visualization
High Availability
AWS HA Features:
- Multi-AZ RDS deployment
- ECS service auto-recovery
- Application Load Balancer health checks
- Automated backups and snapshots
- Cross-AZ redundancy
Trade-offs
Advantages
- UK data residency and compliance (eliminates critical blocker)
- Cost savings (15-30%)
- 100% infrastructure-as-code with CloudFormation (version control, reproducibility, no drift)
- Greater infrastructure control
- Better performance and scalability
- AWS service ecosystem integration
- More flexible resource allocation
- Enhanced security capabilities
- Better monitoring and observability
Disadvantages
- Increased operational responsibility (less managed than Heroku)
- Steeper learning curve for AWS services
- More complex infrastructure management
- Need for AWS expertise in team
- Infrastructure-as-code maintenance overhead
- More components to monitor and maintain
- Deployment complexity (ECS vs. git push heroku)
Risk Mitigation
- Training and documentation: Invest in AWS training for team
- Infrastructure-as-code: Use Terraform or CloudFormation for reproducibility
- Gradual migration: Test thoroughly in staging environment
- Monitoring from day one: Comprehensive CloudWatch setup before go-live
- AWS support plan: Consider AWS Business Support for expert guidance
- Runbooks and procedures: Document common operational tasks
- Disaster recovery testing: Regular restore testing from backups
Compliance Requirements
UK Data Residency
Requirements Met:
- All compute resources in EU-West-2 (London)
- Database and backups in EU-West-2 (London)
- Redis cache in EU-West-2 (London)
- Application logs in EU-West-2 (CloudWatch Logs)
- S3 storage in EU-West-2 (if used)
Audit Trail:
- CloudTrail logs all API calls and data access
- Resource tagging for compliance tracking
- IAM policies enforce regional restrictions
- AWS Organizations for governance controls
Data Protection Compliance
GDPR/UK GDPR Alignment:
- Data processing within UK jurisdiction
- No cross-border data transfers to USA
- Right to erasure (tenant deletion capabilities)
- Data encryption at rest and in transit
- Audit logs for data access
Conclusion
The migration from Heroku to AWS ECS represents a strategic necessity for Viitata, driven primarily by UK data residency and compliance requirements. The current Heroku architecture creates unacceptable compliance risk due to database backups being retained in the USA, making it impossible to guarantee UK-only data storage.
This migration delivers:
- Resolves critical compliance issue: UK data residency with backups in EU-West-2 (London)
- Eliminates USA data transfer: All data and backups remain in UK
- Cost savings: 15-30% base reduction + 40-50% additional savings from auto-scaling during off-peak
- Auto-scaling with scale-down: Handles multi-tenant traffic spikes automatically while reducing costs during low-usage periods
- 100% version-controlled infrastructure: CloudFormation templates for entire stack, eliminating manual configuration and drift
- Greater operational control: Full infrastructure flexibility and customization
- Performance improvements: Dedicated resources, auto-scaling for spikes, and advanced AWS features
- AWS service ecosystem: Native integration with SQS (message broker), S3, CloudWatch, IAM, and other services
- Serverless message queue: SQS eliminates need to manage message broker infrastructure
- Enhanced security: VPC isolation, IAM roles, Secrets Manager, and encryption
While AWS requires greater operational expertise compared to Heroku’s managed platform, the compliance requirements make this migration essential. The additional benefits of cost savings, performance improvements, and operational flexibility provide further justification beyond the compliance imperative.
Without this migration, Viitata cannot serve clients with UK data residency requirements and remains exposed to compliance risks associated with cross-border backup storage.
References