Day 9 - II : IDeploying a Real FastAPI App on AWS with Terraform Modules

Day 9 - II : IDeploying a Real FastAPI App on AWS with Terraform Modules

The module from Day 9 runs a toy Python HTTP server. Let's replace it with a real FastAPI backend — and fix the module gaps that only show up when you move past Hello World.

The versioned module setup from Day 9 works. But the app it runs is embarrassing:

nohup python3 -m http.server 8080 &

That's not a web service — it's a file browser. There's no routing, no JSON, no health endpoint worth monitoring, and if it crashes you have no idea until the ALB starts returning 502s.

This post replaces it with a real FastAPI application: structured endpoints, a proper /health check, and a systemd service that auto-restarts on failure. Along the way I'll fix a gap in the module itself — right now user_data is baked in, which means the module only works for one specific app. That has to change.

Why This Matters in the Industry

Real teams don't deploy "Hello World." They deploy APIs — user services, product catalogs, internal tools — and the infrastructure that runs them has to handle a few things the toy example skips entirely:

Startup time. Installing Python packages takes 20–40 seconds. The ALB health check doesn't know or care — it starts probing immediately. If your health check is too aggressive, the instance gets marked unhealthy before the app is even running and the ASG replaces it. That loop never ends.

Process management. nohup ... & backgrounded processes don't restart if they crash. A production service needs something watching it. On Amazon Linux 2, that something is systemd.

A real health endpoint. Returning 200 OK from the root path tells the ALB the server is reachable. Returning 200 from /health with instance metadata tells you it's the right server, running the right code.

Getting these right is not optional. They're the difference between a deployment that works and one that silently misbehaves.

Prerequisites: Remote State Backend (One-Time Setup)

Before running terraform init, the S3 bucket and DynamoDB table for remote state must already exist. Terraform cannot create its own backend — if the bucket isn't there, init fails before any resources are evaluated.

This is a one-time setup per AWS account. If you went through the Day 9 setup, these already exist. If you're starting fresh, create them now with the AWS CLI:

# Create the bucket — names are globally unique, pick one tied to your account
aws s3api create-bucket \
  --bucket mnourdine-tf-state \
  --region us-east-1

# Enable versioning so you can recover a previous state if an apply goes wrong
aws s3api put-bucket-versioning \
  --bucket mnourdine-tf-state \
  --versioning-configuration Status=Enabled

# Encrypt state at rest — state files can contain secrets (DB passwords, tokens, etc.)
aws s3api put-bucket-encryption \
  --bucket mnourdine-tf-state \
  --server-side-encryption-configuration \
    '{"Rules":[{"ApplyServerSideEncryptionByDefault":{"SSEAlgorithm":"AES256"}}]}'

# Block all public access
aws s3api put-public-access-block \
  --bucket mnourdine-tf-state \
  --public-access-block-configuration \
    "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"

# Create the DynamoDB table for state locking
# LockID is the required key — Terraform writes to it when acquiring a lock
aws dynamodb create-table \
  --table-name terraform-state-locks \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST \
  --region us-east-1

Once these exist, every environment's terraform init can reference the same bucket and table. The state files are isolated by the key path inside the bucket — dev/web-app/terraform.tfstate, staging/web-app/terraform.tfstate, etc.

The Problem with Hardcoded user_data

The current module bakes the app startup directly into main.tf:

user_data = base64encode(<<-EOF
  #!/bin/bash
  mkdir -p /var/www/html
  echo "Hello from ${var.environment}" > /var/www/html/index.html
  cd /var/www/html && nohup python3 -m http.server ${var.server_port} &
EOF
)

This makes the module useless for anything else. A module should describe how to run an instance — not what app to run on it. Those are different concerns and they belong in different places.

The fix is a user_data input variable. The module handles the infrastructure. The caller handles the application.

Step 1: Update the Module — Add a user_data Variable

This is module v1.2.0. The only change from v1.1.0 is pulling user_data out of main.tf and into variables.tf.

modules/web-app/variables.tf — add this variable

variable "user_data" {
  description = "Shell script to run on instance launch. Installs and starts the application."
  type        = string
  sensitive   = true
}

modules/web-app/main.tf — update the launch template

resource "aws_launch_template" "web" {
  image_id      = data.aws_ami.amazon_linux.id
  instance_type = var.instance_type

  vpc_security_group_ids = [aws_security_group.instance.id]

  # Caller provides the script — module doesn't care what app runs
  user_data = base64encode(var.user_data)

  lifecycle {
    create_before_destroy = true
  }
}

Also add health_check_grace_period to the ASG. This is how long the ASG waits before it starts trusting health check results on a new instance. The default is 300 seconds — which sounds like a lot, but if your startup script installs packages from the internet, it can take longer on a cold instance. Setting it explicitly makes the behavior predictable and keeps it in code.

modules/web-app/main.tf — update the ASG

resource "aws_autoscaling_group" "web" {
  min_size                  = var.min_size
  max_size                  = var.max_size
  desired_capacity          = var.min_size
  health_check_grace_period = var.health_check_grace_period

  launch_template {
    id      = aws_launch_template.web.id
    version = "$Latest"
  }

  vpc_zone_identifier = data.aws_subnets.default.ids
  target_group_arns   = [aws_lb_target_group.web.arn]
  health_check_type   = "ELB"

  tag {
    key                 = "Name"
    value               = "${local.name_prefix}-web"
    propagate_at_launch = true
  }
}

modules/web-app/variables.tf — add grace period variable

variable "health_check_grace_period" {
  description = "Seconds the ASG waits before checking health on a new instance. Set high enough to cover your startup script."
  type        = number
  default     = 300
}

Tag and push:

git add .
git commit -m "feat: accept user_data as input variable, expose health_check_grace_period"
git tag v1.2.0
git push origin main --tags

Step 2: The FastAPI Application

Here's the application. It's small enough to read in two minutes, realistic enough to be useful as a starting point.

# main.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List
import socket
import datetime

app = FastAPI(title="Items API", version="1.0.0")


class Item(BaseModel):
    id: int
    name: str
    price: float


# In-memory store — fine for a demo.
# In a real service this would be a database call.
_items: List[Item] = [
    Item(id=1, name="Widget", price=9.99),
    Item(id=2, name="Gadget", price=24.99),
    Item(id=3, name="Doohickey", price=4.99),
]


@app.get("/health")
def health():
    """
    Health check endpoint for the ALB.
    Returns the hostname so you can verify which instance responded.
    """
    return {
        "status": "healthy",
        "hostname": socket.gethostname(),
        "timestamp": datetime.datetime.utcnow().isoformat() + "Z",
    }


@app.get("/items", response_model=List[Item])
def list_items():
    return _items


@app.get("/items/{item_id}", response_model=Item)
def get_item(item_id: int):
    for item in _items:
        if item.id == item_id:
            return item
    raise HTTPException(status_code=404, detail=f"Item {item_id} not found")

Three endpoints:

Endpoint What it does
GET /health ALB health check target. Returns hostname + timestamp.
GET /items Returns the full item list as JSON.
GET /items/{id} Returns a single item, or 404 if not found.

Step 3: The user_data Script

This is the script that runs on each EC2 instance at launch. It installs the app and registers it as a systemd service so it auto-restarts on crash or reboot.

#!/bin/bash
set -e  # exit immediately if any command fails

# ── System packages ──────────────────────────────────────────────────────────
yum update -y
yum install -y python3 python3-pip

# ── Python dependencies ──────────────────────────────────────────────────────
pip3 install fastapi "uvicorn[standard]" pydantic

# ── Application ──────────────────────────────────────────────────────────────
mkdir -p /opt/api

cat > /opt/api/main.py << 'PYEOF'
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List
import socket
import datetime

app = FastAPI(title="Items API", version="1.0.0")

class Item(BaseModel):
    id: int
    name: str
    price: float

_items = [
    Item(id=1, name="Widget", price=9.99),
    Item(id=2, name="Gadget", price=24.99),
    Item(id=3, name="Doohickey", price=4.99),
]

@app.get("/health")
def health():
    return {
        "status": "healthy",
        "hostname": socket.gethostname(),
        "timestamp": datetime.datetime.utcnow().isoformat() + "Z",
    }

@app.get("/items", response_model=List[Item])
def list_items():
    return _items

@app.get("/items/{item_id}", response_model=Item)
def get_item(item_id: int):
    for item in _items:
        if item.id == item_id:
            return item
    raise HTTPException(status_code=404, detail=f"Item {item_id} not found")
PYEOF

# ── systemd service ──────────────────────────────────────────────────────────
# Using systemd instead of `nohup ... &` means:
# - the process restarts automatically if it crashes
# - it starts on reboot
# - logs go to journald (readable with: journalctl -u api -f)
cat > /etc/systemd/system/api.service << 'EOF'
[Unit]
Description=FastAPI Items API
After=network.target

[Service]
User=ec2-user
WorkingDirectory=/opt/api
ExecStart=/usr/local/bin/uvicorn main:app --host 0.0.0.0 --port 8000
Restart=always
RestartSec=5
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable api
systemctl start api

Two things worth noting here:

set -e at the top means the script stops immediately if any command fails — pip3 install fails because of a network issue, the instance doesn't come up half-configured. Without it, later commands can silently run against a broken environment.

The systemd service runs as ec2-user rather than root. This is a minimal precaution: if the app has a vulnerability, the blast radius is limited to what ec2-user can access.

Step 4: The Infrastructure Repo — Calling the Updated Module

dev/main.tf

provider "aws" {
  region = "us-east-1"
}

locals {
  fastapi_user_data = <<-EOF
    #!/bin/bash
    set -e

    yum update -y
    yum install -y python3 python3-pip
    pip3 install fastapi "uvicorn[standard]" pydantic

    mkdir -p /opt/api

    cat > /opt/api/main.py << 'PYEOF'
    from fastapi import FastAPI, HTTPException
    from pydantic import BaseModel
    from typing import List
    import socket, datetime

    app = FastAPI(title="Items API", version="1.0.0")

    class Item(BaseModel):
        id: int
        name: str
        price: float

    _items = [
        Item(id=1, name="Widget", price=9.99),
        Item(id=2, name="Gadget", price=24.99),
        Item(id=3, name="Doohickey", price=4.99),
    ]

    @app.get("/health")
    def health():
        return {"status": "healthy", "hostname": socket.gethostname(),
                "timestamp": datetime.datetime.utcnow().isoformat() + "Z"}

    @app.get("/items")
    def list_items():
        return _items

    @app.get("/items/{item_id}")
    def get_item(item_id: int):
        for item in _items:
            if item.id == item_id:
                return item
        raise HTTPException(status_code=404, detail=f"Item {item_id} not found")
    PYEOF

    cat > /etc/systemd/system/api.service << 'SVCEOF'
    [Unit]
    Description=FastAPI Items API
    After=network.target
    [Service]
    User=ec2-user
    WorkingDirectory=/opt/api
    ExecStart=/usr/local/bin/uvicorn main:app --host 0.0.0.0 --port 8000
    Restart=always
    RestartSec=5
    [Install]
    WantedBy=multi-user.target
    SVCEOF

    systemctl daemon-reload
    systemctl enable api
    systemctl start api
  EOF
}

module "web_app" {
  source = "git::https://github.com/mohamednourdine/terraform-modules.git//modules/web-app?ref=v1.2.0"

  environment               = var.environment
  instance_type             = var.instance_type
  min_size                  = var.min_size
  max_size                  = var.max_size
  server_port               = 8000          # uvicorn default
  health_check_path         = "/health"     # FastAPI endpoint, not "/"
  health_check_grace_period = 360           # gives the startup script time to finish
  user_data                 = local.fastapi_user_data
}

output "url" {
  value       = "http://${module.web_app.alb_dns_name}"
  description = "Base URL of the API"
}

dev/terraform.tfvars

environment   = "dev"
instance_type = "t2.micro"
min_size      = 1
max_size      = 2

dev/variables.tf

variable "environment"   { type = string }
variable "instance_type" { type = string }
variable "min_size"      { type = number }
variable "max_size"      { type = number }

Staging and prod use the same main.tf (different tfvars) — the script is identical across environments. That's the point: same code, different scale.

staging/terraform.tfvars

environment   = "staging"
instance_type = "t2.small"
min_size      = 1
max_size      = 3

prod/terraform.tfvars

environment   = "prod"
instance_type = "t3.small"
min_size      = 2
max_size      = 6

Step 5: Deploy and Test

cd dev/
terraform init -upgrade   # pulls module v1.2.0
terraform apply -var-file="terraform.tfvars"

Wait for the ALB to finish health checks — this takes 2–3 minutes after apply completes. The startup script is still running on each instance during that window. Watch the Target Group in the AWS console: instances move from initialhealthy once the health check passes.

Once healthy, test the endpoints:

BASE="http://web-app-dev-alb-xxxxxxxx.us-east-1.elb.amazonaws.com"

# Health check — also shows which instance responded
curl -s $BASE/health | python3 -m json.tool
{
    "status": "healthy",
    "hostname": "ip-172-31-24-87.ec2.internal",
    "timestamp": "2026-04-16T09:42:11Z"
}
# Full item list
curl -s $BASE/items | python3 -m json.tool
[
    {"id": 1, "name": "Widget", "price": 9.99},
    {"id": 2, "name": "Gadget", "price": 24.99},
    {"id": 3, "name": "Doohickey", "price": 4.99}
]
# Single item
curl -s $BASE/items/2 | python3 -m json.tool
{"id": 2, "name": "Gadget", "price": 24.99}
# 404 response
curl -s $BASE/items/99 | python3 -m json.tool
{"detail": "Item 99 not found"}

Run the health check a few times — you'll see the hostname rotate between instances as the ALB load balances across them. That's the ASG working correctly.

Bugs I Hit During Testing (And the Fixes)

After running terraform apply, all targets in the ALB target group showed Unhealthy and EC2 Instance Connect failed with "Error establishing SSH connection." Here's what was wrong and what fixed it.

Bug 1: Missing egress rule on the instance security group

The instance security group only had an ingress rule for port 8000. That lets the ALB reach the app — but it doesn't give the instance any outbound internet access.

Without outbound access, yum update, yum install python3, and pip3 install fastapi in the user_data script all silently hang or fail. The app never starts. The ALB health check probes /health, gets nothing back, marks the target unhealthy, and the ASG eventually replaces it — then the same thing happens again on the new instance.

This is the important thing to know: Terraform does not add a default egress rule when you define a security group in code. The AWS console does add one automatically (allow all outbound), so if you're used to working in the console this will catch you off guard. In Terraform, if you don't declare it, it doesn't exist.

Fix — add this to aws_security_group "instance" in the module:

egress {
  from_port   = 0
  to_port     = 0
  protocol    = "-1"
  cidr_blocks = ["0.0.0.0/0"]
}

Bug 2: Missing port 22 ingress rule

EC2 Instance Connect — the browser-based SSH in the AWS console — requires port 22 to be open on the instance security group. It wasn't, so every connection attempt failed before it could establish.

Fix — add this ingress rule alongside the port 8000 rule:

ingress {
  from_port   = 22
  to_port     = 22
  protocol    = "tcp"
  cidr_blocks = ["0.0.0.0/0"]
}

In a real production setup you'd restrict this to a known CIDR (your office IP, a bastion host, or a VPN range) rather than 0.0.0.0/0. For a dev environment it's fine.

Bug 3: Local fixes not being picked up

The module source in dev/main.tf was pointing to a remote git tag:

source = "git::https://github.com/mohamednourdine/terraform-modules.git//modules/web-app?ref=v1.2.0"

Every fix I made locally had no effect — Terraform kept pulling the tagged version from GitHub. While iterating on bugs, switch to a local path so changes apply immediately:

source = "../../terraform-modules/modules/web-app"

Remember to run terraform init after changing the source. Once the bugs are fixed and the module is tagged, switch back to the versioned remote source.

Result

After applying these three fixes, instances got outbound internet access, the bootstrap script completed, FastAPI started on port 8000, and the health check returned:

{"status": "healthy", "hostname": "ip-172-31-78-60.ec2.internal", "timestamp": "2026-04-16T09:51:03Z"}

All targets moved to Healthy in the target group.

Debugging Startup Issues

If instances stay in initial or flip to unhealthy, the startup script is the first place to check. SSH into an instance via EC2 Instance Connect (requires port 22 open — see Bug 2 above) and read the service logs:

# See if the service is running
systemctl status api

# Follow live output from the service
journalctl -u api -f

# Read the cloud-init log — this is where user_data errors show up
cat /var/log/cloud-init-output.log

Common culprits:

  • Missing egress rule → pip3 install can't reach the internet → script hangs → health check never passes → ASG loops replacing instances indefinitely.
  • Port mismatch between uvicorn (--port 8000) and the security group / target group port (server_port variable). These need to match exactly.
  • health_check_grace_period too short → ASG marks instances unhealthy before startup finishes. Increase it or pre-bake the AMI.

What Changed from the Hello World Version

Before (python3 -m http.server) After (FastAPI + systemd)
Process management nohup — no restart on crash systemd — restarts automatically
Health check GET / returns an HTML file GET /health returns structured JSON
Startup failure Script continues past errors set -e stops immediately
App code Hardcoded in module Passed in as user_data variable
Grace period Implicit 300s default Explicit 360 in config
Logs Lost on process exit journalctl -u api
Instance SG egress Not considered Explicit allow-all required for outbound access
SSH access Not considered Port 22 ingress rule required for Instance Connect

Where I'm At

Moving from a toy server to a real application exposed more gaps than expected. The module needed a user_data input, an explicit grace period, and a meaningful health check path — those were planned. What wasn't planned: the instance security group had no egress rule (so the bootstrap script couldn't reach the internet to install packages), and no port 22 rule (so Instance Connect couldn't be used to investigate). Both came from a habit of working in the AWS console, which silently adds a default egress rule for you. Terraform doesn't.

The FastAPI example is still simple — there's no database, no auth, no persistent state. But the infrastructure pattern is real: a stateless API tier behind an ALB, managed by an ASG, deployed from a versioned module, with proper process management and health checks. That pattern scales to production.

Next up: adding an RDS database behind this API and managing the connection securely.


This post is part of a 30-day Terraform learning journey.

Share This Article

Did you find this helpful?

💬 Comments

No comments yet. Be the first to share your thoughts!

Leave a Comment

Get In Touch

I'm always open to discussing new projects and opportunities.

Location Yassa/Douala, Cameroon
Availability Open for opportunities

Connect With Me

Send a Message

Have a project in mind? Let's talk about it.