The versioned module setup from Day 9 works. But the app it runs is embarrassing:
nohup python3 -m http.server 8080 &
That's not a web service — it's a file browser. There's no routing, no JSON, no health endpoint worth monitoring, and if it crashes you have no idea until the ALB starts returning 502s.
This post replaces it with a real FastAPI application: structured endpoints, a proper /health check, and a systemd service that auto-restarts on failure. Along the way I'll fix a gap in the module itself — right now user_data is baked in, which means the module only works for one specific app. That has to change.
Why This Matters in the Industry
Real teams don't deploy "Hello World." They deploy APIs — user services, product catalogs, internal tools — and the infrastructure that runs them has to handle a few things the toy example skips entirely:
Startup time. Installing Python packages takes 20–40 seconds. The ALB health check doesn't know or care — it starts probing immediately. If your health check is too aggressive, the instance gets marked unhealthy before the app is even running and the ASG replaces it. That loop never ends.
Process management. nohup ... & backgrounded processes don't restart if they crash. A production service needs something watching it. On Amazon Linux 2, that something is systemd.
A real health endpoint. Returning 200 OK from the root path tells the ALB the server is reachable. Returning 200 from /health with instance metadata tells you it's the right server, running the right code.
Getting these right is not optional. They're the difference between a deployment that works and one that silently misbehaves.
Prerequisites: Remote State Backend (One-Time Setup)
Before running terraform init, the S3 bucket and DynamoDB table for remote state must already exist. Terraform cannot create its own backend — if the bucket isn't there, init fails before any resources are evaluated.
This is a one-time setup per AWS account. If you went through the Day 9 setup, these already exist. If you're starting fresh, create them now with the AWS CLI:
# Create the bucket — names are globally unique, pick one tied to your account
aws s3api create-bucket \
--bucket mnourdine-tf-state \
--region us-east-1
# Enable versioning so you can recover a previous state if an apply goes wrong
aws s3api put-bucket-versioning \
--bucket mnourdine-tf-state \
--versioning-configuration Status=Enabled
# Encrypt state at rest — state files can contain secrets (DB passwords, tokens, etc.)
aws s3api put-bucket-encryption \
--bucket mnourdine-tf-state \
--server-side-encryption-configuration \
'{"Rules":[{"ApplyServerSideEncryptionByDefault":{"SSEAlgorithm":"AES256"}}]}'
# Block all public access
aws s3api put-public-access-block \
--bucket mnourdine-tf-state \
--public-access-block-configuration \
"BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"
# Create the DynamoDB table for state locking
# LockID is the required key — Terraform writes to it when acquiring a lock
aws dynamodb create-table \
--table-name terraform-state-locks \
--attribute-definitions AttributeName=LockID,AttributeType=S \
--key-schema AttributeName=LockID,KeyType=HASH \
--billing-mode PAY_PER_REQUEST \
--region us-east-1
Once these exist, every environment's terraform init can reference the same bucket and table. The state files are isolated by the key path inside the bucket — dev/web-app/terraform.tfstate, staging/web-app/terraform.tfstate, etc.
The Problem with Hardcoded user_data
The current module bakes the app startup directly into main.tf:
user_data = base64encode(<<-EOF
#!/bin/bash
mkdir -p /var/www/html
echo "Hello from ${var.environment}" > /var/www/html/index.html
cd /var/www/html && nohup python3 -m http.server ${var.server_port} &
EOF
)
This makes the module useless for anything else. A module should describe how to run an instance — not what app to run on it. Those are different concerns and they belong in different places.
The fix is a user_data input variable. The module handles the infrastructure. The caller handles the application.
Step 1: Update the Module — Add a user_data Variable
This is module v1.2.0. The only change from v1.1.0 is pulling user_data out of main.tf and into variables.tf.
modules/web-app/variables.tf — add this variable
variable "user_data" {
description = "Shell script to run on instance launch. Installs and starts the application."
type = string
sensitive = true
}
modules/web-app/main.tf — update the launch template
resource "aws_launch_template" "web" {
image_id = data.aws_ami.amazon_linux.id
instance_type = var.instance_type
vpc_security_group_ids = [aws_security_group.instance.id]
# Caller provides the script — module doesn't care what app runs
user_data = base64encode(var.user_data)
lifecycle {
create_before_destroy = true
}
}
Also add health_check_grace_period to the ASG. This is how long the ASG waits before it starts trusting health check results on a new instance. The default is 300 seconds — which sounds like a lot, but if your startup script installs packages from the internet, it can take longer on a cold instance. Setting it explicitly makes the behavior predictable and keeps it in code.
modules/web-app/main.tf — update the ASG
resource "aws_autoscaling_group" "web" {
min_size = var.min_size
max_size = var.max_size
desired_capacity = var.min_size
health_check_grace_period = var.health_check_grace_period
launch_template {
id = aws_launch_template.web.id
version = "$Latest"
}
vpc_zone_identifier = data.aws_subnets.default.ids
target_group_arns = [aws_lb_target_group.web.arn]
health_check_type = "ELB"
tag {
key = "Name"
value = "${local.name_prefix}-web"
propagate_at_launch = true
}
}
modules/web-app/variables.tf — add grace period variable
variable "health_check_grace_period" {
description = "Seconds the ASG waits before checking health on a new instance. Set high enough to cover your startup script."
type = number
default = 300
}
Tag and push:
git add .
git commit -m "feat: accept user_data as input variable, expose health_check_grace_period"
git tag v1.2.0
git push origin main --tags
Step 2: The FastAPI Application
Here's the application. It's small enough to read in two minutes, realistic enough to be useful as a starting point.
# main.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List
import socket
import datetime
app = FastAPI(title="Items API", version="1.0.0")
class Item(BaseModel):
id: int
name: str
price: float
# In-memory store — fine for a demo.
# In a real service this would be a database call.
_items: List[Item] = [
Item(id=1, name="Widget", price=9.99),
Item(id=2, name="Gadget", price=24.99),
Item(id=3, name="Doohickey", price=4.99),
]
@app.get("/health")
def health():
"""
Health check endpoint for the ALB.
Returns the hostname so you can verify which instance responded.
"""
return {
"status": "healthy",
"hostname": socket.gethostname(),
"timestamp": datetime.datetime.utcnow().isoformat() + "Z",
}
@app.get("/items", response_model=List[Item])
def list_items():
return _items
@app.get("/items/{item_id}", response_model=Item)
def get_item(item_id: int):
for item in _items:
if item.id == item_id:
return item
raise HTTPException(status_code=404, detail=f"Item {item_id} not found")
Three endpoints:
| Endpoint | What it does |
|---|---|
GET /health |
ALB health check target. Returns hostname + timestamp. |
GET /items |
Returns the full item list as JSON. |
GET /items/{id} |
Returns a single item, or 404 if not found. |
Step 3: The user_data Script
This is the script that runs on each EC2 instance at launch. It installs the app and registers it as a systemd service so it auto-restarts on crash or reboot.
#!/bin/bash
set -e # exit immediately if any command fails
# ── System packages ──────────────────────────────────────────────────────────
yum update -y
yum install -y python3 python3-pip
# ── Python dependencies ──────────────────────────────────────────────────────
pip3 install fastapi "uvicorn[standard]" pydantic
# ── Application ──────────────────────────────────────────────────────────────
mkdir -p /opt/api
cat > /opt/api/main.py << 'PYEOF'
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List
import socket
import datetime
app = FastAPI(title="Items API", version="1.0.0")
class Item(BaseModel):
id: int
name: str
price: float
_items = [
Item(id=1, name="Widget", price=9.99),
Item(id=2, name="Gadget", price=24.99),
Item(id=3, name="Doohickey", price=4.99),
]
@app.get("/health")
def health():
return {
"status": "healthy",
"hostname": socket.gethostname(),
"timestamp": datetime.datetime.utcnow().isoformat() + "Z",
}
@app.get("/items", response_model=List[Item])
def list_items():
return _items
@app.get("/items/{item_id}", response_model=Item)
def get_item(item_id: int):
for item in _items:
if item.id == item_id:
return item
raise HTTPException(status_code=404, detail=f"Item {item_id} not found")
PYEOF
# ── systemd service ──────────────────────────────────────────────────────────
# Using systemd instead of `nohup ... &` means:
# - the process restarts automatically if it crashes
# - it starts on reboot
# - logs go to journald (readable with: journalctl -u api -f)
cat > /etc/systemd/system/api.service << 'EOF'
[Unit]
Description=FastAPI Items API
After=network.target
[Service]
User=ec2-user
WorkingDirectory=/opt/api
ExecStart=/usr/local/bin/uvicorn main:app --host 0.0.0.0 --port 8000
Restart=always
RestartSec=5
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable api
systemctl start api
Two things worth noting here:
set -e at the top means the script stops immediately if any command fails — pip3 install fails because of a network issue, the instance doesn't come up half-configured. Without it, later commands can silently run against a broken environment.
The systemd service runs as ec2-user rather than root. This is a minimal precaution: if the app has a vulnerability, the blast radius is limited to what ec2-user can access.
Step 4: The Infrastructure Repo — Calling the Updated Module
dev/main.tf
provider "aws" {
region = "us-east-1"
}
locals {
fastapi_user_data = <<-EOF
#!/bin/bash
set -e
yum update -y
yum install -y python3 python3-pip
pip3 install fastapi "uvicorn[standard]" pydantic
mkdir -p /opt/api
cat > /opt/api/main.py << 'PYEOF'
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List
import socket, datetime
app = FastAPI(title="Items API", version="1.0.0")
class Item(BaseModel):
id: int
name: str
price: float
_items = [
Item(id=1, name="Widget", price=9.99),
Item(id=2, name="Gadget", price=24.99),
Item(id=3, name="Doohickey", price=4.99),
]
@app.get("/health")
def health():
return {"status": "healthy", "hostname": socket.gethostname(),
"timestamp": datetime.datetime.utcnow().isoformat() + "Z"}
@app.get("/items")
def list_items():
return _items
@app.get("/items/{item_id}")
def get_item(item_id: int):
for item in _items:
if item.id == item_id:
return item
raise HTTPException(status_code=404, detail=f"Item {item_id} not found")
PYEOF
cat > /etc/systemd/system/api.service << 'SVCEOF'
[Unit]
Description=FastAPI Items API
After=network.target
[Service]
User=ec2-user
WorkingDirectory=/opt/api
ExecStart=/usr/local/bin/uvicorn main:app --host 0.0.0.0 --port 8000
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
SVCEOF
systemctl daemon-reload
systemctl enable api
systemctl start api
EOF
}
module "web_app" {
source = "git::https://github.com/mohamednourdine/terraform-modules.git//modules/web-app?ref=v1.2.0"
environment = var.environment
instance_type = var.instance_type
min_size = var.min_size
max_size = var.max_size
server_port = 8000 # uvicorn default
health_check_path = "/health" # FastAPI endpoint, not "/"
health_check_grace_period = 360 # gives the startup script time to finish
user_data = local.fastapi_user_data
}
output "url" {
value = "http://${module.web_app.alb_dns_name}"
description = "Base URL of the API"
}
dev/terraform.tfvars
environment = "dev"
instance_type = "t2.micro"
min_size = 1
max_size = 2
dev/variables.tf
variable "environment" { type = string }
variable "instance_type" { type = string }
variable "min_size" { type = number }
variable "max_size" { type = number }
Staging and prod use the same main.tf (different tfvars) — the script is identical across environments. That's the point: same code, different scale.
staging/terraform.tfvars
environment = "staging"
instance_type = "t2.small"
min_size = 1
max_size = 3
prod/terraform.tfvars
environment = "prod"
instance_type = "t3.small"
min_size = 2
max_size = 6
Step 5: Deploy and Test
cd dev/
terraform init -upgrade # pulls module v1.2.0
terraform apply -var-file="terraform.tfvars"
Wait for the ALB to finish health checks — this takes 2–3 minutes after apply completes. The startup script is still running on each instance during that window. Watch the Target Group in the AWS console: instances move from initial → healthy once the health check passes.
Once healthy, test the endpoints:
BASE="http://web-app-dev-alb-xxxxxxxx.us-east-1.elb.amazonaws.com"
# Health check — also shows which instance responded
curl -s $BASE/health | python3 -m json.tool
{
"status": "healthy",
"hostname": "ip-172-31-24-87.ec2.internal",
"timestamp": "2026-04-16T09:42:11Z"
}
# Full item list
curl -s $BASE/items | python3 -m json.tool
[
{"id": 1, "name": "Widget", "price": 9.99},
{"id": 2, "name": "Gadget", "price": 24.99},
{"id": 3, "name": "Doohickey", "price": 4.99}
]
# Single item
curl -s $BASE/items/2 | python3 -m json.tool
{"id": 2, "name": "Gadget", "price": 24.99}
# 404 response
curl -s $BASE/items/99 | python3 -m json.tool
{"detail": "Item 99 not found"}
Run the health check a few times — you'll see the hostname rotate between instances as the ALB load balances across them. That's the ASG working correctly.
Bugs I Hit During Testing (And the Fixes)
After running terraform apply, all targets in the ALB target group showed Unhealthy and EC2 Instance Connect failed with "Error establishing SSH connection." Here's what was wrong and what fixed it.
Bug 1: Missing egress rule on the instance security group
The instance security group only had an ingress rule for port 8000. That lets the ALB reach the app — but it doesn't give the instance any outbound internet access.
Without outbound access, yum update, yum install python3, and pip3 install fastapi in the user_data script all silently hang or fail. The app never starts. The ALB health check probes /health, gets nothing back, marks the target unhealthy, and the ASG eventually replaces it — then the same thing happens again on the new instance.
This is the important thing to know: Terraform does not add a default egress rule when you define a security group in code. The AWS console does add one automatically (allow all outbound), so if you're used to working in the console this will catch you off guard. In Terraform, if you don't declare it, it doesn't exist.
Fix — add this to aws_security_group "instance" in the module:
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
Bug 2: Missing port 22 ingress rule
EC2 Instance Connect — the browser-based SSH in the AWS console — requires port 22 to be open on the instance security group. It wasn't, so every connection attempt failed before it could establish.
Fix — add this ingress rule alongside the port 8000 rule:
ingress {
from_port = 22
to_port = 22
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
In a real production setup you'd restrict this to a known CIDR (your office IP, a bastion host, or a VPN range) rather than 0.0.0.0/0. For a dev environment it's fine.
Bug 3: Local fixes not being picked up
The module source in dev/main.tf was pointing to a remote git tag:
source = "git::https://github.com/mohamednourdine/terraform-modules.git//modules/web-app?ref=v1.2.0"
Every fix I made locally had no effect — Terraform kept pulling the tagged version from GitHub. While iterating on bugs, switch to a local path so changes apply immediately:
source = "../../terraform-modules/modules/web-app"
Remember to run terraform init after changing the source. Once the bugs are fixed and the module is tagged, switch back to the versioned remote source.
Result
After applying these three fixes, instances got outbound internet access, the bootstrap script completed, FastAPI started on port 8000, and the health check returned:
{"status": "healthy", "hostname": "ip-172-31-78-60.ec2.internal", "timestamp": "2026-04-16T09:51:03Z"}
All targets moved to Healthy in the target group.
Debugging Startup Issues
If instances stay in initial or flip to unhealthy, the startup script is the first place to check. SSH into an instance via EC2 Instance Connect (requires port 22 open — see Bug 2 above) and read the service logs:
# See if the service is running
systemctl status api
# Follow live output from the service
journalctl -u api -f
# Read the cloud-init log — this is where user_data errors show up
cat /var/log/cloud-init-output.log
Common culprits:
- Missing egress rule →
pip3 installcan't reach the internet → script hangs → health check never passes → ASG loops replacing instances indefinitely. - Port mismatch between uvicorn (
--port 8000) and the security group / target group port (server_portvariable). These need to match exactly. health_check_grace_periodtoo short → ASG marks instances unhealthy before startup finishes. Increase it or pre-bake the AMI.
What Changed from the Hello World Version
Before (python3 -m http.server) |
After (FastAPI + systemd) | |
|---|---|---|
| Process management | nohup — no restart on crash |
systemd — restarts automatically |
| Health check | GET / returns an HTML file |
GET /health returns structured JSON |
| Startup failure | Script continues past errors | set -e stops immediately |
| App code | Hardcoded in module | Passed in as user_data variable |
| Grace period | Implicit 300s default | Explicit 360 in config |
| Logs | Lost on process exit | journalctl -u api |
| Instance SG egress | Not considered | Explicit allow-all required for outbound access |
| SSH access | Not considered | Port 22 ingress rule required for Instance Connect |
Where I'm At
Moving from a toy server to a real application exposed more gaps than expected. The module needed a user_data input, an explicit grace period, and a meaningful health check path — those were planned. What wasn't planned: the instance security group had no egress rule (so the bootstrap script couldn't reach the internet to install packages), and no port 22 rule (so Instance Connect couldn't be used to investigate). Both came from a habit of working in the AWS console, which silently adds a default egress rule for you. Terraform doesn't.
The FastAPI example is still simple — there's no database, no auth, no persistent state. But the infrastructure pattern is real: a stateless API tier behind an ALB, managed by an ASG, deployed from a versioned module, with proper process management and health checks. That pattern scales to production.
Next up: adding an RDS database behind this API and managing the connection securely.
This post is part of a 30-day Terraform learning journey.
💬 Comments
No comments yet. Be the first to share your thoughts!
Leave a Comment