AWS Lambda vs AWS Fargate for SSO Service
Date: 2025-02-12
Status: accepted
Context
The remix-sso application provides Single Sign-On (SSO) functionality for our platform. It is currently hosted on AWS Lambda. An outage on 11-02-2025, affecting the entire platform was traced back to a surge in authentication requests, which caused us to hit multiple AWS rate limits:
- The number of concurrent AWS Lambda executions.
- The maximum request rate to AWS Secrets Manager/Parameter Store
Detailed Outage Analysis
Occasionally, the Accounts Overview page experiences traffic spikes which caused a significant incident.
The Account Overview includes global components, which in turn use the remix-sso-library to make calls to the /sso/* endpoints. During these traffic spikes:
- The warm Lambdas cannot handle the increased load, leading AWS to spin up new Lambda instances.
- These new Lambdas fetch secrets from AWS Parameter Store, which results in a high volume of requests.
- AWS Parameter Store enforces rate limits, leading to failures in the running Lambdas.
- The Lambda failures cause them to be terminated, requiring new cold starts.
- The cold-started Lambdas again attempt to read from Parameter Store, quickly hitting the rate limit again.
- This cycle continues, with the AWS Parameter Store returning "Rate exceeded" errors, causing further Lambda failures.
- Additionally, we hit the concurrent execution limit for AWS Lambda, leading to many Lambdas being started and immediately terminated without producing log output.
Given the critical nature of the SSO service, and as an outcome from the incident, we are exploring alternative infrastructure options to improve reliability and scalability under load.
Decision
We will migrate the SSO service from AWS Lambda to AWS Fargate, running as a long-lived containerized service.
Rationale
-
Better control over resource scaling:
- AWS Lambda scales automatically but can spawn too many instances under high load, leading to increased pressure on downstream dependencies (e.g., Secrets Manager, Parameter Store).
- AWS Fargate provides a steady-state service with predictable scaling characteristics, allowing us to implement rate-limiting strategies and connection pooling more effectively.
-
Reduced reliance on ephemeral execution:
- Lambda’s cold starts can introduce unpredictable latency, especially during traffic spikes.
- A long-running Fargate service will maintain warm execution environments, improving response times.
-
More predictable interactions with AWS dependencies:
- Instead of each Lambda execution making independent requests to Secrets Manager and Parameter Store, a Fargate service can cache secrets in-memory, reducing load on AWS services.
- This reduces the likelihood of hitting AWS-imposed request limits.
-
Improved observability and debugging:
- AWS Lambda logs are fragmented per execution, making debugging complex workflows difficult.
- A Fargate service allows for centralized logging and monitoring, improving visibility into request patterns and failures.
Trade-offs
-
Operational Overhead:
- Running a long-lived service requires managing task definitions, networking, auto-scaling policies, and service health checks.
- Lambda abstracts much of this operational complexity, but at the cost of less predictability in high-load scenarios.
-
Scaling Behavior:
- AWS Lambda scales automatically based on incoming requests with little intervention.
- AWS Fargate requires tuning of scaling policies, which could lead to under-provisioning (causing latency) or over-provisioning (increasing costs).
-
Cost Considerations:
- Lambda charges based on execution time and invocations, which can be cost-effective for sporadic workloads but expensive for high and sustained traffic.
- Fargate has fixed compute and memory allocations, which may be more cost-efficient under heavy and consistent loads but can be inefficient if traffic is highly variable.
Risks and Mitigations
-
Increased Management Complexity:
- We will need to set up proper auto-scaling policies and monitoring to ensure reliability and cost efficiency.
- Mitigation: Use AWS Auto Scaling and Datadog monitoring to optimize instance allocation dynamically.
-
Potential for Under-Provisioning:
- If we do not scale the Fargate service appropriately, it may not handle traffic spikes as well as Lambda.
- Mitigation: Implement proactive scaling policies and load tests to determine optimal service sizing.
Alternatives Considered
- Continue with AWS Lambda but optimize execution patterns: This could involve implementing connection pooling, caching secrets externally, or requesting AWS rate limit increases. However, these optimizations may not fully mitigate the risk of exceeding AWS constraints during extreme load spikes.
Next Steps
- Define and implement Fargate task configurations (CPU/memory allocation, auto-scaling policies).
- Test and benchmark performance under different traffic conditions.
- Implement logging, monitoring, and alerting to ensure observability.