How I Battled the NestJS 504 Gateway Timeout on a Shared VPS: One Midnight Debugging Session That Saved My Production Codebase
Ever stared at a blank terminal at 2 AM, watching the same “504 Gateway Timeout” flash over and over? I have. My API, built with NestJS, stopped answering requests just when my SaaS startup needed it most. Below you’ll see exactly how I ripped that timeout apart, step by step, and turned a panic‑filled night into a faster, more resilient production stack.
Why This Matters
A 504 error isn’t just a polite “sorry, we’re busy.” On a shared VPS, it signals that your Node process exhausted resources, your reverse proxy (NGINX, Caddy, etc.) gave up, or a network glitch stalled the request. In a real‑world product, every timeout is a lost customer, a hit to SEO, and a potential revenue dip.
If you’re running NestJS on a budget VPS, you’ll face the same limits: CPU throttling, memory caps, and limited concurrent connections. Knowing how to diagnose and fix a 504 can keep your users happy and your bottom line healthy.
The Midnight Debugging Session – Step‑by‑Step
1. Replicate the Timeout Locally
First, I needed a reproducible test. I set up a curl loop that bombarded the endpoint with 200 simultaneous requests. The pattern was clear: after ~120 requests, the VPS responded with 504.
for i in {1..200}; do
curl -s -o /dev/null -w "%{http_code}\n" https://api.myapp.com/users &
done
wait
2. Check NGINX Timeouts
On the VPS, my reverse proxy was NGINX. Its default proxy_read_timeout is 60 seconds, but my heavy query sometimes needed 90 seconds.
Tip: Setting both proxy_connect_timeout and proxy_send_timeout to the same value prevents mismatched timeouts.
# /etc/nginx/conf.d/api.conf
server {
listen 80;
location / {
proxy_pass http://localhost:3000;
proxy_connect_timeout 120s;
proxy_send_timeout 120s;
proxy_read_timeout 120s;
}
}
3. Profile NestJS CPU & Memory
I SSHed into the VPS and ran top while the load test was active. The Node process spiked to 98% CPU and 1.4 GB of RAM (the VPS only had 2 GB total).
Warning: Ignoring high CPU on a shared VPS can cause your container to be throttled, leading to more 504s.
4. Optimize the Bottleneck Query
The culprit was a massive JOIN on a PostgreSQL table with 1.7 M rows. Adding an index and limiting selected columns shaved 1.3 seconds off the query.
// src/users/users.service.ts
async findHeavy() {
return this.repo.createQueryBuilder('u')
.select(['u.id', 'u.email', 'p.profile_picture'])
.leftJoin('u.profile', 'p')
.where('u.is_active = :active', { active: true })
.orderBy('u.created_at', 'DESC')
.limit(100)
.getMany();
}
5. Enable NestJS Built‑in Rate Limiting
To protect the server from future spikes, I added @nestjs/throttler. This throttles each IP to 30 requests per minute, giving the app breathing room.
// app.module.ts
import { ThrottlerModule } from '@nestjs/throttler';
@Module({
imports: [
ThrottlerModule.forRoot({
ttl: 60,
limit: 30,
}),
// other imports …
],
})
export class AppModule {}
6. Deploy a Node Process Manager (PM2)
PM2 restarts the app automatically if it crashes and keeps a small memory footprint by enabling --max-memory-restart. This guard prevented the process from being OOM‑killed.
# Install PM2 globally
npm i -g pm2
# Start NestJS with memory limit
pm2 start dist/main.js --name api --max-memory-restart 1500M
# Save the process list
pm2 save
Real‑World Use Case: SaaS Billing API
My startup’s billing micro‑service runs on the same VPS. After the fixes, the endpoint that creates invoices now averages 320 ms, even under a 100‑request burst. That translates to over 2,000 saved seconds per day and a noticeable drop in churn because customers no longer see “Timeout” errors during checkout.
Results / Outcome
- 504 errors eliminated in production for 30+ consecutive days.
- CPU usage stabilized at 45% under load.
- Memory consumption dropped from 1.4 GB to 850 MB.
- Revenue impact: $1,200/month saved by preventing abortive checkout flows.
Bonus Tips for Future‑Proofing
- Enable HTTP/2 on NGINX – reduces latency for API calls.
- Use Redis Cache for repeat‑read queries; a 5‑second DB call became 30 ms.
- Monitor with Grafana + Prometheus – set alerts for CPU > 80% or response time > 500 ms.
- Consider Light‑weight VPS upgrades (e.g., 2 vCPU) once you consistently hit 70% CPU.
Monetization (Optional)
If you found this walkthrough helpful, check out my DevTools bundle – a curated set of monitoring scripts, Dockerfiles, and NGINX configs that shave minutes off any Node deployment. Use code CODEMASTER10 for a 10% discount.
“The difference between a night‑mare and a minor hiccup is knowing the exact command to run at 2 AM.” – Yours truly
© 2026 YourName.dev – All rights reserved.
No comments:
Post a Comment