Saturday, May 2, 2026

How I Fixed the Mysterious 502 Bad Gateway Crash on My NestJS App Running on a Shared VPS – One Dev’s 3‑Hour Debug Marathon Revealed the Hidden Memory Leak and Configuration Overlooked by All Hosts

How I Fixed the Mysterious 502 Bad Gateway Crash on My NestJS App Running on a Shared VPS – One Dev’s 3‑Hour Debug Marathon Revealed the Hidden Memory Leak and Configuration Overlooked by All Hosts

If you’ve ever watched a production‑grade NestJS service explode with a 502 error while your users stare at a blank page, you know the panic feels like a punch to the gut. This story is about how I turned a nightmare crash into a steady‑state, 30% faster API—without changing the hosting plan.

Why This Matters

Shared VPSs are cheap, but they come with a hidden set of traps: limited RAM, vague nginx limits, and a default pm.max_children that doesn’t match your Node.js heap. When those limits clash with a memory‑hungry NestJS process, the result is the dreaded “502 Bad Gateway” that even seasoned ops can’t explain.

Understanding the root cause not only saves you hours of downtime, it also protects your revenue stream—especially if you’re selling API‑driven SaaS or a paid mobile backend.

The 3‑Hour Debug Marathon – Step‑by‑Step

  1. 1️⃣ Reproduce the 502 on a Staging Clone

    I duplicated the production folder onto a spare sub‑domain, pointed the same VPS IP, and ran npm run start:prod. The 502 appeared after about 15 minutes of simulated traffic.

  2. 2️⃣ Check Server Logs – Nginx vs. Node

    First I inspected /var/log/nginx/error.log and saw:

    2024/04/28 13:42:01 [error] 12345#0: *12 upstream prematurely closed connection while reading response header from upstream, client: 203.0.113.5, server: api.example.com, request: "GET /users HTTP/1.1", upstream: "http://127.0.0.1:3000/users", host: "api.example.com"

    That meant the Node process crashed or was killed. pm2 logs showed “Memory limit exceeded” after the 15‑minute mark.

  3. 3️⃣ Spot the Leak – Profiling with Clinic.js

    I installed clinic on the VPS and ran:

    clinic doctor -- node dist/main.js

    The flamegraph revealed a steady increase in Array.prototype.push inside a custom CacheService that never cleared.

  4. 4️⃣ Fix the Leak – Add TTL & WeakMap

    I rewrote the cache to use node-cache with a ttl of 60 seconds and replaced the manual array with a WeakMap. Below is the before/after snippet.

    // before – memory hog
    @Injectable()
    export class CacheService {
      private store = [];
      get(key: string) {
        return this.store.find(item => item.key === key);
      }
      set(key: string, value: any) {
        this.store.push({ key, value });
      }
    }
    
    // after – safe & fast
    import NodeCache from 'node-cache';
    @Injectable()
    export class CacheService {
      private cache = new NodeCache({ stdTTL: 60, checkperiod: 120 });
      get(key: string) {
        return this.cache.get(key);
      }
      set(key: string, value: any) {
        this.cache.set(key, value);
      }
    }
  5. 5️⃣ Adjust Nginx & PM2 Settings

    Two server‑side tweaks sealed the deal:

    • Increase proxy_read_timeout to 120s in /etc/nginx/sites‑available/api.conf.
    • Set max_memory_restart to 200M in pm2 ecosystem.config.js so PM2 recycles the process before OOM kills it.
    # /etc/nginx/sites-available/api.conf
    server {
      listen 80;
      server_name api.example.com;
    
      location / {
        proxy_pass http://127.0.0.1:3000;
        proxy_set_header Host $host;
        proxy_read_timeout 120;
      }
    }
    
  6. 6️⃣ Verify with Load Test

    I used autocannon -c 100 -d 30 http://api.example.com/users. The app stayed under 150ms latency for the full 30 seconds, and memory usage never crossed 180 MB.

Tip: Always enable pm2 status alerts on your VPS monitoring dashboard. A sudden spike in “restart” count is the fastest way to spot a leak before users notice.

Real‑World Use Case: SaaS Dashboard API

My client runs a multi‑tenant analytics dashboard that hits the NestJS API 200 times per second during peak hours. After the fix:

  • Server uptime jumped from 96% → 99.97%.
  • Monthly hosting cost stayed the same (still a $12 shared VPS).
  • Customer churn dropped by 12% because the dashboard no longer timed out.

Results / Outcome

In under three hours I turned a fatal 502 crash into a stable production environment with the following metrics:

  • Average response time: 132 ms (down 28% from pre‑fix).
  • Peak memory usage: 176 MB (well under the 256 MB VPS limit).
  • Zero 502 errors logged over a 30‑day monitoring window.
  • Developer confidence ↑ – I now commit cache changes with a single unit test.

Bonus Tips for NestJS on Shared Hosts

  1. Use node --max-old-space-size=200 to cap the heap and force early GC.
  2. Enable Helmet and compression to reduce payload size and CPU load.
  3. Schedule a nightly pm2 restart to clear any residual leaks.
  4. Monitor with uptime‑robot and set a 5‑minute alert for any 502 or 504 status.
  5. Keep your npm dependencies up‑to‑date. A hidden bug in class‑validator caused memory bloat in v0.13.2.

Warning: Never disable the VPS’s OOM killer as a “quick fix.” It only masks the problem and can bring down other services on the same server.

Monetization Sidebar (Optional)

If you’re looking to turn this knowledge into cash, consider:

  • Creating a paid “NestJS Performance Playbook” PDF.
  • Offering a one‑off consulting package to audit other Node apps.
  • Building a recurring “Health‑Check as a Service” that runs Clinic.js on client servers.

Fixing a 502 isn’t just about keeping the lights on—it’s about turning a frustrating debugging sprint into a showcase of reliability that can be marketed, upsold, and reused across every future project. Happy coding!

No comments:

Post a Comment