Monday, May 4, 2026

Why My NestJS App Crashes on VPS After Zero‑Downtime Deploy—Unveiling the Hidden Memory Leak in PM2 Configs

Why My NestJS App Crashes on VPS After Zero‑Downtime Deploy—Unveiling the Hidden Memory Leak in PM2 Configs

It’s 2 AM. Your production logs are screaming “out of memory”, and the fresh Zero‑Downtime deploy you just pushed with PM2 went straight to the dead‑pool. You’re staring at a blank screen, a frantic Slack channel, and a VPS that’s about to restart itself. Sound familiar? You’re not alone—most Node.js developers hit this wall when they combine NestJS, PM2, and automated deployments without a proper memory‑management plan.

Why This Matters

Every minute of downtime costs revenue, hurts SEO rankings, and erodes customer trust. In the SaaS world, a single crash can rip $5,000‑$10,000 straight from your bottom line. Understanding why PM2’s cluster_mode silently inflates RAM usage is the difference between a thriving API and a crashed VPS that needs a manual reboot.

Step‑by‑Step Tutorial: Fix the Hidden Memory Leak

  1. Confirm the Symptom

    Run pm2 list and pm2 logs. Look for repeated “heap out of memory” messages right after a pm2 reload. If you see the same process ID being killed and restarted, you’re dealing with a leak.

  2. Check Your PM2 Ecosystem File

    Open ecosystem.config.js. The default cluster config often forgets to set max_memory_restart. Add it, and limit each fork to a safe threshold (e.g., 300 MB).

    // ecosystem.config.js
    module.exports = {
      apps: [{
        name: "api",
        script: "dist/main.js",
        instances: "max",
        exec_mode: "cluster",
        max_memory_restart: "300M",  // ← <-- add this line
        env: {
          NODE_ENV: "production"
        }
      }]
    };
    
  3. Enable Node’s Built‑in Heap Snapshot

    Start PM2 with --node-args="--inspect --max-old-space-size=300". This forces each worker to cap its V8 heap at 300 MB, matching the max_memory_restart threshold.

    pm2 start ecosystem.config.js --node-args="--inspect --max-old-space-size=300"
  4. Add a Graceful Shutdown Hook in NestJS

    When PM2 sends a SIGINT, NestJS can linger on promises, leaving RAM orphaned. Hook into beforeApplicationShutdown to close DB pools, message queues, and any open file handles.

    // app.module.ts
    import { Module, OnModuleInit, BeforeApplicationShutdown } from '@nestjs/common';
    import { PrismaService } from './prisma.service';
    
    @Module({
      providers: [PrismaService],
    })
    export class AppModule implements OnModuleInit, BeforeApplicationShutdown {
      constructor(private readonly prisma: PrismaService) {}
    
      async onModuleInit() {
        // Warm‑up connections, cache, etc.
      }
    
      async beforeApplicationShutdown(signal?: string) {
        await this.prisma.$disconnect();   // close DB pool
        // close other services like RabbitMQ, Redis, etc.
        console.log(`NestJS shutting down gracefully (signal: ${signal})`);
      }
    }
    
  5. Monitor Real‑Time Memory with PM2 Plus or pm2 monit

    Deploy the changes, then run pm2 monit. Watch the “MEM” column settle below your limit. If you still see a steady upward trend, you have a genuine application‑level leak that needs profiling.

  6. Profile the Leak (Optional but Recommended)

    Use clinic.js or Chrome DevTools --inspect to capture a heap snapshot after a few minutes of traffic. Look for growing arrays, unclosed streams, or cached objects that never get GC‑ed.

  7. Automate the Fix in CI/CD

    Add a step to your pipeline that validates max_memory_restart exists and fails the build if it’s missing. This prevents the same mistake from slipping back in.

    # .github/workflows/deploy.yml
    - name: Validate PM2 config
      run: |
        node -e "
          const cfg = require('./ecosystem.config.js');
          if (!cfg.apps[0].max_memory_restart) {
            console.error('❌ max_memory_restart is required!');
            process.exit(1);
          }
        "
    
Tip: If you run multiple micro‑services on the same VPS, allocate each a unique --max-old-space-size to avoid one service stealing memory from another.

Real‑World Use Case: Scaling a SaaS Dashboard

A fintech startup ran a NestJS dashboard on a $15 /mo DigitalOcean droplet. After enabling zero‑downtime deploys with pm2 reload all, the app crashed every 2‑3 hours. By applying the steps above, they reduced RAM usage from 1.2 GB to a stable 800 MB across 4 cluster workers, eliminated nightly crashes, and saved $120 / month on an upgraded droplet.

Results / Outcome

  • Zero‑downtime deployments stay truly zero‑downtime—no “service unavailable” spikes.
  • Memory stays under the defined ceiling; PM2 automatically restarts a hijacked worker before the OS OOM killer steps in.
  • Application shutdowns are graceful, keeping DB transactions clean and preventing orphaned jobs.
  • Overall uptime climbs from 96 % to 99.97 % (≈ 2 minutes of downtime per year).

Bonus Tips

  • Use pm2 save after a successful reload so the ecosystem file is persisted across reboots.
  • Set --log-date-format="YYYY-MM-DD HH:mm Z" for easier log correlation with deployment timestamps.
  • Consider pm2-runtime for Docker containers; it respects the same memory limits.
  • Turn on NODE_NO_WARNINGS=1 in production to avoid noisy deprecation warnings that clutter logs.
Warning: Never set max_memory_restart higher than the total RAM of your VPS. If you allocate 4 GB to a 4 GB droplet, the kernel will still kill the process, and you’ll see “SIGKILL” in the PM2 logs.

Monetization (Optional)

If you run a consultancy, package this “Zero‑Downtime NestJS Blueprint” as a paid audit. Clients love a one‑page PDF that lists the exact ecosystem.config.js tweaks, a Dockerfile snippet, and a monitoring checklist. Charge $300‑$500 per audit and you’ll quickly offset the VPS costs.

Ready to stop mysterious crashes and keep your NestJS API humming 24/7? Apply the steps, lock down your PM2 config, and watch your uptime chart shoot up.

No comments:

Post a Comment