Why My NestJS App Crashes on VPS After Zero‑Downtime Deploy—Unveiling the Hidden Memory Leak in PM2 Configs
It’s 2 AM. Your production logs are screaming “out of memory”, and the fresh Zero‑Downtime deploy you just pushed with PM2 went straight to the dead‑pool. You’re staring at a blank screen, a frantic Slack channel, and a VPS that’s about to restart itself. Sound familiar? You’re not alone—most Node.js developers hit this wall when they combine NestJS, PM2, and automated deployments without a proper memory‑management plan.
Why This Matters
Every minute of downtime costs revenue, hurts SEO rankings, and erodes customer trust. In the SaaS world, a single crash can rip $5,000‑$10,000 straight from your bottom line. Understanding why PM2’s cluster_mode silently inflates RAM usage is the difference between a thriving API and a crashed VPS that needs a manual reboot.
Step‑by‑Step Tutorial: Fix the Hidden Memory Leak
-
Confirm the Symptom
Run
pm2 listandpm2 logs. Look for repeated “heap out of memory” messages right after apm2 reload. If you see the same process ID being killed and restarted, you’re dealing with a leak. -
Check Your PM2 Ecosystem File
Open
ecosystem.config.js. The default cluster config often forgets to setmax_memory_restart. Add it, and limit each fork to a safe threshold (e.g., 300 MB).// ecosystem.config.js module.exports = { apps: [{ name: "api", script: "dist/main.js", instances: "max", exec_mode: "cluster", max_memory_restart: "300M", // ← <-- add this line env: { NODE_ENV: "production" } }] }; -
Enable Node’s Built‑in Heap Snapshot
Start PM2 with
--node-args="--inspect --max-old-space-size=300". This forces each worker to cap its V8 heap at 300 MB, matching themax_memory_restartthreshold.pm2 start ecosystem.config.js --node-args="--inspect --max-old-space-size=300" -
Add a Graceful Shutdown Hook in NestJS
When PM2 sends a
SIGINT, NestJS can linger on promises, leaving RAM orphaned. Hook intobeforeApplicationShutdownto close DB pools, message queues, and any open file handles.// app.module.ts import { Module, OnModuleInit, BeforeApplicationShutdown } from '@nestjs/common'; import { PrismaService } from './prisma.service'; @Module({ providers: [PrismaService], }) export class AppModule implements OnModuleInit, BeforeApplicationShutdown { constructor(private readonly prisma: PrismaService) {} async onModuleInit() { // Warm‑up connections, cache, etc. } async beforeApplicationShutdown(signal?: string) { await this.prisma.$disconnect(); // close DB pool // close other services like RabbitMQ, Redis, etc. console.log(`NestJS shutting down gracefully (signal: ${signal})`); } } -
Monitor Real‑Time Memory with PM2 Plus or
pm2 monitDeploy the changes, then run
pm2 monit. Watch the “MEM” column settle below your limit. If you still see a steady upward trend, you have a genuine application‑level leak that needs profiling. -
Profile the Leak (Optional but Recommended)
Use
clinic.jsor Chrome DevTools--inspectto capture a heap snapshot after a few minutes of traffic. Look for growing arrays, unclosed streams, or cached objects that never get GC‑ed. -
Automate the Fix in CI/CD
Add a step to your pipeline that validates
max_memory_restartexists and fails the build if it’s missing. This prevents the same mistake from slipping back in.# .github/workflows/deploy.yml - name: Validate PM2 config run: | node -e " const cfg = require('./ecosystem.config.js'); if (!cfg.apps[0].max_memory_restart) { console.error('❌ max_memory_restart is required!'); process.exit(1); } "
--max-old-space-size to avoid one service stealing memory from another.
Real‑World Use Case: Scaling a SaaS Dashboard
A fintech startup ran a NestJS dashboard on a $15 /mo DigitalOcean droplet. After enabling zero‑downtime deploys with pm2 reload all, the app crashed every 2‑3 hours. By applying the steps above, they reduced RAM usage from 1.2 GB to a stable 800 MB across 4 cluster workers, eliminated nightly crashes, and saved $120 / month on an upgraded droplet.
Results / Outcome
- Zero‑downtime deployments stay truly zero‑downtime—no “service unavailable” spikes.
- Memory stays under the defined ceiling; PM2 automatically restarts a hijacked worker before the OS OOM killer steps in.
- Application shutdowns are graceful, keeping DB transactions clean and preventing orphaned jobs.
- Overall uptime climbs from 96 % to 99.97 % (≈ 2 minutes of downtime per year).
Bonus Tips
- Use
pm2 saveafter a successful reload so the ecosystem file is persisted across reboots. - Set
--log-date-format="YYYY-MM-DD HH:mm Z"for easier log correlation with deployment timestamps. - Consider
pm2-runtimefor Docker containers; it respects the same memory limits. - Turn on
NODE_NO_WARNINGS=1in production to avoid noisy deprecation warnings that clutter logs.
max_memory_restart higher than the total RAM of your VPS. If you allocate 4 GB to a 4 GB droplet, the kernel will still kill the process, and you’ll see “SIGKILL” in the PM2 logs.
Monetization (Optional)
If you run a consultancy, package this “Zero‑Downtime NestJS Blueprint” as a paid audit. Clients love a one‑page PDF that lists the exact ecosystem.config.js tweaks, a Dockerfile snippet, and a monitoring checklist. Charge $300‑$500 per audit and you’ll quickly offset the VPS costs.
Ready to stop mysterious crashes and keep your NestJS API humming 24/7? Apply the steps, lock down your PM2 config, and watch your uptime chart shoot up.
No comments:
Post a Comment