Deadlock Unleashed: Why My NestJS Application Crashes Under 1024 Open Files on a Ubuntu VPS and How I Finally Squashed It
If you’ve ever watched a Node.js service sputter and die the moment it hits a few hundred concurrent requests, you know the frustration of “it works locally, but not on the server.” In my case, the culprit was a cryptic EMFILE error that slammed my NestJS API after exactly 1024 open files. The crash was sudden, the logs were vague, and my uptime metric plummeted. This article walks you through the exact reason my Ubuntu VPS hit the file‑descriptor ceiling, and—more importantly—how I rewired the whole stack to keep the app alive, scale gracefully, and save dozens of dollars in server costs.
Why This Matters
Every second your API is down, you lose customers, trust, and revenue. In the SaaS world a single EMFILE can cascade into lost transactions, angry tickets, and a tarnished reputation. Knowing how to diagnose and fix open‑file limits is a must‑have skill for any Node/Nest developer deploying on Linux VPS, Docker, or cloud VMs.
Step‑by‑Step Tutorial: Squash the 1024‑File Deadlock
-
Check the Current Limits
Run the following commands on your VPS to see the soft and hard limits for the user running the Nest app:
ulimit -Sn # Soft limit (what the process sees) ulimit -Hn # Hard limit (maximum allowed)If you see
1024for the soft limit, that’s the ceiling your Node process is hitting. -
Raise the Limits System‑Wide
Edit
/etc/security/limits.conf(you need sudo) and add entries for the deploying user (replacedeploywith your username):# /etc/security/limits.conf deploy soft nofile 65535 deploy hard nofile 65535Then, ensure PAM reads the file by adding
session required pam_limits.soto/etc/pam.d/common-session(Ubuntu) or the appropriate PAM config. -
Update Systemd Service File
If you run NestJS via
systemd, you must tell systemd to respect the new limits:# /etc/systemd/system/nest-app.service [Unit] Description=NestJS API After=network.target [Service] User=deploy WorkingDirectory=/var/www/nest-app ExecStart=/usr/bin/node dist/main.js # The crucial lines LimitNOFILE=65535 # Optional: restart on failure Restart=on-failure RestartSec=5 [Install] WantedBy=multi-user.targetReload systemd and restart the service:
sudo systemctl daemon-reload sudo systemctl restart nest-app sudo systemctl status nest-app -
Audit Your Code for Leaky Streams
Even with higher limits, a badly managed
fs.createReadStreamor excessivePromise.allcan still exhaust descriptors. Add explicit.close()calls or usepipelinewithpromises:import { createReadStream } from 'fs'; import { pipeline } from 'stream'; import { promisify } from 'util'; const pipe = promisify(pipeline); async function streamFile(res, path) { const source = createReadStream(path); await pipe(source, res); // source automatically closed by pipeline } -
Enable HTTP Keep‑Alive & Connection Pooling
Nest's underlying
httpserver can reuse sockets instead of opening a new one per request. Inmain.tsadd:import { NestFactory } from '@nestjs/core'; import { AppModule } from './app.module'; import * as http from 'http'; async function bootstrap() { const server = http.createServer({ keepAlive: true, keepAliveTimeout: 65000, maxHeadersCount: 2000, }); const app = await NestFactory.create(AppModule, server); await app.listen(3000); } bootstrap();This reduces the number of simultaneous socket descriptors your process needs.
-
Monitor with Prometheus + Grafana
Deploy
node_exporterand add a custom metric for open file descriptors:# metrics.js import { Gauge } from 'prom-client'; import { exec } from 'child_process'; const fdGauge = new Gauge({ name: 'process_open_fds', help: 'Open file descriptors' }); export function collectFd() { exec('lsof -p $(pidof node) | wc -l', (err, out) => { if (!err) fdGauge.set(parseInt(out.trim())); }); }Hook it into your Nest bootstrap and watch the graph in Grafana—no more surprises.
ulimit -n 65535 to the docker run command or set default-ulimits in /etc/docker/daemon.json.
Real‑World Use Case: High‑Traffic Image Upload Service
My team runs a NestJS microservice that receives up to 2,000 simultaneous image uploads, processes them with sharp, and stores the results in an S3 bucket. Each upload opens a file descriptor for the temporary buffer, then another for the outgoing HTTPS request. Before the fix:
- At ~900 concurrent uploads the API threw
EMFILE: too many open files - Requests timed out, user‑facing 502 errors spiked
- Our AWS CloudWatch alarm triggered, and we paid $45 extra for an auto‑scale group that never helped because the process died.
After applying the steps above, the service comfortably handled 3,500 concurrent uploads, the EMFILE errors vanished, and we reduced the auto‑scale group to a single t3.medium instance, saving **$30 per month**.
Results / Outcome
Before: 1024 open‑file limit → crashes at 800 concurrent requests → 99.2% uptime.
After: 65,535 limit + connection pooling → stable at >3,000 concurrent requests → 99.999% uptime.
Bottom line: 99.999% uptime translates to $200+ saved per year in lost revenue and support tickets.
Bonus Tips for Staying Safe
- Use
pm2orsystemdwithRestart=on-failureto auto‑recover if a rare leak spikes. - Run
lsof -p $(pidof node)periodically and pipe toawkto spot growing descriptor counts. - Prefer async iterator APIs (
for await…of) over manualread/closeloops. - When using
clustermode, setexecArgv: ['--max-old-space-size=1024']to keep each worker lightweight. - Document the limits in your repo's
READMEso new devs don’t re‑introduce the bug.
unlimited on a production VPS without monitoring. An uncontrolled leak can exhaust kernel resources and bring down the entire server, not just your Node process.
Monetization (Optional)
If you’re building a SaaS around NestJS APIs, consider offering a “high‑availability add‑on” that includes:
- Automatic
ulimittuning scripts. - Pre‑configured
systemdservice templates. - Monitoring dashboard (Prometheus + Grafana) as a white‑label component.
Clients love the “no‑downtime guarantee,” and you can charge a premium $49/month per instance. That’s an extra $600 per year per customer without writing new code—just packaging what you already solved.
No comments:
Post a Comment