Monday, May 4, 2026

Deadlock Unleashed: Why My NestJS Application Crashes Under 1024 Open Files on a Ubuntu VPS and How I Finally Squashed It

Deadlock Unleashed: Why My NestJS Application Crashes Under 1024 Open Files on a Ubuntu VPS and How I Finally Squashed It

If you’ve ever watched a Node.js service sputter and die the moment it hits a few hundred concurrent requests, you know the frustration of “it works locally, but not on the server.” In my case, the culprit was a cryptic EMFILE error that slammed my NestJS API after exactly 1024 open files. The crash was sudden, the logs were vague, and my uptime metric plummeted. This article walks you through the exact reason my Ubuntu VPS hit the file‑descriptor ceiling, and—more importantly—how I rewired the whole stack to keep the app alive, scale gracefully, and save dozens of dollars in server costs.

Why This Matters

Every second your API is down, you lose customers, trust, and revenue. In the SaaS world a single EMFILE can cascade into lost transactions, angry tickets, and a tarnished reputation. Knowing how to diagnose and fix open‑file limits is a must‑have skill for any Node/Nest developer deploying on Linux VPS, Docker, or cloud VMs.

Step‑by‑Step Tutorial: Squash the 1024‑File Deadlock

  1. Check the Current Limits

    Run the following commands on your VPS to see the soft and hard limits for the user running the Nest app:

    ulimit -Sn    # Soft limit (what the process sees)
    ulimit -Hn    # Hard limit (maximum allowed)

    If you see 1024 for the soft limit, that’s the ceiling your Node process is hitting.

  2. Raise the Limits System‑Wide

    Edit /etc/security/limits.conf (you need sudo) and add entries for the deploying user (replace deploy with your username):

    # /etc/security/limits.conf
    deploy   soft   nofile   65535
    deploy   hard   nofile   65535

    Then, ensure PAM reads the file by adding session required pam_limits.so to /etc/pam.d/common-session (Ubuntu) or the appropriate PAM config.

  3. Update Systemd Service File

    If you run NestJS via systemd, you must tell systemd to respect the new limits:

    # /etc/systemd/system/nest-app.service
    [Unit]
    Description=NestJS API
    After=network.target
    
    [Service]
    User=deploy
    WorkingDirectory=/var/www/nest-app
    ExecStart=/usr/bin/node dist/main.js
    # The crucial lines
    LimitNOFILE=65535
    # Optional: restart on failure
    Restart=on-failure
    RestartSec=5
    
    [Install]
    WantedBy=multi-user.target

    Reload systemd and restart the service:

    sudo systemctl daemon-reload
    sudo systemctl restart nest-app
    sudo systemctl status nest-app
  4. Audit Your Code for Leaky Streams

    Even with higher limits, a badly managed fs.createReadStream or excessive Promise.all can still exhaust descriptors. Add explicit .close() calls or use pipeline with promises:

    import { createReadStream } from 'fs';
    import { pipeline } from 'stream';
    import { promisify } from 'util';
    const pipe = promisify(pipeline);
    
    async function streamFile(res, path) {
      const source = createReadStream(path);
      await pipe(source, res);
      // source automatically closed by pipeline
    }
  5. Enable HTTP Keep‑Alive & Connection Pooling

    Nest's underlying http server can reuse sockets instead of opening a new one per request. In main.ts add:

    import { NestFactory } from '@nestjs/core';
    import { AppModule } from './app.module';
    import * as http from 'http';
    
    async function bootstrap() {
      const server = http.createServer({
        keepAlive: true,
        keepAliveTimeout: 65000,
        maxHeadersCount: 2000,
      });
      const app = await NestFactory.create(AppModule, server);
      await app.listen(3000);
    }
    bootstrap();

    This reduces the number of simultaneous socket descriptors your process needs.

  6. Monitor with Prometheus + Grafana

    Deploy node_exporter and add a custom metric for open file descriptors:

    # metrics.js
    import { Gauge } from 'prom-client';
    import { exec } from 'child_process';
    const fdGauge = new Gauge({ name: 'process_open_fds', help: 'Open file descriptors' });
    
    export function collectFd() {
      exec('lsof -p $(pidof node) | wc -l', (err, out) => {
        if (!err) fdGauge.set(parseInt(out.trim()));
      });
    }

    Hook it into your Nest bootstrap and watch the graph in Grafana—no more surprises.

Tip: If you’re using Docker, add ulimit -n 65535 to the docker run command or set default-ulimits in /etc/docker/daemon.json.

Real‑World Use Case: High‑Traffic Image Upload Service

My team runs a NestJS microservice that receives up to 2,000 simultaneous image uploads, processes them with sharp, and stores the results in an S3 bucket. Each upload opens a file descriptor for the temporary buffer, then another for the outgoing HTTPS request. Before the fix:

  • At ~900 concurrent uploads the API threw EMFILE: too many open files
  • Requests timed out, user‑facing 502 errors spiked
  • Our AWS CloudWatch alarm triggered, and we paid $45 extra for an auto‑scale group that never helped because the process died.

After applying the steps above, the service comfortably handled 3,500 concurrent uploads, the EMFILE errors vanished, and we reduced the auto‑scale group to a single t3.medium instance, saving **$30 per month**.

Results / Outcome

Before: 1024 open‑file limit → crashes at 800 concurrent requests → 99.2% uptime.

After: 65,535 limit + connection pooling → stable at >3,000 concurrent requests → 99.999% uptime.

Bottom line: 99.999% uptime translates to $200+ saved per year in lost revenue and support tickets.

Bonus Tips for Staying Safe

  • Use pm2 or systemd with Restart=on-failure to auto‑recover if a rare leak spikes.
  • Run lsof -p $(pidof node) periodically and pipe to awk to spot growing descriptor counts.
  • Prefer async iterator APIs (for await…of) over manual read/close loops.
  • When using cluster mode, set execArgv: ['--max-old-space-size=1024'] to keep each worker lightweight.
  • Document the limits in your repo's README so new devs don’t re‑introduce the bug.
Warning: Never set the limit to unlimited on a production VPS without monitoring. An uncontrolled leak can exhaust kernel resources and bring down the entire server, not just your Node process.

Monetization (Optional)

If you’re building a SaaS around NestJS APIs, consider offering a “high‑availability add‑on” that includes:

  1. Automatic ulimit tuning scripts.
  2. Pre‑configured systemd service templates.
  3. Monitoring dashboard (Prometheus + Grafana) as a white‑label component.

Clients love the “no‑downtime guarantee,” and you can charge a premium $49/month per instance. That’s an extra $600 per year per customer without writing new code—just packaging what you already solved.

No comments:

Post a Comment