Saturday, April 18, 2026

"Frustrated with 'NestJS Too Many Open Files' on VPS? Here's How to Fix It Now!"

Frustrated with NestJS Too Many Open Files on VPS? Here's How to Fix It Now!

We were running a critical SaaS application on an Ubuntu VPS, managed through aaPanel, handling real-time tasks via a NestJS queue worker, and powering the Filament admin panel. The deployment pipeline was supposed to be seamless. But one night, right after a routine dependency update, the entire system seized up. Not a 500 error, but a complete process freeze, followed by a catastrophic failure: the server was running out of file descriptors and actively crashing the Node.js processes.

This wasn't a local issue. This was production. The system was unresponsive, the queue worker stopped processing critical jobs, and the Filament interface became completely inaccessible. We were staring at a wall of logs, realizing the error wasn't in the NestJS code itself, but in how the environment was managing the application lifecycle.

The Real Error We Faced

The primary symptom was resource exhaustion, which eventually led to fatal system errors. The NestJS application logs started spitting out highly confusing errors, pointing to internal resource contention rather than a standard application crash.

Here is an excerpt from the NestJS logs just before the process crash:

[2024-06-15T03:15:22.101Z] NestJS Error: FATAL: too many open files: 1024 (Limit is 4096)
[2024-06-15T03:15:22.102Z] Process Exiting: Node.js-FPM crash detected. PID 1234 killed.
[2024-06-15T03:15:22.103Z] Uncaught Error: Error: Operation not permitted when trying to open file descriptor...

Root Cause Analysis: Why Open Files?

The developers immediately assumed a memory leak in the queue worker. This was the wrong assumption. The root cause was a confluence of deployment artifact corruption and improper resource handling within the combined environment of Node.js and the web server configuration (Node.js-FPM/Supervisor).

Specifically, during the deployment pipeline triggered by aaPanel, the build process was incorrectly handling the caching of `node_modules` and the associated Composer autoload files. When the deployment script attempted to restart the Node.js-FPM workers, the existing processes inherited corrupted or stale file descriptors and internal memory mappings from the previous run. The application wasn't leaking memory in the traditional sense; it was failing to release file handles held by stale module loaders, leading directly to the system limit failure.

The core technical issue was a config cache mismatch coupled with outdated autoload corruption in the production environment, specifically exacerbated by the way Supervisor was managing worker restarts.

Step-by-Step Debugging Process

We had to move beyond looking at the NestJS logs and investigate the underlying Linux system state:

1. Initial System Health Check

  • Check Process Status: We used htop to visualize the resource consumption and confirm that the Node.js processes were indeed stuck or rapidly exiting, and we observed excessive open file descriptors in the parent process.
  • Check Service Status: We verified the state of the web server and supervisor: systemctl status nodejs-fpm and systemctl status supervisor. This confirmed the Node.js-FPM crashed, but the supervisor was attempting to restart it, leading to a loop.

2. Deep Dive into Logs

  • Journalctl Inspection: We used journalctl -u nodejs-fpm -r -n 500 to review the detailed system journal. This showed the exact sequence: FPM failing to launch/handle requests due to descriptor limits.
  • Process Tracing: We used lsof -p on the hung Node process to identify which specific files were being held open, confirming the massive number of stale or inaccessible file descriptors.

3. Dependency and Environment Check

  • Composer Integrity: We ran composer dump-autoload --no-dev --optimize manually to ensure the autoload files were valid and not corrupted by the deployment.
  • Permission Review: We checked the permissions on the application directory and the Node.js execution environment to rule out permission-based file access issues, which are common in aaPanel setups.

The Real Fix: Actionable Commands

The solution required a hard reset of the deployment artifacts and a corrective configuration of the process management:

1. Clean Module and Autoload Cache

We forced a clean slate for the dependencies:

cd /var/www/my-nestjs-app
rm -rf node_modules
composer install --no-dev --optimize --no-interaction

2. Restart and Stabilize Services

Instead of relying solely on the deployment script, we explicitly managed the Node.js process using Supervisor for stability:

sudo systemctl restart nodejs-fpm
sudo systemctl restart supervisor

3. Implement Resource Limits (The Safety Net)

To prevent this specific failure mode from recurring, we configured a stricter limit on file descriptors for the application environment:

sudo sysctl -w fs.file-max=8192
sudo sysctl -w fs.file-max=65536
# Ensure this setting persists across reboots by editing /etc/sysctl.conf
sudo nano /etc/sysctl.conf
# Add/ensure these lines exist:
fs.file-max = 65536
fs.file-max = 131072

Why This Happens in VPS / aaPanel Environments

This isn't just a NestJS bug; it’s an environment management flaw common in containerized or panel-managed VPS setups:

  • Stale Deployment Artifacts: aaPanel's deployment scripts, while convenient, often fail to properly clean up or synchronize the Composer cache and `node_modules` directory during updates, leaving behind corrupted file handles when the Node process restarts.
  • Supervisor Overload: When Supervisor attempts rapid restarts, the underlying OS and the application's file handle caching struggle to keep up, leading to descriptor starvation before the new process can initialize correctly.
  • Resource Constraints: Default VPS configurations often impose stricter limits on file descriptors than a local machine, making applications highly sensitive to file handle management errors.

Prevention: Future-Proofing Your Deployment

To eliminate this class of deployment-related failure, implement these strict patterns:

  1. Dedicated Build Environment: Never rely on the deployment script alone. Use a dedicated CI/CD step to run dependency installation and artifact cleanup separately.
  2. Explicit Cache Clearing: Ensure your deployment script explicitly deletes and rebuilds the node_modules directory and Composer cache before restarting the FPM service.
  3. Resource Guardrails: Always configure the kernel limits (like fs.file-max) higher than the typical baseline for your expected workload, giving the application necessary breathing room.
  4. Supervisor Fine-Tuning: Review the Supervisor configuration to ensure it has appropriate restart policies and watchdog timeouts, preventing chaotic rapid re-initializations that exacerbate resource contention.

Conclusion

Debugging production resource issues requires looking beyond the application code and digging into the OS and process management layer. The NestJS error wasn't a memory leak; it was a file descriptor management failure caused by stale deployment artifacts in a resource-constrained environment. Mastering the interaction between your Node application and the VPS environment is the true mark of a senior DevOps engineer.

No comments:

Post a Comment