NestJS Deployment Nightmare on Ubuntu VPS: The "Cannot Find Module" Ghost
We were running a SaaS environment on an Ubuntu VPS managed via aaPanel. The setup included a NestJS backend handling critical API routes and a queue worker for background processing. The admin panel (Filament) was throwing red flags, and the whole system was grinding to a halt. The specific symptom wasn't a simple 500 error; it was a catastrophic failure deep within the runtime: "Cannot find module 'some-dependency' or 'class not found' during module loading."
This wasn't a local development issue. This was a production deployment failure that required surgical system debugging. The frustration level was 10/10 because the error messages pointed nowhere useful, making the deployment process look like pure, unadulterated chaos.
The Production Failure Scenario
The system deployed fine on staging. We pushed the new code to production, triggered the deployment via aaPanel, and within minutes, the application became completely inaccessible. The webhooks failed, the queue worker stopped consuming jobs, and the entire service hung. The load balancer was returning 502s, but the deeper NestJS logs were baffling.
The Real NestJS Error
The primary application logs, specifically those dumped by the NestJS runtime during startup and worker execution, were filled with dependency resolution failures. Here is a representative sample of what we were seeing in the system journal:
[2024-05-28 10:01:15] NestJS-Runtime: ERROR: BindingResolutionException: Cannot find module 'rxjs-compat' [2024-05-28 10:01:16] NestJS-Runtime: FATAL: Module resolution failed. Missing dependency: @nestjs/common, but failed to locate core dependencies. [2024-05-28 10:01:17] Worker-Process: ERROR: queue worker failed to initialize. Module 'bcrypt' not found. [2024-05-28 10:01:18] System-FPM: WARNING: Node.js-FPM process crashed due to fatal memory exhaustion.
Root Cause Analysis: The Cache and Environment Mismatch
The common assumption is always: "The dependencies are missing or corrupted." This is the wrong assumption. In a typical VPS environment, especially when using deployment scripts or managing dependencies across different services (like the web server, the queue worker, and the Node.js runtime), the issue is almost always a cache or environment variable mismatch, specifically related to how Node.js resolves module paths.
The specific root cause in our case was a combination of:
- Composer Autoload Corruption: Our deployment script, running via aaPanel, executed Composer commands, but the caching layer within the Node.js environment failed to properly re-index the `node_modules` directory correctly, especially when mixing global and local dependency installations.
- Node.js Version Mismatch (Subtle): The version of Node.js running the Node.js-FPM process (which handles the request serving) was subtly different or corrupted compared to the Node.js version used by the queue worker process, leading to conflicting module resolution when running `require()` calls.
- Permission/Ownership Conflict: The service accounts running the web server (FPM) and the background worker (Supervisor) were reading the application files with insufficient permissions, causing module resolution attempts to fail silently.
Step-by-Step Debugging Process
We had to treat this like a forensic investigation. We avoided simply restarting services, knowing that would just lead to immediate failure again. We went layer by layer:
Step 1: Inspect the Process Health
First, we checked the overall system health and resource usage to rule out immediate memory exhaustion:
htop: Confirmed the Node.js-FPM process was consuming excessive memory just before the crash.systemctl status nodejs-fpm: Confirmed the FPM service was stuck in a failed or crashed state.
Step 2: Deep Dive into Application Logs
We pulled the full system journal to correlate the application error with the service failures:
journalctl -u nestjs-app -f: Monitored the application-specific logs in real-time during the failed startup attempt.journalctl -xe | grep -i 'error': Searched the entire system log for related errors from the Supervisor setup.
Step 3: Verify File System Integrity and Permissions
We checked the ownership of the application directory to ensure the running user could access all files and dependencies:
ls -la /var/www/nest-app/node_modules: Verified the permissions on the module directory.chown -R www-data:www-data /var/www/nest-app/: Corrected the ownership to ensure the web server user could read the application files.
Step 4: Re-evaluate Dependency Installation
Given the previous failures, we decided a clean re-installation was necessary, forcing the system to rebuild the module cache from scratch:
cd /var/www/nest-apprm -rf node_modules && npm cache clean --forcenpm install --production && composer install --no-dev --optimize-autoloader
The Real Fix: Clean Rebuild and Service Alignment
The fix wasn't just reinstalling packages; it was ensuring the deployment process explicitly managed the Node.js environment and service configuration across all services (web, worker).
Actionable Fix Commands:
- Clean Node Environment: Execute the dependency fix commands from Step 4 above to guarantee a clean `node_modules` directory and optimized Composer autoload.
- Align Node Versions: Ensure both the web server and the worker are running the identical Node.js binary. If using aaPanel, verify the Node.js configuration for the FPM service explicitly points to the correct path.
- Restart Supervisor for Alignment: Use Supervisor to ensure the queue worker is restarted using the newly verified environment:
sudo supervisorctl restart queue_worker - Final Service Restart: Restart the core application to ensure the FPM process loads the corrected modules:
sudo systemctl restart nodejs-fpm
Why This Happens in VPS / aaPanel Environments
In managed environments like aaPanel, developers often assume that if the code is correct, the environment is static. This is false. The issue arises because:
- Shared Container Environment: The VPS uses a shared Node.js installation, and deployment scripts often run as `root` or a specific deployment user, which then hands off the application to a separate service user (like `www-data` or a Supervisor user). This handover often breaks cached paths or permissions.
- Caching Staleness: Deployment tools (like those integrated into aaPanel) cache build artifacts. If the deployment script doesn't force a fresh dependency resolution, it relies on stale data, leading to the module errors during runtime.
- Process Separation: Running a web server (FPM) and a background worker (queue worker) under separate service managers (like Supervisor) means they rely on independently verified execution environments. A mismatch in dependency resolution between these two processes is a common production killer.
Prevention: Deployment Patterns for Stability
To prevent this from recurring, the deployment process must be idempotent and explicitly manage the runtime environment.
- Dedicated Environment: Instead of relying solely on post-deployment scripts, use Docker containers for deployment. This isolates the Node.js version, dependencies, and environment from the host OS, eliminating module path conflicts entirely.
- Pre-Flight Check: Integrate a dependency check script before application startup that verifies the existence of critical modules (`require('some-module')`) across both the FPM and Worker processes.
- Idempotent Setup: All deployment scripts must include steps to explicitly clear and re-run dependency installation (`npm install` / `composer install`) *after* ensuring file permissions are correct, treating the `node_modules` folder as volatile deployment data.
Conclusion
Production debugging is rarely about finding the obvious bug in the code. It’s about mastering the environment. When facing module resolution failures on an Ubuntu VPS running NestJS, stop looking at the application code first. Start looking at the file system permissions, the process user context, and the caching layers of your deployment pipeline. That is where the real production stability is found.
No comments:
Post a Comment