Apr 28 2022 9:35 – 10:35 AM Downtime Analysis

9:44AM EST report came in that tenants could not login to production

The Live BE server on EC2 was non-responsive when trying to login to Jenkins.

Rebooted the EC2 instance after viewing the monitoring reports. Approximately 9:35AM CPU spike to near 100%.

After reboot login still was not working, but Jenkins was responsive. No notable issues in the console logs in Jenkins to indicate build issues. The reboot at midnight also worked normally.

Started a rebuild process on Jenkins to force the API to do a complete reset. Build took nearly 15m; twice as long as normal likely due to multiple people trying to login during the build. Build completed with no errors. Build also logged “no changes” meaning it is the same as what was running since the 1.70 deployment.

RDS server logs looked normal, no zombie connections. Restarted the RDS server to see if that would kick the process back inline. Reboot took less than 3 minutes. No luck.

nginx access/error logs show no obvious problems on the API EC2 server. No notable traffic on the EC2 monitoring thus it does not appear to be a DDoS or other attack crippling the app or network.

Also staging and develop can still login as normal. Appears to be specific to the production API server.

Check the /var/lib/jenkins/.pm2/logs file — it will catch node and our OB app errors in there sometimes. Nothing notable, though probably some random errors to be cleaned up at some point. Non-critical and not the source of the issue.

Despite the rebuild/restart logs being “clean” — decided to do a REBUILD from command line on the API server versus the full “fetch all the code from git, re-install all node libs, etc.” full clean build that Jenkins will run.

SSH into the API server using the pem key for ubuntu user.

cd /var/lib/jenkins/workspace/dev_backend then run the pm2 commands.

sudo pm2 delete all

sudo yarn start

This build takes about 4 minutes… it got the login processes going.

Possible Long Term Solutions

Code Fixes

The trigger is likely a code loop in the API (BE) code that caused the initial CPU consumption and downward spiral of the API server.

There is another underlying issue in how Jenkins talks to PM2 services for node and now the PM2 service is managing the underlying API app. PM2 has always been inconsistent on the memory space and what runs where depending on how it is invoked. Vanilla PM2 commands in a Jenkins deployment script (configuration) run in one memory space (jenkins user most likely) , running the same command from yarn… yarn pm2 start for example… runs in a different memory space. Running the commands from the SSH login as Ubuntu runs in yet another memory space (ubuntu user most likely) and sudo pm2 addresses another workspace (likely root ; also the sudo pm2 delete all command probably didn’t do a damn thing). Finally sudo yarn start eventually runs the pm2 command in yet another workspace. WHERE the yarn commands are attaching PM2 … it should be in the global /var/lib pm2 workspace but that does not seem to be consistent.

Fixing the PM2 Environment

This may help if we stay on the EC2 instances, but it is a deep complex problem due to the myriad of options on how to configure and run PM2. It has been OK up to now but is not the most robust solution as we can see from the problems getting the app restarted today.

We likely need a node +PM2 expert or need to do a lot more homework internally, but there is likely a better solution…

Move API service to Amplify

Amplify will manage CDN distribution and basic scalability under load for JavaScript/Node apps such as our API server.

The problem up until the 1.69 build was that our API service requires environment variables to point the code to the right database server with security credentials attached (you don’t want that in code). While we do have .env files on EC2 servers, Amplify does not support that. Instead you must configure the Amplify environment outside the “disk storage” (.env file manually created on our EC2 API servers). Amplify uses AWS console configuration settings on the Amplify instances to set/configure per-deployment environment variables.

Unfortunately our app requires a dozen variables and FOUR happened to be named with a standard AWS_<something> tag for the AWS S3 connections. Turns out Amplify reserves AWS_environment variables to do some “special AWS Amplify magic”. So these had to be renamed in the API codebase (they are now named THE_AWS_<something>) in preparation for making it possible to put the API servers up on Amplify.

Now that the code is ready (since OB 1.69) we can explore putting the API server on Amplify and dropping PM2 from the services list.

July 2022 Update

A test was done with the new environment variables in place in the OmniBlocks® API app to support Amplify. Amplify was configured to use the new variables and the app was spun up on a dedicated Amplify API node.

Amplify is essentially a CI/CD builder service that constructs static web hosting packages served through a CDN backed by the S3 service. This design works well for traditional static websites with basic HTML + CSS + JavaScript. The front-end OmniBlocks® application is a perfect candidate as users access the app from a static URL with dynamic routes build within the React application, a setup known as a “single page application” (SPA).

The API server, however, needs to answer multiple routes. In order to do this properly with the firewalls and access controls put in place via the Amplify AWS stack, you need helpers to route traffic. Turns out there is a subsystem of services that need to be configured to work with the REST API for our back end server. This uses AWS Lambda and requires some changes to our backend code to support a slightly different version of the current traffic router built into our application. The Server Express module needs to be added to our API code and setup to work with Lambda before we can put the API server on an auto-scaling service like Amplify.

Tagged Amplify, downtime analysis, EC2, Jenkins, OmniBlocks®, RDS