Cause
Live BE server stopped responding.
Actions
RDS server was still responding, but showing 20 open connections.
EC2 server was not responding, SSH unresponsive, Jenkins (http) unresponsive.
AWS reboot of the EC2 instance never finished (5m).
AWS stop instance never finished (5m).
AWS force stop took 3m+, eventually stopped.
Instance manually restarted via AWS console.
Forced PM2 restart (logged in via SSH, sudo su – jenkins, cd workspace/dev_backend, yarn pm2 stop all, yarn start).
Server online @10:27AM
Response Time: 15m from report to online
Forensic Analysis
Jul 11 Timeline
Note: server logs are typically in UTC not EST.
14:00 = 10AM EST
nginx/access.log = /var/log/nginx/access.log
nginx/error.log = /var/log/nginx/error.log
ob_api_server_out.log = /var/lib/jenkins/workspace/dev_backend/logs/ob_api_server_out.log
14:01:27 | nginx/error.log | [error] 8934#8934: *220129 upstream prematurely closed connection while reading response header from upstream, client: 107.167.202.154 GET /v1/location?id=4704 from: 107.167.202.154 |
14:01:57 | nginx/error.log | [error] 8934#8934: *220135 upstream prematurely closed connection while reading response header from upstream, client: 107.167.202.154 GET /v1/inventory/get/?current=256 from: 107.167.202.154 |
14:02:17 | ob_api_server_out.log | GET /v1/me/tenant |
14:02:17 | ob_api_server_out.log | GET /v1/unit_measure |
14:02:17 | ob_api_server_out.log | GET /v1/me/tenant |
14:02:17 | ob_api_server_out.log | GET /v1/me/tenant |
14:02:17 | ob_api_server_out.log | GET /v1/me/tenant |
14:02:17 | ob_api_server_out.log | GET v1/location |
14:02:17 | ob_api_server_out.log | GET /v1/me/tenant |
14:02:17 | ob_api_server_out.log | GET /v1/me/tenant |
14:02:17 | ob_api_server_out.log | GET /v1/me/tenant |
14:02:19 | ob_api_server_out.log | GET /v1/inventory/get/?id=12493 |
ob_api_server_out.log | >>>> Server started… | |
14:03:37 | nginx/access.log | POST /v1/mbr_legacy/ from: 107.167.202.154 |
13:59:43 | /var/log/syslog | systemd-udevd[450]: seq 26654 ‘/kernel/slab/:a-0000104/cgroup/buffer_head(531:systemd-resolved.service)’ is taking a long time |
14:15:37 | /var/log/syslog | systemd[1]: snapd.service: Watchdog timeout (limit 5min)! |
———– REBOOT ———- | ||
14:13:13 | /var/log/cloud-init.log | util.py[DEBUG]: Cloud-init v. 21.4-0ubuntu1~18.04.1 running ‘init-local’ at Mon, 11 Jul 2022 14:19:13 +0000. Up 21.52 seconds. |
14:19:19 | /var/log/kernlog | Linux version 5.4.0-1060-aws (buildd@lgw01-amd64-054) (gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)) first entry (reboot) |
ob_api_server_out.log | Server started… | |
14:26:50 | ob_api_server_out.log | GET /v1/unit_measure?page_size=500 |
EC2 Basic System Analysis
EC2 instance type:
memory: 3.8G, 2 vCPU
First glance at summary report is not showing resource allocation problems.
CPU Hit 64.2% Max @ 14:10
API Server PM2 Logs
PM2 Snapshot
Error Log Summary
⚠️ NOTE: need to investigate if PM2 can be configured to email these log entries or post some kind of notification that is NOT buried in Jenkins ~/workspace/dev_backend/logs/ob_api_server_error.log
Bad Request Sent To OB FE App
This error is repeated 24 times at the end of the log.
sendBadRequest: Error: Insufficient number of lots selected to complete batch.
at checkForInsufficientAllocations (/var/lib/jenkins/workspace/dev_backend/app/modules/allocations/helper.ts:521:13)
at /var/lib/jenkins/workspace/dev_backend/app/modules/allocations/helper.ts:430:59
at Generator.next ()
at /var/lib/jenkins/workspace/dev_backend/build/app/modules/allocations/helper.js:8:71
at new Promise ()
at __awaiter (/var/lib/jenkins/workspace/dev_backend/build/app/modules/allocations/helper.js:4:12)
at throwCompletabilityErrorIfNeeded (/var/lib/jenkins/workspace/dev_backend/app/modules/allocations/helper.ts:417:20)
at MbrProcess. (/var/lib/jenkins/workspace/dev_backend/app/modules/mbrLegacy/MbrProcessLegacy.ts:681:47)
at Generator.next ()
at fulfilled (/var/lib/jenkins/workspace/dev_backend/build/app/modules/mbrLegacy/MbrProcessLegacy.js:5:58) {
customSubMessage: 'Insufficient number of lots selected to complete batch.',
bulletPoints: [
{
catNum: 'IN#5406',
missingString: '269.6552 kg of 272.1552 kg (99.08%) missing'
}
],
skipRollback: true
}
Invalid String Length Error / Unhandled Rejection
This should be addressed. If this is not caught and processed properly it can lead to indeterminate behavior.
You have triggered an unhandledRejection, you may have forgotten to catch a Promise rejection:
RangeError: Invalid string length
at JSON.stringify ()
at stringify (/var/lib/jenkins/workspace/dev_backend/node_modules/express/lib/response.js:1128:12)
at ServerResponse.json (/var/lib/jenkins/workspace/dev_backend/node_modules/express/lib/response.js:260:14)
at InventoriesController.sendOk (/var/lib/jenkins/workspace/dev_backend/app/utils/base-controller.ts:199:32)
at InventoriesController. (/var/lib/jenkins/workspace/dev_backend/app/modules/inventories/controller.ts:810:17)
at Generator.next ()
at fulfilled (/var/lib/jenkins/workspace/dev_backend/build/app/modules/inventories/controller.js:5:58)
at runMicrotasks ()
at processTicksAndRejections (node:internal/process/task_queues:96:5)
skipUndefined deprecation notice (warning)
This message should be dealt with or logging turned off. It is not hurting the application execution, but error logging does take system resources and slows down the app as it writes to a plain text file with no caching or I/O performance wizardry.
skipUndefined() is deprecated and will be removed in objection 4.0
API Server Request Log
API Service Autorestarts
⚠️ Note: Why is the API service restarting so often? PM2 manages the node server. If one of the two services (one copy of the app runs on each CPU, we have 2 CPUs on the current server) it will restart it automatically. Something in these requests seems to be triggering a resource or other failure.
Nick noted that PM2 will restart any time it drains all active requests… so the frequent restart looks normal here.
PM2 Service Restart Log Analysis
The PM2 request log lines immediate before the PM2 service restarts…
11/Jul/2022:15:26:54 | PUT /v1/inventory/attach_location/14176 |
11/Jul/2022:15:32:50 | GET /v1/customer/customers?id=[81] |
11/Jul/2022:15:33:51 | GET /v1/mmr/ |
11/Jul/2022:15:36:52 | GET /v1/customer?company_name=aliv |
11/Jul/2022:15:36:54 | GET /v1/customer |
11/Jul/2022:15:41:22 | POST /v1/orders/order/?order_id=3090 |
11/Jul/2022:15:41:51 | GET /v1/orders/order/get |
11/Jul/2022:15:42:00 | GET /v1/orders/order/get |
11/Jul/2022:15:43:55 | GET /v1/inventory_metrics/need_to_order |
11/Jul/2022:15:46:23 | PUT /v1/inventory/attach_location/14176 |
11/Jul/2022:15:50:51 | GET /v1/mmr/ |
11/Jul/2022:15:52:20 | GET /v1/customer/customers |
11/Jul/2022:15:52:40 | PUT /v1/batch/userLock/ |
11/Jul/2022:15:53:22 | GET /v1/mbr_legacy/index_all |
11/Jul/2022:15:53:42 | GET /v1/customer/customers?id=[81] |
11/Jul/2022:15:54:19 | PUT /v1/formula/formula_ingredient/10979 |
11/Jul/2022:15:55:50 | GET /v1/customer/customers?id=[81] Note: restarted twice in a row here, no requests in between |
11/Jul/2022:15:56:22 | GET /v1/orders/order/get |
11/Jul/2022:16:00:07 | GET /v1/customer/customers?id=[81] |
11/Jul/2022:16:34:51 | PUT /v1/inventory/attach_location/14177 |
11/Jul/2022:16:34:59 | PUT /v1/inventory/attach_location/14177 |
11/Jul/2022:16:39:08 | PUT /v1/inventory/attach_location/14177 |
Endpoint Counts Before Failures
batch/userLock | 1 (1 Put) |
orders/order | 4 (1 Post, 3 Get) |
customer | 2 (2 Get) |
customer/customers | 5 (5 Get) |
formula/formula_ingredient | 1 (1 Put) |
inventory/attach_location | 5 (5 Put) |
inventory_metrics/need_to_order | 1 (1 Get) |
mbr_legacy/index_all | 1 (1 Get) |
mmr | 2 (2 Get) |
Redis Service on EC2 API Server Review
The in-memory caching database that OB API server connects to for session caching (and likely not setup properly).
There are definitely some configuration issues here.
⚠️ Note: this needs to be fixed…
WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add ‘vm.overcommit_memory = 1’ to /etc/sy
sctl.conf and then reboot or run the command ‘sysctl vm.overcommit_memory=1’ for this to take effect.
✅ Production and develop servers updated to fix this.
See https://docs.omniblocks.app/internal-note/configuring-ec2-for-redis/
⚠️ Note: this needs to be fixed…
WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this
issue run the command ‘echo never > /sys/kernel/mm/transparent_hugepage/enabled’ as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must
be restarted after THP is disabled.
✅ Production and develop servers updated to fix this.
See https://docs.omniblocks.app/internal-note/configuring-ec2-for-redis/
RDS System Analysis
RDS server is now only serving production. No indication of a resource issue on the RDS server itself with the exception of DB CONNECTIONS.
At 10:03 AM the minimum DB connections spiked to 20 connections and stayed here until the server was shut down. This is abnormal behavior.
This is indicative of either a code logic problem where connections (very likely SQL transactions) are not finishing properly. Possibly getting into a wait/lock state which could even be self-referential. Could also indicate a configuration limitation on the EC2 node server (OB BE) that is limited to total connections to the RDS server.
High Read IOPS
@13:23 the RDS server had a notable spike in Read IOPS.
This is not unusual for the app, but these big spikes likely indicate some coding/logic issues as they are not happening at routine intervals and are likely triggered by data ops in the code logic.
Unusual Data Queries
These are in the performance insights logs…
Who Is 71.130.164.222 — Nick Fields
These are coming from a pg_dump command…
@14:05
COPY public.batches (id, created_by, updated_by, formula_id, user_id, tenant_id, created_at, updated_at, status, batch_size, cap_or_bottle, value, template_id, qr_link, qr_code, product_name, batch_num, physical_inventory, location_id, produced_cat_num, production_output_unit, production_start_date, expiration_date, locked, current_step_name, "activeUID", completability_status_int, actual_yield, estimated_overage, expiration_offset, "customerIds", cert_marks_id) TO stdout;
@16:16
COPY public.mmr (id, status, created_at, updated_at, created_by, updated_by, tenant_id, formula_id, template_id, catalog_id, production_output_unit, batch_num, value, product_name, produced_cat_num, batch_size, cap_or_bottle, locked, expiration_offset, "customerIds") TO stdout;
Stop Gap Remedies
✅ Increase EC2 Instance Size
Increasing the size of the EC2 instance to allow more CPU/memory and thus more active data connections to be configured. This is basically “using a bigger hammer” to space out potential downtime, however if the “nail is bent” this is not going to be a great solution. We need to figure out the root cause.
Long Term Remedies
Isolate and Resolve Possible Coding Issue
This requires more information to first determine if we are dealing with a code based issue or configuration issue.
Primary Findings
These things are possibly related to the root cause and need to be addressed soon.
- Uncaught promise / invalid string length noted above.
- Routes that seem to be related to the node server restarting:
- v1/customer searches
- v1/orders/order (get and post operations)
Secondary Findings
These things may not be the root cause but need to be addressed.
- Multiple calls to /v1/me/tenant from front end app, this should be done once per session
- Debounce the search call for customer searches:
/v1/customer?company_name=a
/v1/customer?company_name=al
/v1/customer?company_name=ali - Replace skipUndefined (deprecation notice)
Isolate and Resolve Possible BE Server Configuration
This requires more information to first determine if we are dealing with a code based issue or configuration issue.
Primary Findings
- ✅ ⚠️ Redis needs to have a max memory limit assigned (July 12th 11:30AM)
This has not been setup properly EVER. The original dev team (TP) made configuration changes but completely ignored this key and critical setting.
Resolved 6:10PM EST July 12th 2022.
Secondary Findings
These things may not be the root cause but need to be addressed.
- Review PM2 and nginx configuration.
See: https://pm2.keymetrics.io/docs/tutorials/pm2-nginx-production-setup - Review Redis configuration and setup on API server.
The app must be stateless and thus must share login state between nodes (even on the same server / different CPUs) so something like Redis must be in place ot share state.
Note: Redis may be running TWO instances on the production server (one per CPU) without sharing credentials. The auth library relies heavily on redis for user auth and a non-shared “in memory database” (what Redis is, essentially) would cause sessions to disconnect if going between “node server on CPU1” and “node server on CPU2”. Needs further investigation.- Most likely need to employ Redis via AWS not local copies.
https://aws.amazon.com/memorydb/pricing/ - ✅ DEFINITELY need to fix the server to set vm.overcommit_memory = 1 and restart.
- ✅ Fix Huge Pages – disable.
- ✅ Fix the memory limit , it is MIA (set to 1GB on an 8GB server)
- Most likely need to employ Redis via AWS not local copies.
- ✅ Look into getting PM2 error logs to send a notification — need to subscribe if we want this.
PM2.io is a paid service to get Slack/Email notifications
$39/month
Get API Server On Fault Tolerant Cluster (High Priority)
The Production API Server MUST be put on a fault tolerant cluster, but again, this only means “more hammers” if we don’t find the root cause. HOWEVER, this should be in place as sometimes servers fail and we want a fault tolerant system with auto-failover.
The better solution will be to update the BE app to run on Amplify services with Lambda (expressed in this article: ). It will take a bit longer to implement and be a little harder to manage, however it is the most flexible and requires less IT resources overall in the long run. This would get the API app off Jenkins and use the Amplify CI/CD processes which are a bit more sophisticated. Amplify is as container-based system that also managed the edge (CDN) networks.
Resources: (1) IT sysadmin x 1 week, (1) BE engineer x 1 week
We can create an EC2 load balanced cluster and auto-scaling group as another option. This keeps us on the older operating system with Jenkins which requires manual updates. It does not have the CDN without explicitly configuring it. The advantage is we know this system and it can be spun up a few days, HOWEVER we do not have experience with using Jenkins in this environment where every node in the cluster would need to be updated at the same exact time. This could turn into a complex implementation and take longer than expected to configure.
Resource: (1) IT sysadmin x 1 week (possibly longer)
Actions Taken
20:15 EST 07/11/2022 AWS EC2 Instance Upgrade
Upgraded the OB BE Production server from AWS EC2 t3.medium (2 CPU/4GB RAM/5GB Network) => m6.large (2 CPU/8GB RAM/10GB Network).
20:45 EST 07/11/2022 Redis Server Configuration Fixes
Updated the develop server to test the proposed Redis changes noted above for memory overcomit and huge pages. Rebooted, service appears to be working fine.
Updated production server with the same updates. Restart. Login to production.
✅ Also check the /var/log/redis/redis-server.log file to ensure no errors.
18:12 EST 07/12/2022 Redis Configuration Fixes
Added maxmemory setting to Redis. With this setting missing it can lead to catastrophic server failure.
21:15 EST 07/18/2022 OmniBlocks 1.84 Release
An update to the API server code was put on the production server. The update does not include a patch that resolves the root issue with 100% confidence as we have yet to be able to reproduce the runaway server resources in any environment. This update includes several patches to help increase our confidence that the middleware application will be stable as well as hopefully improve logging when something does go wrong.
New features include additional error logging that present notifications to the front end application when the user is interacting with the system and something goes wrong. The goal is “no unmanaged warnings/failures” and provide a more graceful user experience when something does go wrong. The hope is users report these issues to RBT inc. when the new warnings come up.
Catching more errors also reduces the likelihood of the node server application from a thread “hard crashing” due to a warning/error that was not caught. Theory is that when an uncaught error causes the node application to crash it is unable to call the underlying Knex library handler to close the database connections. We have seen lingering open data connections at irregular intervals that will build on the RDS server over time. Stopping uncaught errors & crashes with restarts may help mitigate this issue, especially given the PM2 wrapper around the node service that auto-restarts the API server if something goes wrong (and as routine processing to keep memory stacks clear, such as when all connections are drained).
Additional updates to the Express router also work to catch more errors, in this case any uncaught errors (non-fatal of course) that may not have been handled with explicit try/catch-exit coding. This is an attempt to at least log errors to the log files gracefully BEFORE the server hard crashes.
While this update does not address the root cause of the server crash with 100% confidence, it moves us another step in the direction of stability.
Work on preparing the codebase for a proper fault-tolerant scalable cluster is next on the list.
Follow Up & Monitoring
July 12th 2022
The new server configuration uses the same number of CPU cores, but it is on a much newer generation (gen 6 vs. gen 3). As such the CPU utilization is less than HALF the average as the t3.medium instance provided.
Memory usage remains low, about 1GB.
PM2 monitoring looks normal, the 400M+ of RAM for a single API server thread is a bit high. Need to reconfigure the PM2 config with the 8GB of RAM vs. 4GB RAM on the new server.
July 13th 2022
Memory check – well under the 7.8GB available, at 1.4GB usage (instantaneous)
PM2 showing 400MB+ memory usage again… not “out of range” but definitely higher than average
July 18th 2022
Nothing new in the review of server logs and services. The 8GB memory on the new server appears to be handling the spikes in memory consumption, however we still have no direct leads on what is triggering the memory consumption spikes when the occur.
Overall network I/O and CPU usage are far below capacity, operating at < 10% maximum capacity.
A slightly different view of the server resources on the production API server – htop, shows nothing concerning. Real time monitoring does show an occasional spike in CPU and memory usage, but the load average shows us that is it well with normal operating parameters:
Behind the EC2 API Server, the RDS server is operating well within configured parameters despite running on a t3.small burstable instance (note: this should be upgraded to a m5 instance at some point in the near future).
RDS CPU
DB connections , driven by the production API server, seem within normal range and well within the bounds of being able to be handled by the current RDS configuration:
July 19th 2022 : Post OB 1.84 Update
Both CPU cores are running at 500M of RAM, that is unusual.
htop is also showing unusual memory consumption for the API app. Maybe this is the “new normal” for 1.84 but in the past anything > 150M memory consumption was a concern:
That said, the 1.84 build DOES include an updated PM2 configuration that bumps the “restart the API node server” setting from “200M” to “2G”, which would indicate the node server is now holding a fairly large cache in either shareable code segments or data I/O caching.
Redis itself has been configured for nearly a week to allow 1GB max usage so this memory consumption pattern should have been showing up earlier this week/end-of last week if that was the trigger for the new pattern.
Conclusions
The API server outage has been an ongoing problem. It is very clear that time needs to be allocated to the tech services schedule to put the OmniBlocks® API server onto a scalable cluster. This effort will take a full week of IT/Sysadmin work followed by a full week of development work. Unfortunately we do not have a dedicate IT/Sysadmin so much of the initial effort will likely take more than a week to implement AFTER the options analysis and specification documents are finished.
Over the past few outages we’ve eliminated some key problem areas, but they did not resolve the issue fully: RDS server upgraded to better meet capacity, RDS server made fault tolerant, API server updated to better meet capacity, notable reduction in SQL query wait times from some poorly implemented order module queries, increased data indexes/performances to reduce wait times. These all helped, but some unknown series of interactions still triggers data I/O and resource runaway (we think memory related) from the code. The server is not handling the resource shortage well and hangs all processes, including SSH login or AWS fabric request (such as the reboot request).
The problem is multi-layered involving deep code and architecture issues. The code issues could possibly involve underlying supporting code library issues (Knex, Objection, Redis, PM2). RBT Inc. algorithms have been updated along the way, however nearly 50% of the codebase on the API server still utilizes some form of the inherited TP code structures. Most blatant problem areas have been resolved.
In addition there are server and cloud configuration issues that are causing the API server to become non-responsive. The servers are running out of resources so quickly when this issue happens it is not sending problem notifications when the services are failing. Extensive work needs to be done with AWS to get precursory notifications and warnings sent from either the AWS fabric or the server itself. Some of this will be mitigated by taking the time to move the API server to a new services architecture outside of EC2 or with EC2 as part of a proper monitoring and scaling group.
Update (11:30AM July 12th) — just discovered a key setting in the Redis configuration file that the original dev team never applied. They customized the configuration but left out a critical piece — the memory limit (details above).
This is a complex situation and highlights the need to get an IT/Sysadmin with cloud experience on staff.