July 11th 2022 10:12AM EST Downtime Analysis & Remediation

Cause

Live BE server stopped responding.

Actions

RDS server was still responding, but showing 20 open connections.

EC2 server was not responding, SSH unresponsive, Jenkins (http) unresponsive.
AWS reboot of the EC2 instance never finished (5m).
AWS stop instance never finished (5m).
AWS force stop took 3m+, eventually stopped.
Instance manually restarted via AWS console.

Forced PM2 restart (logged in via SSH, sudo su – jenkins, cd workspace/dev_backend, yarn pm2 stop all, yarn start).

Server online @10:27AM

Response Time: 15m from report to online

Forensic Analysis

Jul 11 Timeline

Note: server logs are typically in UTC not EST.
14:00 = 10AM EST

nginx/access.log = /var/log/nginx/access.log
nginx/error.log = /var/log/nginx/error.log
ob_api_server_out.log = /var/lib/jenkins/workspace/dev_backend/logs/ob_api_server_out.log

14:01:27	nginx/error.log	[error] 8934#8934: 220129 upstream prematurely closed connection while reading response header from upstream, client: 107.167.202.154 GET /v1/location?id=4704* from: 107.167.202.154
14:01:57	nginx/error.log	[error] 8934#8934: 220135 upstream prematurely closed connection while reading response header from upstream, client: 107.167.202.154 GET /v1/inventory/get/?current=256* from: 107.167.202.154
14:02:17	ob_api_server_out.log	GET /v1/me/tenant
14:02:17	ob_api_server_out.log	GET /v1/unit_measure
14:02:17	ob_api_server_out.log	GET /v1/me/tenant
14:02:17	ob_api_server_out.log	GET /v1/me/tenant
14:02:17	ob_api_server_out.log	GET /v1/me/tenant
14:02:17	ob_api_server_out.log	GET v1/location
14:02:17	ob_api_server_out.log	GET /v1/me/tenant
14:02:17	ob_api_server_out.log	GET /v1/me/tenant
14:02:17	ob_api_server_out.log	GET /v1/me/tenant
14:02:19	ob_api_server_out.log	GET /v1/inventory/get/?id=12493
	ob_api_server_out.log	>>>> Server started…
14:03:37	nginx/access.log	POST /v1/mbr_legacy/ from: 107.167.202.154
13:59:43	/var/log/syslog	systemd-udevd[450]: seq 26654 ‘/kernel/slab/:a-0000104/cgroup/buffer_head(531:systemd-resolved.service)’ is taking a long time
14:15:37	/var/log/syslog	systemd[1]: snapd.service: Watchdog timeout (limit 5min)!
		———– REBOOT ———-
14:13:13	/var/log/cloud-init.log	util.py[DEBUG]: Cloud-init v. 21.4-0ubuntu1~18.04.1 running ‘init-local’ at Mon, 11 Jul 2022 14:19:13 +0000. Up 21.52 seconds.
14:19:19	/var/log/kernlog	Linux version 5.4.0-1060-aws (buildd@lgw01-amd64-054) (gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)) first entry (reboot)
	ob_api_server_out.log	Server started…
14:26:50	ob_api_server_out.log	GET /v1/unit_measure?page_size=500

EC2 Basic System Analysis

EC2 instance type:
memory: 3.8G, 2 vCPU

First glance at summary report is not showing resource allocation problems.

CPU Hit 64.2% Max @ 14:10

API Server PM2 Logs

PM2 Snapshot

Error Log Summary

⚠️ NOTE: need to investigate if PM2 can be configured to email these log entries or post some kind of notification that is NOT buried in Jenkins ~/workspace/dev_backend/logs/ob_api_server_error.log

Bad Request Sent To OB FE App

This error is repeated 24 times at the end of the log.

sendBadRequest: Error: Insufficient number of lots selected to complete batch. at checkForInsufficientAllocations (/var/lib/jenkins/workspace/dev_backend/app/modules/allocations/helper.ts:521:13) at /var/lib/jenkins/workspace/dev_backend/app/modules/allocations/helper.ts:430:59 at Generator.next () at /var/lib/jenkins/workspace/dev_backend/build/app/modules/allocations/helper.js:8:71 at new Promise () at __awaiter (/var/lib/jenkins/workspace/dev_backend/build/app/modules/allocations/helper.js:4:12) at throwCompletabilityErrorIfNeeded (/var/lib/jenkins/workspace/dev_backend/app/modules/allocations/helper.ts:417:20) at MbrProcess. (/var/lib/jenkins/workspace/dev_backend/app/modules/mbrLegacy/MbrProcessLegacy.ts:681:47) at Generator.next () at fulfilled (/var/lib/jenkins/workspace/dev_backend/build/app/modules/mbrLegacy/MbrProcessLegacy.js:5:58) { customSubMessage: 'Insufficient number of lots selected to complete batch.', bulletPoints: [ { catNum: 'IN#5406', missingString: '269.6552 kg of 272.1552 kg (99.08%) missing' } ], skipRollback: true }

Invalid String Length Error / Unhandled Rejection

This should be addressed. If this is not caught and processed properly it can lead to indeterminate behavior.

You have triggered an unhandledRejection, you may have forgotten to catch a Promise rejection:
RangeError: Invalid string length at JSON.stringify () at stringify (/var/lib/jenkins/workspace/dev_backend/node_modules/express/lib/response.js:1128:12) at ServerResponse.json (/var/lib/jenkins/workspace/dev_backend/node_modules/express/lib/response.js:260:14) at InventoriesController.sendOk (/var/lib/jenkins/workspace/dev_backend/app/utils/base-controller.ts:199:32) at InventoriesController. (/var/lib/jenkins/workspace/dev_backend/app/modules/inventories/controller.ts:810:17) at Generator.next () at fulfilled (/var/lib/jenkins/workspace/dev_backend/build/app/modules/inventories/controller.js:5:58) at runMicrotasks () at processTicksAndRejections (node:internal/process/task_queues:96:5)

skipUndefined deprecation notice (warning)

This message should be dealt with or logging turned off. It is not hurting the application execution, but error logging does take system resources and slows down the app as it writes to a plain text file with no caching or I/O performance wizardry.

skipUndefined() is deprecated and will be removed in objection 4.0

API Server Request Log

API Service Autorestarts

⚠️ Note: Why is the API service restarting so often? PM2 manages the node server. If one of the two services (one copy of the app runs on each CPU, we have 2 CPUs on the current server) it will restart it automatically. Something in these requests seems to be triggering a resource or other failure.

Nick noted that PM2 will restart any time it drains all active requests… so the frequent restart looks normal here.

PM2 Service Restart Log Analysis

The PM2 request log lines immediate before the PM2 service restarts…

11/Jul/2022:15:26:54	PUT /v1/inventory/attach_location/14176
11/Jul/2022:15:32:50	GET /v1/customer/customers?id=[81]
11/Jul/2022:15:33:51	GET /v1/mmr/
11/Jul/2022:15:36:52	GET /v1/customer?company_name=aliv
11/Jul/2022:15:36:54	GET /v1/customer
11/Jul/2022:15:41:22	POST /v1/orders/order/?order_id=3090
11/Jul/2022:15:41:51	GET /v1/orders/order/get
11/Jul/2022:15:42:00	GET /v1/orders/order/get
11/Jul/2022:15:43:55	GET /v1/inventory_metrics/need_to_order
11/Jul/2022:15:46:23	PUT /v1/inventory/attach_location/14176
11/Jul/2022:15:50:51	GET /v1/mmr/
11/Jul/2022:15:52:20	GET /v1/customer/customers
11/Jul/2022:15:52:40	PUT /v1/batch/userLock/
11/Jul/2022:15:53:22	GET /v1/mbr_legacy/index_all
11/Jul/2022:15:53:42	GET /v1/customer/customers?id=[81]
11/Jul/2022:15:54:19	PUT /v1/formula/formula_ingredient/10979
11/Jul/2022:15:55:50	GET /v1/customer/customers?id=[81] Note: restarted twice in a row here, no requests in between
11/Jul/2022:15:56:22	GET /v1/orders/order/get
11/Jul/2022:16:00:07	GET /v1/customer/customers?id=[81]
11/Jul/2022:16:34:51	PUT /v1/inventory/attach_location/14177
11/Jul/2022:16:34:59	PUT /v1/inventory/attach_location/14177
11/Jul/2022:16:39:08	PUT /v1/inventory/attach_location/14177

Endpoint Counts Before Failures

batch/userLock	1 (1 Put)
orders/order	4 (1 Post, 3 Get)
customer	2 (2 Get)
customer/customers	5 (5 Get)
formula/formula_ingredient	1 (1 Put)
inventory/attach_location	5 (5 Put)
inventory_metrics/need_to_order	1 (1 Get)
mbr_legacy/index_all	1 (1 Get)
mmr	2 (2 Get)

Redis Service on EC2 API Server Review

The in-memory caching database that OB API server connects to for session caching (and likely not setup properly).

There are definitely some configuration issues here.

⚠️ Note: this needs to be fixed…
WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add ‘vm.overcommit_memory = 1’ to /etc/sy
sctl.conf and then reboot or run the command ‘sysctl vm.overcommit_memory=1’ for this to take effect.
✅ Production and develop servers updated to fix this.
See https://docs.omniblocks.app/internal-note/configuring-ec2-for-redis/

⚠️ Note: this needs to be fixed…
WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this
issue run the command ‘echo never > /sys/kernel/mm/transparent_hugepage/enabled’ as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must
be restarted after THP is disabled.
✅ Production and develop servers updated to fix this.
See https://docs.omniblocks.app/internal-note/configuring-ec2-for-redis/

RDS System Analysis

RDS server is now only serving production. No indication of a resource issue on the RDS server itself with the exception of DB CONNECTIONS.

At 10:03 AM the minimum DB connections spiked to 20 connections and stayed here until the server was shut down. This is abnormal behavior.

This is indicative of either a code logic problem where connections (very likely SQL transactions) are not finishing properly. Possibly getting into a wait/lock state which could even be self-referential. Could also indicate a configuration limitation on the EC2 node server (OB BE) that is limited to total connections to the RDS server.

High Read IOPS

@13:23 the RDS server had a notable spike in Read IOPS.

This is not unusual for the app, but these big spikes likely indicate some coding/logic issues as they are not happening at routine intervals and are likely triggered by data ops in the code logic.

Unusual Data Queries

These are in the performance insights logs…

Who Is 71.130.164.222 — Nick Fields

These are coming from a pg_dump command…

pg_dump from 71.130.164.222

@14:05

COPY public.batches (id, created_by, updated_by, formula_id, user_id, tenant_id, created_at, updated_at, status, batch_size, cap_or_bottle, value, template_id, qr_link, qr_code, product_name, batch_num, physical_inventory, location_id, produced_cat_num, production_output_unit, production_start_date, expiration_date, locked, current_step_name, "activeUID", completability_status_int, actual_yield, estimated_overage, expiration_offset, "customerIds", cert_marks_id) TO stdout;

@16:16

COPY public.mmr (id, status, created_at, updated_at, created_by, updated_by, tenant_id, formula_id, template_id, catalog_id, production_output_unit, batch_num, value, product_name, produced_cat_num, batch_size, cap_or_bottle, locked, expiration_offset, "customerIds") TO stdout;

Stop Gap Remedies

✅ Increase EC2 Instance Size

Increasing the size of the EC2 instance to allow more CPU/memory and thus more active data connections to be configured. This is basically “using a bigger hammer” to space out potential downtime, however if the “nail is bent” this is not going to be a great solution. We need to figure out the root cause.

Long Term Remedies

Isolate and Resolve Possible Coding Issue

This requires more information to first determine if we are dealing with a code based issue or configuration issue.

Primary Findings

These things are possibly related to the root cause and need to be addressed soon.

Uncaught promise / invalid string length noted above.
Routes that seem to be related to the node server restarting:
- v1/customer searches
- v1/orders/order (get and post operations)

Secondary Findings

These things may not be the root cause but need to be addressed.

Multiple calls to /v1/me/tenant from front end app, this should be done once per session
Debounce the search call for customer searches:
/v1/customer?company_name=a
/v1/customer?company_name=al
/v1/customer?company_name=ali
Replace skipUndefined (deprecation notice)

Isolate and Resolve Possible BE Server Configuration

This requires more information to first determine if we are dealing with a code based issue or configuration issue.

Primary Findings

✅ ⚠️ Redis needs to have a max memory limit assigned (July 12th 11:30AM)
This has not been setup properly EVER. The original dev team (TP) made configuration changes but completely ignored this key and critical setting.
Resolved 6:10PM EST July 12th 2022.

If maxmemory is not set Redis will keep allocating memory as it sees fit and thus it can (gradually) eat up all your free memory. Therefore it is generally advisable to configure some limit. You may also want to set maxmemory-policy to noeviction (which is not the default value in some older versions of Redis).
It makes Redis return an out of memory error for write commands if and when it reaches the limit – which in turn may result in errors in the application but will not render the whole machine dead because of memory starvation.
https://redis.io/docs/reference/optimization/memory-optimization

What happens if Redis runs out of memory?
Redis has built-in protections allowing the users to set a max limit on memory usage, using the maxmemory option in the configuration file to put a limit to the memory Redis can use. If this limit is reached, Redis will start to reply with an error to write commands (but will continue to accept read-only commands).
You can also configure Redis to evict keys when the max memory limit is reached. See the [eviction policy docs] for more information on this.
https://redis.io/docs/getting-started/faq/

Secondary Findings

These things may not be the root cause but need to be addressed.

Review PM2 and nginx configuration.
See: https://pm2.keymetrics.io/docs/tutorials/pm2-nginx-production-setup
Review Redis configuration and setup on API server.
The app must be stateless and thus must share login state between nodes (even on the same server / different CPUs) so something like Redis must be in place ot share state.
Note: Redis may be running TWO instances on the production server (one per CPU) without sharing credentials. The auth library relies heavily on redis for user auth and a non-shared “in memory database” (what Redis is, essentially) would cause sessions to disconnect if going between “node server on CPU1” and “node server on CPU2”. Needs further investigation.
- Most likely need to employ Redis via AWS not local copies.
  https://aws.amazon.com/memorydb/pricing/
- ✅ DEFINITELY need to fix the server to set vm.overcommit_memory = 1 and restart.
- ✅ Fix Huge Pages – disable.
- ✅ Fix the memory limit , it is MIA (set to 1GB on an 8GB server)
✅ Look into getting PM2 error logs to send a notification — need to subscribe if we want this.
PM2.io is a paid service to get Slack/Email notifications
$39/month

Get API Server On Fault Tolerant Cluster (High Priority)

The Production API Server MUST be put on a fault tolerant cluster, but again, this only means “more hammers” if we don’t find the root cause. HOWEVER, this should be in place as sometimes servers fail and we want a fault tolerant system with auto-failover.

The better solution will be to update the BE app to run on Amplify services with Lambda (expressed in this article: ). It will take a bit longer to implement and be a little harder to manage, however it is the most flexible and requires less IT resources overall in the long run. This would get the API app off Jenkins and use the Amplify CI/CD processes which are a bit more sophisticated. Amplify is as container-based system that also managed the edge (CDN) networks.
Resources: (1) IT sysadmin x 1 week, (1) BE engineer x 1 week

We can create an EC2 load balanced cluster and auto-scaling group as another option. This keeps us on the older operating system with Jenkins which requires manual updates. It does not have the CDN without explicitly configuring it. The advantage is we know this system and it can be spun up a few days, HOWEVER we do not have experience with using Jenkins in this environment where every node in the cluster would need to be updated at the same exact time. This could turn into a complex implementation and take longer than expected to configure.
Resource: (1) IT sysadmin x 1 week (possibly longer)

Actions Taken

20:15 EST 07/11/2022 AWS EC2 Instance Upgrade

Upgraded the OB BE Production server from AWS EC2 t3.medium (2 CPU/4GB RAM/5GB Network) => m6.large (2 CPU/8GB RAM/10GB Network).

20:45 EST 07/11/2022 Redis Server Configuration Fixes

Updated the develop server to test the proposed Redis changes noted above for memory overcomit and huge pages. Rebooted, service appears to be working fine.

Updated production server with the same updates. Restart. Login to production.

✅ Also check the /var/log/redis/redis-server.log file to ensure no errors.

18:12 EST 07/12/2022 Redis Configuration Fixes

Added maxmemory setting to Redis. With this setting missing it can lead to catastrophic server failure.

21:15 EST 07/18/2022 OmniBlocks 1.84 Release

An update to the API server code was put on the production server. The update does not include a patch that resolves the root issue with 100% confidence as we have yet to be able to reproduce the runaway server resources in any environment. This update includes several patches to help increase our confidence that the middleware application will be stable as well as hopefully improve logging when something does go wrong.

New features include additional error logging that present notifications to the front end application when the user is interacting with the system and something goes wrong. The goal is “no unmanaged warnings/failures” and provide a more graceful user experience when something does go wrong. The hope is users report these issues to RBT inc. when the new warnings come up.

Catching more errors also reduces the likelihood of the node server application from a thread “hard crashing” due to a warning/error that was not caught. Theory is that when an uncaught error causes the node application to crash it is unable to call the underlying Knex library handler to close the database connections. We have seen lingering open data connections at irregular intervals that will build on the RDS server over time. Stopping uncaught errors & crashes with restarts may help mitigate this issue, especially given the PM2 wrapper around the node service that auto-restarts the API server if something goes wrong (and as routine processing to keep memory stacks clear, such as when all connections are drained).

Additional updates to the Express router also work to catch more errors, in this case any uncaught errors (non-fatal of course) that may not have been handled with explicit try/catch-exit coding. This is an attempt to at least log errors to the log files gracefully BEFORE the server hard crashes.

While this update does not address the root cause of the server crash with 100% confidence, it moves us another step in the direction of stability.

Work on preparing the codebase for a proper fault-tolerant scalable cluster is next on the list.

Follow Up & Monitoring

July 12th 2022

The new server configuration uses the same number of CPU cores, but it is on a much newer generation (gen 6 vs. gen 3). As such the CPU utilization is less than HALF the average as the t3.medium instance provided.

Memory usage remains low, about 1GB.

PM2 monitoring looks normal, the 400M+ of RAM for a single API server thread is a bit high. Need to reconfigure the PM2 config with the 8GB of RAM vs. 4GB RAM on the new server.

July 13th 2022

Memory check – well under the 7.8GB available, at 1.4GB usage (instantaneous)

24h EC2 Server CPU usage – looks normal

PM2 showing 400MB+ memory usage again… not “out of range” but definitely higher than average

July 18th 2022

Nothing new in the review of server logs and services. The 8GB memory on the new server appears to be handling the spikes in memory consumption, however we still have no direct leads on what is triggering the memory consumption spikes when the occur.

Overall network I/O and CPU usage are far below capacity, operating at < 10% maximum capacity.

A slightly different view of the server resources on the production API server – htop, shows nothing concerning. Real time monitoring does show an occasional spike in CPU and memory usage, but the load average shows us that is it well with normal operating parameters:

Behind the EC2 API Server, the RDS server is operating well within configured parameters despite running on a t3.small burstable instance (note: this should be upgraded to a m5 instance at some point in the near future).

RDS CPU

Max 25% CPU over past 2 weeks

DB connections , driven by the production API server, seem within normal range and well within the bounds of being able to be handled by the current RDS configuration:

Active data connections for the start of Monday morning production shows that the system is properly closing connections most of the time, this is still running the same 1.83 API codebase that was in place during the crash 2 weeks ago:

The database queries are also showing a normal distribution. This indicates the 1.83 updates to significantly reduce the load of the complex case/when coalesce SQL query was effective; however that turned out NOT to be the main culprit in the middleware server failure:

Active SQL query sample during Monday morning processing.

July 19th 2022 : Post OB 1.84 Update

Both CPU cores are running at 500M of RAM, that is unusual.

htop is also showing unusual memory consumption for the API app. Maybe this is the “new normal” for 1.84 but in the past anything > 150M memory consumption was a concern:

That said, the 1.84 build DOES include an updated PM2 configuration that bumps the “restart the API node server” setting from “200M” to “2G”, which would indicate the node server is now holding a fairly large cache in either shareable code segments or data I/O caching.

Redis itself has been configured for nearly a week to allow 1GB max usage so this memory consumption pattern should have been showing up earlier this week/end-of last week if that was the trigger for the new pattern.

Conclusions

The API server outage has been an ongoing problem. It is very clear that time needs to be allocated to the tech services schedule to put the OmniBlocks® API server onto a scalable cluster. This effort will take a full week of IT/Sysadmin work followed by a full week of development work. Unfortunately we do not have a dedicate IT/Sysadmin so much of the initial effort will likely take more than a week to implement AFTER the options analysis and specification documents are finished.

Over the past few outages we’ve eliminated some key problem areas, but they did not resolve the issue fully: RDS server upgraded to better meet capacity, RDS server made fault tolerant, API server updated to better meet capacity, notable reduction in SQL query wait times from some poorly implemented order module queries, increased data indexes/performances to reduce wait times. These all helped, but some unknown series of interactions still triggers data I/O and resource runaway (we think memory related) from the code. The server is not handling the resource shortage well and hangs all processes, including SSH login or AWS fabric request (such as the reboot request).

The problem is multi-layered involving deep code and architecture issues. The code issues could possibly involve underlying supporting code library issues (Knex, Objection, Redis, PM2). RBT Inc. algorithms have been updated along the way, however nearly 50% of the codebase on the API server still utilizes some form of the inherited TP code structures. Most blatant problem areas have been resolved.

In addition there are server and cloud configuration issues that are causing the API server to become non-responsive. The servers are running out of resources so quickly when this issue happens it is not sending problem notifications when the services are failing. Extensive work needs to be done with AWS to get precursory notifications and warnings sent from either the AWS fabric or the server itself. Some of this will be mitigated by taking the time to move the API server to a new services architecture outside of EC2 or with EC2 as part of a proper monitoring and scaling group.

Update (11:30AM July 12th) — just discovered a key setting in the Redis configuration file that the original dev team never applied. They customized the configuration but left out a critical piece — the memory limit (details above).

This is a complex situation and highlights the need to get an IT/Sysadmin with cloud experience on staff.

Tagged downtime analysis, OmniBlocks®