Exec Review : 04/28/2022 API Server Downtime Analysis

Initial Review

Reference: original write up of the API Server crash in April 2022

Synopsis

Problem

Production API Server went offline from 9:35AM to 10:35AM on April 28th, 2022 (Thu).

Resolution Process

Production API EC2 instance was non-responsive and not accepting SSH connections, server rebooted.

Connectivity restored but the REST API connection was not running.

API rebuild via the standard Jenkins CI/CD tool did not restart the application.

A command-line clean up and restart of the PM2 process was needed.

Recommendations

Code fixes (refactor) on BE processes to isolate and resolve runaway code.

Update the PM2 processor and/or move to a managed service like AWS Amplify.

July 2022 Executive Report

Cause

The root cause of this issue is a combination of code logic problems and data structure problems. Many of those issues were inherited from the original tech team (TP). Some have been created in new code added to the system since the original team was replaced.

Data Structures

Many of the data structures inherited by TP are non-performant and non-scalable. Fixing these data structures has been an ongoing process. It is a significant undertaking that would require several months to address all of the known data structure issues while maintaining access to legacy data and not breaking the current interfaces. Unfortunately the original TP data structures are not isolated from the data processor or the front end application. As such a “simple change” to address many data structure issues requires a rewrite of the data model, data processor, REST controller, and even the front end application. This puts a significant load on the dev team to address known technical issues.

We don’t know specifically which data structures were involved in the performance issue in this case as a half-dozen data structures were indicated in the data queries in play at the time of the failure. Primary culprits were the production model batches data table and the inventories table.

A major contributor to performance and reliability problems is the TP designed architecture that stores virtually every single actionable data point in a single massive JSONB field. We term this field and all the related API controllers and processors the “Heart of the Flying Spaghetti Monster” (HoFSM). It is an incredibly horrid design that impacts the form builder (and related sections/templates modules), MMR, and production modules. It has been cited as a primary concern since the first full review of the architecture dating back as far as Q3 2020.

This has not been addressed fully as it is a major functional component that underpins much of the OmniBlocks application. Changing it en-masse means a nearly full rewrite of the main components of the application. Unfortunately the resources have not been available to undertake this endeavor fully. In addition, the tech team was told the final MVP was imminent and thus needed to focus on stabilizing what we have and meeting customer requests over a 3-month pause in development to work on this refactor.

Data Queries

The poor data architecture noted above means deriving poor data queries and related processors and controllers to work around the notable shortcoming of this “HoFSM” single-field-holds-all-info design. That design “technically works” but it requires insanely complex code to be written in data processors and related endpoint controllers in order to “do all the math and related data juggling” to read and update that data structure. This makes the controller and processor logic overly complex and difficult to maintain.

The Sales Order Data Logic

This situation precipitated another problem that led to the shut down. A workable but “not the best architecture” implementation of inventory status queries. A request was made by executive staff on behalf of our clients to show them the status of sales orders and provide a visual indicator of “what was ready to be produced, shipped, etc.”. Essentially an “at-a-glance” visual icon and status code that would provide info needed for decision making and planning of the manufacturer’s production and shipping pipelines.

The architecture was designed with a rudimentary “get the people what they want as quickly as possible” design. Several elements of the design, presented to the rest of the dev team after several weeks of coding was complete was deemed “functional but potentially problematic under load”. Areas of concern included not using an aggregate meta table to store already-calculated inventory and status codes as well as an overly complex data collation query that the core PostgreSQL data engine could not optimize in real-time.

This particular sales order query on its own was of concern, but the velocity of data I/O from the current customer base meant the application would function until we had time to go back and revise the architecture. It did not directly lead to the April downtime, but was one of several poorly-written data queries and structures that led to the API server becoming overloaded.

Non-Scalable API Server Architecture

The original TP architecture stood up the entire production system using a single-point-of-failure front end application server and single-point-of-failure middleware server. It also employed a single-point-of-failure database server. Of these, the API server remains on a single-point-of failure configuration.

As such, when the data structure and query problems started to spiral out of control under new load from the customer base, the API server was unable to keep up. The PM2 process manager that was employed, one of the few good decisions by the TP team, could not keep up with the service load despite running two processes on two cores.

The fact that TP also misconfigured the server by running several PM2 managers on the same API server made problems worse. Unfortunately the misconfiguration was not discovered until the service failed.

Resolution

Actions Taken

HoFSM Phase 1 Refactor

Time was allocated to the development schedule for refactoring. After this failure the main focus was on the batches table that supports product and the insanity of the “value” field. Four weeks of the development schedule was spent on what we called “Heart of Flying Spaghetti Monster Phase 1” refactoring.

This reduced the database load and increased stability of the MMR/Production modules. It allowed for more reliable and faster calculations of inventory levels. It simplified the code to make it easier to follow and maintain. This also ensured that processes like reducing inventory and marking special “unusable inventory states” such as mixed ingredients to be set reliably. This significantly improved the stability and reliability of the production module and how it impacts inventory levels. It directly affected the formula and smart weighing components of the production process as well as inventory level reliability.

This is only phase 1 of what is a multi-stage refactor to address the horrid “store all data in a single field” architecture inherited from TP as noted above.

PM2 Service Reconfigured

The misconfigured PM2 services on the API server was resolved soon after it was discovered. The extra PM2 services running on the Ubuntu (OS admin user) and Jenkins (CI/CD service user) were permanently removed and the correct node-managed PM2 service bound to the OmniBlocks API server application was left intact.

The PM2 service was also updated to include a PM2 configuration file which takes into account the number of processors on the server, memory restrictions, and automatic-restart of failed processes. Non of this was configured prior to this update.

This will allow an overloaded API service to restart and allows multiple services to run in parallel with PM2 routing request to the lowest-load application.

This is still not the best option as the “cluster” is still running on a single server instance with the limitations on CPU, memory, and network I/O that comes along with a “single hardware box” setup.

Remaining Areas of Concern

Non-Scalable API Server

The API server remains configured as a “single-box” service. This needs to be addressed by IT with a cloud service expert to configure this as a horizontally scalable cluster. If a single server fails the other servers take up the load with AWS automatically adding services as needed.

There are several paths that can be taken. Using the current EC2 instance to derive a baseline image and then building a proper auto-scaling group is one option. It will work very much like the current configuration, however there are likely to be API routing request concerns that needs to be tested to ensure any shared assets are managed properly (S3 or RDS or other shared “static asset stores”). Another option to be considered is putting the API server on the AWS-managed Amplify service. A decent amount of research has been done on this option and it is feasible.

Either option will take a couple of weeks dedicated to this concern from a qualified AWS Cloud and scalability expert.

AWS support and internal contacts have been contact as well to assist in planning this server migration.

Resources have yet to be prioritized/allocated to address this concern.

HoFSM Phase 2 Refactor

The data structure updates and corresponding controller and processor updates have yet to be started for the MMR and Production module. The underlying massive value field and supporting complex logic are significantly reduced after the phase 1 refactor, but many non-performant and hard-to-maintain pieces remain.

Focus was shifted per executive decision to work on the “QC Module”. During the design and planning of this module it was determined a full refactor of the Form Builder field subsystem needs to be addressed. While not directly related to this refactor, much of that work can be leveraged in a future update to alleviate some of the problems inherent in carrying around the remaining weight of the “HoFSM”. That will make the Phase 2 refactor less urgent, but still necessary.

Recommended Actions

1. Make the API Server Horizontally Scalable

Resources need to be allocated from the IT team to spend the couple of weeks necessary to replace the existing Jenkins + EC2 configuration on the API server with a proper Amplify or EC2 Scaling Group configuration. This is going to require a few days, possibly a week or more, development team resources to add code support for the new environment.

Resource Estimate: (1) IT/System Admin for 2 weeks, (1) backend developer for 1 week

2. Finish HoFSM Phase 2

Finishing the next major piece of the HoFSM refactor should be a priority. It is impacting more than code maintenance but continues to add excess burden on the database server and by proxy the API service that crashed per this incident report. The focus here is taking more information out of the single “value” JSONB field and getting into discrete actionable fields for things like “UX/layout”, “default values”, and “final user input”.

Resource Estimate: (1) front end developer for 1 week, (1) backend developer for 4 weeks, (1) db admin/architect for 1 week

3. Continue The “Fields Subsystem” Work

The fields subsystem will have a direct impact on the MMR and Production modules, and may be the last piece needed to fully eliminate the HoFSM single-field implementation.

Resource Estimate: (1) backend developer for 3 weeks, (1) db/admin architect for 1 week.