Summary
The OmniBlocks® application has been the main service offered by RBT to the current customer base. It has been the primary focus for the tech team since April 2020. The lightweight MRP system was implemented to provide a lower-cost solution for the initial customer base that was largely using “paper-based” systems to document and track production. There was a direct need to get their production data into managed digital storage in order to facilitate the planned blockchain and supply chain tracking.
Unfortunately the OmniBlocks® application was not architected to be scalable or performant. RBT inherited significant technical debt that has yet to be fully mitigated.
History
The original application was architected and coded for months prior to April 2020 from an outsourced team from Pakistan we’ve labelled “Team Pakistan” (TP). This original application was taken over by an internal team at RBT in Q4 2020 after several months of overlap with the original development team.
Original Assessment 2020
Before taking over the application an assessment was performed in Q2 2020. It was determined that the best path forward would be a complete rewrite of the application, using the current application as a template for the required feature set. The projected resources needed to do the rewrite in 2020 was $1M and a half-dozen technical staff including software engineers, data specialists, and IT support.
The executive decision at that point in time was to focus on a refactor versus a complete rewrite. RBT Inc. did not have the resources to do the rewrite at that time. In addition it was anticipated that a new client that was coming onboard (Naomi Whittel) needed to have a fully operational system in less than 3 months.
RBT Inc. decided it would pursue funding for the rewrite while taking a “best effort” approach to patching the application shortcomings. A $1M development budget was created for this purpose. A funding goal that has yet to be met, and thus the full rewrite has yet to be addressed.
The “best effort” refactor that has been underway since 2020, has been slow-going at best. This was expected to be a stop-gap solution as “funding was just around the corner”. The lead developer/interim CTO was coding part-time while taking on other tasks. The other developer was inexperienced at the start of the project and was learning advanced techniques and best practices while “on the job”; expected results for a low-cost intern/Junior Developer. A second outsourced team brought on a year later provided failed to fulfill their promises; That team generated less than 70% of the anticipated throughput and the oversight/management component never materialized. The outsource contract was terminated after several months due to performance as well as budget concerns.
Revised Assessment Q2’22
Today the tech team (a junior developer + our CTO) continues to evaluation options. Despite two years of ongoing efforts to do inline refactoring while “keeping the application running”, significant portions of the application remain on the “needs refactoring” list. In addition numerous issues with the IT infrastructure and security processes have been uncovered.
To this day the tech team recommends a complete rewrite versus a refactor. Not only does the current application continue to carry a large portion of the original technical debt, the scope of the application has increased dramatically. Entire new modules have been added alongside extending features in the current application. While most of the newer modules use industry best practices, these modules often interface with legacy components that force compromises to be made to maintain compatibility.
The original $1M budget may no longer cover the full refactor due to economic and industry changes in the past 2 years as well as the increased scope of the codebase.
Areas Of Concern
Summary
High Risk
Middleware Server Scalability
The current configuration hosts the REST API listener on a single AWS EC2 instance. This middleware server is a single-point-of-failure through which all data requests are passed. It should be configured as part of a horizontally scalable cluster with at least 2 nodes in place for fault tolerance. This requires several underlying architecture changes for the REST API backend involving code changes, application deployment process changes, security updates, and setup/configuration of several cloud services.
Estimated resource requirement: 1x IT/Sysdadmin + 1x Backend Engineer for 3 weeks
BE Production Module
Non-normalized data structure with “full record in a single field” combined with over 2 years of legacy data. This module rework involves complex data migration, data architecture, and supporting logic that will result in mirrored changes on the front end. The refactor would need to happen in multiple phases. Touches on a wide array of back end issues detailed below including memory caching, error handling, API consistency, algorithm complexity, among others.
Estimated resource requirement: 1x Full Stack Engineer + 1x Backend/Data Engineer for 8-10 weeks (refactor), this is a rewrite candidate
Sales Order Module
Complex data processing logic and a lack of supporting meta tables for the production, inventory, and orders data require an extensive build out of data tables and the models plus logic to support them. Touches on a wide array of back end issues detailed below including memory caching, error handling, API consistency, algorithm complexity, among others.
Estimated resource requirement: 1x Full Stack Engineer + 1x Backend/Data Engineer for 6 weeks
User Security/Passwords
The username/password security in the OmniBlocks application is very weak. The request to address this ASAP has been pushed down the priority list repeatedly. Weak user passwords are still in play on a public SaaS application. The lack of introducing true security rules and MFA is an issue. It is only a matter of time before a malicious actor finds the site and executes a successful brute force attack on a customer account.
Estimated resource requirement: 1 Full Stack Engineer + 1x Backend/Security Engineer for 3 weeks
Moderate Risk
Backup Services
The EC2 middleware server should have routine snapshots configured.
The WordPress sites should all have an active Jetpack subscription with backup services enabled.
Estimated resource requirement: 1x System Admin for 1 day
Low Risk
QC Module
The original module was unusable and has been discarded. New architecture was started. This needs to be significantly extended. To get the most value this should be designed with the existing form builder + MMR + Production module in mind so the tech can later be leveraged in the refactor/rewrite of the high priority BE Production Module noted above — this adds complexity but is a good ROI for the effort.
Estimated resource requirement: 2 Full Stack Engineers x 6 weeks
Front End Data Connections
The intersection of the back end services and front end is inconsistent with legacy architecture poorly or incorrectly implemented. This task involves several items noted below such as the Redux/Saga and Data API calls problem areas. Some work can be done solely on the front end however much of the work needs to be done in tandem with backend development increasing complexity of this task.
Estimated resource requirement: 1x Front End Engineer and 1x Back End Engineer x 4 weeks
Front End User Experience (UX)
Multiple modules retain an incorrect implementation of the Ant Design framework and remain to be refactored. This leads to inconsistent search results, pagination, sorting, and other basic experience inconsistencies between modules. Time needs to be allocated to ensuring consistency across the user experience. This includes the timing loops detail item noted below as well as various UX issues noted throughout this document.
Estimated resource requirement: 2x Front End Engineers x 6 weeks
Detailed Review of the Application Architecture
Front End
Incorrect Redux/Saga Implementation
The original architecture (TP version) utilized Redux/Saga for managing data flow between the front-end application and the REST API server. Redux/Saga is a common tool kit used with React applications for managing data flow. When implemented properly it can alleviate the rendering work load on the front end while simultaneously reducing data requests from the backend.
Best practices call for a shared singular Redux data store for most applications with possible limited deviations for specific use cases. For example, one module of the application may require more detailed real-time data queries and rendering to provide the best overall user experience. In those cases there may be a sub-set of the application-level data store designed in module-specific Saga and corresponding Redux data store.
The application that was inherited had over two dozen individual data stores with corresponding sagas. Essentially one independent data store per React container. In addition, these stores often did not maintain state when switching between modules. A user looking at inventory data, then switching to the production view and then back to inventory would cause the application to re-request data and re-update the data store. While there are some cases where this is preferred, it should be handled on a case-by-case basis.
To this day many modules still use this less-than-optimal implementation of Redux/Saga. This causes undue burden on the API server. That leads to a requirement for larger API servers with more CPU and memory to handle the load. It increases network I/O and overall cost of operation of the application. In addition it can cause slow response times on the front end which leads to a subpar user experience.
- Multiple API requests for same data
- Siloed data stores for every page/module
- Incorrect parameter passing
- Pagination
- Sorting
- Searching
Modules impacted: AddTemplate, AuditTrail, BuilderForm, Conversions, Customer, DragDrop, Formula, HomePage, Inventory, LanguageProvider, Login, Measures, PurchaseOrdersForm, Settings, Signup, SOPs, Statements, Tenant, TenantType, Unit, User
Remediation: replace all Redux/Saga stores with React Context or other shared state/data store management
Impact: Poor user experience, slow front end performance, excess data request from API server, excess data volume to/from API server
Customer satisfaction impact: Moderate
Risk of downtime: Low
Incorrect Data API Calls and Processing
Directly related to the poor Redux Saga implementation are is the incorrect structure of API calls and response processing. This is most notable in how it impacts the pagination, sorting, and filtering of data across several modules. Virtually all of the original Redux/Saga calls were architected to perform data filtering, sorting and pagination AFTER data was returned from the API.
The proper implementation and best practice is to send the parameters such as sort order, data filters, or page info to the API server. The API server processes the incoming parameters and structures a data query that represents the request. The data set that is returned is filtered, sorted, than paginated to represent the original intent of the query.
The TP architecture, on the other hand, sends a generic “get me everything” query to the API server. This query is then processed by the FRONT END after the API response is received. Records are removed from the data set AFTER they have been returned, that data is then sorted locally, and paginated. As a result the user experience often shows pages with no records, or less than the desired number of records on a page. In essence it behaves as if all the records were rendered for the user at the start, then some rows from the displayed table are deleted that don’t match a filter. This creates a user experience that often renders tables with no records on the first page or two, a third page with one record, a fourth page with five records, and so on. This is unexpected and atypical behavior for a well crafted application.
Modules impacted: many of those listed in the Redux/Saga modules list. SOME of the Redux/Saga calls perform proper parameter passing due to limited refactoring.
Remediation: remove all front end filter/pagination/sorting processing, implement proper query passing to the API server
Recommendations: Continue the migration to React Context based data I/O management, combine with SQL query management and caching libraries
Impact: Extremely poor user experience, excess data I/O on the database server, excess data requests and resource consumption on the middleware server.
Customer satisfaction impact: Moderate
Risk of downtime: Low
Lack of state update timing loops
While not a problem in the existing application, the lack of a timing loop on the front end application is a notable shortcoming. The lack of such a system makes it impossible to have modern real-time notifications or other visual indicators within the application. At the moment the entire front end application has no mechanism to present the user with real-time information updates unless the user explicitly interacts with the system. The current architecture also makes it nearly impossible to post notifications across modules; In other words if you are looking at the inventory module it would be extremely difficult to get a notification about an ongoing production state.
Recommendations: Implement on of the many available React notification libraries or message processors throughout the application.
Impact: Degraded user experience that is not producing real-time updates per modern application design standards
Customer satisfaction impact: Moderate
Risk of downtime: None
Back End
The back end application is a REST API endpoint processor written in TypeScript that handles API requests from various sources. The service runs on Node Server, managed by a PM2 node application, and communicates with an AWS PostgreSQL database on the RDS service. The application employs the Knex library with an Objection layer to provide a standard data interface that is decoupled from the underlying SQL implementation. A memory cache database, Redis, serves to speed up data requests.
Incomplete implementation of memory caching (Redis)
The TP designed application employs Redis to reduce data queries to the SQL persistent data store and increase performance of oft-requested SQL queries. Unfortunately the implementation stopped and an overly-complex custom authentication module also designed by TP. Thus the only data queries that benefit from a performance boost are the initial user authentication queries and follow on “is the request still from a valid user” queries that come in via REST requests. These queries are poorly constructed often leading to stale or mismatched key/value pair entries and reducing the effectiveness of the cache. In addition this can lead to inconsistent user state experiences where users may be logged out if the session keys are flushed prematurely from memory.
Modules impacted: User authenticated, which precedes all REST API requests.
Remediation: Review and revise the Redis request to ensure consistency and stability.
Recommendations: Redis can generate notable performance increases in application responses while significantly reducing read-only requests from the persistent data store. Redis should be implemented across more REST endpoints, especially those with heavy read-only data (units, tenants, users, settings).
Impact: Inconsistent user experience with unintentional session disconnects, increased data I/O requests to the SQL server
Customer satisfaction impact: Moderate
Risk of downtime: Low
Invalid/Incomplete error catching and reporting
Best practices employed by most applications use granular try/catch procedures to not only provide graceful error handling but actionable tasks when something does fail. The original backend processor was designed throughout with generic “try to do everything, and if anything fails throw a generic error” processes. Worse, in some cases the “throw an error” was not employed, instead opting to log a hidden message and continue on as if nothing was wrong. It is the coding equivalent of having a tire blow out on a highway but continuing to drive to your destination on the rim. The result was inconsistent data being stored or returned to the front end application, random application crashes, or in the best case scenario excess resource usage while covering up the errors.
Much of the application has been revised to narrow down the scope of the try/catch error handling to send back a single actionable message. Messages are now action-specific such as “could not update this particular record” or “a variable in the code was not defined”. More importantly, most of these messages are now immediately handled where they stop processing immediately and send the error to the front end application where it is displayed for the user.
There are some modules that remain which are still using the legacy architecture. While most instances of “pretend nothing went wrong” have been removed, a few may still remain. This is being addressed as part of the latests API server stability improvements but are not fully implemented at the time of this writing.
The entire application needs to be updated to not only add the granular error messages, but to also catch any rogue errors that happen outside of explicit try/catch logic within the application. This requires updates to multiple modules in the backend as well as to the global data routing architecture of the REST API controller.
Modules impacted: Most/all modules at some level, the most-used modules have already been refactored to address this issue.
Remediation: Remove all “ignore and continue” implementations, replace generic “do everything/report general error” implementations with granular try/catch with specific actionable reporting, update data routing requests to “bubble up” uncaught errors and return them before continuing processing.
Impact: Random error messages on the front end application, invalid data entries in some cases, untraceable application crashes on the API server.
Customer satisfaction impact: Moderate
Risk of downtime: Moderate
Remaining inconsistent API response controls
Consistent API response controls is necessary to reduce code complexity on both the front end and back end. It also ensures that message handling, including errors like those above, are consistent. Getting the entire application to use a standard and shared HTTP response library is the first step to building over-arching code modules that can manage all data I/O requests between the front end and backend. This significantly reduces code complexity, facilitates better message handling and user interactions. The resulting application will be far less fragile as the number of custom data I/O components on the front end and back end are reduced. It also means using HTTP standard communication protocols in place of many of the “TP custom” protocols built into the application, such as checking the data payload for a status code versus the HTTP standard response header.
Remediation: Replace all custom HTTP responses with the already-existing base response class send* responses on the backend, will trigger some updates on the front end where custom “data status” controls remain.
Impact: Better message handling, smaller codebase, ability to implement standard shared classes on the front end and back end for data processing.
Customer satisfaction impact: Moderate
Risk of downtime: Low
Poorly constructed SQL queries
The REST API server code is built on top of Knex and Objection libraries that are used to provide the interface to the PostgreSQL database. Much of the original data models that are used to generate the SQL queries are inconsistently implemented. Some do not follow best practices as the underlying data structures are not properly architected. Most of the original architecture opted to use brute-force raw queries or poorly constructed Objection-based queries to link data sets together or to provide filters.
Ongoing refactoring has replaced many of those queries by utilizing the tools provided by Knex and Objection to build better relationship mapping, filters, and automated data handler methods that live closer to the data definition. This provides a consistent data interaction and tends to produce better auto-generated SQL queries. Even though progress has been made there are many places where raw SQL queries are implemented in ways that do not allow the database engine to optimize the query.
In addition to some poorly formed raw SQL queries, many of the data queries (ALL of them originally) use a “graph fetch” model versus “graph join” model when the graph join is a better option. Often data filters and ordering performs better with a join when a single query is built. The fetch model, on the other hand, builds multiple data queries on the JavaScript side and links those disparate data sets together in an “after the SQL engine has returned results” fashion. This can increase data connections and memory consumption on the middleware server. Not necessarily a bad thing, but very little thought has been behind the decisions on WHICH method to use and why. Many queries will benefit from replacing fetch-based methods with join methods.
Better use of proper relationship definitions and filters is also necessary across a vast majority of the existing models.
Remediation: Replace raw SQL queries with well defined relationships, models and filters when appropriate. Analysis of fetch vs. joined relationship query execution to determine if a join is a better fit; especially when filters and sorting is being utilized. Some of this will require better data structures as specified elsewhere in this review.
Impact: Slower performance of data queries, complex data queries leading to errors. Both can impact server load increasing memory consumption and response times.
Customer satisfaction impact: Moderate
Risk of downtime: Low
Non-performant algorithms / Overly complex architecture
Multiple modules in the API controller and processor are non-performant. Some of the modules are original legacy architecture designed by TP, others were built by our outsourced agency that copied TP design patterns despite being told repeatedly to NOT use those designs as a template, and some are newly created to meet tight timelines when developing new feature requests.
Form Builder / MMR / Production “HoFSM”
Known areas of non-performance and potential fragility include, but are not limited to, the Form Builder/MMR/Production architecture revolving around the monolithic “entire data record in a field” value field. We have assigned the internal code name “Heart of the Flying Spaghetti Monster” (HoFSM) to this field and all logic that surrounds it. This may be the single most impactful architectural element of the entire application. Support for that architecture created extremely complex code on the front end and back end application to support the design. This design element has a notable impact on performance and code complexity.
Virtually every element of the Form Builder, MMR module, and Production module are impacted by this design. Thousands of lines of code were originally created to support this configuration. Months of effort have been spent mitigating the effect of this design through multiple inline refactors. An entire month was dedicated in Q2’22 to remedy one performance-and-stability vector related formulations and the constituent math that had a direct impact on inventory.
This “feature” alone accounts for several months of effort on the development schedule to refactor this architecture. Only phase 1 of 3 month-long phases has been completed.
The refactoring of this part of the application is extremely complex due to the necessity of supporting all legacy data including completed production data from months of creating real world customer data on this architecture. Instructions were provided to the tech team that we must keep ALL data from January 2020 forward. This requirement to support legacy data is an additional reason to work on a full rewrite while archiving the legacy app and database for reference purposes onlyl
Modules impacted: Form Builder, MMR, Production. Indirectly, Inventory and Formulas.
Remediation: Multiple migrations to extract key data from this massive monolithic JSON field into easily actionable independent data points. This includes several new data columns to properly support the various states of data for form builder templates, MMRs, and batch records. Additional data tables need to be created to continue the refactoring and properly normalize the data field. Controllers and processors for batches and form builder templates need to be rewritten to support a proper architecture represented by these changes.
Impact: Slow response times through several key functional areas of MMR and especially batch production, inconsistent results with data processing and “inventory math” due to issues buried in the overly complex front end and back end codebase, code fragility, difficulty managing the code, notable performance impacts on the front end due to massive data volume going to/from the API server, inability to extend and refine the data processes and algorithms without breaking backwards compatibility.
Customer satisfaction impact: High
Risk of downtime: High
Sales Order Module Complexity
The Sales Order module was significantly reworked and extend to add new features demanded by the customer base in late 2021. The nature of the orders module features that were requested meant working with the poor HoFSM architecture noted above. This led to multiple design decisions that were not best-of-class as compatibility with the existing poorly-written controllers and processors for the production data modules needed to be supported.
On top of the decisions to support the older architecture, the new data structures, related data models, and data processing logic were created on a shortened timeframe. A full specification and peer review of the new features was not employed given the tight timeline and limited resources. As such several poor decisions were made regarding the data architecture and SQL query design needed to support new materials status indicators requested for the front end applications.
Some of these designs where mitigated in June 2022 with a minor refactor after a server failure analysis pointed to the overly complex Case/When + Coalesce SQL query was indicated as a significant performance problem with the API middleware and PostgreSQL data queries. Data indexes were added to multiple segments of a extremely complex dozen-level-deep SQL query and subquery structure; This improved performance significantly in some cases. The SQL coalesce statement was simplified which also assisted in improving performance.
However, the underlying architecture has been largely untouched given the “hot fix” nature of the request and the need to bring the Sales Order module back online on the front end “ASAP”. Only a few days were given to resolve that situation. The development team made it clear the fixes were only a band-aid and a major refactor over several weeks of effort was needed to truly address the underlying issues.
Remediation: Revise multiple existing data structures and corresponding models and processors for the Orders as well as the relations with customers, production, and inventory. Evaluate the relationships and the data queries to create supporting meta data tables along with their models, processors, and controllers. Implement data migrations to reformat the data as needed and populate meta tables.
Impact: Instability in the sales order module, high memory consumption, notable impact on RDS data queries and performance, potential to overrun the middleware server
Customer satisfaction impact: High
Risk of downtime: High
Unusable QC Module
This module has been missing from the application since the beginning. It is required for proper FDA compliance. It is also critical for allowing the OmniBlocks® MRP system to be able to drive supply chain data to the blockchain, a key tenant of the RBT supply chain & validation objective.
TP architected an extremely complex and overly-normalized set of modules to support QC work on the backend. None of this was properly connected to the front end. In addition, these data structures overly specific to the data tracking and recording required, going the exact opposite extreme of the monolithic “entire database in a single field” design of the MMR/Production module.
The TP design reflects the “words and phrases” related to a proper QC Specification and Test Results system as required by RBT for the OmniBlocks application. Unfortunately NONE of the work that was done is usable in a functional system.
A new QC module has been started as this has been deemed a “critical feature” that is missing from the OmniBlocks application. Work was started but only partially completed, with 2 weeks of development work performed on what is a 6 week endeavor to get the “basic functionality” in place.
This a new module and thus does not impact current systems.
Data Architecture
A solid database architecture is the key to a performant scalable application. The data architecture for a proper MRP or ERP system is extensive and OmniBlocks® is no exception. Unfortunately much of the data architecture remains based on the original development design which would be ranked by most data analysts as poor-to-fair.
This architecture remains largely in place throughout the application. The data architecture decisions are tightly tied to data models which drive the data processors where the complexity of queries is ultimately reflected in the REST API endpoint controllers. In addition, the legacy architecture has tightly tied the Front End data rendering to the underlying data model and data architecture on the back end; An unusual and highly fragile design. As such even minor “corrections” to data structures result in a much bigger impact on development resources than usual; Often an architectural fix on the data structure results in a half-dozen code modules being reworked on the back end and front end of the application.
While several new data structures have been introduced to support new modules, most of the existing application relies on the original data architecture inherited by TP. The legacy architecture has been extended and partly refactored based on bug reports and feature requests, however this effort has been done in an ad-hoc manner versus a concerted effort to remedy notable data structure issues. As such much of the legacy data architecture remains along with the corresponding technical debt in related algorithms.
Lack of data normalization in many tables
Data normalization typically results in a mix of data structures that prevent most data from being stored in multiple tables — the data equivalent of “don’t repeat yourself” (DRY) coding standards. Consideration is given to multiple factors when normalizing a data set such as “proximity of data” and the time/resource cost of fetching data through relations. A data architect should analyze the use cases and data structures to ensure “copies of data” exist where there is substantial performance benefits, understanding the risk of “data sync” issues that are conveyed clearly to the dev team to ensure data is updated at all points when the “proximity performance” option has been put into place.
The existing architecture has substantial “data copies” that are not always updated properly. In many cases the data copy should be replaced with a relationship pointer. In most cases the performance would improve. This is an unusual result as data copies are often less performant, however the architecture doubles-down on bad decisions by further storing data copies in complex JSON fields. As a result a proper normalization of the database combined with proper tables relations will not only increase stability, reduce-or-eliminate data inconsistency, reduce code complexity (for the handful of modules that properly update all data copies), but also improve performance — which is a rare.
The current architecture has some data normalization that works well for the application, however there are many cases where further normalization is required. To do this properly data migrations need to be written to extract data elements buried in existing fields. The extracted data needs to be used to generate new data columns and in some cases entire new joiner tables and metadata tables.
Remediation: Creation of a full application data map, analysis of data flow and use cases, review of application performance and data query I/O, building data migrations to move data toward better normalized architecture, creating of joiner tables, updating data models and processors to reflect the new structure, updated data processors.
Impact: Lower data performance, data inconsistencies, additional backend code complexity
Customer satisfaction impact: Moderate
Risk of downtime: Low
Poor data relationship architecture
Some of the original data architecture connects data tables using non-indexed string fields. This is non-performant at the base SQL layer. It also leads to inconsistent and unpredictable data relations in some cases where the linkage is made through a data field that can be changed by users. Best practices for SQL relationships is to set all primary relationships to work through the data-engine-assigned unique ID fields; This mechanism has significant performance and stability advantages and is well documented. Some of these relationships have been reworked, but a few remain in less-used modules.
Remediation: Creation of a full application data map, review of data query structures and SQL statements, migrating non-primary-key relationships to primary key when appropriate, updating related data models, indexing relationship keys.
Impact: Lower data performance, additional backend code complexity
Customer satisfaction impact: Moderate
Risk of downtime: Low
Lack of proper indexes in many tables/queries
The original data architecture used very few indexes outside of the default primary key index automatically created by the underlying data libraries. Additional indexes have been created in an ad-hoc fashion based largely on system response time issues or user complaints when the front end experience has been subpar. A full data query analysis should be performed along with real-world index performance analysis to determine which indexes will bolster application performance.
A proper indexing strategy ensures data updates remain performant while significantly improving data query performance. Indexing data sets properly can be the easiest way to bolster application performance.
The OmniBlocks® application has seen several data queries improve response times 10-20x with the right indexes in place.
Remediation: Mapping of data queries throughout the application, creating of use cases and test case simulation of data queries, full explain/analyze of SQL queries and the potential impact of indexes, data migrations to attach indexes to existing tables.
Impact: Lower data performance — sometimes significantly
Customer satisfaction impact: Moderate
Risk of downtime: Low
Incorrect implementation and overuse of JSON field types
A common issue with the inherited technical debt was the original architect’s over use of JSON data stores. These single fields are used in multiple places in the application and is the primary design flaw that is behind months of refactoring related to the “Heart of the Flying Spaghetti Monster” (HoFSM) issue noted several times in this document. Essentially the original design has stored entire data records or even full data sets within a single encoded text field (JSON) within the database.
While PostgreSQL has advanced features that do a great job manipulating and querying JSON fields, there are several key requirements that have not been implemented properly. As such these well-performing functions are largely unavailable to better manage the data. Often the data in these fields is stored in a semi-random array of data entries, eliminating the ability to utilized a vast majority of “process by key” JSON functions provided by the data engine. That design decision alone drive hundreds of lines of excess code to be written on the front end and back end applications in order to support the unstructured “entire data set in a single field” design.
The determination that all legacy data needed to be supported while working on refactoring this data design has been extensive work on writing data migrations to start normalizing these unstructured JSON fields. Entire data models, processors, and controllers have to be rewritten for every phase of JSON field decomposition. While there are ways to mitigate this issue, none of the current in-place refactoring options are quick or easy to implement.
Remediation: Creating data migrations to distill key data out of the “data set in a field” JSON fields, updating data models to reflect the new “distilled data structures” along with the controllers and processors to support them.
Impact: Lower data performance — sometimes significantly, overly complex code implementation on the front end and back end increasing memory consumption, code maintenance issues, and code performance.
Customer satisfaction impact: Moderate
Risk of downtime: Low
Use of SQL tables vs. NoSQL or Data stores
The entire data structure for the OmniBlocks® application is built using a single PostgreSQL relational database. Several key components of the application would be better served using persistent data stores outside of the main pgSQL relational database. Two primary examples are the current Audit Trail table and the yet-to-be-developed “blockchain candidates”.
The Audit Trail is currently built as a single table to which every single notable data event is logged. In fact the current implementation is still missing 25-30% of the data log entries that should be tracked for better FDA compliance and history tracking. This table is already massive, quickly becoming the largest table in the SQL database and will continue to grow at an increasing rate. In the current architecture, this table is a prime candidate for architecture refactoring to either create views or split the table from a single massive audit trail to several separate operation-specific of even table-specific audit trails. However a better solution would be to consider other data store solutions such as NoSQL options or large data stores such as Hadoop data lakes with processing to generate data warehouses.
The “blockchain candidates” is a new feature that requires extraction of key materials and production data points that support supply chain validation and authenticity checks. While much of the data that is needed exists in the PostgreSQL database there will be a need to create data sets that live outside of the traditional relational database to facilitate blockchain logging. This can be done using various NoSQL or data lake technologies while relegating the data parsing and blockchain entry management to blockchain specific tools.
Remediation: Creating custom data migration tools to replace the SQL audit trail tables with No SQL or data lake solutions, writing new API interfaces for the new data store, writing data store controllers and business intelligence tools for the new data set. Blockchain data sets and tools have yet to be developed. In the meantime starting to plan & create data views will be an important interim step.
Impact: Potential notable increases in data storage costs with larger RDS servers and data stores needed on a regular basis, over time audit trail data queries will slow significantly. Not addressing this issue in the next 12 months or less has the potential for significant performance impacts in the future.
Customer satisfaction impact: Low (for now) increasing over time.
Risk of downtime: Low
Server Architecture
Incorrect Redis Configuration
The Redis application was installed and configured by TP. The system images and configuration has not been inspected at a deep level as the Redis service was running, not generating errors or logs in the main server log or application log, and appeared to be answering data queries correctly.
The July 11th 2020 failure and subsequent analysis uncovered several primary issues with the setup and configuration of ALL API middleware servers. Redis recommends several key features be implemented for all production systems, none of which had been implemented properly. These shortcoming were found when the protected redis-server-log file was discovered.
Overcommit memory was not enabled. This is a feature Redis recommends for all production servers. It impacts how Redis behaves when the memory store is full. Without this setting enabled it can lead to inconsistencies include login key/value pairs for active sessions not getting created. This directly impacts the user authentication module in the API codebase and can lead to an inconsistent login experience. This has been remedied (July 11th 2022).
Huge Pages was not disabled. Redis is not designed to read large memory pages by default and prefers to operate within a 32-bit address space. The Ubuntu servers on which the API middleware runs were never configured to disable the Huge Pages setting, and were indicated as a potential problem area for Redis. The impact is unknown, but the incorrect configuration has been flagged by Redis on every startup of the service. This has been remedied (July 11th 2022).
Despite custom configuring several key parameters in the Redis configuration file on the production server, one key setting was missed. It is one of the most important settings on a Redis server, the maximum memory limit. When this is not set, the Redis application will try to consume any available memory on the server. If the underlying API code happens to start consuming excess memory, Redis will consume all the memory that remains. This will hang the server making it completely non-responsive. Since Redis AND the API codebase run on the same server, a questionable decision, but not necessarily incorrect, the entire REST API server stops responding. The end result is the entire OmniBlocks® application becomes non-responsive. Adding the memory limit will cause Redis to fail, but not crash the server. Instead it will serve up read-only requests which MAY lead to excess data queries to the persistent data store, but should allow the API service to continue running and serving requests.
Remediation: Update the operating system configuration to meet Redis requirements (done), update the Redis configuration for memory constraint (in progress July 12th 2022), Review and revise the Redis request to ensure consistency and stability.
Recommendations: Bring in a full time IT/sysadmin with cloud experience to review all servers and configuration settings as well as assist with migrating Redis to an AWS memcache service.
Impact: Non-responsive API server, slow API requests, inconsistent authentication results
Customer satisfaction impact: High
Risk of downtime: High
Middleware (API) Server Scalability
Redis needs to be moved OFF SERVER, and not reside on the middleware server where the API codebase lives. It is not a shared memory cache like that and thus is NOT viable in a horizontally scalable cluster architecture. It also competes for middleware server memory resources fighting with the API codebase for this resource.
Remediation: Create a new cloud-based memcache/Redis installation, update the API codebase to work with the new server.well as assist with migrating Redis to an AWS memcache service.
Impact: Potential critical downtime should the single server go offline.
Customer satisfaction impact: High
Risk of downtime: High
Systems Monitoring
Routine system monitoring needs to be put in place. All servers and systems need to be monitored on a regular basis for potential issues. Resource monitoring such as CPU, memory, disk space and network I/O needs to be tracked. Log files need to be monitored for potential service compromise attempts, resource restrictions, application failures, system-level failures. This needs to be a cross-discipline effort that involves classic Linux server analysis, cloud service monitoring, and general service monitoring. Most of the IT stack is largely unmonitored outside of high water mark alarms set across all services. Nothing is in place on the classic EC2 based Linux servers for lower-level notification such as OS level resource flags which are not handled well by AWS fabric layers.
Remediation: Hiring full time IT or sysadmin staff to routinely monitor and update servers. Adding custom tools to reduce the workload and provide a cross-service dashboard report.
Impact: Potential for uncaught security issues or intrusion, reduced stability, lack of insight into resource issues.
Customer satisfaction impact: Low-to-high
Risk of downtime: Low
Backup and Recovery Services
Multiple services now have routine backup and recovery, however several key sites are missing any form of routine backup service.
The database service is running AWS RDS on multizone replication with nightly backups and 1-minute point-in-time recovery available. The database may benefit from an SQL specific backup/recovery service but this is non-critical.
The front end OmniBlocks and TruInsights front end apps are single page applications that can be easily recovered from the git repositories. The Amplify service makes for quick recovery of the services or for new services to be started in the case of catastrophic failure.
The OmniBlocks middleware app is also easily replicated, albeit with a 15 minute downtime window, HOWEVER the single-server EC2 instance does not have routine snapshots or backups. Currently these are created manually to provide a snapshot of the OS and accompanying software stack which would be used to create a new EC2 server in the case of catastrophic failure. This can be a slow process taking up to an hour to fully recover in the case of a major failure. The move to a scalable cluster will mitigate much of this and will directly impact the backup/recovery service.
All of the web presence via WordPress is currently done via Lightsail. These instances have ZERO backup and recovery inherent in the IT stack. These site should all be running a WordPress specific backup such as the JetPack backup service.
Remediation: Setup routine snapshots on the EC2 server for the middleware OB app, subscribe to Jetpack with backup services for the RBT website, the OB docs website, and the TruInsights website.
Impact: Potential loss of marketing and documentation content if the WordPress servers fail, extended recovery time in the case of a catastrophic OB middleware server failure.
Customer satisfaction impact: Low-to-high
Risk of downtime: Moderate
PostgreSQL Database Update
The PostgreSQL database is running an outdated (version 11) version of the database. This limits some of the SQL features that are available for the R&D team. It also limits the options for upgrading the database service to more scalable and better-fitting options available from AWS cloud services; For example, moving to an Aurora PostgreSQL serverless configuration is likely a great fit for the OmniBlocks application with routine spikes in workload followed by significant dormant periods on nights & weekends.
Moving to an Aurora PostgreSQL Serverless configuration also means far less IT and administrative overhead. The serverless option provides automatic resource scaling without the need for manual configuration that is required with the current legacy PostgreSQL v11 database server.
Remediation: Backup the database server, upgrade to the lastest version of PgSQL, convert to Aurora PostgreSQL Compatible server on a serverless configuration
Impact: More complex SQL queries with lack of modern SQL functions, more IT staff and time requirements to manually manage the existing RDS database cluster.
Customer satisfaction impact: Low
Risk of downtime: Low
Security
Cloud Server Access
Some of the legacy security configuration remains intact. The original team hard-coded several AWS/IAM credentials within the application. New non-personnel-based IAM keys need to be configured with the proper roles & permissions and need to be set in place to replace those keys with something that can be better restricted. The keys are virtually inaccessible to the original team, however there is a risk they can have the keys cached or stored on a backup and thus be used to gain access through a yet-unknown attack vector.
The EC2 middleware servers still have their original SSH keys installed and active. Those keys are used by the current dev & sysadmin staff to access the dev/staging/production EC2 middleware server. Those should be replaced.
The entire VPC and related security groups and rules needs to be reviewed and tightened. While many rules have been improved (virtually all access was wide open 18 months ago), the access controls and subnets can benefit from much tighter access rules.
MFA should be added to ALL cloud based access including general Amazon logins, all WordPress sites, and even the OmniBlocks application itself. Currently only Lance’s AWS account has been configured with MFA.
Remediation: A full security assesment across all services, password updates for all username/password access, replacing all SSH keys on all servers and services (git repos too), full review and update of AWS security groups and access rules.
Impact: Potential for security breach from third parties, including the original development team or past contractors and employees.
Customer satisfaction impact: Low
Risk of downtime: Low
OmniBlocks Application
The main OmniBlocks application has the most rudimentary of username/password security protection in place. It should be improved dramatically on nearly every potential attack vector. MFA should be required. Password strength and replacement rules should be added. Forcing users to change weak passwords should be required. A lot more attention should be placed on basic security of customer access.
There is notable resistance to making the security more advanced, even for things like more complex passwords. Users are sharing tablets on the production floor and routinely have to log out and log in on different devices. The customer base is requesting biometric login or RFID/NFC login options. This reduces security.
To address the need for better security and easier login there will need to be a concerted effort from R&D to develop a viable solution. This may involve writing a custom mobile application which can delay implementing stronger security by months.
Areas of concern:
- Single username/password system
- Sent in clear text
- No MFA in place
- No password rotation or strength testing
- users still have basic passwords in place
- Need to add bioidentification logins
Remediation: A full set of application updates in OmniBlocks related to the username/password system including password strength tests, password longevity checks, removing clear text password transmission, adding MFA support, reviewing web-based bioidentification options.
Impact: Potential for theft of critical customer data or customer data tampering by malicious actors.
Customer satisfaction impact: Potentially high
Risk of downtime: Low
No Firewalls
The entire AWS stack is mostly devoid of firewalls or basic protection from routine attacks. Web application firewall services need to be employed on top of basic network security noted above. Currently only a handful of network access rules are on the web app stack to prevent access from countries such as China, Ukraine, and Russia; However these rules are easily bypassed. An intelligent firewall is necessary.
Remediation: Employing proper Web Application Firewalls on any public facing web service should be put in place. With the current customer base most foreign access points can be disabled en-masse.
Impact: A DDoS attack can bring the entire service offline in seconds, leaving the application “wide open” also means more potential brute force attacks which increases the chances for a dictionary password user attack against the already-weak username/password system for OmniBlocks.
Customer satisfaction impact: Low
Risk of downtime: Moderate