Sullexis helped an ML startup fight the COVID-19 pandemic by rearchitecting data engineering processes so data scientists could build outbreak prediction models to assist the US government in deploying relief and assistance where it was needed, when it was needed, in a timely fashion.
The Challenge
An ML startup had developed a cloud based ML platform for developing, training, and deploying ML models. During the height of the COVID-19 pandemic, the startup was approached by the federal government to assist in helping address the need to efficiently predict where limited medical and support resources should be deployed. Properly training a model to predict infection rates geographically meant utilizing dozens of disparate data sources, from multiple state, local, and federal departments as well as economic and supply chain data. Coming in various formats, from multiple governmental bodies, none of this data utilized a standard reporting format and needed heavy data massaging before being useful for model building.
The client had developed a propriety storage platform to efficiently store, index and expose data for research at petabyte scale. At this point, the startup had no unified approach to loading data into the platform and with the massive influx of disparate data sources, the startup needed a more generalized approach for efficiently loading data at scale.
The Solution
Sullexis was called in to bring its engineering expertise, ability to quickly onboard and evaluate existing technology stacks, and previous MLOps experience to create a fully integrated data ingestion framework.
Key components of the solution included:
- An Enterprise Data Lake, utilizing Hadoop for file and data storage of incoming sources
- A NoSQL database for alert tracking and data management metadata storage
- Nifi, a data flow tool, for generalized data loading processes, capable of processing terabytes of data
- A Javascript front end for administration, management and monitoring
The Result
The startup was able to effectively leverage the massive processing power of their ML platform with confidence in the data being provided was loaded correctly, and on time. The project delivered
- A standardized ingestion platform, capable of receiving flat files, compressed archives, JSON, and other popular data formats
- An automated data inspector utilizing custom developed algorithms to alert on suspicious or inconsistent formatting
- A front end with self-service capabilities for the data scientists to load additional 3rd party data sources, without aid from data engineering.