At Noonum, we are building a knowledge graph that captures important material relationships about companies and their themes, locations, people and products. To deal with both the size of data we process and the number of relationships we extract and store for in the knowledge graph, our machine learning (ML) pipeline must process new data in near real-time. To achieve this, we need to consider a few key requirements in our architecture:
- Massive data sets and very large ML models
- Need to regularly rerun models over data
- Data stores that have limits on how fast new data can be inserted
- Web application that is performant over a broad set of inferences
Leverage Big Data and Big ML Models Through Functions and Distribution
Similar to guidance in good clean code software engineering, developing an information and processing architecture for ML models that is functional in nature has made our system more performance, easier to manage, and enabled several points of reuse. Common design patterns consist of: data that enters a function is never changed, functions take on a singular core action, name functions according to the action they perform, keep function returns small to avoid memory issues in serverless systems, and avoid reading external data sources.
We currently process up to 100K texts daily: (a text may be a news article, a blog post, or a filing with the SEC. The machine learning models we use, including foundational language models, run on each file: NLP models to extract entities, heuristics to perform linking to various entity or concept dictionaries, several relationship extraction classifiers to both predict the type of relationship but also score them for materiality, sentiment classifier, and several additional heuristics to determine whether we keep relationships for our graph. We use of proven distribution systems provided by AWS like Spark and Elastic Map Reduce, facilitated by Amazon Kinesis, Lambda functions and S3 buckets. By using more mature technologies, we benefit from large online repositories of bug reports and work arounds as most cloud and recent systems lack adequate documentation beyond simple use cases. As a result, we have learned a lot to configure and manage the cloud environment and distributed systems.
Enable Historical and Iterative Improvement Through Elasticity
When developing our system at Noonum, we often make key improvements to our machine learning models, heuristics, data quality and data processing methods. In some cases, we deploy those changes to the production pipeline after testing in our development environment, impacting only current and future data. In other cases, the changes are significant enough that we want to improve all of our historical data, requiring us to reprocess a massive amount of data. To achieve this, we benefit from using the AWS cloud and our scalable pipeline: increasing the cluster size significantly as well as some of the key distributing system parameters, we are able to reprocess data efficiently. Because we are thoughtful toward load balancing of information into and out of the pipeline, discussed more below, we are able to continue to ingest and capture outputs into the data stores. Achieving scalability, however, is not simple, requiring us to improve our architecture several times and deeply understand the capabilities and configurations of the pipeline and storage technologies.
Avoid Information Flow Bottlenecks with Queues and Notifications
As discussed above, both in the real time case and in the historical case, when processing large amounts of data into and out of the pipeline, it’s easy to flood the pipeline and data stores past their limits, causing data loss or system downtime as memory or CPU limits are reached. For example, a common mistake when ingesting data is sending too much data to the distribution framework, and, while the system is running, data that is waiting to be processed is lost due to job time outs or other system failures. Alternatively, when a large-scale distributed job finishes and begins emitting data, it is easy to overrun memory or time limits with functions intended to insert results into various stores. Therefore, we use a combination of queuing systems to send data and notification systems to alert other systems or functions that data is ready.
Achieve Performance via Inference Snapshots
With massive amounts of data feeding our graph, it’s necessary to consider the query and inference performance for web applications or batch analytic jobs. We use the AWS Neptune property graph for our knowledge graph to store the many predictions about entities and their relationships that we extract every minute. Because our inferences are time-sensitive (recent inferences are more valuable that older ones, and insights about companies and themes and products change over time), we create time-interval snapshots that allow us to both avoid inferences that are out of date and isolate the relevant relationships for specific inferences. Data that does not need to be stored in the graph, or very large time-series data used to capture graph relationships, are stored in PostgreSQL RDS, which is easily increased in capacity as our processed data volume has grown. Our knowledge graph therefore stored across different databases to achieve both a good representation of knowledge that is also performant for our ML pipeline and inferences servicing customers.