4 min read

What data and methods are used to build a Financial Knowledge Graph?

What data and methods are used to build a Financial Knowledge Graph?
Photo by Patrick Tomasso on Unsplash

At Noonum, the dynamic, evolving knowledge graph database we create represents point-in-time relationships between companies, stocks, and market entities.  There are many powerful inferences that can be made directly from it, and there are many additional inferences that can be made from those inferences, and so on.  That is the power of knowledge and knowledge represented in a graph: one insightful connection can lead to another, which leads to another, and through exposing them, new knowledge is found and can be leveraged in decision-making.

The power of a knowledge graph derives from the specialized and unique engineering and data science at its foundation. These processes include data mastering, data integration, countless machine learning models, quality control methods, and domain knowledge to define the types and schemas.  Some of these specific challenges can be solved through isolated, existing solutions, but Noonum’s proprietary process offers the flexibility to control and customize each component to maximize accuracy and efficiency.

A Graph of Things and Their Relationships is More Resilient

A knowledge graph structures data around entities, such as companies or securities, and classifies the relationship between entities. Are they competitors? Or trading partners?  A financial knowledge graph serves as a great way to represent knowledge, but also as a more natural mechanism to query knowledge. As compared to a traditional database, knowledge graphs often have a closer alignment to how humans think about knowledge, offering simpler queries that are easier to construct.  We chose to use a knowledge graph at Noonum for the flexibility of representing and querying knowledge. To support the growing need of integrating thematic analysis about securities into various financial decision-making workflows, a knowledge graph enables a more natural and resilient data structure while also providing more powerful and ad-hoc queries over the data.

To build a knowledge graph, you typically need two sources of data.  First, you need data that defines the ‘structure’ of the graph, the types of nodes or entities, as well as the types of relationships between them.  In our graph, we use the basic nodes of ‘company’, ‘security’, ‘location’, ‘person’, ‘product’, and others.  We also define some relationships between companies like ‘competes with’, ‘trade partner with’, or ‘has legal issue with’.  Relationships amongst all nodes are defined, and each node and relationship type can have various properties.  You then need data from which you will extract your knowledge that ‘fills’ your graph.  For example, we subscribe to various sources of financial market data that generate our security node entities, like stocks and the companies that issue them.  We also ingest disparate data sources that include actively traded securities from various markets, current geopolitical entities and their relationships (Seattle is a city in Washington, which is a state in the United States of America) and codes used for classifying the sectors of companies. For example, Apple is part of the technology sector, but also belongs to the sub-sector of technology hardware.  This data is structured and therefore easily loaded into a graph, but it does require substantial cleaning and quality control.  For example, company names and hierarchies require an understanding of company mergers and acquisitions.

Differentiating and Challenges with a Knowledge Graph

The most interesting data for building a useful and differentiated knowledge graph is typically more difficult to capture.  Structured, clean data often persists in many databases and is relatively commoditized, and thus far less likely to provide unique insights.  Noonum's core differentiation is in capturing and exposing the material relationship that exists between companies, locations, people, themes, and products.  We’ve found the most accurate and comprehensive source of these relationships resides in unstructured data.  

80-90% of the world’s data is unstructured and 90% of it was created in the last two years.  

Our core technology extracts knowledge from an ever-expanding corpus of text – news, corporate filings, earnings transcripts, patents, and structures it to be stored and understood through our knowledge graph. This extracted information is then enriched through the relationships and exposures we generate. Noonum's graph enables the ability to generate proprietary analyses around sentiment, thematic uniqueness, and performance attribution to further understand what’s driving companies and markets.  The data acquisition costs, storage, and performance required to maintain a dynamic knowledge graph of this nature are significant.

How big is the graph?  Very big.  Our graph is defined primarily in terms of the number of relationships we load into it.  Because it is a time series of knowledge that we capture from a variety of sources, at any point in time, we may be inferring over 3 billion relationships.  We currently cover over 5 million themes and 2 million entities including companies, places, and products.  In order to build and query such a comprehensive graph, some of the data is actually stored in more time series friendly sources. When we query the graph, we actually query these hybrid sources.

One of the most challenging aspects of building a knowledge graph is identifying the entities and keeping them accurate.  What does this mean? Let’s consider a node in the graph for a company.  We are not interested in generalizations. We are interested in specific knowledge about specific companies at specific points in time.   Because companies can come and go, we also need to understand when they existed and when they ceased to exist.  If we know that a stock AAPL is issued by a company called Apple, Inc. today, we can then create a company node with the name ‘Apple, Inc.’. This entity links to the security ‘AAPL’, and updates any date properties with today’s data if it was not known previously.   Now, if tomorrow’s security information says ‘AAPL’ was delisted, what do we do?  Did the company with name Apple, Inc., cease to exist, or did the relationship between Apple, Inc. and AAPL cease to exist?  Also, if we read some text and it says ‘Apple is a leader in smartphones’, is this Apple the same as ‘Apple, Inc.’?  For every decision in creating and mastering entities in a knowledge graph, you need to consider:

  • Source of the entity
  • How has the entity changed over time?
  • Which sources to trust?
  • What are the rules to create or remove an entity?

Lastly, you also need to think carefully about identifiers (IDs) and the names of entities.  Many decisions must be made when mastering entities for a knowledge graph.

Today, in industry and academia, using large data sources to train state of the art language, image and graph models has massive interest.  To build data sources on which such models can be trained is very challenging, often more difficult than training the models. Noonum’s evolving graph has great potential to learn the sequence of how market relationships emerge and evolve, shaping the way we understand and predict financial and business market dynamics.