Paolo Di Prodi

OmnibusCyber: a schema-ready strongly typed database to model all cyber security objects (pdf, video)

The global cyber security market size was valued at USD 184.93 billion in 2021 and is expected to expand at a compound annual growth rate (CAGR) of 12.0% from 2022 to 2030 from a market research survey. With the exponential growth of commercial products and revenue, there is a rich fabric of companies, researchers and public bodies such as NIST,SANS,MITRE,OASIS that are working together to create standards and protocols for interoperability. The combination of those efforts have create a lot of inter-related knowledge silos just to mention a few -this is not an exhaustive list- such as CVE,CAPEC,CWE,CVSS,EPSS, Cocoa, OWASP, DML, MITRE ATTCK and DEFEND, VERIS, STIX and MAEC. To unify this proliferation of frameworks, standards and protocols researchers have proposed various ontologies with different levels of granularity from specific use cases like defence exercises to more comprehensive one like the UCO project. The adoption of such ontologies based on OWL and similar languages has been very low for a variety of reasons with the primary one being the step learning curve required to learn and deploy such systems. As a result of that, I have surveyed more than 20 companies within the cyber threat alliance and my network to discover that all of them are using bespoke database schemas to store and index both internal and external security data. This creates an impedance match between the internal and external representations which occurs when sharing datasets typically via the STIX exchange format. Although the STIX format cannot represent all those concepts, it is now the only practically established solution to exchange mostly threat intelligence objects. The approach we are taking with this project is instead to offer a golden standard for the internal representation of entities which will also facilitate the exchange of threat intelligence (despite not being just the primary goal) via STIX which we already support the TypeDB CTI related project. In essence, the OmnibusCyber project offers a ready made solution based on a strongly typed database called TypeDB (an open source project) which offers the expressivity, safety, inference properties required to implement a knowledge graph without the complexity associated with the OWL/RDF semantic frameworks. The advantage of using TypeDB as a database is that we avoid the need to do any normalization, as TypeQL enables us to create a direct mapping of the ER Diagram with entities, relations, attributes and roles. Even though this is different from pure SQL (which most database engineers are most familiar with), where we need to impose a tabular structure, the TypeQL model offers a logical layer which can implement the ER diagram without any normalization. With regards instead to the mentioned OWL/RDF framework there are 3 main differences to be mentioned: TypeDB reduces the complexity while maintaining a high degree of expressivity. With TypeDB, we avoid having to learn different Semantic Web Standards, each with high levels of complexity. This reduces the barrier to entry. TypeDB provides a higher-level abstraction for working with complex data than Semantic Web Standards. With RDF we model the world in triples, which is a lower level data model than TypeDB’s entity-relation concept level schema. Modelling and querying for higher order relationships and complex data is native in TypeDB. Semantic Web Standards are built for the Web, TypeDB works for closed world systems with private data. The former was designed to work for linked data on an open web with incomplete data, while TypeDB works like a traditional database management system in a closed world environment. Within Omnibus we are building a rich schema to represent all the most commonly used concepts, a toolbox with utility tools to I/O from/to a variety of common sources, a set of example queries that are useful for data mining, machine learning and data science tasks. This means that a team of engineers can easily set up a central repository to manage all their cyber security data and meta-data in one central location creating what is commonly referred as a data mart that can be used to perform OLAP queries. The OLAP queries include the traditional Roll-up, Drill-down, Slice, Dice and Pivot but also the most powerful operations like inference, link prediction, graph embeddings that are offered in our engine. This gives a powerful tool to discover new facts and relations within their dataset without the need to write complex SQL/Graph queries. Our aim is currently to receive feedback from the community and reach a version that would be considered for inclusion by the OASIS group.