Harmonization and Knowledge Management with a Semantic Single Source of Truth (SSOT)

14. Juni 2024
von Wolfgang Klemm

AI / KI
Artificial Intelligence
bigdata
English
myHusky

Harmonizing heterogeneous data sources with semantic graph databases.

In large companies, especially those that have grown through acquisitions, there often exists a coexistence of the most diverse tool landscapes. There are various business areas and competencies, individual processes, and historically grown environments that cannot be replaced, migrated, or standardized in the short term or economically in practice. This involves technologies, but also the people who use them.

Current State

From demand and requirements management, product and application lifecycle management, development and version control, test, deployment, and continuous integration tools, to operations and monitoring tools: In practice, developers are confronted with complex heterogeneous environments, a variety of distributed data sources and quantities with the most diverse interfaces, data formats, quality grades, or availabilities.

When the marketing department with its CRM system or support, human resources, legal, or finance department also come on board, the challenges quickly multiply when it comes to consolidating data across departments to create a basis for business intelligence (BI), for example, for reports and key performance indicators (KPIs).

With sometimes manual, often repetitive preparations, transformations, and transfers, it is not only data but also existing established processes that come under scrutiny. For management, transparency, reusability, automation, and cost reduction may be strategic goals. After all, measures such as the formalization and persistence of knowledge or the analysis of practices and processes using data analytics, semantics, artificial intelligence (AI), process mining (PM), or robotic process automation (RPA) open up entirely new, sometimes still unimagined possibilities here.

What on the one hand means promising perspectives, on the other hand raises questions for their employees about their future role. In addition to an insight into architecture and technology, this article aims to clarify how this evolution develops less into a risk and more into an opportunity for all involved.

In contrast to the approaches of Data Warehouse and Data Lake, the word semantics comes first in a semantic Single Source of Truth (SSOT) - and thus the meaning of the information it contains. This makes it a central, ideally company-wide source of knowledge. Metadata ensures a common understanding of stored information, a central terminology for a common language. The goal: to make collaboration more efficient, between people as well as between machines.

Persisting knowledge means making it sharable and available. Formalizing it makes it understandable and thus reusable. Enriching information with semantics makes it possible to automatically infer new knowledge from existing knowledge, thus creating a self-learning knowledge database. Because a Single Source of Truth manages not only the data itself, but especially its meaning, semantics is an important basis for intelligent integration of information.

For management, the smart harmonization of data offers higher transparency and quality as well as better analyses for their business decisions - across the boundaries of tools and departments. For us, it offers a common understanding and the reusability of knowledge, thus freeing up more space and time for creativity and innovation - across personal skills, different working models, and cultures.

Challenges

Many tasks in merging information are obvious and can usually be solved by so-called Extract-Transform-Load (ETL) tools. These include downloading data from the primary sources, transforming it into a standard format, optionally validating it, and finally making it available via APIs.

However, when incompatibilities, different terminologies, and thus ambiguities - i.e., semantic differences between the tools - come into play, a simple one-to-one transformation quickly reaches its limits. Possible inconsistencies - i.e., discrepancies between multiple environments - can make things even more difficult, for example, errors in synchronization between systems or uncontrolled updates of individual tools.

Especially when it comes to business analytics, reports, and KPIs, aspects such as correctness, completeness, and consistency of information are essential. The SSOT discussed in this article is based on the W3C-compatible semantic graph database GraphDB from Ontotext. The recent free version can be downloaded here.

Support not only in centralizing data but also in improving data quality in the primary systems is another crucial factor for the success of a Single Source of Truth (SSOT).

Many tasks are not purely technical in nature. For example, fine-grained access permissions to the primary sources should not simply be bypassed by redundant central data storage à la data lake. In addition to rights and roles, personal data may need to be anonymized. Possibly, company agreements or the GDPR completely exclude the use of certain data. Other information may be subject to confidentiality restrictions, such as export controls or customer agreements.

Once security and legal concerns are resolved, individual data sovereignty is added. Once these are resolved and there is a willingness to harmonize on all sides, the effort to create a common understanding of the available information can continue.

Harmonization versus Standardization

Complex heterogeneous environments pose diverse challenges. Applications, like people, often mean the same thing - but speak different languages, using not only different data but also different terminologies. And that leads to friction at the interfaces, both technical and human.

Try generating a common report from data from two applications or departments. Quickly, a call for standardization becomes loud with the argument that it reduces costs. However, thinking this through a bit further, a department-wide or even company-wide standardization, for example, through the technical unification of practically different workflows, would not only be tantamount to an unpleasant straitjacket for employees but would probably also be just as impractical as it would be uneconomical.

With a Single Source of Truth, it is therefore not about standardizing information and processes, but about harmonizing them, with the goal of creating compatibility instead of conformity and thus ultimately supporting strategic corporate goals.

Compatibility versus Conformity

As an example, let's consider the use of different task tracking systems, for simplicity's sake, GitLab and Jira. If you want to show across the board in a reporting tool for your project management how many tasks with which priority are open, this assumes that both systems have the same configurations and values for specifying the priority - this is rarely the case in practice.

One system may have priorities 1 to 5 as numeric values, while the other may have Critical, High, Medium, and Low as text values. The goal should not be to force all systems into a standardized corset, but to harmonize the information in such a way that both systems can continue to exist but are still compatible for cross-system reporting.

Not only are the values different, but their number also does not allow for an unambiguous one-to-one mapping. Establishing conformity would mean defining a standard and adapting the primary systems to have identical values. Even if this is technically possible, it would at least mean intervening in the workflows of the affected users. Not to mention whether and how a migration - especially of historical data (legacy) - can be done.

Compatibility, on the other hand, means leaving the primary systems as they are and using a semantic reference model to give the values of the respective environments a meaning that is equally understandable for humans and machines and thus semantically comparable.

Compatibility does not necessarily mean achieving a lossless bidirectional transformation at the pure field level. Considered in isolation, five values cannot be mapped to four values and then back to five values in the other direction. Graphs and semantics cannot eliminate this problem per se, but they can significantly reduce it.

Graphs are characterized by the linking of information. A task tracking system may have additional fields such as the type of task, for example, feature or bug, a bug in turn may have a severity and a frequency. Semantically, the priority field should not be predefined but calculated or inferred.

If a reference model has four values for priority, for example, a mapping that includes additional fields can make a semantically unambiguous statement even with different values in the primary sources.

A priority 2 in system A is mapped to Critical or High in the reference model, depending on the frequency, a priority B of a bug in system B is mapped to High or Medium in the reference model, depending on the severity. Since the reference model knows both the terms frequency and severity, a semantic mapping and thus the harmonization of different systems is possible.

With the knowledge about the derivation of the values, not only can a tool-independent report be realized, but it can also be explained to the viewer with the help of the links in the semantic model. What still seems quite simple in this example can be extended to far more complex chains of argumentation using semantics. If an error has occurred at a customer, its importance can influence the priority, the importance of the customer can be determined by the revenue with them or the duration of the business relationship, and so on. In practice, confusing evaluation criteria can be merged with the help of linking and semantics, and decisions can thus be easily objectified. At this point, it already becomes clear how semantic graphs as assistance systems can facilitate our daily work.

Now the question arises whether a priority field that can be manually maintained by the user is semantically useful at all. After all, it can be derived from the type of an entry, the severity of an error, its frequency of occurrence, and perhaps also from the importance of the customer, which in turn can be derived from the revenue. It is actually more about determining an order, a ranking, in which the pending tasks should be processed, depending on defined criteria.

The linking of information in a semantic graph creates attractive automation potentials, but only on the basis of the defined criteria; exceptional cases remain unconsidered. An assistance system should therefore not mean taking over control of all decisions, but rather it should support us. To stay with the priority, a manual intervention to control the ranking within the framework of valid values should certainly be maintained.

Common Language, Terminology

A strategic goal in large companies is often to improve internal collaboration. One measure to achieve this is to improve communication. To achieve this, a uniform terminology, a common language - technically a common semantic definition of identifiers and their synonyms - helps.

In the area of Application Lifecycle Management (ALM), for example, there are terms such as issues, items or elements, tasks, features or functions, bugs, errors or defects, which can have different meanings in their respective contexts, but do not have to. This creates ambiguities, resulting in misunderstandings and ultimately high costs to clarify them again.

Eliminating or at least reducing ambiguities is one of the goals of an SSOT. A glossary is a useful feature here. It represents a central reference of identifiers including a human-understandable description, the actual meaning. You can find important identifiers for understanding the terminology in the Glossary box at the bottom of this article.

Semantics

A semantic graph, also called an ontology, essentially consists of a class hierarchy, the so-called taxonomy or T-Box, and the individuals as well as the instances of the classes, the so-called A-Box. Multiple inheritance is also supported. In addition, there are the properties, which are divided into data properties and object properties, as well as the annotations.

While the data properties contain values for fields of an individual, the object properties describe the relations between different individuals. Both properties can be provided with meta-information that contributes to the actual strength of semantics. Properties can be structured hierarchically, thus also inheriting their meaning.

Annotations are additional information that can be attached to all mentioned entities but are not considered by the so-called reasoner. They can be used for comments, version numbers, the author of the information, or for the internationalization of an ontology.

The reasoner is an engine for logical conclusions, the so-called inference, and thus a core component of every semantic graph database. Some useful examples here are the following meta-information for object properties:

Transitive properties state that if an individual A references another individual B, and B references C, then A also references C. A transitive property "uses" would thus show not only the direct dependencies of an app's own high-level services but also the low-level services of the cloud provider used by them.

Inverse properties describe relations in the opposite direction. A property "usedBy" for a service would be a reversal of "uses". This supports top-down queries like "Which app uses which services" as well as bottom-up queries like "Which services are used by which app" - and this in an arbitrarily complex graph as a representation of the microservice infrastructure.

Symmetric properties like "interactsWith" state that if one service interacts with another, the same is true for the other service, without this having to be explicitly defined for the latter. This is exactly the task and function of the reasoner, which automatically makes these logical conclusions for the stored information.

That's it for now. In the following article, you we’ll delve into more details of a reference model as well as the mapping in the reference model.

Glossary

The most important terms for understanding graph databases:

A-Box: The Assertional Box is part of an ontology and contains knowledge about the concrete instances (individuals, objects) of a domain. It contains facts (see Axioms) about individuals and their properties, as well as their relationships to each other. The A-Box represents the state of a world modeled in the T-Box.
Annotation: Annotations are statements that can be attached to any entity, i.e., any class, individual, or property, without affecting its semantics. Annotations can be used for comments, specifying authors or version numbers, and for internationalizing ontologies, i.e., translating knowledge into different languages. Annotations are treated as data but not considered by the reasoner.
Assertions: Assertions are claims in an ontology that do not necessarily have to be true, complete, or consistent. For example, a person's age could be specified as negative or with two contradictory statements. The ontology does not reject invalid claims per se. However, the reasoner of a graph database identifies rule violations and semantic inconsistencies in the form of explanations and supports correction at a time chosen by the user, while the entire ontology remains available externally despite internal inconsistencies.
Axioms: Axioms are statements in an ontology that determine what is true in a domain. Example: Human is a subclass of Living Being. If Max is a human, he is also a living being. Axioms determine, among other things, classes, data and object properties, data types, or annotations. Axioms about individuals are often also referred to as facts.
Data Warehouse: A data warehouse is a central repository for structured data, prepared, filtered, validated, and transformed for a purpose. A schema is used when writing to the warehouse (Schema on Write). Contents are easy to understand, but changes are more complex, as consumers such as business intelligence tools use them directly, for example, for dashboards. The target group is more business professionals, and the purpose is quick analysis results.
Data Lake: A data lake is the consolidation of raw data in a central location, whose purpose and use are not yet determined, dynamic, unfiltered, extensive, and less organized. Contents are ideal for machine learning, harder to understand but easier to change. Navigation, data quality, and data governance are more difficult. A schema is only applied when reading from the lake (Schema on Read). The target group is more data scientists.
Domain: In the semantic web, a domain is referred to as a content-related or logically intertwined area of knowledge, an area of interest, or a collection of resources, people, or machines. It is identified by a name and should not be confused with an Internet domain.
Graph Database: A graph essentially consists of nodes and the connections between them, the so-called edges. A graph database is ideal for representing networked information and thus managing knowledge. RDF is the best-known concept for a semantic graph database. Here, statements are formulated in the form of so-called triples consisting of subject, predicate, and object (see RDF). Graph databases can usually contain multiple ontologies in different contexts (graphs) within a single database schema. Such a combination of ontologies, especially together with a lot of instance information, is often also referred to as a knowledge graph.
Individuals: In the context of the semantic web and ontologies, objects are usually referred to as individuals, technically understood as instances of classes, which are occasionally also referred to as concepts. Individuals can contain data and object properties and can be assigned to one or more classes. While in the object-oriented world, an instance of a particular class is usually created - for example, var Max = new Person() - in the semantic web, an individual can also be created without a class, but the class assignment(s) can be inferred via properties. Example: If an individual has the property hasWheels and hasWheels has the domain Vehicle, the individual is automatically a member of the class Vehicle without this having to be explicitly defined for it.
Inference: Inference means logically deriving new statements based on existing ones. Example: If A is equivalent to B and B is equivalent to C, it can be inferred that A is also equivalent to C. While the first two statements are explicitly formulated, the third is inferred - that is, implicit, automatically generated knowledge. The reasoner is responsible for the inference using sets of rules, so-called profiles. The W3C defines the scope of the profiles for OWL, but not how they are to be implemented. The major vendors of graph databases usually allow the adaptation of existing or the definition of custom rule sets in addition to the predefined profiles.
Ontology: An ontology includes the formal definition of concepts (classes), properties, and relationships between the entities of a domain. Ontologies are particularly suitable for sharing knowledge using a common vocabulary. They consist of a T-Box, the taxonomy (class hierarchy), and an A-Box, the individuals (instances). Reasoners are responsible for inference (logical conclusions) in ontologies. Annotations can be used for comments or internationalization of ontologies. Ontologies can reference each other and thus grow into extensive, self-learning knowledge databases.
Open World Assumption: The Closed World Assumption (CWA) states that everything that is not known to be true must be false. If a train schedule states that a train runs at 10 and 14 o'clock, this implies in CWA that it does not run at 12 o'clock. In contrast, the Open World Assumption (OWA) states that everything that is not known to be true is simply unknown. If a phone directory lists the numbers of two subscribers, this does not mean that no other subscribers exist. Ontologies in the Semantic Web work according to the Open World Assumption. This is advantageous because incorrect, contradictory, or rule-violating information entered does not restrict the functionality of the ontology as a whole, but can be identified, reported, and explained by the reasoner.
OWL: OWL stands for Web Ontology Language, a language designed for the Semantic Web to represent knowledge about objects and classes (as groups of objects) and their relationships to each other. Ontologies are based on RDF and OWL and can be read and modified with SPARQL as query language. Reasoners support RDF and OWL.
Profiles: OWL2 (http://www.w3.org/TR/owl2-profiles ) defines sets of inference rules, so-called profiles, with different expressiveness and efficiency for specific use cases. Prominent examples are EL for ontologies with many classes and properties, QL for ontologies with many instances, and RL for applications that require a balanced trade-off between scalable reasoning (inference) and expressiveness.
Properties: Properties describe the characteristics of individuals, the instances in a graph database. Data properties contain concrete values of various data types (www.3.org/TR/owl2-syntax/#Datatype_Maps), object properties describe the relations between individuals. The expressiveness of ontologies is determined, among other things, by so-called restrictions. There are transitive - if a customer A uses app B and app B uses service C, then it can be inferred: customer A also uses service C -, symmetric - if, for example, a service A interacts with a service B, then service B also interacts with service A - and inverse properties - an inverse property to App A uses Service B would be, for example, usedBy. A query for Service B usedBy App A would be answered by the reasoner with true.
Reasoner: A reasoner is a so-called inference engine. A software that is able to draw new logical conclusions from axioms and assertions without explicitly persisting them, but can implicitly provide them in SPARQL queries: A = B (explicit) and B = C (explicit) => A = C (implicit). Therefore, it can also happen that ontologies require considerably more space in memory than on disk due to the dynamically generated knowledge. W3C-compliant semantic graph databases offer different sets of inference rules, called profiles in OWL2. Reasoners are also responsible for reporting and explaining inferences and inconsistencies in an ontology.
RDF: RDF is the abbreviation for Resource Definition Framework (http://www.w3.org/RDF ), a modeling concept of the Semantic Web standardized by the W3C that uses triples of subject, predicate, and object to formulate simple logical statements in directed graphs that can be easily read, understood, and visualized by machines. Examples: Max hasAge 32 (data property) or Josef hasSpoose Maria (object property). The format is RDF/XML.
Semantic Web: According to the inventor of the Internet, Tim Berners-Lee, the Semantic Web is "the web of data that can be processed by machines". The W3C defines it as follows: "The Semantic Web provides a framework that allows data to be shared and reused across application, enterprise, and community boundaries." The Semantic Web is therefore also seen as an integrator across different contents, information applications, and systems.
SPARQL: Is the name of the protocol based on HTTP (http://www.w3.org/TR/sparql11-protocol ) and at the same time the abbreviation for SPARQL Protocol And RDF Query Language (http://www.w3.org/TR/sparql11-overview ), a query (SPARQL 1.0) and manipulation language (SPARQL 1.1) in the Semantic Web for RDF graphs standardized by the W3C. SPARQL is similar to SQL, supports the integration of multiple, even external RDF graphs and is optimized for RDF triple stores. With SPARQL, not only persisted explicit knowledge, but also implicitly generated knowledge through reasoners and inference can be queried. Implementations exist for almost all programming languages.
Taxonomy: In terms of ontologies and the Semantic Web, a taxonomy is a model for a hierarchical classification of objects, or in simple terms: a class hierarchy with classes and subclasses. It enables simple aggregations at the class level. Example: accident statistics of an insurance company for cars and trucks as subclasses of all vehicles.
T-Box: The Terminological Box is the conceptual component of a knowledge base, also called schema or vocabulary. It contains the knowledge about the classes of a domain as well as their hierarchy (taxonomy) and their characteristics (properties). There are powerful ontologies that consist only of a T-Box but contain no instances.

Zurück

fon:	+49 / 40 70 10 653
fax:	+49 / 40 70 10 6550
E-Mail:	info@aurora-tecknow.de