Model-driven Enterprise Data Architecture

In this article we present an agile model-driven Enterprise Data Architecture.   The architecture is driven from a semantically-rich conceptual model described in a Fact-Based Modeling language developed by Factil called FIML.  The architecture is based on having three stages of data refinement and consolidation which are implemented according to different relational schema approaches:

  • The Data Hub is the initial landing platform for all data, where it is refined into tabular form
  • The Data Warehouse which is the consolidated repository for the data assets of the enterprise
  • The Data Marts which may be virtual or persistent and contains extracts from the Data Warehouse used for reporting purposes

The agility of the architecture is based on several concepts:

  1. Using a Data Vault approach for defining the schema of the Data Warehouse, which allows it to be incrementally built and expanded without re-engineering.
  2. Using schema and migration code generation from the conceptual model into the Data Hub, the Data Warehouse and Data Marts schema patterns, which provides both developmental productivity and logical consistency across the architecture.

Introduction

Enterprises today are facing a wide range of challenges for managing and capitalising on their internal and external data in order to gain actionable business insights.  Many organisations are still struggling to get full value from the internal traditional enterprise data, but the new era of Big Data has ushered in a flood of new data sources with the three V’s of Volume, Velocity and Variety.

Many commentators refer to Value as the fourth V of Big Data.  The business value of data is maximised by bringing both Big Data and Enterprise Data together into a single consistent architecture.

BigDataEntDataValue

A modern Enterprise Data Architecture must take into consideration the bringing Big Data and Enterprise Data together in order to maximise business Value.

Variety makes integration difficult

There are a number of major challenges to bringing both Big Data and Enterprise Data together, but the greatest challenge comes from the variety of formats of the different data sources.  Combining data from different sources is fundamentally based on having metadata to describe the data so as to specify how data items are related and to provide guidance on how the data is to be transformed so that it can be combined.  The different data sources provide data in different formats, with widely differing metadata capabilities, and differing challenges for data combination.

Source Format and challenge
In-house applications RDBMS, bespoke schema
Vendor applications RDBMS, often highly abstracted
Hosted applications Web API, often only changes
Machine Data Log files, semi-structured
External Data

  • Social media
  • Open/Official data
XLS, XML, JSON files, semi-structured
Text files, images, unstructured

 

The challenges have never been greater to integrate data from the different sources now required within an enterprise data architecture.

Big Data vs Data Warehouse?

Even within Enterprise Data, traditional data warehousing projects have experienced a number of problems, including:

  • Slow time to build
  • Lack of agility due to expensive and extensive re-engineering
  • Inability to effectively integrate disparate systems

This has caused some people in the Big Data community to suggest that Big Data can replace the Data Warehouse.  They introduce the notion of a “schema-less” Data Lake where all data is loaded in source form as the repository of enterprise data, and where information is to be gleaned from the Data Lake by running hand-written data mining code.  The underlying premise seems to be that the traditional data management disciplines of meta-data and master data management can be ignored.  They create the mentality of Big Data versus the Data Warehouse.  And unfortunately this separation is also often reflected in the organisational structure within the enterprise of separate teams responsible for these environments.

Our view is that doing away with the traditional data management disciplines of meta-data and master data management will cause a great deal of wasted effort within a Data Lake and will make it virtually impossible get an enterprise wide view of crucial business concepts.  Although there may be some short-term wins to build some information islands within the Data Lake, overall this approach will fail to deliver sustained business value.

At the same time, the problems associated with traditional Data Warehousing projects are serious and continuing to build Data Warehouses in the same manner is not consistent with the business requirements for agility and rapid incremental delivery of new capability.

Traditionally Data Warehouses have been built based on either a 3rd Normal Form (3NF) or Dimensional schema approach.  A different approach called Data Vault provides a structured way to integrate disparate systems, that can be built incrementally and expanded without re-engineering.  It is these characteristics that make it very appealing for an Enterprise Data Architecture that brings together the strengths of both Big Data and the Data Warehouse.  So rather than seeing Big Data and Data Warehouse as adversaries, we see them as complementary and compatible.  A modern Enterprise Data Architecture brings together both Big Data and the Data Warehouse:

EDA = BD + DW

The next section gives a short introduction to Data Vault.  More information can be found in our Fact-Based Modelling and Data Vault article and the references at the end of this article.

Data Vault Overview

The underlying premise behind Data Vault is the notion of Unified Decomposition.

Data Vault breaks things into their component parts for flexibility, adaptability, agility, and generally to facilitate the capture of things that are either interpreted in different ways or changing independently of each other – to decompose.

At the same time, Data Vault moves to a common standard view of unified concepts – to unify.

Data Vault models the component parts of a core business concept as an ensemble consisting of:

Hub, which holds the business key but no descriptive data.  There is a single hub in each ensemble.  A hub can only be connected to links and satellites within its ensemble.

Links, which represent the Business Relationships, but no descriptive data.  A link can have binary or n-ary connections to hubs.  Links can be connected to satellites.

Satellites contain all context, descriptive data and history.  A satellite must have only one connection, which can be to a hub or link.

Diagrammatically, the information that would normally be housed in a single 3NF table or dimensional star schema is represented in multiple tables in a Data Vault ensemble.

TabDVDim

In this and the following diagrams, hubs are shown as blue boxes, links as green boxes and satellites as yellow boxes.  The ensemble is shown as a red circle.  There is one hub per ensemble.  Links are shown on the edge of an ensemble because they are linked to other hubs.

In Data Vault the hubs and links form the backbone of the schema.

DVBackbone

The purpose of hubs and links is to identify the business concepts and the relationships between business concepts.

A full Data Vault model might look like this:

DVBackboneConcept

Link relationships can be read as a phase.  In this example:

Product was sold with Sale according to Sale LI

where the naming of Product, Sale and Sale LI is given by their respective business keys.

As we explain in our Fact-Based Modelling and Data Vault article, Data Vault has a number of important advantages for data modeling in the data warehouse including:

  • Handling change flexibility
  • Managing history efficiently
  • Data loading efficiencies

Taken together the agility, flexibility and efficiency advantages of Data Vault make it a compelling data modeling approach for the Data Warehouse.

Enterprise Data Architecture

We propose an Enterprise Data Architecture divided into four major data processing areas:

  • A Data Hub that receives data from a wide range of data sources and formats, and applies minimal data processing. Data is merely processed into tabular form for change data detection and time-stamping before loading into the Data Warehouse.
  • A Data Warehouse concerned with enterprise-wide data integration, data alignment and reconciliation, providing a historical record of changed data, and audit trail of the changes.
  • A number of Data Marts used for specialised reporting.
  • An Analysis area where a wide range of ad-hoc, batch analytics and data mining can be carried out from data in the Data Hub.

The variety of data sources to be handled by the Data Hub and the volume and velocity of some of the data handled by the Data Hub would suggest that this would be implemented in a Big Data technology, although it may be a hybrid of relational and Big Data technologies.

The agility, flexibility and efficiency advantages of Data Vault make it a compelling data modeling approach for the Data Warehouse of the Enterprise Data Architecture.

The Data Marts are modelled dimensionally.  Efficient transformation code from the Data Warehouse would enable many data marts to be implemented non-persistently, as views into the Data Warehouse.

Reporting and Visualisation can be ubiquitous, fetching data directly from any of the areas of the Enterprise Data Architecture.

Diagrammatically, the Enterprise Data Architecture can be depicted as follows:

EntDataArch

Data Modeling in Enterprise Data Architecture

Compared to current Enterprise Data Architectures, the structure presented here is more ambitious because there are three distinct schema approaches to be used simultaneously and consistently:

  • 3NF in the Data Hub, across both Big Data and Relational Database technologies.
  • Data Vault in the Data Warehouse
  • Dimensional in the Data Marts

Maintaining consistency is particularly important.  Data Vault creates a considerable volume of schema definitions and as result requires a considerable amount of migration code when mapping from the Data Hub to the Data Vault.  Typically Data Vault creates between 5 and 8 Data Vault tables for each 3NF table in the Data Hub.  In addition, Data Marts require specialised views or migration from the Data Vault schema to create the dimensional data in the Data Mart.

One of the concerns for Data Vault modelling is that there is little to no built-in support for Data Vault modeling in the current Entity-Relationship / Dimensional modeling tools.  In many environments, manual data modeling processes followed that are slow and inconsistent, leading to the development bespoke solutions.

A follow-on concern from manual data modeling is that the architectural and change management processes in many enterprises require extensive testing and change management approvals before the deployment of these changes, causing deployment bottlenecks.

Finally in a manual data modeling environment, it is costly to produce and maintain documentation and design artefacts that are consistent with the solution under development or deployed.

Rolling up these concerns we pose the following challenge for Data Modeling for the Enterprise Data Architecture.

The Data Modeling Challenge

If only there was a conceptual modeling approach that could:

  • Be easily understood and verified by business domain experts (preferably in natural language)
  • Formally express business concepts, relationships and associated data (mandatory, uniqueness, cardinality, etc)
  • Generate schemas for 3NF, Data Vault and Dimensional databases, and associated migration code

A Model-driven Enterprise Data Architecture

If we could address this challenge then we could envisage a Model-driven Enterprise Data Architecture where there is an easily understood, semantically-rich conceptual data model that could be used to generate schemas and migration between the three schema approaches.

Ideally a significant proportion of the conceptual model could inferred from source data systems. Business domain experts could refine the data model with additional relationships, cardinalities and constraints not obvious from the raw source data.  Schema and code generation would produce consistent database schema and migration code across all three schema approaches for the Data Hub, Data Warehouse and Data Marts.

This architecture could be depicted as follows:

ModelEntDataArch

As we will show in the following sections, we believe that Fact-Based Modeling in general, and the FIML language in particular, answers this data modeling challenge, and provides a sound theoretical and practical basis for the implementation of an Enterprise Data Architecture.

Fact-Based Modeling

Fact-Based Modeling is a semantically-rich conceptual approach for modeling, querying and transforming information.  It is based on the following principles:

  • All facts of interest are conceptually represented in terms of attribute-free structures known as fact types.
  • Conceptual models are developed from concrete examples of the required information.
  • Conceptual models are verbalised in a controlled natural language.

Fact-Based models do not presuppose the entity/attribute mapping underpinning Entity-Relation Modeling (ER), Unified Modeling Language (UML) or the majority of other modeling approaches. It is this characteristic that makes them ideal as the basis for conceptual modeling and schema generation into different relational forms that amount to different entity/attribute mappings.

Fact-Based Modeling originated in Europe in the 1970’s and has since evolved into a family of related approaches.  One of the best known Fact-Based Modeling approaches is Object-Role Modeling (ORM) developed by Terry Halpin and his colleagues in 1990’s. ORM provides both a textual modeling language and an equivalent graphical modeling language.

Constellation Query Language (CQL) was originally developed in 2008 by Clifford Heath, one of Factil’s co-founders.  CQL is a representation of ORM in plain text, and extended to cover queries. Some ideas were also adopted from the Semantics of Business Vocabulary and Business Rules (SBVR).  One of the design objectives of CQL was to create a concise and precise structured nearly-natural language for conceptual modeling that is readily understood and verified by domain experts.

FIML has been developed by Factil as a subset of CQL, but with extensions in the area of integration logic.

Fact-based Information Modeling Language (FIML)

The Fact-based Information Modeling Language (FIML) is a language for constructing fact-based information models.  It is based on the graphical Object-Role Modeling (ORM) language, with some terminology adopted from the Semantics of Business Vocabulary and Business Rules (SBVR).

A FIML model is comprised of the following kinds of definitions:

  • Object Type definitions, each designated by a name (which may be multiple words, conventionally using initial upper-case).  An object type is either:
    • value type, which is the type of a single value (e.g., number, name, date, etc)
    • an entity type, which identifies a class of things of interest (e.g., country, product, asset)
    • named fact type, which is a named relationship (e.g., employment, production forecast)
  • Fact Types that declare the relationship between object types or a (boolean) characteristic of a single object type.  A fact type is designated by one or more readings which gives a verbal description of the object types’ relationship.
  • Constraints that state conditions which restrict the allowed object instances and facts in the model.
  • Units that are used to automate value conversion.
  • Instances of object types and facts conforming to fact types, which are used as examples, reference data, or for test scenarios.
  • Queries which can be asked in the modeling language, and return either true/false or a result set answer. A query can also be projected as a Derived Fact Type, akin to a view.

FIML is written as series of controlled natural language statements that is readily understood and verified by business domain experts.  There are a number of word sequences that have special meaning in the language but the majority of the statements are written in close to conventional natural language.  Here is a short summary of the syntax of value types, entity types and fact types FIML which will be used in the following example.

Value Type

A value type identifies a concept in the model with a single value (e.g., number, name, date, etc).  A new value type is based on a parent value type with possible range restrictions and parameters.  A value type is introduced with an “is written as” statement:

each VALUE_TYPE_ID is written as a BASE_VALUE_TYPE_ID(PARAMS)

For example:

each Quantity is written as an Unsigned Integer(32);

FIML itself has no built-in value types (you can use whatever names you like), but certain types are special when mapping to database or programming technologies, including String, Signed and Unsigned Integer, Real, Boolean, Date. Each value instance is identified by its canonical written form (so Real 2.00 is the same value as Real 2).

Value Type Restriction

Value type constraints are introduced inside curly brackets, and are defined by a list of values or value ranges (including ranges with either end open), introduced with a “restricted to” key phrase:

each VALUE_TYPE_ID is written as a BASE_VALUE_TYPE_ID restricted to { RESTRICTION }

For example:

each Season is written as a String restricted to {'Autumn', 'Spring', 'Summer', 'Winter'};

Entity Type

An entity type identifies a named concept in the model (e.g., country, product, asset).  An entity type is introduced with an “is identified by” statement.  Entity type statements have a short form and a long form.  The short form is introduced by “its” and provides implied readings for the entity and its name:

each ENTITY_TYPE_ID is identified by its OBJECT_TYPE_ID

For example:

each Product is identified by its Name;

In the long form, multiple identifying roles can be defined and the fact type readings associated with the roles are stated explicitly.

Subtyping

An entity type may be declared to be a subtype of one or more supertype entity types.  A subtype inherits all the characteristics of its supertype. Subtyping is introduced with a “is a kind of” word sequence:

each ENTITY_TYPE_ID is a kind of SUPERTYPE_ENTITY_TYPE_ID

For example:

each Person is a kind of Party

Fact Type

A fact types declares the relationship between two or more object types or a boolean property of a single object type.  A fact type is designated by one or more readings which gives a verbal description of the relationship between roles taken by the object types in the fact type.  (Object roles will be explained below.)  A binary fact type is relationship between two objects, written with a connecting verb(s).

OBJECT_TYPE_ROLE verb OBJECT_TYPE_ROLE

For example:

Customer raises Order

ternary or greater fact type is a relationship between three or more objects, and will typically be written with a verb between the first two objects and prepositions between the later objects:

OBJECT_TYPE_ROLE verb OBJECT_TYPE_ROLE preposition OBJECT_TYPE_ROLE 

For example:

Person plays Sport for Country

unary fact type declares a boolean property of a single object type.

OBJECT_TYPE_ROLE verb

For example:

Person is deceased

Multiple readings and quantifiers

For binary and greater fact types, multiple readings can be given for the different object role orderings to denote the natural way of expressing the relationship in different contexts.

The final object role in a fact type may be preceded by a quantifier to a constraint to the cardinality of the final object role for a given set of all but the last object role.  Valid quantifiers include:

one
exactly QUANTITY
at least QUANTITY
at most QUANTITY
at least QUANTITY and at most QUANTITY
from QUANTITY to QUANTITY

where QUANTITY can one one or a number.

Here is an example of multiple readings for a fact type concerned with oil supply:

Refinery in Supply Period will make Product in one Quantity,
Refinery will make Product in Supply Period in Quantity,
Refinery will make Quantity of Product in Supply Period;

Named Fact Type

A fact type (including its multiple readings) may be named so it can be used as an object type in the same way as a value or entity type.  Each instance of that fact type is regarded as an object, so we say the fact type is objectified.   A named fact type is introduced by the “is where” key phrase:

each NAMED_FACT_TYPE_ID is where FACT_TYPE

For example:

each Production Forecast is where
        Refinery in Supply Period will make Product in one Quantity,
        Refinery will make Product in Supply Period in Quantity,
        Refinery will make Quantity of Product in Supply Period;

Named Roles in Fact Types

Each fact type definition includes roles, each one played by one object type. An object type may be appear more than once in the same fact type, representing different roles in the relationship.  For example, the Party object type appears twice in a Related Party relationship.  To distinguish them in natural language they might be known as, say, the First Party and the Second Party.  FIML offers several ways of indicating the different roles for the same object type in a fact type: as a named role, as an adjectival role, or by subscripting.

named object role is introduced by a “( as … )” key phrase:

OBJECT_TYPE_ID ( as OBJECT_TYPE_ROLE )

For example:

Celebration was organised by one Organiser,
    Person (as Organiser) organised Celebration;

An adjectival object role is introduced by a hyphenated adjective either before or after the object type id:

adjective- OBJECT_TYPE_ID
OBJECT_TYPE_ID -adjective

Both forms are allowed in FIML to allow for languages and customs of having the adjective before or after the noun in an expression.

The first instance of an object type role in a fact type can be a object type id (i.e. the role is implicitly named), but the second and subsequent instances of an object type must take a named role to avoid ambiguity.  Each reading of a fact type has the same role naming for all its object roles.

For example, here are two readings of a fact type where the role of the first instance of Product is implicitly named, and the second instance is adjectivally named:

Product may be substituted by alternate-Product in Season,
alternate-Product is an acceptable substitute for Product in Season

FIML code and ORM diagram

To give a more complete example, here is the FIML code and equivalent ORM diagram for a data model concerned with oil supply production and delivery planning.

Here is the description of the model:

A model of the supply and demand for refined oil. A populated database can be used to optimise profitability by minimising transport costs, maximise supply by allowing substitution of compatible products (with compatibility depending on season) and also to predict shortages.

Here is the the full FIML for this model:

each Cost is written as Money;
each Month Nr is written as Signed Integer(32);
each Product Name is written as String;
each Quantity is written as Unsigned Integer(32);
each Refinery Name is written as String(80);
each Region Name is written as String;
each Season is written as String(6) restricted to {'Autumn', 'Spring', 'Summer', 'Winter'};
each Transport Method is written as String restricted to {'Rail', 'Road', 'Sea'};
each Year Nr is written as Signed Integer(32);

each Month [static] is identified by its Nr restricted to {1..12};
Month is in one Season;

each Product is independent identified by its Name;

each Refinery is independent identified by its Name;

each Region is independent identified by its Name;
each Transport Route is where
 Transport Method transportation is available from Refinery to Region,
 Transport Method transportation is available to Region from Refinery;
Transport Route incurs at most one Cost per kl;

each Year is identified by its Nr;

each Acceptable Substitution is where
 Product may be substituted by alternate-Product in Season [acyclic, intransitive],
 alternate-Product is an acceptable substitute for Product in Season;

each Supply Period [separate, static] is identified by Year and Month where
 Supply Period is in one Year,
 Supply Period is in one Month;

each Production Forecast is where
 Refinery in Supply Period will make Product in one Quantity,
 Refinery will make Product in Supply Period in Quantity,
 Refinery will make Quantity of Product in Supply Period;
Production Forecast predicts at most one Cost;

each Regional Demand is where
 Region in Supply Period will need Product in one Quantity,
 Region will need Product in Supply Period in Quantity,
 Region will need Quantity of Product in Supply Period;

FIML and ORM are closely related Fact-Based Modeling languages.  Almost every concept in a FIML model can be represented in a ORM diagram and vice versa.  We will use an ORM diagram to help illustrate the strong conceptual overlap between Fact-Based Modeling and the Data Vault modeling.

In the equivalent ORM diagram for this example you can see that the main business concepts (Refinery, Region and Product) are represented as individual nodes with associated naming (their business keys).  The business relationships (named as Production Forecast, Transport Route and Region Demand) are represented as nodes with an internal box structure identifying the elements of the relationship, and a reading of the relationship where the ellipsis (…) in the reading correspond to the object roles of the relationship.  Object roles of a relationship can be business concepts or associated data.  The mauve annotations in the diagram indicate uniqueness constraints and mandatory constraints on the fact types, and an acyclic, intransitive constraint on Product roles of the the Acceptable Substitution fact type.

OilrRefineryORM

 

We have shown the ORM diagram to give a graphical way of visualising the fully decomposed ORM diagram being mapped to 3NF, Data Vault and Dimensional schemas from one conceptual model. In practice, Factil’s toolset allows a FIML model to be directly compiled and generated as an SQL schema following any of these three schema approaches.

Fact-Based Modeling and Enterprise Data Architecture

An essential characteristic of a Fact-Based Model is that it is attribute-free – models that do not presuppose the entity/attribute mapping.  A Fact-Based Model can also be described as fully decomposed.  The operation of generating a 3NF, Data Vault or Dimensional model from a Fact-Based Model is a composition operation, where each modeling pattern has a different set of rules to absorb related entities into tables to obtain a schema that is appropriate for the pattern.   In the sections below, we outline these set of rules for each modeling pattern.

FIML allows annotations embedded in square brackets “[ … ]” to be attached to FIML statements to provide guidance in schema generation and to supplement standard mapping rules.

3NF Schema Generation

To assist with 3NF schema generation, entity and fact types can be annotated as follows.

Annotation Meaning
[separate] This entity type should be generated as a separate table

 

To obtain a 3NF Schema from a Fact-Based Model an iterative generation procedure is performed:

3NF High-Level Generation Procedure:

  1. Assign each object type a tentative table/not-table status based on several heuristics:
    • A n-ary fact type must always be a table
    • An independent or separate object type must be a table
  2. For each object type that still has tentative status, look at the status of all object types which play a role in its identifier to see if a decision can now be made:
    • A unary fact type is always absorbed
    • An entity type whose identifier is a single table and has no other functional references to it can be absorbed
    • An entity type that has more than one possible absorption path but has other functional dependencies, it must be a table
    • Some entity types can be fully absorbed in more than one place (e.g. Address)
    • One-to-one references can sometimes be flipped in absorption direction
  3. Repeat until all possible decisions have been made, then make remaining tentative decisions final.

For the Oil Supply example the logical model of the generated 3NF schema looks like:

OilSupply-3NF-Logical-OracleDM

Data Vault Schema Generation

To assist with Data Vault schema generation, entity types can be annotated as follows.

Annotation Meaning
[ satellite <name> ] This entity type should be kept in a satellite table with name <name> with other entities tagged with the same satellite name
[static] This entity type would be a reference or dimension table

 

To obtain a Data Vault Schema from a Fact-Based Model the following generation procedure is performed:

Data Vault High-Level Generation Procedure:

  1. Generate a normal 3NF schema using the Rmap algorithm, taking note of [separate]
  2. Partition these tables into reference/dimension, hub and link groups. Links have an identifier consisting of references to two or more other tables.
  3. For each hub and link table, create satellite tables to store all functionally related data values
  4. Add timestamps and data source identifier columns
  5. Emit all hub, link and satellite tables, omitting reference tables

For the Oil Supply example the logical model of the generated Data Vault schema looks like:

OilSupply-DV-Logical-OracleDM

 

In this example, no annotations are required in order to get an acceptable Data Vault schema for this conceptual model.

Dimensional Schema Generation

In our Enterprise Data Architecture, Data Marts to be used for business reporting and analytics purposes are generated from the aligned and integrated data in the data warehouse.  For ease of reporting Data Marts are modeled using the Dimensional schema approach.  Data Marts are essentially dimensionally modeled views or queries of data in the data warehouse.  Data Marts may be virtual, computed on request, or materialised, where they are saved as star schema tables in a relational database or as data cubes in an OLAP database.

FIML has a powerful query language which is used as the basis for generated the virtual or materialised views from the data warehouse to present the information into the enterprise Data Marts.  The FIML query language is described in more detail in our Introduction to FIML article.

Conclusion and Further Materials

We have presented an Enterprise Data Architecture based on using a consistent and semantically-rich conceptual model used to generate 3NF, Data Vault and Dimensional schemas to be used in the Data Hub, Data Warehouse and Data Marts of the architecture.  Fact-Based Modeling is an ideal approach for conceptual modeling in this architecture because it is attribute-free and can be expressed in a language is that is both precise and easily understood by domain experts.

We have introduced the relevant aspects of the FIML language for supporting this architecture, and given examples illustrating the power and flexibility of fully decomposed modeling approach.

Factil is developing the collaborative modeling tools to support this data architecture approach.

References

[1] Halpin, T., 2015, Object-Role Modeling Fundamentals, Technics Publications.

[2] Hultgren, H., 2012, Modeling the Agile Data Warehouse with Data Vault, New Hamilton.

[3] Linstedt, D., 2010, Super Charge your Data Warehouse, Dan Linstedt.

Graeme PortModel-driven Enterprise Data Architecture