This article introduces and presents the strong conceptual overlap between Fact-Based Modeling and the Data Vault data warehouse modeling, and the practical results that can be achieved for Data Vault schema generation from a fact-based model.

Fact-Based Modeling is a semantically-rich conceptual approach for modeling, querying and transforming information. It is based on the following principles:

  • All facts of interest are conceptually represented in terms of attribute-free structures known as fact types.
  • Conceptual models are developed from concrete examples of the required information.
  • Conceptual models are verbalised in a controlled natural language.

In this article we will be concerned with two fact-based modeling approaches: Object-Role Modeling (ORM) provides both a textual modeling language and an equivalent graphical modeling language, and Factil’s Fact-based Information Modeling Language (FIML). A hallmark of FIML it is that concise and precise structured nearly-natural language for conceptual modeling that is readily understood and verified by domain experts. This article includes a short introduction to FIML. A more comprehensive introduction can be found in the Introduction to FIML article.

Data Vault is a relational data modeling approach that is optimised for data warehouses. It is distinguished from Third Normal Form (3NF) modeling which is suited for operational systems and Star Schema (Dimensional) modeling which works well for Data Marts.

The underlying premise behind Data Vault modeling is that data associated with business keys, relationships and associated information is decomposed in a specific way so to provide flexibility, adaptability and agility in a data warehouse context.

At a high-level we see Data Vault as a partially decomposed modelling approach, compared to Fact-Based Modeling which is fully decomposed.

Fact-Based Modeling tools have well establishing composition rules for generating a 3NF schema from a fact-based model. This article introduces composition rules for generating a Data Vault schema from a fact-based model.

This article provides an overview of the Data Vault and subset of the FIML language concerned with Data Vault schema generation, and a summary with examples of Data Vault schema generation from fact-based models. Links to further resources can be found at the end of the article.

Data Vault Overview

Unified Decomposition

Data Vault breaks things into their component parts for flexibility, adaptability, agility, and generally to facilitate the capture of things that are either interpreted in different ways or changing independently of each other – to decompose.

At the same time, Data Vault moves to a common standard view of unified concepts – to unify.

Data Vault models the component parts of a core business concept as an ensemble consisting of:

Hub, which holds the business key but no descriptive data. There is a single hub in each ensemble. A hub can only be connected to links and satellites within its ensemble.

Links, which represent the Business Relationships, but no descriptive data. A link can have binary or n-ary connections to hubs. Links can be connected to satellites.

Satellites contain all context, descriptive data and history. A satellite must have only one connection, which can be to a hub or link.

TabDVDim.jpg

In this and the following diagrams, hubs are shown as blue boxes, links as green boxes and satellites as yellow boxes. The ensemble is shown as a red circle. There is one hub per ensemble. Links are shown on the edge of an ensemble because they are linked to other hubs.

In Data Vault the hubs and links form the backbone of the schema.

DVBackbone.jpg

The purpose of hubs and links is to identify the business concepts and the relationships between business concepts. The physical schema of hubs and links is tightly controlled.

A hub consists of:

  • Hub Surrogate Key (primary key)
  • Business Key (one or more fields)
  • Creation date/time field
  • Record source field
A link consists of:

  • Link Surrogate Key (primary key)
  • Two or more foreign keys to hubs
  • Creation date/time field
  • Record source field
The physical schema of satellites is more flexible:

  • Single foreign key to either link or hub
  • Creation date/time field
  • One or more data fields
  • Record source field
A full Data Vault model might look like this:

DVBackboneConcept.jpg

Link relationships can be read as a phase. In this example:

Product was sold with Sale according to Sale LI

Architectural Advantages of Data Vault for Data Warehouse

  • Handling change flexibility
  • Managing history efficiently
  • Data loading efficiencies

Handling Change Flexibly

1. Backbone nodes are never updated or destroyed

2. New concepts added incrementally

This example shows the addition of the Employee ensemble. Because of the structural rules of Data Vault this new ensemble can be added with requiring any change to the existing schema.

DVConceptEmployee.jpg

3. Attribute value changes

Descriptive attribute value changes are localised to adding a new record to a small satellite table. There is never the need to change any records in the Data Vault backbone.

In contrast in a 3NF or Dimensional schema this would require a record update, often bracketed by a transactional lock.

4. New attribute changes

A new attribute change is localised to a schema changes to a small satellite table or addition of a new satellite table. There is no impact on hubs, links or other satellites

In contrast in a 3NF or Dimensional schema this would require a large table schema change.

In the example below, suppose a new Loyalty Rating attribute is to be added. In 3NF or Dimensional schemas this would cause a major change to the Cust_Table or Cust_DIM tables. But within Data Vault this could be accommodated with either a new satellite or a change to a small satellite table containing related data.

TabDVDimAttrib.jpg

Managing History Efficiently

Hubs and Links form the backbone of a Data Vault schema. Records in Hub and Link tables can be created and read, but they are not updated or deleted. Descriptive data which can change of time is held in Satellite tables. Descriptive attribute value changes are localised to adding a new record to a small satellite table. Existing Satellite records are kept so as to maintain a historical record of the values of this descriptive data. The current value of the descriptive data can be found by selecting the most recent record for a particular Hub or Link foreign key.

Some Data Vault implementations maintain a current flag to improve the efficiency for making the search for the most recent record for a particular Hub or Link foreign key. When this approach is taken there is an additional cost when inserting a new record into a Satellite to check whether there is an existing the record with this foreign key, and updating the record to switch off this flag before the new record is added.

Apart from the exception for maintaining current flags, records in the Data Vault schema are not updated or deleted. Updates and deletes are expensive operations and are common in 3NF and Dimensional models.

DVBackboneHistory.jpg

Data Loading efficiencies

  1. Load new Hubs. All new Hubs can be loaded in parallel because there are no key dependencies.
  2. Load new Links and Hub Satellite updates. All new Links and Hub Satellite updates can be loaded in parallel because they are only dependent on Hub foreign keys.
  3. Load Link Satellite updates. All Link Satellite updates can be loaded in parallel because they are only dependent on Link foreign keys.

 

Fact-Based Modeling and FIML

Within in Fact-Based Modeling there is no division between entities and attributes -- they are modeled in the same way. Whereas the traditional model approaches treat establishing the distinction between entities and attributes as a step toward implementation (i.e. moving from the conceptual to the logical to the physical model), Fact-Based Modeling focuses on declaring the relationships, cardinalities and constraints between the concepts. Fact-based models create a semantically rich conceptual model of domain of discourse. The physical schemas that then be obtained directly by applying composition rules to the conceptual model.

Different relational forms, 3NF, Data Vault and Dimensional, can be generated from the one conceptual model by applying different composition rules.

In order to describe the composition rules for Data Vault, we will give a shortened introduction to FIML. A more comprehensive introduction can be found in the Introduction to FIML article.

Fact-based Information Modeling Language (FIML)

A FIML model is composed of the following kinds of statements:
  • Object Type definitions, each designated by a name (which may be multiple words, conventionally using initial upper-case). An object type is either:

    • a value type, which is the type of a single value (e.g., number, name, date, etc)
    • an entity type, which identifies a class of things of interest (e.g., country, product, asset)
    • a named fact type, which is a named relationship (e.g., employment, production forecast)
  • Fact Types that declare the relationship between object types or a (boolean) characteristic of a single object type. A fact type is designated by one or more readings which gives a verbal description of the object types' relationship.
  • Constraints that state conditions which restrict the allowed object instances and facts in the model.
  • Units that are used to automate value conversion.
  • Instances of object types and facts conforming to fact types, which are used as examples, reference data, or for test scenarios.
  • Queries which can be asked in the modeling language, and return either true/false or a result set answer. A query can also be projected as a Derived Fact Type, akin to a view.
Here is a short summary of the syntax of value types, entity types and fact types FIML which is relevant for discussing Data Vault composition.

Value Type

each VALUE_TYPE_ID is written as a BASE_VALUE_TYPE_ID(PARAMS)

each Quantity is written as an Unsigned Integer(32);

Value Type Restriction

each VALUE_TYPE_ID is written as a BASE_VALUE_TYPE_ID restricted to { RESTRICTION }

each Season is written as a String restricted to {'Autumn', 'Spring', 'Summer', 'Winter'};

Entity Type

each ENTITY_TYPE_ID is identified by its OBJECT_TYPE_ID

each Product is identified by its Name;

Subtyping

each ENTITY_TYPE_ID is a kind of SUPERTYPE_ENTITY_TYPE_ID

each Person is a kind of Party

Fact Type

binary fact type

OBJECT_TYPE_ROLE verb OBJECT_TYPE_ROLE

Customer raises Order

ternary

OBJECT_TYPE_ROLE verb OBJECT_TYPE_ROLE preposition OBJECT_TYPE_ROLE ...

Person plays Sport for Country

unary

OBJECT_TYPE_ROLE verb

Person is deceased

Multiple readings and quantifiers

The final object role in a fact type may be preceded by a quantifier to a constraint to the cardinality of the final object role for a given set of all but the last object role. Valid quantifiers include:

one exactly QUANTITY at least QUANTITY at most QUANTITY at least QUANTITY and at most QUANTITY from QUANTITY to QUANTITY

one Here is an example of multiple readings for a fact type concerned with oil supply:
Refinery in Supply Period will make Product in one Quantity,
Refinery will make Product in Supply Period in Quantity,
Refinery will make Quantity of Product in Supply Period;

Named Fact Type

objectified

each NAMED_FACT_TYPE_ID is where FACT_TYPE

each Production Forecast is where
        Refinery in Supply Period will make Product in one Quantity,
        Refinery will make Product in Supply Period in Quantity,
        Refinery will make Quantity of Product in Supply Period;

Named Roles in Fact Types

A named object role is introduced by a "( as ... )" key phrase:

OBJECT_TYPE_ID ( as OBJECT_TYPE_ROLE )

Celebration was organised by one Organiser,
    Person (as Organiser) organised Celebration;

adjectival object role

adjective- OBJECT_TYPE_ID OBJECT_TYPE_ID -adjective

The first instance of an object type role in a fact type can be a object type id (i.e. the role is implicitly named), but the second and subsequent instances of an object type must take a named role to avoid ambiguity. Each reading of a fact type has the same role naming for all its object roles. For example, here are two readings of a fact type where the role of the first instance of Product is implicitly named, and the second instance is adjectivally named:
Product may be substituted by alternate-Product in Season,
alternate-Product is an acceptable substitute for Product in Season

FIML code and ORM diagram

Here is the description of the model:

A model of the supply and demand for refined oil. A populated database can be used to optimise profitability by minimising transport costs, maximise supply by allowing substitution of compatible products (with compatibility depending on season) and also to predict shortages.

each Cost is written as Money;
each Month Nr is written as Signed Integer(32);
each Product Name is written as String;
each Quantity is written as Unsigned Integer(32);
each Refinery Name is written as String(80);
each Region Name is written as String;
each Season is written as String(6) restricted to {'Autumn', 'Spring', 'Summer', 'Winter'};
each Transport Method is written as String restricted to {'Rail', 'Road', 'Sea'};
each Year Nr is written as Signed Integer(32);

each Month [static] is identified by its Nr restricted to {1..12}; Month is in one Season;

each Product is independent identified by its Name;

each Refinery is independent identified by its Name;

each Region is independent identified by its Name; each Transport Route is where Transport Method transportation is available from Refinery to Region, Transport Method transportation is available to Region from Refinery; Transport Route incurs at most one Cost per kl;

each Year is identified by its Nr;

each Acceptable Substitution is where Product may be substituted by alternate-Product in Season [acyclic, intransitive], alternate-Product is an acceptable substitute for Product in Season;

each Supply Period [separate, static] is identified by Year and Month where Supply Period is in one Year, Supply Period is in one Month;

each Production Forecast is where Refinery in Supply Period will make Product in one Quantity, Refinery will make Product in Supply Period in Quantity, Refinery will make Quantity of Product in Supply Period; Production Forecast predicts at most one Cost;

each Regional Demand is where Region in Supply Period will need Product in one Quantity, Region will need Product in Supply Period in Quantity, Region will need Quantity of Product in Supply Period;

In the equivalent ORM diagram for this example you can see that the main business concepts (Refinery, Region and Product) are represented as individual nodes with associated naming (their business keys). The business relationships (named as Production Forecast, Transport Route and Region Demand) are represented as nodes with an internal box structure identifying the elements of the relationship, and a reading of the relationship where the ellipsis (...) in the reading correspond to the object roles of the relationship. Object roles of a relationship can be business concepts or associated data. The mauve annotations in the diagram indicate uniqueness constraints and mandatory constraints on the fact types, and an acyclic, intransitive constraint on Product roles of the the Acceptable Substitution fact type.

OilrRefineryORM.jpg

We have shown the ORM diagram to give a graphical way of visualising how the fully decomposed ORM diagram could be mapped to a Data Vault schema. In practice, Factil's toolset allows a FIML model to be directly compiled and generated as an Data Vault SQL schema.

Data Vault schema generation

The operation of generating a Data Vault schema from a Fact-Based Model is a composition operation, where there are composition rules to absorb related entities into tables to obtain a schema. In the sections below, we outline these set of rules for each modeling pattern.

FIML allows annotations embedded in square brackets "[ ... ]" to be attached to FIML statements to provide guidance in schema generation and to supplement standard mapping rules.

Data Vault Schema Generation

To assist with Data Vault schema generation, entity types can be annotated as follows.

Annotation Meaning
[ satellite <name> ] This entity type should be kept in a satellite table with name <name> with other entities tagged with the same satellite name
[separate] This entity type should be generated as a separate table
[static] This entity type would be a reference or dimension table
 

To obtain a Data Vault Schema from a Fact-Based Model the following generation procedure is performed:

Data Vault High-Level Generation Procedure:

  1. Generate a normal 3NF schema, using the 3NF Generation Procedure, taking note of [separate] annotation.
  2. Partition these tables into reference/dimension, hub and link groups. Links have an identifier consisting of references to two or more other tables.
  3. For each hub and link table, create satellite tables to store all functionally related data values
  4. Add timestamps and data source identifier columns
  5. Emit all hub, link and satellite tables, omitting reference tables
And here is the 3NF generation procedure:

3NF High-Level Generation Procedure:

  1. Assign each object type a tentative table/not-table status based on several heuristics:

    • A n-ary fact type must always be a table
    • An independent or separate object type must be a table
  2. For each object type that still has tentative status, look at the status of all object types which play a role in its identifier to see if a decision can now be made:

    • A unary fact type is always absorbed
    • An entity type whose identifier is a single table and has no other functional references to it can be absorbed
    • An entity type that has more than one possible absorption path but has other functional dependencies, it must be a table
    • Some entity types can be fully absorbed in more than one place (e.g. Address)
    • One-to-one references can sometimes be flipped in absorption direction
  3. Repeat until all possible decisions have been made, then make remaining tentative decisions final.

Using the CQL compiler to generate Data Vault schema

CQL was originally developed as part of the Active Facts project principally developed by Clifford Heath. This project includes the CQL compiler afgen, which reads a CQL or ORM data model and can generate any of a range of application and systems development outputs, including a Data Vault SQL schema.

For the Oil Supply example, the generated Data Vault Schema looks like:

OilRefineryDV.jpg

In this example, no annotations are required to obtain a Data Vault schema backbone.

The CQL compiler command line to generate Data Vault SQL schema is:

afgen --transform/datavault --sql/server OilSupply.cql > OilSupply-DV.sql
This file can be directly incorporated into your standard data modeling process, including being read into standard ER modeling tools. Here is an ER diagram generated by such a tool directly from the OilSupply Data Vault SQL file:

OilRefineryDVSchema.jpg

Although the font size of this picture is probably too small to see in detail, it is clear from the diagram structure that all tables, attributes and their types, primary and foreign keys and associated constraints have been defined in this Data Vault SQL file.

Full details about downloading and using the open source FIML complier can be found in Introduction to FIML.

Conclusion and Further Resources

This article has introduced the strong conceptual overlap between Fact-Based Modeling and the Data Vault data warehouse modeling, and the practical results that can be achieved for Data Vault schema generation from a fact-based model.

Both Data Vault and Fact-Based Modeling are based on the notion of decomposition of the data model. Data Vault's decomposition is partial, the with guidelines for creating a Data Vault model directed by the practicalities of implementing a flexible, adaptable and agile data warehouse.

Fact-based models are fully decomposed but semantically rich in terms of the relationships, cardinality and constraints between all concepts. The richness of Fact-Based Modeling allows the definition of composition rules that can create high-quality consistent 3NF and Data Vault schemas.

These composition rules have been implemented and released in the open source FIML compiler.

References

[1] Halpin, T., 2015, Object-Role Modeling Fundamentals, Technics Publications.

[2] Hultgren, H., 2012, Modeling the Agile Data Warehouse with Data Vault, New Hamilton.

[3] Linstedt, D., 2010. Super Charge your Data Warehouse. Dan Linstedt.