Building data warehouse
I recently had experience of building a warehouse. Here I have described OLTP, OLAP, DWH approaches and various ways of designing schema.
Online Transaction Processing (OLTP)
OLTP is a class of software program capable of supporting transaction oriented applications on internet. OLTP systems are designed to optimize transaction processing by storing the data which is easy to maintain and retrieve for transaction use cases. The schema of data in OLTP databases are highly normalized.
OLTP database schema is highly normalized, below is an example of normalized database schema.
Below is few characteristics of normalized schema.
- Insert Optimized
- Redundancy is reduced
- Changes have less impact on database. If product category changed, then only single record in product table needs modification.
- Query complexity is increased. Many tables have to joined together to answer simple question. i. e. what is the sum of sale amount in each state?
Online Analytical Processing (OLAP)
OLAP is a class of software program capable of supporting business intelligence, data discovery, report viewing. OLAP systems are designed to optimize data aggregation, rollup, drill, slice, and dice. The schema of data in OLAP databases are often semi-normalized (Star Schema) or de-normalized. Below is example semi-normalized schema, also called Dimensional schema or Star schema.
Below is few characteristics of Dimensional schema (Star schema).
- The central fact table is surrounded by dimension tables in Star schema.
- Table relationships are only one level deep, No more than two tables need to be joined together for common business questions and aggregations. i.e. what is the sum of sale amount for each state?
Data Warehouse (DWH)
DWH is a database, storing organization/department wide data. DWH have to be good not just in storing but also retrieving of the data. Depending on use case, data schema can be modelled in such way that retrieving data is very efficient.
There are two school of thoughts when it comes to how you can create DWH.
- Inmon (top-down approach)
- Kimball(bottom-up approach)
Inmon way of DWH design-
Focuses on high level business entity to organize data around subject. All organization level data is stored in single DWH and then department wise separate database are created. These separate department wise databases are called Data marts. In this approach, DWH schema resembles to OLTP system and Data marts schema resembles to OLAP system. Data marts are used to perform BI queries.
Following is the data follow direction in Inmon model.
External Data Source->ETL->DWH->Data Marts
Pro:
- Segregated reporting stores (DB) department wise, gives ownership to department and greater flexibility department wise in a big Enterprise. Security and performance of separate department data mart remains independent.
Cons:
- Costly to develop EDW which requires a lot of effort to develop as it’s based on ER. Also effort to develop Data marts for all the departments from EDW.
- Complex to design.
- End user accessibility is lower compare to Kimball Approach
Kimball way of DWH design
All data is collected into a single DWH. All BI queries are directed to this DWH. The schema of data in DWH is modeled as per OLAP system. DWH is the sum of all the data marts, each representing a business process in organization by a means of a star schema, or a family of star schemas of different granularity. Unlike Inmon design, Kimball design have no physical data marts but only logical ones residing inside DWH itself.
Pro:
- Lower cost and time to production.
Cons:
- Can absorb limited change from source system. As DWH is modelled in OLAP way rather than OLTP way.
Kimball vs Inmon
DWH Schema
Majorly there are two types of system. OLTP based and OLAP based. Database in both of these system differ the way database schema is structured. OLTP database follow Entity Relationship Diagram (ERD) model which are highly normalized. OLAP database follow Dimensional Model (Star schema and variants) which are semi-normalized and are made up of Fact Tables and Dimension Tables.
Fact Table — Data which is a quantity, is a measurement or is a number, typically qualifies as Fact table.
Dimensional Table — Data which is qualitative, is a description or is a text typically qualifies as Dimensional Table.
The link between the fact and dimension table is through the referential integrity. The dimension table has the primary key while the fact table has the foreign key. The referential integrity is a concept of establishing a parent-child relationship between two tables with the purpose of ensuring that every row in the child table has a corresponding parent entry in the parent table. We can enforce the referential integrity as either hard referential integrity or soft referential integrity. The former is enforced at the table level while the later can be enforcing through the ETL.
Following are the different type of schema units in Dimensional Modelling.
1. Star Schema: Fact Table is linked with dimension table in star like structure. Dimension tables are de-normalized.
Pro
- Provide a direct and intuitive mapping between the business entities being analyzed by end users and the schema design.
- Provide highly optimized performance for typical star queries.
- Are widely supported by a large number of business intelligence tools, which may anticipate or even require that the data-warehouse schema contain dimension tables
- Beneficial for large size data with limited dimension table rows
Cons
- Dimension table has data redundancy
2. Snowflake Schema: Similar to Star Schema except that some Dimension tables are normalized.
Pro
- Reduced redundancy compare to Star schema.
- Beneficial for small size data with huge dimension table rows
Cons
- Fact table query is complex and requires more number of joins.
DWH Schema design
There are three major steps while designing DWH schema.
- Conceptual Data Modeling — Fact and Dimension mapping
- Logical Data Modeling — Columns (Tables attributes) identification
- Physical Data Modeling — executable scripts(i.e. SQL scripts)
It’s very crucial to identify fact table and dimension table while modeling Star schema.
If data is answer to the any of below queries, then its most likely qualifies as dimension of Star schema.
- When(time dim)
- Where(Location Dim)
- How (online/offline)
- Which(product)
- Who(user/customer)
Below are few specific type of Dimension tables-
Conformed Dimension:
Dimension that can be shared between multiple Fact Tables. This is the dimension through which multiple Star schema can be joined together.
Fig- Sales and Inventory are two Star schema system, both share Date, Product and Store dimension. These shared dimension are Conformed dimension.
Degenerate Dimension:
Dimensional key in Fact table. For example order id, bill no etc. in a fact table are typically qualifies as Degenerate Dimension because they don’t have any other attribute to make them as a separate dimension. Also these Degenerated Dimension if separated will be as huge as fact table itself and hence its good idea to keep them in fact table itself. Can be used to drill down across multiple fact table also if required.
Inferred Dimension:
New column (Inferred Flag) in Fact table to track if we expect a new data in future.
Junk Dimension:
Attributes which together don’t form a logical entity and can be grouped together to form a separate dimension.
Monster Dimension:
Millions of rows in dimension. Typically a dimension which have huge number of rows qualifies as Monster Dimension.
Mini Dimension:
When Monster Dimension has many Type 2 field(Type 2 fields are explained later here), i.e. 5 Mil rows,100 attribute and if 10% of attribute change yearly, then type-2 nature of dimension will create 50 million more rows for the table which is not scalable. Hence group related attribute of the Dimension and create another Mini Dimension.
Outrigger Dimensions:
Dimension table references another dimension table.
Role playing Dimension:
Multiple Foreign key in Fact table pointing to single Dimension Table. For example multiple dates (order date, shipping date) in fact table pointing to Date dimension.
Bridge Dimension Table:
Identifying an employee’s entire management chain or drilling down to identify the activity for all employees who directly or indirectly work for a given manager.
In below figure, the bridge table has one row for each manager and each employee who is directly or indirectly in their management chain, plus an additional row for the manager to himself. The bridge joins shown in Figure below enable you to drill down within a manager’s chain of command.
Now we have reviewed few specific type of Dimension table. As we have seen these dimensional table store some of the attribute about the fact table and some these attribute may change with time. Dimensional table which change with time are called Slowly Changing Dimension (SCD).
Below are different type of SCD table.
SCD (Slowly Changing Dimension):
- SCD Type 0: after column updated in OLTP , no change in OLAP column
- SCD Type 1: after column updated in OLTP, change in OLAP column. Used when historical data is not maintained in OLAP db.
- SCD Type 2: after column updated in OLTP, new record added in OLAP db.
- Hybrid: mix of type 0,1,2 for different column
- SCD Type 3: column to track previous and current value in OLAP for update in OLTP.
- SCD Type 4: Mini Dimension(mentioned earlier)
- SCD Type 5: include Mini Dimension primary key in Monster Dimension.
- SCD Type 6: Type 1 attribute in Type-2 dimension. Which mean we have a type 2 dimension but one of the attribute in type-2 dimension table behaves as type-1
Now that we have looked into Dimension table in details. Let’s dive deep into characteristics of Fact table and few Specific type of Fact table.
Fact Table Types
- Transaction Fact Table:
Aggregation is possible and additive
- Periodic Snapshot:
Data span over time. Semi-additive
- Accumulated Fact:
Span over business process, i.e. processing of an order. Fact table will contain columns i.e. order date, process date, ship date, close date
- Fact less Fact:
Only surrogate keys in Fact table. Surrogate keys are just numbers and don’t have any business value and are typically used as PK.
- Aggregate Fact Tables or OLAP Cubes –
Fact data sometime can be huge. Aggregation of fact data by rolling up one or more dimension reduces the size.
- Consolidated Fact Tables :
It is often convenient to combine facts from multiple processes together into a single consolidated fact table if they can be expressed at the same grain. For example, sales actuals can be consolidated with sales forecasts in a single fact table to make the task of analysing actuals versus forecast simple and fast, as compared to assembling a drill-across application using separate fact tables.
- Centipede Fact Table :
Many to one dimension table split in to many dimension such as a date dimension, month dimension, quarter dimension and year dimension and then include all these foreign key into fact tables. This results is a centipede fact table with dozens of hierarchically related dimensions. Centipede fact tables should be avoided.
Let me know if it has been helpful in comments. Thanks!