Designing and implementing a Data Warehouse. Part 4

By: John Charles (Juan Carlos) Olamendy Turruellas

Introduction

Business Intelligence has become a buzzword in recent years as a support to decision making. Today we can find several database systems which include data warehousing, online analytical processing (OLAP), and data mining technologies. Data warehousing provides an efficient storage, maintenance, and retrieval of data. OLAP is a service that provides a way to create ad hoc queries against the data warehouse in order to answer important business questions. Data mining is a disciple comprising of several algorithms for discovering knowledge in a large bulk of data.

In order to build a data warehouse solution, we need to model a consistent architecture where the operational data will fit well in an integrated and enterprise-wide view as well as to take into consideration a handful implementation strategies to provide a high quality application. The design and implementation of a data warehouse solution sometimes is a very complex challenge in theory and practice. In this article, I will cover the main principles and techniques to design and implement a data warehouse providing my own experience in such an overwhelming challenge.

This is the fourth article of a series of articles that I want to write in order to share my knowledge and experience in this subject matter. In this part and the previous one, we’ve seen how to implement the technical solution for a data warehouse.

Implementing the technical solution

In this section, we’ll continue working on the technical solution.

After the steps on the previous article, you must click on Next button, and the Select Measures page will appear which displays the measures associated to the fact table. Click on Next button, and the Detecting Hierarchies page process each dimension in order to find potential hierarchies between the attributes inside the dimension. Click on Next button, and you can review the found hierarchies for each dimension. In this case, you can see that we have found two hierarchies for the TimePeriodDim dimension (see Figure 1).

Figure 1

Click on Finish button in order to create the cube. Then the Cube editor will be displayed (see Figure 2).

Figure 2

You can add new measures to the cube by click on the corresponding measure group on the Measures windows and selecting the New Measure option (see Figure 3).

Figure 3

Then the New Measure dialog box will enable selecting the attribute and the type of measure operation to apply it. In this case, we add a new measure to get the maximum values of the SalesVolumes attribute in the SalesFact fact table.

As well, you can edit the dimension by right-click on the particular dimension on the Dimensions windows in the Cube editor or by navigating the tree on the Solution Explorer until the particular dimension node on the Dimensions parent node. Then the Dimension editor will be displayed.

Let’s add a Subcategory-Color hierarchy to the Product dimension by opening the Dimension editor on the Product dimension. First, drag and drop the Subcategory attribute from the Attributes windows into the Hierarchies and Level pane in the Dimension editor (see Figure 4).

Figure 4

Then add drag and drop the Color attribute inside the new hierarchy and below the Subcategory attribute (see Figure 5).

Figure 5

Now it’s the time to deploy the solution from the development environment to the production environment in order for the end users to consume reports from the cube. First, we need to configure the deployment scenario by going into the project properties in the Project | ProjectName Properties menu. Then select the Deployment node and set the Server property to the name of your production server (see Figure 6).

Figure 6

Then go the Build | Deploy ProjectName menu to invoke deployment process.

Once you have deployed your solution, you can create custom reports for displaying the information to your users. You have several options for the data warehouse presentation layer. You can create the reports using Microsoft technologies such as Excel pivot tables for Windows applications and Office Web Components for Web applications. You can also find several third-party tools from vendors such as Cognos, ProClarity, Microstrategy, and Business Objects.

You can also write your own queries against the cubes using a language named Multidimensional Expressions (MDX). MDX is for OLAP databases like SQL is for relational databases. Using this language, you can build your own data warehouse front-end components and customize views produced by third-party tools.

Now we’re going to discuss how to use Pivot tables in Microsoft Excel 2007 to produce OLAP custom reports as the main client tool. You can think of a pivot table as a way to create summary reports for analyzing your data. The main structure of a pivot table is:

  1. PivotTable Field List. This section in the top right displays the fields in your spreadsheet. You may check a field or drag it to a quadrant into the lower section.
  2. The lower right quadrants. This area defines where and how the data shows on the pivot table. You can have a field show in either a column or a row. You may indicate if the data should be counted, summed, averaged, filtered and so on.
  3. The red outlined area to the left is the result of your selection from (1) and (2).

Let’s create an Excel Pivot Table to display the Test DW cube created before. The first step is to create a connection to the Analysis Service multidimensional database. From the Data ribbon, choose From Analysis Services option from the From Other Sources drop-down box (see Figure 7). Excel will display the Data Connection Wizard in order to configure your connection string and establish a link to the server.

Figure 7

In the first page of the Wizard, enter the name of the server as well as the authentication information (see Figure 8).

Figure 8

Click Next button, and select the target database from the drop-down box. When you select the TestAS database, then a list of the underlying tables and cubes is shown below in the drop-down menu. Then choose the TestDW cube (see Figure 9).

Figure 9

Click on the Next button in order to add some descriptive information related to the data connection (see Figure 10).

Figure 10

And finally, click on the Finish button. After that, you immediately see the Import Data dialog box, and then select the PivotTable report option, and the PivotTable object will be display in the current worksheet (see Figure 11).

Figure 11

As you can see that the measures and its attributes are listed under the Sigma icon node. You can see the dimensions and its attributes which are listed under the table icon nodes. In the case of the Product dimension, you can see the attributes such as Color, Product Business ID, Product Dim and Subcategory Name as well as the defined Subcategory Name – Color hierarchy (see Figure 12).

Figure 12

Now it’s time to analyze the Sales Volumes and ShippedUnits by Customer, Product, and Region. From the PivotTable Field List windows, click on the Sales Volumes and ShippedUnits measures on the Sales Fact node. You can see in the lower portion of the window on the Values pane, that these measures are added. Now click on the CustomerType attribute of the Customer dimension and this attribute is added to the Row Labels pane on the lower portion of the window. The worksheet also displays the measures group by the CustomerType attribute (see Figure 13).

Figure 13

Now let’s enhance the report by by comparing by the Product Subcategory Name-Color hierarchy and Region by clicking on the SubcategoryName-Color Hierarchy folder on the Product dimension and clicking on the Region Name attribute on the Region dimension. The hierarchy and attribute are added to the Row Labels pane on the lower region of the window. Now you can drill-down or rollup to see the summaries of the measures (see Figure 14).

Figure 14

Conclusion

In this series of articles, I’ve covered the principles, methodology, techniques and the main technologies for the development of a data warehouse solution using Microsoft related technologies such as Microsoft SQL 2005 Analysis Services and Microsoft Excel 2007.

Designing and implementing a Data Warehouse: Part 3

Introduction

Business Intelligence has become a buzzword in recent years as a support to decision making. Today we can find several database systems which include data warehousing, online analytical processing (OLAP), and data mining technologies. Data warehousing provides an efficient storage, maintenance, and retrieval of data. OLAP is a service that provides a way to create ad hoc queries against the data warehouse in order to answer important business questions. Data mining is a disciple comprising of several algorithms for discovering knowledge in a large bulk of data.

In order to build a data warehouse solution, we need to model a consistent architecture where the operational data will fit well in an integrated and enterprise-wide view as well as to take into consideration a handful implementation strategies to provide a high quality application. The design and implementation of a data warehouse solution sometimes is a very complex challenge in theory and practice. In this article, I will cover the main principles and techniques to design and implement a data warehouse providing my own experience in such an overwhelming challenge.

This is the third article of a series of articles that I want to write in order to share my knowledge and experience in this subject matter. In this part and the next one, we will see how to implement the technical solution for a data warehouse.

Implementing the technical solution

In this section, we’ll build the relational tables which support the dimensional data model that we have designed in the previous section. Then we’ll see how to load these relational tables with data from the AdventureWorks database as the main data source and finally we’re going to build the Analysis Services cubes to present the information to end-users.

Let’s talk about the implementation strategies of the dimensional data model. Open the SQL Server Management Studio and create a new database; then specify a name such as TestDW, and click OK to create it.

In order to create the dimensions and the fact tables, run the following SQL DDL statements (see Listing 1). We have added a unique key constraint for each table’s business key as a clustered index and in order to improve the performance when we need to fetch the dimension records based on a the business key which is very common operation on data warehouse. The clustered indexes enable to physically arrange the data in the dimensions table based on the values of the business key.

CREATE TABLE ProductDim(
ProductDim_ID        int             IDENTITY(1,1),
ProductBusinessID    int             NOT NULL,
ProductName          nvarchar(50)    NOT NULL,
Color                nvarchar(15)    NULL,
SubcategoryName      nvarchar(50)    NOT NULL,
CONSTRAINT “ProductDim.PK” PRIMARY KEY NONCLUSTERED (ProductDim_ID)
)
go

CREATE UNIQUE CLUSTERED INDEX “ProductDim.ProductBusinessID” ON ProductDim(ProductBusinessID)
go

CREATE TABLE CustomerDim(
CustomerDim_ID        int         IDENTITY(1,1),
CustomerBusinessId    int         NOT NULL,
CustomerType          char(10)    NULL,
CONSTRAINT “CustomerDim.PK” PRIMARY KEY NONCLUSTERED (CustomerDim_ID)
)
go

CREATE UNIQUE CLUSTERED INDEX “CustomerDim.CustomerBusinessID” ON CustomerDim(CustomerBusinessId)
go

CREATE TABLE RegionDim(
RegionDim_ID        int             IDENTITY(1,1),
RegionBusinessID    int             NOT NULL,
RegionName          nvarchar(50)    NOT NULL,
CONSTRAINT “RegionDim.PK” PRIMARY KEY NONCLUSTERED (RegionDim_ID)
)
go

CREATE UNIQUE CLUSTERED INDEX “RegionDim.RegionBusinessID” ON RegionDim(RegionBusinessID)
go

CREATE TABLE TimePeriodDim(
TimePeriodDim_ID    int         IDENTITY(1,1),
Calendar_Date       datetime    NOT NULL,
Calendar_Year       int         NOT NULL,
Calendar_Month      int         NOT NULL,
Calendar_Quarter    int         NOT NULL,
Calendar_Week       int         NOT NULL,
CONSTRAINT “TimePeriodDim.PK” PRIMARY KEY NONCLUSTERED (TimePeriodDim_ID)
)
go

CREATE UNIQUE CLUSTERED INDEX “TimePeriodDim.Calendar_Date” ON TimePeriodDim(Calendar_Date)
go

CREATE TABLE SalesFact(
SalesFact_ID        int      IDENTITY(1,1),
ProductDim_ID       int      NOT NULL,
CustomerDim_ID      int      NOT NULL,
RegionDim_ID        int      NOT NULL,
TimePeriodDim_ID    int      NOT NULL,
SalesVolumes        money    NOT NULL,
ShippedUnits        int      NOT NULL,
CONSTRAINT “SalesFact.PK” PRIMARY KEY NONCLUSTERED (SalesFact_ID),
CONSTRAINT “SalesFact.RefProductDim” FOREIGN KEY (ProductDim_ID)
REFERENCES ProductDim(ProductDim_ID),
CONSTRAINT “SalesFact.RefCustomerDim” FOREIGN KEY (CustomerDim_ID)
REFERENCES CustomerDim(CustomerDim_ID),
CONSTRAINT “SalesFact.RefRegionDim” FOREIGN KEY (RegionDim_ID)
REFERENCES RegionDim(RegionDim_ID),
CONSTRAINT “SalesFact.RefTimeDim” FOREIGN KEY (TimePeriodDim_ID)
REFERENCES TimePeriodDim(TimePeriodDim_ID)
)
go
CREATE INDEX “SalesFact.SalesFact_ID” ON SalesFact (ProductDim_ID, CustomerDim_ID,RegionDim_ID, TimePeriodDim_ID)

Go

Listing 1

Now it’s time to load the tables with data from the data sources. Although in the real-world, we can find multiple source systems and different ways of representing information within most business such as spreadsheet, text files and relational databases, we’re going to use the AdventureWorks database shipped with SQL Server 2005 as our only data source for the data warehouse. This objective is achieved by an extract, transform and load (ETL) process using several technologies such as Data Transformation Services (DTS) packages in SQL Server 2000 and Integration Services in Microsoft SQL 2005 (SSIS).

In this case, we’re going to populate the dimensions and fact tables using SELECT SQL statements.

The first steps are to load the data into the dimension tables from the Sales.Customer, Sales.SalesTerritory, Production.Product and Production.ProductSubcategory tables in the AdventureWorks database (see Listing 2).

INSERT INTO dbo.CustomerDim(CustomerBusinessId, CustomerType)
SELECT CustomerID AS CustomerBusinessID,CustomerType
FROM AdventureWorks.Sales.Customer;
go

INSERT INTO dbo.RegionDim(RegionBusinessID, RegionName)
SELECT TerritoryID AS RegionBusinessID, [Name]+’-‘+CountryRegionCode AS RegionName
FROM AdventureWorks.Sales.SalesTerritory;
go

INSERT INTO dbo.ProductDim(ProductBusinessID, ProductName, Color, SubcategoryName)
SELECT p.ProductID AS ProductBusinessID, p.[Name] AS ProductName, p.Color, s.[Name] AS SubcategoryName
FROM AdventureWorks.Production.Product p inner join AdventureWorks.Production.ProductSubcategory s
ON p.ProductSubcategoryID=s.ProductSubcategoryID;
go

DECLARE @dtStartDate datetime;
DECLARE @dtEndDate datetime;

SET @dtStartDate = ‘1/1/2000’;
SET @dtEndDate = ‘1/1/2012’;

WHILE(@dtStartDate <= @dtEndDate)
BEGIN
INSERT INTO dbo.TimePeriodDim(Calendar_Date, Calendar_Year, Calendar_Month, Calendar_Quarter, Calendar_Week)
SELECT @dtStartDate, DATEPART(YYYY,@dtStartDate) AS Calendar_Year,
DATEPART(M,@dtStartDate) AS Calendar_Month,
DATEPART(Q,@dtStartDate) AS Calendar_Quarter,
DATEPART(WK,@dtStartDate) AS Calendar_Week;
SET @dtStartDate = @dtStartDate + 1;
END;
go

Listing 2

Next step is to load the data into the fact table. One important point to analyze is that dimension tables use a surrogate keys, and these keys are only specific to the data warehouse; thus when we’re loading data into the fact tables, a lookup step is always required to map business keys into surrogate keys.

In our case, we’re integrating data within the same database system, thus we’re going to map business keys to surrogate keys by using inner join operations as well as to extract the sales related data from the Sales.SalesOrderHeader and Sales.SalesOrderDetail tables in the AdventureWorks database. The SELECT SQL statement to load the fact table SalesFact is shown in Listing 3.

INSERT INTO SalesFact(ProductDim_ID, CustomerDim_ID, RegionDim_ID, TimePeriodDim_ID, SalesVolumes, ShippedUnits)
SELECT p.ProductDim_ID AS ProductDim_ID, c.CustomerDim_ID AS CustomerDim_ID, r.RegionDim_ID AS RegionDim_ID, t.TimePeriodDim_ID AS TimePeriodDim_ID, TotalDue AS SalesVolumes, sum(OrderQty) AS ShippedUnits
FROM AdventureWorks.Sales.SalesOrderHeader sh inner join AdventureWorks.Sales.SalesOrderDetail sd
ON sh.SalesOrderID=sd.SalesOrderID inner join dbo.CustomerDim c
ON sh.CustomerID=c.CustomerBusinessId inner join dbo.RegionDim r
ON sh.TerritoryID=r.RegionBusinessID inner join dbo.ProductDim p
ON sd.ProductID=p.ProductBusinessID inner join dbo.TimePeriodDim t
ON sh.OrderDate = t.Calendar_Date
GROUP BY p.ProductDim_ID, c.CustomerDim_ID, r.RegionDim_ID, t.TimePeriodDim_ID, TotalDue;

Go

Listing 3

Now that we have built the data warehouse and load the required data by the integration process, then we need to add the final component of the architecture: the multidimensional database. The relational data warehouse consolidates the underlying data sources, but the end-users need a flexible analytical capability easy to use. The solution is to build an Analysis Services database from the relational data warehouse. Next we’ll adjust the dimensional model created before to better support cubes.

Let’s start the solution by using the BI Development Studio in order to create a new Analysis Services project (see Figure 1). This will create a new local working directory which contains the definition of the underlying objects. Later when we have finished designing the objects, we can deploy the project to a server running an Analysis Services instance.



Figure 1

The first step is to add a data source and specify the type of database and connection information. In the Solution Explorer window, navigate to the Data Sources node in the tree, then right-click and select New Data Source option in order to launch the Data Source Wizard. In the Select how to define the connection page, click on New button and the Connection Manager windows appears. Set the server name and select the data warehouse database and click on the OK button (see Figure 2).



Figure 2

Next in the Data Source Wizard, click on the Next button, and in the Impersonation Wizard page, select the Use the credentials of the current users option. Click on the Next button and finish the wizard (see Figure 3).



Figure 3

Now we’ll select the parts of the data source to be analyzed by defining a logical view called a data source view (DSV). Using data source views, you can also define relationships between the objects as well as add calculated columns. This is an abstraction layer between the dimensional database and the data source when you’re building the cubes.

Let’s define five data source views to select the dimensions and fact tables from the data source. In the Solution Explorer window, navigate to the Data Source Views node in the tree, then right-click and select New Data Source View option in order to launch the Data Source View Wizard. Click on Next button, in the Select Data Source page, choose the previously created data source, and click on the Next button and the Select Tables and Views page appears. You select all the tables from the data source by clicking on the >> button (see Figure 4).



Figure 4

Then click on the Next and Finish buttons. You can see in the data source view (see Figure 5) that we don’t need to create the relationships between objects, because the wizard detects these relationships based on the primary key and any referential integrity constraints. In the case, we need more relationships, it’s very simple the creation process. You drag and drop a foreign key column from the source table to the corresponding primary key in the destination table.



Figure 5

Now it’s the time for the creation of the cube. In the Solution Explorer window, navigate to the Cubes node in the tree, then right-click and select New Cube option in order to launch the Cube Wizard. Click on the Next button, in the Select Build Method page, select the Build the cube using a data source option and click on the Next button (see Figure 6).



Figure 6

In the Select Data Source View page, select the previously defined data source view and click on the Next button. The next two pages will allow you to choose a data source view and analyze it in order to identify potential dimensions and facts tables. Click on the Next button, and the Identify Fact and Dimension Tables page, you can see that the wizard detects the CustomerDim, ProductDim, RegionDim and TimePeriodDim as the dimension tables as well as the SalesFact and TimePeriodDim as the fact tables. You must uncheck the fact option for TimePeriodDim table. You must also configure manually the TimePeriodDim table as time dimension (see Figure 7).



Figure 7

Click on Next button, and the Select Time Periods page will appear which enable map the property names of time dimension in the cube with the property names of time dimension table (see Figure 8).



Figure 8

Conclusion

In this article, I’ve started to show how to build the data warehouse using Microsoft technologies. In next article, I will finish on the solution.

Designing and implementing a Data Warehouse: Part 2

Introduction

Business Intelligence has become a buzzword in recent years as a support to decision making. Today we can find several database systems which include data warehousing, online analytical processing (OLAP), and data mining technologies. Data warehousing provides an efficient storage, maintenance, and retrieval of data. OLAP is a service that provides a way to create ad hoc queries against the data warehouse in order to answer important business questions. Data mining is a disciple comprising of several algorithms for discovering knowledge in a large bulk of data.

In order to build a data warehouse solution, we need to model a consistent architecture where the operational data will fit well in an integrated and enterprise-wide view as well as to take into consideration a handful implementation strategies to provide a high quality application. The design and implementation of a data warehouse solution sometimes is a very complex challenge in theory and practice. In this article, I will cover the main principles and techniques to design and implement a data warehouse providing my own experience in such an overwhelming challenge.

This is the second article of a series of articles that I want to write in order to share my knowledge and experience in this subject matter.

Getting started with the data warehouse solution

A data warehouse has many functions to perform such as: data extraction, data loading, data transformation, data storage, and provision of an interface. Figure 1 illustrates a high-level architecture of the data warehouse.



Figure 1

There are several technologies and components to support the functions of a data warehouse. Figure 2 shows a detail view of the architecture of a data warehouse.



Figure 2

In the left side, you can see the Data Sources which enable extracting data from operational systems, and importing additional data from external sources. Then the all the data from the Data Sources is integrated in the Data Stating component in order to remove inconsistencies, detect missing values and transform the data to be stored in a suitable format for easy access and analysis, in this case structurally consistent with the star schema dimensional model. This step is very important in order to maintain data quality. Once the data is prepared in the Data Staging component of the solution, then later the data is loaded in the data warehouse storage medium, and the Data Storage component manages the storage of the data as well as keeps track information related to the data known as meta-data. It’s remarkable to say that the data in the Data Storage is characterized as a stable and de-normalized (following the star schema) representing snapshots at specified periods, although the data may change periodically. It’s also a critical decision between using a relational database or multidimensional database for the implementation of the data warehouse. Finally the Information Delivery component enables an interface to access the data in order to be displayed.
Regarding to the Data Storage component, we have three physical mechanisms to store data in a dimensional database: ROLAP, MOLAP, and HOLAP. ROLAP stands for Relational OLAP; in this case the dimensional model is mapped to a relational database engine such as Oracle and SQL Server. MOLAP stands for Multidimensional OLAP; in this case the dimensional model is mapped to cubes for example in SQL Server Analysis Server. And finally, we have HOLAP which stands for Hybric OLAP a combination of ROLAP and MOLAP.

Building the data warehouse solution

This section is intended to take everything we have discussed so far, and begin the process of putting it into practice by describing and demonstrating the full process of analyzing, designing, and building a dimensional model and the implementation of the underlying data warehouse. A data warehouse is typically developed one subject area at time, and in this section we’re going to develop a department-specific data warehouse or data mart for sales analysis. The result mode will be a star schema format to enable multidimensional analysis.

Designing the dimensional data model

Let’s suppose that our client is a manufacturer company whose transactional database is the AdventureWorks shipped with Microsoft SQL Server 2005. This company is finding difficult to give the sales manager meaningful information to answer several business questions to keep up with a changing and challenging environment and the report execution are adversely affecting the performance of the transaction processing systems during peak reporting periods. The main business questions to answer are concerning sales volumes and shipped units by products by customers by region and by time periods.

The first step is the development of the staging database that receives data from the underlying data sources (the data is almost stored in separate transactional systems), transforms the data and loads the data through an integration process in order to build the multidimensional database.

Once the requirements are gathered, all that we need to do is to establish what the facts are (the transactional data), and what the dimensions are (reference data or master data).
The next diagram (in the Figure 3) shows simple star schema containing a central fact table and several dimension tables surrounding. You can see that every dimension has a surrogate key generated within the data warehouse as the primary key of the dimension table as well as a business key which represents the different primary keys (for example, a product may have a product code in a data source different semantically and by its data type respect to the other data sources) of the underlying tables to be consolidated into the dimension. All the surrogate keys are declared as integer fields with the IDENTITY property turned on. In the case of the fact table, the primary key is usually the combination of all the dimension keys. However, depending on the semantic of the problem and the level of summarization so it does not matter whether we have multiple records, you can declare another attribute on the fact table as the primary key.



Figure 3

The dimension tables such as ProductDim, CustomerDim, RegionDim, and TimePeriodDim enable describing the SalesFact fact table. One interested area in the model is Customers; and the Sales Manager is only interested in the CustomerType attribute because this manufacturer does not sell directly to consumers but only to retailers. The simple SQL statement to load the data to this dimension from a relational schema is shown in Listing 1.

select CustomerID as CustomerBusinessID,CustomerType
from Sales.Customer;

Listing 1

In order for the Sales Manager to get more insight about the customers, then RegionDim table was introduced. This table records the fact that most companies divides up geographical areas into “sales territories”, and these territories might not translate directly to a political association such as state or province. The data for the RegionDim table may be extracted directly from a table in the relational schema using the SQL statement shown in Listing 2.

Select TerritoryID as RegionBusinessID, [Name]+’-‘+CountryRegionCode as RegionName
from Sales.SalesTerritory;

Listing 2

The ProductDim table represents the dimension related to the area of products. Products can be grouped by into subcategories and categories, and each product can contain descriptive attributes useful for the final reports. This dimension is characterized by the product name, color, and subcategory name. The data for this dimension may be extracted directly from the Product and ProductSubcategory tables in the relational schema as shown in the Listing 3.

Select p.ProductID as ProductBusinessID, p.[Name] as ProductName, p.Color, s.[Name] as SubcategoryName
from Production.Product p inner join Production.ProductSubcategory s
on p.ProductSubcategoryID=s.ProductSubcategoryID;

Listing 3

And finally, we have to deal with the Time dimension represented by the TimePeriodDim table. It’s remarkable to say that most star schema designs will have a time dimension because we often need to analyze the information from different time periods. For example, the Sales Manager wants to see the sales volumes in the current month, or compare the value of the current period with the same period last year, and so on. This dimension includes information about the year, day, quarter, month, and week. In this way, we can access and summarize data in any of these periods, not a single period such as year. In order to define the time dimension, we need to understand the level of detail required by the fact table. Sometimes we think that we don’t need a date time table because modern database systems include a lot functions to process date and time data types. But if we include a date time table containing a record for each day, we can semantically extend the logic associated to the date time data type by flagging days as holidays or other special days such as Christmas or 4th of July. You can also accommodate the calendar to see fiscal years.

As you can see the time dimension is a special dimension, and most of the time, we need to populate the dimension with calendar dates and date parts such as year, month, quarter and week. In order to generate this dimension, you can come up with SQL business logic (see Listing 4).

declare @dtStartDate datetime;
declare @dtEndDate datetime;

set @dtStartDate = ‘1/1/2008’;
set @dtEndDate = ‘1/1/2010’;

while(@dtStartDate <= @dtEndDate)
begin
select @dtStartDate, DATEPART(YYYY,@dtStartDate) as Calendar_Year,
DATEPART(M,@dtStartDate) as Calendar_Month,
DATEPART(Q,@dtStartDate) as Calendar_Quarter,
DATEPART(WK,@dtStartDate) as Calendar_Week;
set @dtStartDate = @dtStartDate + 1;

end;

Listing 4

The fact table SalesFact contains all the keys to the dimensions such as ProductDim_ID, CustomerDim_ID, RegionDim_ID, and TimePeriodDim_ID as well as the columns to hold the measures to be analyzed such as SalesVolumes and ShippedUnits. The data in this table is the result of summarizing data from the SalesOrderHeader and SalesOrderDetail tables in the relational schema through the following SQL statement (see Listing 5).

select sd.ProductID as ProductDim_ID, sh.CustomerID as CustomerDim_ID, sh.TerritoryID as RegionDim_ID, OrderDate as TimeDim, TotalDue as SalesVolumes, sum(OrderQty) as ShippedUnits

from Sales.SalesOrderHeader sh inner join Sales.SalesOrderDetail sdon sh.SalesOrderID=sd.SalesOrderID

group by sd.ProductID, sh.CustomerID, sh.TerritoryID, OrderDate, TotalDue;

Listing 5

In this case the measures SalesVolumes and ShippedUnits are fully additive meaning that the values of these attributes may be summed up by simple addition. When you cannot add up a certain measure attribute from the underlying instances and come up with the total, this kind of measure is semi-additive. It’s remarkable to say that cubes are great at handling additive measures.

Sometimes you need to add columns to fact tables which are neither dimension keys nor numeric measures to help the users drill down to a level of detail that includes some business entities and a full detail of each. This is commonly known as degenerate dimension, because it’s basically a dimension with only one column, which is the business key.

Conclusion

In this second article, I’ve explained how to get started with a data warehouse solution.

Designing and implementing a Data Warehouse: Part I

Introduction
Business Intelligence has become a buzzword in recent years as a support to decision making. Today we can find several database systems which include data warehousing, online analytical processing (OLAP), and data mining technologies. Data warehousing provides an efficient storage, maintenance, and retrieval of data. OLAP is a service that provides a way to create ad hoc queries against the data warehouse in order to answer important business questions. Data mining is a disciple comprising of several algorithms for discovering knowledge in a large bulk of data.
In order to build a data warehouse solution, we need to model a consistent architecture where the operational data will fit well in an integrated and enterprise-wide view as well as to take into consideration a handful implementation strategies to provide a high quality application. The design and implementation of a data warehouse solution sometimes is a very complex challenge in theory and practice. In this article, I will cover the main principles and techniques to design and implement a data warehouse providing my own experience in such an overwhelming challenge.
This is the first article of a series of articles that I want to write in order to share my knowledge and experience in this subject matter.
The design of a data warehouse
Although many of the principles and techniques to design and implement a relational data model for an operational system are adaptable to data warehouse modeling, they cannot be carried across in a natural way, thus data warehouse modeling is new discipline which is enhancing every day.
Operational and decision support systems
Most operational data is stored in relational database structures such as tables and their underlying relationships. They’re highly normalized and optimized to support the business processes in the enterprise. This kind of schemas are excellent for operational systems with several users working concurrently and executing transactions (where the most important criteria is the data consistency), although it’s not adequate for reporting applications (where the common operation is data extraction and the most important criteria is the performance of the underlying query) using a huge amount of data by relatively very few users; because you have normally to join several tables in the execution of the query.
Decision support systems deal with huge of historical data representing a time slice of the operational data. A data warehouse is the storage medium for the decision support systems and it requires periodic updates to load a new data from the operational data sources. Operations over data warehouse are characterized by read-only ad hoc queries (less predictable) over a high volume of data which might be integrated, aggregated, and summarized for decision support purposes. Table 1 summarizes the basic differences between operational (using relational database technologies) and decision support (using data warehouse technologies) systems.

Relational Database Systems Data Warehouse Systems
Transaction oriented Business process oriented
Thousands of users concurrently Few users
Generally small in size (Hundreds of MB up to GB) Very large (Hundreds of GB up to several TB)
Current data Historical data
Normalized data De-normalized data
Continuous updates Batch updates
Simple to complex queries Very complex queries

Table 1
Dimensional modeling
Dimensional modeling is a modeling technique to structure business dimensions and metrics which are analyzed along with dimensions in order to execute high performance queries. At a high level of interpretation, the data warehouse contains an integrated view of data which is derived from data in the operational systems supporting different business processes. In between the operational systems and the data warehouse, there is an important component known as the staging area. In this area, the operational data gets to be cleansed and transformed into a format suitable to be placed in the data warehouse storage.
A data warehouse database is a highly de-normalized structure with two main components: the first one is a central table, also known as fact table, which contains transactional data and it is surrounded by the second type of components, known as the dimension tables which contain referential static data or master data. The conceptual relationship between these two main components is that the dimensions describe the facts. This specific data model of dimensions and facts tables is known as the dimensional model and it can be graphically described as a star schema (a single fact table surrounded by dimensions) which specifies a multidimensional database. A multidimensional database can be built using conventional relational DBMS (it is referred as ROLAP database) or specialized multidimensional DBMS optimized for such structures (referred as a multidimensional database composed by cubes). A data model of a data warehouse is practically made up of data marts or a sub-set of star schemas, where each data mart is a single fact table surrounded by dimension tables containing data for different departments of functional areas.
Transformation from the enterprise data model to the dimensional model
The enterprise data model is the specification of the business entities as well as its underlying relationships, keys, attributes, subtypes and business rules using the entity relationship (ER) and normalization techniques. The enterprise data model is global view of the data model of the database systems to be integrated into the data warehouse. This important model corresponds to subject area supported by the underlying operational information systems. The degree of completion of the larger enterprise data model is of little concern to the development of the data warehouse.
Once the enterprise data model is specified, you can make several transformations in order to build the dimensional model of the data warehouse such as:
  • Removal of purely operational data
  • Addition of appropriate derived data
  • Transformation of relationships
  • Creation of array of data
It’s remarkable to say that these transformations are a guideline to the design of a data warehouse, but their use is mainly determined by the business requirements of the decision-support systems.
Removal of purely operational data
This transformation deals with examining the enterprise data model and removing all data that is purely operational. They are not removed from the enterprise data models; they are simply not useful in the dimensional model. For example, the definition of the entity Product (see Figure 1) contains a lot of operational attributes.
1.gif
Figure 1
Now let’s transform this definition into a new one that fits the dimensional model by removing the attributes: ProductNumber, MakeFlag, FinishedGoodsFlag, SafetyStockLevel, ReorderPoint, StandardCost, ListPrice, SizeUnitMeasureCode, WeightUnitMeasureCode, Weight, DaysToManufacture, ProductLine, Style, ProductSubcatogoryID, ProductModelID, SellStartDate, SellEndDate, DiscontinuedDate, rowguid and ModifiedDate. The final schema is shown in the Figure 2.
2.gif
Figure 2
Addition of appropriate derived data
The next transformation is the addition of derived data (this is a de-normalization process) where it’s mostly accessed and calculated in order to reduce the amount of processing required getting the data. Let’s illustrate this transformation with an example of the Sales Order entity’s data model (see Figure 3).
3.gif
Figure 3
Now let’s transforms this relational model in Figure 3 into a dimensional model (see Figure 4).
4.gif
Figure 4
Another case is when we have physically designed the Product subject into two relational tables such as Product and ProductSubcategory tables in the AdventureWorks database (see Figure 5); and then we want to create a dimension for the Product subject. In this case, it’s not descriptive to include the ProductSubcategoryID integer number; instead it’s better to include the name of the product subcategory (see Figure 6).
5.gif
Figure 5
6.gif
Figure 6
Another interested case of this transformation is concerning the one-to-many relationship in the relation data model. Let’s suppose that we have a non-identifying mandatory relationship between a Product and its Supplier. The underlying semantics dictates that a supplier may have many products, but a given product must have an associated supplier (see Figure 7).
7.gif
Figure 7
In order to turn this data model into a dimensional model, we design an entity which represents the Supplier and Product entities and the associated relationship. We can also apply the Removal of purely operational data technique in order to design the most important attributes. This transformation is shown in the Figure 8.
8.gif
Figure 8
Creation of Arrays of Data
As the enterprise data model is usually normalized, then the repeating groups are not shown as part of the model. But under some conditions, the data warehouse, as an analytical tool, needs to represent these repeating groups.
Let’s suppose that our data model designs the Budget business entity. The budget occurs on a month-by-month basis. In order to normalize this structure, we have the Budget and BudgetItem entities associated by an indentifying relationship (see Figure 9).
9.gif
Figure 9
Now when we apply the transformation technique, we get as a result the Figure 10.
10.gif
Figure 10
It’s remarkable to say that this transformation is not a general-purpose option. There are several scenarios where this strategy may be applied such as:
  • When the number of occurrences of data are predictable.
  • When the occurrence of data is relatively small.
  • When the occurrences of data are frequently used together.
When the pattern of insertion and deletion is stable.
Conclusion
In this article, I’ve shown the principles of a data warehouse and the underlying modeling methodology.