Wednesday, August 1, 2018

How to make Cognos report work with historical data snapshot

1 Context
There are many business cases where you need to run report based on historical data and produce the same report generated before. If the data amount is very big, then this requirement is very challenge. There are difficult ways to archive historical data, such as save the whole database, save the whole table. However, these solutions may not be very efficient in term of capacity and maintenance. This document is intended to share one solution based on assumptions below:
1) Only portion of data are changed
2) Cognos framework manager is used

2 Overall solutionThe diagram below illustrates the architecture about how the data flow to report based on the selection.


3 Host historical data in separated table with snapshot column
It is proposed to host two tables below:

Current FactTable
Static Fields
Dynamic Fields
SF1
SFn
DF1
DFn
XX
XX
XX
XX
XX
XX
XX
XX
XX
XX
XX
XX


Historical  FactTable  
Snapshot 1 
Snapshot n
Snapshot
DF1
DFn
Snapshot
DF1
DFn
Snapshot 1
XX
XX
XX
Snapshot n
XX
XX
XX
Snapshot 1
XX
XX
XX
Snapshot n
XX
XX
XX

Please note that Current FactTable holds all current data with all columns containing two sets of data:
Static data – data is not changed
Dynamic data – data is changed month by month
Historical  FactTable  holds all historical snapshots, for example, each month has one snapshot. The snapshot only saves columns that data are changed.

4 Make report generation based on snapshot
A new parameter is introduced to allow report users to choose snapshot. This parameter is built into historical FactTable with filter Snapshot = #Prompt()# with default value. If it is default, then there won’t be any data for report to use. But if users choose a snapshot, then these dynamic columns will be used for report.

5 Introduce a generic reportable Fact Table
Reportable fact table is dynamic generated based on snapshot. If there is snapshot specified, then these columns will be replaced from historical data based on selected snapshot

Reportable FactTable
Static Fields
Dynamic Fields
SF1
SFn
DF1
DFn
XX
XX
XX
XX
XX
XX
XX
XX
XX
XX
XX
XX

Basically, if new parameter is introduced to allow report users to choose snapshot. This parameter is built into historical FactTable with filter Snapshot = #Prompt()# with default value. If it is default, then there won’t be any data for report to use. But if users choose a snapshot, then these dynamic columns will be used for report.

6 Make left join relationship to accomplish replace current data with a historical snapshot
The generic reportable Fact Table is the join result from both current Fact table and historical Fact table. There are two cases:
Case 1: when no snapshot selected, then the left join simply get no data from historical snapshot. The dynamic columns in reportable fact table is actually from current fact table.
Case 2: when a snapshot is selected, then the left join get data from historical snapshot. The dynamic columns in reportable fact table will take data from snapshot over the original data from current fact table.
All the logic can be built in Cognos framework manager for business tier. The advantage is that all existing reports will not be needed to change at all. we can simply change column definition to achieve this goal, such as NVL ([Historical FactTable].[DFi], ([Currental FactTable].[DFi],))

Sunday, July 1, 2018

Is SCD2 really needed?

SCD 2 is naturally applied in data warehousing design, also Bitemporal design approach is also sometimes applied. When talking about high level picture, all these concepts are intended to resolve the relationship between dimension and fact.  Based on my experiences from many practical data warehouse projects, I am trying to ask myself whether the SCD 2, or even surrogate keys is really necessary or not.  This document is to argue that surrogate key, SCD2 and Bitemporal are not needed.

Referring back to the two design goals of a dimensional model per Kimball; create an easy to understand data structure while providing fast query response times.

First of all, let address the performance issue. Fast response times often mean efficient joins between facts and dimensions.  By using surrogate keys we can select a data type, integer, which offers high performance. However, the performance is insignificant given that many serious data warehousing use Teradata or Netezza.  To apply for the surrogate keys and SCD2, the development and maintenance cost are much higher than simple relationship. We need either to write complicated SQL, and to use ETL tool to generate both dimensions and facts during development. It is also difficult to query fact data as we need to join dimension tables to get data understandable. Because of  incremental IT personnel costs, and additional soft costs, there is no justification for using surrogate key and SCD2 for better performance.

Secondly, with respect to create an easy to understand data structure, the actually data structure without surrogate keys and SCD2 is still illustrated as star scheme, which still can be easily modeled with different BI tools. 

Below are a few methods to model data structures:

1. Simple attribute(s) ( or SCD0): 
we can simply consider these attributes as fact dimension. normally, the fact table has primary key, we can either leave these attributes in fact table, or separate them as fact dimension without any changes. If modeling it in Cognos, we can either model an individual attribute as a corresponded dimension, or make the all attribute as one fact dimension. The latter is better approach.

2. Dimension (SCD1):
 If we don’t use surrogate key, then we don’t need to process fact and dimension, or put surrogate key in fact and insert surrogate key in dimension.  We can simply keep the dimension related column as it in fact. With respect to dimension, there are two different cases:

2.1 Create a static reference table like dimension, with or without different hierarchies.

2.2 If this dimension is dynamic, then we can consider this column as distinct value as independent dimension. the dimension will be automatically extracted from fact.

3. Dimension (SCD2):
For the sake of discussion, we use customers dimension as sample. This table has snapshot month and customer ID with a lot of dimension attributes.  All change history of attributes in customer are automatically preserved. There are two ways to proceed:

3.1 Keep the customer dimension as is, then create relationship between customer and fact with two columns joined: customer ID and Snapshot month.
  • Advantage: once modeled, you will have all possible columns from customer for reports, and t is fully flexible
  • Disadvantage: the customer dimension could be huge, but only small percentage of customer data is used, this will result in a poor performance.

3.2 Bring only applied customer related attribute into fact table. for example, marriage status. We can add this attribute into fact table, and then apply the concept SCD0 and SCD1
  • Advantage: Very good performance and easy to understand.
  • Disadvantage:  It is not flexible, we need to bring new columns into fact table if we need to add new customer related new attribute.

Friday, June 1, 2018

how to debug a complicated DMR based report

This document will have walk through to demonstrate how to debug a complicated DMR report.
1. Problem
Below is Discontinuous Crosstab reports with problem in the marked area. However, when checking data with Cognos workspace advanced, we get correct data, and therefore we consider it as report problem.

This report above is Discontinuous crosstab, which has many measures with calculation on X side, and many different dimensions on Y side 

2. Analysis
it is very difficult to pin down the problem. We get correct result when removing most of measures and all other dimensions.

 This report above is Discontinuous crosstab, which has many measures with calculation on X side, and many different dimensions on Y side . the only solution is to remove measures and dimensions one by one, respectively.

3. Debug
Step 1. Remove measure 1, 2 , 3, to save report as X1, X2, X3, ….

Step 2. Define a new job and run all these reports and then identify until which measure get this problem

Step 3. Remove this identified measure(s), but the result still NOT correct. in this sample, it is almost impossible to pin down the issue, as there are two many combinations.

Step 4. However, when looking into generated SQL between report with problem and report without problem, we find the generated SQL is very different, from JOIN to UNION. but it is very difficult to see the real problem.

Step 5. Remove dimension 1, 2 , 3, to save report as Y1, Y2, Y3, ….

Step 6. Define a new job and run all these reports and then identify until which dimension get this problem

Step 7. Define a new job and run all these reports and then identify until which dimension get this problem

Step 8. Remove this identified dimension, check report and confirm that the generated report is correct. then we can be sure that this dimension has problem.

Step 9. Generate two SQLs below and compare them
- SQL generated with correct result, aka GOOD SQL
- SQL generated with wrong result, aka  BAD SQL
For this case, it is INNER JOIN with this identified dimension.  There are some dimension items that exist in FACT table, but does NOT exist in this dimension, therefore remove some data from FACT table as long as this dimension is used.

Tuesday, May 1, 2018

A pragmatic approach for reference data management: Cognos BI Modelling and implementation issues

 
Context
- Netezza as databases
- Cognos as report tool
 
Goal
- Make development easier – All ETLs will use the centralized table to get dimension business key, also all these map will be used to validate dimension tables as needed
- Make maintenance easier – when some changes applies duo to business logic change, we don’t need to change ETL and Cognos model
 
Assumptions
- Focused area is from staging data to data mart data
- SCD concept with surrogate key is NOT applied

Detail assumption of staging table
- All these dimension related columns are not NULL.
- If the column is NULL, then setup as ‘-99’ for string, or as -99 for number.
- If the column cannot be found from reference table, then setup as ‘-1’ for string, or as -1 for number.
 
Capabilities
- Handle single value mapping
- Handle data range mapping
- Handle default
- Generate exception between map and dimensions
 
Mapping table
-- DDL
DROP TABLE MPL_BI.REFERENCE_DATA_MAP IF EXISTS;
CREATE TABLE MPL_BI.REFERENCE_DATA_MAP 
(
 SOURCE CHARACTER VARYING(100) NOT NULL,
 PARAMETER_NAME CHARACTER VARYING(100) NOT NULL,
 SINGLE_VALUE CHARACTER VARYING(100),
 LOW_BOUND_VALUE DECIMAL(20,2), 
 LOW_BOUND_INCL_VALUE DECIMAL(20,2),
 HIGH_BOUND_VALUE DECIMAL(20,2),
 HIGH_BOUND_INCL_VALUE DECIMAL(20,2),
 DIMENSION_NAME CHARACTER VARYING(100) NOT NULL,
 DIMENSION_ITEM_NAME CHARACTER VARYING(100) NOT NULL,
 NOTE CHARACTER VARYING(1000),
 LAST_UPDATE_USER CHARACTER VARYING(100) ,
 LAST_MOFIFIED_DATE  DATE ,
 START_DATE DATE NOT NULL,
 END_DATE DATE
);
Sample Data
INSERT INTO MPL_BI.REFERENCE_DATA_MAP values('ALL','MAP_SINGLE_VALUE','Current (0)',NULL,NULL,NULL,NULL,'DIM_SINGLE_VALUE','NOT DELINQUENT',NULL,NULL,NULL,'2018-03-01',NULL);
.............
select * from MPL_BI.REFERENCE_DATA_MAP;
 
SOURCEPARAMETER_NAMESINGLE_VALUELOW_BOUND_VALUELOW_BOUND_INCL_VALUEHIGH_BOUND_VALUEHIGH_BOUND_INCL_VALUEDIMENSION_NAMEDIMENSION_ITEM_NAMENOTELAST_UPDATE_USERLAST_MOFIFIED_DATESTART_DATEEND_DATE
ALLMAP_SINGLE_VALUECurrent (0)NULLNULLNULLNULLDIM_SINGLE_VALUENOT DELINQUENTNULLNULLNULL2018-03-01 00:00:00NULL
ALLMAP_SINGLE_VALUECYC1. 1-30NULLNULLNULLNULLDIM_SINGLE_VALUECLASS 0NULLNULLNULL2018-03-01 00:00:00NULL
ALLMAP_SINGLE_VALUECYC2. 31-60NULLNULLNULLNULLDIM_SINGLE_VALUECLASS 1NULLNULLNULL2018-03-01 00:00:00NULL
ALLMAP_SINGLE_VALUECYC3. 61-90NULLNULLNULLNULLDIM_SINGLE_VALUECLASS 2NULLNULLNULL2018-03-01 00:00:00NULL
ALLMAP_SINGLE_VALUECYC4. 90+NULLNULLNULLNULLDIM_SINGLE_VALUECLASS 3NULLNULLNULL2018-03-01 00:00:00NULL
ALLMAP_SINGLE_VALUENPNANULLNULLNULLNULLDIM_SINGLE_VALUENPNANULLNULLNULL2018-03-01 00:00:00NULL
ALLMAP_SINGLE_VALUEDWONULLNULLNULLNULLDIM_SINGLE_VALUEDWONULLNULLNULL2018-03-01 00:00:00NULL
ALLMAP_SINGLE_VALUEDEFAULTNULLNULLNULLNULLDIM_SINGLE_VALUENOT IDENTIFIEDNULLNULLNULL2018-03-01 00:00:00NULL
ALLMAP_BANDNULLNULL-1.00NULL-999.00DIM_BANDMISSINGNULLNULLNULL2018-03-01 00:00:00NULL
ALLMAP_BANDNULLNULL0.00NULL0.00DIM_BANDZERONULLNULLNULL2018-03-01 00:00:00NULL
ALLMAP_BANDNULLNULL1.00NULL50.00DIM_BANDFORCEDNULLNULLNULL2018-03-01 00:00:00NULL
ALLMAP_BANDNULLNULL51.00NULL599.00DIM_BAND<600NULLNULLNULL2018-03-01 00:00:00NULL
ALLMAP_BANDNULLNULL600.00NULL619.00DIM_BAND600-619NULLNULLNULL2018-03-01 00:00:00NULL
ALLMAP_BANDNULLNULL620.00NULL639.00DIM_BAND620-639NULLNULLNULL2018-03-01 00:00:00NULL
ALLMAP_BANDNULLNULL640.00NULL659.00DIM_BAND640-659NULLNULLNULL2018-03-01 00:00:00NULL
ALLMAP_BANDNULLNULL660.00NULL679.00DIM_BAND660-679NULLNULLNULL2018-03-01 00:00:00NULL
ALLMAP_BANDNULLNULL680.00NULL699.00DIM_BAND680-699NULLNULLNULL2018-03-01 00:00:00NULL
ALLMAP_BANDNULLNULL700.00NULL719.00DIM_BAND700-719NULLNULLNULL2018-03-01 00:00:00NULL
ALLMAP_BANDNULLNULL720.00NULL739.00DIM_BAND720-739NULLNULLNULL2018-03-01 00:00:00NULL
ALLMAP_BANDNULLNULL740.00NULL759.00DIM_BAND740-759NULLNULLNULL2018-03-01 00:00:00NULL
ALLMAP_BANDNULLNULL760.00NULL779.00DIM_BAND760-779NULLNULLNULL2018-03-01 00:00:00NULL
ALLMAP_BANDNULLNULL780.00NULL799.00DIM_BAND780-799NULLNULLNULL2018-03-01 00:00:00NULL
ALLMAP_BANDNULLNULL800.00NULL819.00DIM_BAND800-819NULLNULLNULL2018-03-01 00:00:00NULL
ALLMAP_BANDNULLNULL820.00NULL839.00DIM_BAND820-839NULLNULLNULL2018-03-01 00:00:00NULL
ALLMAP_BANDNULLNULL840.00NULL859.00DIM_BAND840-859NULLNULLNULL2018-03-01 00:00:00NULL
ALLMAP_BANDNULLNULL860.00NULL1000.00DIM_BAND860+NULLNULLNULL2018-03-01 00:00:00NULL
ALLMAP_BANDDEFAULTNULLNULLNULLNULLDIM_BANDMISSINGNULLNULLNULL2018-03-01 00:00:00NULL

Constraints
SELECT PARAMETER_NAME,SINGLE_VALUE, count(*) FROM MPL_BI.REFERENCE_DATA_MAP
WHERE END_DATE IS NULL AND SINGLE_VALUE IS NOT NULL
GROUP BY PARAMETER_NAME,SINGLE_VALUE HAVING COUNT(*) > 1;

 
SELECT PARAMETER_NAME,LOW_BOUND_VALUE,LOW_BOUND_INCL_VALUE,HIGH_BOUND_VALUE,HIGH_BOUND_INCL_VALUE, count(*) FROM MPL_BI.REFERENCE_DATA_MAP
WHERE END_DATE IS NULL  AND (LOW_BOUND_VALUE IS NOT NULL OR LOW_BOUND_INCL_VALUE IS NOT NULL OR HIGH_BOUND_VALUE IS NOT NULL OR HIGH_BOUND_INCL_VALUE IS NOT NULL)
GROUP BY PARAMETER_NAME,LOW_BOUND_VALUE,LOW_BOUND_INCL_VALUE,HIGH_BOUND_VALUE,HIGH_BOUND_INCL_VALUE HAVING COUNT(*) > 1;

Apply for ETL
DROP TABLE MPL_BI.STG_TABLE IF EXISTS;
CREATE TABLE MPL_BI.STG_TABLE
(
 ACCOUNT int,
 DELQ CHARACTER VARYING(100),
 ORIGINAL_BUREAU_SCORE int
);
INSERT INTO MPL_BI.STG_TABLE VALUES (1,'Current (0)',-1);
INSERT INTO MPL_BI.STG_TABLE VALUES (2,'CYC1. 1-30',610);
INSERT INTO MPL_BI.STG_TABLE VALUES (3,'Other Value',NULL);
SELECT
 ACCOUNT,
 NVL(MAP_SINGLE_VALUE.DIMENSION_ITEM_NAME,MAP_DELQ_STATUS_COPS_0.DIMENSION_ITEM_NAME) AS MEMBER_DELQ,
 NVL(MAP_BAND.DIMENSION_ITEM_NAME,MAP_BUREAU_SCORE_20PT_BAND_INCLUDING_FORCED_0.DIMENSION_ITEM_NAME) AS MEMBER_ORIGINAL_BUREAU_SCORE
FROM MPL_BI.STG_TABLE S
LEFT JOIN MPL_BI.REFERENCE_DATA_MAP MAP_SINGLE_VALUE ON (MAP_SINGLE_VALUE.PARAMETER_NAME ='MAP_SINGLE_VALUE' AND S.DELQ=MAP_SINGLE_VALUE.SINGLE_VALUE)
LEFT JOIN MPL_BI.REFERENCE_DATA_MAP MAP_DELQ_STATUS_COPS_0 ON (MAP_DELQ_STATUS_COPS_0.PARAMETER_NAME ='MAP_SINGLE_VALUE' AND MAP_DELQ_STATUS_COPS_0.SINGLE_VALUE = 'DEFAULT')
LEFT JOIN MPL_BI.REFERENCE_DATA_MAP MAP_BAND ON (MAP_BAND.PARAMETER_NAME ='MAP_BAND' AND S.ORIGINAL_BUREAU_SCORE BETWEEN MAP_BAND.LOW_BOUND_INCL_VALUE AND MAP_BAND.HIGH_BOUND_INCL_VALUE)
LEFT JOIN MPL_BI.REFERENCE_DATA_MAP MAP_BUREAU_SCORE_20PT_BAND_INCLUDING_FORCED_0 ON (MAP_BUREAU_SCORE_20PT_BAND_INCLUDING_FORCED_0.PARAMETER_NAME ='MAP_BAND' AND MAP_BUREAU_SCORE_20PT_BAND_INCLUDING_FORCED_0.SINGLE_VALUE = 'DEFAULT');
 
 Apply for Cognos Dimensions
select * FROM MPL_BI.REFERENCE_DATA_MAP
WHERE PARAMETER_NAME IN ('MAP_SINGLE_VALUE')
AND DIMENSION_ITEM_NAME NOT IN (SELECT DELQ_STATUS from MPL_BI.DIM_CONFORMED_DELQ_STATUS);
select * FROM MPL_BI.REFERENCE_DATA_MAP
WHERE PARAMETER_NAME IN ('MAP_BAND')
AND DIMENSION_ITEM_NAME NOT IN (SELECT CURR_BUREAU_SCORE___20PT_BAND_INCLUDING_FORCED from MPL_BI.DIM_CONFORMED_BUREAU_SCORE_20PT_BAND_INCLUDING_FORCED);