DataWashroom - Database Clean up Engine


Project Description
Without washrooms today, imagine what would human society be like? In the era of the proliferation of information, data is also need to be localized excreta, control defecation.
The DataWashroom provides a filter core of the data-purification system, includes the common functions of data isolation and waste separation by a simplicity instruction set (7 instructions). The instruction engine handles pre-compiling, building, publishing (deploying), version control and documentation-generation. (It’s compiling, not interpreting) The input of the compiler is some metadata in wash-instruction tables, the output is a stored procedure.

For more intentions about this project, please see also the [Data Quality] section on the related project http://view.codeplex.com wiki page.

Instruction Set
There are 7 build-in wash instructions shown as following:

Instruction Set

Clean-up is not just about deleting those no longer needed historical data or garbage data. The clean-up here emphasizes particularly on treatment for bad data. In order to simplify the generalized common solution from untold symptomatic treatments for bad data, the design thinking of above build-in wash instructions is based on following conventions:

The duplicate key treatment is subdivided into two alternative methods:
 - Check Unique Key (CHK_UK)
   Only adopt qualified rows, exclude all duplicate rows.
 - Sort Duplicate Key (RNK_DK)
   Assign a sequential number of every row within each partition of supposed unique key by a specified ordering rule. So that the first (top 1) row of each supposed unique key can be picked out directly.

Wash Cycle
All 7 wash-instructions inherit from the base WASH_ISTR. They are organized under the wash cycle in sequence.

Washing Cycle Diagram

WASH_CYCLE
A wash cycle is a sequential workflow of wash instructions.
CYCLE_ID: The unique mnemonic identifier for a wash cycle, consider a naming convention within the enterprise (like a short namespace).
PROCEDURE_NAME: Define the stored procedure name of the wash cycle to be generated, just as the executable file name to an application. The schema name must be included in the stored procedure name (e.g. ABC.PRETREATMENT).
DESCRIPTION_: Arbitrary introduction of the wash cycle, it can be brief like an application name.
OWNER_: The owner of the wash cycle.
LATEST_VERSION: This is a free maintenance column with initialization 0. It's used inside version control.

Wash Instructions

WASH_ISTR
This is the base table of all 7 wash-instruction tables. Each row of this table is a wash instruction declaration. For a top-down design, this table can also be used for storing requirement or outline design with every wash step of a wash cycle. For the compiler, this base table is treated like a header file to C++.
ISTR_ID: Defines the unique mnemonic identifier for a wash instruction, consider a naming convention within the enterprise.
CYCLE_ID: Which wash cycle does the wash instruction belong to.
ISTR_TYPE: The type of instruction can be one of: DELETE, COPY, MERGE, CHK_UK, RNK_DK, MUP_MK, MUP_NA. It indicates the definition of the instruction is located in which derived instruction table.
ISTR_ORDER: The step ordinal of the instruction within its wash cycle.
DESCRIPTION_: The brief introduction of what is this step going to do. This is an optional info, but it can be useful for generating the progress status for every step and documentation-generation.

7 derived wash instruction tables:

WASH_ISTR_DELETE
Removes rows from a table. For instance, to refresh data from source system, a DELETE step could need to be operated before COPY.
ISTR_ID: The ISTR_ID embodies the inheritance from the base WASH_ISTR.
DST_TABLE: Specifies the table from which the rows are to be deleted. Schema name is always required in the table name. (E.g. ABC.WORK_TABLE)
DST_FILTER: Specifies the conditions used to limit the number of rows that are deleted (E.g. ID_TYPE = 'ISIN'). If this column is NULL, DELETE removes all the rows from the table.
 
WASH_ISTR_COPY
Copies all matching columns (by column name) of data from a source table or view to a destination table.
ISTR_ID: The ISTR_ID embodies the inheritance from the base WASH_ISTR.
SRC_VIEW: Specifies the source table or view to be copied from. Schema name is always required in the table or view name. (E.g. ABC.VIEW_SRC)
SRC_FILTER: Specifies the search conditions used to limit the number of rows to be copied (E.g. ID_TYPE = 'ISIN'). If this column is NULL, all the rows from the source will be copied.
DST_TABLE: Specifies the destination table from which the rows are to be copied to. Schema name is always required in the table name. (E.g. ABC.TABLE_DST)
 
WASH_ISTR_MERGE
Merges specified matching columns (by column name) of data from a source table or view to a target table - Updating specified columns in a target table if a matching row exists, or inserting the data as a new row if a matching row does not exist.
ISTR_ID: The ISTR_ID embodies the inheritance from the base WASH_ISTR.
SRC_VIEW: Specifies the source table or view to be merged from. Schema name is always required in the table or view name. (E.g. ABC.VIEW_SRC)
SRC_FILTER: Specifies the search conditions used to limit the number of rows to be merged (E.g. ID_TYPE = 'ISIN'). If this column is NULL, all the rows from the source will be consumed as the source data.
DST_TABLE: Specifies the destination table to which the rows are to be copied. Schema name is always required in the table name. (E.g. ABC.TABLE_DST)
JOIN_COLUMNS: Specifies the list of columns on which source table/view is joined with the target table to determine where they match. Multiple columns must be delimited by commas. (E.g. POS_ID, GRP_ID, DATE_) The column names must exist in both source table/view and destination table.
UPDATE_COLUMNS: Specifies a comma-separated list of columns to be updated in the destination table by matching rows from the source table/view (matched by JOIN_COLUMNS). A column that is referenced in JOIN_COLUMNS list can not be included in the UPDATE_COLUMNS list.
INSERT_COLUMNS: Specifies a comma-separated list of columns which will be copied from the source table/view to the destination table when matching rows did not exist. A column in JOIN_COLUMNS list can also be included in the INSERT_COLUMNS list.
UPDATE_COLUMNS and INSERT_COLUMNS are optional, but at least one of them must be specified, they can be specified both.
WASH_ISTR_CHK_UK
Check Unique Key – Checks the uniqueness by a supposed business key, and tags it something if a row is unique or tags it something if a row is duplicate.
ISTR_ID: The ISTR_ID embodies the inheritance from the base WASH_ISTR.
DST_TABLE: Specifies the target table to be checked. Schema name is always required in the table name. (E.g. ABC.TABLE_DST)
DST_FILTER: Specifies the search conditions used to limit the number of rows to be checked (E.g. ID_TYPE = 'ISIN'). If this column is NULL, all the rows from the table will be checked.
KEY_COLUMNS: Specifies a column or a list of columns which is supposed to be a unique key. A composite key (includes two or more columns) must be delimited by commas (E.g. DATE_, POS_ID).
SET_EXPR_IF_UNIQUE: Specifies a comma-separated list of clause(s) {column_name = expression}[ ,...n ] for tagging a row if its supposed key is unique. For example: IS_QUALIFIED = 'Y', REDIRECT_CODE = ORIG_CODE.
SET_EXPR_IF_DUPLICATE: Specifies a comma-separated list of clause(s) {column_name = expression}[ ,...n ] for tagging a row if its supposed key is duplicate. For example: IS_QUALIFIED = 'N', REDIRECT_CODE = 'N/A'.
At least one of SET_EXPR_IF_UNIQUE or/and SET_EXPR_IF_DUPLICATE must be specified, normally they should be specified both.
 
WASH_ISTR_RNK_DK
Rank Duplicate Key – Checks the uniqueness by a supposed business key, ranks every row within their partition of the supposed key, and assigns a sequential number of every row, starting at 1 for the first row in each partition.
ISTR_ID: The ISTR_ID embodies the inheritance from the base WASH_ISTR.
DST_TABLE: Specifies the target table to be checked. Schema name is always required in the table name. (E.g. ABC.TABLE_DST)
DST_FILTER: Specifies the search conditions used to limit the number of rows to be checked (E.g. ID_TYPE = 'ISIN'). If this column is NULL, all the rows from the table will be checked.
KEY_COLUMNS: Specifies a column or a list of columns which is supposed to be a unique key. A composite key (includes two or more columns) must be delimited by commas (E.g. DATE_, POS_ID).
ORDER_BY: The ORDER_BY clause determines the sequence in which the rows are assigned their unique ROW_NUMBER within a specified partition (E.g. TRAN_NO DESC, PRICE). It is required.
RN_COLUMN: Specifies the column for filling the ROW_NUMBER. It is required, the column type must be NUMBER or compatible types.
WASH_ISTR_MUP_MK
Make up Missing Keys – the compensation inserts unique rows contained by the source table/view (select distinct supposed foreign key and coping values) but missing in the target table (by supposed primary key) originally.
ISTR_ID: The ISTR_ID embodies the inheritance from the base WASH_ISTR.
SRC_VIEW: The source table/view which references the universal set of supposed foreign key. Schema name is always required in the table or view name. (E.g. ABC.VIEW_SRC)
SRC_FILTER: Specifies the search conditions used to limit the number of rows to be matched (E.g. ID_TYPE = 'ISIN'). If this column is NULL, all the rows from the source table/view will be taken as the universal set.
SRC_KEY_COLUMNS: Specifies a column or a comma-separated list of columns which is supposed to be a foreign key and will be used to join with DST_KEY_COLUMNS (supposed primary key of target table).
DST_TABLE: The target table to be checked and to be made up. Schema name is always required in the table name. (E.g. ABC.TABLE_DST)
DST_KEY_COLUMNS: Specifies a column or a comma-separated list of columns which is supposed to be a primary key of target table and can be used to join with SRC_KEY_COLUMNS of SRC_VIEW.
DST_VAL_COLUMNS: (Optional) When insert a new compensating row into the target table, some non-key columns may need to be assigned as a special value (such as ‘Dummy’, ‘Unknown’, ‘N/A’, -1, ‘1900-01-01’…). DST_VAL_COLUMNS specifies a column or a comma-separated list of columns (E.g. GRP_ID, ROW_SRC) to be assigned. A column that is listed in DST_KEY_COLUMNS can not be included in the DST_VAL_COLUMNS.
SRC_VALUES: (Optional) If DST_VAL_COLUMNS is specified, and then SRC_VALUES is required. SRC_VALUES specifies a value expression or a comma-separated list of value expressions which will be loaded into DST_VAL_COLUMNS while making up. An expression can be a constant (E.g. -1, 'Made-up') or an expression references on SRC_VIEW (E.g. -1, LAST_UPD). It's natural that only unique rows (by key columns) will be inserted into the target table if all expressions only contain constants. Otherwise, the SRC_VIEW is responsible for ensuring the uniqueness of result set if an expression references on a column(s) of SRC_VIEW.
 
WASH_ISTR_MUP_NA
Make up a N/A key – insert a special primary key into the target table as a reserved row for substituting N/A cases (such as null-references, exception replacement …) if it did not exist before.
ISTR_ID: The ISTR_ID embodies the inheritance from the base WASH_ISTR.
DST_TABLE: The target table to be made up. Schema name is always required in the table name. (E.g. ABC.TABLE_DST)
DST_KEY_COLUMNS: Specifies a column or a comma-separated list of columns which is the primary key of the target table. (E.g. REF_ID)
DST_VAL_COLUMNS: (Optional) When insert a reserved row into the target table, some non-key columns may need to be assigned as special attributes (such as ‘Dummy’, ‘Unknown’, ‘N/A’, -1, ‘1900-01-01’…). DST_VAL_COLUMNS specifies a column or a comma-separated list of columns (E.g. GRP_ID, ROW_SRC) to be assigned. A column that is listed in DST_KEY_COLUMNS can not be included in the DST_VAL_COLUMNS.
MAKE_UP_KEYS: Introduces a constant or a comma-separated list of constants of primary key columns to be inserted if it did not exist (E.g. -1). There must be one data value for each column of DST_KEY_COLUMNS list in the same order. If the same key already exists, nothing will be updated on that row.
MAKE_UP_VALUES: (Optional) If DST_VAL_COLUMNS is specified, and then MAKE_UP_VALUES is required. MAKE_UP_VALUES introduces a constant or a comma-separated list of constants which will be assigned to columns of DST_VAL_COLUMNS. The values in the MAKE_UP_VALUES must be in the same order as the columns in DST_VAL_COLUMNS list. (E.g. -1, 'N/A')

Above is all metadata may need to be filled in instructions tables. The triggers behind these tables will check the validity of input metadata.

Build and Deploy

WASH_DEPLOY.BUILD_AND_PUBLISH
The combination function for building and publishing a Wash Cycle.
Parameters Default Description
inCycle_ID   The Wash Cycle to be built and deployed.
inVer_Comment '' Comment for this version.
inGen_Progress 'Y' 'Y': To generate progress status for every step;
'N': Does not generate the code for progress status.
inStep_Commit 'N' 'Y': Commit every step (consider of mass data);
'N': Commit once only at the end of whole wash cycle.
Return Value: A new version number of this publish. (every time of publish will generate a new version)
Output: Accompanied by the stored procedure of the wash cycle being installed, a version snapshot is automatically generated and tracked for every publishing. All version history is stored in the table WASH_VERSION (with deployment status or error messages).

In many instances, developer may need to use individual steps for trial/debug convenience. Following diagram shows internal procedures of the whole build and deployment management.
(Labels in gray indicate the corresponding name of table, view, stored procedure or function)

Build and Deploy

WASH_ENGINE.PRECOMPILE
The procedure precompiles every step of wash-instructions under a wash cycle into DML SQL.
Parameters Default Description
inCycle_ID   The wash cycle to be precompiled.
Output: All DML SQL are planned in the table WASH_DML_PLAN.

VIEW_WASH_CYCLE_CODE can be used to preview the complete code of stored procedure which will be generated for the wash cycle after precompilation.

WASH_DEPLOY.PUBLISH_PROCEDURE
The function publishes a precompiled wash cycle as a stored procedure.
Parameters Default Description
inCycle_ID   The wash cycle to be published(deployed).
inVer_Comment '' Comment for this version.
inGen_Progress 'Y' 'Y': To generate progress status for every step;
'N': Does not generate the code for progress status.
inStep_Commit 'N' 'Y': Commit every step (consider of mass data);
'N': Commit once only at the end of whole wash cycle.
Return Value: A new version number of this publish. (every time of publish will generate a new version)
Output: Accompanied by the stored procedure of the wash cycle being installed, a version snapshot is automatically generated and tracked for every publishing. All version history is stored in the table WASH_VERSION (with deployment status or error messages).

As shown in above diagram, there are two ways of production deployment:

- Method 1 (recommended)
Distribute a manifest of metadata to production if the production database has the same setup of the DataWashroom environment as a deployment manager.

- Method 2 (left side)
Distribute the final stored procedure to production if the production database is not allow to have the DataWashroom for any reason, or you don't want people to think that you are too lazy to write those boring code. No one else would care your code was written in your seat or was generated in washroom.

Utilities by Product

All database packages of this project are organized in the package dependency graph below.

Packages Dependence

The bottom 3 blue packages PROGRESS_TRACK, DEPLOY_UTILITY and CACHE_UTILITY were designed to be used independently.

- PROGRESS_TRACK.REGISTER
- PROGRESS_TRACK.GO_STEP
- PROGRESS_TRACK.ON_ERROR
- PROGRESS_TRACK.POLLING

A common usage can be like the below sample (for Oracle version):
    ...
    tProgress_ID    PLS_INTEGER;
BEGIN
    ...
    tProgress_ID := VPI.PROGRESS_TRACK.REGISTER(tTotal_Steps);
    ...
    VPI.PROGRESS_TRACK.GO_STEP(tProgress_ID, NULL, 'Status information 1 ...');
    ...
    VPI.PROGRESS_TRACK.GO_STEP(tProgress_ID, NULL, 'Status information 2 ...');
    ...
    VPI.PROGRESS_TRACK.GO_STEP(tProgress_ID, NULL, 'Status information 3 ...');
    ...
    VPI.PROGRESS_TRACK.GO_STEP(tProgress_ID, NULL, 'Status information n ...');
    ...
    VPI.PROGRESS_TRACK.GO_STEP(tProgress_ID, NULL, 'Done.');
EXCEPTION
WHEN OTHERS THEN
    VPI.PROGRESS_TRACK.ON_ERROR(tProgress_ID, SQLERRM);
    ...
END;
Refer to the generated code of a wash stored procedure (with inGen_Progress option turned on) for more information about further binding to a business flow.
Then the view VIEW_PROGRESS_LOG or VIEW_WASH_LAST_PROGRESS_LOG can be used to monitor the whole ongoing progress in background.
Or, UI can use the procedure PROGRESS_TRACK.POLLING to display a progress bar.
DEPLOY_UTILITY.EXPORT_INSERT_SQL
This function exports rows of a table into a script of INSERT SQL. It works like the toolbar button [Export Query Results...to SQL file] on PL/SQL IDE, but provides as a API and solves the reserved characters [&] and ['].
Parameters Default Description
inTable_Name   The table that rows to be exported.
inSrc_Filter NULL (Optional)Specifies the search conditions used to limit the number of rows to be exported (E.g. ID_TYPE = 'ISIN'). If this column is NULL, all the rows from the table will be exported.
inOrder_By NULL (Optional)Specifies the order of rows to be generated in INSERT script. (E.g. TRAN_NO DESC, PRICE)
Return Value: The complete script of INSERT SQLs.

CACHE_CONTROL is the central control table which defines a unique identifier and expiration for each cache.
Cache Periods
 
The CACHE_UTILITY package provides a global cache control similar to Reader-Writer Locks (refer to MSDN).
Refresh Lock is different from write lock:
- Only work in critical period while cache expiring, as shown above.
- Only one refresher can obtain the token to exclusively access for refreshing, other blocked refreshers will not refresh again when they enter the lock, seen in the circuit diagram below.
 
Read-Write Lock

It is not recommended to use these APIs in application level directly.
Next section [Singleton] will explain the risk of deadlocks and corresponding precautions in a higher level encapsulation.

Extensibility and Applicability

Refer to the source code of 7 build-in wash instructions, extened instructions can also be customized when necessary.

The original intention of DataWashroom is designed for data pretreatment, not for entire processings. To simplify subsequent processing system designs by ensuring the data quality in the central pretreatment stage.
So the generated stored procedure has following characteristics:

Prevention over Treatment

Such note often seen in public washrooms:
Hand Washing
(See: the Before and After Triggers)

An example for data, [START_DATE, END_DATE] is one of common modes for storing slow change data in back office systems, the date range is error prone (overlapped or fragmentary) for manual maintenance without validation. However, introducing the checksum could avoid mistake as easy as washing hands. E.g.

CREATE TABLE ABC.SEC_HOLDINGS
(
    SEC_ID      VARCHAR2(16)                                     NOT NULL,
    HOLDINGS    NUMBER(19)                                       NOT NULL,
    START_DATE  DATE DEFAULT TO_DATE('1990-01-01', 'yyyy-mm-dd') NOT NULL,
    END_DATE    DATE DEFAULT TO_DATE('9999-12-31', 'yyyy-mm-dd') NOT NULL,

    CONSTRAINT PK_SEC_HOLDINGS PRIMARY KEY (SEC_ID, START_DATE),
    CONSTRAINT UK_SEC_HOLDINGS UNIQUE (END_DATE, SEC_ID),
    CONSTRAINT CK_SEC_HOLDINGS_START_END CHECK (START_DATE <= END_DATE)
);
CREATE OR REPLACE TRIGGER ABC.TRG_SEC_HOLDINGS
AFTER INSERT OR DELETE OR UPDATE OF START_DATE, END_DATE ON ABC.SEC_HOLDINGS
DECLARE
    tSec_ID VARCHAR2(16);
BEGIN
    SELECT
        MIN(SEC_ID) INTO tSec_ID
    FROM
    (
        SELECT
            SEC_ID,
            SUM(END_DATE - START_DATE + 1) - 
               (MAX(END_DATE) - MIN(START_DATE) + 1) AS CHECKSUM
        FROM
            ABC.SEC_HOLDINGS
        GROUP BY
            SEC_ID
    )
    WHERE   ROWNUM    = 1
        AND CHECKSUM <> 0;

    IF tSec_ID IS NOT NULL THEN
        RAISE_APPLICATION_ERROR(-20021,
            'Some [START_DATE, END_DATE]s are overlapped or fragmentary!
             First issue SEC_ID: ' || tSec_ID);
    END IF;
END TRG_SEC_HOLDINGS;
This is just a sample (for a small table) to show the CHECKSUM algorithms. In practice, (for Oracle solution) a row level trigger should be used in conjunction with the statement level trigger to validate only changed IDs when the table has large data; (for SQL Server solution) it's even simpler. The main point here is to illustrate that it would always be much easier to prevent a mistake from the source input. But, it should be noted that this sample does not recommend to abuse triggers - mysophobia! After all, the life is not only for eating, the data is not only for inputting.
However, if the same validation processing was requested to repeat in subsequent users' works, it would become a calamity for whole system performance and clarity. This is similar to a car factory doesn't establish an acceptable vendor quality assurance system (before reaching assembly line), but ask every workstations on the assembly line to inspect every accessories/parts while assembling them into a car - repairing the part if fault and marking a special flag ... can you see how complex and efficient the assembly line would be!
Cleaning runway of airfield is airport company's responsibility. Airline companies should not be held respnsible for runway clearance.
 

Data Pollution Report

Distributing a complete list of bad data (base on known rules) everyday to every data producers may help people to look for accurate pathogeny. Sooner or later each source of problem application(design) will be laid bare. No matter how arrogant or humble, strong or weak ... a application is, every data sits equal on the toilet.

A utility for generating these queries(views) of data pollution report will be introduced in later release.

Requirement on requirement
In order to have a compact and well structured requirement for a clean system, some basic principles are recommended to keep in mind during the requirements analysis.

  1. Disentangle functional dependencies between attributes in source systems base on true business meanings.
    As a isolation of basis data, the target system needs to carry out 3NF or BCNF. Some normalized relationships might have been safeguarded in source systems already, the requirement specifications is required to list the rest of them, and some redundant attributes can be cancelled.
  2. Unify different units of measurement into a standardized system of units.
    For example, converting all units of measurement into internationally standard measurement units (miles/h => km/h, yards => metres) if source systems use multiple different systems of units; converting all native dollars into the settlement currency; converting mixed ISO code systems(two-letter code and three-letter code) into a pure ISO code system (either one of them)... etc.
  3. Act as a global cache of basis data from source systems, to provide downstream back-office systems a clean data source with optimization for query performance is a key service purpose of the DataWashroom. It's requirement specification's responsibility to investigate:
    • Which attributes are often used as keys to look up others?
    • Which attributes are usually coupling together in consumers' queries?
    • What amount of data and growthrate of size?
    • What probability distributions of different partitions of data being used in most queries?
    Etc., these investigations will be substantial arguments for the physical design of target schema.
     
  4. Summarize data quality requirements into 3 basic parts:
    • Regular rules within the same row.
      Such as: non-null, reasonable values range and comparisons, etc.
    • Business keys uniqueness, relationships conflict.
    • Relation-chains directivity and connectivity.

The sufficient analyses is intended to simplify solutions through inductiveness trimness and summary, not to complicate matters.

Supported Databases

Contribute