Project Description
Without washrooms today, imagine what would human society be like? In the era of the proliferation of information, data is also need to be localized excreta, control defecation.
The DataWashroom provides a filter core of the data-purification system, includes the common functions of data isolation and waste separation by a simplicity instruction set (7 instructions). The instruction engine handles pre-compiling, building, publishing
(deploying), version control and documentation-generation. (It’s compiling, not interpreting) The input of the compiler is some metadata in wash-instruction tables, the output is a stored procedure.
For more intentions about this project, please see also the [Data Quality] section on the related project http://view.codeplex.com wiki page.
Instruction Set
There are 7 build-in wash instructions shown as following:
Clean-up is not just about deleting those no longer needed historical data or garbage data. The clean-up here emphasizes particularly on treatment for bad data. In order to simplify the generalized common solution from untold symptomatic treatments for bad data, the design thinking of above build-in wash instructions is based on following conventions:
The duplicate key treatment is subdivided into two alternative methods:
- Check Unique Key (CHK_UK)
Only adopt qualified rows, exclude all duplicate rows.
- Sort Duplicate Key (RNK_DK)
Assign a sequential number of every row within each partition of supposed unique key by a specified ordering rule. So that the first (top 1) row of each supposed unique key can be picked out directly.
Wash Cycle
All 7 wash-instructions inherit from the base WASH_ISTR. They are organized under the wash cycle in sequence.
WASH_CYCLE | |
A wash cycle is a sequential workflow of wash instructions. | |
CYCLE_ID: | The unique mnemonic identifier for a wash cycle, consider a naming convention within the enterprise (like a short namespace). |
PROCEDURE_NAME: | Define the stored procedure name of the wash cycle to be generated, just as the executable file name to an application. The schema name must be included in the stored procedure name (e.g. ABC.PRETREATMENT). |
DESCRIPTION_: | Arbitrary introduction of the wash cycle, it can be brief like an application name. |
OWNER_: | The owner of the wash cycle. |
LATEST_VERSION: | This is a free maintenance column with initialization 0. It's used inside version control. |
Wash Instructions
WASH_ISTR | |
This is the base table of all 7 wash-instruction tables. Each row of this table is a wash instruction declaration. For a top-down design, this table can also be used for storing requirement or outline design with every wash step of a wash cycle. For the compiler, this base table is treated like a header file to C++. | |
ISTR_ID: | Defines the unique mnemonic identifier for a wash instruction, consider a naming convention within the enterprise. |
CYCLE_ID: | Which wash cycle does the wash instruction belong to. |
ISTR_TYPE: | The type of instruction can be one of: DELETE, COPY, MERGE, CHK_UK, RNK_DK, MUP_MK, MUP_NA. It indicates the definition of the instruction is located in which derived instruction table. |
ISTR_ORDER: | The step ordinal of the instruction within its wash cycle. |
DESCRIPTION_: | The brief introduction of what is this step going to do. This is an optional info, but it can be useful for generating the progress status for every step and documentation-generation. |
7 derived wash instruction tables:
WASH_ISTR_DELETE | |
Removes rows from a table. For instance, to refresh data from source system, a DELETE step could need to be operated before COPY. | |
ISTR_ID: | The ISTR_ID embodies the inheritance from the base WASH_ISTR. |
DST_TABLE: | Specifies the table from which the rows are to be deleted. Schema name is always required in the table name. (E.g. ABC.WORK_TABLE) |
DST_FILTER: | Specifies the conditions used to limit the number of rows that are deleted (E.g. ID_TYPE = 'ISIN'). If this column is NULL, DELETE removes all the rows from the table. |
WASH_ISTR_COPY | |
Copies all matching columns (by column name) of data from a source table or view to a destination table. | |
ISTR_ID: | The ISTR_ID embodies the inheritance from the base WASH_ISTR. |
SRC_VIEW: | Specifies the source table or view to be copied from. Schema name is always required in the table or view name. (E.g. ABC.VIEW_SRC) |
SRC_FILTER: | Specifies the search conditions used to limit the number of rows to be copied (E.g. ID_TYPE = 'ISIN'). If this column is NULL, all the rows from the source will be copied. |
DST_TABLE: | Specifies the destination table from which the rows are to be copied to. Schema name is always required in the table name. (E.g. ABC.TABLE_DST) |
WASH_ISTR_MERGE | |
Merges specified matching columns (by column name) of data from a source table or view to a target table - Updating specified columns in a target table if a matching row exists, or inserting the data as a new row if a matching row does not exist. | |
ISTR_ID: | The ISTR_ID embodies the inheritance from the base WASH_ISTR. |
SRC_VIEW: | Specifies the source table or view to be merged from. Schema name is always required in the table or view name. (E.g. ABC.VIEW_SRC) |
SRC_FILTER: | Specifies the search conditions used to limit the number of rows to be merged (E.g. ID_TYPE = 'ISIN'). If this column is NULL, all the rows from the source will be consumed as the source data. |
DST_TABLE: | Specifies the destination table to which the rows are to be copied. Schema name is always required in the table name. (E.g. ABC.TABLE_DST) |
JOIN_COLUMNS: | Specifies the list of columns on which source table/view is joined with the target table to determine where they match. Multiple columns must be delimited by commas. (E.g. POS_ID, GRP_ID, DATE_) The column names must exist in both source table/view and destination table. |
UPDATE_COLUMNS: | Specifies a comma-separated list of columns to be updated in the destination table by matching rows from the source table/view (matched by JOIN_COLUMNS). A column that is referenced in JOIN_COLUMNS list can not be included in the UPDATE_COLUMNS list. |
INSERT_COLUMNS: | Specifies a comma-separated list of columns which will be copied from the source table/view to the destination table when matching rows did not exist. A column in JOIN_COLUMNS list can also be included in the INSERT_COLUMNS list. |
UPDATE_COLUMNS and INSERT_COLUMNS are optional, but at least one of them must be specified, they can be specified both. |
WASH_ISTR_CHK_UK | |
Check Unique Key – Checks the uniqueness by a supposed business key, and tags it something if a row is unique or tags it something if a row is duplicate. | |
ISTR_ID: | The ISTR_ID embodies the inheritance from the base WASH_ISTR. |
DST_TABLE: | Specifies the target table to be checked. Schema name is always required in the table name. (E.g. ABC.TABLE_DST) |
DST_FILTER: | Specifies the search conditions used to limit the number of rows to be checked (E.g. ID_TYPE = 'ISIN'). If this column is NULL, all the rows from the table will be checked. |
KEY_COLUMNS: | Specifies a column or a list of columns which is supposed to be a unique key. A composite key (includes two or more columns) must be delimited by commas (E.g. DATE_, POS_ID). |
SET_EXPR_IF_UNIQUE: | Specifies a comma-separated list of clause(s) {column_name = expression}[ ,...n ] for tagging a row if its supposed key is unique. For example: IS_QUALIFIED = 'Y', REDIRECT_CODE = ORIG_CODE. |
SET_EXPR_IF_DUPLICATE: | Specifies a comma-separated list of clause(s) {column_name = expression}[ ,...n ] for tagging a row if its supposed key is duplicate. For example: IS_QUALIFIED = 'N', REDIRECT_CODE = 'N/A'. |
At least one of SET_EXPR_IF_UNIQUE or/and SET_EXPR_IF_DUPLICATE must be specified, normally they should be specified both. |
WASH_ISTR_RNK_DK | |
Rank Duplicate Key – Checks the uniqueness by a supposed business key, ranks every row within their partition of the supposed key, and assigns a sequential number of every row, starting at 1 for the first row in each partition. | |
ISTR_ID: | The ISTR_ID embodies the inheritance from the base WASH_ISTR. |
DST_TABLE: | Specifies the target table to be checked. Schema name is always required in the table name. (E.g. ABC.TABLE_DST) |
DST_FILTER: | Specifies the search conditions used to limit the number of rows to be checked (E.g. ID_TYPE = 'ISIN'). If this column is NULL, all the rows from the table will be checked. |
KEY_COLUMNS: | Specifies a column or a list of columns which is supposed to be a unique key. A composite key (includes two or more columns) must be delimited by commas (E.g. DATE_, POS_ID). |
ORDER_BY: | The ORDER_BY clause determines the sequence in which the rows are assigned their unique ROW_NUMBER within a specified partition (E.g. TRAN_NO DESC, PRICE). It is required. |
RN_COLUMN: | Specifies the column for filling the ROW_NUMBER. It is required, the column type must be NUMBER or compatible types. |
WASH_ISTR_MUP_MK | |
Make up Missing Keys – the compensation inserts unique rows contained by the source table/view (select distinct supposed foreign key and coping values) but missing in the target table (by supposed primary key) originally. | |
ISTR_ID: | The ISTR_ID embodies the inheritance from the base WASH_ISTR. |
SRC_VIEW: | The source table/view which references the universal set of supposed foreign key. Schema name is always required in the table or view name. (E.g. ABC.VIEW_SRC) |
SRC_FILTER: | Specifies the search conditions used to limit the number of rows to be matched (E.g. ID_TYPE = 'ISIN'). If this column is NULL, all the rows from the source table/view will be taken as the universal set. |
SRC_KEY_COLUMNS: | Specifies a column or a comma-separated list of columns which is supposed to be a foreign key and will be used to join with DST_KEY_COLUMNS (supposed primary key of target table). |
DST_TABLE: | The target table to be checked and to be made up. Schema name is always required in the table name. (E.g. ABC.TABLE_DST) |
DST_KEY_COLUMNS: | Specifies a column or a comma-separated list of columns which is supposed to be a primary key of target table and can be used to join with SRC_KEY_COLUMNS of SRC_VIEW. |
DST_VAL_COLUMNS: | (Optional) When insert a new compensating row into the target table, some non-key columns may need to be assigned as a special value (such as ‘Dummy’, ‘Unknown’, ‘N/A’, -1, ‘1900-01-01’…). DST_VAL_COLUMNS specifies a column or a comma-separated list of columns (E.g. GRP_ID, ROW_SRC) to be assigned. A column that is listed in DST_KEY_COLUMNS can not be included in the DST_VAL_COLUMNS. |
SRC_VALUES: | (Optional) If DST_VAL_COLUMNS is specified, and then SRC_VALUES is required. SRC_VALUES specifies a value expression or a comma-separated list of value expressions which will be loaded into DST_VAL_COLUMNS while making up. An expression can be a constant (E.g. -1, 'Made-up') or an expression references on SRC_VIEW (E.g. -1, LAST_UPD). It's natural that only unique rows (by key columns) will be inserted into the target table if all expressions only contain constants. Otherwise, the SRC_VIEW is responsible for ensuring the uniqueness of result set if an expression references on a column(s) of SRC_VIEW. |
WASH_ISTR_MUP_NA | |
Make up a N/A key – insert a special primary key into the target table as a reserved row for substituting N/A cases (such as null-references, exception replacement …) if it did not exist before. | |
ISTR_ID: | The ISTR_ID embodies the inheritance from the base WASH_ISTR. |
DST_TABLE: | The target table to be made up. Schema name is always required in the table name. (E.g. ABC.TABLE_DST) |
DST_KEY_COLUMNS: | Specifies a column or a comma-separated list of columns which is the primary key of the target table. (E.g. REF_ID) |
DST_VAL_COLUMNS: | (Optional) When insert a reserved row into the target table, some non-key columns may need to be assigned as special attributes (such as ‘Dummy’, ‘Unknown’, ‘N/A’, -1, ‘1900-01-01’…). DST_VAL_COLUMNS specifies a column or a comma-separated list of columns (E.g. GRP_ID, ROW_SRC) to be assigned. A column that is listed in DST_KEY_COLUMNS can not be included in the DST_VAL_COLUMNS. |
MAKE_UP_KEYS: | Introduces a constant or a comma-separated list of constants of primary key columns to be inserted if it did not exist (E.g. -1). There must be one data value for each column of DST_KEY_COLUMNS list in the same order. If the same key already exists, nothing will be updated on that row. |
MAKE_UP_VALUES: | (Optional) If DST_VAL_COLUMNS is specified, and then MAKE_UP_VALUES is required. MAKE_UP_VALUES introduces a constant or a comma-separated list of constants which will be assigned to columns of DST_VAL_COLUMNS. The values in the MAKE_UP_VALUES must be in the same order as the columns in DST_VAL_COLUMNS list. (E.g. -1, 'N/A') |
Above is all metadata may need to be filled in instructions tables. The triggers behind these tables will check the validity of input metadata.
Build and Deploy
WASH_DEPLOY.BUILD_AND_PUBLISH | ||
The combination function for building and publishing a Wash Cycle. | ||
Parameters | Default | Description |
inCycle_ID | The Wash Cycle to be built and deployed. | |
inVer_Comment | '' | Comment for this version. |
inGen_Progress | 'Y' | 'Y': To generate progress status for every step; 'N': Does not generate the code for progress status. |
inStep_Commit | 'N' | 'Y': Commit every step (consider of mass data); 'N': Commit once only at the end of whole wash cycle. |
Return Value: | A new version number of this publish. (every time of publish will generate a new version) | |
Output: | Accompanied by the stored procedure of the wash cycle being installed, a version snapshot is automatically generated and tracked for every publishing. All version history is stored in the table WASH_VERSION (with deployment status or error messages). |
In many instances, developer may need to use individual steps for trial/debug convenience. Following diagram shows internal procedures of the whole build and deployment management.
(Labels in gray indicate the corresponding name of table, view, stored procedure or function)
WASH_ENGINE.PRECOMPILE | ||
The procedure precompiles every step of wash-instructions under a wash cycle into DML SQL. | ||
Parameters | Default | Description |
inCycle_ID | The wash cycle to be precompiled. | |
Output: | All DML SQL are planned in the table WASH_DML_PLAN. |
VIEW_WASH_CYCLE_CODE can be used to preview the complete code of stored procedure which will be generated for the wash cycle after precompilation.
WASH_DEPLOY.PUBLISH_PROCEDURE | ||
The function publishes a precompiled wash cycle as a stored procedure. | ||
Parameters | Default | Description |
inCycle_ID | The wash cycle to be published(deployed). | |
inVer_Comment | '' | Comment for this version. |
inGen_Progress | 'Y' | 'Y': To generate progress status for every step; 'N': Does not generate the code for progress status. |
inStep_Commit | 'N' | 'Y': Commit every step (consider of mass data); 'N': Commit once only at the end of whole wash cycle. |
Return Value: | A new version number of this publish. (every time of publish will generate a new version) | |
Output: | Accompanied by the stored procedure of the wash cycle being installed, a version snapshot is automatically generated and tracked for every publishing. All version history is stored in the table WASH_VERSION (with deployment status or error messages). |
As shown in above diagram, there are two ways of production deployment:
- Method 1 (recommended)
Distribute a manifest of metadata to production if the production database has the same setup of the DataWashroom environment as a deployment manager.
- Method 2 (left side)
Distribute the final stored procedure to production if the production database is not allow to have the DataWashroom for any reason, or you don't want people to think that you are too lazy to write those boring code. No one else would care your code was written
in your seat or was generated in washroom.
Utilities by Product
All database packages of this project are organized in the package dependency graph below.
The bottom 3 blue packages PROGRESS_TRACK, DEPLOY_UTILITY and CACHE_UTILITY were designed to be used independently.
... tProgress_ID PLS_INTEGER; BEGIN ... tProgress_ID := VPI.PROGRESS_TRACK.REGISTER(tTotal_Steps); ... VPI.PROGRESS_TRACK.GO_STEP(tProgress_ID, NULL, 'Status information 1 ...'); ... VPI.PROGRESS_TRACK.GO_STEP(tProgress_ID, NULL, 'Status information 2 ...'); ... VPI.PROGRESS_TRACK.GO_STEP(tProgress_ID, NULL, 'Status information 3 ...'); ... VPI.PROGRESS_TRACK.GO_STEP(tProgress_ID, NULL, 'Status information n ...'); ... VPI.PROGRESS_TRACK.GO_STEP(tProgress_ID, NULL, 'Done.'); EXCEPTION WHEN OTHERS THEN VPI.PROGRESS_TRACK.ON_ERROR(tProgress_ID, SQLERRM); ... END;
DEPLOY_UTILITY.EXPORT_INSERT_SQL | ||
This function exports rows of a table into a script of INSERT SQL. It works like the toolbar button [Export Query Results...to SQL file] on PL/SQL IDE, but provides as a API and solves the reserved characters [&] and [']. | ||
Parameters | Default | Description |
inTable_Name | The table that rows to be exported. | |
inSrc_Filter | NULL | (Optional)Specifies the search conditions used to limit the number of rows to be exported (E.g. ID_TYPE = 'ISIN'). If this column is NULL, all the rows from the table will be exported. |
inOrder_By | NULL | (Optional)Specifies the order of rows to be generated in INSERT script. (E.g. TRAN_NO DESC, PRICE) |
Return Value: | The complete script of INSERT SQLs. |
CACHE_CONTROL is the central control table which defines a unique identifier and expiration for each cache.
The CACHE_UTILITY package provides a global cache control similar to
Reader-Writer Locks (refer to MSDN).
Refresh Lock is different from write lock:
- Only work in critical period while cache expiring, as shown above.
- Only one refresher can obtain the token to exclusively access for refreshing, other blocked refreshers will not refresh again when they enter the lock, seen in the circuit diagram below.
It is not recommended to use these APIs in application level directly.
Next section [Singleton] will explain the risk of deadlocks and corresponding precautions in a higher level encapsulation.
Extensibility and Applicability
Refer to the source code of 7 build-in wash instructions, extened instructions can also be customized when necessary.
The original intention of DataWashroom is designed for data pretreatment, not for entire processings. To simplify subsequent processing system designs by ensuring the data quality in the central pretreatment stage.
So the generated stored procedure has following characteristics:
Following diagram shown the internal implementation of cache lock mechanism.
In practice, a corresponding schema would be allocated for a cache to organize its target tables. This would help consumers to be clear which cache is using. Also combined with naming convention (E.g. suffix of schema name, prefix of table name), whatever a schema contains a single cache or multiple caches ... all roads lead to a clarity, and reduce unnecessary configurations.
Risks of dead lockPROCEDURE MY_DM_FROM_CACHEA(...) AS BEGIN ... INSERT INTO XYZ.LOCAL_TABLE1 (...) SELECT ... FROM ABC.GLOBAL_CACHE_TABLE1, ... WHERE ...; ... UPDATE XYZ.LOCAL_TABLE2 L SET COL1 = (SELECT G.VAL1 FROM ABC.GLOBAL_CACHE_TABLE2 G WHERE ...); ... END MY_DM_FROM_CACHEA;
PROCEDURE PUB_DM_FROM_CACHEA(...) AS tCycle_ID CONSTANT VARCHAR2(32) := 'DPT.WASH_A';
-- The Cycle ID is also the Cache ID. BEGIN VPI.WASH_GLOBAL_CACHE.ENTER_LOCK(tCycle_ID); MY_DM_FROM_CACHEA(...); VPI.WASH_GLOBAL_CACHE.EXIT_LOCK(tCycle_ID); EXCEPTION WHEN OTHERS THEN VPI.WASH_GLOBAL_CACHE.EXIT_LOCK(tCycle_ID); RAISE; END PUB_DM_FROM_CACHEA;
OK | Throw Exception | Throw Exception |
ENTER_LOCK('CACHEA'); DM_CACHEA(...); EXIT_LOCK('CACHEA'); ENTER_LOCK('CACHEB'); DM_CACHEB(...); EXIT_LOCK('CACHEB'); |
ENTER_LOCK('CACHEA'); ENTER_LOCK('CACHEB'); DM_CACHEAB(...); EXIT_LOCK('CACHEB'); EXIT_LOCK('CACHEA'); |
ENTER_LOCK('CACHEA'); DM_CACHEA(...); ENTER_LOCK('CACHEB'); DM_CACHEAB(...); EXIT_LOCK('CACHEA'); DM_CACHEB(...); EXIT_LOCK('CACHEB'); |
PROCEDURE PUB_DM_FROM_CACHEAB(...) AS tCycle_IDs CONSTANT VCH_ID_ARY := VCH_ID_ARY('DPT.WASH_A', 'MKT.WASH_B' ...); -- An array of Cache IDs. BEGIN VPI.WASH_GLOBAL_CACHE.ENTER_LOCK(tCycle_IDs); MY_DM_FROM_CACHEAB(...); VPI.WASH_GLOBAL_CACHE.EXIT_LOCK(tCycle_IDs); EXCEPTION WHEN OTHERS THEN VPI.WASH_GLOBAL_CACHE.EXIT_LOCK(tCycle_IDs); RAISE; END PUB_DM_FROM_CACHEAB;
Prevention over Treatment
Such note often seen in public washrooms:
(See: the Before and After
Triggers)
An example for data, [START_DATE, END_DATE] is one of common modes for storing slow change data in back office systems, the date range is error prone (overlapped or fragmentary) for manual maintenance without validation. However, introducing the checksum could
avoid mistake as easy as washing hands. E.g.
CREATE TABLE ABC.SEC_HOLDINGS ( SEC_ID VARCHAR2(16) NOT NULL, HOLDINGS NUMBER(19) NOT NULL, START_DATE DATE DEFAULT TO_DATE('1990-01-01', 'yyyy-mm-dd') NOT NULL, END_DATE DATE DEFAULT TO_DATE('9999-12-31', 'yyyy-mm-dd') NOT NULL, CONSTRAINT PK_SEC_HOLDINGS PRIMARY KEY (SEC_ID, START_DATE), CONSTRAINT UK_SEC_HOLDINGS UNIQUE (END_DATE, SEC_ID), CONSTRAINT CK_SEC_HOLDINGS_START_END CHECK (START_DATE <= END_DATE) );
CREATE OR REPLACE TRIGGER ABC.TRG_SEC_HOLDINGS AFTER INSERT OR DELETE OR UPDATE OF START_DATE, END_DATE ON ABC.SEC_HOLDINGS DECLARE tSec_ID VARCHAR2(16); BEGIN SELECT MIN(SEC_ID) INTO tSec_ID FROM ( SELECT SEC_ID, SUM(END_DATE - START_DATE + 1) - (MAX(END_DATE) - MIN(START_DATE) + 1) AS CHECKSUM FROM ABC.SEC_HOLDINGS GROUP BY SEC_ID ) WHERE ROWNUM = 1 AND CHECKSUM <> 0; IF tSec_ID IS NOT NULL THEN RAISE_APPLICATION_ERROR(-20021, 'Some [START_DATE, END_DATE]s are overlapped or fragmentary! First issue SEC_ID: ' || tSec_ID); END IF; END TRG_SEC_HOLDINGS;
Data Pollution Report
Distributing a complete list of bad data (base on known rules) everyday to every data producers may help people to look for accurate pathogeny. Sooner or later each source of problem application(design) will be laid bare. No matter how arrogant or humble, strong or weak ... a application is, every data sits equal on the toilet.
A utility for generating these queries(views) of data pollution report will be introduced in later release.
Requirement on requirement
In order to have a compact and well structured requirement for a clean system, some basic principles are recommended to keep in mind during the requirements analysis.
The sufficient analyses is intended to simplify solutions through inductiveness trimness and summary, not to complicate matters.
Supported Databases
Contribute