Is your data fit for use?

--

If you purchase a data governance policy course with Nicola Askham, you could participate in a conference call with her once a month. I remember this call like it was yesterday. Back then, I was doing some consulting for SMEs and was sharing my challenge of creating a data governance framework for SMEs. And she said something along the lines of — not a direct quote: “Data governance policies should not be created for the sake of creating policies. The data governance policy should provide an overall framework for sustainable data quality improvements”. ‘How is (was) the data quality level in the organizations you worked with?’ she asked. “It has been good” replied I confidently. Little did I know… . After that conversation, time and again, I got the feeling that there’s more to data quality issues than meets the eye. That is the reasoning behind this data quality endeavor and this and upcoming blog posts.

What?

I had my ‘Aha’ moment while reading DMBoK (Data Management Body of Knowledge) which says that a data quality evaluation is an evaluation of ‘fitness for use’, i.e. the evaluation of to what extent data serves the purposes of the use. Data is of high quality as long as it meets the expectations and needs of data consumers [1]. As such, does the below set of records have a data quality problem?

Click to enlarge

It depends.

How or What are data quality dimensions?

When you start reading materials on data quality, you will inevitably come across the term ‘data quality dimensions’.

Data quality can be measured through data quality dimensions. Menna Ibrahim Gabr, Yehia mostafa helmy, Doaa Saad Elzanfaly in their article ‘Data Quality Dimensions, Metrics, and Improvement Techniques.’ listed 68 traditional data quality dimensions, as well as data quality dimensions and solving techniques used to evaluate the quality of big data and resolve potential quality issues [2]. You can find the adapted version of their summary in full-screen mode by visiting https://datawrapper.dwcdn.net/x0PO2/5/ or searching from the current page.

https://datawrapper.dwcdn.net/x0PO2/5/

To search, for example, for the dimensions used to evaluate data quality on a record (a row, if we are talking about tabular data), attribute (a column, in case of tabular data), dataset (a collection of data), or cross-data sources (several data sources, e.g. databases, SaaS applications, flat files, etc) levels, you could use a global search filter. Feel free to leave your comments in the original Google Sheet at https://docs.google.com/spreadsheets/d/1PfJa-TOztErtfLADAKTzINT6QDw9L7jMTLLJ4EN6Qo0/edit?usp=sharing, so that I could re-publish the table to reflect your comments.

Data Management Association UK (DAMA(UK) [3] recommended using six dimensions ‘Completeness’, ‘Uniqueness’, ‘Timeliness ‘, ‘Validity’, ‘Accuracy’, and ‘Consistency’.

If we were to illustrate the data quality problems in our example, it would like:

Click to enlarge. Inspired by info@idwbi.com

How or What are data quality measures?

To assess data quality dimensions, different metrics are used. Most often used are ValidDQ(r) or Valid Data Quality rule and InvalidDQ(r) or Invalid Data Quality rule [3].

ValidDQ(r) = (TestExecutions— ExceptionsFound) / TestExecutions where TestExecutions represent the total number of records tested and ExceptionsFound represent the number of incorrect values

InvalidDQ(r) = ExecutionsFound / TestExecutions where TestExecutions represent the total number of records tested and ExceptionsFound represent the number of incorrect values.

For example, to estimate the percentage of missing data in the dataset, you could divide the number of missing values by the total number of records.
Apart from the percentage calculation, there are data quality metrics specific to some data quality dimensions, for example, the timeliness could be evaluated based on the latest timestamp (date).

In the next step, I will review the Hows or practical implementation of data validation.

Sources:

[1] DAMA-DMBOK: Data Management Body of Knowledge

[2] Menna Ibrahim Gabr, Yehia mostafa helmy, Doaa Saad Elzanfaly. (2021). Data Quality Dimensions, Metrics, and Improvement Techniques. https://digitalcommons.aaru.edu.jo/cgi/viewcontent.cgi?article=1141&context=fcij

[3] The six primary dimensions for data quality assessment. Defining data quality dimensions, DAMA, UK

Other sources used:

--

--

Eka Ponkratova (@thatdatabackpacker)
Eka Ponkratova (@thatdatabackpacker)

Written by Eka Ponkratova (@thatdatabackpacker)

I’m a data consultant, interacting closely with you to get data to work for you www.linkedin.com/in/eponkratova

No responses yet