If tools could smell, data profiling might have ‘thrift store smell’
It still puzzles me that “В то время, как наши космические корабли бороздят просторы вселенной…”, we are still using the data management tools developed back in 90s. And as you can imagine, companies of those days rarely give away something in the form of an open-source or freemium product. A case in point? Data profiling tools. Don’t get me wrong, there are plenty of articles on the Internet on how good data profiling is — a simple search on “data profiling” returns 490K results; “data profiling tools”, 14K+ results. Things get more interesting if you search for “data profiling open source”, which brings you back 100+ results. If tools can smell, I feel data profiling along with data mapping, and the list goes on might have ‘thrift store smell’.
After exploring the Internet options, asking data management practitioners at LinkedIn and receiving suggestions for paying N-digits numbers for commercial software and comments that companies develop in-house data profiling tools (ha?) and playing around with open-source tools — f̶o̶r̶t̶u̶n̶a̶t̶e̶l̶y̶, unfortunately, there is just a handful, I turned to pandas_profiling (GitHub), an open-source Python library. From the onset, it won’t be another article on how useful pandas_profiling for EDA, you can find all these articles here. The audience of this article is data management practitioners/someone doing data profiling on a scale.
Let the story begin
‘Data profiling is a technology for discovering and investigating data quality issues, such as duplication, lack of consistency, and lack of accuracy and completeness. This is accomplished by analyzing one or multiple data sources and collecting metadata that shows the condition of the data and enables the data steward to investigate the origin of data errors. The tools provide data statistics, such as degree of duplication and ratios of attribute values, both in tabular and graphical formats.’ (Gartner)
Pandas Profiling was initiated by Jos Polfliet, a data scientist based in Antwerp, Belgium and is currently developed by Simon Brugman.
The code is very straightforward: you load your file in pandas DataFrame, get ProfileReport object, and then, save it using to_file() or display in Jupyter, using report.profile_report(style={‘full_width’:True}). The output will help you to evaluate the dataset completeness i.e. complete verses blank or null; uniqueness i.e. how many duplicated values across a column; value distribution i.e. the distribution of records for a given attribute; range i.e. min and max values for a given attribute, among others.
Twist ending for data practitioners
In case if you are like me, want everything at once, here is a tweaked version to generate the reports for all tables in your database/a subset of tables and combine them in a complete_report.html file.
#importing libraries
from sqlalchemy import create_engine
import pandas as pd
import pandas_profiling
import numpy as np
import time
import osengine = create_engine(‘postgresql+psycopg2://yourUsername:yourPassword@URL/db’)#printing the table names
lst = engine.table_names()#looping through the tables to get each table profile
for x in lst:
time.sleep(60)
df = pd.read_sql_query(‘’’SELECT * FROM ‘’’+ x, engine)
report = pandas_profiling.ProfileReport(df)
report.to_file(x + “.html”)#pulling reports together
html = '<html><body>'
subset = []
lastFile = None
for file in os.listdir(): #getting all files in the script directory
if file.endswith(".html"): #extracting only those that have the extension .html
subset.append(file)
for r in subset:
if not lastFile:
html += '<a href="%s">' % r
html += '<h3>%s</h3></a>' % r#saving reports as complete_report.html
with open("complete_report.html","w") as fp:
fp.write(html)
What has been your experience with data profiling so far? What tools do you use?