Data Vault in Redshift
Imagine you have the ERP with the market price of products that you are selling. An example of such data is Simusolar Kina sold at USD1500 (the prices are fictional) — see table 2. At the same time, you have the inventory data with all the technical characteristics of Kina, such as charge, and panel size, among others — see table 2.
The issue is that this data exists in different source systems with each record having its key and slightly different names. But what if you want to create a complete picture of the entire stock including the technical characteristics and market price, capture all the changes in the market price, and be able to get the latest record for each pump?
+--------+-----------------------------+------------+
| ERP id | Brand | Cash price |
+--------+-----------------------------+------------+
| 1001 | Simusolar (Kina) | 1500 |
| 1002 | Sunculture (2SP) | 1000 |
| 1003 | Amped Innovation (WOWsolar) | 600 |
| 1004 | Ecozen Solutions (HP- 50 m) | 1000 |
+--------+-----------------------------+------------+
+--------+--------------------------------+------------------+---------------+-------------------+-------------------+----------------+-----------------------+
| Inv id | Brand | Product Type | Model Number | Power Supply Type | PV Array Size (W) | Head Value (m) | Water Volume (m³/day) |
+--------+--------------------------------+------------------+---------------+-------------------+-------------------+----------------+-----------------------+
| PU1 | Simusolar-Kina | Submersible Pump | Kina | DC | 2250 | 80 | 13.5 |
| PU2 | Sunculture-2SP | Surface Pump | 2SP | DC | 420 | 10 | 11.5 |
| PU3 | Amped Innovation-WOWsolar Pump | Surface Pump | WOWsolar Pump | DC | 80 | 6 | 5.9 |
| PU4 | Solartech | Submersible Pump | SPM600HS | DC | 600 | 40 | 14.2 |
+--------+--------------------------------+------------------+---------------+-------------------+-------------------+----------------+-----------------------+
You have several options for how you can tie different systems together and track the changes to the market price — see table 3 — by:
(1) creating master data (MDM) using one of the traditional architecture models for an MDM- read more about it in the article “Too many names for one customer or Master Data Management, part I” at https://eponkratova.medium.com/too-many-names-for-one-customer-or-master-data-management-in-aws-c993c2824af7;
(2) using slowly changing dimensions and generating surrogate keys — read the article “Is anybody loading data in fact and dimension tables the pure-Python way?” at https://medium.com/@eponkratova/is-anybody-doing-dimensional-modeling-the-pure-python-way-f4d2388cdfb4;
(3) using the hub and spoke model (data vault).
+----+--------+--------+--------------------------------+------------+
| Id | Inv id | ERP id | Brand | Cash price |
+----+--------+--------+--------------------------------+------------+
| 1 | PU1 | 1001 | Simusolar-Kina | 1500 |
| 2 | PU2 | 1002 | Sunculture-2SP | 1000 |
| 3 | PU3 | 1003 | Amped Innovation-WOWsolar Pump | 600 |
| 4 | PU4 | | | |
| 5 | | 1004 | Ecozen Solutions | 1000 |
| 6 | PU4 | | PU4 | |
+----+--------+--------+--------------------------------+------------+
What is it all about?
As per the Data Vault methodology, you separate your tables into hubs (business keys), links (relationships), and satellites (descriptive attributes) to store the history of attribute changes and link different systems together.
In our example, we will have two hub tables each containing an internal product record id and the business key to map external business keys to the internal product record (hub_product_inventory and hub_product_erp). Then, we will build a link table (link_product) that links the internal key from the hub_inventory to the internal key from the hub_market. Finally, we will have the sat_inventory with the technical characteristics from the inventory system; and sat_market with the market prices, currency, effective data, and other price-related details.
How to go around Data Vault or Tools to build a Data Vault
When it comes to the implementation, you can build a data vault from scratch, but you would quickly realize that implementing a Data Vault is challenging for multiple reasons. For example, when transforming data, you need to separate data into hubs, links, and satellites while handling historical tracking and incremental load; generating and managing surrogate keys that link data from different sources.
Alternatively, you can use tools that support the specialized nature of the hubs and spoke model — see some of the tools below.
If you use Data Build Tool (dbt) as your ELT tool — read more at https://www.getdbt.com/ — you can take advantage of AutomateDV and datavault4dbt packages, open-source dbt packages that contain different macros to generate and excute the ELT code build the data warehouse following the Data Vault methodology.
- AutomateDV (formally known as dbtvault) from DataVault UK, a data engineering, governance & analytics consultancy.
- datavault4dbt from ScalefreeCOM, a consultancy specializing in Data Vault 2.0 solutions for modern businesses.
- You might be also interested in dq_vault from Infinite Lambda, a data and AI engineering consultancy, to ensure that you have tests on the core entities.
Data Vault in Redshift with datavault4dbt
Google Cloud Platform (GCP), Amazon Web Services (AWS), Microsoft Azure, and Oracle were the leaders in the Gartner Magic Quadrant for Strategic Cloud Platform Services as of 2023. For this project, I will be using Redshift as a data warehouse. As I already mentioned, Data Vault models might result in large tables, especially satellites because these tables store the historical changes to attributes. Redshifts’ ability to define distribution and sort keys allows improving query performance on the large tables.
Step 1 — Create an AWS Redshift cluster that involves specifying the appropriate node type depending on your CPU, memory, and storage requirements; and configuring networking & security settings for the access control using AWS Management Console, CLI, or API. For more details on the Redshift setup, refer to the ‘Materials used’ section at the end of the article.
Step 2 — Prepare schemas and tables in the newly created Redshift data warehouse and insert sample data. You can find the DDL statements used to prepare the datasets at https://github.com/eponkratova/articles/blob/master/data_vault/preparing_dwh.sql.
Step 2 — Install dbt Core; initialize a dbt Project, and define a Profile to connect to Redshift. If you don’t have a dbt Core install, follow the article in the ‘Materials used’ section which includes setting up a virtual environment and installing dbt using Python’s pip package manager by running a command like pip install dbt-core and then, dbt init. While configuring a profiles.yml file, you need to define the Redshift connection details (including database, schema, username, password, and host) — refer to the ‘Materials used’ section.
Step 3 — Define the dbt Project structure — see the example of the folder structure for my use case; populate the sources.yml file that lists your source tables; create and update packages.yml file; and create the generate_schema_names.sql macro under the macros folder.
data_vault/
├─ macros/
│ ├─ generate_schema_names.sql
├─ models/
│ ├─ preprocessing
│ ├─ ├─ prep_product.sql
│ ├─ ├─ prep_stock.sql
│ ├─ staging
│ ├─ ├─ stg_stock.sql
│ ├─ ├─ stg_product.sql
│ ├─ data_vault
│ ├─ ├─ hubs
│ ├─ ├─ ├─ hub_product_inventory.sql
│ ├─ ├─ ├─ hub_product_erp.sql
│ ├─ ├─ satelites
│ ├─ ├─ ├─ sat_inventory.sql
│ ├─ ├─ ├─ sat_market.sql
│ ├─ ├─ links
│ ├─ ├─ ├─ link_product.sql
│ ├─ sources.yml
├─ dbt_project.yml
├─ packages.yml
Step 5 — Define Data Vault data models. The staging tables serve as the initial data extraction layer where we perform initial data cleaning and merging. The data from the staging models feeds directly into the preprocessing models and later, into the hub models where hub_product_inventory.sql uses inventory-specific keys to create unique business keys for each product, and hub_product_erp.sql. The satellites track the changing descriptive attributes for each hub. Finally, the link model (link_stock.sql) joins these hubs by matching common fields.
Even though the hub tables are usually not used to query the results, but you can run the below query to review the outcomes.
WITH unified_products AS (
SELECT
l.inv_pk,
s.brand,
product_type,
model_number,
power_supply_type,
pv_array_size_w,
head_value_m,
water_volume_m3_day,
cash_price,
COALESCE(i.ldts, e.ldts) AS load_datetime
FROM processing.link_stock l
LEFT JOIN processing.hub_product_inventory i
ON l.inv_pk = i.inv_pk
LEFT JOIN processing.sat_inventory s
ON l.inv_pk = s.inv_pk
LEFT JOIN processing.hub_product_erp e
ON l.erp_pk = e.erp_pk
LEFT JOIN processing.sat_market m
ON l.erp_pk = m.erp_pk
)
SELECT * FROM unified_products;
The shared dbt models can be improved further — for example, by enhancing the cleaning process for the brand column in the raw data — but for now, they have achieved the intended objective of forming a unified view across the two systems.
Materials used:
(1) VeraSol product data at https://data.verasol.org/products/swp
(2) Data Vault 2.0 with dbt Cloud at https://docs.getdbt.com/blog/data-vault-with-dbt-cloud
(3) The AutomateDV package information at https://hub.getdbt.com/Datavault-UK/automate_dv/latest/
(4) The datavault4dbt package information at https://hub.getdbt.com/ScalefreeCOM/datavault4dbt/latest/
(5) The dq_vault package information at https://hub.getdbt.com/infinitelambda/dq_vault/latest/
(6) Gartner Magic Quadrant for Strategic Cloud Platform Services as of 2023 — https://www.cxtoday.com/customer-data-platform/gartner-magic-quadrant-for-strategic-cloud-platform-services-2023/
(7) Installing dbt Core at https://docs.getdbt.com/docs/core/pip-install
(8) Provisioning a Redshift cluster at https://docs.aws.amazon.com/redshift/latest/mgmt/create-cluster.html
(9) Setting up dbt Core for Redshift at https://docs.getdbt.com/docs/core/connect-data-platform/redshift-setup
(10) AutomateDV documentation at https://automate-dv.readthedocs.io/en/latest/
(11) Data vault templates (AutomateDV and datavault4dbt) at https://github.com/infinitelambda/dbt-data-vault-template/tree/main
(12) Demo datavualt4dbt at https://github.com/ScalefreeCOM/datavault4dbt-snowflake-demo