To continue with a metrics store, part II
‘Why do we have X as a revenue in the report A and Y, in application B?’ You start digging and realize that in the first case, the analyst used the invoice date to recognize the revenue, while the finance department used the first payment date.
You could use different ways to align data/metrics across other data products, e.g. to create summary tables with pre-calculated measures or use a metrics store/headless BI/semantic layer. And metrics store will be the focus of this article.
I’ve already done my first try with Cube dev in Part I at https://dev.to/eponkratova/getting-started-with-metrics-store-547i but I was eager to try a different platform to see another way to build up relations between datasets. After playing with Metriql, I decided to focus on MetricFlow, an open-source metric creation framework launched by Transform — later to be acquired by dbt Labs — in 2022.
Getting ready
- To be to compare ‘apples with apples’, I would use Redshift, a cloud-based data warehouse solution offered by AWS — read more on setting up Redshift at Getting started.
2. Create a role in your Redshift account. Note that a user should be able to create table(s) and query tables in your data warehouse.
create group transform;
create user yourUser password ‘yourPassword’ in group transform;
grant create on database dev to group transform;
grant select on all tables in schema information_schema to group transform;
grant select on all tables in schema pg_catalog to group transform;
grant select on all tables in schema public to group transform;
3. Redshift provides with a public ticket data set that you can use for this project. As per https://docs.aws.amazon.com/redshift/latest/dg/c_sampledb.html, the data set contains two fact — sales and events — and five dimensions tables — date, category, users, listing, and venues. I already identified the metrics to be computed — revenue, number of tickets sold that could be extracted from the sales table — to be split by a username that could be extracted from the users table.
How to do it
The Transform’s website states, ‘At its core, it [MetricFlow] is a Python library that includes a metric spec, model parsing from YAML files, and APIs to create and execute metric logic.’ For this project, I’ve created a conda virtual environment personalproject
to keep all project library dependencies together.
1. Run pip install metricflow
in any command-line / terminal. Lesson(s) learned: you need to have a version of Python >=3.8 and Python < 3.10.
2. Then, run mf setup
and select the data platform you want to connect — bigquery, databricks, duckdb, postgresql, redshift, or snowflake. Depending on the selected dialect, a template config file will be created in {HOME}/.metricflow/config.yml
— in my case, it was located in C:\Users\katep\.metricflow\config.yml
.
Usually, you grant permissions in Redshift on a per-schema basis, and it would bite you when you run mf health-checks
in your CLI to validate the data warehouse connection as MetricFlow will try to create a random test schema with test tables. Lesson(s) learned: specify the desired schema name, e.g. dwh_schema: ‘metricflow’
and then, grant the permissions on that schema to avoid the permission error. C:\Users\katep\.metricflow\config.yml
dbt_cloud_service_token: ‘’ # The dbt service token to access the metadata for the dbt cloud job. Needs a minimum of Metadata API access for the desired dbt job’s project.
dbt_cloud_job_id: ‘’ # The ID of a dbt cloud job to build the model from
dbt_repo: ‘’ # If set to `True`, MetricFlow will interpret the value of `model_path` to point to dbt configs
dwh_schema: ‘metricflow’
model_path: C:\Users\katep/.metricflow/sample_models # Path to directory containing defined models (Leave until after DWH setup)
email: ‘’ # Optional
dwh_dialect: redshift
dwh_host: ‘yourHost.redshift.amazonaws.com’ # Host name
dwh_port: ‘5439’
dwh_user: ‘yourUser’ # Username for the data warehouse
dwh_password: ‘yourPassword’ # Password associated with the provided user
dwh_database: ‘dev’
3. Once you are done, test the connection to the data warehouse with mf health-checks
in your CLI.
4. If you are not sure where to go next, you could perform your first run with mf tutorial
to get the sample metrics populated.
5. There are two main objects in Metricflow, both specified in an .yaml file: (1) data sources or tables to be used in your metrics calculation and (2) metrics or the metrics store. Like in the case with Cube dev, MetricFlow will translate the upcoming requests into SQL and execute them against a database/data warehouse.
In the end, I made the ‘metrics’ a part of the data source files, following the structure of their test scripts where the first script was saved in C:\Users\katep\.metricflow\sample_models\users.yaml
and the 2nd, in C:\Users\katep\.metricflow\sample_models\sales.yaml
.
data_source:
name: users
description: Dataset that contains info on the users
identifiers:
— name: user
type: primary
expr: userid
dimensions:
— name: city
type: categorical
— name: state
type: categorical
— name: username
type: categorical
sql_table: public.users
mutability:
type: immutable
data_source:
name: sales
description: Raw ticket-level data
identifiers:
— name: sales_info
type: primary
expr: salesid
— name: user
type: foreign
expr: sellerid
— name: sales_user
type: foreign
identifiers:
— ref: user
— name: sellerid
measures:
— name: revenue
description: Price paid for tickets
expr: pricepaid
agg: SUM
create_metric: true
— name: num_sold
description: Number of tickets sold
expr: qtysold
agg: SUM
create_metric: true
dimensions:
— name: saletime
type: time
type_params:
is_primary: True
time_granularity: day
sql_table: public.sales
mutability:
type: immutable
- -
metric:
name: revenue
type: measure_proxy
type_params:
measure: revenue
metric:
name: num_sold
type: measure_proxy
type_params:
measure: num_sold
metric:
name: revenue_filtered
type: measure_proxy
type_params:
measure: revenue
constraint: |
user__username = 'TFY09PKU'
From my Cube experience, I was especially curious about how MetricFlow joins data because left join might give completely different results from inner join, and I just didn’t want to leave the joint decision in the users’ hands. The impression I got from testing is that if your foreign and primary keys are named in the same way, e.g. userid
in sales vs userid
in users, joins would be easy:
data_source:
name: sales_simple
description: Raw ticket-level data
identifiers:
— name: user
type: foreign
expr: userid
However, because I had userid
as a primary key in the users
table and sellerid
, in the sales
table, I had to do the following gymnastics where I defined the foreign key first, and then, created another entry referring to the user identifier in C:\Users\katep\.metricflow\sample_models\users.yaml.
data_source:
name: sales
description: Raw ticket-level data
identifiers:
— name: user
type: foreign
expr: sellerid
— name: sales_user
type: foreign
identifiers:
— ref: user
— name: sellerid
It is time to test by running mf list-metrics
to see all metrics. And it is a success as all three metrics defined are on the list.
But let’s get some data with mf query --metrics revenue --dimensions user__username
.
If you want to see what query in particular hit Redshift, add --explain
to your command.
6. Let’s test it in action… Or not. If you remember, one could connect to Cube dev from different BI tools and data platforms, such as Google Data Studio, Power BI, Jupyter Notebook, etc, using the Cube SQL API. Metricflow, however, has a CLI and Python SDK, only. The hosted version, Transform, comes with a SQL and GraphQL Interface, which was not a part of my review. What a bummer! Materializations, ‘a pre-computed and used to “warm-up” the cache for the most common metrics and dimensions’, aka pre-aggregations in Cube, came to my rescue as I could connect to the materializations with a BI tool — for the Python SDK, check this article.
To test the materialization, run mf materialize --materialization-name ticket_summary --start-time 1900–01–01T00:00:00 --end-time 2050–01–01T00:00:00
in your CLI. The new table metricflow.ticket_summary
was created in your Redshift data warehouse.
Unlike Cube dev which has refresh workers that trigger pre-aggregation to be refreshed periodically, MetricFlow doesn’t allow setting up a scheduled refresh. The website states, ‘If you are using Airflow, you can use Transform’s Airflow operator to schedule materializations.’ As I am using Windows, Task Scheduler, it is.
Because I created a conda virtual environment ‘personal project’ to isolate the packages I installed for the project from other projects, I need to run the script in that virtual environment from the Task Scheduler.
7. Create a batch .bat file, e.g. schedule_metricflow.bat
and add :C:\Users\katep\miniconda3\Scripts\activate.bat personalproject && mf materialize --materialization-name ticket_summary --start-time 1900–01–01T00:00:00 --end-time 2050–01–01T00:00:00
8. Create the actual Task by opening Start -> Control Panel -> System and Security -> Administrative Tools -> Task Scheduler. And then, Action -> Create Basic Task -> Type a name and Click Next. Then, specify the desired frequency in the Trigger
window and find schedule_metricflow.bat
in the Action
window.
Great, my pre-aggregation summary now gets refreshed at regular intervals.
9. As we are not connecting our metrics store to the data warehouse to perform the aggregations and retrieve the data on the fly, but connect to our materialization ticket_summary table, I would use AWS QuickSight. Like in the case with Cube dev and Power BI/Google Data Studio, ticket_summary
got exposed as a table where both measures and dimensions are columns.
Lesson(s) learned: The part I am not sure about in the metrics store is the calculated ratio columns in the materializations, especially when you have several dimensions to slice your data on because you would be destined to calculate the average of the calculated ratio columns once you get the data to a BI tool which might not be correct in all cases. Similarly, with Cube and Google Data Studio, I defined the avgRevenue column, calculated as revenue/num_sold in QuickSight itself. The final report page looks somewhat similar to the Google Data Studio dashboard.
Summary
At this stage, I looked into Cube dev, Metriql and MetricFlow. Although it is difficult to get all pros and cons of each tool from this small project, initiated as a part of #100daysoflearning, I do see that the metrics store can come in handy when you have different applications consuming the same metrics differently, e.g. by using an API to get the data; by connecting different BI tools; by running scripting tools, e.g. Python/R, etc. At the same time, if you have just a BI tool(s) using the metrics, I wonder if the metrics store will be of any use and a simple summary table/view, aka pre-aggregations from Cube/materializations from MetricFlow will go.