Exporting BigQuery tables to one or more files on Google Storage
If you are on the topic of Modern Data Stack, most probably you’ve come across the term ‘Reverse ETL’ which can be defined as the process of copying data from your central data warehouse to your operational tools.
You might think, wait a moment, is it not just another data pipeline with a source & target being reversed? One reason behind establishing Reverse ETL as the stand-alone approach is the ability to delegate the handling of API schema changes to reverse ETL vendors. Regardless, of whether you think Reverse ETL is a new approach or just another data pipeline, you might have a use case when you need to export a BigQuery table to Google Storage with dbt/unload a Redshift table to S3 with dbt. As I could not find a solution on the Internet on how to achieve it, I decided to do a quick write-up on the potential options.
Option 1 or Reverse ETL tools
Based on the listing at https://www.moderndatastack.xyz/companies/reverse-etl-tools (the list extracted on 11/16/2022) and the website reviews, the following Reverse ETL tools have a dbt integration and allow saving output to Google Cloud Storage.
| service | site | dbt integration | output to GCS |
|-------------|------------------------------|-----------------|---------------|
| RudderStack | https://www.rudderstack.com/ | yes | yes |
| hightouch | https://hightouch.com | yes | yes |
| grouparoo | https://www.grouparoo.com | yes | no |
| omnata | https://omnata.com/ | yes | no |
| Census | https://www.getcensus.com/ | yes | yes |
Option 2 or post-hook/pre-hook with dbt
As per dbt, Pre-hook/post-hook is ‘a SQL statement (or list of SQL statements) to be run before or after a model, seed, or snapshot is built.’ As in our case, we want to export raw data from one of the daily jobs to GCS, we set up a post-hook with the model looking as follows.
{{ config(
post_hook = "EXPORT DATA OPTIONS(
uri='gs://<bucket_name>/<file_path_inside_bucket>_'||current_date()||'*.csv',
format='CSV',
overwrite=true,
header=true,
field_delimiter=',') AS (select * from {{ source('project-name.dataset', 'table-name') }} "
) }}
select * from {{ source('project-name.dataset', 'table-name') }}
Keep in mind that for the post-hook to work, you need to grant appropriate permissions in your GCS, e.g. Storage Object Viewer or Storage Object Creator to your service account you use to authenticate BigQuery and dbt Cloud.
If you are interested in exporting data from Redshift to AWS S3, check out https://github.com/dbt-labs/redshift/tree/0.2.3/#unload_table-source and https://docs.getdbt.com/reference/resource-configs/pre-hook-post-hook