Rittman Analytics

View Original

Bringing dbt and Analytics Engineering to Oracle Autonomous Data Warehouse

dbt (Data Build Tool) is an open source command-line tool for managing and automating data transformation workflows. It enables a standardised, modular approach to building analytics code and is typically used as part of an increasingly popular method of developing data warehouses and data platforms called analytics engineering.

The analytics engineering approach combines software engineering principles with analytics to create more maintainable, trustworthy data pipelines built on transformation logic that is modular, scalable, tested, and well-documented.

With dbt, analytics code progresses from messy, ad hoc scripts to structured engineering workflows developed using a software development lifecycle. In this post, I’ll introduce dbt, explain its benefits for Oracle Database and Autonomous Data Warehouse developers and walk through getting started with dbt and the new dbt-oracle adapter.

What is dbt?

ELT (Extract, Load, Transform) is a modern data warehousing pattern very familiar to Oracle data warehouse developers where raw data is first extracted and loaded into the target warehouse, then transformed within the warehouse itself.

This is where dbt comes in — it provides the “T” for managing transformations in ELT, with the E (extract) and L (load) typically handled by services such as Fivetran, Airbyte or Stitch.

Originally developed to help data analysts within the VC-backed startup world develop production-ready data transformation code for Amazon Redshift and other cloud data warehouses, dbt is now more of an ecosystem than single tool and comprises of:

  • An open source core (dbt Core), originally developed by Fishtown Analytics (now dbt Labs) that you run from the command-line and can schedule using cron jobs or orchestrators such as Dagster

  • a commercial SaaS service (dbt Cloud) that provides features for developing, testing, scheduling and investigating data models

  • community-provided utilities and content packages (such as our RA Attribution package for multi-touch marketing attribution) along with official and related Slack groups

Key capabilities of dbt include:

  • Dependency management — dbt handles model dependencies automatically

  • Testing — add tests to models to ensure data quality

  • Refactoring — easily refactor models without breaking dependencies

  • Documentation — auto-generate docs for all models in a project

  • CI/CD Integration — build, test, and deploy projects on every code change

Whereas tools such as Oracle Data Integrator (ODI) and Oracle Warehouse Builder provide a visual, GUI-based mapping interface for designing data flows, dbt instead takes a code-centric approach with transformations encapsulated in modular SQL SELECT statements called “models”, either written and run using a text editor or tool such as Microsoft VS Code.

Developing transformation logic using a GUI has the benefit of opening-up this task to a wider audience, but when the business requires complex data transformations such an approach can become unwieldy.

More of a concern though is the ungoverned shadow SQL or Python code that drag-and-drop tool users end-up adding to their visual transformation nodes, making it hard to track changes and ensure those transformations and embedded are tested and reliable in a scalable and automated way.

A key difference between ODI and dbt is therefore in its approach to modularity and reusable transformation logic. With ODI, code reuse is achieved by copying and modifying mappings across projects; dbt, however achieves this through the use of those modular, configurable SQL models that can be packaged up into shared libraries and invoked repeatedly.

Also, crucially, dbt models exist as simple SQL files making them straightforward to version control via Git. ODI’s visual mappings are stored within its repository, requiring use of ODI lifecycle tools.

With dbt, developers can utilize the full power of Git for branching strategies, pull requests, and decentralized development.

dbt models also provide visibility into history and differences between versions.

Getting Started with dbt-oracle

The dbt-oracle adapter (originally created by Indicium, now supported and maintained directly by Oracle) is a python package that contains the core dbt libraries together with an adapter for connecting to on-premises and cloud Oracle databases. This, along with an install of Python 3 and the oracledb client drivers package are what’s required to install dbt on your Mac, for example.

python3 -m venv dbt-oracle-venv source dbt-oracle-venv/bin/activate pip install --upgrade pip python -m pip install oracledb --upgrade python -m pip install dbt-oracle

The dbt-oracle Github repo at https://github.com/oracle/dbt-oracle contains a demo dbt package that transforms data from the SH (Sales History) example schema shipped with all Oracle database versions.

Looking at an example dbt model within that demo package, you can see examples of model configuration settings, references to other tables in the form of sources and some Jinja templating code that adds a further WHERE clause to the SQL statement when the model is compiled in incremental load mode.

See this content in the original post

Other model definitions use ref() Jinja functions instead of table names to define dependencies between tables in a transformation DAG, and use Oracle-specific configuration settings to tell dbt how specifically to materialize the database object that is the result of compiling this particular model.

See this content in the original post

dbt Core data source connections are defined using a configuration file in YAML format stored by default at $HOME/.dbt/profiles.yml, outside of the git repository files so database credentials don’t get published with your code by accident.

See this content in the original post

There are various ways that dbt can connect to an Oracle database; using thin or thick drivers from the python-oracledb Oracle client libraries package along with the choice of either TLS (Transport Layer Security) or mutual TLS (mTLS) when connecting to Oracle Autonomous Data Warehouse or Autonomous Transaction Processing.

I used the thin client mode and was able to connect to Oracle Autonomous Data Warehouse Cloud without using a wallet file, using just my database username, password and host/port/service name.

dbt Core packages are run from the command-line and when executed, compile your SQL and Jinja model code into SQL SELECT and MERGE statements that transforms, tests and runs utilities to transform your data within the Oracle database.

See this content in the original post

dbt packages typically include descriptions for fields, schema definitions and other metadata that, together with the DAG dependency graph defined by all of the ref() and source() definitions within your models can be served-up from the command line using a built-in webserver, like this:

See this content in the original post

and then visualized as a data dictionary portal and transformation dependency graph, like this:

Interested? Find Out More!

Rittman Analytics is a boutique data analytics consultancy that’s both an Oracle Partner with over 25 years experience working with Oracle technology and Oracle customers, and a long standing partner of dbt Labs, the software company that sponsors dbt Core development.

You can read more about our work with dbt on our blog, see examples of our dbt development approach in our Github repo and listen to episodes of our podcast, Drill to Detail, where we discuss dbt and the analytics engineering approach with one of dbt’s original authors, Tristan Handy.

If you’re looking for some help and assistance adopting dbt and the analytics engineering approach to data warehouse development, or to help build-out your analytics capabilities and data team using a modern, flexible and modular data stack, contact us now to organise a 100%-free, no-obligation call — we’d love to hear from you!