The dbt Semantic Layer, Data Orchestration and the Modern Enterprise Data Stack

21 Oct

We’re currently over in New Orleans for the dbt Labs-sponsored Coalesce 2022 : The Analytics Engineering Conference and I have to say, it’s great to be back at an in-person event since the pandemic lockdowns over the past couple of years.

The conversations in the hallways, opinions being shared and friendships built by everyone being in the same place for a few days makes the time and expense involved incredibly worthwhile. Well done to dbt Labs from us and everyone in the community for putting on such a great event.

There were two big product announcements from dbt Labs on Monday, the first of which was the addition of Python language support into dbt Core.

Intended to extend dbt’s appeal to data scientists who typically use Python as their language of choice for building stats and ML models, adding Python language support to dbt and dbt Cloud makes it easier to create data pipelines that involve SQL as well as Python steps and typically run on data lake-style platforms hosted on Databricks, Snowflake and Google Cloud Platform.

The second announcement was the Public Preview release of the dbt Semantic Layer, a layer of business metadata and metrics definitions that aims to place dbt Labs squarely at the centre of the emerging modern (enterprise) data stack, sherlocks headless-BI startups cube.dev and metricql and and makes it the direct competitor and alternative to today’s de-facto standard, Looker’s LookML-based Universal Semantic Model.

If you’re familiar with the basic concept of a BI tool semantic layer the approach taken by dbt Labs will look familiar; a layer of business logic and terminology mappings that sits on-top of your data warehouse and turns the IT-centric world of tables, joins, columns and SQL functions into a trusted and accessible catalog of business objects that end-users can use to create reports and dashboards.

Not everyone is an analytics engineer and a semantic layer is essential in delivering the “data democratization” that BI tool vendors such as Business Objects, Oracle through to Looker today promised end-users.

The idea behind the dbt Semantic Layer is to create a layer of infrastructure that contains a versioned store of business metrics, along with their dimensionality and rules for time-series and other calculations expressed using a query such as the one below:

select *
from {{ metrics.calculate(
   metric('dbt_cloud_weekly_active_users'),
   dimensions=['country'],
   grain='day'
) }}

This catalog of metrics and calculation rules is accompanied by infrastructure that crucially, rather than being closed and tied to one particular BI tool is instead made available to the ecosystem of BI tools and other data consumers within the modern data stack.

What this in-effect does is unify the data model and understanding of the until-now interoperable, but otherwise silo’d, components of the modern data stack and evolving it into the beginnings of a modern enterprise data stack.

As such it places dbt Labs in direct competition with Google Cloud Platform and their Looker Universal Semantic Model, an opening-up of the LookML modeling layer to now make it consumable by a number of top-tier enterprise-class data visualization tools including Tableau, Microsoft PowerBI and Google’s own Google Data Studio, now renamed as Looker Data Studio.

So how does dbt’s Semantic Layer work, both from a technical level and also commercially? How does dbt, a traditionally developer-centric toolkit designed for batch-transforming data in the warehouse before users get to see it, turn itself into a language capable of expressing complex analytical queries and a query engine capable of supporting fast, ad-hoc analysis?

The dbt Semantic Layer consists of a number of components, as shown in the diagram below from the product keynote.

1. Metric definitions, created in your dbt project using the new metrics: extension to dbt Core and alongside all of your existing data model definitions and transformation logic, that specifies the metric name, type, dimensionality, time grain and filters that can be applied to that metric

# models/marts/product/schema.yml version: 2 models: - name: dim_customers metrics: - name: rolling_new_customers label: New Customers model: ref('dim_customers') description: "The 14 day rolling count of paying customers using the product" type: count_distinct sql: user_id timestamp: signup_date time_grains: [day, week, month, quarter, year] dimensions: - plan - country filters: - field: is_paying operator: 'is' value: 'true' - field: lifetime_value operator: '>=' value: '100' - field: company_name operator: '!=' value: "'Acme, Inc'" - field: signup_date operator: '>=' value: "'2020-01-01'"

2. The dbt-metrics package extension to dbt Core used to defines rules for calculation of the metric including time-series, period-over-period analysis and rolling aggregates and that can, if used in-isolation, be used to materialize those calculations for that metric as a static warehouse “metrics layer” table.

select * from {{ metrics.calculate( metric('new_customers'), grain='week', dimensions=['plan', 'country'], secondary_calculations=[ metrics.period_over_period (comparison_strategy="ratio", interval=1, alias="pop_1wk"), metrics.period_over_period (comparison_strategy="difference", interval=1), metrics.period_to_date (aggregate="average", period="month", alias="this_month_average"), metrics.period_to_date (aggregate="sum", period="year"), metrics.rolling (aggregate="average", interval=4, alias="avg_past_4wks"), metrics.rolling (aggregate="min", interval=4) ], start_date='2022-01-01', end_date='2022-12-31', where="some_column='filter_value'" ) }}

3. Where the magic happens though is in the new and foundational data infrastructure introduced with the dbt Semantic Layer takes what what is essentially a metrics table generator for developers and turns it into a ROLAP query engine designed to support ad-hoc query workloads.A new hosted and always-available dbt Server component (source-available and free for non-commercial use) along with a rewriting SQL proxy server does this by accepting query requests for dbt metrics and live-compiles them into native SQL en-route to your data warehouse, limited initially to Snowflake but with Google BigQuery and Redshift support coming soon.

4. Finally, the Metadata API component provides access to the Semantic Layer to those BI tools along with ones for data integration, data observability and data science that announced integration with the Semantic Layer, either by importing metrics it contains into their own data catalogs or by sending queries using those metrics to dbt’s Proxy Server that then translates, on-the-fly, those queries into native SQL queries against the data warehouse.

dbt Labs won’t necessarily be able to translate their success in owning the modern data stack transformation layer into doing so for the semantic layer; there are already established players in that space such as Cube.dev, who already have a working implementation of their product compared to a Public Preview and have extended their product offering to include caching and features aimed at the embedded analytics market.

And a semantic model of this type has two fundamental assumptions; that the semantic model is centralized and is owned by your development and engineering team, not by the business users who use it and have the domain knowledge to make it relevant and useful.

But fundamentally, what this move by dbt Labs does is divide the modern data stack world into two rival “world views”, not unlike the competition for ideas and influence today between the US and China; one is for a well-planned, efficient but tightly-controlled semantic model that works-well today but is ultimately there to drive more sales of Looker, and depends on Google continuing to invest in it (and not kill it off).

The other worldview, the US world view if you follow my tortured analogy, is a coalition of the willing that’s centred around dbt Labs, is an open alliance of BI tool and other vendors that choose to invest in an open standard that, should it succeed, raises the collective value of the modern data stack ecosystem and enables its transition from a disruptive technology only appealing to early adopters into a serious competitor to the monolithic BI Suites that I used to write 1000–page books about.

The significance of dbt Labs adding support for Python workloads into dbt Core is the move it represents into the data orchestration market currently served by vendors such as Elementl (Dagster), Astronomer (Apache Airflow) and Prefect (Prefect), and the transformation it represents in the value and fungibility in the product they sell now, dbt Cloud.

Right now, dbt Labs’ valuation at $4.2bn on revenue of around $60m comes largely from sales of their dbt Cloud service to large enterprise customers whose list of regulatory needs they need to comply with is matched only by the deepness of their pockets and willingness to pay for it, but also have the IT and engineering support that can build their own schedulers and release pipelines out of stock AWS or GCP cloud services.

Adding “good-enough” data orchestration capabilities into dbtCloud will, over time, make it a compelling alternative to paying for products such as Dagster (Elementl), Airflow (Astronomer) and Prefect when budgets are under pressure, and the extension in user personas served that it represents increases the potential audience for users of the service.

More significantly though, the dbt Semantic Model is well-placed to become the universal semantic model and unifying infrastructure for the modern data stack as that ecosystem of products, like most disruptive technologies, extends its appeal from early adopters up into the enterprise space.

And being the software company that realised that vision and building the software that enables and is essential in delivering that vision and making it relevant to the enterprise market is what’s presumably the driver for the 70x multiple that dbt Labs current revenue represents on that $4.2bn valuation.

INTERESTED? FIND OUT MORE

Rittman Analytics is a boutique analytics consultancy specializing in the modern data stack who can help you centralise your data sources and enable your end-users and data team with best practices and a modern analytics workflow.

If you’re looking for some help and assistance understanding the impact and value delivered by your content or other marketing activity, or to help build-out your analytics capabilities and data team using a modern, flexible and modular data stack, contact us now to organize a 100%-free, no-obligation call — we’d love to hear from you!

dbtIndustry News