Analytics engineering for fact tables
How to model data for a fact table, based on real analytics use cases.
The goal of data modeling is to make retrieving data fast (for the engine that processes the queries), and easy (for the people that write those queries).
Most data warehousing practices are meant to emphasize speed. Analytics engineering (a term popularized by dbt, and sometimes bundled into the term full-stack analytics) is the process of modeling data for usability. Even if that’s not what you call it, you’re probably practicing analytics engineering whenever you need to put together a curated dataset, a segment or metric, or a dashboard for someone else.
This tutorial shows you how to apply an analytics engineering approach to a dataset at the data warehouse level—and more specifically, to a certain kind of dataset called a fact table.
Introduction
Dimension tables contain a snapshot of data at a point in time, such as the number of partially-finished mugs you have at the end of your work day.
| time | total mugs |
|---------------------|------------|
| 2022-08-16 17:30:00 | 3 |
Fact tables contain a history of information, such as the rate at which you’ve been drinking your coffee throughout the day.
| time | mug | coffee remaining |
|---------------------|----------|------------------|
| 2022-08-16 08:00:00 | 1 | 100% |
| 2022-08-16 08:01:00 | 1 | 0% |
| 2022-08-16 09:00:00 | 2 | 100% |
| 2022-08-16 12:00:00 | 3 | 100% |
| 2022-08-16 12:30:00 | 3 | 99% |
| 2022-08-16 17:30:00 | 3 | 98% |
Fact and dimension tables are used together in a star schema (or the closely related snowflake schema) to organize information in a data warehouse.
You may want to build a fact table if:
-
Your data sources (the systems that generate data, such as your application database) only store a current snapshot of information by saving it over the previous snapshot.
-
You’re creating a dataset to power embedded analytics for your customers. Standalone fact tables are great for self-serve analytics, because they can cover a wide range of use cases without relying on joins.
But before we get started, let’s add one more mug of caffeine to your daily total—we’ve got a lot ahead of us!
| time | total mugs |
|---------------------|------------|
| CURRENT_TIMESTAMP() | n+1 |
Overview
In this tutorial, we’ll be working with a dimension table that we’ll call account
, like a dimension table that you might get from a CRM. Let’s assume this account
dimension table stores the current state of our customers, with the current state being updated by our application.
The account
table looks something like this:
| id | country | type | status |
|------------------|-------------|------------|-----------|
| 941bfb1b2fdab087 | Croatia | Agency | Active |
| dbb64fd5c56e7783 | Singapore | Partner | Active |
| 67aae9a2e3dccb4b | Egypt | Partner | Inactive |
| ced40b3838dd9f07 | Chile | Advertiser | Test |
| ab7a61d256fc8edd | New Zealand | Advertiser | Inactive |
Part 1: Designing the fact table
To design a fact table schema based on account
, we’ll need to consider the kinds of analytics questions that people might ask about changes to customer accounts over time. Since the account
table contains a status
field, we can answer questions such as:
- How many new accounts were added each month?
- How many accounts churned (became inactive) each month?
- What’s the churn rate by customer cohort?
Part 2: Implementing the fact table
To create fact_account
from the data stored in account
, we’ll write a SQL script to:
- Initialize
fact_account
with today’saccount
data. - Get a snapshot of the rows in
account
(assuming it gets updated by another system). - Compare each day’s
account
snapshot against the historical data infact_account
. - Insert a new row into
fact_account
for each account that has changed since the previous day’s snapshot.
Part 3: Testing the fact table with common use cases
To check if our fact table is useful in practice, we’ll set it up with Metabase and try to answer all three of our sample analytics questions.
Part 4: Improving fact table performance
The last section of this tutorial gives you an idea of what it’ll look like to iterate on your fact table as it scales to accommodate more history (and more questions!).
How to follow along with this tutorial
If you want to apply the steps below to your own data, we recommend working with a dimension table that gets updated regularly by your source system, and a database or data warehouse of your choice.
In this tutorial, we’re using Firebolt to test out their partner driver with Metabase. Firebolt is a data warehouse that employs some slightly modified SQL DDL to load data in a format that’s designed to make queries run faster.
If you’re following along with your own data, your SQL syntax probably won’t match the sample code exactly. For more information, you can check out the reference guides for common SQL dialects.
Designing the fact table
A basic fact schema
First, we’ll draft up a schema for our fact table, which we’ll call fact_account
. Putting a schema in a visual reference like the table shown below can make it easier to validate whether fact_account
will support the queries we want to make (i.e., the analytics questions that people want to answer). A visual reference also doubles as a helpful resource later on for anyone who is new to fact_account
.
In this example, we’re going to keep all of the original columns from account
. If we need to omit any columns, we can always hide those columns via the Data Model page in Metabase. Hiding columns in Metabase is less disruptive compared to excluding too many columns from our schema at the start, as we’d have to regenerate the schema each time we need to restore columns.
We’ll also include a new column called updated_at
to indicate the timestamp that a row was inserted to the table. In practice, updated_at
can be used to approximate the date or time that a change was made to an account.
This addition is based on the assumption that all of the account
attributes can change, except for id
. For example, the status of a given account can change from Active
to Inactive
, or the Type
of the account from Partner
to Advertiser
.
Basic fact_account
schema example
| Column name | Data type | Description | Expected values |
|-----------------|-----------|--------------------------------------------------------------|---------------------------------------------------|
| id | varchar | The unique id of a customer account. | 16 character string |
| status | varchar | The current status of the account. | Active, Inactive, or Test |
| country | varchar | The country where the customer is located. | Any of the "English short names" used by the ISO. |
| type | varchar | The type of account. | Agency, Partner, or Advertiser |
| updated_at | datetime | The date a row was added to the table | |
A better fact schema
To check the usability of the schema, we’ll write out a pseudo-SQL query for one of our analytics questions:
-- How many new accounts have been added each month?
WITH new_account AS (
SELECT
id,
MIN(updated_at) AS first_added_at -- Infer the account creation date
FROM
fact_account
GROUP BY
id
)
SELECT
DATE_TRUNC('month', first_added_at) AS report_month,
COUNT(DISTINCT id) AS new_accounts
FROM
new_account
GROUP BY
report_month;
The current fact_account
schema needs an extra step to get (or estimate) the “creation” timestamp for each account (in this case, estimation is necessary for the accounts that were already active before we started keeping a history).
It’ll be much easier to answer questions about “new accounts” if we simply add a column for an account’s creation date to the fact_account
schema. But adding columns will increase the complexity of the table (the time it takes for someone to understand and query it), as well as the complexity of the SQL script (the time it takes to update the table).
To help us decide whether it’s worth adding a column to our fact_account
schema, we’ll consider whether the creation timestamp can be used for other types of analytics questions about accounts.
The creation timestamp of an account can also be used to calculate:
- The age of an account.
- The time to an important event (such as the number of days it takes for an account to churn or become inactive).
These metrics can be applied to interesting use cases such as reducing customer churn or calculating LTV, so it’s probably worth including in fact_account
.
We’ll add the column is_first_record
to keep our schema streamlined. This column will flag the row which corresponds to an account’s earliest entry in the fact table.
If you’re planning to make a fact table to simplify self-serve (such that the fact table includes information that’s usually captured in a dimension table), you can also add a column for is_latest_record
. This column will help people filter fact_account
for current data (in addition to historical data), so that they can use the same table to quickly answer questions such as: “how many active accounts do we have to date?”.
Using this convention can make for slower queries but easier adoption when rolling out self-serve for the first time (so people don’t have to remember the joins between fact and dimension tables).
A better fact_account
schema
| Column name | Data type | Description | Expected values |
|-----------------|-----------|---------------------------------------------------------------------|---------------------------------------------------|
| id | varchar | The unique id of a customer account. | 16 character string |
| status | varchar | The current status of the account. | "Active", "Inactive", "Test", or "Trial" |
| country | varchar | The country where the customer is located. | Any of the "English short names" used by the ISO. |
| type | varchar | The type of account. | "Agency", "Partner", or "Advertiser" |
| ... | ... | ... | |
| updated_at | datetime | The date a row was added to the table | |
| is_first_record | boolean | TRUE if this is the first record in the table for a given id | |
| is_latest_record| boolean | TRUE if this is the most current record in the table for a given id | |
Initializing the fact table
To implement the fact schema, we’ll start by creating an empty fact_account
table to store the account
table snapshots over time.
We’re working with a Firebolt data warehouse, so we’re going to create the fact table from the Firebolt console. We’ll select SQL Workspace > New Script, and write:
-- Create an empty fact_account table in your data warehouse.
CREATE FACT TABLE IF NOT EXISTS fact_account
(
id varchar
status varchar
country varchar
type varchar
updated_at timestamp
is_first_record boolean
is_latest_record boolean
);
Note that Firebolt’s DDL includes the FACT
keyword (it can be excluded in standard SQL DDL).
If you’re creating fact tables in the same SQL script that you’re using to ingest data, you can follow the well-commented SQL script templates from the Import Script button on the collapsible right sidebar.
Next, we’ll populate fact_account
with everything that’s in the current account table. You can include these statements in the same SQL script that creates the fact table:
-- Put an initial snapshot of data from "account" into "fact_account".
-- Add "quality of life" columns to make the data model nicer to work with.
INSERT INTO fact_account (
SELECT
*,
CURRENT_TIMESTAMP() AS updated_at,
is_first_record = TRUE,
is_latest_record = TRUE
FROM
account);
Loading the fact table incrementally
To update fact_account
with regular snapshots from account
, we’ll write another SQL script to:
- Query
account
for a current snapshot of the data. - Compare the current data against the last updated data in
fact_account
. - Insert rows into
fact_account
for records that have changed since the last snapshot.
You’ll need to save and schedule this SQL script outside of your data warehouse, using tools like dbt or Dataform. For more information, check out the Transforming data section of our ETL, ELT, and Reverse ETLs tutorial in Learn.
-- Add the latest snapshot from the account table.
-- This assumes that account is regularly updated from the source system.
INSERT INTO fact_account
SELECT
*,
is_first_record = TRUE
FROM
account
WHERE
id = id
AND CURRENT_TIMESTAMP() <> updated_at ();
-- Update the rows from the previous snapshot, if applicable.
WITH previous_snapshot AS (
SELECT
id,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY updated_at DESC) AS row_number
FROM
fact_account
WHERE
is_first_record = TRUE)
UPDATE
fact_account fa
SET
is_latest_record = FALSE
FROM
previous_snapshot ps
WHERE
ps.row_number = 2;
Testing the fact table with common use cases
Here’s what we would expect fact_account
to look like after it starts getting populated with daily snapshots of account
:
| id | country | type | status | updated_at | is_first_record | is_latest_record |
|------------------|-----------|------------|-----------|---------------------|-----------------|------------------|
| 941bfb1b2fdab087 | Croatia | Agency | Active | 2022-02-04 09:02:09 | TRUE | FALSE |
| 941bfb1b2fdab087 | Croatia | Partner | Active | 2022-07-10 14:46:04 | FALSE | TRUE |
| dbb64fd5c56e7783 | Singapore | Partner | Active | 2022-05-10 02:42:07 | TRUE | FALSE |
| dbb64fd5c56e7783 | Singapore | Partner | Inactive | 2022-07-14 14:46:04 | FALSE | TRUE |
| ced40b3838dd9f07 | Chile | Advertiser | Test | 2022-07-02 06:22:34 | TRUE | TRUE |
Now, we can put our fact table into Metabase to see how it does at answering our example analytics questions:
- How many new accounts were added each month?
- How many accounts churned (became inactive) each month?
- What’s the churn rate by customer cohort?
Setting up Metabase
If you don’t have your database set up with Metabase already, you can set things up in just a few minutes:
- Download and install Metabase, or sign up for a free trial of Metabase Cloud.
- Add the database with your fact table.
If you’re following this tutorial using Firebolt, you’ll need the username and password that you use to log into the Firebolt console, as well as the database name (listed on the console homepage).
- From the top right of the Metabase home page, click New > Question.
New accounts
Let’s say we want to know the total number of new accounts that were added last month.
This kind of result would work with self-serve use cases such as:
- A “static number” visualization.
- A progress bar visualization to measure last month’s new accounts against a goal number.
People can self-serve a metric like “new accounts added in the past month” from Metabase’s query builder using these steps:
- Go to New > Question.
- Select
fact_account
as the starting data. - From Pick the metric you want to see, select Number of distinct values of > ID.
- From the Filter button, click Is First Record and select “Is” (default setting), with the value “True”.
- From the Filter button, click Status and select “Is Not”, with the value “Test”.
- Click Last Updated At and select “Last Month”.
Or, they can self-serve the same value from any SQL IDE (including the Metabase SQL editor) using a snippet like this:
SELECT
COUNT(DISTINCT id) AS new_accounts
FROM
fact_account
WHERE
is_first_record = TRUE
AND status <> "Test"
AND DATE_TRUNC('month', updated_at) = DATE_TRUNC('month', CURRENT_TIMESTAMP) - INTERVAL '1 MONTH';
Churned accounts
In addition to new accounts that are added to our business, we also want to keep track of churned accounts that were lost. This time, instead of limiting our results to last month’s data, we’ll get a monthly summary table like this:
| report_month | churned_accounts |
|--------------|------------------|
| 2022-05-01 | 23 |
| 2022-06-01 | 21 |
| 2022-07-01 | 16 |
This kind of result can help people self-serve:
- A bar or line chart to plot the change in
churned_accounts
for eachreport_month
. - A “trend” visualization to show the percentage change in the number of churned accounts, month over month.
- A saved question or model that can be joined to other tables on
report_month
. This enables people to use thechurned_accounts
column in calculations with other columns that aren’t found infact_account
.
People can self-serve the “monthly churned accounts” summary table from Metabase’s query builder by following these steps:
- Go to New > Question.
- Select
fact_account
as the starting data. - From Pick the metric you want to see, select Number of distinct values of > ID.
- From Pick a column to group by, select Updated At: Month.
- Click the Filter button.
- Click Status and select True.
They can also get the results from any SQL IDE (including Metabase’s SQL editor) using a query like this:
SELECT
DATE_TRUNC('month', updated_at) AS report_month,
COUNT(DISTINCT id) AS churned_accounts
FROM
fact_account
WHERE
status = 'inactive';
Advanced use case: cohort table
A cohort table is one of the most complex use cases that can be powered by a well-designed fact table. These tables measure churn rate as a function of account age, and can be used to identify groups of customers that are especially successful or unsuccessful.
We want to get a result like this:
| age | churned_accounts | total_accounts | churn_rate |
| --- | ---------------- | -------------- | ---------- |
| 1 | 21 | 436 | = 21 / 436 |
| 2 | 26 | 470 | = 26 / 470 |
| 3 | 18 | 506 | = 18 / 506 |
Since this is an advanced use case, we’ll focus on showing you how the “shape” of the fact_account
table can be changed into a cohort table. These steps can be done in Metabase by creating a series of saved SQL questions that build upon one another.
-
Create a saved question that gets
first_added_month
andchurned_month
for each account:Sample result
| id | first_added_month | churned_month | | ---------------- | ----------------- | ------------- | | 941bfb1b2fdab087 | 2022-02-01 | null | | dbb64fd5c56e7783 | 2022-05-01 | 2022-07-01 | | 67aae9a2e3dccb4b | 2022-07-01 | null |
SQL snippet
SELECT id, CASE WHEN is_first_record = TRUE THEN DATE_TRUNC('month', updated_at) END AS first_added_month, CASE WHEN status = 'inactive' THEN DATE_TRUNC('month', updated_at) ELSE NULL END AS churned_month FROM fact_account;
-
Join the saved question from step 1 to a column that has one row per month. You can do this in SQL by generating a series (or you may be able to use an existing table in your data warehouse). Take note of the join conditions on the months.
Sample result
| id | first_added_month | churned_month | report_month | age | is_churned | |------------------|-------------------|---------------|--------------|-----|------------| | dbb64fd5c56e7783 | 2022-05-01 | 2022-07-01 | 2022-05-01 | 1 | FALSE | | dbb64fd5c56e7783 | 2022-05-01 | 2022-07-01 | 2022-06-01 | 2 | FALSE | | dbb64fd5c56e7783 | 2022-05-01 | 2022-07-01 | 2022-07-01 | 3 | TRUE |
SQL snippet
WITH date_series AS ( SELECT * FROM GENERATE_SERIES('2022-01-01'::date, '2022-12-31'::date, '1 month'::interval) report_month ) SELECT *, age, CASE WHEN s.churned_month = d.report_month THEN TRUE ELSE FALSE END AS is_churned FROM step_1 s FULL JOIN date_series d ON d.report_month >= s.first_added_month AND (d.report_month <= s.churned_month OR d.report_month <= CURRENT_TIMESTAMP::date);
-
Your result from step 2 can now be aggregated from the query builder into the final result (you can calculate the churn rate using a custom column).
Sample result
| age | churned_accounts | total_accounts | churn_rate | | --- | ---------------- | -------------- | ---------- | | 1 | 21 | 436 | = 21 / 436 | | 2 | 26 | 470 | = 26 / 470 | | 3 | 18 | 506 | = 18 / 506 |
SQL snippet
SELECT age, COUNT(DISTINCT CASE WHEN is_churned = TRUE THEN id END) AS churned_accounts, COUNT(DISTINCT CASE WHEN is_churned = FALSE THEN id END) AS total_accounts, churned_accounts / total_accounts AS churn_rate FROM step_2 GROUP BY age;
Improving fact table performance
Once we have a working fact table in production, we’ll want to pay attention to how it scales as:
- The table gets updated with more history.
- More people start running queries against the table in parallel.
Let’s say that the churn logic becomes very popular, such that our fact_account
becomes a dependency (and bottleneck) for many downstream dashboards and aggregations.
To improve the performance of queries against the fact table, we’ll want to pre-calculate aggregations against the columns most commonly used in churn calculations.
There are a few ways to do this in SQL databases:
- Add indexes to the columns that are used most often in
GROUP BY
statements. - Create views of the summarized (pre-aggregated) data.
In our Firebolt data warehouse, we can combine both of these optimizations using aggregating indexes. Defining aggregation indexes tells your the Firebolt engine to create additional tables (under the hood) that should be referenced instead of the main fact table when a SQL query asks to apply a certain aggregation over a given column.
Aggregating indexes can also be included in the SQL script that you use to initialize and load the fact table (but it’s easier to choose the right indexes after you’ve had a chance to observe how people use the table in practice).
Here’s an example of a Firebolt aggregating index that helps speed up the count of cumulative and current churned accounts over different reporting periods:
CREATE AGGREGATING INDEX IF NOT EXISTS churned_accounts ON fact_account
(
updated_at,
DATE_TRUNC('day', updated_at),
DATE_TRUNC('week', updated_at),
DATE_TRUNC('month', updated_at),
DATE_TRUNC('quarter', updated_at),
COUNT(DISTINCT CASE WHEN status = 'inactive' then id end),
COUNT(DISTINCT CASE WHEN status = 'inactive' AND is_latest_record = TRUE then id end)
);
Further reading
Learn more about data modeling, data warehousing, and working with SQL:
- Models in Metabase
- Common data model mistakes made by startups
- Best practices for data warehousing
- Which data warehouse should you use?
- Working with SQL
Next: Data normalization
What a normalized database looks like and why table structure matters.