Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analytics Engineering Project Guidance #415

Merged
merged 36 commits into from
Dec 18, 2024
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
51f08ef
project structure
alex-pavlopoulos Jun 25, 2024
59617ff
some updates to staging
alex-pavlopoulos Jun 26, 2024
b1fafcb
spelling mistake
alex-pavlopoulos Jun 26, 2024
ec5d4f3
more additions
alex-pavlopoulos Jun 26, 2024
582d776
adding style guide
alex-pavlopoulos Jul 1, 2024
eb5a2ab
small changes
alex-pavlopoulos Jul 2, 2024
f6bd385
big update to style guide
alex-pavlopoulos Jul 3, 2024
1a3260f
adding intermediate guide
alex-pavlopoulos Jul 3, 2024
0994fb1
formatting
alex-pavlopoulos Jul 3, 2024
b00f2b9
more updates
alex-pavlopoulos Jul 4, 2024
b2f8658
Update data-modelling-concepts.md
alex-pavlopoulos Jul 11, 2024
5b788cd
edits
alex-pavlopoulos Jul 18, 2024
efa8e46
fix
alex-pavlopoulos Jul 18, 2024
1838093
Merge branch 'ae-style-guide' of github.com:moj-analytical-services/u…
alex-pavlopoulos Jul 18, 2024
364d0f6
slack
alex-pavlopoulos Jul 18, 2024
0c8eaa2
Rename data-modelling-concepts.md to data-modelling-style-guide.md
hollyfurniss-moj Jul 24, 2024
039ada5
minor restructure of content
hollyfurniss-moj Jul 24, 2024
083f336
add dbt guidance version
hollyfurniss-moj Jul 24, 2024
35cae24
add jinja & yaml
hollyfurniss-moj Jul 24, 2024
5a197b8
Update project-structure.md
hollyfurniss-moj Jul 24, 2024
280a707
fix some typos
hollyfurniss-moj Jul 24, 2024
40d71a8
Delete source/documentation/tools/create-a-derived-table/data-modelli…
hollyfurniss-moj Jul 24, 2024
2f78b1f
Update source/documentation/tools/create-a-derived-table/project-stru…
alex-pavlopoulos Aug 21, 2024
b327791
Making suggested edits (#437)
alex-pavlopoulos Sep 18, 2024
607c29e
adding marts discussion (#438)
alex-pavlopoulos Dec 12, 2024
eec625a
adding derived section and making suggested changes
Dec 12, 2024
48bd24c
Merge remote-tracking branch 'origin/main' into ae-style-guide
Dec 12, 2024
8015068
fixing some formatting and style issues
Dec 12, 2024
b198f27
adding some commentary
Dec 12, 2024
9dbe0a5
Update source/documentation/tools/create-a-derived-table/project-stru…
alex-pavlopoulos Dec 13, 2024
ecc7d3c
Update source/documentation/tools/create-a-derived-table/project-stru…
alex-pavlopoulos Dec 13, 2024
90853f5
Update source/documentation/tools/create-a-derived-table/project-stru…
alex-pavlopoulos Dec 13, 2024
3b578f2
suggested changes
Dec 13, 2024
a80f9af
adding tip blocks
Dec 17, 2024
d3051e3
Update source/documentation/tools/create-a-derived-table/project-stru…
alex-pavlopoulos Dec 17, 2024
5ac5ac8
Food preparation analogy (#477)
alex-pavlopoulos Dec 17, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth renaming this file / title - style guide

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just remembered we'll need to update the contents of the CaDeT guidance overall so our new pages are discoverable. Think it's here: https://github.com/moj-analytical-services/user-guidance/blob/main/source/documentation/tools/create-a-derived-table/index.md

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,381 @@
# Data Modelling Style Guide

last updated: 23/07/2024

# How we style our create-a-derived-table projects

## Why does style matter?

Style might seem like a trivial, surface-level issue, but it's an important aspect of a well-built project. A consistent, clear style enhances readability and makes your project easier to understand and maintain. Highly readable code helps build clear mental models making it easier to debug and extend your project. It's not just a favor to yourself, though; equally importantly, it makes it less effort for others to understand and contribute to your project, which is essential for peer collaboration, open-source work, and onboarding new team members. [A style guide lets you focus on what matters](https://mtlynch.io/human-code-reviews-1/#settle-style-arguments-with-a-style-guide), the logic and impact of your project, rather than the superficialities of how it's written. This brings harmony and pace to your team's work, and makes reviews more enjoyable and valuable.

As a group of Analytics Engineers, we have thoroughly reviewed [guidance published by dbt](https://docs.getdbt.com/best-practices/how-we-style/0-how-we-style-our-dbt-projects) (the technology behind create-a-derived-table) and adapted it to suit our needs in this style guide.

Please take the time to familiarise yourself with this style guide, and our guidance on project structure [add link], before starting projects in create-a-derived-table.

## What's important about style?

There are two crucial tenets of code style:

- Clarity
- Consistency

Style your code in such a way that you can quickly read and understand it. It's also important to consider code review and git diffs. If you're making a change to a model, you want reviewers to see just the material changes you're making clearly.

Once you've established a clear style, stay consistent. This is the most important thing. Everybody on your team needs to have a unified style, which is why having a style guide is so crucial. If you're writing a model, you should be able to look at other models in the project that your teammates have written and read in the same style. If you're writing a macro or a test, you should see the same style as your models. Consistency is key.

# How we style our dbt models

## Fields and model names

- 👥 Models should be pluralized, for example, `customers`, `orders`, `products`. Alhough this is a good best practice, we accept that this may not work with the projects you are working on, so if you cannot keep to it then that is fine.
- 🔑 Each model should have a primary key and that primary key should be the first field in the table.
- 🔑 The primary key of a model should be named `<object>_id`, for example, `account_id`. This makes it easier to know what `id` is being referenced in downstream joined models.
- Use underscores for naming dbt models; avoid dots or camel case.
- ✅ `models_without_dots`
- ❌ `models.with.dots`
- ❌ `CamelCaseModels`
- Most data platforms use dots to separate `database.schema.object`, so using underscores instead of dots reduces your need for [quoting](/reference/resource-properties/quoting) as well as the risk of issues in certain parts of dbt Cloud. For more background, refer to [this GitHub issue](https://github.com/dbt-labs/dbt-core/issues/3246).
- 🔑 Keys should be string data types. Additionally we adives using the hash function `dbt_utils.generate_surrogate_key` ([guidance here](https://github.com/dbt-labs/dbt-utils?tab=readme-ov-file#generate_surrogate_key-source)) to create unique keys. This ensures there is a unique id for each row as well as making the ids uniform in length.
- 🔑 Consistency is key! Use the same field names across models where possible. For example, a key to the `customers` table should be named `customer_id` rather than `user_id` or 'id'.
- ❌ Do not use abbreviations or aliases. Emphasize readability over brevity. For example, do not use `cust` for `customer` or `o` for `orders`. Again, in AE we accept that in some cases this may not be possible. We want to prioritise readability, so try your best to be as descriptive as possible, however, if this is not practical then do not lose sleep over it.
- ❌ Avoid reserved words as column names. create-a-derived-table reserved words:
- something...
- ➕ Booleans should be prefixed with `is_` or `has_`.
- 🕰️ Timestamp columns should be named `<event>_at`(for example, `created_at`) and should be in UTC. If a different timezone is used, this should be indicated with a suffix (`created_at_pt`).
- 📆 Dates should be named `<event>_date`. For example, `created_date.`
- 🔙 DBT suggests event dates and times should be past tense, we do not beleive this is necessary for create-a-derived-table as there are many examples of fields that are well established and changing them would cause confusion. We do however suggest following this for meta data like &mdash; `created`, `updated`, or `deleted`.
- 💱 Price/revenue fields should be in decimal currency (`19.99` for £19.99; many app databases shop prices as integers in pence). If a non-decimal currency is used, indicate this with a suffix (`price_in_pence`).
- 🐍 Schema, table and column names should be in `snake_case`.
- 🏦 Use names based on the _business_ terminology, rather than the source terminology. For example, if the source database uses `user_id` but the business calls them `customer_id`, use `customer_id` in the model.
- 🔢 Versions of models should use the suffix `_v1`, `_v2`, etc for consistency (`customers_v1` and `customers_v2`).
- 🗄️ DBT suggest a consistant ordering of data types in your models, for our use case we do not see this as advantageous as it can be helpful to group fields based on their relevance to eachother, say a flag and the field it is referring to. We therefore advise that a consistent grouping is followed but it does not necessarily need to be based on field type. Where possible ids should be the first fields in a model and we expect the primary key to be **the first** field.

## Example model

Below is an example finance model, this follows the dbt style of grouping by field type, we will also include an example of a model where this is not the case

```sql
with

source as (

select * from {{ source('sop_finance_stg', 'base_hmpps_general_ledger_gl01') }}

),

renamed as (

select

---------- ids
id as cost_centre_id, -- primary key
a_id as analysis_code_id,
objective as cobjective_code_id,

---------- strings
version

---------- numerics
cast(debit_amount / 100.0) as float) as debit_amount,
cast(credit_amount / 100.0) as float) as credit_amount,
cast(total_amount / 100.0) as float) as total_amount,
---------- booleans
is_recoverable,

---------- dates
date(paid_date) as paid_date,

---------- timestamps
transction_at

from source

)

select * from renamed
```
```SQL
Another example
```

# How we style our SQL

## Basics

- 👻 Use Jinja comments (`{# #}`) for comments that should not be included in the compiled SQL. When dbt compiles your code it will include code using SQL comments like `/* */` and `--`
- ⏭️ Use trailing commas in lists. e.g.
```SQL
select
defendant_id,
court_id,
disposal_date,
ho_offence_code
...
```
rather than
```SQL
select
defendant_id
, court_id
, disposal_date
, ho_offence_code
...
```
- 4️⃣ Indents should be four spaces.
- 📏 Lines of SQL should be no longer than 80 characters. This is excluding model names as they can often be longer than 80 characters themselves. It is helpful to add a vertical line to your IDE (R, VS code or jupyter notebooks. See [here](https://stackoverflow.com/questions/29968499/how-can-i-have-multiple-vertical-rulers-in-vs-code) for a guide) to mark where 80 characters is.
- ⬇️ Field names, keywords, and function names (`select`, `as`, `group by`, etc...) should all be lowercase.
- 🫧 The `as` keyword should be used explicitly when aliasing a field or table. e.g. `id as defendnat_id` not `id defendant_id`


## Fields, aggregations, and grouping

- 🔙 Fields should be stated before aggregates and window functions.
- 🤏🏻 Aggregations should be executed as early as possible (on the smallest data set possible) before joining to another table to improve performance.
- 🔢 Grouping by a number (eg. group by 1, 2) is preferred over listing the column names (see [this classic rant](https://www.getdbt.com/blog/write-better-sql-a-defense-of-group-by-1) for why). Note that if you are grouping by more than a few columns, it may be worth revisiting your model design.
- 🔢 Column names should be written out explicitly with the column names in order statements to avoid ambiguity.
Example:
```SQL
select
defendant_on_case_id,
result_priority,
custodial_period_days,
monetary,
final_offence_ho_code_priority,
final_offence_ho_code,
defendant_on_offence_id,
disposal2_id,
row_number()
over (
partition by
defendant_on_case_id
order by
result_priority asc nulls last,
custodial_period_days desc nulls last,
monetary desc nulls last,
final_offence_ho_code_priority asc nulls last,
final_offence_ho_code asc nulls last,
defendant_on_offence_id asc nulls last,
disposal2_id asc nulls last
) as most_serious_disposal_rank,
from disposals
group by 1, 2, 3, 4, 5, 6, 7, 8
order by defendant_on_case_id
```

## Joins

- 👭🏻 Prefer `union all` to `union` unless you explicitly want to remove duplicates.
- 👭🏻 If joining two or more tables, _always_ prefix your column names with the table name where that column is coming from. If only selecting from one table, prefixes are not needed.
- 👭🏻 Be explicit about your join type (i.e. write `inner join` instead of `join`).
- 🥸 Avoid table aliases in join conditions (especially initialisms) — it's harder to understand what the table called "c" is as compared to "customers".
- ➡️ Always move left to right to make joins easy to reason about - `right joins` often indicate that you should change which table you select `from` and which one you `join` to.

## 'Import' CTEs

- 🔝 All `{{ ref('...') }}` statements should be placed in CTEs at the top of the file.
- 📦 'Import' CTEs should be named after the table they are referencing.
- 'Import' CTEs should be short and concise to make it easy to read what tables are being read in. Try not to have lots of fields in your select statement as it makes reading the CTEs harder, `select *` should be sufficient in most cases.
- You can use a `where` clasue to filter out undeeded data.
- For example:

```sql
with

requirments as (
select * from {{ ref("derived_delius_stg__base_rqmnt") }}
),

reference_1 as (
select * from {{ ref("derived_delius_stg__stg_standard_reference_1") }}
where code_set_name = 'REQUIREMENT SUB CATEGORY'
),

reference_2 as (
select * from {{ ref("derived_delius_stg__stg_standard_reference_2") }}
where code_set_name = 'ADDITIONAL REQUIREMENT SUB CATEGORY'
),

reference_3 as (
select * from {{ ref("derived_delius_stg__stg_standard_reference_3") }}
where code_set_name = 'REQUIREMENT TERMINATION REASON'
),

reference_4 as (
select * from {{ ref("derived_delius_stg__stg_standard_reference_4") }}
where code_set_name = 'UNITS'
),

joined as (

select
requirments.rqmnt_id,
requirments.disposal_id,
requirments.start_date,
requirments.length,
requirments.rqmnt_notes,
requirments.commencement_date,
requirments.termination_date,
requirments.partition_area_id,
requirments.expected_start_date,
requirments.soft_deleted,
requirments.expected_end_date,

-- For rqmnt main type sub-cat
reference_1.code_value as rqmnt_type_sub_category_code,
reference_1.code_description as rqmnt_type_sub_category_desc,

-- For rqmnt main type sub-cat
reference_2.code_value as ad_rqmnt_type_sub_category_code,
reference_2.code_description as ad_rqmnt_type_sub_category_desc,

-- For terminations
reference_3.code_value as rqmnt_termination_reason_code,
reference_3.code_description as rqmnt_termination_reason_desc,

-- For units, bring code_value and code_description from standard_reference_list_id
reference_4.code_value as rqmnt_type_main_units_code,
reference_4.code_description as rqmnt_type_main_units_desc

from requirments

left join reference_1
on requirments.rqmnt_type_sub_category_id = reference_1.standard_reference_list_id

left join reference_2
on requirments.ad_rqmnt_type_sub_category_id = reference_2.standard_reference_list_id

left join reference_3
on requirments.rqmnt_termination_reason_id = reference_3.standard_reference_list_id

left join reference_4
on requirments.units_id = reference_4.standard_reference_list_id
)

select * from joined
```

## 'Functional' CTEs

- ☝🏻 Where performance permits, CTEs should perform a single, logical unit of work.
- 👭🏻 Where possible joins should be done separately to calculations. This is to make the code neater, we expect the table each columns comes from to be referenced and therefore can make calculations overly complicated if a join is also being performed. (see example at the end of the section)
- 📖 CTE names should be as verbose as needed to convey what they do e.g. `events_joined_to_users` instead of `user_events` (this could be a good model name, but does not describe a specific function or transformation).
- 🌉 CTEs that are duplicated across models should be pulled out into their own intermediate models. Look out for chunks of repeated logic that should be refactored into their own model.
- 🔚 The last line of a model should be a `select *` from your final output CTE. This makes it easy to materialise and audit the output from different steps in the model as you're developing it. You just change the CTE referenced in the `select` statement to see the output from that step.

## Model configuration

- 📝 Model-specific attributes (like sort/dist keys) should be specified in the model.
- 📂 If a particular configuration applies to all models in a directory, it should be specified in the `dbt_project.yml` file.
- 👓 In-model configurations should be specified like this for maximum readability:

```sql
{{
config(
materialized = 'table',
sort = 'id',
dist = 'id'
)
}}
```

## Example SQL

```sql
with

events as (

...

),

{# CTE comments go here #}
filtered_events as (

...

)

select * from filtered_events
```

### Example SQL

```sql
with

my_data as (

select
field_1,
field_2,
field_3,
cancellation_date,
expiration_date,
start_date

from {{ ref('my_data') }}

),

some_cte as (

select
id,
field_4,
field_5

from {{ ref('some_cte') }}

),

some_cte_agg as (

select
id,
sum(field_4) as total_field_4,
max(field_5) as max_field_5

from some_cte

group by 1

),

joined as (

select
my_data.field_1,
my_data.field_2,
my_data.field_3,

-- use line breaks to visually separate calculations into blocks
case
when my_data.cancellation_date is null
and my_data.expiration_date is not null
then expiration_date
when my_data.cancellation_date is null
then my_data.start_date + 7
else my_data.cancellation_date
end as cancellation_date,

some_cte_agg.total_field_4,
some_cte_agg.max_field_5

from my_data

left join some_cte_agg
on my_data.id = some_cte_agg.id

where my_data.field_1 = 'abc' and
(
my_data.field_2 = 'def' or
my_data.field_2 = 'ghi'
)

having count(*) > 1

)

select * from joined
```
Loading