Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-evaluate adding compile task when using ExecutionMode.AIRFLOW_ASYNC #1477

Open
tatiana opened this issue Jan 21, 2025 · 0 comments
Open
Assignees
Labels
area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc dbt:compile Primarily related to dbt compile command or functionality execution:virtualenv Related to Virtualenv execution environment
Milestone

Comments

@tatiana
Copy link
Collaborator

tatiana commented Jan 21, 2025

In Cosmos 1.7.0, we introduced experimental support to ExecutionMode.AIRFLOW_ASYNC, as discussed in this article and documentation page:

A fundamental characteristic of the approach implemented in 1.7.0 is that we'd pre-compute the SQL in a single first setup task, called dbt_compile, and the remaining tasks would only need to run the SQL statements, as illustrated in:

Image

This approach had some problems:

  1. The SQL run by Cosmos was incorrect, as discussed in [bug] Fix ExecutionMode.AIRFLOW_ASYNC query #1260
  2. Need to pre-compile and upload SQL statements to remote storage (additional configuration / possible latency)
  3. The fact it was not clearing out those files after

As part of fixing #1260 using a monkey-patch, in #1474, we noticed that the dbt_compile step did not have to happen beforehand since we could monkey-patch per run statement, leading to a refactor to remove the dbt_compile and run the monkey-patched dbt command with dbtRunner per task. While this is cleaner from a DAG topology perspective, it is unclear what is the best approach moving forward since to run the patched dbt version in every run task will require:

a) dbt and Airflow being installed in the same Python environment on every worker node
b) possible memory/CPU overhead of running dbtRunner for every task

An alternative approach we could consider is:

  • Re-introduce the dbt_compile task, but identify the possibility of running it with Cosmos ExecutionMode.VIRTUALENV
  • Upload SQL to remote object store
  • No longer have the dependency of running dbtRunner per run task

The advantages of this approach would be:

  • dbt and Airflow would not have to be installed in the same Python environment, potentially (some changes to how we monkey-patch may be needed)
  • most worker nodes would not have to run dbt commands
  • the memory and CPU usage per worker node should go down while just executing the transformation

The downsides would be

Ideally, we'd compare these two approaches with real dbt projects and evaluate the numbers before making a decision.

@tatiana tatiana added this to the Cosmos 1.9.0 milestone Jan 21, 2025
@dosubot dosubot bot added area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc dbt:compile Primarily related to dbt compile command or functionality execution:virtualenv Related to Virtualenv execution environment labels Jan 21, 2025
@pankajkoti pankajkoti self-assigned this Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc dbt:compile Primarily related to dbt compile command or functionality execution:virtualenv Related to Virtualenv execution environment
Projects
None yet
Development

No branches or pull requests

2 participants