Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SNOW-90] Introduce CI job to test schemachange updates against a clone #78

Open
wants to merge 87 commits into
base: dev
Choose a base branch
from

Conversation

jaymedina
Copy link
Contributor

@jaymedina jaymedina commented Oct 23, 2024

problem

Rather than have it be a manual process, automating the testing of schemachange updates in our branch before merging it into dev will speed up the development life cycle. A design document was made to tackle this problem and is available here.

solution

Untitled diagram-2024-10-29-144419

  • Introduce a new branch label, test_with_clone, to trigger the automated process of zero-copy cloning synapse_data_warehouse_dev and testing new/modified schemachange scripts against it
  • Introduce new CI jobs that reflect the design outlined in the design document
  • Update contribution guidelines to explain these new CI jobs and the corresponding PR labels

testing

  • Test that label create_clone_and_run_schemachange triggers the CI job to run
  • Test that a clone is made of the dev db and not the prod db
  • Test that a clone is named after the branch
  • Test that unlabeling and relabeling the branch just replaces the existing clone, and does not make a new one
  • Test that merging the PR triggers drop_clone to run
  • Test that merging the PR DOES NOT trigger create_clone_and_run_schemachange to run
  • Test that labeling the merged PR with create_clone_and_run_schemachange does not trigger the CI job to run

@jaymedina jaymedina added create_clone_and_run_schemachange Create a DB clone and run schemachange against it and removed create_clone_and_run_schemachange Create a DB clone and run schemachange against it labels Oct 23, 2024
@jaymedina jaymedina added the create_clone_and_run_schemachange Create a DB clone and run schemachange against it label Oct 23, 2024
@jaymedina jaymedina added create_clone_and_run_schemachange Create a DB clone and run schemachange against it and removed create_clone_and_run_schemachange Create a DB clone and run schemachange against it labels Oct 23, 2024
@jaymedina jaymedina added create_clone_and_run_schemachange Create a DB clone and run schemachange against it and removed create_clone_and_run_schemachange Create a DB clone and run schemachange against it labels Oct 23, 2024
@jaymedina jaymedina added the create_clone_and_run_schemachange Create a DB clone and run schemachange against it label Oct 23, 2024
@jaymedina jaymedina added create_clone_and_run_schemachange Create a DB clone and run schemachange against it and removed create_clone_and_run_schemachange Create a DB clone and run schemachange against it labels Oct 23, 2024
@jaymedina jaymedina added create_clone_and_run_schemachange Create a DB clone and run schemachange against it and removed create_clone_and_run_schemachange Create a DB clone and run schemachange against it labels Oct 23, 2024
@jaymedina jaymedina added create_clone_and_run_schemachange Create a DB clone and run schemachange against it and removed create_clone_and_run_schemachange Create a DB clone and run schemachange against it labels Oct 23, 2024
@jaymedina jaymedina added create_clone_and_run_schemachange Create a DB clone and run schemachange against it and removed create_clone_and_run_schemachange Create a DB clone and run schemachange against it labels Oct 23, 2024
@jaymedina jaymedina added create_clone_and_run_schemachange Create a DB clone and run schemachange against it and removed create_clone_and_run_schemachange Create a DB clone and run schemachange against it labels Oct 23, 2024
@thomasyu888
Copy link
Member

My main concern with the proposed workaround is that DATA_ENGINEER is quickly becoming a behemoth with no clear delineation between itself and SYSADMIN

@philerooski / @jaymedina , I think there should be another ticket to fully determine whether the DATA_ENGINEER role should just be the OWNER of synapse_data_warehouse and synapse_data_warehouse_dev. This way, the data_engineer role will be able to do all operations if needs to within those databases.

Similarly, since the DATA_ENGINEER role is rolled up into SYSADMIN, SYSADMIN should be able to do everything the DATA_ENGINEER role can do. Note: I'm not proposing DATA_ENGINEER have the same level of permissions as SYSADMIN, but what I'm saying is the DATA_ENGINEER role could have all the privileges the SYSADMIN role currently has WITHIN those two databases and clones

^ That said, I'm not tied to that, and I'll let Phil decide when we get to that design ticket.

snow sql -q "GRANT OWNERSHIP ON ALL TABLES IN SCHEMA ${SNOWFLAKE_SYNAPSE_DATA_WAREHOUSE_DATABASE}.SYNAPSE TO ROLE DATA_ENGINEER REVOKE CURRENT GRANTS;"

# Transfer ownership of: dynamic tables
snow sql -q "GRANT OWNERSHIP ON ALL DYNAMIC TABLES IN SCHEMA ${SNOWFLAKE_SYNAPSE_DATA_WAREHOUSE_DATABASE}.SYNAPSE TO ROLE DATA_ENGINEER REVOKE CURRENT GRANTS;"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to COPY CURRENT GRANTS rather than REVOKE so that we can imitate the source object as closely as possible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially, I would say no, mainly because the only role that should be interacting with the clone and its objects is DATA_ENGINEER for the purpose of testing and validation anyway. Also because COPY CURRENT GRANTS would copy over the DATA_ENGINEERs previous grants before it became owner, so it would be redundant.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering how we might test changes to privileges. If we've already revoked current grants on an object, then how can we check that whatever changes we've made to privileges granted on an object reflect what we are expecting?

For example:

object_dev

  • (ownership privilege)
  • privilege a
  • privilege b
  • privilege c

object_clone (before schemachange)

  • (ownership privilege)

object_clone (after schemachange)

  • (ownership privilege)
  • privilege d

Assuming that schemachange applied some arbitrary change set to the privileges (like adding privilege d, revoking some other privileges), how can we test grants on object_dev + our change set = grants we expect on this object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point - If we plan to add tests for updates in object privileges, then I agree we should not revoke any grants. I'll do a test to make sure COPY CURRENT GRANTS doesn't override DATA_ENGINEERs new ownership, and make a commit for this. Thanks!

@jaymedina jaymedina added create_clone_and_run_schemachange Create a DB clone and run schemachange against it and removed create_clone_and_run_schemachange Create a DB clone and run schemachange against it labels Nov 8, 2024
@jaymedina
Copy link
Contributor Author

Something to note:

The clone is built off the latest dev DB because this DB has schema changes that prod may not, HOWEVER the newestsnapshots data flows directly into prod (see comment here). My thinking is the data should ingest into dev before prod to make sure that the dynamic table queries (for example) in dev get updated and reflect the data as expected.

@thomasyu888
Copy link
Member

Something to note:

The clone is built off the latest dev DB because this DB has schema changes that prod may not, HOWEVER the newestsnapshots data flows directly into prod (see comment here). My thinking is the data should ingest into dev before prod to make sure that the dynamic table queries (for example) in dev get updated and reflect the data as expected.

Just to add the dev database links to the dev deployment of Synapse, so the data that is hosted within the dev database contains all actions that occur within the dev or maybe staging deployment of synapse, whereas prod links to the production synapse stack.

@jaymedina
Copy link
Contributor Author

From the conversation here, we can clone off of the staging DB (when it's created) instead of the dev DB

@thomasyu888
Copy link
Member

From the conversation here, we can clone off of the staging DB (when it's created) instead of the dev DB

@jaymedina let's hold off on making changes until we can get all get our bearings on this. Ideally the dev snapshots provide enough information for us to do our transforms, but in the incidences that it doesn't, you're right we do need to leverage staging.

The point of cloning DEV is so that the queries run faster due to it having less data (but similar structure) as the prod data

Comment on lines +122 to +132
drop_clone:
runs-on: ubuntu-latest
if: github.event.pull_request.merged == true || github.event.action == 'closed'
environment: dev
env:
SNOWFLAKE_PASSWORD: ${{ secrets.SNOWSQL_PWD }}
SNOWFLAKE_ACCOUNT: ${{ secrets.SNOWSQL_ACCOUNT }}
SNOWFLAKE_USER: ${{ secrets.SNOWSQL_USER }}
SNOWFLAKE_WAREHOUSE: ${{ secrets.SNOWSQL_WAREHOUSE }}
SNOWFLAKE_CLONE_ROLE: DATA_ENGINEER
CLONE_NAME: "${{ vars.SNOWFLAKE_SYNAPSE_DATA_WAREHOUSE_DATABASE }}_${{ github.head_ref }}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am understanding this will only drop the clone when the PR is merged or closed. However, a cloned DB is created on any push when the PR is labeled and open.

Would this mean that only the last github.head_ref is cleaned up? In the end we would want to cleanup both the head_ref when things are merged, AND, when a new push is added (and head_ref is updated) we'd also want to drop the previous cloned DB.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this mean that only the last github.head_ref is cleaned up?

Good question - Yes, this means that the last github.head_ref is cleaned up. However, this is not a concern because it's the only cloned DB on snowflake, for a given feature branch. All predecessor clones just get replaced by a new clone.

So for example if head_ref is snow-90-my-branch, the procedure goes like this:

  1. an initial commit is made + job is triggered
  2. new clone is made named: synapse_data_warehouse_dev_snow_90_my_branch
  3. schemachange runs
  4. another commit is made
  5. old clone is replaced by the new clone of synapse_data_warehouse_dev with a CREATE OR REPLACE. the new clone has the same name as the last clone, so it does a REPLACE
  6. schemachange runs
  7. rinse and repeat
  8. when it's time to merge, the clone is dropped (if it still exists). the drop clone job assumes the name hasn't been changed since its initial create, so this is a potential edge case actually, but it's probably enough to just say "don't change the clone name" in the CONTRIBUTIONS doc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
create_clone_and_run_schemachange Create a DB clone and run schemachange against it
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants