Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

container: fix resourceManagerTags tests #12728

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

wyardley
Copy link
Contributor

@wyardley wyardley commented Jan 10, 2025

Rework of #12376 by @MaChenhao

This fixes some tests that may have started failing after my #12014.

Turned out to be a bigger can of worms than I expected, and is now updated to include autopilot and nodepool variants as well.

Fixes hashicorp/terraform-provider-google#19997
Fixes hashicorp/terraform-provider-google#20252
Closes #12376

Release Note Template for Downstream PRs (will be copied)


@github-actions github-actions bot requested a review from ScottSuarez January 10, 2025 09:02
Copy link

Hello! I am a robot. Tests will require approval from a repository maintainer to run.

@ScottSuarez, a repository maintainer, has been assigned to review your changes. If you have not received review feedback within 2 business days, please leave a comment on this PR asking them to take a look.

You can help make sure that review is quick by doing a self-review and by running impacted tests locally.

@modular-magician modular-magician added the awaiting-approval Pull requests that need reviewer's approval to run presubmit tests label Jan 10, 2025
@modular-magician modular-magician added service/container and removed awaiting-approval Pull requests that need reviewer's approval to run presubmit tests labels Jan 14, 2025
@modular-magician
Copy link
Collaborator

Hi there, I'm the Modular magician. I've detected the following information about your changes:

Diff report

Your PR generated some diffs in downstreams - here they are.

google provider: Diff ( 1 file changed, 7 insertions(+), 45 deletions(-))
google-beta provider: Diff ( 1 file changed, 7 insertions(+), 45 deletions(-))

@modular-magician
Copy link
Collaborator

Tests analytics

Total tests: 219
Passed tests: 205
Skipped tests: 12
Affected tests: 2

Click here to see the affected service packages
  • container

Action taken

Found 2 affected test(s) by replaying old test recordings. Starting RECORDING based on the most recent commit. Click here to see the affected tests
  • TestAccContainerCluster_resourceManagerTags
  • TestAccContainerCluster_withAutopilotResourceManagerTags

Get to know how VCR tests work

@modular-magician
Copy link
Collaborator

🟢 Tests passed during RECORDING mode:
TestAccContainerCluster_resourceManagerTags [Debug log]

🟢 No issues found for passed tests after REPLAYING rerun.


🔴 Tests failed during RECORDING mode:
TestAccContainerCluster_withAutopilotResourceManagerTags [Error message] [Debug log]

🔴 Errors occurred during RECORDING mode. Please fix them to complete your PR.

View the build log or the debug log for each test

@wyardley
Copy link
Contributor Author

wyardley commented Jan 14, 2025

@ScottSuarez thanks for running these. Can you see what the error is - if it just needs a re-run after bootstrapping (though I'd have thought they'd already be in place from the other runs), or if you see a different error? The autopilot test worked for me, but I'll try again as well.

@ScottSuarez
Copy link
Contributor

sure !

2025/01/14 18:50:48 [DEBUG] [transport] [server-transport 0xc000850340] loopyWriter exiting with error: transport closed by client 
    resource_container_cluster_test.go:3608: Step 3/6 error: Error running apply: exit status 1
        
        Error: Error waiting for updating GKE cluster node pool auto config resource manager tags: Google Compute Engine: Required 'resourcemanager.tagValueBindings.create' permission for 'tagValues/281481010616245'.
        
          with google_container_cluster.with_autopilot,
          on terraform_plugin_test.tf line 94, in resource "google_container_cluster" "with_autopilot":
          94: resource "google_container_cluster" "with_autopilot" {
        

@wyardley wyardley force-pushed the wyardley/fix/tag_resource_manager_test branch from e1022bf to 447b0c4 Compare January 15, 2025 07:10
@wyardley
Copy link
Contributor Author

@ScottSuarez I was able to reproduce a similar failure locally on another run. I think the first pass maybe didn't go far enough, so made some more adjustments.

I don't think we can use acctest.BootstrapPSARole() on the serviceAccount:${data.google_project.project.number}@cloudservices.gserviceaccount.com one (at least without changes to that function, because there is no iam in the right hand side of the email, for one thing), but I think we can use it for granting the container-engine-robot roles/resourcemanager.tagUser permissions. If you've got ideas / suggestions about the other one, just let me know.

I rebased, made some adjustments, and will push up the fix if that seems to help, but you may want to run a couple of times in CI, even if it works. There's an outside chance that it could still be a little flaky, but hopefully this helps.

I ran into some issues when running both tests at the same time, but I assume / hope that won't be an issue with the way these run in the actual test suite (is there a chance that the two tests run in parallel against the same project at any point)?

The autopilot is a pain to test because it takes a while, and is kind of finnicky. With the updated code, I'm getting these failures semi-consistently on the autopilot one, occasionally (on the same step, I am thinking 2nd step, based on the requested tags and since it's 3/6 and there are two runs per step):

    resource_container_cluster_test.go:3481: Step 3/6 error: Error running apply: exit status 1
        
        Error: Error waiting for updating GKE cluster node pool auto config resource manager tags: Google Compute Engine: The instance's current status does not support update.
        
          with google_container_cluster.with_autopilot,
          on terraform_plugin_test.tf line 87, in resource "google_container_cluster" "with_autopilot":
          87: resource "google_container_cluster" "with_autopilot" {

Under the hood, this is what's happening (heavily snipped, but I think it gives a flavor of what's happening):

PUT /v1/projects/xxx/locations/us-central1/clusters/tf-test-cluster-lmdz1yz3cz?alt=json&prettyPrint=false HTTP/1.1
[...]
{
 "update": {
  "desiredNodePoolAutoConfigResourceManagerTags": {
   "tags": {
    "tagKeys/281476493792018": "tagValues/281483533413074",
    "tagKeys/281484657323346": "tagValues/281482241168330"
   }
  }
 }
}
---[ RESPONSE ]--------------------------------------
HTTP/2.0 200 OK
[...]
{
 "name": "operation-1736919116983-c37bdb21-ed20-454d-b575-25a6e3e4a896",
 "zone": "us-central1",
 "operationType": "UPDATE_CLUSTER",
 "status": "RUNNING",
 "selfLink": "https://container.googleapis.com/v1/projects/xx/locations/us-central1/operations/operation-1736919116983-c37bdb21-ed20-454d-b575-25a6e3e4a896",
 "targetLink": "https://container.googleapis.com/v1/projects/xx/locations/us-central1/clusters/tf-test-cluster-lmdz1yz3cz",
 "detail": "Updating default-pool, done with 0 out of 1 nodes (0.0%): 1 being processed",
[...]
GET /v1/projects/xxx/locations/us-central1/operations/operation-1736919116983-c37bdb21-ed20-454d-b575-25a6e3e4a896?alt=json&prettyPrint=false HTTP/1.1
---[ RESPONSE ]--------------------------------------
HTTP/2.0 200 OK
{
 "name": "operation-1736919116983-c37bdb21-ed20-454d-b575-25a6e3e4a896",
 "zone": "us-central1",
 "operationType": "UPDATE_CLUSTER",
 "status": "DONE",
 "statusMessage": "Google Compute Engine: The instance's current status does not support update.",
 "selfLink": "https://container.googleapis.com/v1/projects/xx/locations/us-central1/operations/operation-1736919116983-c37bdb21-ed20-454d-b575-25a6e3e4a896",
 "targetLink": "https://container.googleapis.com/v1/projects/xx/locations/us-central1/clusters/tf-test-cluster-lmdz1yz3cz",
 "detail": "Google Compute Engine: The instance's current status does not support update.",
[...]
 },
 "clusterConditions": [
  {
   "message": "Google Compute Engine: The instance's current status does not support update.",
   "canonicalCode": "INVALID_ARGUMENT"
  }
 ],
 "error": {
  "code": 3,
  "message": "Google Compute Engine: The instance's current status does not support update."
 }
}

If I'm reading this right, the cluster thinks it's in ready state, but the underlying GCE instances aren't actually ready to take the update yet? Not sure if this expected behavior and / or if an issuetracker bug should be filed about the cluster reporting being ready when it's not (I didn't find a public one, nor did I find any search results for this exact error message).

The 120s sleep in the Terraform code won't apply after the first step, because it only applies to create actions.... next thing I'm trying is adding a sleep within step 2 of the test itself, though this will slow down test execution further, even in replaying mode (side note: I see VCR enabled tests with acctest.SleepInSecondsForTest() -- maybe there should be a condition skipping the sleep when VCR is enabled there? created an issue to ask that)

@modular-magician modular-magician added the awaiting-approval Pull requests that need reviewer's approval to run presubmit tests label Jan 15, 2025
@wyardley
Copy link
Contributor Author

Another possibility for this failing earlier in CI is, as mentioned above, if the two tests are both recording in parallel, maybe the grant for [project-number]@cloudservices.gserviceaccount.com got deleted? The good news is, if this were the case, it should now succeed on the next run if the other test is already passing and thus running in replaying mode.... but that condition would still technically exist. For that to be fixable, I think acctest.BootstrapPSARole() would have to be adjusted / extended (e.g., to allow suppressing the .iam in the domain), and or a similar mechanism made available for bootstrapping that SA.

@@ -3635,6 +3643,9 @@ func TestAccContainerCluster_withAutopilotResourceManagerTags(t *testing.T) {
{
Config: testAccContainerCluster_withAutopilotResourceManagerTagsUpdate1(pid, clusterName, clusterNetName, clusterSubnetName, randomSuffix),
Check: resource.ComposeTestCheckFunc(
// Small sleep first, to avoid condition where cluster is ready but underlying GCE
// resources apparently aren't.
acctest.SleepInSecondsForTest(30),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other possibility is that I ran into some sort of shorter term problem while I was testing this and that it's not actually necessary. But this did reliably seem to get rid of the issue I mentioned in the PR comments.

See notes and linked issue about whether this is really needed in the case of VCR, though.

Copy link

@ScottSuarez This PR has been waiting for review for 3 weekdays. Please take a look! Use the label disable-review-reminders to disable these notifications.

@modular-magician modular-magician removed the awaiting-approval Pull requests that need reviewer's approval to run presubmit tests label Jan 15, 2025
@modular-magician
Copy link
Collaborator

Hi there, I'm the Modular magician. I've detected the following information about your changes:

Diff report

Your PR generated some diffs in downstreams - here they are.

google provider: Diff ( 1 file changed, 20 insertions(+), 81 deletions(-))
google-beta provider: Diff ( 1 file changed, 20 insertions(+), 81 deletions(-))

@modular-magician
Copy link
Collaborator

Tests analytics

Total tests: 219
Passed tests: 205
Skipped tests: 12
Affected tests: 2

Click here to see the affected service packages
  • container

Action taken

Found 2 affected test(s) by replaying old test recordings. Starting RECORDING based on the most recent commit. Click here to see the affected tests
  • TestAccContainerCluster_resourceManagerTags
  • TestAccContainerCluster_withAutopilotResourceManagerTags

Get to know how VCR tests work

@modular-magician
Copy link
Collaborator

🟢 Tests passed during RECORDING mode:
TestAccContainerCluster_resourceManagerTags [Debug log]

🟢 No issues found for passed tests after REPLAYING rerun.


🔴 Tests failed during RECORDING mode:
TestAccContainerCluster_withAutopilotResourceManagerTags [Error message] [Debug log]

🔴 Errors occurred during RECORDING mode. Please fix them to complete your PR.

View the build log or the debug log for each test

@wyardley
Copy link
Contributor Author

@ScottSuarez so, assuming the error is the same as before, I think the issue may be the issue with the two parallel tests trying to manage the same permissions for [project-number]@cloudservices.gserviceaccount.com (and the permission getting deleted by one test when the other one is still running. Does that seem likely to you, and if so, do you or anyone on your team have any thoughts about the best way to approach this? If you give me some direction on whether and how to adjust acctest.BootstrapPSARole(), I could take that on (as a separate PR, or as part of this one), but I'm not sure that's the right way to go.

Separately, we could decide whether to keep or get rid of that extra sleep.

@ScottSuarez
Copy link
Contributor

   resource_container_cluster_test.go:3610: Step 3/6 error: After applying this test step and performing a `terraform refresh`, the plan was not empty.
       stdout
       
       
       Terraform used the selected providers to generate the following execution
       plan. Resource actions are indicated with the following symbols:
         + create
       
       Terraform will perform the following actions:
       
         # google_project_iam_member.tag_user will be created
         + resource "google_project_iam_member" "tag_user" {
             + etag    = (known after apply)
             + id      = (known after apply)
             + member  = "serviceAccount:[email protected]"
             + project = "ci-test-project-yyyy"
             + role    = "roles/resourcemanager.tagUser"
           }
       
       Plan: 1 to add, 0 to change, 0 to destroy.

@ScottSuarez
Copy link
Contributor

@ScottSuarez so, assuming the error is the same as before, I think the issue may be the issue with the two parallel tests trying to manage the same permissions for [project-number]@cloudservices.gserviceaccount.com (and the permission getting deleted by one test when the other one is still running. Does that seem likely to you, and if so, do you or anyone on your team have any thoughts about the best way to approach this? If you give me some direction on whether and how to adjust acctest.BootstrapPSARole(), I could take that on (as a separate PR, or as part of this one), but I'm not sure that's the right way to go.

Separately, we could decide whether to keep or get rid of that extra sleep.

Yeah looks like that might be the case. The terraform refresh seems to suggest that we're modifying a singleton. I would say we should maintain separate values for these two tests if we can. I haven't looked into it in as much detail as you as to if that would be possible

@ScottSuarez
Copy link
Contributor

Is there a reason we still maintain a google_project_iam_member in the test in question? Couldn't we also bootstrap that permission?

@wyardley
Copy link
Contributor Author

wyardley commented Jan 15, 2025

Is there a reason we still maintain a google_project_iam_member in the test in question? Couldn't we also bootstrap that permission?

@ScottSuarez what I was getting at above is that .iam is baked in to the email domain in BootstrapAllPSARoles() and related functions currently:

members[i] = fmt.Sprintf("serviceAccount:%s%d@%s.iam.gserviceaccount.com", prefix, project.ProjectNumber, agentName)

So from what I can see, I don't believe it's currently possible to use BootstrapPSARoles() or BootstrapPSARole() to bootstrap permissions on [email protected] without code changes in those functions [because there's no .iam after the service hostname], and I didn't see (from a quick look) any other functions that seem like they'd work. If you or your team would like to give an example of how it's possible to bootstrap that account, or create a PR to support that, I'd be happy to implement it, or if there are clear instructions on how it should be implemented, I could throw up a PR for that separately?

I think with golang not having default function parameters, and the nesting of those functions, it could get a little messy to do, maybe something like if it's nil, use iam. and if it's "" suppress iam. entirely? Or a new wrapper function that's separate and still calls BootstrapAllPSARoles() with some modifications there to allow it to work?

I can switch the existing set of calls for 3 spearate roles from BootstrapPSARole() to BootstrapPSARoles(), which at least tidies up the logic in the existing bootstrap logic.

I can also try taking that one permission out entirely and see if either / both tests work without it. I'm not sure of the exact use but I assume it's actually necessary in both cases.

@wyardley wyardley force-pushed the wyardley/fix/tag_resource_manager_test branch from 447b0c4 to f760199 Compare January 16, 2025 00:51
@modular-magician modular-magician added the awaiting-approval Pull requests that need reviewer's approval to run presubmit tests label Jan 16, 2025
@ScottSuarez
Copy link
Contributor

@ScottSuarez what I was getting at above is that .iam is baked in to the email domain in BootstrapAllPSARoles() and related functions currently:

Ah this makes sense. I think it would be reasonable to make the IAM function more general here and allow modification of the email. Just expose new wrapper functions.

@modular-magician modular-magician removed the awaiting-approval Pull requests that need reviewer's approval to run presubmit tests label Jan 16, 2025
@wyardley wyardley force-pushed the wyardley/fix/tag_resource_manager_test branch from 9a24425 to 087a798 Compare January 16, 2025 23:04
@modular-magician modular-magician removed the awaiting-approval Pull requests that need reviewer's approval to run presubmit tests label Jan 16, 2025
@wyardley wyardley force-pushed the wyardley/fix/tag_resource_manager_test branch from 087a798 to 9a24425 Compare January 16, 2025 23:05
@modular-magician modular-magician added awaiting-approval Pull requests that need reviewer's approval to run presubmit tests and removed awaiting-approval Pull requests that need reviewer's approval to run presubmit tests labels Jan 16, 2025
@wyardley
Copy link
Contributor Author

wyardley commented Jan 16, 2025

Ok, I have a pretty simple implementation that is hacky, but works (tested in replaying mode, and testing running both concurrently in recording mode now locally), and that I think is somewhat flexible while not having too many changes.

The other big win here is that this creates a massive improvement in speed in replaying mode, since the Nx 120 sleeps (from the terraform code that was removed) are no longer present.

I pushed it up in its own commit, and cherry-picked into #12784 - my suggestion is that if we can get fast review on this, it will be cleaner for the commit history if it's merged first, and then I can rebase this.

@modular-magician
Copy link
Collaborator

Hi there, I'm the Modular magician. I've detected the following information about your changes:

Diff report

Your PR generated some diffs in downstreams - here they are.

google provider: Diff ( 2 files changed, 23 insertions(+), 138 deletions(-))
google-beta provider: Diff ( 2 files changed, 23 insertions(+), 138 deletions(-))

@modular-magician modular-magician added the awaiting-approval Pull requests that need reviewer's approval to run presubmit tests label Jan 16, 2025
@wyardley
Copy link
Contributor Author

Also @ScottSuarez I tried backing off that sleep to 15s, and got more failures locally. It seems to me like maybe this is a bug. I'm guessing this is something that will not rank as very important, but I created https://issuetracker.google.com/issues/390456348 anyway, this way I can at least link to that in the comment in the code here.

@wyardley
Copy link
Contributor Author

#12785 -- this is a potential solution to the sleep in REPLAYING mode, though as currently written, it would affect all ~4 uses of that function.

@wyardley
Copy link
Contributor Author

In the meantime, I'll push up a fix for hashicorp/terraform-provider-google#19997

@modular-magician
Copy link
Collaborator

Tests analytics

Total tests: 4449
Passed tests: 4018
Skipped tests: 426
Affected tests: 5

Click here to see the affected service packages

All service packages are affected

Action taken

Found 5 affected test(s) by replaying old test recordings. Starting RECORDING based on the most recent commit. Click here to see the affected tests
  • TestAccApigeeEnvironmentAddonsConfig_apigeeEnvAddonsAnalyticsTestExample
  • TestAccApigeeEnvironment_apigeeEnvironmentUpdateTest
  • TestAccContainerCluster_withAutopilotResourceManagerTags
  • TestAccDataSourceGoogleGkeHubFeature_basic
  • TestAccEphemeralServiceAccountKey_basic

Get to know how VCR tests work

@modular-magician
Copy link
Collaborator

🟢 Tests passed during RECORDING mode:
TestAccContainerCluster_withAutopilotResourceManagerTags [Debug log]
TestAccDataSourceGoogleGkeHubFeature_basic [Debug log]

🔴 Tests failed when rerunning REPLAYING mode:
TestAccDataSourceGoogleGkeHubFeature_basic [Error message] [Debug log]

Tests failed due to non-determinism or randomness when the VCR replayed the response after the HTTP request was made.

Please fix these to complete your PR. If you believe these test failures to be incorrect or unrelated to your change, or if you have any questions, please raise the concern with your reviewer.


🔴 Tests failed during RECORDING mode:
TestAccApigeeEnvironmentAddonsConfig_apigeeEnvAddonsAnalyticsTestExample [Error message] [Debug log]
TestAccApigeeEnvironment_apigeeEnvironmentUpdateTest [Error message] [Debug log]
TestAccEphemeralServiceAccountKey_basic [Error message] [Debug log]

🔴 Errors occurred during RECORDING mode. Please fix them to complete your PR.

View the build log or the debug log for each test

Copy link

@GoogleCloudPlatform/terraform-team @ScottSuarez This PR has been waiting for review for 1 week. Please take a look! Use the label disable-review-reminders to disable these notifications.

Copy link
Contributor

@ScottSuarez ScottSuarez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waiting on IAM changes from out of band PR

@wyardley wyardley force-pushed the wyardley/fix/tag_resource_manager_test branch from 4bce450 to ecc4b37 Compare January 17, 2025 23:09
@github-actions github-actions bot requested a review from ScottSuarez January 17, 2025 23:10
@wyardley
Copy link
Contributor Author

@ScottSuarez @melinath thanks for the quick fix.... ecc4b37 includes the changes from #12796, and seems to work for me locally. Only open question is whether it's Ok to share that bootstrap function (modeled after the example in there) across the container and node pool test files, or if I should duplicate it and / or extract out to another file. Happy to change the naming of that function if you'd like.

@melinath
Copy link
Member

I don't have a strong feeling about it. Whatever @ScottSuarez thinks WFM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-approval Pull requests that need reviewer's approval to run presubmit tests service/container
Projects
None yet
5 participants