Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sam delete gets confused about which template to delete from S3 #5254

Closed
garretwilson opened this issue Jun 3, 2023 · 27 comments
Closed

sam delete gets confused about which template to delete from S3 #5254

garretwilson opened this issue Jun 3, 2023 · 27 comments
Labels
area/delete platform/windows stage/waiting-for-release Fix has been merged to develop and is waiting for a release type/bug

Comments

@garretwilson
Copy link

I'm using SAM 1.84.0 on Windows 10.

I build a JAR file with Maven and then directly deploy it using sam deploy like this:

sam deploy \
    --profile $awsProfile \
    --template-file $SCRIPT_DIR/../foobar/sam-template.yaml \
    --stack-name my-stack \
    --s3-bucket my-sam-bucket \
    --capabilities CAPABILITY_NAMED_IAM \
    --parameter-overrides Env=dev Ver=$ver

As expected SAM CLI renames the JAR file to a hash 11111111111111111111111111111111 and uploads it to the S3 bucket. Then it does something to the template, renames it to a hash 22222222222222222222222222222222.template, and uploads that to the S3 bucket as well. (These hashes are just contrived example values of course.)

        Uploading to 11111111111111111111111111111111  …  (100.00%)
…
        Uploading to 22222222222222222222222222222222.template  …  (100.00%)

And indeed if I look in the S3 bucket, those are the exact two files that were uploaded!

Then I immediately delete the stack using sam delete like this:

sam delete \
    --profile $awsProfile \
    --stack-name my-stack \
    --s3-bucket my-sam-bucket

Oddly SAM asks me:

        Are you sure you want to delete the stack my-stack in the region xx-xxx-xx ? [y/N]: y
        Do you want to delete the template file 33333333333333333333333333333333.template in S3? [y/N]:

Wait, what is 33333333333333333333333333333333.template? There is no 33333333333333333333333333333333.template on S3. SAM never uploaded a 33333333333333333333333333333333.template file. SAM never said it uploaded a 33333333333333333333333333333333.template. SAM told me (truthfully) that it uploaded a 22222222222222222222222222222222.template file.

So I say, "Sure, SAM, go ahead and delete whatever you want," and SAM goes away and then comes back to say:

        - Deleting S3 object with key 11111111111111111111111111111111
        - Could not find and delete the S3 object with the key 33333333333333333333333333333333.template
        - Deleting Cloudformation stack my-stack

It's not surprising SAM couldn't find 33333333333333333333333333333333.template, because it never uploaded such a file. And SAM left 22222222222222222222222222222222.template on S3.

Why did SAM get confused and try to delete the non-existent file 33333333333333333333333333333333.template, when it really had uploaded 22222222222222222222222222222222.template?

(I don't know if it's relevant, but I'm using AWS::LanguageExtensions as one of the transforms. I'll just make a wild guess here with nothing to base it on, but is it possible that SAM creates the first hash based upon some sort of transformation/interpolation, but then creates the second hash based upon the raw template source file? I'm just brainstorming in case it helps you track this down.)

@garretwilson garretwilson added the stage/needs-triage Automatically applied to new issues and PRs, indicating they haven't been looked at. label Jun 3, 2023
@jfuss
Copy link
Contributor

jfuss commented Jun 5, 2023

@garretwilson Can you provide a workable example. The details you have describe what is happening but without something I can actually validate with, it's hard to repro (effectively just guessing at your template and setup).

@jfuss jfuss added the blocked/more-info-needed More info is needed from the requester. If no response in 14 days, it will become stale. label Jun 5, 2023
@garretwilson
Copy link
Author

Can you provide a workable example.

I can give you basically everything I have. Here is basically what my template is:

AWSTemplateFormatVersion: '2010-09-09'
Transform:
  - AWS::LanguageExtensions
  - AWS::Serverless-2016-10-31

Description:
  Example functions.

Parameters:
  Ver:
    Description: The version of the service, such as "1.2.3".
    Type: String
    AllowedPattern: '[\w.+-]+'
    ConstraintDescription: The service version must use only word characters, dots, plus signs, and dashes.

Globals:
  Function:
    CodeUri: !Sub "s3://my-sam-bucket/foo-bar-functions-${Ver}.jar"
    Runtime: java17
    Architectures: [x86_64]
    MemorySize: 512
    Timeout: 300

Resources:
  ExampleFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: com.example.FooBar::foo

Here is my samconfig.toml:

version = 0.1

[default]
[default.global.parameters]
stack_name = "placeholder" # will be specified from the CLI

[default.build.parameters]
cached = true
parallel = true

[default.validate.parameters]
lint = true

[default.deploy.parameters]
capabilities = "CAPABILITY_IAM"
confirm_changeset = true

[default.package.parameters]

[default.sync.parameters]
watch = true

[default.local_start_api.parameters]
warm_containers = "EAGER"

[default.local_start_lambda.parameters]
warm_containers = "EAGER"

I'm afraid I don't have time to create some turnkey Zip package. I'm already delayed in this project; I haven't even got to the point of deploying actual functionality, as I spent so much time just getting the deployment working (#5249). 😅 Really there isn't much here in my configuration and code; I can't imagine it would be hard to reproduce.

If it's not happening for you, the first place I would look would be the !Ref and AWS::LanguageExtensions interactions. It sounds like something to do with creating a hash before or after variable substitution.

You might check into line endings as well. I'm on a Windows machine using CRLF in the templates. If your hashing function converts to LF when doing the upload and then compares with that in the local directory without normalizing line endings, then that could lead to mismatched hashes as well. Just giving you ideas to look into.

@jfuss
Copy link
Contributor

jfuss commented Jun 5, 2023

@garretwilson Thanks that is enough. I don't need a full example, I can fill in the details but with nothing to base on you are leaving us guessing, which leads to "works on my machine can't repro". Providing clear and concise issues with examples is the key to making interactions faster with the team.

@garretwilson
Copy link
Author

Sure, I understand. My configuration is so simple I would guess it will be super-easy to reproduce. But if you can't, just let me know and I'll think more about what further could be different between our environments. A step at a time I can put in more work for turn-key reproducibility if needed, but at this point I'm guessing it won't be needed. Good luck and have a good week.

@jfuss
Copy link
Contributor

jfuss commented Jun 5, 2023

I am not able to repro this on 1.85.0, using various methods. The likely place this could fail is https://github.com/aws/aws-sam-cli/blob/develop/samcli/commands/delete/delete_context.py#L273 as we try to compute the hash from the downloaded template but without being able to repro, it's hard to say for certain.

@garretwilson
Copy link
Author

OK. I'll put together something even more turnkey, although it might be a few days. (Fortunately this particular is nowhere near a blocker; I'm just reporting it to help tidy up loose ends.)

In the meantime, before I put a lot of time into this, can you just confirm that you have tested it from the command line on Windows 10?

@jfuss
Copy link
Contributor

jfuss commented Jun 5, 2023

I am not on Windows. It's possible this is Windows specific related but don't have direct access to a Windows machine to verify on you exact setup.

@garretwilson
Copy link
Author

garretwilson commented Jun 5, 2023

Can you at least confirm that you've tested this with a template using CRLF line endings? You should be able to do that on any platform. VS Code can even do the conversion for you if you ask it to.

@garretwilson
Copy link
Author

This is odd:

  • I switched to LF endings for sam-template.yaml
  • I changed CodeUri: !Sub "s3://my-sam-bucket/foo-bar-functions-${Ver}.jar" to CodeUri: "s3://my-sam-bucket/foo-bar-functions-1.2.3.jar" to remove any substitutions.
  • I removed AWS::LanguageExtensions, leaving only Transform: AWS::Serverless-2016-10-31

Still SAM CLI tries to delete a different hash-named template than the one it uploaded, brazenly showing me the hashes that have nothing to do with each other.

(In addition it appears that if I remove the !Sub so that the CodeUri is a literal URI to the JAR that I uploaded, SAM CLI takes it upon itself to delete that JAR when it tears down the CloudFormation stack! So SAM URI can't delete the template that it uploaded, but turns around and deletes the JAR that I uploaded. Fortunately using !Sub with interpolation seems to be enough to confuse SAM CLI so that it leaves my JAR alone.)

I am running out of ideas for why you can't reproduce this; I haven't found a way to not reproduce it on my end.

The only thing I could even possible think of is that I am invoking SAM from a Bash script in Git Bash for Windows. But that has a very low chance of being related.

I'll think of more ways I can narrow this down and provide you something to reproduce this, but it's not going to be very soon. I spent half of yesterday tracking down why Log4J (which the AWS libraries force me to use) was producing a warning. (Spoiler: it's because Log4J uses a multi-release POM, and AWS lambda doesn't know how to deal with multi-release POMs, so Lambda was using old pre-Java 9 classes. It may impact sam build, which I don't use so I don't know. If you want more info, ask me or read aws/aws-lambda-java-libs#204 .)

@jfuss
Copy link
Contributor

jfuss commented Jun 6, 2023

@garretwilson Are you using sam deploy or using some other way to deploy the template initially.

Leaving what I was testing below to not loose context in the future and could not repro even with subs and transforms.

template.yaml

AWSTemplateFormatVersion: '2010-09-09'
Transform:
  - AWS::LanguageExtensions
  - AWS::Serverless-2016-10-31

Description:
  Example functions.

Parameters:
  Ver:
    Description: The version of the service, such as "1.2.3".
    Type: String
    ConstraintDescription: The service version must use only word characters, dots, plus signs, and dashes.

Globals:
  Function:
    CodeUri: !Sub "s3://<my hardcoded bucket>/testing-5254-7/${Ver}"
    Runtime: python3.9
    Architectures: [x86_64]
    MemorySize: 512
    Timeout: 300

Resources:
  HelloWorldFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: app.lambda_handler
      Environment:
        Variables:
          Version: !Ref Ver

commands:

sam deploy --stack-name testing-5254-8 --guided --region us-west-2 --parameter-overrides "Ver=4d4e1a715ca32be7dcbbf7751dfb8e8a"
sam delete --stack-name testing-5254-8 --region us-west-2 --s3-bucket <my hardcoded bucket>

Notes:
I ran sam deploy --stack-name testing-5254-7 --guided --region us-west-2 as a cheap way to upload an artifact and then used that artifact in the template above. This replicates a direct S3 upload.

@garretwilson
Copy link
Author

Are you using sam deploy or using some other way to deploy the template initially.

I am using sam deploy from inside a Bash script inside Git Bash for Windows on Windows 10. See the description of this ticket for the specific format of the command I'm using.

… could not repro even with subs and transforms.

It happens for me even when I take out the !Sub and the AWS::LanguageExtensions.

… as a cheap way to upload an artifact and then used that artifact in the template above. This replicates a direct S3 upload.

OK. Or you can just drag and drop the JAR into the S3 console, and hard-code the s3://… CodeUri in the SAM template.

Just to make sure we're on the same page (sorry if you understood this, but it wasn't clear in your last comment), the problem is not with sam deploy, but that sam delete tries to delete a different template than what it uploaded.

Thanks for looking at this, but it's not my priority at the moment as I can live with an extra orphaned template scattered around on the server. Right now I'm writing a bug report for Eclipse 2023-06 RC1. It seems that ever 15 minutes some other tool breaks. You can imagine how much real work I've managed to do in the past few days with sam deploy not recognizing interpolation, AWS Lambda not working with multi-release JARs, Eclipse 2023-06 breaking with maven.compiler.release set to 8 … arg!!

Wait … is today tequila Tuesday? (Or was that "taco Tuesday" …)

@mildaniel mildaniel added stage/bug-repro The issue/bug needs to be reproduced and removed stage/needs-triage Automatically applied to new issues and PRs, indicating they haven't been looked at. labels Jun 9, 2023
@mndeveci
Copy link
Contributor

mndeveci commented Jul 5, 2023

Hey @garretwilson

I've tried steps you mentioned here, and wasn't able to re-produce the problem.

One thing that caught my attention though. Are you using a JAR file which is already uploaded to S3, or are you using a local file and let SAM CLI to upload it with deploy flow? Because your example contains a JAR file in S3, but you mention that it uploads the JAR file during deployment process.

As expected SAM CLI renames the JAR file to a hash 11111111111111111111111111111111 and uploads it to the S3 bucket.

Could there be additional steps that you are doing before or after deploy? Just trying to see where this is coming from.

Thanks!

@garretwilson
Copy link
Author

@mndeveci thank you for looking into this.

Are you using a JAR file which is already uploaded to S3, or are you using a local file and let SAM CLI to upload it with deploy flow?

I'm using a JAR file that my own script uploads to S3. Then I reference that S3 blob path from my SAM template, but using a !Ref. This is because SAM breaks if I tried to include a !Ref in the actual CodeUri using a so-called relative URI for SAM to upload the JAR. (And I want to use a !Ref because I pass in the version number via a parameter.) This is all explained in #5249.

In fact my actual solution is described in painstaking detail in a comment to that ticket. And it's working great. (Since I'm doing all the work to upload the JAR file myself, though, I'm seeing less and less of a need for SAM.)

So basically there will be a !Ref in the CodeUri, but since it's an absolute URI SAM will ignore it and let CloudFormation resolve the !Ref later in the pipeline, finding the JAR I already uploaded. And then when tearing down, SAM won't even try to delete the JAR (that's good, because I didn't ask it to delete the JAR and I don't want it to), because it can't understand the !Ref anyway.

@mndeveci
Copy link
Contributor

mndeveci commented Jul 6, 2023

I see your point. Unfortunately intrinsic support in SAM CLI is very limited since there is no library or API endpoint that we can use to resolve them inside the template.

I am trying to see any workarounds, you might look at custom builds (where you need to provide a Makefile for your function's build process) you might try following options (both of them will be creating a template file with CodeUri property without any intrinsic function;

  1. You can move all of your external build logic into Makefile and then you can use sam build and sam package/sam deploy
  2. If you want to keep the way you are building your functions, you can still use a Makefile but just copying the jar files from the previous external step that you build. And then again you can try using sam build and sam package/sam deploy as the next steps.

Going back to your question for this issue, I've checked the sam delete flow, and I find this where we try to estimate the S3 location of the deployed template (CFN doesn't have an API to return the template location), which basically downloads your deployed template and calculates hash and estimates S3 location;
https://github.com/aws/aws-sam-cli/blob/develop/samcli/lib/package/local_files_utils.py#L26-L66

So if you are deploying through SAM CLI, and then making some changes to your template and re-deploy again by another tool, it is possible that we can't estimate the location of the template file therefore that step fails during sam delete.

@garretwilson
Copy link
Author

garretwilson commented Jul 6, 2023

I am trying to see any workarounds, …

Thank you, but this ticket was not to request workarounds to #5249. I have given up on SAM in this area, and created my own workaround, which is working splendidly. This ticket was opened for a different reason.

And then again you can try using sam build

I am not using sam build at all. My Maven project builds the project just fine; I don't need SAM for that. (In fact using the Maven Shade plugin for creating uber JARs is a horrible approach for many reasons, such as conflicting files and a complete breaking of multirelease JAR dependencies — not because Shade can't create multirelease JARs, but because AWS Lambda doesn't understand multirelease uber JARs, and breaks Java 9+ functionality, etc.) My Maven build is creating a nice ZIP file. I just needed SAM to upload the ZIP file with a parameter as part of its path. It can't do that in #5249, and it gets confused with deleting the template in this ticket.

Going back to your question for this issue …

👍

… I find this where we try to estimate the S3 location of the deployed template (CFN doesn't have an API to return the template location), which basically downloads your deployed template and calculates hash and estimates S3 location …

I've indicated previously in this ticket that my gut instinct is that SAM is somehow getting confused with the hash of the template before and after variable substitution.

The whole CloudFormation template thing SAM is built on is very brittle to begin with, as illustrated by #5249, and in fact the very need to have the AWS::LanguageExtensions hack in the first place, which itself doesn't even work all the time; see aws-cloudformation/cfn-language-discussion#127 .

So if you are deploying through SAM CLI, and then making some changes to your template and re-deploy again by another tool, …

I am not doing this. I am deploying the template via SAM CLI, making zero changes to the template via any means, and then using SAM CLI to delete the stack.

Again this ticket has low priority; I really don't care so much at this point. There are much bigger usability concerns such as aws-cloudformation/cfn-language-discussion#127 .

But thanks again for reading it and thinking about it.

@garretwilson
Copy link
Author

garretwilson commented Jul 7, 2023

which basically downloads your deployed template and calculates hash and estimates S3 location

But where does the content of the "deployed template" come from? Here's a snippet from the source code you indicated:

def get_uploaded_s3_object_name(
    precomputed_md5: Optional[str] = None,
    file_content: Optional[str] = None,
    file_path: Optional[str] = None,
    extension: Optional[str] = None,
) -> str:

Where are you getting that file_content from? Is the "deployed template" an identical, byte-for-byte copy of what SAM uploaded to the S3 bucket before processing? I can't see how that could be the case. When I open a CloudFormation stack in the console and look under "Template", what I see is not identical to what I have locally. It appears the object tree has been stored somewhere, and what I see in the console is a reserialization of that, which is equivalent to what I have to the original template source file, but not identical. It certainly wouldn't generate the same hash, and I wouldn't expect it to.

So where is this "deployed template" that is byte-for-byte identical to what is in the S3 bucket, that you would expect to generate a hash of and have it match that of the original template source code?

@mndeveci
Copy link
Contributor

mndeveci commented Jul 7, 2023

Is the "deployed template" an identical, byte-for-byte copy of what SAM uploaded to the S3 bucket before processing?

Unfortunately we can't use what was deployed before, since you might deploy your stack in one place and might try to delete from another. So that is why we are getting the template from CFN, and then trying to estimate file location by getting its hash.

When I open a CloudFormation stack in the console and look under "Template", what I see is not identical to what I have locally

I think what we are getting is what you are seeing in CFN. We are using get_template API with TemplateStage=Original which should return un-transformed (aka original) template.

So where is this "deployed template" that is byte-for-byte identical to what is in the S3 bucket, that you would expect to generate a hash of and have it match that of the original template source code?

CFN doesn't have an API to return deployed template location, but it looks like, it might be recorded in CloudTrail (see). Can you check those events to see where the template is stored? It might also give you some clue about why it is different from the local one.

@garretwilson
Copy link
Author

So that is why we are getting the template from CFN, and then trying to estimate file location by getting its hash.

Why go to all that trouble in such a roundabout way? Why not just add the hash as a tag to the stack? Or better yet, the bucket name and path? Sure, the user could delete the tag. They can go into the S3 bucket and rename the staged, template too. The user can do anything. But there's no reason for them to muck with this tag, and if they know not to, then they probably won't, just like they don't rename the staged template in the SAM bucket. So why not do it the easy, direct way, instead of trying to work into it backwards by recalculating the hash and then looking up the filename?

Another question: why do you have to stage the SAM template to begin with? I can deploy a CF template directly without staging the template as far as I know. Why do we have to stage it in an S3 bucket with SAM? I'm honestly curious.

@mndeveci
Copy link
Contributor

mndeveci commented Jul 7, 2023

Why go to all that trouble in such a roundabout way? Why not just add the hash as a tag to the stack? Or better yet, the bucket name and path?

It is doable, but sam deploy was already there before we released sam delete, that is why we are doing an estimate for template's location. I will take this to team's attention about adding a tag for template's location and considering that as well while deleting all the resources.

Another question: why do you have to stage the SAM template to begin with? I can deploy a CF template directly without staging the template as far as I know. Why do we have to stage it in an S3 bucket with SAM? I'm honestly curious.

With sam deploy we are using create_change_set API with TemplateURL property. That property requires an S3 location or systems manager document.

@garretwilson
Copy link
Author

With sam deploy we are using create_change_set API with TemplateURL property. That property requires an S3 location or systems manager document.

I see. Thanks for explaining. That's good to know.

I'm looking at the API and I see that there is a TemplateBody option (I don't know if Boto supports it—I don't use it), although that's limited to ~50K in size.

@mndeveci mndeveci added type/feature Feature request and removed blocked/more-info-needed More info is needed from the requester. If no response in 14 days, it will become stale. stage/bug-repro The issue/bug needs to be reproduced labels Jul 7, 2023
@garretwilson
Copy link
Author

I still intend to put together a reproducible test (which shouldn't be hard because I only get this behavior), but as I think about this day-to-day, there are a few doubts I still have:

  1. If I deploy version A of the template, then deploy version B, then version C; what happens to versions A and B of the template? If SAM only tries (and fails, in this ticket) to delete the template when I delete the stack, it's only trying (and failing) to delete version C. What happens to version A and B? Even if we fix the bug in this ticket, won't versions A and B still be left cluttering the staging bucket? Or does SAM delete version A when it deploys the change set for version B, etc.?
  2. If my template is under ~50K in size, why does SAM even need to upload the template in the first place? Let me rephrase that—of course it doesn't need to. Why does it if it doesn't need to?
  3. After SAM deploys my template, is there any way to get the name of the uploaded template in the staging bucket so that my script can just immediately delete this junk, seeing that 1) we didn't need it in the first place, 2) it may be orphaned their anyway until SAM deletes old ones when it deploys change sets, and 3) in this ticket SAM can't even clean up the last one correclty?

@garretwilson
Copy link
Author

I still intend to put together a reproducible test …

Since most of my day has been wasted trying to do simple things and running into aws/serverless-application-model#3265 (trival), aws/serverless-application-model#3264 (huge), #5533 (irritating and time-wasting), and aws/serverless-application-model#3261 (pales in comparison to the others), I decided I might as well spend the rest of the day getting you a reproducible test case.

I started paring down my code to see what triggers this, and I ran into something else. I removed the -s3-bucket $s3BucketName from the scripts for both sam deploy and for sam delete. Surprisingly, SAM went and head and deployed the template, even without a staging bucket. And it didn't create one! I haven't used the default SAM staging bucket in months. I deleted it long ago, and I don't remember what it was named. Now I use separate staging buckets for each environment.

But then when I tried to do sam-delete without -s3-bucket, SAM failed:

Warning: Cannot resolve s3 bucket information from command options , local config file or cloudformation template. Please use --s3-bucket next time and delete s3 files manually if required.

Odd—no warning when deploying with no staging bucket name, but a warning when deleting without a staging bucket name.

(I had thought it would create a default staging bucket; I was trying to pare down the test case to as small as possible. I guess I am not remembering correctly. I'll use a hard-coded bucket name.)

Anyway, this is probably a separate bug, but I'm pretty tired from all the bug reporting today already. For now I'll just leave it in this comment. I don't want that to sidetrack us from the main issue for this ticket.

@garretwilson
Copy link
Author

OK, @jfuss and others, I've created a minimal test case case to reproduce this issue. It puzzles me greatly that you cannot reproduce it, because I have been unable to not reproduce this. (I have a fear that, after all the time I've spent, someone is finally going to try it on Windows and find that, "oh, this always happens on Windows"—but let's see ….) I've provided the files in a ZIP file which I've attached and linked in the instructions below.

Before going further, let me be clear about my configuration:

  • Windows 10
  • SAM CLI, version 1.88.0
  • aws-cli/2.11.23 Python/3.11.3 Windows/10 exe/AMD64 prompt/off
  • git version 2.40.1.windows.1
  • VS Code 1.80.1
  • I'm running these scripts from within Git Bash on Windows from the VS Code embedded terminal.

Note: The scripts in the ZIP file probably don't have their executable bit set. They don't need that when when running from Git Bash on Windows. If you test them in *nix, I'm sure you know that you need to do a chmod +x or whatever on them first.

  1. Unzip the files from aws-sam-cli-issue-5254.zip and place them in the same directory on Windows:
    • aws-sam-cli-issue-5254.sam.yaml: The SAM template.
    • samconfig.toml: The SAM configuration file.
    • deploy-aws-sam-cli-issue-5254.sh: The deployment script.
    • delete-aws-sam-cli-issue-5254.sh: The deletion script.
  2. Create a bucket named aws-sam-cli-issue-5254-staging on your AWS account for staging. You can use any bucket name, but you'll need to do a search+replace on the scripts if you use something else.
  3. Verify that the bucket name aws-sam-cli-issue-5254 is available; otherwise you'll need to update aws-sam-cli-issue-5254.sam.yaml with some other bucket name of your choice.
  4. Deploy the stack using ./deploy-aws-sam-cli-issue-5254.sh. Along with other output, I see:
            Uploading to 05ca0eabd6ebdeacd0836c6ec2826579.template  328 / 328  (100.00%)
    
  5. Immediately delete the stack using ./delete-aws-sam-cli-issue-5254.sh. I see the following output:
            - Could not find and delete the S3 object with the key d03c1af37631d000624ff6fee500c2ba.template
            - Deleting Cloudformation stack aws-sam-cli-issue-5254
    
    Deleted successfully
    

@mndeveci
Copy link
Contributor

mndeveci commented Jul 16, 2023

@garretwilson thanks for working on this (even on Saturday!).

Thanks to your step-by-step example, I was able to find the root cause of it (finally). As I mentioned above, we are getting deployed template and calculate its hash to find its location in S3, this works well on Linux and MacOS but it fails on Windows. The reason is, python does some extra stuff when writing strings to a file on Windows;

When writing output to the stream, if newline is None, any '\n' characters written are translated to the system default line separator, os.linesep. If newline is '' or '\n', no translation takes place. If newline is any of the other legal values, any '\n' characters written are translated to the given string.

So what we get from CFN is this;

b"\nAWSTemplateFormatVersion: '2010-09-09'\nTransform:\n- AWS::LanguageExtensions\n- AWS::Serverless-2016-10-31\nDescription: 'Demonstration of AWS SAM CLI Issue aws/aws-sam-cli#5254.'\nResources:\n  TestBucket:\n    Type: AWS::S3::Bucket\n    Properties:\n      BucketName: aws-sam-cli-issue-5254\n    Metadata:\n      SamResourceId: TestBucket\n"

What is written into the file is this (\n becomes \r\n);

b"\r\nAWSTemplateFormatVersion: '2010-09-09'\r\nTransform:\r\n- AWS::LanguageExtensions\r\n- AWS::Serverless-2016-10-31\r\nDescription: 'Demonstration of AWS SAM CLI Issue aws/aws-sam-cli#5254.'\r\nResources:\r\n  TestBucket:\r\n    Type: AWS::S3::Bucket\r\n    Properties:\r\n      BucketName: aws-sam-cli-issue-5254\r\n    Metadata:\r\n      SamResourceId: TestBucket\r\n"

This auto conversion is why you get a different hash on Windows and that leads into the original problem of this issue where it can't find the S3 object for deletion.

I've tried with following options and both of them worked so far, I need to check with the team and to select one or the other.

  • Option 1: Calculate the hash of the template as string rather than writing to a tmp file.
  • Option 2: When opening tmp file for writing, we need to open it with newline="" so that it won't do the auto-conversion of line endings on windows.

Note: This is the part where it is creating this problem https://github.com/aws/aws-sam-cli/blob/develop/samcli/lib/package/local_files_utils.py#L53-L57

@garretwilson
Copy link
Author

I've tried with following options and both of them worked so far …

I'm coming in late at night and it's not wise for me to be commenting extensively on this at the moment, but I believe that the standard practices would probably be simply to normalize to LF before doing a hash. This is not a new type of problem. But I'll comment more when I'm less "tired" and hopefully find a way not to be grumpy tomorrow about the whole affair of having mentioned this possibility at the beginning. 😅 But I'm super-happy you found the problem. Good night and thanks.

@garretwilson
Copy link
Author

garretwilson commented Jul 16, 2023

Good morning! I want to say thanks again for continuing to look at this, and it's wonderful to finally get to the bottom of it.

Since we're all friends here, I do want to point out a couple of things, though. I filed this ticket well over a month ago. I described it concisely, precisely, and thoroughly. My first guess was that you were inconsistently hashing before/after the transform, but two days later I said:

You might check into line endings as well. I'm on a Windows machine using CRLF in the templates. If your hashing function converts to LF when doing the upload and then compares with that in the local directory without normalizing line endings, then that could lead to mismatched hashes as well. Just giving you ideas to look into.

And that was "purt near" (as the cowboys say) exactly what the problem was conceptually. All this back and forth and countless hours and $X was a complete waste—that is, we gained nothing of value from it, and we could have been doing something more productive with our time. It is my understanding that just a single test on Windows for any template a month ago would have reproduced this problem.

The other thing I want to point out is that this probably affected hundreds if not thousands of Windows users. Most people don't file tickets. Most people just see a glitch and think, "it's part of this rough-and-tumble Wild West of the cloud; I'll just ignore it." I on the other hand take the time to file tickets.

What I'm trying to get to is … if I ever sound grumpy when I get pushback on tickets I take the time to create, please remember this ticket, cut me a little slack, and don't hold it against me. 😁❤️

Again my sincere thanks for continuing to follow this and look into it.

@github-actions
Copy link
Contributor

Patch is released in v1.94.0. Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/delete platform/windows stage/waiting-for-release Fix has been merged to develop and is waiting for a release type/bug
Projects
None yet
Development

No branches or pull requests

4 participants