Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure Deployment - Repeated Grpc.Core.RpcException: Status(StatusCode="XXX", Detail="Error connecting to subchannel.", "System.Net.Sockets.SocketException: An attempt was made to access a socket in a way forbidden by its access permissions" #2970

Open
dfaivre opened this issue Nov 19, 2024 · 5 comments
Assignees

Comments

@dfaivre
Copy link

dfaivre commented Nov 19, 2024

Description

We get repeated intermittent GRPC RpcExceptions in our deployed durable functions app, seemingly when it's trying to connect to the side car?

Expected behavior

A retry, or a reliable GRPC connection to the sidecar

Actual behavior

Randomly fails

Relevant source code snippets

Grpc.Core.RpcException: Status(StatusCode="Unavailable", Detail="Error connecting to subchannel.", DebugException="System.Net.Sockets.SocketException: An attempt was made to access a socket in a way forbidden by its access permissions.")
 ---> System.Net.Sockets.SocketException (10013): An attempt was made to access a socket in a way forbidden by its access permissions.
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
   at System.Net.Sockets.Socket.<ConnectAsync>g__WaitForConnectWithCancellation|285_0(AwaitableSocketAsyncEventArgs saea, ValueTask connectTask, CancellationToken cancellationToken)
   at Grpc.Net.Client.Balancer.Internal.SocketConnectivitySubchannelTransport.TryConnectAsync(ConnectContext context)
   --- End of inner exception stack trace ---
   at Grpc.Net.Client.Balancer.Internal.ConnectionManager.PickAsync(PickContext context, Boolean waitForReady, CancellationToken cancellationToken)
   at Grpc.Net.Client.Balancer.Internal.BalancerHttpHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at Grpc.Net.Client.Internal.GrpcCall`2.RunCall(HttpRequestMessage request, Nullable`1 timeout)
   at Microsoft.DurableTask.Client.Grpc.GrpcDurableTaskClient.GetInstancesAsync(String instanceId, Boolean getInputsAndOutputs, CancellationToken cancellation)
[internal code]...
--- End of stack trace from previous location ---
[internal code]...

Known workarounds

I think I'd basically need to create a custom DurableTaskClient that proxied all the calls wrapped in retry logic?

App Details

  • Durable Functions extension version (e.g. v1.8.3): Microsoft.Azure.Functions.Worker.Extensions.DurableTask 1.1.7, Microsoft.Azure.Functions.Worker 1.23.0, App Setting FUNCTIONS_EXTENSION_VERSION = ~4
  • Azure Functions runtime version (1.0 or 2.0): 2.0
  • Programming language used: .net 8.0 isolated (c#)

Screenshots

None

If deployed to Azure

We have access to a lot of telemetry that can help with investigations. Please provide as much of the following information as you can to help us investigate!

  • Timeframe issue observed: 2024-11-19T19:13:48.0385569Z
  • Function App name: (please email me if you need/want it)
  • Function name(s): Ops_OpInputPipelineStarterFunction
  • Azure region: Central
  • Orchestration instance ID(s): none (it's trying to start an orchestration)
  • Azure storage account name: (please email me if you need/want it)
  • Invocation ID: 023972b3-9b18-449c-9562-83abf5ae465e
  • HostInstance ID: cdf5caea-5b96-4052-ac69-46738ab06aa8
  • SDK Version: azurefunctions: 4.1036.3.23284
  • Operation ID: 6832f51c05c0bd19c117b7a64ce4bb93
  • Application ID: d430d61a-6e7c-4d9d-8a92-eb7f7dd63011
@cgillum
Copy link
Member

cgillum commented Nov 20, 2024

gRPC connection errors are generally not expected unless one of either the host process (which I think you're referring to as the sidecar) or the worker process crashes. Can you check to see if the Azure Functions host process might be recycling?

@cgillum cgillum added Needs: Author Feedback Waiting for the author of the issue to respond to a question and removed Needs: Triage 🔍 labels Nov 20, 2024
@dfaivre
Copy link
Author

dfaivre commented Nov 20, 2024

Thanks Chris! I'm not entirely sure how to check if the host process is recycling?

It might make sense that the host process is dying, as once it starts throwing the errors, the only way to clear them is to restart the functions app. So my retry work around probably wouldn't work...

@microsoft-github-policy-service microsoft-github-policy-service bot added Needs: Attention 👋 and removed Needs: Author Feedback Waiting for the author of the issue to respond to a question labels Nov 20, 2024
@cgillum
Copy link
Member

cgillum commented Nov 20, 2024

@nytian mentioned a similar case to me yesterday. We suspect that this kind of problem might happen if the host process recycles and the worker process doesn't, in which case the host process starts listening on a different port than what the worker process is expecting.

I was able to find your app using the information you provided (thanks!) and while it's not easy for me to know whether there's a host restart (I can see host starting, but not necessarily stopping), I do see a couple of cases where there was at least two host startup events on the same VM within a 10 minute window.

RoleInstance TIMESTAMP count_ min_TIMESTAMP max_TIMESTAMP
pd0MediumDedicatedWebWorkerRole_IN_15289 2024-11-18 23:40:00.0000000 2 2024-11-18 23:40:11.2751149 2024-11-18 23:47:38.5501626
pd0MediumDedicatedWebWorkerRole_IN_15188 2024-11-18 23:40:00.0000000 2 2024-11-18 23:41:12.6843653 2024-11-18 23:48:30.2643996

They didn't quite match up with the timestamp you provided, however, so I'm not sure if these are related. By the way, are you running on the Consumption plan? It seems like your app is changing VMs pretty regularly.

If you haven't done so already, it might be worth opening an Azure Support request so that we can get more experts looking into this.

@cgillum cgillum added Needs: Author Feedback Waiting for the author of the issue to respond to a question and removed Needs: Attention 👋 labels Nov 20, 2024
@dfaivre
Copy link
Author

dfaivre commented Nov 20, 2024

@cgillum - thanks for taking the time to look into all of this.

  • We're on the Elastic Premium plan - so in theory it behaves like consumption....
  • Looks like there was a small spike of errors around: 2024-11-18T23:50:47.2385233Z- so pretty close to what you were seeing.
  • I have a support request opened on 11/7/2024 and have sent them this GitHub issue. I opened this because they seemed to be having a hard time making any progress... :) Support request ID: 2411070040008596

@microsoft-github-policy-service microsoft-github-policy-service bot added Needs: Attention 👋 and removed Needs: Author Feedback Waiting for the author of the issue to respond to a question labels Nov 20, 2024
@nytian nytian self-assigned this Nov 26, 2024
@dfaivre
Copy link
Author

dfaivre commented Dec 1, 2024

cross ref: microsoft/durabletask-dotnet#353

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants