Azure Deployment - Repeated Grpc.Core.RpcException: Status(StatusCode="XXX", Detail="Error connecting to subchannel.", "System.Net.Sockets.SocketException: An attempt was made to access a socket in a way forbidden by its access permissions" #2970

dfaivre · 2024-11-19T23:12:41Z

Description

We get repeated intermittent GRPC RpcExceptions in our deployed durable functions app, seemingly when it's trying to connect to the side car?

Expected behavior

A retry, or a reliable GRPC connection to the sidecar

Actual behavior

Randomly fails

Relevant source code snippets

Grpc.Core.RpcException: Status(StatusCode="Unavailable", Detail="Error connecting to subchannel.", DebugException="System.Net.Sockets.SocketException: An attempt was made to access a socket in a way forbidden by its access permissions.")
 ---> System.Net.Sockets.SocketException (10013): An attempt was made to access a socket in a way forbidden by its access permissions.
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
   at System.Net.Sockets.Socket.<ConnectAsync>g__WaitForConnectWithCancellation|285_0(AwaitableSocketAsyncEventArgs saea, ValueTask connectTask, CancellationToken cancellationToken)
   at Grpc.Net.Client.Balancer.Internal.SocketConnectivitySubchannelTransport.TryConnectAsync(ConnectContext context)
   --- End of inner exception stack trace ---
   at Grpc.Net.Client.Balancer.Internal.ConnectionManager.PickAsync(PickContext context, Boolean waitForReady, CancellationToken cancellationToken)
   at Grpc.Net.Client.Balancer.Internal.BalancerHttpHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at Grpc.Net.Client.Internal.GrpcCall`2.RunCall(HttpRequestMessage request, Nullable`1 timeout)
   at Microsoft.DurableTask.Client.Grpc.GrpcDurableTaskClient.GetInstancesAsync(String instanceId, Boolean getInputsAndOutputs, CancellationToken cancellation)
[internal code]...
--- End of stack trace from previous location ---
[internal code]...

Known workarounds

I think I'd basically need to create a custom DurableTaskClient that proxied all the calls wrapped in retry logic?

App Details

Durable Functions extension version (e.g. v1.8.3): Microsoft.Azure.Functions.Worker.Extensions.DurableTask 1.1.7, Microsoft.Azure.Functions.Worker 1.23.0, App Setting FUNCTIONS_EXTENSION_VERSION = ~4
Azure Functions runtime version (1.0 or 2.0): 2.0
Programming language used: .net 8.0 isolated (c#)

Screenshots

None

If deployed to Azure

We have access to a lot of telemetry that can help with investigations. Please provide as much of the following information as you can to help us investigate!

Timeframe issue observed: 2024-11-19T19:13:48.0385569Z
Function App name: (please email me if you need/want it)
Function name(s): Ops_OpInputPipelineStarterFunction
Azure region: Central
Orchestration instance ID(s): none (it's trying to start an orchestration)
Azure storage account name: (please email me if you need/want it)
Invocation ID: 023972b3-9b18-449c-9562-83abf5ae465e
HostInstance ID: cdf5caea-5b96-4052-ac69-46738ab06aa8
SDK Version: azurefunctions: 4.1036.3.23284
Operation ID: 6832f51c05c0bd19c117b7a64ce4bb93
Application ID: d430d61a-6e7c-4d9d-8a92-eb7f7dd63011

The text was updated successfully, but these errors were encountered:

cgillum · 2024-11-20T00:40:13Z

gRPC connection errors are generally not expected unless one of either the host process (which I think you're referring to as the sidecar) or the worker process crashes. Can you check to see if the Azure Functions host process might be recycling?

dfaivre · 2024-11-20T14:33:28Z

Thanks Chris! I'm not entirely sure how to check if the host process is recycling?

It might make sense that the host process is dying, as once it starts throwing the errors, the only way to clear them is to restart the functions app. So my retry work around probably wouldn't work...

cgillum · 2024-11-20T18:34:24Z

@nytian mentioned a similar case to me yesterday. We suspect that this kind of problem might happen if the host process recycles and the worker process doesn't, in which case the host process starts listening on a different port than what the worker process is expecting.

I was able to find your app using the information you provided (thanks!) and while it's not easy for me to know whether there's a host restart (I can see host starting, but not necessarily stopping), I do see a couple of cases where there was at least two host startup events on the same VM within a 10 minute window.

RoleInstance	TIMESTAMP	count_	min_TIMESTAMP	max_TIMESTAMP
pd0MediumDedicatedWebWorkerRole_IN_15289	2024-11-18 23:40:00.0000000	2	2024-11-18 23:40:11.2751149	2024-11-18 23:47:38.5501626
pd0MediumDedicatedWebWorkerRole_IN_15188	2024-11-18 23:40:00.0000000	2	2024-11-18 23:41:12.6843653	2024-11-18 23:48:30.2643996

They didn't quite match up with the timestamp you provided, however, so I'm not sure if these are related. By the way, are you running on the Consumption plan? It seems like your app is changing VMs pretty regularly.

If you haven't done so already, it might be worth opening an Azure Support request so that we can get more experts looking into this.

dfaivre · 2024-11-20T23:04:40Z

@cgillum - thanks for taking the time to look into all of this.

We're on the Elastic Premium plan - so in theory it behaves like consumption....
Looks like there was a small spike of errors around: 2024-11-18T23:50:47.2385233Z- so pretty close to what you were seeing.
I have a support request opened on 11/7/2024 and have sent them this GitHub issue. I opened this because they seemed to be having a hard time making any progress... :) Support request ID: 2411070040008596

dfaivre · 2024-12-01T20:25:33Z

cross ref: microsoft/durabletask-dotnet#353

microsoft-github-policy-service bot added the Needs: Triage 🔍 label Nov 19, 2024

cgillum added Needs: Author Feedback Waiting for the author of the issue to respond to a question and removed Needs: Triage 🔍 labels Nov 20, 2024

microsoft-github-policy-service bot added Needs: Attention 👋 and removed Needs: Author Feedback Waiting for the author of the issue to respond to a question labels Nov 20, 2024

cgillum added Needs: Author Feedback Waiting for the author of the issue to respond to a question and removed Needs: Attention 👋 labels Nov 20, 2024

microsoft-github-policy-service bot added Needs: Attention 👋 and removed Needs: Author Feedback Waiting for the author of the issue to respond to a question labels Nov 20, 2024

nytian self-assigned this Nov 26, 2024

dfaivre mentioned this issue Dec 1, 2024

DurableTaskClient throwing RpcException "An attempt was made to access a socket in a way forbidden by its access permissions" microsoft/durabletask-dotnet#353

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Azure Deployment - Repeated Grpc.Core.RpcException: Status(StatusCode="XXX", Detail="Error connecting to subchannel.", "System.Net.Sockets.SocketException: An attempt was made to access a socket in a way forbidden by its access permissions" #2970

Azure Deployment - Repeated Grpc.Core.RpcException: Status(StatusCode="XXX", Detail="Error connecting to subchannel.", "System.Net.Sockets.SocketException: An attempt was made to access a socket in a way forbidden by its access permissions" #2970

dfaivre commented Nov 19, 2024

cgillum commented Nov 20, 2024

dfaivre commented Nov 20, 2024

cgillum commented Nov 20, 2024

dfaivre commented Nov 20, 2024

dfaivre commented Dec 1, 2024

Azure Deployment - Repeated Grpc.Core.RpcException: Status(StatusCode="XXX", Detail="Error connecting to subchannel.", "System.Net.Sockets.SocketException: An attempt was made to access a socket in a way forbidden by its access permissions" #2970

Azure Deployment - Repeated Grpc.Core.RpcException: Status(StatusCode="XXX", Detail="Error connecting to subchannel.", "System.Net.Sockets.SocketException: An attempt was made to access a socket in a way forbidden by its access permissions" #2970

Comments

dfaivre commented Nov 19, 2024

Description

Expected behavior

Actual behavior

Relevant source code snippets

Known workarounds

App Details

Screenshots

If deployed to Azure

cgillum commented Nov 20, 2024

dfaivre commented Nov 20, 2024

cgillum commented Nov 20, 2024

dfaivre commented Nov 20, 2024

dfaivre commented Dec 1, 2024