Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unhandled {ssl_closed, _} message when using async #464

Open
paulswartz opened this issue Jan 15, 2018 · 52 comments · May be fixed by #640
Open

unhandled {ssl_closed, _} message when using async #464

paulswartz opened this issue Jan 15, 2018 · 52 comments · May be fixed by #640
Assignees

Comments

@paulswartz
Copy link

I see the following message in my Elixir logs periodically:

Unexpected message: {:ssl_closed, {:sslsocket, {:gen_tcp, #Port<0.9056>, :tls_connection, :undefined}, #PID<0.10027.10>}}

That looks like a close message from the SSL socket, which I'd expect hackney to manage the same as a close of the TCP connection (which are handled properly). I'm using the async option to stream the response data as it comes in.

@matthewoden
Copy link

matthewoden commented Jan 23, 2018

I can reproduce this - I'm also using the async option with streaming. It seems that my chunk gets mixed in with it, as the last item in the tuple.

Unexpected message: {:ssl, {:sslsocket, {:gen_tcp, #Port<0.8780>, :tls_connection, :undefined}, #PID<0.380.0>}, <<250, 148, 222, 99, 116, 39, 82, 167, 182, 77, 245, 137, 189, 33, 60, 168, 182, 240, 30, 55, 235, 210, 208, 103, 217, 48, 245, 8, 29, 92, 27, 76, 134, 173, 101, 105, 228, 74, 31, 239, 5, 67, 94, 174, 203, 215, 11, 155, ...>>}

Update/Edit: When I run the streaming process in an Elixir Task, the message doesn't occur. Which makes me think there's a mailbox issue?

Only occurs in our deployment environment, (cloudfoundry, ubuntu trusty). Locally on OSX 10.12.5, there's no issue.

Using hackney 1.11

code snippet:

   # options for request - [async: :once, stream_to: self()]

    defp handle_async(%{status: status} = conn, ref) do
        :ok = :hackney.stream_next(ref)

        receive do
            {:hackney_response, ^ref, {:status, code, _reason}} ->
                conn
                |> Plug.Conn.put_status(code)
                |> handle_async(ref)

            {:hackney_response, ^ref, {:headers, headers, }} ->
                downcased_headers =
                    headers
                    |> Stream.map(fn {k,v} -> {String.downcase(k), v} end)
                    |> Enum.filter(fn {k, _v} -> k !== "cache-control" end)

                conn
                |> Plug.Conn.merge_resp_headers(downcased_headers)
                |> Plug.Conn.delete_resp_header("content-length") #RFC spec dictates no content-length on chunked requests
                |> Plug.Conn.put_resp_header("connection", "close")
                |> Plug.Conn.send_chunked(status)
                |> handle_async(ref)

            {:hackney_response, ^ref, :done} ->
                conn

            {:hackney_response, ^ref, chunk} ->
                case Plug.Conn.chunk(conn, chunk) do
                    {:ok, new_conn} ->
                        handle_async(new_conn, ref)

                    {:error, :closed} ->
                        Logger.info("Client canceled request.")
                        :hackney.close(ref)
                        conn

                    {:error, reason } ->
                        Logger.error("Unable to complete request: #{inspect(reason)}")
                        :hackney.close(ref)
                        conn
                end

            message ->
                Logger.warn("Hackney.handle_async got an unexpected message: #{inspect(message)}")
                handle_async(conn, ref)
       end
    end

@benoitc benoitc self-assigned this Jan 23, 2018
@benoitc benoitc added the bug label Jan 23, 2018
@benoitc
Copy link
Owner

benoitc commented Jan 23, 2018

hrm it's probably happening on close. @paulswartz is your code similar to the one posted by @matthewoden ?

@paulswartz
Copy link
Author

I'm using hackney via httpoison, but here's the relevant code: https://github.com/edgurgel/httpoison/blob/master/lib/httpoison/base.ex#L369-L387

@matthewoden
Copy link

If it matters, I've done a quick port on my code over to :ibrowse, and the issue hasn't come up again. The receive block is basically the same (though ibrowse combines status and headers into a single message).

I'd prefer to use hackney, so just let me know if you need any other usage info.

@benoitc
Copy link
Owner

benoitc commented Jan 30, 2018

well the issue is that a message is still on the process heap after cloising the socket. I have a fix for it. Will take care of it later today. Thanks for the code!

@benoitc
Copy link
Owner

benoitc commented Feb 27, 2018

can you try #482?

@benoitc
Copy link
Owner

benoitc commented Mar 1, 2018

can anyone provides me a link on which I can reproduce the issue?

@matthewoden
Copy link

matthewoden commented Mar 1, 2018

I can't provide our source, but I'll see if I can get #482 out in a test environment today or tomorrow. (Edit: I've never been able to reproduce it locally, even after dockerizing the thing.)

@stephenmoloney
Copy link

stephenmoloney commented Mar 1, 2018

This https://github.com/stephenmoloney/fastimage/pull/24 reproduces it everytime - but on travis only.

Like @matthewoden, I also can't reproduce it locally.... probably something to do with loopbacks?
My tcp knowledge is not good enough...

This is in stdout on travis container startup...

127.0.0.1	localhost
::1	 ip6-localhost ip6-loopback
fe00::0	ip6-localnet
ff00::0	ip6-mcastprefix
ff02::1	ip6-allnodes
ff02::2	ip6-allrouters

@benoitc
Copy link
Owner

benoitc commented Mar 1, 2018

@stephenmoloney do you mean the link? I need to reporduce it apart of the another lib ...

@stephenmoloney
Copy link

@benoitc sorry, I don't full understand... When the tests run, that error is printed...

https://travis-ci.org/stephenmoloney/fastimage/jobs/346210674

I still didn't get time to try the tracing.

@critch
Copy link

critch commented Mar 6, 2018

@benoitc Take your example of async receiving in the documentation, convert to require stream_next, and put a sleep of a second or more between requesting stream_next. This is how I have done it via elixir with only hackney. There is something possibly a race condition of receiving the last of the web request and the receiving a message to stream_next. The stream has fully closed, but the buffer has not been flushed out and the done message sent.

@benoitc
Copy link
Owner

benoitc commented Mar 6, 2018

@critch thanks. Quite possible indeed. I think that rather than pausing i should rather receive actively at some point. I will have a closer looks, thanks for the reproducible way.

@benoitc
Copy link
Owner

benoitc commented Apr 3, 2018

fixed in latest master.

@benoitc benoitc closed this as completed Apr 3, 2018
@rozap
Copy link
Contributor

rozap commented Jun 11, 2018

I still see this happening with hackney 1.12.1. It definitely seems to be a race condition, as it happens sometimes.

One thing I don't understand from d4c4f3930d633a675a4224eb76149646c1e6cd74 is that the changes all seem to be around closing the socket, but I see this error happen long before the socket has closed - just after the response stream starts.

I'm going to keep looking into it.

@rozap
Copy link
Contributor

rozap commented Jun 11, 2018

I've got two traces, one where it works and one where it crashes.

This is part of the working trace that differs from the broken one.

13:20:52:826799 (<0.931.0>) << {#Ref<0.3355659064.3830185990.130707>,stream_next}
13:20:52:826812 (<0.931.0>) << {#Ref<0.3355659064.3830185990.130707>,stream_next}
13:20:52:826826 (<0.931.0>) <0.929.0> ! {'$gen_call',
                                            {<0.931.0>,#Ref<0.3355659064.3830185990.130819>},
                                            {set_opts,[{active,once}]}}
13:20:52:826862 (<0.931.0>) << {#Ref<0.3355659064.3830185990.130819>,ok}
13:20:52:826869 (<0.931.0>) <0.929.0> ! {'$gen_call',
                                            {<0.931.0>,#Ref<0.3355659064.3830185990.130820>},
                                            {set_opts,[{active,once}]}}
13:20:52:826901 (<0.931.0>) << {#Ref<0.3355659064.3830185990.130820>,ok}
13:20:52:827039 (<0.931.0>) << {ssl,{sslsocket,{gen_tcp,#Port<0.27730>,tls_connection,undefined},
                                               <0.929.0>},

This is the broken one

13:06:37:122289 (<0.1064.0>) << {#Ref<0.280439725.1680343042.127415>,stream_next}
13:06:37:122300 (<0.1064.0>) << {#Ref<0.280439725.1680343042.127415>,stream_next}
13:06:37:122332 (<0.1064.0>) <0.1062.0> ! {'$gen_call',
                                              {<0.1064.0>,#Ref<0.280439725.1680343041.134667>},
                                              {set_opts,[{active,once}]}}
13:06:37:122368 (<0.1064.0>) << {#Ref<0.280439725.1680343041.134667>,ok}
13:06:37:122547 (<0.1064.0>) <0.1062.0> ! {'$gen_call',
                                              {<0.1064.0>,#Ref<0.280439725.1680343041.134668>},
                                              {set_opts,[{active,once}]}}
13:06:37:122562 (<0.1064.0>) << {ssl,{sslsocket,{gen_tcp,#Port<0.28349>,tls_connection,undefined},
                                                <0.1062.0>},
                                     <<"datadatadata">>}
13:06:37:122691 (<0.1064.0>) << {ssl,{sslsocket,{gen_tcp,#Port<0.28349>,tls_connection,undefined},
                                                <0.1062.0>},
                                     <<"datadatadata">>}
13:06:37:122696 (<0.1064.0>) << {#Ref<0.280439725.1680343041.134668>,ok}
13:06:37:122743 (<0.1064.0>) <0.635.0> ! {hackney_response,#Ref<0.280439725.1680343042.127415>,
                                             <<"datadatadata">>}
13:06:37:122753 (<0.1064.0>) error_logger ! {notify,
                                             {error,<0.324.0>,
                                              {<0.1064.0>,"Unexpected message: ~w~n",
                                               [{ssl,
                                                 {sslsocket,
                                                  {gen_tcp,#Port<0.28349>,tls_connection,
                                                   undefined},
                                                  <0.1062.0>},
                                                 <<"datadatadata"}]}}}

If you look at them side by side, we see the working one has

13:20:52:826869 (<0.931.0>) <0.929.0> ! {'$gen_call',
                                            {<0.931.0>,#Ref<0.3355659064.3830185990.130820>},
                                            {set_opts,[{active,once}]}}
13:20:52:826901 (<0.931.0>) << {#Ref<0.3355659064.3830185990.130820>,ok}

But the broken one has the ok response from setting the transport opts come after the first chunks of data. Its first message after setting transport_opts is actual data from the socket.

So I think the Transport:setopts(TSock, [{active, once}]), line in hackney_stream.erl or Transport:setopts(Socket, [{active, once}, {packet, 0}]), in hackney_pool.erl is not setting the socket up in time for the first chunk to get there? That was the only different between the traces that I found.

@rozap
Copy link
Contributor

rozap commented Jun 12, 2018

Just going to leave a comment here if anyone else encounters this behavior: the reason for this seemed to be calling stream_next twice in a row without waiting for an ack after the socket was set up. Removing the duplicate call I had seems to have fixed it.

@benoitc
Copy link
Owner

benoitc commented Jun 12, 2018

@rozap you mean 2 consecutive call without receiving the message? can you show a simple snippet for it?

just fyi right after the next release i will revisit the stream code to make it simpler to handle, but if we can fix that issue firsy/

@joshchernoff
Copy link

Seen with
hackney 1.15.1
elixir 1.9.0
erlang 22.0.5

@spieker
Copy link

spieker commented Jul 30, 2019

I get the same error with

hackney 1.15.1
elixir 1.9.0
erlang 22.0.7

It works with

hackney 1.15.1
elixir 1.6.5
erlang 20.3.8.22

@paveltyk
Copy link

paveltyk commented Aug 7, 2019

Leaking messages with:

Elixir 1.8.2
OTP 21.3.8.6
HTTPoison 1.5.1
Hackney 1.15.1

@apr94
Copy link

apr94 commented Oct 23, 2019

Leaking messages with:

Elixir 1.9.2
OTP 22.1
HTTPoison 1.5.0
Hackney 1.15.2

If it makes sense, I can take a look at it. Also, is there any pattern of usage that sidesteps the issue?

@astutecat
Copy link

We're getting this when using it with oc_reporter (see @bipthelin 's PR above from september) - the failures are happening frequently enough that it brings down the VM. We're keem to see a fix for this too.

@bipthelin
Copy link
Contributor

We're getting this when using it with oc_reporter (see @bipthelin 's PR above from september) - the failures are happening frequently enough that it brings down the VM. We're keem to see a fix for this too.

This commit fixed it for us: opencensus-beam/oc_google_reporter@bb578a5

It seems that if you don't close hackney either explicitly with hackney:close(Ref) or implicitly by reading the body the pool will eventually get exhausted. This seems to be a problem in how hackney get's invoked and it should probably be some info about it on the README

@utf4
Copy link

utf4 commented Mar 19, 2020

Hi @benoitc
any update about
Unexpected message: {:ssl_closed, {:sslsocket, {:gen_tcp, #Port<0.9056>, :tls_connection, :undefined}, #PID<0.10027.10>}}

@rbino
Copy link

rbino commented May 21, 2020

Hi @benoitc,
I just came across this bug in a production system while using HTTPoison, is it fixed in new hackney versions or is it still an open bug?

@rozap
Copy link
Contributor

rozap commented Jun 10, 2020

Looks like this is back in Erlang 23 and the newest version of hackney, but this time it happens for streaming and non-streaming interfaces.

For the streaming case, it is easy to reproduce it by running the following:

{:ok, ref} = :hackney.get("https://data.seattle.gov/api/views/tpvk-5fr3/rows.csv?accessType=DOWNLOAD", [{"Accept-Encoding", "gzip, deflate"}], <<>>, [
  async: :once
])

Enum.each(0..1000, fn _ ->
  # receive do
  #   {:hackney_response, ^ref, _} -> :ok
  # end
  :hackney.stream_next(ref)
end)

Uncommenting the receive block "fixes it" but it's clear there's a race. Reading the code, it looks like in hackney_stream.erl there is mutual recursion between maybe_continue and async_recv. IMO maybe_continue shouldn't have a catch-all case for messages that will be handled by async_recv, such as messages from the transport. This Commit seems to fix it, but I'm not super confident in that change, so I haven't made a pull request.

That change doesn't fix the non-streaming interface though. I'm going to continue to look into that.

@rozap
Copy link
Contributor

rozap commented Jun 10, 2020

I've traced the crash a bit for the non-async path. Here is a trace where things behave correctly.

iex(1)> (<10835.7576.1>) << {'$gen_call',
                        {<10835.7575.1>,
                         #Ref<10835.1445857441.4079484930.224997>},
                        {application_data,
                            [<<"POST /?delete HTTP/1.1\rUser-Agent: hackney/1.16.0\r\nContent-Type: application/octet-stream\r\nContent-Length: 275\r\n\r\n<?xml version=\"1.0\" encoding=\"UTF-8\"?><Delete><Object><Key>transf/CDE7038DE4-7656989_0</Key></Object><Object><Key>transf/F90CD44682-7656986_0</Key></Object><Object><Key>transf/9B5A8DD598-7656988_0</Key></Object><Object><Key>in_col/B4C3750951-6770726_0</Key></Object></Delete>">>]}}
iex(1)> (<10835.7576.1>) << {inet_reply,#Port<10835.16914>,ok}
iex(1)> (<10835.7576.1>) <10835.7575.1> ! {#Ref<10835.1445857441.4079484930.224997>,
                                   ok}
iex(1)> (<10835.7575.1>) << {#Ref<10835.1445857441.4079484930.224997>,ok}
iex(1)> (<10835.7575.1>) <10835.7577.1> ! {'$gen_call',
                                      {<10835.7575.1>,
                                       #Ref<10835.1445857441.4079747073.42893>},
                                      {recv,0,10000}}
iex(1)> (<10835.7575.1>) << {#Ref<10835.1445857441.4079747073.42893>,
                     {ok,<<"HTTP/1.1 200 OK\rConnection: close\r\nContent-Type: application/xml\r\nTransfer-Encoding: chunked\r\nServer: AmazonS3\r\n\r\n158\r\n<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<DeleteResult xmlns=\"http://s3.amazonaws.com/doc/2006-03-01/\"><Deleted><Key>in_col/B4C3750951-6770726_0</Key></Deleted><Deleted><Key>transf/CDE7038DE4-7656989_0</Key></Deleted><Deleted><Key>transf/9B5A8DD598-7656988_0</Key></Deleted><Deleted><Key>transf/F90CD44682-7656986_0</Key></Deleted></DeleteResult>\r\n">>}}
iex(1)> (<10835.7575.1>) <10835.7577.1> ! {'$gen_call',
                                      {<10835.7575.1>,
                                       #Ref<10835.1445857441.4079747073.42919>},
                                      {recv,0,10000}}
iex(1)> (<10835.7575.1>) << {#Ref<10835.1445857441.4079747073.42919>,
                     {ok,<<"0\r\n\r\n">>}}
iex(1)> (<10835.7575.1>) <10835.7577.1> ! {'$gen_call',
                                      {<10835.7575.1>,
                                       #Ref<10835.1445857441.4079484930.225001>},
                                      {close,5000}}
iex(1)> (<10835.7576.1>) << {'$gen_call',
                        {<10835.7578.1>,
                         #Ref<10835.1445857441.4079484930.225005>},
                        {ack_alert,
                            {alert,1,0,
                                #{file => "ssl_connection.erl",line => 2669,
                                  mfa => {ssl_connection,terminate_alert,1}},
                                undefined,undefined}}}
iex(1)> (<10835.7576.1>) << {inet_reply,#Port<10835.16914>,ok}
iex(1)> (<10835.7576.1>) <10835.7578.1> ! {#Ref<10835.1445857441.4079484930.225005>,
                                   ok}
iex(1)> (<10835.7575.1>) << {#Ref<10835.1445857441.4079484930.225001>,ok}
iex(1)> (<10835.7575.1>) << {'DOWN',#Ref<10835.1445857441.4079484930.225001>,process,
                            <10835.7577.1>,
                            {shutdown,normal}}
iex(1)> (<10835.7575.1>) hackney_manager ! {'$gen_cast',
                                    {cancel_request,
                                     #Ref<10835.1445857441.4079484930.224990>}}
iex(1)> (<10835.7575.1>) <10835.20328.0> ! {#Ref<10835.1445857441.4079747073.42878>,
                                    ok}
iex(1)> (<10835.20328.0>) << {#Ref<10835.1445857441.4079747073.42878>,ok}
iex(1)> (<10835.20328.0>) << {'DOWN',#Ref<10835.1445857441.4079747073.42878>,process,
                             <10835.7575.1>,normal}

And here is one where we don't handle the ssl_closed message correctly and it gets leaked to the calling process

(<10835.7580.1>) << {'$gen_call',
                        {<10835.20328.0>,
                         #Ref<10835.1445857441.4079747073.42984>},
                        {application_data,
                            [<<"POST /?delete HTTP/1.1\r\nUser-Agent: hackney/1.16.0\r\nContent-Type: application/octet-stream\r\nContent-Length: 149\r\n\r\n<?xml version=\"1.0\" encoding=\"UTF-8\"?><Delete><Object><Key>src-chunk-569482-0-WDfi2x4aa2xA2jB4d46evwbs3eBffN4m3aoOweud4Z4.bin</Key></Object></Delete>">>]}}
(<10835.7580.1>) << {inet_reply,#Port<10835.16915>,ok}
(<10835.7580.1>) <10835.20328.0> ! {#Ref<10835.1445857441.4079747073.42984>,
                                    ok}
(<10835.20328.0>) << {#Ref<10835.1445857441.4079747073.42984>,ok}
(<10835.20328.0>) <10835.7581.1> ! {'$gen_call',
                                       {<10835.20328.0>,
                                        #Ref<10835.1445857441.4079747073.42985>},
                                       {recv,0,10000}}
(<10835.20328.0>) << {#Ref<10835.1445857441.4079747073.42985>,
                      {ok,<<"HTTP/1.1 200 OK\r\nDate: Wed, 10 Jun 2020 20:59:28 GMT\r\nConnection: close\r\nContent-Type: application/xml\r\nTransfer-Encoding: chunked\r\nServer: AmazonS3\r\n\r\n">>}}
(<10835.20328.0>) <10835.7581.1> ! {'$gen_call',
                                       {<10835.20328.0>,
                                        #Ref<10835.1445857441.4079747073.43010>},
                                       {recv,0,10000}}
(<10835.20328.0>) << {#Ref<10835.1445857441.4079747073.43010>,
                      {ok,<<"d4\r\n<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<DeleteResult xmlns=\"http://s3.amazonaws.com/doc/2006-03-01/\"><Deleted><Key>src-chunk-569482-0.bin</Key></Deleted></DeleteResult>\r\n0\r\n\r\n">>}}
(<10835.20328.0>) << {ssl_closed,
                         {sslsocket,
                             {gen_tcp,#Port<10835.16915>,tls_connection,
                                 undefined},
                             [<10835.7581.1>,<10835.7580.1>]}}
(<10835.20328.0>) <10835.7581.1> ! {'$gen_call',
                                       {<10835.20328.0>,
                                        #Ref<10835.1445857441.4079484930.225023>},
                                       {close,5000}}
(<10835.7580.1>) << {'$gen_call',
                        {<10835.7582.1>,
                         #Ref<10835.1445857441.4079747073.43015>},
                        {ack_alert,
                            {alert,1,0,
                                #{file => "ssl_connection.erl",line => 2669,
                                  mfa => {ssl_connection,terminate_alert,1}},
                                undefined,undefined}}}
(<10835.7580.1>) << {inet_reply,#Port<10835.16915>,ok}
(<10835.7580.1>) <10835.7582.1> ! {#Ref<10835.1445857441.4079747073.43015>,ok}
(<10835.20328.0>) << {'DOWN',#Ref<10835.1445857441.4079484930.225023>,process,
                             <10835.7581.1>,
                             {shutdown,peer_close}}
(<10835.20328.0>) hackney_manager ! {'$gen_cast',
                                     {cancel_request,
                                      #Ref<10835.1445857441.4079747073.42966>}}

In the bad trace (2nd one), it looks like the sslsocket gets closed automatically, before we call close explicitly and get the alert. In the other instance, I see the genserver call to close and then the ssl closed alert (generated on ssl_connection.erl:2669).

When we do call close, we get back {shutdown, peer_close} instead of {shutdown, normal} in the first trace, suggesting it's already closed. The peer_close alert is coming from here https://github.com/erlang/otp/blob/9bca07b045b5fe6fe958f38b25585a0f50b326bb/lib/ssl/src/ssl_connection.erl#L396

So ultimately my understanding is that the server closed the connection before the client, which leads to this message not getting handled and leaking into the caller. But may not be interpreting it correctly.

@rozap
Copy link
Contributor

rozap commented Jun 12, 2020

Fix submitted as #640

peburrows added a commit to brexhq/statsig_ex that referenced this issue Sep 20, 2024
We're seeing something similar to
benoitc/hackney#464 that is causing the
genserver to crash, because it's not handling a message that seems to be
leaking from hackney / httpoison. This change should prevent that crash
from happening.
peburrows added a commit to brexhq/statsig_ex that referenced this issue Sep 26, 2024
We're seeing something similar to
benoitc/hackney#464 that is causing the
genserver to crash, because it's not handling a message that seems to be
leaking from hackney / httpoison. This change should prevent that crash
from happening.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment