vibe.d bug with http requests (previously: memory corruption bug) #1395

AndrejMitrovic · 2020-11-30T09:31:08Z

node-2_1 | 2020-11-30 09:24:40,909 Info [agora.network.NetworkManager] - Net time offsets: [{ key: GDNODE6ZXW2NNOOQIGN24MBEZRO5226LSMHGQA3MUAMYQSTJVR7XT6GH, node_time: 1, req_delay: 0, offset: -1606728279 }, { key: GDNODE3EWQKF33TPK35DAQ3KXAYSOT4E4ACDOVJMDZQDVKP66IMJEACM, node_time: 1, req_delay: 0, offset: -1606728279 }, { key: GDNODE2IMTDH7SZHXWDS24EZCMYCEJMRZWB3S4HLRIUP6UNGKVVFLVHQ, node_time: 1606728279, req_delay: 0, offset: 0 }, { key: GDNODE4KTE7VQUHVBLXIGD7VEFY57X4XV547P72D37SDG7UEO7MWOSNY, node_time: 1606728280, req_delay: 0, offset: 0 }, { key: GDNODE6ZXW2NNOOQIGN24MBEZRO5226LSMHGQA3MUAMYQSTJVR7XT6GH, node_time: 1606728280, req_delay: 0, offset: 0 }, { key: GDNODE7J5EUK7T6HLEO2FDUBWZEXVXHJO7C4AF5VZAKZENGQ4WR3IX2U, node_time: 1606728280, req_delay: 0, offset: 0 }, { key: GDNODE7J5EUK7T6HLEO2FDUBWZEXVXHJO7C4AF5VZAKZENGQ4WR3IX2U, node_time: 1606728280, req_delay: 0, offset: 0 }]
node-2_1 | 2020-11-30 09:24:40,960 Info [agora.network.NetworkManager] - Retrieving blocks [2..1606728280] from http://node-6:6826..

Look closely. The logging which added the node_time logging emits the same number 1606728280 as in the Retrieving blocks log line. There seems to be some memory corruption which propagates everywhere.

I also added some additional logging and ran into this:

node-4_1  | RECEIVED INVALID PUBLIC KEY: '' from submodules/vibe.d/data/vibe/data/serialization.d(739)

This is in Key.fromString (scope const(char)[] str). There is probably a separate bug here that we're using assertions in place of exceptions inside this function. But it's also highly unusual that legitimate local nodes are sending empty json strings.

So I suspect there is a memory corruption bug somewhere.

The text was updated successfully, but these errors were encountered:

AndrejMitrovic · 2020-11-30T09:34:09Z

Look closely. The logging which added the node_time logging emits the same number 1606728280 as in the Retrieving blocks log line. There seems to be some memory corruption which propagates everywhere.

One hint is that both code locations use static buffers:

https://github.com/bpfkorea/agora/blob/b55d0a6066da78b35fdb649ef86f2b71d272dd90/source/agora/network/NetworkManager.d#L841
https://github.com/bpfkorea/agora/blob/b55d0a6066da78b35fdb649ef86f2b71d272dd90/source/agora/network/NetworkManager.d#L536

I suspect it might be a re-entrance problem. If the routine does a context switch and is re-entered, then the new fiber will overwrite the static array of the old fiber. Meanwhile, the old fiber will later wake up and find its static array in an invalid state.

AndrejMitrovic · 2020-11-30T09:35:21Z

We should probably avoid using any kind of static storage in a routine which is not re-entrant safe, IOW a routine which does any sorts of context switches.

omerfirmak · 2020-12-02T19:45:42Z

I feel like this is not a memory corruption bug but a protocol error in vibe.d. It is apparently doing what is called http pipelining, but not handling it properly. Wireshark dumps suggest that. When client asks for A and B without waiting a response in between, server should respond in that order, but the responses come in reverse order. I can see that responses from getBlockHeight and getLocalTime are getting routed to wrong waiters. Since they are returning the same value type even the deserializer is still happy. When getLocalTime return value is wrapped in a struct because it would be serialized differently, the error is probably masked somewhere.

The issue where we were seeing a large number of blocks being fetched could very well be because of this.
2020-12-02 22:38:21,173 Info [agora.network.NetworkManager] - Retrieving blocks [5..1606937900] from http://127.0.0.1:6826..

1606937900 is simply a timestamp.

AndrejMitrovic · 2020-12-03T02:43:57Z

Great find @omerfirmak! Can you file the bug upstream?

I'm guessing the pipelining in vibe.d can't be disabled / worked around?

ferencdg · 2020-12-03T08:39:09Z

I think this 'pipelining' is done because of the 'http-keepalive' is switched on by default for rest requests. If so, we can probably switch it off until the bug is fixed.

ferencdg · 2020-12-03T12:02:17Z

I think it is a bit more difficult than that, from my test I see that the TCP connection is only reused if the previous request finished, otherwise a new connection is opened:

struct TestStruct
{
	int i;
}

interface IService
{
	@safe TestStruct getTest(int sleepsecs);
}

class Service : IService
{
	@safe TestStruct getTest(int sleepsecs)
	{
		writeln("$$$$$$$$$$$ receiving request with wait: ", sleepsecs);
		sleep(sleepsecs.seconds);
		TestStruct test_struct = {sleepsecs};
		writeln("$$$$$$$$$$$ returning request with wait: ", sleepsecs);
		return test_struct;
	}
}

int testmain()
{
	setLogLevel(LogLevel.trace);
	auto settings = new HTTPServerSettings;
	settings.port = 5000;
	settings.bindAddresses = ["127.0.0.1"];
	auto router = new URLRouter;
	router.registerRestInterface(new Service);
	auto server_addr = listenHTTP(settings, router);
	writeln("*************** before running tasks");
	runTask({
		auto settings = new RestInterfaceSettings();
		settings.baseURL = URL("http://127.0.0.1:5000/");
		try
		{
			auto service_client = new RestInterfaceClient!IService(settings);
			runTask({
				auto res = service_client.getTest(5);
				writeln("############## return res: ", res);
				assert(res.i == 5);
			});
			writeln("*************** before main sleeping");
			sleep(1.seconds);
			writeln("*************** after main sleeping");
			runTask({
				auto res = service_client.getTest(1);
				writeln("############## return res: ", res);
				assert(res.i == 1);
			});

		}
		catch(Exception e)
		{
			writeln("exception in main task");
		}
	});

	runApplication();
	return 0;
}

Result:

./build/app
REST route: GET /test ["sleepsecs"]
add route GET /test
REST route: OPTIONS /test ["sleepsecs"]
add route OPTIONS /test
[main(----) INF] Listening for requests on http://127.0.0.1:5000/
*************** before running tasks
Creating new fiber...
Running event loop...
setup signal handler
Starting event loop.
Creating new fiber...
Create HTTP client pool 127.0.0.1:5000 false proxy :0
creating new HTTPClient connection, all 0 are in use
 ... 7F5281096DC0
Now got 1 connections
*************** before main sleeping
Creating new fiber...
wait shared 55CBCF9E7E58
Creating new fiber...
--------------------
HTTP client request:
--------------------
GET /test?sleepsecs=5 HTTP/1.1
User-Agent: vibe.d/1.11.1 (HTTPClient, +http://vibed.org/)
Connection: keep-alive
Accept-Encoding: gzip, deflate
Host: 127.0.0.1:5000
Accept: application/json
--------------------
HTTP client waiting for response (client)
Socket 9, read 198 bytes: ok (probably server)
reading request.. (server)
HTTP server reading status line (server)
--------------------
HTTP server request:
--------------------
GET /test?sleepsecs=5 HTTP/1.1
User-Agent: vibe.d/1.11.1 (HTTPClient, +http://vibed.org/)
Connection: keep-alive
Accept-Encoding: gzip, deflate
Host: 127.0.0.1:5000
Accept: application/json
--------------------
Got request header.
persist: true
handle request (body 0)
Disambiguate done: 22 nodes, 1 max stack size
Match tree has 7 (of 22 in the builder) nodes, 2 terminals
route match: /test -> GET /test []
$$$$$$$$$$$ receiving request with wait: 5
*************** after main sleeping
Creating new fiber...
creating new HTTPClient connection, all 1 are in use
 ... 7F528109A210
Now got 2 connections
emit shared 7F52810990B0
lw null
emit shared done
--------------------
HTTP client request:
--------------------
GET /test?sleepsecs=1 HTTP/1.1
User-Agent: vibe.d/1.11.1 (HTTPClient, +http://vibed.org/)
Connection: keep-alive
Accept-Encoding: gzip, deflate
Host: 127.0.0.1:5000
Accept: application/json
--------------------
HTTP client waiting for response
Socket 11, read 198 bytes: ok
reading request..
HTTP server reading status line
--------------------
HTTP server request:
--------------------
GET /test?sleepsecs=1 HTTP/1.1
User-Agent: vibe.d/1.11.1 (HTTPClient, +http://vibed.org/)
Connection: keep-alive
Accept-Encoding: gzip, deflate
Host: 127.0.0.1:5000
Accept: application/json
--------------------
Got request header.
persist: true
handle request (body 0)
route match: /test -> GET /test []
$$$$$$$$$$$ receiving request with wait: 1
$$$$$$$$$$$ returning request with wait: 1
---------------------
HTTP server response:
---------------------
HTTP/1.1 200 OK
Server: vibe.d/1.11.1
Date: Thu, 03 Dec 2020 11:33:45 GMT
Keep-Alive: timeout=10
Content-Type: application/json
Content-Length: 7
---------------------
return true
Waiting for next request...
Socket 10, read 161 bytes: ok
HTTP client reading status line
stln: HTTP/1.1 200 OK
---------------------
HTTP client response:
---------------------
HTTP/1.1 200 OK
Server: vibe.d/1.11.1
Date: Thu, 03 Dec 2020 11:33:45 GMT
Keep-Alive: timeout=10
Content-Type: application/json
Content-Length: 7
---------------------
Returning HTTPClientResponse for conn 7F528109A210
REST call: GET http://127.0.0.1:5000/test?sleepsecs=1 -> 200, 
############## return res: TestStruct(1)
emit shared 7F5281099AB0
lw null
emit shared done
$$$$$$$$$$$ returning request with wait: 5
---------------------
HTTP server response:
---------------------
HTTP/1.1 200 OK
Server: vibe.d/1.11.1
Date: Thu, 03 Dec 2020 11:33:44 GMT
Keep-Alive: timeout=10
Content-Type: application/json
Content-Length: 7
---------------------
return true
Waiting for next request...
Socket 7, read 161 bytes: ok
HTTP client reading status line
stln: HTTP/1.1 200 OK
---------------------
HTTP client response:
---------------------
HTTP/1.1 200 OK
Server: vibe.d/1.11.1
Date: Thu, 03 Dec 2020 11:33:44 GMT
Keep-Alive: timeout=10
Content-Type: application/json
Content-Length: 7
---------------------
Returning HTTPClientResponse for conn 7F5281096DC0
REST call: GET http://127.0.0.1:5000/test?sleepsecs=5 -> 200, 
############## return res: TestStruct(5)
emit shared 7F52810994B0
lw null
emit shared done
^C[main(----) INF] Received signal 2. Shutting down.
emit shared 55CBCF9E7E58
waiter 7F5281098C00
lw 7F5281098C00
notify task 7F527DBE6970 55CBCF7DCC80 7F527DBE6808
emit shared done
main thread exit
emit shared 7F52810996B0
lw null
emit shared done

It seems the client used socket 7 and 10, but if you change the timing and the first request finishes before the second issued, then the TCP connection is reused

ferencdg · 2020-12-03T12:05:36Z

here is the result for non-overlapping requests:

REST route: GET /test ["sleepsecs"]
add route GET /test
REST route: OPTIONS /test ["sleepsecs"]
add route OPTIONS /test
[main(----) INF] Listening for requests on http://127.0.0.1:5000/
*************** before running tasks
Creating new fiber...
Running event loop...
setup signal handler
Starting event loop.
Creating new fiber...
Create HTTP client pool 127.0.0.1:5000 false proxy :0
creating new HTTPClient connection, all 0 are in use
 ... 7F68BCECADC0
Now got 1 connections
*************** before main sleeping
Creating new fiber...
wait shared 557A8D6C3E58
Creating new fiber...
--------------------
HTTP client request:
--------------------
GET /test?sleepsecs=1 HTTP/1.1
User-Agent: vibe.d/1.11.1 (HTTPClient, +http://vibed.org/)
Connection: keep-alive
Accept-Encoding: gzip, deflate
Host: 127.0.0.1:5000
Accept: application/json
--------------------
HTTP client waiting for response
Socket 9, read 198 bytes: ok
reading request..
HTTP server reading status line
--------------------
HTTP server request:
--------------------
GET /test?sleepsecs=1 HTTP/1.1
User-Agent: vibe.d/1.11.1 (HTTPClient, +http://vibed.org/)
Connection: keep-alive
Accept-Encoding: gzip, deflate
Host: 127.0.0.1:5000
Accept: application/json
--------------------
Got request header.
persist: true
handle request (body 0)
Disambiguate done: 22 nodes, 1 max stack size
Match tree has 7 (of 22 in the builder) nodes, 2 terminals
route match: /test -> GET /test []
$$$$$$$$$$$ receiving request with wait: 1
$$$$$$$$$$$ returning request with wait: 1
---------------------
HTTP server response:
---------------------
HTTP/1.1 200 OK
Server: vibe.d/1.11.1
Date: Thu, 03 Dec 2020 11:52:12 GMT
Keep-Alive: timeout=10
Content-Type: application/json
Content-Length: 7
---------------------
return true
Waiting for next request...
Socket 7, read 161 bytes: ok
HTTP client reading status line
stln: HTTP/1.1 200 OK
---------------------
HTTP client response:
---------------------
HTTP/1.1 200 OK
Server: vibe.d/1.11.1
Date: Thu, 03 Dec 2020 11:52:12 GMT
Keep-Alive: timeout=10
Content-Type: application/json
Content-Length: 7
---------------------
Returning HTTPClientResponse for conn 7F68BCECADC0
REST call: GET http://127.0.0.1:5000/test?sleepsecs=1 -> 200, 
############## return res: TestStruct(1)
emit shared 7F68BCECD4B0
lw null
emit shared done
*************** after main sleeping
returning HTTPClient connection 0 of 1
--------------------
HTTP client request:
--------------------
GET /test?sleepsecs=5 HTTP/1.1
User-Agent: vibe.d/1.11.1 (HTTPClient, +http://vibed.org/)
Connection: keep-alive
Accept-Encoding: gzip, deflate
Host: 127.0.0.1:5000
Accept: application/json
--------------------
HTTP client waiting for response
emit shared 7F68BCECD0B0
lw null
emit shared done
Socket 9, read 198 bytes: ok
reading request..
HTTP server reading status line
--------------------
HTTP server request:
--------------------
GET /test?sleepsecs=5 HTTP/1.1
User-Agent: vibe.d/1.11.1 (HTTPClient, +http://vibed.org/)
Connection: keep-alive
Accept-Encoding: gzip, deflate
Host: 127.0.0.1:5000
Accept: application/json
--------------------
Got request header.
persist: true
handle request (body 0)
route match: /test -> GET /test []
$$$$$$$$$$$ receiving request with wait: 5
$$$$$$$$$$$ returning request with wait: 5
---------------------
HTTP server response:
---------------------
HTTP/1.1 200 OK
Server: vibe.d/1.11.1
Date: Thu, 03 Dec 2020 11:52:15 GMT
Keep-Alive: timeout=10
Content-Type: application/json
Content-Length: 7
---------------------
return true
Waiting for next request...
Socket 7, read 161 bytes: ok
HTTP client reading status line
stln: HTTP/1.1 200 OK
---------------------
HTTP client response:
---------------------
HTTP/1.1 200 OK
Server: vibe.d/1.11.1
Date: Thu, 03 Dec 2020 11:52:15 GMT
Keep-Alive: timeout=10
Content-Type: application/json
Content-Length: 7
---------------------
Returning HTTPClientResponse for conn 7F68BCECADC0
REST call: GET http://127.0.0.1:5000/test?sleepsecs=5 -> 200, 
############## return res: TestStruct(5)
emit shared 7F68BCECD4B0
lw null
emit shared done

ferencdg · 2020-12-03T12:39:35Z

I aggree with Andrei that this might be a memory corruption issue: using assumeSafeAppend is very suspicious to me and this can certainly cause memory corrurption if getNetTimeOffset or getBlocksFrom are reentered.

ferencdg · 2020-12-03T12:40:46Z

startPeriodicCatchup calling getBlocksFrom can cause re-entrancy for example

AndrejMitrovic · 2020-12-03T12:43:03Z

I aggree with Andrei that this might be a memory corruption issue: using assumeSafeAppend is very suspicious to me and this can certainly cause memory corrurption if getNetTimeOffset or getBlocksFrom are reentered.

I've tried avoiding using buffers and always allocating, but the problem still persists.

AndrejMitrovic · 2020-12-03T12:43:44Z

However yes, there is also the case of re-entrency. I think we're looking at two bugs here, one of which is symptomatic.

ferencdg · 2020-12-03T12:57:13Z

I aggree with Andrei that this might be a memory corruption issue: using assumeSafeAppend is very suspicious to me and this can certainly cause memory corrurption if getNetTimeOffset or getBlocksFrom are reentered.

I've tried avoiding using buffers and always allocating, but the problem still persists.

I suppose you removed the static from both the TimeInfo[] offsets; and Pair[] node_pairs;

AndrejMitrovic · 2020-12-03T13:06:45Z

Yep both of them.

ferencdg · 2020-12-03T13:12:28Z

there are actually quite a lot of local static arrays in our codebase, if you search for 'static \w*\s*[] ' in vscode, just hoping none of them are re-entrant

AndrejMitrovic · 2020-12-03T13:17:04Z

We'll have to fix many of those. But also try to minimize unnecessary reentrancy.

ferencdg · 2020-12-03T13:20:03Z

@omerfirmak I think if you would change the return type for getBlockHeight from int to string temporarily, then that would put your theory at test. Now vibe.d would fail at the serialization level.

From my limited test, I don't see vibe.d using pipelining, but it does TCP reuse.

omerfirmak · 2020-12-03T13:46:31Z

I changed getLocalTime return value to string and now getLocalTime returns public keys :)

2020-12-03 16:44:53,658 Info [agora.network.NetworkManager] - glt_ret: 1607003093
2020-12-03 16:44:53,658 Info [agora.network.NetworkManager] - Node time: 1607003093
2020-12-03 16:44:53,659 Info [agora.network.NetworkManager] - glt_ret: 1607003093
2020-12-03 16:44:53,659 Info [agora.network.NetworkManager] - Node time: 1607003093
2020-12-03 16:44:54,169 Info [agora.consensus.protocol.Nominator] - combineCandidates: 1
2020-12-03 16:44:54,915 Info [agora.network.NetworkManager] - glt_ret: GDNODE3EWQKF33TPK35DAQ3KXAYSOT4E4ACDOVJMDZQDVKP66IMJEACM

ferencdg · 2020-12-03T13:54:53Z

I changed getLocalTime return value to string and now getLocalTime returns public keys :)

2020-12-03 16:44:53,658 Info [agora.network.NetworkManager] - glt_ret: 1607003093
2020-12-03 16:44:53,658 Info [agora.network.NetworkManager] - Node time: 1607003093
2020-12-03 16:44:53,659 Info [agora.network.NetworkManager] - glt_ret: 1607003093
2020-12-03 16:44:53,659 Info [agora.network.NetworkManager] - Node time: 1607003093
2020-12-03 16:44:54,169 Info [agora.consensus.protocol.Nominator] - combineCandidates: 1
2020-12-03 16:44:54,915 Info [agora.network.NetworkManager] - glt_ret: GDNODE3EWQKF33TPK35DAQ3KXAYSOT4E4ACDOVJMDZQDVKP66IMJEACM

did you also remove the static from both array?

do you have some simplified code to reproduce the issue? I could probably fix it if it is in vibe.d.

omerfirmak · 2020-12-03T14:07:51Z

I changed getLocalTime return value to string and now getLocalTime returns public keys :)

2020-12-03 16:44:53,658 Info [agora.network.NetworkManager] - glt_ret: 1607003093
2020-12-03 16:44:53,658 Info [agora.network.NetworkManager] - Node time: 1607003093
2020-12-03 16:44:53,659 Info [agora.network.NetworkManager] - glt_ret: 1607003093
2020-12-03 16:44:53,659 Info [agora.network.NetworkManager] - Node time: 1607003093
2020-12-03 16:44:54,169 Info [agora.consensus.protocol.Nominator] - combineCandidates: 1
2020-12-03 16:44:54,915 Info [agora.network.NetworkManager] - glt_ret: GDNODE3EWQKF33TPK35DAQ3KXAYSOT4E4ACDOVJMDZQDVKP66IMJEACM

did you also remove the static from both array?

do you have some simplified code to reproduce the issue? I could probably fix it if it is in vibe.d.

Tried it again with non-static arrays, still the same.
Sadly I dont have a minimal example, let me see what I can do.

ferencdg · 2020-12-03T15:30:06Z

sadly this issue doesn't happen with the determenistic version of system test (system_integration_test.d)

omerfirmak · 2020-12-03T16:17:02Z

[main(GGqk) INF] Rest http://127.0.0.1:2826//public_key GET Ret: "1607011945"
[main(o+0v) INF] Rest http://127.0.0.1:2826//local_time GET Ret: "GDNODE2IMTDH7SZHXWDS24EZCMYCEJMRZWB3S4HLRIUP6UNGKVVFLVHQ"

Logged at: rest.d executeClientMethod()

ferencdg · 2020-12-03T16:22:00Z

nice... I think both PublicKey and now LocalTime is serialized like:
{"ACTUAL STRING"} so they can still be interchangeable for REST

I am also starting to think maybe somehow the messages are interchanged.... when Andrei changed the return value to a struct, it finally had a different serialization format, so no interchanging...

ferencdg · 2020-12-03T16:24:01Z

so previously the two rest methods both returning int were interchanged, now the two methods returning strings are interchanged

ferencdg · 2020-12-03T21:54:55Z

I have a very wild theory of what is happening here:

fiberA on the client side calls getPublicKey and get's assigned TCP socket with id 5(with port number 1111)
server answers, but fiberA is not scheduled yet, so it cannot read the answer.
after 10 seconds(keepalive-interval) the server closes the tcp connection
it reaches the client OS, who will free socket id 5(and port 1111)
fiberA still sleeps
fiberB comes along and issue getTime and get's assigned TCP socket with id 5 AGAIN(with port number 2222)
7 fiberB goes to sleep
8 fiberA wakes up and read data that was intended to fiberB

If the serialized format is the same, then it succeeds, otherwise it doesn't. This would explain Andrei's weired result that wrapping time_t into a struct seems to solve the issue. (I think if we switched on trace level logging, we would see a lot of rest queries failing and being retried)

this is just one scenario, here is an other(same steps, except number 2):

fiberA on the client side calls getPublicKey and get's assigned TCP socket with id 5(with port number 1111)
network error happens and the server closes the TCP connection
it reaches the client OS, who will free socket id 5(and port 1111)
fiberA still sleeps
fiberB comes along and issue getTime and get's assigned TCP socket with id 5 AGAIN(with port number 2222)
6 fiberB goes to sleep
7 fiberA wakes up and read data that was intended to fiberB

the key is that the same socket id is reused and fiberA is not woken up when the server side closes the connection(I have to verify both, it might just be some rubbish theory)

omerfirmak · 2020-12-04T06:36:58Z

@bpalaggi , Mathias asked @ferencdg to complete this. So I am assigning him too.

omerfirmak · 2020-12-04T07:25:59Z

@ferencdg , I have also seen cases where subsequent calls to getPublicKey and getLocalTime receive the same result. In fact, I think it is the more common case. If a fiber reads a value that was intended for another fiber, wouldnt it consume the data for good? The other fiber would not be able to read the same data.

REST call: GET http://127.0.0.1:3826/public_key -> 200, "GDNODE3EWQKF33TPK35DAQ3KXAYSOT4E4ACDOVJMDZQDVKP66IMJEACM"
REST call: GET http://127.0.0.1:3826/local_time -> 200, "GDNODE3EWQKF33TPK35DAQ3KXAYSOT4E4ACDOVJMDZQDVKP66IMJEACM"

omerfirmak · 2020-12-04T10:12:38Z

https://gist.github.com/omerfirmak/89d170556a83be4f514b9218ee3358c3

Towards the end, it can be seen that multiple requests receive unrelated json responses. Not just the ones with matching return values.

omerfirmak · 2020-12-04T10:36:58Z

So this is the first request that gets a wrong response. It uses connection 1, if you look couple of requests back at this line a getPublicKey request gets sent on the same connection, but never returns. The response to this getPublicKey is treated as the response to the receive_envelope. Then again here another getPublicKey gets sent over the same connection and receives the response for the previous receive_envelope

ferencdg · 2020-12-04T11:31:57Z

So this is the first request that gets a wrong response. It uses connection 1, if you look couple of requests back at this line a getPublicKey request gets sent on the same connection, but never returns. The response to this getPublicKey is treated as the response to the receive_envelope. Then again here another getPublicKey gets sent over the same connection and receives the response for the previous receive_envelope

hmmm I think this is very similar to my theory, but instead of reusing the raw socket IDs, we reuse the connections from the connection pool.

I looked at the timeouts and retries, everything in millisecs:
read timeout: 100
wait between retries: 50
additionall we have
max retry number: 25

so getPublicKey takes maximum (100+50)*25=3750msec, less then the 10 secs needed for the server to close the tcp connection

I think in our case the server replied over 3750msec and the original client gave up trying

I think if we push the total timeout on the client side above 10secs this will make the issue go away. For example by changing the readtimeout to 450. This can be a temporary solution until the issue is fixed.

ferencdg · 2020-12-04T11:32:50Z

or we just switch off http-keepalive

ferencdg · 2020-12-04T11:36:05Z

@omerfirmak thanks a lot for running the tests, it never really ran correctly on my machine

ferencdg · 2020-12-04T11:39:09Z

defaultKeepAliveTimeout=0 and I think it would work

omerfirmak · 2020-12-04T11:43:35Z

So this is the first request that gets a wrong response. It uses connection 1, if you look couple of requests back at this line a getPublicKey request gets sent on the same connection, but never returns. The response to this getPublicKey is treated as the response to the receive_envelope. Then again here another getPublicKey gets sent over the same connection and receives the response for the previous receive_envelope

hmmm I think this is very similar to my theory, but instead of reusing the raw socket IDs, we reuse the connections from the connection pool.

I looked at the timeouts and retries, everything in millisecs:
read timeout: 100
wait between retries: 50
additionall we have
max retry number: 25

so getPublicKey takes maximum (100+50)*25=3750msec, less then the 10 secs needed for the server to close the tcp connection

I think in our case the server replied over 3750msec and the original client gave up trying

I think if we push the total timeout on the client side above 10secs this will make the issue go away. For example by changing the readtimeout to 450. This can be a temporary solution until the issue is fixed.

I think we should disable keepalive all together, increasing timeout is not really reliable imo.
What do you think the ultimate fix should be? Would not reusing connections that a request timedout while using work?

omerfirmak · 2020-12-04T13:06:28Z

Running 2 versions of this (-version=node1 and -version=node2) simultaneously, triggers the bug right away.

omerfirmak · 2020-12-04T15:16:28Z

diff --git a/http/vibe/http/client.d b/http/vibe/http/client.d
index ddb36978..0d38d23b 100644
--- a/http/vibe/http/client.d
+++ b/http/vibe/http/client.d
@@ -538,6 +538,7 @@ final class HTTPClient {
 			scope request_allocator = new RegionListAllocator!(shared(Mallocator), false)(1024, Mallocator.instance);
 		else
 			scope request_allocator = new RegionListAllocator!(shared(GCAllocator), true)(1024, GCAllocator.instance);
+		scope (failure) disconnect();
 
 		bool close_conn;
 		SysTime connected_time;
@@ -554,10 +555,7 @@ final class HTTPClient {
 		Exception user_exception;
 		while (true)
 		{
-			scope (failure) {
-				m_responding = false;
-				disconnect();
-			}
+			scope (failure) m_responding = false;
 			try responder(res);
 			catch (Exception e) {
 				logDebug("Error while handling response: %s", e.toString().sanitize());

Simplest fix I can think of. When request times out, which throws an exception, disconnect and dont reuse the connection. There may be a need to do additional cleanup though.

ferencdg · 2020-12-04T17:01:37Z

sadly I also don't have any better idea... The request that is reusing the connection of an already timed out request will have no idea if the data it received is a response for its request or a response to a reqest that timed out... unless we use some kind of ID in the request and response.

ferencdg · 2020-12-04T17:03:49Z

maybe it is better if you or me raise the ticket on vibe.d and just propose the solution, and see what Sonke suggests. Would you like to report it to vibe.d or should I do that? I think we would need to post a minimal example too to reproduce.

ferencdg · 2020-12-04T17:25:53Z

an other solution would be to return the 'rest method name' in a http response header from the rest server along with the body, so we will now what method this response corresponds too. If we do this we don't need to modify vibe-core project.

ferencdg · 2020-12-04T17:38:14Z

the advantage of the solution I proposed is that we don't have to close the TCP connection in case we encounter a wrong message

omerfirmak · 2020-12-04T17:43:57Z

@ferencdg , I think you should open the ticket, you have been working on vibe.d already. But I think this should definitely be fixed in vibe.d. This is clearly a bug on their end, I may even call it a big one :) Anything we do in agora would be a 'hack' IMO.

ferencdg · 2020-12-04T17:50:21Z

yeah seems a really big one(yes we don't need to hack in agora, even my suggested solution is entirely in vibe.d)

omerfirmak · 2020-12-04T17:55:10Z

unless we use some kind of ID in the request and response.

an other solution would be to return the 'rest method name' in a http response header from the rest server along with the body, so we will now what method this response corresponds too. If we do this we don't need to modify vibe-core project.

Well these assume both server and client use vibe.d, right? If I am not mistaken, a vibe.d client with a non-vibe.d server could still be a problem. We should target the general case.

Lets just raise the issue and see what vibe.d people say 😄 I think we can use my example I posted above.

ferencdg · 2020-12-04T18:17:48Z

unless we use some kind of ID in the request and response.

an other solution would be to return the 'rest method name' in a http response header from the rest server along with the body, so we will now what method this response corresponds too. If we do this we don't need to modify vibe-core project.

Well these assume both server and client use vibe.d, right? If I am not mistaken, a vibe.d client with a non-vibe.d server could still be a problem. We should target the general case.

Lets just raise the issue and see what vibe.d people say 😄 I think we can use my example I posted above.

yeah, we might have both solutions implemented depending how much people hate dropping tcp connections

omerfirmak · 2020-12-05T07:54:57Z

Related: vibe-d/vibe.d#2342

ferencdg · 2021-01-06T11:28:08Z

done by: vibe-d/vibe.d#2507

Geod24 · 2021-01-19T03:25:51Z

I was greeted by an old friend today:

node-0_1  | 2021-01-19 03:15:17,290 Info [agora.network.NetworkManager] - Retrieving blocks [1..18446744073709551615] from http://node-5:5826..
node-0_1  | 2021-01-19 03:15:17,290 Info [agora.network.NetworkManager] - Retrieving blocks [1..18446744073709551615] from http://node-5:5826..
node-0_1  | 2021-01-19 03:15:17,408 Info [agora.network.NetworkManager] - Retrieving blocks [1..18446744073709551615] from http://node-5:5826..
node-0_1  | 2021-01-19 03:15:17,424 Info [agora.network.NetworkManager] - Retrieving blocks [1..18446744073709551615] from http://node-5:5826..
node-0_1  | 2021-01-19 03:15:17,448 Info [agora.network.NetworkManager] - Retrieving blocks [1..18446744073709551615] from http://node-5:5826..
node-0_1  | 2021-01-19 03:15:25,886 Error [agora.network.NetworkManager] - Couldn't retrieve blocks: Request failure to http://node-5:5826 after 25 attempts. Will try again later..

I'll see if I can reproduce reliably.

omerfirmak · 2021-01-19T05:43:48Z

Nothing stays fixed for too long

omerfirmak · 2021-11-08T07:50:35Z

I was greeted by an old friend today:

node-0_1  | 2021-01-19 03:15:17,290 Info [agora.network.NetworkManager] - Retrieving blocks [1..18446744073709551615] from http://node-5:5826..
node-0_1  | 2021-01-19 03:15:17,290 Info [agora.network.NetworkManager] - Retrieving blocks [1..18446744073709551615] from http://node-5:5826..
node-0_1  | 2021-01-19 03:15:17,408 Info [agora.network.NetworkManager] - Retrieving blocks [1..18446744073709551615] from http://node-5:5826..
node-0_1  | 2021-01-19 03:15:17,424 Info [agora.network.NetworkManager] - Retrieving blocks [1..18446744073709551615] from http://node-5:5826..
node-0_1  | 2021-01-19 03:15:17,448 Info [agora.network.NetworkManager] - Retrieving blocks [1..18446744073709551615] from http://node-5:5826..
node-0_1  | 2021-01-19 03:15:25,886 Error [agora.network.NetworkManager] - Couldn't retrieve blocks: Request failure to http://node-5:5826 after 25 attempts. Will try again later..

I'll see if I can reproduce reliably.

18446744073709551615 is ulong.max, so I don't think it was related to this bug.

kchulj added the type-bug Things don't work as they were intended to, or the way they were intended to work doesn't make sense label Nov 30, 2020

AndrejMitrovic mentioned this issue Dec 1, 2020

A too large number appears when retrieving blocks from NetworkManager #1337

Closed

AndrejMitrovic added this to the 5. CoinNet release milestone Dec 1, 2020

AndrejMitrovic changed the title ~~Memory corruption bug~~ vibe.d bug with http requests (previously: memory corruption bug) Dec 3, 2020

bpalaggi assigned omerfirmak Dec 4, 2020

bpalaggi added the C.General An issue which doesn't fit in the other categories (not blockchain related) label Dec 4, 2020

AndrejMitrovic added the prio-URGENT Affects the current system and need immediate attention label Dec 6, 2020

AndrejMitrovic mentioned this issue Dec 6, 2020

Clock out of sync errors appear in testnet #1391

Closed

ferencdg mentioned this issue Dec 7, 2020

HTTP request reusing TCP connection leads to wrong response being processed vibe-d/vibe.d#2506

Closed

kchulj mentioned this issue Dec 9, 2020

SCP causes Agora to get stuck trying to restart #1408

Closed

ferencdg closed this as completed Jan 6, 2021

hewison-chris mentioned this issue Nov 8, 2021

Stop storing duplicate entries in the cache.db network_manager table #2649

Closed

vibe.d bug with http requests (previously: memory corruption bug) #1395

vibe.d bug with http requests (previously: memory corruption bug) #1395

Comments

AndrejMitrovic commented Nov 30, 2020 • edited Loading

AndrejMitrovic commented Nov 30, 2020

AndrejMitrovic commented Nov 30, 2020

omerfirmak commented Dec 2, 2020 • edited Loading

AndrejMitrovic commented Dec 3, 2020

ferencdg commented Dec 3, 2020

ferencdg commented Dec 3, 2020 • edited Loading

ferencdg commented Dec 3, 2020

ferencdg commented Dec 3, 2020

ferencdg commented Dec 3, 2020

AndrejMitrovic commented Dec 3, 2020

AndrejMitrovic commented Dec 3, 2020

ferencdg commented Dec 3, 2020

AndrejMitrovic commented Dec 3, 2020

ferencdg commented Dec 3, 2020

AndrejMitrovic commented Dec 3, 2020

ferencdg commented Dec 3, 2020 • edited Loading

omerfirmak commented Dec 3, 2020

ferencdg commented Dec 3, 2020

omerfirmak commented Dec 3, 2020

ferencdg commented Dec 3, 2020

omerfirmak commented Dec 3, 2020 • edited Loading

ferencdg commented Dec 3, 2020 • edited Loading

ferencdg commented Dec 3, 2020

ferencdg commented Dec 3, 2020

omerfirmak commented Dec 4, 2020

omerfirmak commented Dec 4, 2020 • edited Loading

omerfirmak commented Dec 4, 2020

omerfirmak commented Dec 4, 2020 • edited Loading

ferencdg commented Dec 4, 2020

ferencdg commented Dec 4, 2020

ferencdg commented Dec 4, 2020

ferencdg commented Dec 4, 2020

omerfirmak commented Dec 4, 2020

omerfirmak commented Dec 4, 2020

omerfirmak commented Dec 4, 2020 • edited Loading

ferencdg commented Dec 4, 2020 • edited Loading

ferencdg commented Dec 4, 2020

ferencdg commented Dec 4, 2020 • edited Loading

ferencdg commented Dec 4, 2020

omerfirmak commented Dec 4, 2020

ferencdg commented Dec 4, 2020 • edited Loading

omerfirmak commented Dec 4, 2020 • edited Loading

ferencdg commented Dec 4, 2020 • edited Loading

omerfirmak commented Dec 5, 2020

ferencdg commented Jan 6, 2021

Geod24 commented Jan 19, 2021

omerfirmak commented Jan 19, 2021

omerfirmak commented Nov 8, 2021

AndrejMitrovic commented Nov 30, 2020 •

edited

Loading

omerfirmak commented Dec 2, 2020 •

edited

Loading

ferencdg commented Dec 3, 2020 •

edited

Loading

ferencdg commented Dec 3, 2020 •

edited

Loading

omerfirmak commented Dec 3, 2020 •

edited

Loading

ferencdg commented Dec 3, 2020 •

edited

Loading

omerfirmak commented Dec 4, 2020 •

edited

Loading

omerfirmak commented Dec 4, 2020 •

edited

Loading

omerfirmak commented Dec 4, 2020 •

edited

Loading

ferencdg commented Dec 4, 2020 •

edited

Loading

ferencdg commented Dec 4, 2020 •

edited

Loading

ferencdg commented Dec 4, 2020 •

edited

Loading

omerfirmak commented Dec 4, 2020 •

edited

Loading

ferencdg commented Dec 4, 2020 •

edited

Loading