Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vibe.d bug with http requests (previously: memory corruption bug) #1395

Closed
AndrejMitrovic opened this issue Nov 30, 2020 · 48 comments
Closed
Assignees
Labels
C.General An issue which doesn't fit in the other categories (not blockchain related) prio-URGENT Affects the current system and need immediate attention type-bug Things don't work as they were intended to, or the way they were intended to work doesn't make sense

Comments

@AndrejMitrovic
Copy link
Contributor

AndrejMitrovic commented Nov 30, 2020

node-2_1 | 2020-11-30 09:24:40,909 Info [agora.network.NetworkManager] - Net time offsets: [{ key: GDNODE6ZXW2NNOOQIGN24MBEZRO5226LSMHGQA3MUAMYQSTJVR7XT6GH, node_time: 1, req_delay: 0, offset: -1606728279 }, { key: GDNODE3EWQKF33TPK35DAQ3KXAYSOT4E4ACDOVJMDZQDVKP66IMJEACM, node_time: 1, req_delay: 0, offset: -1606728279 }, { key: GDNODE2IMTDH7SZHXWDS24EZCMYCEJMRZWB3S4HLRIUP6UNGKVVFLVHQ, node_time: 1606728279, req_delay: 0, offset: 0 }, { key: GDNODE4KTE7VQUHVBLXIGD7VEFY57X4XV547P72D37SDG7UEO7MWOSNY, node_time: 1606728280, req_delay: 0, offset: 0 }, { key: GDNODE6ZXW2NNOOQIGN24MBEZRO5226LSMHGQA3MUAMYQSTJVR7XT6GH, node_time: 1606728280, req_delay: 0, offset: 0 }, { key: GDNODE7J5EUK7T6HLEO2FDUBWZEXVXHJO7C4AF5VZAKZENGQ4WR3IX2U, node_time: 1606728280, req_delay: 0, offset: 0 }, { key: GDNODE7J5EUK7T6HLEO2FDUBWZEXVXHJO7C4AF5VZAKZENGQ4WR3IX2U, node_time: 1606728280, req_delay: 0, offset: 0 }]
node-2_1 | 2020-11-30 09:24:40,960 Info [agora.network.NetworkManager] - Retrieving blocks [2..1606728280] from http://node-6:6826..

Look closely. The logging which added the node_time logging emits the same number 1606728280 as in the Retrieving blocks log line. There seems to be some memory corruption which propagates everywhere.

I also added some additional logging and ran into this:

node-4_1  | RECEIVED INVALID PUBLIC KEY: '' from submodules/vibe.d/data/vibe/data/serialization.d(739)

This is in Key.fromString (scope const(char)[] str). There is probably a separate bug here that we're using assertions in place of exceptions inside this function. But it's also highly unusual that legitimate local nodes are sending empty json strings.

So I suspect there is a memory corruption bug somewhere.

@AndrejMitrovic
Copy link
Contributor Author

Look closely. The logging which added the node_time logging emits the same number 1606728280 as in the Retrieving blocks log line. There seems to be some memory corruption which propagates everywhere.

One hint is that both code locations use static buffers:

https://github.com/bpfkorea/agora/blob/b55d0a6066da78b35fdb649ef86f2b71d272dd90/source/agora/network/NetworkManager.d#L841
https://github.com/bpfkorea/agora/blob/b55d0a6066da78b35fdb649ef86f2b71d272dd90/source/agora/network/NetworkManager.d#L536

I suspect it might be a re-entrance problem. If the routine does a context switch and is re-entered, then the new fiber will overwrite the static array of the old fiber. Meanwhile, the old fiber will later wake up and find its static array in an invalid state.

@AndrejMitrovic
Copy link
Contributor Author

We should probably avoid using any kind of static storage in a routine which is not re-entrant safe, IOW a routine which does any sorts of context switches.

@kchulj kchulj added the type-bug Things don't work as they were intended to, or the way they were intended to work doesn't make sense label Nov 30, 2020
@AndrejMitrovic AndrejMitrovic added this to the 5. CoinNet release milestone Dec 1, 2020
@omerfirmak
Copy link
Contributor

omerfirmak commented Dec 2, 2020

I feel like this is not a memory corruption bug but a protocol error in vibe.d. It is apparently doing what is called http pipelining, but not handling it properly. Wireshark dumps suggest that. When client asks for A and B without waiting a response in between, server should respond in that order, but the responses come in reverse order. I can see that responses from getBlockHeight and getLocalTime are getting routed to wrong waiters. Since they are returning the same value type even the deserializer is still happy. When getLocalTime return value is wrapped in a struct because it would be serialized differently, the error is probably masked somewhere.

The issue where we were seeing a large number of blocks being fetched could very well be because of this.
2020-12-02 22:38:21,173 Info [agora.network.NetworkManager] - Retrieving blocks [5..1606937900] from http://127.0.0.1:6826..

1606937900 is simply a timestamp.

@AndrejMitrovic AndrejMitrovic changed the title Memory corruption bug vibe.d bug with http requests (previously: memory corruption bug) Dec 3, 2020
@AndrejMitrovic
Copy link
Contributor Author

Great find @omerfirmak! Can you file the bug upstream?

I'm guessing the pipelining in vibe.d can't be disabled / worked around?

@ferencdg
Copy link
Contributor

ferencdg commented Dec 3, 2020

I think this 'pipelining' is done because of the 'http-keepalive' is switched on by default for rest requests. If so, we can probably switch it off until the bug is fixed.

@ferencdg
Copy link
Contributor

ferencdg commented Dec 3, 2020

I think it is a bit more difficult than that, from my test I see that the TCP connection is only reused if the previous request finished, otherwise a new connection is opened:

struct TestStruct
{
	int i;
}

interface IService
{
	@safe TestStruct getTest(int sleepsecs);
}

class Service : IService
{
	@safe TestStruct getTest(int sleepsecs)
	{
		writeln("$$$$$$$$$$$ receiving request with wait: ", sleepsecs);
		sleep(sleepsecs.seconds);
		TestStruct test_struct = {sleepsecs};
		writeln("$$$$$$$$$$$ returning request with wait: ", sleepsecs);
		return test_struct;
	}
}

int testmain()
{
	setLogLevel(LogLevel.trace);
	auto settings = new HTTPServerSettings;
	settings.port = 5000;
	settings.bindAddresses = ["127.0.0.1"];
	auto router = new URLRouter;
	router.registerRestInterface(new Service);
	auto server_addr = listenHTTP(settings, router);
	writeln("*************** before running tasks");
	runTask({
		auto settings = new RestInterfaceSettings();
		settings.baseURL = URL("http://127.0.0.1:5000/");
		try
		{
			auto service_client = new RestInterfaceClient!IService(settings);
			runTask({
				auto res = service_client.getTest(5);
				writeln("############## return res: ", res);
				assert(res.i == 5);
			});
			writeln("*************** before main sleeping");
			sleep(1.seconds);
			writeln("*************** after main sleeping");
			runTask({
				auto res = service_client.getTest(1);
				writeln("############## return res: ", res);
				assert(res.i == 1);
			});

		}
		catch(Exception e)
		{
			writeln("exception in main task");
		}
	});

	runApplication();
	return 0;
}

Result:

./build/app
REST route: GET /test ["sleepsecs"]
add route GET /test
REST route: OPTIONS /test ["sleepsecs"]
add route OPTIONS /test
[main(----) INF] Listening for requests on http://127.0.0.1:5000/
*************** before running tasks
Creating new fiber...
Running event loop...
setup signal handler
Starting event loop.
Creating new fiber...
Create HTTP client pool 127.0.0.1:5000 false proxy :0
creating new HTTPClient connection, all 0 are in use
 ... 7F5281096DC0
Now got 1 connections
*************** before main sleeping
Creating new fiber...
wait shared 55CBCF9E7E58
Creating new fiber...
--------------------
HTTP client request:
--------------------
GET /test?sleepsecs=5 HTTP/1.1
User-Agent: vibe.d/1.11.1 (HTTPClient, +http://vibed.org/)
Connection: keep-alive
Accept-Encoding: gzip, deflate
Host: 127.0.0.1:5000
Accept: application/json
--------------------
HTTP client waiting for response (client)
Socket 9, read 198 bytes: ok (probably server)
reading request.. (server)
HTTP server reading status line (server)
--------------------
HTTP server request:
--------------------
GET /test?sleepsecs=5 HTTP/1.1
User-Agent: vibe.d/1.11.1 (HTTPClient, +http://vibed.org/)
Connection: keep-alive
Accept-Encoding: gzip, deflate
Host: 127.0.0.1:5000
Accept: application/json
--------------------
Got request header.
persist: true
handle request (body 0)
Disambiguate done: 22 nodes, 1 max stack size
Match tree has 7 (of 22 in the builder) nodes, 2 terminals
route match: /test -> GET /test []
$$$$$$$$$$$ receiving request with wait: 5
*************** after main sleeping
Creating new fiber...
creating new HTTPClient connection, all 1 are in use
 ... 7F528109A210
Now got 2 connections
emit shared 7F52810990B0
lw null
emit shared done
--------------------
HTTP client request:
--------------------
GET /test?sleepsecs=1 HTTP/1.1
User-Agent: vibe.d/1.11.1 (HTTPClient, +http://vibed.org/)
Connection: keep-alive
Accept-Encoding: gzip, deflate
Host: 127.0.0.1:5000
Accept: application/json
--------------------
HTTP client waiting for response
Socket 11, read 198 bytes: ok
reading request..
HTTP server reading status line
--------------------
HTTP server request:
--------------------
GET /test?sleepsecs=1 HTTP/1.1
User-Agent: vibe.d/1.11.1 (HTTPClient, +http://vibed.org/)
Connection: keep-alive
Accept-Encoding: gzip, deflate
Host: 127.0.0.1:5000
Accept: application/json
--------------------
Got request header.
persist: true
handle request (body 0)
route match: /test -> GET /test []
$$$$$$$$$$$ receiving request with wait: 1
$$$$$$$$$$$ returning request with wait: 1
---------------------
HTTP server response:
---------------------
HTTP/1.1 200 OK
Server: vibe.d/1.11.1
Date: Thu, 03 Dec 2020 11:33:45 GMT
Keep-Alive: timeout=10
Content-Type: application/json
Content-Length: 7
---------------------
return true
Waiting for next request...
Socket 10, read 161 bytes: ok
HTTP client reading status line
stln: HTTP/1.1 200 OK
---------------------
HTTP client response:
---------------------
HTTP/1.1 200 OK
Server: vibe.d/1.11.1
Date: Thu, 03 Dec 2020 11:33:45 GMT
Keep-Alive: timeout=10
Content-Type: application/json
Content-Length: 7
---------------------
Returning HTTPClientResponse for conn 7F528109A210
REST call: GET http://127.0.0.1:5000/test?sleepsecs=1 -> 200, 
############## return res: TestStruct(1)
emit shared 7F5281099AB0
lw null
emit shared done
$$$$$$$$$$$ returning request with wait: 5
---------------------
HTTP server response:
---------------------
HTTP/1.1 200 OK
Server: vibe.d/1.11.1
Date: Thu, 03 Dec 2020 11:33:44 GMT
Keep-Alive: timeout=10
Content-Type: application/json
Content-Length: 7
---------------------
return true
Waiting for next request...
Socket 7, read 161 bytes: ok
HTTP client reading status line
stln: HTTP/1.1 200 OK
---------------------
HTTP client response:
---------------------
HTTP/1.1 200 OK
Server: vibe.d/1.11.1
Date: Thu, 03 Dec 2020 11:33:44 GMT
Keep-Alive: timeout=10
Content-Type: application/json
Content-Length: 7
---------------------
Returning HTTPClientResponse for conn 7F5281096DC0
REST call: GET http://127.0.0.1:5000/test?sleepsecs=5 -> 200, 
############## return res: TestStruct(5)
emit shared 7F52810994B0
lw null
emit shared done
^C[main(----) INF] Received signal 2. Shutting down.
emit shared 55CBCF9E7E58
waiter 7F5281098C00
lw 7F5281098C00
notify task 7F527DBE6970 55CBCF7DCC80 7F527DBE6808
emit shared done
main thread exit
emit shared 7F52810996B0
lw null
emit shared done

It seems the client used socket 7 and 10, but if you change the timing and the first request finishes before the second issued, then the TCP connection is reused

@ferencdg
Copy link
Contributor

ferencdg commented Dec 3, 2020

here is the result for non-overlapping requests:

REST route: GET /test ["sleepsecs"]
add route GET /test
REST route: OPTIONS /test ["sleepsecs"]
add route OPTIONS /test
[main(----) INF] Listening for requests on http://127.0.0.1:5000/
*************** before running tasks
Creating new fiber...
Running event loop...
setup signal handler
Starting event loop.
Creating new fiber...
Create HTTP client pool 127.0.0.1:5000 false proxy :0
creating new HTTPClient connection, all 0 are in use
 ... 7F68BCECADC0
Now got 1 connections
*************** before main sleeping
Creating new fiber...
wait shared 557A8D6C3E58
Creating new fiber...
--------------------
HTTP client request:
--------------------
GET /test?sleepsecs=1 HTTP/1.1
User-Agent: vibe.d/1.11.1 (HTTPClient, +http://vibed.org/)
Connection: keep-alive
Accept-Encoding: gzip, deflate
Host: 127.0.0.1:5000
Accept: application/json
--------------------
HTTP client waiting for response
Socket 9, read 198 bytes: ok
reading request..
HTTP server reading status line
--------------------
HTTP server request:
--------------------
GET /test?sleepsecs=1 HTTP/1.1
User-Agent: vibe.d/1.11.1 (HTTPClient, +http://vibed.org/)
Connection: keep-alive
Accept-Encoding: gzip, deflate
Host: 127.0.0.1:5000
Accept: application/json
--------------------
Got request header.
persist: true
handle request (body 0)
Disambiguate done: 22 nodes, 1 max stack size
Match tree has 7 (of 22 in the builder) nodes, 2 terminals
route match: /test -> GET /test []
$$$$$$$$$$$ receiving request with wait: 1
$$$$$$$$$$$ returning request with wait: 1
---------------------
HTTP server response:
---------------------
HTTP/1.1 200 OK
Server: vibe.d/1.11.1
Date: Thu, 03 Dec 2020 11:52:12 GMT
Keep-Alive: timeout=10
Content-Type: application/json
Content-Length: 7
---------------------
return true
Waiting for next request...
Socket 7, read 161 bytes: ok
HTTP client reading status line
stln: HTTP/1.1 200 OK
---------------------
HTTP client response:
---------------------
HTTP/1.1 200 OK
Server: vibe.d/1.11.1
Date: Thu, 03 Dec 2020 11:52:12 GMT
Keep-Alive: timeout=10
Content-Type: application/json
Content-Length: 7
---------------------
Returning HTTPClientResponse for conn 7F68BCECADC0
REST call: GET http://127.0.0.1:5000/test?sleepsecs=1 -> 200, 
############## return res: TestStruct(1)
emit shared 7F68BCECD4B0
lw null
emit shared done
*************** after main sleeping
returning HTTPClient connection 0 of 1
--------------------
HTTP client request:
--------------------
GET /test?sleepsecs=5 HTTP/1.1
User-Agent: vibe.d/1.11.1 (HTTPClient, +http://vibed.org/)
Connection: keep-alive
Accept-Encoding: gzip, deflate
Host: 127.0.0.1:5000
Accept: application/json
--------------------
HTTP client waiting for response
emit shared 7F68BCECD0B0
lw null
emit shared done
Socket 9, read 198 bytes: ok
reading request..
HTTP server reading status line
--------------------
HTTP server request:
--------------------
GET /test?sleepsecs=5 HTTP/1.1
User-Agent: vibe.d/1.11.1 (HTTPClient, +http://vibed.org/)
Connection: keep-alive
Accept-Encoding: gzip, deflate
Host: 127.0.0.1:5000
Accept: application/json
--------------------
Got request header.
persist: true
handle request (body 0)
route match: /test -> GET /test []
$$$$$$$$$$$ receiving request with wait: 5
$$$$$$$$$$$ returning request with wait: 5
---------------------
HTTP server response:
---------------------
HTTP/1.1 200 OK
Server: vibe.d/1.11.1
Date: Thu, 03 Dec 2020 11:52:15 GMT
Keep-Alive: timeout=10
Content-Type: application/json
Content-Length: 7
---------------------
return true
Waiting for next request...
Socket 7, read 161 bytes: ok
HTTP client reading status line
stln: HTTP/1.1 200 OK
---------------------
HTTP client response:
---------------------
HTTP/1.1 200 OK
Server: vibe.d/1.11.1
Date: Thu, 03 Dec 2020 11:52:15 GMT
Keep-Alive: timeout=10
Content-Type: application/json
Content-Length: 7
---------------------
Returning HTTPClientResponse for conn 7F68BCECADC0
REST call: GET http://127.0.0.1:5000/test?sleepsecs=5 -> 200, 
############## return res: TestStruct(5)
emit shared 7F68BCECD4B0
lw null
emit shared done

@ferencdg
Copy link
Contributor

ferencdg commented Dec 3, 2020

I aggree with Andrei that this might be a memory corruption issue: using assumeSafeAppend is very suspicious to me and this can certainly cause memory corrurption if getNetTimeOffset or getBlocksFrom are reentered.

@ferencdg
Copy link
Contributor

ferencdg commented Dec 3, 2020

startPeriodicCatchup calling getBlocksFrom can cause re-entrancy for example

@AndrejMitrovic
Copy link
Contributor Author

I aggree with Andrei that this might be a memory corruption issue: using assumeSafeAppend is very suspicious to me and this can certainly cause memory corrurption if getNetTimeOffset or getBlocksFrom are reentered.

I've tried avoiding using buffers and always allocating, but the problem still persists.

@AndrejMitrovic
Copy link
Contributor Author

However yes, there is also the case of re-entrency. I think we're looking at two bugs here, one of which is symptomatic.

@ferencdg
Copy link
Contributor

ferencdg commented Dec 3, 2020

I aggree with Andrei that this might be a memory corruption issue: using assumeSafeAppend is very suspicious to me and this can certainly cause memory corrurption if getNetTimeOffset or getBlocksFrom are reentered.

I've tried avoiding using buffers and always allocating, but the problem still persists.

I suppose you removed the static from both the TimeInfo[] offsets; and Pair[] node_pairs;

@AndrejMitrovic
Copy link
Contributor Author

Yep both of them.

@ferencdg
Copy link
Contributor

ferencdg commented Dec 3, 2020

there are actually quite a lot of local static arrays in our codebase, if you search for 'static \w*\s*[] ' in vscode, just hoping none of them are re-entrant

@AndrejMitrovic
Copy link
Contributor Author

We'll have to fix many of those. But also try to minimize unnecessary reentrancy.

@ferencdg
Copy link
Contributor

ferencdg commented Dec 3, 2020

@omerfirmak I think if you would change the return type for getBlockHeight from int to string temporarily, then that would put your theory at test. Now vibe.d would fail at the serialization level.

From my limited test, I don't see vibe.d using pipelining, but it does TCP reuse.

@omerfirmak
Copy link
Contributor

I changed getLocalTime return value to string and now getLocalTime returns public keys :)

2020-12-03 16:44:53,658 Info [agora.network.NetworkManager] - glt_ret: 1607003093
2020-12-03 16:44:53,658 Info [agora.network.NetworkManager] - Node time: 1607003093
2020-12-03 16:44:53,659 Info [agora.network.NetworkManager] - glt_ret: 1607003093
2020-12-03 16:44:53,659 Info [agora.network.NetworkManager] - Node time: 1607003093
2020-12-03 16:44:54,169 Info [agora.consensus.protocol.Nominator] - combineCandidates: 1
2020-12-03 16:44:54,915 Info [agora.network.NetworkManager] - glt_ret: GDNODE3EWQKF33TPK35DAQ3KXAYSOT4E4ACDOVJMDZQDVKP66IMJEACM

@ferencdg
Copy link
Contributor

ferencdg commented Dec 3, 2020

I changed getLocalTime return value to string and now getLocalTime returns public keys :)

2020-12-03 16:44:53,658 Info [agora.network.NetworkManager] - glt_ret: 1607003093
2020-12-03 16:44:53,658 Info [agora.network.NetworkManager] - Node time: 1607003093
2020-12-03 16:44:53,659 Info [agora.network.NetworkManager] - glt_ret: 1607003093
2020-12-03 16:44:53,659 Info [agora.network.NetworkManager] - Node time: 1607003093
2020-12-03 16:44:54,169 Info [agora.consensus.protocol.Nominator] - combineCandidates: 1
2020-12-03 16:44:54,915 Info [agora.network.NetworkManager] - glt_ret: GDNODE3EWQKF33TPK35DAQ3KXAYSOT4E4ACDOVJMDZQDVKP66IMJEACM

did you also remove the static from both array?

do you have some simplified code to reproduce the issue? I could probably fix it if it is in vibe.d.

@omerfirmak
Copy link
Contributor

I changed getLocalTime return value to string and now getLocalTime returns public keys :)

2020-12-03 16:44:53,658 Info [agora.network.NetworkManager] - glt_ret: 1607003093
2020-12-03 16:44:53,658 Info [agora.network.NetworkManager] - Node time: 1607003093
2020-12-03 16:44:53,659 Info [agora.network.NetworkManager] - glt_ret: 1607003093
2020-12-03 16:44:53,659 Info [agora.network.NetworkManager] - Node time: 1607003093
2020-12-03 16:44:54,169 Info [agora.consensus.protocol.Nominator] - combineCandidates: 1
2020-12-03 16:44:54,915 Info [agora.network.NetworkManager] - glt_ret: GDNODE3EWQKF33TPK35DAQ3KXAYSOT4E4ACDOVJMDZQDVKP66IMJEACM

did you also remove the static from both array?

do you have some simplified code to reproduce the issue? I could probably fix it if it is in vibe.d.

Tried it again with non-static arrays, still the same.
Sadly I dont have a minimal example, let me see what I can do.

@ferencdg
Copy link
Contributor

ferencdg commented Dec 3, 2020

sadly this issue doesn't happen with the determenistic version of system test (system_integration_test.d)

@omerfirmak
Copy link
Contributor

omerfirmak commented Dec 3, 2020

[main(GGqk) INF] Rest http://127.0.0.1:2826//public_key GET Ret: "1607011945"
[main(o+0v) INF] Rest http://127.0.0.1:2826//local_time GET Ret: "GDNODE2IMTDH7SZHXWDS24EZCMYCEJMRZWB3S4HLRIUP6UNGKVVFLVHQ"

Logged at: rest.d executeClientMethod()

@ferencdg
Copy link
Contributor

ferencdg commented Dec 3, 2020

nice... I think both PublicKey and now LocalTime is serialized like:
{"ACTUAL STRING"} so they can still be interchangeable for REST

I am also starting to think maybe somehow the messages are interchanged.... when Andrei changed the return value to a struct, it finally had a different serialization format, so no interchanging...

@ferencdg
Copy link
Contributor

ferencdg commented Dec 3, 2020

so previously the two rest methods both returning int were interchanged, now the two methods returning strings are interchanged

@ferencdg
Copy link
Contributor

ferencdg commented Dec 3, 2020

I have a very wild theory of what is happening here:

  1. fiberA on the client side calls getPublicKey and get's assigned TCP socket with id 5(with port number 1111)
  2. server answers, but fiberA is not scheduled yet, so it cannot read the answer.
  3. after 10 seconds(keepalive-interval) the server closes the tcp connection
  4. it reaches the client OS, who will free socket id 5(and port 1111)
  5. fiberA still sleeps
  6. fiberB comes along and issue getTime and get's assigned TCP socket with id 5 AGAIN(with port number 2222)
    7 fiberB goes to sleep
    8 fiberA wakes up and read data that was intended to fiberB

If the serialized format is the same, then it succeeds, otherwise it doesn't. This would explain Andrei's weired result that wrapping time_t into a struct seems to solve the issue. (I think if we switched on trace level logging, we would see a lot of rest queries failing and being retried)

this is just one scenario, here is an other(same steps, except number 2):

  1. fiberA on the client side calls getPublicKey and get's assigned TCP socket with id 5(with port number 1111)
  2. network error happens and the server closes the TCP connection
  3. it reaches the client OS, who will free socket id 5(and port 1111)
  4. fiberA still sleeps
  5. fiberB comes along and issue getTime and get's assigned TCP socket with id 5 AGAIN(with port number 2222)
    6 fiberB goes to sleep
    7 fiberA wakes up and read data that was intended to fiberB

the key is that the same socket id is reused and fiberA is not woken up when the server side closes the connection(I have to verify both, it might just be some rubbish theory)

@bpalaggi bpalaggi added the C.General An issue which doesn't fit in the other categories (not blockchain related) label Dec 4, 2020
@omerfirmak
Copy link
Contributor

@bpalaggi , Mathias asked @ferencdg to complete this. So I am assigning him too.

@omerfirmak
Copy link
Contributor

omerfirmak commented Dec 4, 2020

@ferencdg , I have also seen cases where subsequent calls to getPublicKey and getLocalTime receive the same result. In fact, I think it is the more common case. If a fiber reads a value that was intended for another fiber, wouldnt it consume the data for good? The other fiber would not be able to read the same data.

REST call: GET http://127.0.0.1:3826/public_key -> 200, "GDNODE3EWQKF33TPK35DAQ3KXAYSOT4E4ACDOVJMDZQDVKP66IMJEACM"
REST call: GET http://127.0.0.1:3826/local_time -> 200, "GDNODE3EWQKF33TPK35DAQ3KXAYSOT4E4ACDOVJMDZQDVKP66IMJEACM"

@omerfirmak
Copy link
Contributor

https://gist.github.com/omerfirmak/89d170556a83be4f514b9218ee3358c3

Towards the end, it can be seen that multiple requests receive unrelated json responses. Not just the ones with matching return values.

@omerfirmak
Copy link
Contributor

omerfirmak commented Dec 4, 2020

So this is the first request that gets a wrong response. It uses connection 1, if you look couple of requests back at this line a getPublicKey request gets sent on the same connection, but never returns. The response to this getPublicKey is treated as the response to the receive_envelope. Then again here another getPublicKey gets sent over the same connection and receives the response for the previous receive_envelope

@ferencdg
Copy link
Contributor

ferencdg commented Dec 4, 2020

So this is the first request that gets a wrong response. It uses connection 1, if you look couple of requests back at this line a getPublicKey request gets sent on the same connection, but never returns. The response to this getPublicKey is treated as the response to the receive_envelope. Then again here another getPublicKey gets sent over the same connection and receives the response for the previous receive_envelope

hmmm I think this is very similar to my theory, but instead of reusing the raw socket IDs, we reuse the connections from the connection pool.

I looked at the timeouts and retries, everything in millisecs:
read timeout: 100
wait between retries: 50
additionall we have
max retry number: 25

so getPublicKey takes maximum (100+50)*25=3750msec, less then the 10 secs needed for the server to close the tcp connection

I think in our case the server replied over 3750msec and the original client gave up trying

I think if we push the total timeout on the client side above 10secs this will make the issue go away. For example by changing the readtimeout to 450. This can be a temporary solution until the issue is fixed.

@ferencdg
Copy link
Contributor

ferencdg commented Dec 4, 2020

or we just switch off http-keepalive

@ferencdg
Copy link
Contributor

ferencdg commented Dec 4, 2020

@omerfirmak thanks a lot for running the tests, it never really ran correctly on my machine

@ferencdg
Copy link
Contributor

ferencdg commented Dec 4, 2020

defaultKeepAliveTimeout=0 and I think it would work

@omerfirmak
Copy link
Contributor

So this is the first request that gets a wrong response. It uses connection 1, if you look couple of requests back at this line a getPublicKey request gets sent on the same connection, but never returns. The response to this getPublicKey is treated as the response to the receive_envelope. Then again here another getPublicKey gets sent over the same connection and receives the response for the previous receive_envelope

hmmm I think this is very similar to my theory, but instead of reusing the raw socket IDs, we reuse the connections from the connection pool.

I looked at the timeouts and retries, everything in millisecs:
read timeout: 100
wait between retries: 50
additionall we have
max retry number: 25

so getPublicKey takes maximum (100+50)*25=3750msec, less then the 10 secs needed for the server to close the tcp connection

I think in our case the server replied over 3750msec and the original client gave up trying

I think if we push the total timeout on the client side above 10secs this will make the issue go away. For example by changing the readtimeout to 450. This can be a temporary solution until the issue is fixed.

I think we should disable keepalive all together, increasing timeout is not really reliable imo.
What do you think the ultimate fix should be? Would not reusing connections that a request timedout while using work?

@omerfirmak
Copy link
Contributor

Running 2 versions of this (-version=node1 and -version=node2) simultaneously, triggers the bug right away.

@omerfirmak
Copy link
Contributor

omerfirmak commented Dec 4, 2020

diff --git a/http/vibe/http/client.d b/http/vibe/http/client.d
index ddb36978..0d38d23b 100644
--- a/http/vibe/http/client.d
+++ b/http/vibe/http/client.d
@@ -538,6 +538,7 @@ final class HTTPClient {
 			scope request_allocator = new RegionListAllocator!(shared(Mallocator), false)(1024, Mallocator.instance);
 		else
 			scope request_allocator = new RegionListAllocator!(shared(GCAllocator), true)(1024, GCAllocator.instance);
+		scope (failure) disconnect();
 
 		bool close_conn;
 		SysTime connected_time;
@@ -554,10 +555,7 @@ final class HTTPClient {
 		Exception user_exception;
 		while (true)
 		{
-			scope (failure) {
-				m_responding = false;
-				disconnect();
-			}
+			scope (failure) m_responding = false;
 			try responder(res);
 			catch (Exception e) {
 				logDebug("Error while handling response: %s", e.toString().sanitize());

Simplest fix I can think of. When request times out, which throws an exception, disconnect and dont reuse the connection. There may be a need to do additional cleanup though.

@ferencdg
Copy link
Contributor

ferencdg commented Dec 4, 2020

sadly I also don't have any better idea... The request that is reusing the connection of an already timed out request will have no idea if the data it received is a response for its request or a response to a reqest that timed out... unless we use some kind of ID in the request and response.

@ferencdg
Copy link
Contributor

ferencdg commented Dec 4, 2020

maybe it is better if you or me raise the ticket on vibe.d and just propose the solution, and see what Sonke suggests. Would you like to report it to vibe.d or should I do that? I think we would need to post a minimal example too to reproduce.

@ferencdg
Copy link
Contributor

ferencdg commented Dec 4, 2020

an other solution would be to return the 'rest method name' in a http response header from the rest server along with the body, so we will now what method this response corresponds too. If we do this we don't need to modify vibe-core project.

@ferencdg
Copy link
Contributor

ferencdg commented Dec 4, 2020

the advantage of the solution I proposed is that we don't have to close the TCP connection in case we encounter a wrong message

@omerfirmak
Copy link
Contributor

@ferencdg , I think you should open the ticket, you have been working on vibe.d already. But I think this should definitely be fixed in vibe.d. This is clearly a bug on their end, I may even call it a big one :) Anything we do in agora would be a 'hack' IMO.

@ferencdg
Copy link
Contributor

ferencdg commented Dec 4, 2020

yeah seems a really big one(yes we don't need to hack in agora, even my suggested solution is entirely in vibe.d)

@omerfirmak
Copy link
Contributor

omerfirmak commented Dec 4, 2020

unless we use some kind of ID in the request and response.

an other solution would be to return the 'rest method name' in a http response header from the rest server along with the body, so we will now what method this response corresponds too. If we do this we don't need to modify vibe-core project.

Well these assume both server and client use vibe.d, right? If I am not mistaken, a vibe.d client with a non-vibe.d server could still be a problem. We should target the general case.

Lets just raise the issue and see what vibe.d people say 😄 I think we can use my example I posted above.

@ferencdg
Copy link
Contributor

ferencdg commented Dec 4, 2020

unless we use some kind of ID in the request and response.

an other solution would be to return the 'rest method name' in a http response header from the rest server along with the body, so we will now what method this response corresponds too. If we do this we don't need to modify vibe-core project.

Well these assume both server and client use vibe.d, right? If I am not mistaken, a vibe.d client with a non-vibe.d server could still be a problem. We should target the general case.

Lets just raise the issue and see what vibe.d people say 😄 I think we can use my example I posted above.

yeah, we might have both solutions implemented depending how much people hate dropping tcp connections

@omerfirmak
Copy link
Contributor

Related: vibe-d/vibe.d#2342

@ferencdg
Copy link
Contributor

ferencdg commented Jan 6, 2021

done by: vibe-d/vibe.d#2507

@ferencdg ferencdg closed this as completed Jan 6, 2021
@Geod24
Copy link
Collaborator

Geod24 commented Jan 19, 2021

I was greeted by an old friend today:

node-0_1  | 2021-01-19 03:15:17,290 Info [agora.network.NetworkManager] - Retrieving blocks [1..18446744073709551615] from http://node-5:5826..
node-0_1  | 2021-01-19 03:15:17,290 Info [agora.network.NetworkManager] - Retrieving blocks [1..18446744073709551615] from http://node-5:5826..
node-0_1  | 2021-01-19 03:15:17,408 Info [agora.network.NetworkManager] - Retrieving blocks [1..18446744073709551615] from http://node-5:5826..
node-0_1  | 2021-01-19 03:15:17,424 Info [agora.network.NetworkManager] - Retrieving blocks [1..18446744073709551615] from http://node-5:5826..
node-0_1  | 2021-01-19 03:15:17,448 Info [agora.network.NetworkManager] - Retrieving blocks [1..18446744073709551615] from http://node-5:5826..
node-0_1  | 2021-01-19 03:15:25,886 Error [agora.network.NetworkManager] - Couldn't retrieve blocks: Request failure to http://node-5:5826 after 25 attempts. Will try again later..

I'll see if I can reproduce reliably.

@omerfirmak
Copy link
Contributor

Nothing stays fixed for too long

@omerfirmak
Copy link
Contributor

I was greeted by an old friend today:

node-0_1  | 2021-01-19 03:15:17,290 Info [agora.network.NetworkManager] - Retrieving blocks [1..18446744073709551615] from http://node-5:5826..
node-0_1  | 2021-01-19 03:15:17,290 Info [agora.network.NetworkManager] - Retrieving blocks [1..18446744073709551615] from http://node-5:5826..
node-0_1  | 2021-01-19 03:15:17,408 Info [agora.network.NetworkManager] - Retrieving blocks [1..18446744073709551615] from http://node-5:5826..
node-0_1  | 2021-01-19 03:15:17,424 Info [agora.network.NetworkManager] - Retrieving blocks [1..18446744073709551615] from http://node-5:5826..
node-0_1  | 2021-01-19 03:15:17,448 Info [agora.network.NetworkManager] - Retrieving blocks [1..18446744073709551615] from http://node-5:5826..
node-0_1  | 2021-01-19 03:15:25,886 Error [agora.network.NetworkManager] - Couldn't retrieve blocks: Request failure to http://node-5:5826 after 25 attempts. Will try again later..

I'll see if I can reproduce reliably.

18446744073709551615 is ulong.max, so I don't think it was related to this bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C.General An issue which doesn't fit in the other categories (not blockchain related) prio-URGENT Affects the current system and need immediate attention type-bug Things don't work as they were intended to, or the way they were intended to work doesn't make sense
Projects
None yet
Development

No branches or pull requests

6 participants