-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vibe.d bug with http requests (previously: memory corruption bug) #1395
Comments
One hint is that both code locations use https://github.com/bpfkorea/agora/blob/b55d0a6066da78b35fdb649ef86f2b71d272dd90/source/agora/network/NetworkManager.d#L841 I suspect it might be a re-entrance problem. If the routine does a context switch and is re-entered, then the new fiber will overwrite the static array of the old fiber. Meanwhile, the old fiber will later wake up and find its static array in an invalid state. |
We should probably avoid using any kind of |
I feel like this is not a memory corruption bug but a protocol error in vibe.d. It is apparently doing what is called http pipelining, but not handling it properly. Wireshark dumps suggest that. When client asks for A and B without waiting a response in between, server should respond in that order, but the responses come in reverse order. I can see that responses from The issue where we were seeing a large number of blocks being fetched could very well be because of this. 1606937900 is simply a timestamp. |
Great find @omerfirmak! Can you file the bug upstream? I'm guessing the pipelining in vibe.d can't be disabled / worked around? |
I think this 'pipelining' is done because of the 'http-keepalive' is switched on by default for rest requests. If so, we can probably switch it off until the bug is fixed. |
I think it is a bit more difficult than that, from my test I see that the TCP connection is only reused if the previous request finished, otherwise a new connection is opened: struct TestStruct
{
int i;
}
interface IService
{
@safe TestStruct getTest(int sleepsecs);
}
class Service : IService
{
@safe TestStruct getTest(int sleepsecs)
{
writeln("$$$$$$$$$$$ receiving request with wait: ", sleepsecs);
sleep(sleepsecs.seconds);
TestStruct test_struct = {sleepsecs};
writeln("$$$$$$$$$$$ returning request with wait: ", sleepsecs);
return test_struct;
}
}
int testmain()
{
setLogLevel(LogLevel.trace);
auto settings = new HTTPServerSettings;
settings.port = 5000;
settings.bindAddresses = ["127.0.0.1"];
auto router = new URLRouter;
router.registerRestInterface(new Service);
auto server_addr = listenHTTP(settings, router);
writeln("*************** before running tasks");
runTask({
auto settings = new RestInterfaceSettings();
settings.baseURL = URL("http://127.0.0.1:5000/");
try
{
auto service_client = new RestInterfaceClient!IService(settings);
runTask({
auto res = service_client.getTest(5);
writeln("############## return res: ", res);
assert(res.i == 5);
});
writeln("*************** before main sleeping");
sleep(1.seconds);
writeln("*************** after main sleeping");
runTask({
auto res = service_client.getTest(1);
writeln("############## return res: ", res);
assert(res.i == 1);
});
}
catch(Exception e)
{
writeln("exception in main task");
}
});
runApplication();
return 0;
} Result:
It seems the client used socket 7 and 10, but if you change the timing and the first request finishes before the second issued, then the TCP connection is reused |
here is the result for non-overlapping requests:
|
I aggree with Andrei that this might be a memory corruption issue: using assumeSafeAppend is very suspicious to me and this can certainly cause memory corrurption if getNetTimeOffset or getBlocksFrom are reentered. |
startPeriodicCatchup calling getBlocksFrom can cause re-entrancy for example |
I've tried avoiding using buffers and always allocating, but the problem still persists. |
However yes, there is also the case of re-entrency. I think we're looking at two bugs here, one of which is symptomatic. |
I suppose you removed the static from both the TimeInfo[] offsets; and Pair[] node_pairs; |
Yep both of them. |
there are actually quite a lot of local static arrays in our codebase, if you search for 'static \w*\s*[] ' in vscode, just hoping none of them are re-entrant |
We'll have to fix many of those. But also try to minimize unnecessary reentrancy. |
@omerfirmak I think if you would change the return type for getBlockHeight from int to string temporarily, then that would put your theory at test. Now vibe.d would fail at the serialization level. From my limited test, I don't see vibe.d using pipelining, but it does TCP reuse. |
I changed
|
did you also remove the static from both array? do you have some simplified code to reproduce the issue? I could probably fix it if it is in vibe.d. |
Tried it again with non-static arrays, still the same. |
sadly this issue doesn't happen with the determenistic version of system test (system_integration_test.d) |
Logged at: rest.d |
nice... I think both PublicKey and now LocalTime is serialized like: I am also starting to think maybe somehow the messages are interchanged.... when Andrei changed the return value to a struct, it finally had a different serialization format, so no interchanging... |
so previously the two rest methods both returning int were interchanged, now the two methods returning strings are interchanged |
I have a very wild theory of what is happening here:
If the serialized format is the same, then it succeeds, otherwise it doesn't. This would explain Andrei's weired result that wrapping time_t into a struct seems to solve the issue. (I think if we switched on trace level logging, we would see a lot of rest queries failing and being retried) this is just one scenario, here is an other(same steps, except number 2):
the key is that the same socket id is reused and fiberA is not woken up when the server side closes the connection(I have to verify both, it might just be some rubbish theory) |
@ferencdg , I have also seen cases where subsequent calls to
|
https://gist.github.com/omerfirmak/89d170556a83be4f514b9218ee3358c3 Towards the end, it can be seen that multiple requests receive unrelated json responses. Not just the ones with matching return values. |
So this is the first request that gets a wrong response. It uses connection 1, if you look couple of requests back at this line a |
hmmm I think this is very similar to my theory, but instead of reusing the raw socket IDs, we reuse the connections from the connection pool. I looked at the timeouts and retries, everything in millisecs: so getPublicKey takes maximum (100+50)*25=3750msec, less then the 10 secs needed for the server to close the tcp connection I think in our case the server replied over 3750msec and the original client gave up trying I think if we push the total timeout on the client side above 10secs this will make the issue go away. For example by changing the readtimeout to 450. This can be a temporary solution until the issue is fixed. |
or we just switch off http-keepalive |
@omerfirmak thanks a lot for running the tests, it never really ran correctly on my machine |
defaultKeepAliveTimeout=0 and I think it would work |
I think we should disable keepalive all together, increasing timeout is not really reliable imo. |
Running 2 versions of this ( |
diff --git a/http/vibe/http/client.d b/http/vibe/http/client.d
index ddb36978..0d38d23b 100644
--- a/http/vibe/http/client.d
+++ b/http/vibe/http/client.d
@@ -538,6 +538,7 @@ final class HTTPClient {
scope request_allocator = new RegionListAllocator!(shared(Mallocator), false)(1024, Mallocator.instance);
else
scope request_allocator = new RegionListAllocator!(shared(GCAllocator), true)(1024, GCAllocator.instance);
+ scope (failure) disconnect();
bool close_conn;
SysTime connected_time;
@@ -554,10 +555,7 @@ final class HTTPClient {
Exception user_exception;
while (true)
{
- scope (failure) {
- m_responding = false;
- disconnect();
- }
+ scope (failure) m_responding = false;
try responder(res);
catch (Exception e) {
logDebug("Error while handling response: %s", e.toString().sanitize()); Simplest fix I can think of. When request times out, which throws an exception, disconnect and dont reuse the connection. There may be a need to do additional cleanup though. |
sadly I also don't have any better idea... The request that is reusing the connection of an already timed out request will have no idea if the data it received is a response for its request or a response to a reqest that timed out... unless we use some kind of ID in the request and response. |
maybe it is better if you or me raise the ticket on vibe.d and just propose the solution, and see what Sonke suggests. Would you like to report it to vibe.d or should I do that? I think we would need to post a minimal example too to reproduce. |
an other solution would be to return the 'rest method name' in a http response header from the rest server along with the body, so we will now what method this response corresponds too. If we do this we don't need to modify vibe-core project. |
the advantage of the solution I proposed is that we don't have to close the TCP connection in case we encounter a wrong message |
@ferencdg , I think you should open the ticket, you have been working on |
yeah seems a really big one(yes we don't need to hack in agora, even my suggested solution is entirely in vibe.d) |
Well these assume both server and client use vibe.d, right? If I am not mistaken, a vibe.d client with a non-vibe.d server could still be a problem. We should target the general case. Lets just raise the issue and see what vibe.d people say 😄 I think we can use my example I posted above. |
yeah, we might have both solutions implemented depending how much people hate dropping tcp connections |
Related: vibe-d/vibe.d#2342 |
done by: vibe-d/vibe.d#2507 |
I was greeted by an old friend today:
I'll see if I can reproduce reliably. |
Nothing stays fixed for too long |
18446744073709551615 is |
Look closely. The logging which added the
node_time
logging emits the same number1606728280
as in theRetrieving blocks
log line. There seems to be some memory corruption which propagates everywhere.I also added some additional logging and ran into this:
This is in
Key.fromString (scope const(char)[] str)
. There is probably a separate bug here that we're using assertions in place of exceptions inside this function. But it's also highly unusual that legitimate local nodes are sending empty json strings.So I suspect there is a memory corruption bug somewhere.
The text was updated successfully, but these errors were encountered: