Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mc_worker retries too many times and give up too quick when server restart #209

Open
wandermyz opened this issue Jan 24, 2019 · 3 comments

Comments

@wandermyz
Copy link

This might be related to #208 but the discussion there lacks of context so I'm trying to open a new one here.

My goal is to make sure that if I run sudo service mongod restart, my Erlang application can automatically recover from the short glitch. I need this because otherwise any scale in/out or failover will take the Erlang application down. I'm using mongoc as I need to connect to a sharded Mongo Atlas instance in production.

I wrote a mongo_sup.erl to supervise all connections. Each pid that mongoc:connect returns become a child of my supervisor. Also as suggested in Issue 208, I added timeout by myself.

The problem is, when I do sudo service mongod restart, all the mc_worker spawned by mc_pool_sup crashes, but the mc_topology gen_server stays there. So my timeout code won't help at all. And as mc_pool_sup as a spec of simple_one_for_one, 1000, 3600, it'll very quickly retry 1000 times (seems triggered somewhere that I didn't figure out) without any waiting or thread yield and give up, even before mongod restarts. Then after mongod starts again, the Erlang application will no longer retry connecting.

I have to add a timeout code inside mc_worker and then it seems working

init(Options) ->
  case mc_worker_logic:connect_to_database(Options) of
    {ok, Socket} ->
      ...;
    Error ->
      io:format("mc_worker Error ~p~n", [Error]),
      timer:sleep(5000),  %%%%%%%%%%%%%%%%

      proc_lib:init_ack(Error)
  end.

Am I misunderstanding something and is this the correct way to use this library with mongoc?

The supervisor code

init(Params) ->
    application:ensure_all_started(mongodb),

    Conn1 = #{
        id       => mongo_local,
        start    => {?MODULE, connect, [
            mongo_local,
            {unknown, [ 
                "localhost:27017"
            ]}, 
            [
                {register, mongo_local},
                {connectTimeoutMS, 3000},
                {socketTimeoutMS, 100},
                {serverSelectionTimeoutMS, 3000}
            ],
            [
                {database,<<"account_dev">>}
            ]
        ]},
        restart  => permanent,
        shutdown => 5000,
        type     => supervisor,
        modules  => [mongoc]
    },

    Conn2 = ...
    Conn3 = ...

    SupFlags = #{
        strategy => one_for_one, 
        intensity => 10, 
        period => 30},
    ChildSpecs = [Conn1, Conn2, Conn3],
    {ok, {SupFlags, ChildSpecs}}.

connect(Name, Seeds, Options, WorkerOptions) ->
    RetryInterval = 5000,

    case mongoc:connect(Seeds, Options, WorkerOptions) of
        {ok, Pid} ->
            try mongoc:status(Pid) of
                {ready, _, _, _} ->
                    State = mc_topology:get_state(Pid),
                    lager:info("Connected to ~p, State: ~p", [Name, State]),
                    {ok, Pid};
                Result ->
                    lager:info("Failed to connect to ~p, ~p", [Name, Result]),
                    timer:sleep(RetryInterval),
                    {stop, Result}
            catch 
                error:Reason -> 
                    lager:info("Failed to connect to ~p, ~p", [Name, Reason]),
                    timer:sleep(RetryInterval),
                    {stop, Reason}
            end;

        {error, Error} -> 
            lager:info("Failed to connect to ~p, error ~p", [Name, Error]),
            timer:sleep(RetryInterval),
            {stop, Error};

        Result ->
            lager:info("Failed to connect to ~p, ~p", [Name, Result]),
            timer:sleep(RetryInterval),
            {stop, Result}
    end.

The log when I run sudo service mongod restart, where w:tl() is a function try to use mongo_api:insert to insert some records.

11:57:41.549 [info] [<0.352.0>] Connected to mongo_local, State: {topology_state,["localhost:27017"],<0.354.0>,standalone,undefined,undefined,undefined,primary,[],98363,3000,200,[{database,<<"account_dev">>}],[{register,mongo_local},{connectTimeoutMS,3000},{socketTimeoutMS,100},{serverSelectionTimeoutMS,3000}],5000}

(accountserver@yang-mbp-local-skyvm)1>
(accountserver@yang-mbp-local-skyvm)1>
(accountserver@yang-mbp-local-skyvm)1>
(accountserver@yang-mbp-local-skyvm)1> 11:57:51.378 [error] [<0.363.0>] gen_server <0.363.0> terminated with reason: tcp_closed
11:57:51.379 [error] [<0.364.0>] gen_server <0.364.0> terminated with reason: tcp_closed
11:57:51.379 [error] [<0.365.0>] gen_server <0.365.0> terminated with reason: tcp_closed
11:57:51.379 [error] [<0.366.0>] gen_server <0.366.0> terminated with reason: tcp_closed
11:57:51.379 [error] [<0.367.0>] gen_server <0.367.0> terminated with reason: tcp_closed
11:57:51.379 [error] [<0.368.0>] gen_server <0.368.0> terminated with reason: tcp_closed
11:57:51.379 [error] [<0.369.0>] gen_server <0.369.0> terminated with reason: tcp_closed
11:57:51.379 [error] [<0.370.0>] gen_server <0.370.0> terminated with reason: tcp_closed
11:57:51.379 [error] [<0.371.0>] gen_server <0.371.0> terminated with reason: tcp_closed
11:57:51.379 [error] [<0.372.0>] gen_server <0.372.0> terminated with reason: tcp_closed
11:57:51.414 [error] [<0.363.0>] CRASH REPORT Process <0.363.0> with 0 neighbours exited with reason: tcp_closed in gen_server:terminate/7 line 812
11:57:51.414 [error] [<0.364.0>] CRASH REPORT Process <0.364.0> with 0 neighbours exited with reason: tcp_closed in gen_server:terminate/7 line 812
11:57:51.415 [error] [<0.365.0>] CRASH REPORT Process <0.365.0> with 0 neighbours exited with reason: tcp_closed in gen_server:terminate/7 line 812
11:57:51.415 [error] [<0.367.0>] CRASH REPORT Process <0.367.0> with 0 neighbours exited with reason: tcp_closed in gen_server:terminate/7 line 812
11:57:51.416 [error] [<0.368.0>] CRASH REPORT Process <0.368.0> with 0 neighbours exited with reason: tcp_closed in gen_server:terminate/7 line 812
11:57:51.417 [error] [<0.369.0>] CRASH REPORT Process <0.369.0> with 0 neighbours exited with reason: tcp_closed in gen_server:terminate/7 line 812
11:57:51.417 [error] [<0.366.0>] CRASH REPORT Process <0.366.0> with 0 neighbours exited with reason: tcp_closed in gen_server:terminate/7 line 812
11:57:51.418 [error] [<0.370.0>] CRASH REPORT Process <0.370.0> with 0 neighbours exited with reason: tcp_closed in gen_server:terminate/7 line 812
11:57:51.418 [error] [<0.371.0>] CRASH REPORT Process <0.371.0> with 0 neighbours exited with reason: tcp_closed in gen_server:terminate/7 line 812
11:57:51.419 [error] [<0.372.0>] CRASH REPORT Process <0.372.0> with 0 neighbours exited with reason: tcp_closed in gen_server:terminate/7 line 812
11:57:51.419 [error] [<0.362.0>] Supervisor {<0.362.0>,poolboy_sup} had child mc_worker started with {mc_worker,start_link,undefined} at <0.363.0> exit with reason tcp_closed in context child_terminated
11:57:51.420 [error] [<0.362.0>] Supervisor {<0.362.0>,poolboy_sup} had child mc_worker started with {mc_worker,start_link,undefined} at <0.364.0> exit with reason tcp_closed in context child_terminated
11:57:51.420 [error] [<0.362.0>] Supervisor {<0.362.0>,poolboy_sup} had child mc_worker started with {mc_worker,start_link,undefined} at <0.365.0> exit with reason tcp_closed in context child_terminated
11:57:51.421 [error] [<0.362.0>] Supervisor {<0.362.0>,poolboy_sup} had child mc_worker started with {mc_worker,start_link,undefined} at <0.367.0> exit with reason tcp_closed in context child_terminated
11:57:51.421 [error] [<0.362.0>] Supervisor {<0.362.0>,poolboy_sup} had child mc_worker started with {mc_worker,start_link,undefined} at <0.368.0> exit with reason tcp_closed in context child_terminated
11:57:51.422 [error] [<0.362.0>] Supervisor {<0.362.0>,poolboy_sup} had child mc_worker started with {mc_worker,start_link,undefined} at <0.369.0> exit with reason tcp_closed in context child_terminated
11:57:51.423 [error] [<0.362.0>] Supervisor {<0.362.0>,poolboy_sup} had child mc_worker started with {mc_worker,start_link,undefined} at <0.366.0> exit with reason tcp_closed in context child_terminated
11:57:51.424 [error] [<0.362.0>] Supervisor {<0.362.0>,poolboy_sup} had child mc_worker started with {mc_worker,start_link,undefined} at <0.370.0> exit with reason tcp_closed in context child_terminated
11:57:51.425 [error] [<0.362.0>] Supervisor {<0.362.0>,poolboy_sup} had child mc_worker started with {mc_worker,start_link,undefined} at <0.371.0> exit with reason tcp_closed in context child_terminated
11:57:51.425 [error] [<0.362.0>] Supervisor {<0.362.0>,poolboy_sup} had child mc_worker started with {mc_worker,start_link,undefined} at <0.372.0> exit with reason tcp_closed in context child_terminated
11:57:51.426 [error] [<0.361.0>] gen_server <0.361.0> terminated with reason: no match of right hand value {error,econnrefused} in poolboy:new_worker/1 line 275
11:57:51.427 [error] [<0.361.0>] CRASH REPORT Process <0.361.0> with 0 neighbours exited with reason: no match of right hand value {error,econnrefused} in poolboy:new_worker/1 line 275 in gen_server:terminate/7 line 812
11:57:51.428 [error] [<0.353.0>] Supervisor mc_pool_sup had child worker_pool started with poolboy:start_link([{worker_module,mc_worker},{size,10},{max_overflow,10},{overflow_ttl,0},{overflow_check_period,undefined}], [{host,"localhost"},{port,27017},{database,<<"account_dev">>}]) at <0.361.0> exit with reason no match of right hand value {error,econnrefused} in poolboy:new_worker/1 line 275 in context child_terminated
11:57:51.429 [error] [<0.376.0>] CRASH REPORT Process <0.376.0> with 0 neighbours exited with reason: no match of right hand value {error,econnrefused} in poolboy:new_worker/1 line 275 in gen_server:init_it/6 line 352
11:57:51.430 [error] [<0.353.0>] Supervisor mc_pool_sup had child worker_pool started with poolboy:start_link([{worker_module,mc_worker},{size,10},{max_overflow,10},{overflow_ttl,0},{overflow_check_period,undefined}], [{host,"localhost"},{port,27017},{database,<<"account_dev">>}]) at <0.361.0> exit with reason no match of right hand value {error,econnrefused} in poolboy:new_worker/1 line 275 in context start_error
11:57:51.430 [error] [<0.377.0>] gen_server <0.377.0> terminated with reason: no match of right hand value {error,econnrefused} in poolboy:new_worker/1 line 275
11:57:51.433 [error] [<0.377.0>] CRASH REPORT Process <0.377.0> with 0 neighbours exited with reason: no match of right hand value {error,econnrefused} in poolboy:new_worker/1 line 275 in gen_server:terminate/7 line 812
11:57:51.434 [error] [<0.379.0>] CRASH REPORT Process <0.379.0> with 0 neighbours exited with reason: no match of right hand value {error,econnrefused} in poolboy:new_worker/1 line 275 in gen_server:init_it/6 line 352
11:57:51.435 [error] [<0.353.0>] Supervisor mc_pool_sup had child worker_pool started with poolboy:start_link([{worker_module,mc_worker},{size,10},{max_overflow,10},{overflow_ttl,0},{overflow_check_period,undefined}], [{host,"localhost"},{port,27017},{database,<<"account_dev">>}]) at {restarting,<0.361.0>} exit with reason no match of right hand value {error,econnrefused} in poolboy:new_worker/1 line 275 in context start_error
11:57:51.435 [error] [<0.380.0>] gen_server <0.380.0> terminated with reason: no match of right hand value {error,econnrefused} in poolboy:new_worker/1 line 275
11:57:51.437 [error] [<0.380.0>] CRASH REPORT Process <0.380.0> with 0 neighbours exited with reason: no match of right hand value {error,econnrefused} in poolboy:new_worker/1 line 275 in gen_server:terminate/7 line 812
11:57:51.492 [error] [<0.382.0>] CRASH REPORT Process <0.382.0> with 0 neighbours exited with reason: no match of right hand value {error,econnrefused} in poolboy:new_worker/1 line 275 in gen_server:init_it/6 line 352
11:57:51.494 [error] [<0.353.0>] Supervisor mc_pool_sup had child worker_pool started with poolboy:start_link([{worker_module,mc_worker},{size,10},{max_overflow,10},{overflow_ttl,0},{overflow_check_period,undefined}], [{host,"localhost"},{port,27017},{database,<<"account_dev">>}]) at {restarting,<0.361.0>} exit with reason no match of right hand value {error,econnrefused} in poolboy:new_worker/1 line 275 in context start_error
11:57:51.494 [error] [<0.383.0>] gen_server <0.383.0> terminated with reason: no match of right hand value {error,econnrefused} in poolboy:new_worker/1 line 275
11:57:51.494 [error] [<0.383.0>] CRASH REPORT Process <0.383.0> with 0 neighbours exited with reason: no match of right hand value {error,econnrefused} in poolboy:new_worker/1 line 275 in gen_server:terminate/7 line 812
11:57:51.495 [error] [<0.385.0>] CRASH REPORT Process <0.385.0> with 0 neighbours exited with reason: no match of right hand value {error,econnrefused} in poolboy:new_worker/1 line 275 in gen_server:init_it/6 line 352
11:57:51.497 [error] [<0.353.0>] Supervisor mc_pool_sup had child worker_pool started with poolboy:start_link([{worker_module,mc_worker},{size,10},{max_overflow,10},{overflow_ttl,0},{overflow_check_period,undefined}], [{host,"localhost"},{port,27017},{database,<<"account_dev">>}]) at {restarting,<0.361.0>} exit with reason no match of right hand value {error,econnrefused} in poolboy:new_worker/1 line 275 in context start_error
11:57:51.497 [error] [<0.386.0>] gen_server <0.386.0> terminated with reason: no match of right hand value {error,econnrefused} in poolboy:new_worker/1 line 275
11:57:51.499 [error] [<0.386.0>] CRASH REPORT Process <0.386.0> with 0 neighbours exited with reason: no match of right hand value {error,econnrefused} in poolboy:new_worker/1 line 275 in gen_server:terminate/7 line 812
11:57:51.500 [error] [<0.388.0>] CRASH REPORT Process <0.388.0> with 0 neighbours exited with reason: no match of right hand value {error,econnrefused} in poolboy:new_worker/1 line 275 in gen_server:init_it/6 line 352
11:57:51.500 [error] [<0.389.0>] gen_server <0.389.0> terminated with reason: no match of right hand value {error,econnrefused} in poolboy:new_worker/1 line 275
11:57:51.501 [error] [<0.389.0>] CRASH REPORT Process <0.389.0> with 0 neighbours exited with reason: no match of right hand value {error,econnrefused} in poolboy:new_worker/1 line 275 in gen_server:terminate/7 line 812
11:57:51.501 [error] [<0.353.0>] Supervisor mc_pool_sup had child worker_pool started with poolboy:start_link([{worker_module,mc_worker},{size,10},{max_overflow,10},{overflow_ttl,0},{overflow_check_period,undefined}], [{host,"localhost"},{port,27017},{database,<<"account_dev">>}]) at {restarting,<0.361.0>} exit with reason no match of right hand value {error,econnrefused} in poolboy:new_worker/1 line 275 in context start_error
11:57:51.501 [error] [<0.391.0>] CRASH REPORT Process <0.391.0> with 0 neighbours exited with reason: no match of right hand value {error,econnrefused} in poolboy:new_worker/1 line 275 in gen_server:init_it/6 line 352
11:57:51.502 [error] [<0.353.0>] Supervisor mc_pool_sup had child worker_pool started with poolboy:start_link([{worker_module,mc_worker},{size,10},{max_overflow,10},{overflow_ttl,0},{overflow_check_period,undefined}], [{host,"localhost"},{port,27017},{database,<<"account_dev">>}]) at {restarting,<0.361.0>} exit with reason no match of right hand value {error,econnrefused} in poolboy:new_worker/1 line 275 in context start_error
11:57:51.502 [error] [<0.392.0>] gen_server <0.392.0> terminated with reason: no match of right hand value {error,econnrefused} in poolboy:new_worker/1 line 275
11:57:51.502 [error] [<0.392.0>] CRASH REPORT Process <0.392.0> with 0 neighbours exited with reason: no match of right hand value {error,econnrefused} in poolboy:new_worker/1 line 275 in gen_server:terminate/7 line 812
11:57:51.502 [error] [<0.394.0>] CRASH REPORT Process <0.394.0> with 0 neighbours exited with reason: no match of right hand value {error,econnrefused} in poolboy:new_worker/1 line 275 in gen_server:init_it/6 line 352
11:57:51.503 [error] [<0.353.0>] Supervisor mc_pool_sup had child worker_pool started with poolboy:start_link([{worker_module,mc_worker},{size,10},{max_overflow,10},{overflow_ttl,0},{overflow_check_period,undefined}], [{host,"localhost"},{port,27017},{database,<<"account_dev">>}]) at {restarting,<0.361.0>} exit with reason no match of right hand value {error,econnrefused} in poolboy:new_worker/1 line 275 in context start_error
11:57:51.503 [error] [<0.395.0>] gen_server <0.395.0> terminated with reason: no match of right hand value {error,econnrefused} in poolboy:new_worker/1 line 275
11:57:51.503 [error] [<0.395.0>] CRASH REPORT Process <0.395.0> with 0 neighbours exited with reason: no match of right hand value {error,econnrefused} in poolboy:new_worker/1 line 275 in gen_server:terminate/7 line 812
11:57:51.503 [error] [<0.397.0>] CRASH REPORT Process <0.397.0> with 0 neighbours exited with reason: no match of right hand value {error,econnrefused} in poolboy:new_worker/1 line 275 in gen_server:init_it/6 line 352

(accountserver@yang-mbp-local-skyvm)1>
(accountserver@yang-mbp-local-skyvm)1>
(accountserver@yang-mbp-local-skyvm)1> w:tl().
11:58:00.145 [warning] [<0.30.0>] lager_error_logger_h dropped 3243 messages in the last second that exceeded the limit of 50 messages/sec
11:58:00.145 [error] [<0.355.0>] gen_server <0.355.0> terminated with reason: no such process or port in call to gen_server:call(mc_pool_sup, {start_child,[[{worker_module,mc_worker},{size,10},{max_overflow,10},{overflow_ttl,0},{overflow_check_period,...}],...]}, infinity) in gen_server:call/3 line 212
11:58:00.146 [error] [<0.355.0>] CRASH REPORT Process <0.355.0> with 0 neighbours exited with reason: no such process or port in call to gen_server:call(mc_pool_sup, {start_child,[[{worker_module,mc_worker},{size,10},{max_overflow,10},{overflow_ttl,0},{overflow_check_period,...}],...]}, infinity) in gen_server:terminate/7 line 812
11:58:00.146 [error] [<0.357.0>] gen_server <0.357.0> terminated with reason: no such process or port in call to gen_server:call(mc_pool_sup, {start_child,[[{worker_module,mc_worker},{size,10},{max_overflow,10},{overflow_ttl,0},{overflow_check_period,...}],...]}, infinity)
11:58:00.147 [error] [<0.357.0>] CRASH REPORT Process <0.357.0> with 1 neighbours exited with reason: no such process or port in call to gen_server:call(mc_pool_sup, {start_child,[[{worker_module,mc_worker},{size,10},{max_overflow,10},{overflow_ttl,0},{overflow_check_period,...}],...]}, infinity) in gen_server:terminate/7 line 812
** exception error: no match of right hand side value {error,timeout}
in function w:tl/0 (src/w.erl, line 152)

@wandermyz
Copy link
Author

Seems related to
#208
#185
#171
#156

@comtihon
Copy link
Owner

Hi,
sorry, I don't have much time right now to investigate your problem. But yes, it seems to be related to what you've mentioned.
Will linking topology to your sup help you when restarting?

@wandermyz
Copy link
Author

Linking topology to my sup doesn't help because topology doesn't crash; only mc_worker crashes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants