Skip to content
This repository has been archived by the owner on Sep 2, 2020. It is now read-only.

Hard cyanite recovery after failure #285

Open
dancb10 opened this issue Dec 11, 2017 · 0 comments
Open

Hard cyanite recovery after failure #285

dancb10 opened this issue Dec 11, 2017 · 0 comments

Comments

@dancb10
Copy link

dancb10 commented Dec 11, 2017

We have stressed our Cyanite infrastructure to verify what is the tipping point when it will fail. Once we stop the testing it seems like Cyanite gets blocked and only a restart gets it up and running. We divided Cyanite nodes into read and writes machines and we've seen the following:

WRITE nodes
We have situations in which with 12 G of heap Cyanite throws out of memory errors. I'm not sure if this is a leak:

java.lang.OutOfMemoryError: GC overhead limit exceeded
	at java.lang.reflect.Method.copy(Method.java:147)
	at java.lang.reflect.ReflectAccess.copyMethod(ReflectAccess.java:140)
	at sun.reflect.ReflectionFactory.copyMethod(ReflectionFactory.java:302)
	at java.lang.Class.searchMethods(Class.java:3005)
	at java.lang.Class.privateGetMethodRecursive(Class.java:3040)
	at java.lang.Class.getMethod0(Class.java:3010)
	at java.lang.Class.getMethod(Class.java:1776)
	at clojure.lang.Reflector.getMethods(Reflector.java:385)
	at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:27)
	at io.cyanite.engine.MetricResolution.ingest_BANG_(engine.clj:54)
	at io.cyanite.engine.Engine.ingest_BANG_(engine.clj:105)
	at io.cyanite.engine$fn__12854$G__12850__12857.invoke(engine.clj:19)
	at io.cyanite.engine$fn__12854$G__12849__12861.invoke(engine.clj:19)
	at clojure.core$partial$fn__6855.invoke(core.clj:2597)
	at io.cyanite.engine.queue.EngineQueue$fn__12124$fn__12125.invoke(queue.clj:57)
	at io.cyanite.engine.queue.EngineQueue$fn__12124.invoke(queue.clj:53)
	at clojure.lang.AFn.call(AFn.java:18)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

READ nodes
There were a lot of read timeouts on the Cyanite instances:

com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout during read query at consistency LOCAL_ONE (1 responses were required but only 0 replica responded)
	at com.datastax.driver.core.exceptions.ReadTimeoutException.copy(ReadTimeoutException.java:88)
	at com.datastax.driver.core.exceptions.ReadTimeoutException.copy(ReadTimeoutException.java:25)
	at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:37)
	at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:245)
	at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:68)
	at qbits.alia$execute.invokeStatic(alia.clj:384)
	at qbits.alia$execute.invoke(alia.clj:326)
	at qbits.alia$execute.invokeStatic(alia.clj:392)
	at io.cyanite.index.cassandra$native_sasi_index.invokeStatic(cassandra.clj:79)
	at io.cyanite.index.cassandra$native_sasi_index.invoke(cassandra.clj:62)
	at io.cyanite.index.cassandra$load_prefixes_fn.invokeStatic(cassandra.clj:100)
	at io.cyanite.index.cassandra.CassandraIndex$reify__16230.load(cassandra.clj:129)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache$BoundedLocalLoadingCache.lambda$new$0(BoundedLocalCache.java:3070)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache$BoundedLocalLoadingCache$$Lambda$4/1846944624.apply(Unknown Source)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.lambda$doComputeIfAbsent$14(BoundedLocalCache.java:1895)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache$$Lambda$5/1816062018.apply(Unknown Source)
	at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:1893)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:1876)
	at com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:113)
	at com.github.benmanes.caffeine.cache.LocalLoadingCache.get(LocalLoadingCache.java:67)
	at io.cyanite.index.cassandra.CassandraIndex.prefixes(cassandra.clj:152)
	at io.cyanite.api$fn__15748.invokeStatic(api.clj:119)
	at io.cyanite.api$fn__15748.invoke(api.clj:109)
	at clojure.lang.MultiFn.invoke(MultiFn.java:229)
	at io.cyanite.api$process.invokeStatic(api.clj:89)
	at io.cyanite.api$make_handler$fn__15791.invoke(api.clj:162)
	at io.cyanite.http$request_handler$fn__15485.invoke(http.clj:110)
	at io.cyanite.http$netty_handler$fn__15493.invoke(http.clj:125)
	at io.cyanite.http.proxy$io.netty.channel.ChannelInboundHandlerAdapter$ff19274a.channelRead(Unknown Source)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:341)
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:341)
	at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:435)
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)
	at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:250)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:341)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
	at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:1018)
	at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:394)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:299)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
	at java.lang.Thread.run(Thread.java:745)
Caused by: com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout during read query at consistency LOCAL_ONE (1 responses were required but only 0 replica responded)
	at com.datastax.driver.core.exceptions.ReadTimeoutException.copy(ReadTimeoutException.java:115)
	at com.datastax.driver.core.Responses$Error.asException(Responses.java:124)
	at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onSet(RequestHandler.java:506)
	at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:1070)
	at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:993)
	at com.datastax.shaded.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
	at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
	at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:328)
	at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:321)
	at com.datastax.shaded.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
	at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
	at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:328)
	at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:321)
	at com.datastax.shaded.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
	at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
	at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:328)
	at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:321)
	at com.datastax.shaded.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)
	at com.datastax.shaded.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)
	at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
	at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:328)
	at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:321)
	at com.datastax.shaded.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1280)
	at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
	at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:328)
	at com.datastax.shaded.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:890)
	at com.datastax.shaded.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:564)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:505)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:419)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:391)
	at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
	at com.datastax.shaded.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:145)
	... 1 common frames omitted
Caused by: com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout during read query at consistency LOCAL_ONE (1 responses were required but only 0 replica responded)
	at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:62)
	at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:37)
	at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:289)
	at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:269)
	at com.datastax.shaded.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88)
	... 20 common frames omitted
ERROR [2017-12-11 11:06:13,589] epollEventLoopGroup-3-1 - io.cyanite.api could not process request
com.datastax.driver.core.exceptions.OperationTimedOutException: [va6-qe-pcs-pcs3gw-3/172.27.39.225:9042] Timed out waiting for server response
	at com.datastax.driver.core.exceptions.OperationTimedOutException.copy(OperationTimedOutException.java:44)
	at com.datastax.driver.core.exceptions.OperationTimedOutException.copy(OperationTimedOutException.java:26)
	at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:37)
	at com.datastax.driver.core.ArrayBackedResultSet$MultiPage.prepareNextRow(ArrayBackedResultSet.java:313)
	at com.datastax.driver.core.ArrayBackedResultSet$MultiPage.one(ArrayBackedResultSet.java:275)
	at qbits.alia.codec$lazy_result_set_.invokeStatic(codec.clj:27)
	at qbits.alia.codec$lazy_result_set_.invoke(codec.clj:25)
	at qbits.alia.codec$lazy_result_set_$fn__13109.invoke(codec.clj:28)
	at clojure.lang.LazySeq.sval(LazySeq.java:40)
	at clojure.lang.LazySeq.seq(LazySeq.java:49)
	at clojure.lang.RT.seq(RT.java:525)
	at clojure.core$seq__6422.invokeStatic(core.clj:137)
	at clojure.core$map$fn__6881.invoke(core.clj:2719)
	at clojure.lang.LazySeq.sval(LazySeq.java:40)
	at clojure.lang.LazySeq.seq(LazySeq.java:49)
	at clojure.lang.RT.seq(RT.java:525)
	at clojure.core$seq__6422.invokeStatic(core.clj:137)
	at clojure.core$map$fn__6881.invoke(core.clj:2719)
	at clojure.lang.LazySeq.sval(LazySeq.java:40)
	at clojure.lang.LazySeq.seq(LazySeq.java:49)
	at clojure.lang.RT.seq(RT.java:525)
	at clojure.core$seq__6422.invokeStatic(core.clj:137)
	at clojure.core$map$fn__6881.invoke(core.clj:2719)
	at clojure.lang.LazySeq.sval(LazySeq.java:40)
	at clojure.lang.LazySeq.seq(LazySeq.java:49)
	at clojure.lang.RT.seq(RT.java:525)
	at clojure.core$seq__6422.invokeStatic(core.clj:137)
	at clojure.core$map$fn__6881.invoke(core.clj:2719)
	at clojure.lang.LazySeq.sval(LazySeq.java:40)
	at clojure.lang.LazySeq.seq(LazySeq.java:49)
	at clojure.lang.Cons.next(Cons.java:39)
	at clojure.lang.RT.next(RT.java:703)
	at clojure.core$next__6406.invokeStatic(core.clj:64)
	at clojure.core$concat$cat__6515$fn__6516.invoke(core.clj:734)
	at clojure.lang.LazySeq.sval(LazySeq.java:40)
	at clojure.lang.LazySeq.seq(LazySeq.java:49)
	at clojure.lang.RT.seq(RT.java:525)
	at clojure.core$seq__6422.invokeStatic(core.clj:137)
	at clojure.core$map$fn__6881.invoke(core.clj:2719)
	at clojure.lang.LazySeq.sval(LazySeq.java:40)
	at clojure.lang.LazySeq.seq(LazySeq.java:49)
	at clojure.lang.Cons.next(Cons.java:39)
	at clojure.lang.RT.next(RT.java:703)
	at clojure.core$next__6406.invokeStatic(core.clj:64)
	at clojure.core$concat$cat__6515$fn__6516.invoke(core.clj:734)
	at clojure.lang.LazySeq.sval(LazySeq.java:40)
	at clojure.lang.LazySeq.seq(LazySeq.java:56)
	at clojure.lang.RT.seq(RT.java:525)
	at clojure.core$seq__6422.invokeStatic(core.clj:137)
	at clojure.core$filter$fn__6908.invoke(core.clj:2782)
	at clojure.lang.LazySeq.sval(LazySeq.java:40)
	at clojure.lang.LazySeq.seq(LazySeq.java:49)
	at clojure.lang.RT.seq(RT.java:525)
	at clojure.core$seq__6422.invokeStatic(core.clj:137)
	at clojure.core$map$fn__6881.invoke(core.clj:2719)
	at clojure.lang.LazySeq.sval(LazySeq.java:40)
	at clojure.lang.LazySeq.seq(LazySeq.java:49)
	at clojure.lang.Cons.next(Cons.java:39)
	at clojure.lang.RT.next(RT.java:703)
	at clojure.core$next__6406.invokeStatic(core.clj:64)
	at clojure.core$reduce1.invokeStatic(core.clj:936)
	at clojure.core$set.invokeStatic(core.clj:4065)
	at globber.glob$filter_compound_ast.invokeStatic(glob.clj:313)
	at globber.glob$glob.invokeStatic(glob.clj:355)
	at io.cyanite.index.cassandra$load_prefixes_fn.invokeStatic(cassandra.clj:100)
	at io.cyanite.index.cassandra.CassandraIndex$reify__16230.load(cassandra.clj:129)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache$BoundedLocalLoadingCache.lambda$new$0(BoundedLocalCache.java:3070)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache$BoundedLocalLoadingCache$$Lambda$4/1846944624.apply(Unknown Source)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.lambda$doComputeIfAbsent$14(BoundedLocalCache.java:1895)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache$$Lambda$5/1816062018.apply(Unknown Source)
	at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:1893)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:1876)
	at com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:113)
	at com.github.benmanes.caffeine.cache.LocalLoadingCache.get(LocalLoadingCache.java:67)
	at io.cyanite.index.cassandra.CassandraIndex.prefixes(cassandra.clj:152)
	at io.cyanite.api$fn__15748.invokeStatic(api.clj:119)
	at io.cyanite.api$fn__15748.invoke(api.clj:109)
	at clojure.lang.MultiFn.invoke(MultiFn.java:229)
	at io.cyanite.api$process.invokeStatic(api.clj:89)
	at io.cyanite.api$make_handler$fn__15791.invoke(api.clj:162)
	at io.cyanite.http$request_handler$fn__15485.invoke(http.clj:110)
	at io.cyanite.http$netty_handler$fn__15493.invoke(http.clj:125)
	at io.cyanite.http.proxy$io.netty.channel.ChannelInboundHandlerAdapter$ff19274a.channelRead(Unknown Source)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:341)
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:341)
	at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:435)
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)
	at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:250)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:341)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
	at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:1018)
	at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:394)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:299)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
	at java.lang.Thread.run(Thread.java:745)
Caused by: com.datastax.driver.core.exceptions.OperationTimedOutException: [va6-qe-pcs-pcs3gw-3/172.27.39.225:9042] Timed out waiting for server response
	at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onTimeout(RequestHandler.java:772)
	at com.datastax.driver.core.Connection$ResponseHandler$1.run(Connection.java:1374)
	at com.datastax.shaded.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)
	at com.datastax.shaded.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:655)
	at com.datastax.shaded.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)
	at com.datastax.shaded.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:145)
	... 1 common frames omitted

CASSANDRA nodes:
We've seen index search issues:

org.apache.cassandra.index.sasi.exceptions.TimeQuotaExceededException: null
        at org.apache.cassandra.index.sasi.plan.QueryController.checkpoint(QueryController.java:158) [apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.plan.Expression.checkpoint(Expression.java:320) ~[apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.disk.OnDiskIndex.searchPoint(OnDiskIndex.java:392) ~[apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.disk.OnDiskIndex.searchRange(OnDiskIndex.java:296) ~[apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.disk.OnDiskIndex.search(OnDiskIndex.java:254) ~[apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.SSTableIndex.search(SSTableIndex.java:103) ~[apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.TermIterator.lambda$build$0(TermIterator.java:130) [apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.TermIterator$$Lambda$316/1683221136.run(Unknown Source) [apache-cassandra-3.11.1.jar:3.11.1]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_45]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_45]
        at com.google.common.util.concurrent.MoreExecutors$DirectExecutorService.execute(MoreExecutors.java:299) [guava-18.0.jar:na]
        at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112) [na:1.8.0_45]
        at com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:50) [guava-18.0.jar:na]
        at com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:37) [guava-18.0.jar:na]
        at org.apache.cassandra.index.sasi.TermIterator.build(TermIterator.java:125) [apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.plan.QueryController.getIndexes(QueryController.java:145) [apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.plan.Operation$Builder.complete(Operation.java:433) [apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.plan.QueryPlan.analyze(QueryPlan.java:57) [apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.plan.QueryPlan.execute(QueryPlan.java:68) [apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.SASIIndex.lambda$searcherFor$2(SASIIndex.java:290) [apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.SASIIndex$$Lambda$297/2047499964.search(Unknown Source) [apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.db.ReadCommand.executeLocally(ReadCommand.java:418) [apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:1884) [apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2594) [apache-cassandra-3.11.1.jar:3.11.1]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_45]
        at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162) [apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134) [apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [apache-cassandra-3.11.1.jar:3.11.1]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45]
DEBUG [Native-Transport-Requests-1] 2017-12-11 11:07:31,022 ReadCallback.java:132 - Timed out; received 0 of 1 responses
INFO  [Service Thread] 2017-12-11 11:07:31,023 StatusLogger.java:47 - Pool Name                    Active   Pending      Completed   Blocked  All Time Blocked
WARN  [ReadStage-2] 2017-12-11 11:07:31,025 AbstractLocalAwareExecutorService.java:167 - Uncaught exception on thread Thread[ReadStage-2,5,main]: {}
java.lang.RuntimeException: org.apache.cassandra.index.sasi.exceptions.TimeQuotaExceededException
        at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2598) ~[apache-cassandra-3.11.1.jar:3.11.1]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_45]
        at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162) ~[apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134) [apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [apache-cassandra-3.11.1.jar:3.11.1]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45]
Caused by: org.apache.cassandra.index.sasi.exceptions.TimeQuotaExceededException: null
        at org.apache.cassandra.index.sasi.plan.QueryController.checkpoint(QueryController.java:158) ~[apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.plan.Expression.checkpoint(Expression.java:320) ~[apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.TermIterator.build(TermIterator.java:157) ~[apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.plan.QueryController.getIndexes(QueryController.java:145) ~[apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.plan.Operation$Builder.complete(Operation.java:433) ~[apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.plan.QueryPlan.analyze(QueryPlan.java:57) ~[apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.plan.QueryPlan.execute(QueryPlan.java:68) ~[apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.SASIIndex.lambda$searcherFor$2(SASIIndex.java:290) ~[apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.SASIIndex$$Lambda$297/2047499964.search(Unknown Source) ~[na:na]
        at org.apache.cassandra.db.ReadCommand.executeLocally(ReadCommand.java:418) ~[apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:1884) ~[apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2594) ~[apache-cassandra-3.11.1.jar:3.11.1]
        ... 5 common frames omitted

Besides this it seems that when performing load tests, Cyanite does loose requests.
The load test was performed with 58680 metrics per second on an infrastructure composed of 4 Cyanite nodes (2 for read, 2 for write) each with 8 cores and 16G of RAM and three node Cassandra cluster (8 cores, 64G of RAM)
The thing is that we have performed multiple tests with multiple numbers and we've seen a pattern in how Cyanite behaves after it fails. So, we've tested with 29340 requests/s it everything was OK, it performed good. Since we had really low load on our instances we tried with 58680 requests/s which made Cyanite fail. But then we stopped the load test and tried again with 29340 requests/s but it never recovered, this test failed as well. So I'm not sure if there is a queue limit or a write bottleneck or a bug in the Clojure code. Unfortunately I don't know Clojure and it makes debugging hard.
So the only way in which we can get Cyanite back again working is by restarting the process on all instances. Cassandra doesn't seem to be the problem here because it handled really good the load plus it accepted data back without any problems once Cyanite became healthy again.
NOTE that in all load tests performed machines had enough resources left and the load was not heavy.
Do you have any numbers or "best practices" in terms of usage and hardware specs for Cyanite? We don't know if scaling horizontally/vertically can fix our problems or if there's a bug.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant