Cassandra Timeouts with No CPU Usage

Question

I am getting Cassandra timeouts using the Phantom-DSL with the Datastax Cassandra driver. However, Cassandra does not seem to be overloaded. Below is the exception I get:

com.datastax.driver.core.exceptions.OperationTimedOutException: [node-0.cassandra.dev/10.0.1.137:9042] Timed out waiting for server response
    at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onTimeout(RequestHandler.java:766)
    at com.datastax.driver.core.Connection$ResponseHandler$1.run(Connection.java:1267)
    at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:588)
    at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:662)
    at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:385)
    at java.lang.Thread.run(Thread.java:745)

And here are the statistics I get from the Cassandra Datadog connector over this time period:

You can see our read rate (per second) on the top-center graph. Our CPU and memory usage are very low.

Here is how we are configuring the Datastax driver:

val points = ContactPoints(config.cassandraHosts)
  .withClusterBuilder(_.withSocketOptions(
    new SocketOptions()
      .setReadTimeoutMillis(config.cassandraNodeTimeout)
  ))
  .withClusterBuilder(_.withPoolingOptions(
    new PoolingOptions()
      .setConnectionsPerHost(
        HostDistance.LOCAL,
        2,
        2
      )
      .setConnectionsPerHost(
        HostDistance.REMOTE,
        2,
        2
      )
      .setMaxRequestsPerConnection(
        HostDistance.LOCAL,
        2048
      )
      .setMaxRequestsPerConnection(
        HostDistance.REMOTE,
        2048
      )
      .setPoolTimeoutMillis(10000)
      .setNewConnectionThreshold(
        HostDistance.LOCAL,
        1500
      )
      .setNewConnectionThreshold(
        HostDistance.REMOTE,
        1500
      )

))

Our nodetool cfstats looks like this:

$ nodetool cfstats alexandria_dev.match_sums
Keyspace : alexandria_dev
        Read Count: 101892
        Read Latency: 0.007479115141522397 ms.
        Write Count: 18721
        Write Latency: 0.012341060840767052 ms.
        Pending Flushes: 0
                Table: match_sums
                SSTable count: 0
                Space used (live): 0
                Space used (total): 0
                Space used by snapshots (total): 0
                Off heap memory used (total): 0
                SSTable Compression Ratio: 0.0
                Number of keys (estimate): 15328
                Memtable cell count: 15332
                Memtable data size: 21477107
                Memtable off heap memory used: 0
                Memtable switch count: 0
                Local read count: 17959
                Local read latency: 0.015 ms
                Local write count: 15332
                Local write latency: 0.013 ms
                Pending flushes: 0
                Percent repaired: 100.0
                Bloom filter false positives: 0
                Bloom filter false ratio: 0.00000
                Bloom filter space used: 0
                Bloom filter off heap memory used: 0
                Index summary off heap memory used: 0
                Compression metadata off heap memory used: 0
                Compacted partition minimum bytes: 0
                Compacted partition maximum bytes: 0
                Compacted partition mean bytes: 0
                Average live cells per slice (last five minutes): 1.0
                Maximum live cells per slice (last five minutes): 1
                Average tombstones per slice (last five minutes): 1.0
                Maximum tombstones per slice (last five minutes): 1
                Dropped Mutations: 0

When we ran cassandra-stress, we didn't experience any issues: we were getting a steady 50k reads per second, as expected.

Cassandra has this error whenever I make my queries:

INFO  [Native-Transport-Requests-2] 2017-03-10 23:59:38,003 Message.java:611 - Unexpected exception during request; channel = [id: 0x65d7a0cd, L:/10.0.1.98:9042 ! R:/10.0.1.126:35536]
io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() failed: Connection reset by peer
        at io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown Source) ~[netty-all-4.0.39.Final.jar:4.0.39.Final]

Why are we getting timeouts?

EDIT: I had the wrong dashboard uploaded. Please see the new image.

score 0 · Answer 1 · answered Mar 10 '17 at 11:15

2 questions that'll be helpful:

What's your timeout set to
What's the query?

Now some clarification on where I think you're going wrong here:

the resolution is too coarse to diagnose a single query, I could have a server doing nothing, do one expensive query that pegs some bottleneck for the entire time and on that scale look like nothing was bottlenecked, run iostat -x 1 on the servers at the same time and you may find something drastically different than what the charts say at that resolution.
If I'm looking at your CPU usage chart correctly there it looks like 50% usage. On modern servers that's actually fully busy because of hyperthreading and how aggregate CPU usage works see https://www.percona.com/blog/2015/01/15/hyper-threading-double-cpu-throughput/

**EDIT:** I had the wrong dashboard uploaded. Please see the new image. — Ian Macalinao, Mar 10 '17 at 22:58

Piotr Mazurek · Answer 2 · 2017-03-10T12:57:25.797

I suggest tracing the problematic query to see what cassandra was doing.

https://docs.datastax.com/en/cql/3.1/cql/cql_reference/tracing_r.html

Open cql shell, type TRACING ON and execute your query. If everything seems fine, there is a chance that this problem happens occasionally, in which case I'd suggest tracing the queries using nodetool settraceprobablilty for some time, until you manage to catch the problem.

You enable it on each node separately using nodetool settraceprobability <param> where param is the probability (between 0 and 1) that the query will get traced. Careful: this WILL cause increased load, so start with a very low number and go up.

If this problem is occasional there is a chance that this might be caused by long garbage collections, in which case you need to analyse the GC logs. Check how long your GC's are.

edit: just to be clear, if this problem is caused by GC's you will NOT see it with tracing. So first check your GC's, and if its not the problem then move on to tracing.

**EDIT:** I had the wrong dashboard uploaded. Please see the new image. — Ian Macalinao, Mar 10 '17 at 22:58
This is extremely fast. I only have 20k rows. I think it may be a problem with the Datastax Java driver. (or the way I am using it) — Ian Macalinao, Mar 10 '17 at 23:39

Cassandra Timeouts with No CPU Usage

2 Answers2

Linked