Quantcast

Follower drops out of quorum, can't reconnect

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Follower drops out of quorum, can't reconnect

Ben Sherman-2
I have been chasing this problem for months and I can't figure out what is
happening.

We have a 5 node cluster running 3.4.10 we saw this in .8 and .9 as well),
and sometimes, a node gets a read timeout, drops all the connections and
tries to re-establish itself to the quorum.  It can usually do this in a
few seconds, but last night it took almost 15 minutes to reconnect.

These are 5 servers in AWS, and we've tried tuning the timeouts, but the
are exceeding any reasonable timeout and still failing.

Any hints here?

2017-05-10 11:21:34,386 [myid:5] - WARN
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
following the leader
java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:171)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)
        at
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
        at
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
        at
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
        at
org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
        at
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
        at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:937)
2017-05-10 11:21:34,387 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
java.lang.Exception: shutdown Follower
        at
org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
        at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:941)
2017-05-10 11:21:34,387 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
2017-05-10 11:21:34,388 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1044] - Closed
socket connection for client /192.168.23.109:40666 which had sessionid
0x55bd553c076002d
[snip .. zk closes all its connections]
2017-05-10 11:21:34,391 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1044] - Closed
socket connection for client /192.168.23.88:65025 which had sessionid
0x55bd553c0760765
2017-05-10 11:21:34,391 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@140] -
Shutting down
2017-05-10 11:21:34,391 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@505] - shutting
down
2017-05-10 11:21:34,391 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FollowerRequestProcessor@107] -
Shutting down
2017-05-10 11:21:34,391 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:CommitProcessor@184] - Shutting
down
2017-05-10 11:21:34,392 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FinalRequestProcessor@402] -
shutdown of request processor complete
2017-05-10 11:21:34,392 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:SyncRequestProcessor@208] -
Shutting down
2017-05-10 11:21:34,392 [myid:5] - INFO
 [CommitProcessor:5:CommitProcessor@153] - CommitProcessor exited loop!
2017-05-10 11:21:34,392 [myid:5] - INFO
 [SyncThread:5:SyncRequestProcessor@186] - SyncRequestProcessor exited!
2017-05-10 11:21:34,392 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:QuorumPeer@865] - LOOKING
2017-05-10 11:21:34,393 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New
election. My id =  5, proposed zxid=0x4100013987
2017-05-10 11:21:34,393 [myid:5] - INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
2017-05-10 11:21:34,393 [myid:5] - INFO
 [FollowerRequestProcessor:5:FollowerRequestProcessor@97] -
FollowerRequestProcessor exited loop!
2017-05-10 11:21:34,593 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 400
2017-05-10 11:21:34,594 [myid:5] - INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
2017-05-10 11:21:34,994 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 800
2017-05-10 11:21:34,994 [myid:5] - INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
2017-05-10 11:21:35,794 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 1600
2017-05-10 11:21:35,795 [myid:5] - INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
2017-05-10 11:21:37,395 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 3200
2017-05-10 11:21:37,395 [myid:5] - INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
2017-05-10 11:21:40,595 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 6400
2017-05-10 11:21:40,596 [myid:5] - INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
2017-05-10 11:21:46,996 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 12800
2017-05-10 11:21:46,996 [myid:5] - INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
2017-05-10 11:21:59,796 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 25600
2017-05-10 11:21:59,797 [myid:5] - INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
---
MANY minutes later, we see things finally timeout, connections are
re-established, and everything comes back:
---
2017-05-10 11:37:02,469 [myid:5] - WARN
 [RecvWorker:3:QuorumCnxManager$RecvWorker@1028] - Connection broken for id
3, my id = 5, error =
java.net.SocketException: Connection timed out (Read failed)
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:171)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1013)
2017-05-10 11:37:02,470 [myid:5] - WARN
 [RecvWorker:3:QuorumCnxManager$RecvWorker@1031] - Interrupting SendWorker
2017-05-10 11:37:02,469 [myid:5] - WARN
 [RecvWorker:2:QuorumCnxManager$RecvWorker@1028] - Connection broken for id
2, my id = 5, error =
java.net.SocketException: Connection timed out (Read failed)
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:171)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1013)
2017-05-10 11:37:02,470 [myid:5] - WARN
 [RecvWorker:2:QuorumCnxManager$RecvWorker@1031] - Interrupting SendWorker
2017-05-10 11:37:02,469 [myid:5] - WARN
 [RecvWorker:4:QuorumCnxManager$RecvWorker@1028] - Connection broken for id
4, my id = 5, error =
java.net.SocketException: Connection timed out (Read failed)
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:171)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1013)
2017-05-10 11:37:02,470 [myid:5] - WARN
 [RecvWorker:4:QuorumCnxManager$RecvWorker@1031] - Interrupting SendWorker
2017-05-10 11:37:02,470 [myid:5] - WARN
 [SendWorker:3:QuorumCnxManager$SendWorker@944] - Interrupted while waiting
for message on queue
java.lang.InterruptedException
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
        at
java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1097)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:932)
2017-05-10 11:37:02,471 [myid:5] - WARN
 [SendWorker:3:QuorumCnxManager$SendWorker@954] - Send worker leaving thread
2017-05-10 11:37:02,470 [myid:5] - WARN
 [SendWorker:4:QuorumCnxManager$SendWorker@944] - Interrupted while waiting
for message on queue
java.lang.InterruptedException
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
        at
java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1097)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:932)
2017-05-10 11:37:02,471 [myid:5] - WARN
 [SendWorker:4:QuorumCnxManager$SendWorker@954] - Send worker leaving thread
2017-05-10 11:37:02,472 [myid:5] - WARN
 [SendWorker:2:QuorumCnxManager$SendWorker@944] - Interrupted while waiting
for message on queue
java.lang.InterruptedException
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
        at
java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1097)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:932)
2017-05-10 11:37:02,472 [myid:5] - WARN
 [SendWorker:2:QuorumCnxManager$SendWorker@954] - Send worker leaving thread
2017-05-10 11:37:03,493 [myid:5] - WARN
 [RecvWorker:1:QuorumCnxManager$RecvWorker@1028] - Connection broken for id
1, my id = 5, error =
java.net.SocketException: Connection timed out (Read failed)
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:171)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1013)
2017-05-10 11:37:03,493 [myid:5] - WARN
 [RecvWorker:1:QuorumCnxManager$RecvWorker@1031] - Interrupting SendWorker
2017-05-10 11:37:03,494 [myid:5] - WARN
 [SendWorker:1:QuorumCnxManager$SendWorker@944] - Interrupted while waiting
for message on queue
java.lang.InterruptedException
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
        at
java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1097)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:932)
2017-05-10 11:37:03,494 [myid:5] - WARN
 [SendWorker:1:QuorumCnxManager$SendWorker@954] - Send worker leaving thread
2017-05-10 11:37:16,607 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 60000
2017-05-10 11:37:16,611 [myid:5] - INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
FOLLOWING (n.state), 1 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
2017-05-10 11:37:16,611 [myid:5] - INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
FOLLOWING (n.state), 1 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
2017-05-10 11:37:16,611 [myid:5] - INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
FOLLOWING (n.state), 2 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
2017-05-10 11:37:16,611 [myid:5] - INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
FOLLOWING (n.state), 2 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
2017-05-10 11:37:16,613 [myid:5] - INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
LEADING (n.state), 3 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
2017-05-10 11:37:16,613 [myid:5] - INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
LEADING (n.state), 3 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
2017-05-10 11:37:16,613 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:QuorumPeer@935] - FOLLOWING
2017-05-10 11:37:16,613 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@173] - Created
server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000
datadir /opt/data-zookeeper/version-2 snapdir /opt/data-zookeeper/version-2
2017-05-10 11:37:16,613 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@64] - FOLLOWING - LEADER
ELECTION TOOK - 942220
2017-05-10 11:37:16,615 [myid:5] - INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) FOLLOWING (my state)
2017-05-10 11:37:16,615 [myid:5] - INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
FOLLOWING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) FOLLOWING (my state)
2017-05-10 11:37:16,615 [myid:5] - INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
FOLLOWING (n.state), 4 (n.sid), 0x41 (n.peerEpoch) FOLLOWING (my state)
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Follower drops out of quorum, can't reconnect

Jordan Zimmerman-3
Have you tried increasing syncLimit?

====================
Jordan Zimmerman

> On May 10, 2017, at 9:29 PM, Ben Sherman <[hidden email]> wrote:
>
> I have been chasing this problem for months and I can't figure out what is
> happening.
>
> We have a 5 node cluster running 3.4.10 we saw this in .8 and .9 as well),
> and sometimes, a node gets a read timeout, drops all the connections and
> tries to re-establish itself to the quorum.  It can usually do this in a
> few seconds, but last night it took almost 15 minutes to reconnect.
>
> These are 5 servers in AWS, and we've tried tuning the timeouts, but the
> are exceeding any reasonable timeout and still failing.
>
> Any hints here?
>
> 2017-05-10 11:21:34,386 [myid:5] - WARN
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
> following the leader
> java.net.SocketTimeoutException: Read timed out
>        at java.net.SocketInputStream.socketRead0(Native Method)
>        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
>        at java.net.SocketInputStream.read(SocketInputStream.java:171)
>        at java.net.SocketInputStream.read(SocketInputStream.java:141)
>        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>        at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
>        at java.io.DataInputStream.readInt(DataInputStream.java:387)
>        at
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>        at
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>        at
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
>        at
> org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
>        at
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
>        at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:937)
> 2017-05-10 11:21:34,387 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
> java.lang.Exception: shutdown Follower
>        at
> org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
>        at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:941)
> 2017-05-10 11:21:34,387 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
> 2017-05-10 11:21:34,388 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1044] - Closed
> socket connection for client /192.168.23.109:40666 which had sessionid
> 0x55bd553c076002d
> [snip .. zk closes all its connections]
> 2017-05-10 11:21:34,391 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1044] - Closed
> socket connection for client /192.168.23.88:65025 which had sessionid
> 0x55bd553c0760765
> 2017-05-10 11:21:34,391 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@140] -
> Shutting down
> 2017-05-10 11:21:34,391 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@505] - shutting
> down
> 2017-05-10 11:21:34,391 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FollowerRequestProcessor@107] -
> Shutting down
> 2017-05-10 11:21:34,391 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:CommitProcessor@184] - Shutting
> down
> 2017-05-10 11:21:34,392 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FinalRequestProcessor@402] -
> shutdown of request processor complete
> 2017-05-10 11:21:34,392 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:SyncRequestProcessor@208] -
> Shutting down
> 2017-05-10 11:21:34,392 [myid:5] - INFO
> [CommitProcessor:5:CommitProcessor@153] - CommitProcessor exited loop!
> 2017-05-10 11:21:34,392 [myid:5] - INFO
> [SyncThread:5:SyncRequestProcessor@186] - SyncRequestProcessor exited!
> 2017-05-10 11:21:34,392 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:QuorumPeer@865] - LOOKING
> 2017-05-10 11:21:34,393 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New
> election. My id =  5, proposed zxid=0x4100013987
> 2017-05-10 11:21:34,393 [myid:5] - INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> 2017-05-10 11:21:34,393 [myid:5] - INFO
> [FollowerRequestProcessor:5:FollowerRequestProcessor@97] -
> FollowerRequestProcessor exited loop!
> 2017-05-10 11:21:34,593 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 400
> 2017-05-10 11:21:34,594 [myid:5] - INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> 2017-05-10 11:21:34,994 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 800
> 2017-05-10 11:21:34,994 [myid:5] - INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> 2017-05-10 11:21:35,794 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 1600
> 2017-05-10 11:21:35,795 [myid:5] - INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> 2017-05-10 11:21:37,395 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 3200
> 2017-05-10 11:21:37,395 [myid:5] - INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> 2017-05-10 11:21:40,595 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 6400
> 2017-05-10 11:21:40,596 [myid:5] - INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> 2017-05-10 11:21:46,996 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 12800
> 2017-05-10 11:21:46,996 [myid:5] - INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> 2017-05-10 11:21:59,796 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 25600
> 2017-05-10 11:21:59,797 [myid:5] - INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> ---
> MANY minutes later, we see things finally timeout, connections are
> re-established, and everything comes back:
> ---
> 2017-05-10 11:37:02,469 [myid:5] - WARN
> [RecvWorker:3:QuorumCnxManager$RecvWorker@1028] - Connection broken for id
> 3, my id = 5, error =
> java.net.SocketException: Connection timed out (Read failed)
>        at java.net.SocketInputStream.socketRead0(Native Method)
>        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
>        at java.net.SocketInputStream.read(SocketInputStream.java:171)
>        at java.net.SocketInputStream.read(SocketInputStream.java:141)
>        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>        at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
>        at java.io.DataInputStream.readInt(DataInputStream.java:387)
>        at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1013)
> 2017-05-10 11:37:02,470 [myid:5] - WARN
> [RecvWorker:3:QuorumCnxManager$RecvWorker@1031] - Interrupting SendWorker
> 2017-05-10 11:37:02,469 [myid:5] - WARN
> [RecvWorker:2:QuorumCnxManager$RecvWorker@1028] - Connection broken for id
> 2, my id = 5, error =
> java.net.SocketException: Connection timed out (Read failed)
>        at java.net.SocketInputStream.socketRead0(Native Method)
>        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
>        at java.net.SocketInputStream.read(SocketInputStream.java:171)
>        at java.net.SocketInputStream.read(SocketInputStream.java:141)
>        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>        at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
>        at java.io.DataInputStream.readInt(DataInputStream.java:387)
>        at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1013)
> 2017-05-10 11:37:02,470 [myid:5] - WARN
> [RecvWorker:2:QuorumCnxManager$RecvWorker@1031] - Interrupting SendWorker
> 2017-05-10 11:37:02,469 [myid:5] - WARN
> [RecvWorker:4:QuorumCnxManager$RecvWorker@1028] - Connection broken for id
> 4, my id = 5, error =
> java.net.SocketException: Connection timed out (Read failed)
>        at java.net.SocketInputStream.socketRead0(Native Method)
>        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
>        at java.net.SocketInputStream.read(SocketInputStream.java:171)
>        at java.net.SocketInputStream.read(SocketInputStream.java:141)
>        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>        at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
>        at java.io.DataInputStream.readInt(DataInputStream.java:387)
>        at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1013)
> 2017-05-10 11:37:02,470 [myid:5] - WARN
> [RecvWorker:4:QuorumCnxManager$RecvWorker@1031] - Interrupting SendWorker
> 2017-05-10 11:37:02,470 [myid:5] - WARN
> [SendWorker:3:QuorumCnxManager$SendWorker@944] - Interrupted while waiting
> for message on queue
> java.lang.InterruptedException
>        at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>        at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
>        at
> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
>        at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1097)
>        at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74)
>        at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:932)
> 2017-05-10 11:37:02,471 [myid:5] - WARN
> [SendWorker:3:QuorumCnxManager$SendWorker@954] - Send worker leaving thread
> 2017-05-10 11:37:02,470 [myid:5] - WARN
> [SendWorker:4:QuorumCnxManager$SendWorker@944] - Interrupted while waiting
> for message on queue
> java.lang.InterruptedException
>        at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>        at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
>        at
> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
>        at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1097)
>        at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74)
>        at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:932)
> 2017-05-10 11:37:02,471 [myid:5] - WARN
> [SendWorker:4:QuorumCnxManager$SendWorker@954] - Send worker leaving thread
> 2017-05-10 11:37:02,472 [myid:5] - WARN
> [SendWorker:2:QuorumCnxManager$SendWorker@944] - Interrupted while waiting
> for message on queue
> java.lang.InterruptedException
>        at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>        at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
>        at
> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
>        at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1097)
>        at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74)
>        at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:932)
> 2017-05-10 11:37:02,472 [myid:5] - WARN
> [SendWorker:2:QuorumCnxManager$SendWorker@954] - Send worker leaving thread
> 2017-05-10 11:37:03,493 [myid:5] - WARN
> [RecvWorker:1:QuorumCnxManager$RecvWorker@1028] - Connection broken for id
> 1, my id = 5, error =
> java.net.SocketException: Connection timed out (Read failed)
>        at java.net.SocketInputStream.socketRead0(Native Method)
>        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
>        at java.net.SocketInputStream.read(SocketInputStream.java:171)
>        at java.net.SocketInputStream.read(SocketInputStream.java:141)
>        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>        at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
>        at java.io.DataInputStream.readInt(DataInputStream.java:387)
>        at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1013)
> 2017-05-10 11:37:03,493 [myid:5] - WARN
> [RecvWorker:1:QuorumCnxManager$RecvWorker@1031] - Interrupting SendWorker
> 2017-05-10 11:37:03,494 [myid:5] - WARN
> [SendWorker:1:QuorumCnxManager$SendWorker@944] - Interrupted while waiting
> for message on queue
> java.lang.InterruptedException
>        at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>        at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
>        at
> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
>        at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1097)
>        at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74)
>        at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:932)
> 2017-05-10 11:37:03,494 [myid:5] - WARN
> [SendWorker:1:QuorumCnxManager$SendWorker@954] - Send worker leaving thread
> 2017-05-10 11:37:16,607 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 60000
> 2017-05-10 11:37:16,611 [myid:5] - INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> FOLLOWING (n.state), 1 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> 2017-05-10 11:37:16,611 [myid:5] - INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> FOLLOWING (n.state), 1 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> 2017-05-10 11:37:16,611 [myid:5] - INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> FOLLOWING (n.state), 2 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> 2017-05-10 11:37:16,611 [myid:5] - INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> FOLLOWING (n.state), 2 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> 2017-05-10 11:37:16,613 [myid:5] - INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> LEADING (n.state), 3 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> 2017-05-10 11:37:16,613 [myid:5] - INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> LEADING (n.state), 3 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> 2017-05-10 11:37:16,613 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:QuorumPeer@935] - FOLLOWING
> 2017-05-10 11:37:16,613 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@173] - Created
> server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000
> datadir /opt/data-zookeeper/version-2 snapdir /opt/data-zookeeper/version-2
> 2017-05-10 11:37:16,613 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@64] - FOLLOWING - LEADER
> ELECTION TOOK - 942220
> 2017-05-10 11:37:16,615 [myid:5] - INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) FOLLOWING (my state)
> 2017-05-10 11:37:16,615 [myid:5] - INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> FOLLOWING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) FOLLOWING (my state)
> 2017-05-10 11:37:16,615 [myid:5] - INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> FOLLOWING (n.state), 4 (n.sid), 0x41 (n.peerEpoch) FOLLOWING (my state)
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Follower drops out of quorum, can't reconnect

Ben Sherman-2
Yes, synclimit is 5, initlimit is 10, ticktime is 2000.

I'm less concerned with the disconnection and more concerned with the
inability to reconnect for 15 minutes.  Networking on this host appeared to
be fine as other services were operating nominally, as were the other 4
servers in the ensemble.

On Wed, May 10, 2017 at 12:38 PM, Jordan Zimmerman <
[hidden email]> wrote:

> Have you tried increasing syncLimit?
>
> ====================
> Jordan Zimmerman
>
> > On May 10, 2017, at 9:29 PM, Ben Sherman <[hidden email]> wrote:
> >
> > I have been chasing this problem for months and I can't figure out what
> is
> > happening.
> >
> > We have a 5 node cluster running 3.4.10 we saw this in .8 and .9 as
> well),
> > and sometimes, a node gets a read timeout, drops all the connections and
> > tries to re-establish itself to the quorum.  It can usually do this in a
> > few seconds, but last night it took almost 15 minutes to reconnect.
> >
> > These are 5 servers in AWS, and we've tried tuning the timeouts, but the
> > are exceeding any reasonable timeout and still failing.
> >
> > Any hints here?
> >
> > 2017-05-10 11:21:34,386 [myid:5] - WARN
> > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
> > following the leader
> > java.net.SocketTimeoutException: Read timed out
> >        at java.net.SocketInputStream.socketRead0(Native Method)
> >        at java.net.SocketInputStream.socketRead(SocketInputStream.
> java:116)
> >        at java.net.SocketInputStream.read(SocketInputStream.java:171)
> >        at java.net.SocketInputStream.read(SocketInputStream.java:141)
> >        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> >        at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
> >        at java.io.DataInputStream.readInt(DataInputStream.java:387)
> >        at
> > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> >        at
> > org.apache.zookeeper.server.quorum.QuorumPacket.
> deserialize(QuorumPacket.java:83)
> >        at
> > org.apache.jute.BinaryInputArchive.readRecord(
> BinaryInputArchive.java:99)
> >        at
> > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
> >        at
> > org.apache.zookeeper.server.quorum.Follower.followLeader(
> Follower.java:85)
> >        at
> > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:937)
> > 2017-05-10 11:21:34,387 [myid:5] - INFO
> > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
> > java.lang.Exception: shutdown Follower
> >        at
> > org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
> >        at
> > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:941)
> > 2017-05-10 11:21:34,387 [myid:5] - INFO
> > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
> > 2017-05-10 11:21:34,388 [myid:5] - INFO
> > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1044] - Closed
> > socket connection for client /192.168.23.109:40666 which had sessionid
> > 0x55bd553c076002d
> > [snip .. zk closes all its connections]
> > 2017-05-10 11:21:34,391 [myid:5] - INFO
> > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1044] - Closed
> > socket connection for client /192.168.23.88:65025 which had sessionid
> > 0x55bd553c0760765
> > 2017-05-10 11:21:34,391 [myid:5] - INFO
> > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@140] -
> > Shutting down
> > 2017-05-10 11:21:34,391 [myid:5] - INFO
> > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@505] - shutting
> > down
> > 2017-05-10 11:21:34,391 [myid:5] - INFO
> > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FollowerRequestProcessor@107] -
> > Shutting down
> > 2017-05-10 11:21:34,391 [myid:5] - INFO
> > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:CommitProcessor@184] - Shutting
> > down
> > 2017-05-10 11:21:34,392 [myid:5] - INFO
> > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FinalRequestProcessor@402] -
> > shutdown of request processor complete
> > 2017-05-10 11:21:34,392 [myid:5] - INFO
> > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:SyncRequestProcessor@208] -
> > Shutting down
> > 2017-05-10 11:21:34,392 [myid:5] - INFO
> > [CommitProcessor:5:CommitProcessor@153] - CommitProcessor exited loop!
> > 2017-05-10 11:21:34,392 [myid:5] - INFO
> > [SyncThread:5:SyncRequestProcessor@186] - SyncRequestProcessor exited!
> > 2017-05-10 11:21:34,392 [myid:5] - INFO
> > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:QuorumPeer@865] - LOOKING
> > 2017-05-10 11:21:34,393 [myid:5] - INFO
> > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New
> > election. My id =  5, proposed zxid=0x4100013987
> > 2017-05-10 11:21:34,393 [myid:5] - INFO
> > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> (message
> > format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> > LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > 2017-05-10 11:21:34,393 [myid:5] - INFO
> > [FollowerRequestProcessor:5:FollowerRequestProcessor@97] -
> > FollowerRequestProcessor exited loop!
> > 2017-05-10 11:21:34,593 [myid:5] - INFO
> > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > Notification time out: 400
> > 2017-05-10 11:21:34,594 [myid:5] - INFO
> > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> (message
> > format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> > LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > 2017-05-10 11:21:34,994 [myid:5] - INFO
> > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > Notification time out: 800
> > 2017-05-10 11:21:34,994 [myid:5] - INFO
> > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> (message
> > format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> > LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > 2017-05-10 11:21:35,794 [myid:5] - INFO
> > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > Notification time out: 1600
> > 2017-05-10 11:21:35,795 [myid:5] - INFO
> > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> (message
> > format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> > LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > 2017-05-10 11:21:37,395 [myid:5] - INFO
> > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > Notification time out: 3200
> > 2017-05-10 11:21:37,395 [myid:5] - INFO
> > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> (message
> > format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> > LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > 2017-05-10 11:21:40,595 [myid:5] - INFO
> > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > Notification time out: 6400
> > 2017-05-10 11:21:40,596 [myid:5] - INFO
> > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> (message
> > format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> > LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > 2017-05-10 11:21:46,996 [myid:5] - INFO
> > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > Notification time out: 12800
> > 2017-05-10 11:21:46,996 [myid:5] - INFO
> > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> (message
> > format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> > LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > 2017-05-10 11:21:59,796 [myid:5] - INFO
> > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > Notification time out: 25600
> > 2017-05-10 11:21:59,797 [myid:5] - INFO
> > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> (message
> > format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> > LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > ---
> > MANY minutes later, we see things finally timeout, connections are
> > re-established, and everything comes back:
> > ---
> > 2017-05-10 11:37:02,469 [myid:5] - WARN
> > [RecvWorker:3:QuorumCnxManager$RecvWorker@1028] - Connection broken for
> id
> > 3, my id = 5, error =
> > java.net.SocketException: Connection timed out (Read failed)
> >        at java.net.SocketInputStream.socketRead0(Native Method)
> >        at java.net.SocketInputStream.socketRead(SocketInputStream.
> java:116)
> >        at java.net.SocketInputStream.read(SocketInputStream.java:171)
> >        at java.net.SocketInputStream.read(SocketInputStream.java:141)
> >        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> >        at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
> >        at java.io.DataInputStream.readInt(DataInputStream.java:387)
> >        at
> > org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(
> QuorumCnxManager.java:1013)
> > 2017-05-10 11:37:02,470 [myid:5] - WARN
> > [RecvWorker:3:QuorumCnxManager$RecvWorker@1031] - Interrupting
> SendWorker
> > 2017-05-10 11:37:02,469 [myid:5] - WARN
> > [RecvWorker:2:QuorumCnxManager$RecvWorker@1028] - Connection broken for
> id
> > 2, my id = 5, error =
> > java.net.SocketException: Connection timed out (Read failed)
> >        at java.net.SocketInputStream.socketRead0(Native Method)
> >        at java.net.SocketInputStream.socketRead(SocketInputStream.
> java:116)
> >        at java.net.SocketInputStream.read(SocketInputStream.java:171)
> >        at java.net.SocketInputStream.read(SocketInputStream.java:141)
> >        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> >        at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
> >        at java.io.DataInputStream.readInt(DataInputStream.java:387)
> >        at
> > org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(
> QuorumCnxManager.java:1013)
> > 2017-05-10 11:37:02,470 [myid:5] - WARN
> > [RecvWorker:2:QuorumCnxManager$RecvWorker@1031] - Interrupting
> SendWorker
> > 2017-05-10 11:37:02,469 [myid:5] - WARN
> > [RecvWorker:4:QuorumCnxManager$RecvWorker@1028] - Connection broken for
> id
> > 4, my id = 5, error =
> > java.net.SocketException: Connection timed out (Read failed)
> >        at java.net.SocketInputStream.socketRead0(Native Method)
> >        at java.net.SocketInputStream.socketRead(SocketInputStream.
> java:116)
> >        at java.net.SocketInputStream.read(SocketInputStream.java:171)
> >        at java.net.SocketInputStream.read(SocketInputStream.java:141)
> >        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> >        at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
> >        at java.io.DataInputStream.readInt(DataInputStream.java:387)
> >        at
> > org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(
> QuorumCnxManager.java:1013)
> > 2017-05-10 11:37:02,470 [myid:5] - WARN
> > [RecvWorker:4:QuorumCnxManager$RecvWorker@1031] - Interrupting
> SendWorker
> > 2017-05-10 11:37:02,470 [myid:5] - WARN
> > [SendWorker:3:QuorumCnxManager$SendWorker@944] - Interrupted while
> waiting
> > for message on queue
> > java.lang.InterruptedException
> >        at
> > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.
> reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
> >        at
> > java.util.concurrent.locks.AbstractQueuedSynchronizer$
> ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
> >        at
> > java.util.concurrent.ArrayBlockingQueue.poll(
> ArrayBlockingQueue.java:418)
> >        at
> > org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(
> QuorumCnxManager.java:1097)
> >        at
> > org.apache.zookeeper.server.quorum.QuorumCnxManager.
> access$700(QuorumCnxManager.java:74)
> >        at
> > org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(
> QuorumCnxManager.java:932)
> > 2017-05-10 11:37:02,471 [myid:5] - WARN
> > [SendWorker:3:QuorumCnxManager$SendWorker@954] - Send worker leaving
> thread
> > 2017-05-10 11:37:02,470 [myid:5] - WARN
> > [SendWorker:4:QuorumCnxManager$SendWorker@944] - Interrupted while
> waiting
> > for message on queue
> > java.lang.InterruptedException
> >        at
> > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.
> reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
> >        at
> > java.util.concurrent.locks.AbstractQueuedSynchronizer$
> ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
> >        at
> > java.util.concurrent.ArrayBlockingQueue.poll(
> ArrayBlockingQueue.java:418)
> >        at
> > org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(
> QuorumCnxManager.java:1097)
> >        at
> > org.apache.zookeeper.server.quorum.QuorumCnxManager.
> access$700(QuorumCnxManager.java:74)
> >        at
> > org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(
> QuorumCnxManager.java:932)
> > 2017-05-10 11:37:02,471 [myid:5] - WARN
> > [SendWorker:4:QuorumCnxManager$SendWorker@954] - Send worker leaving
> thread
> > 2017-05-10 11:37:02,472 [myid:5] - WARN
> > [SendWorker:2:QuorumCnxManager$SendWorker@944] - Interrupted while
> waiting
> > for message on queue
> > java.lang.InterruptedException
> >        at
> > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.
> reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
> >        at
> > java.util.concurrent.locks.AbstractQueuedSynchronizer$
> ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
> >        at
> > java.util.concurrent.ArrayBlockingQueue.poll(
> ArrayBlockingQueue.java:418)
> >        at
> > org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(
> QuorumCnxManager.java:1097)
> >        at
> > org.apache.zookeeper.server.quorum.QuorumCnxManager.
> access$700(QuorumCnxManager.java:74)
> >        at
> > org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(
> QuorumCnxManager.java:932)
> > 2017-05-10 11:37:02,472 [myid:5] - WARN
> > [SendWorker:2:QuorumCnxManager$SendWorker@954] - Send worker leaving
> thread
> > 2017-05-10 11:37:03,493 [myid:5] - WARN
> > [RecvWorker:1:QuorumCnxManager$RecvWorker@1028] - Connection broken for
> id
> > 1, my id = 5, error =
> > java.net.SocketException: Connection timed out (Read failed)
> >        at java.net.SocketInputStream.socketRead0(Native Method)
> >        at java.net.SocketInputStream.socketRead(SocketInputStream.
> java:116)
> >        at java.net.SocketInputStream.read(SocketInputStream.java:171)
> >        at java.net.SocketInputStream.read(SocketInputStream.java:141)
> >        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> >        at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
> >        at java.io.DataInputStream.readInt(DataInputStream.java:387)
> >        at
> > org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(
> QuorumCnxManager.java:1013)
> > 2017-05-10 11:37:03,493 [myid:5] - WARN
> > [RecvWorker:1:QuorumCnxManager$RecvWorker@1031] - Interrupting
> SendWorker
> > 2017-05-10 11:37:03,494 [myid:5] - WARN
> > [SendWorker:1:QuorumCnxManager$SendWorker@944] - Interrupted while
> waiting
> > for message on queue
> > java.lang.InterruptedException
> >        at
> > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.
> reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
> >        at
> > java.util.concurrent.locks.AbstractQueuedSynchronizer$
> ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
> >        at
> > java.util.concurrent.ArrayBlockingQueue.poll(
> ArrayBlockingQueue.java:418)
> >        at
> > org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(
> QuorumCnxManager.java:1097)
> >        at
> > org.apache.zookeeper.server.quorum.QuorumCnxManager.
> access$700(QuorumCnxManager.java:74)
> >        at
> > org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(
> QuorumCnxManager.java:932)
> > 2017-05-10 11:37:03,494 [myid:5] - WARN
> > [SendWorker:1:QuorumCnxManager$SendWorker@954] - Send worker leaving
> thread
> > 2017-05-10 11:37:16,607 [myid:5] - INFO
> > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > Notification time out: 60000
> > 2017-05-10 11:37:16,611 [myid:5] - INFO
> > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> (message
> > format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> > FOLLOWING (n.state), 1 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > 2017-05-10 11:37:16,611 [myid:5] - INFO
> > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> (message
> > format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> > FOLLOWING (n.state), 1 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > 2017-05-10 11:37:16,611 [myid:5] - INFO
> > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> (message
> > format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> > FOLLOWING (n.state), 2 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > 2017-05-10 11:37:16,611 [myid:5] - INFO
> > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> (message
> > format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> > FOLLOWING (n.state), 2 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > 2017-05-10 11:37:16,613 [myid:5] - INFO
> > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> (message
> > format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> > LEADING (n.state), 3 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > 2017-05-10 11:37:16,613 [myid:5] - INFO
> > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> (message
> > format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> > LEADING (n.state), 3 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > 2017-05-10 11:37:16,613 [myid:5] - INFO
> > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:QuorumPeer@935] - FOLLOWING
> > 2017-05-10 11:37:16,613 [myid:5] - INFO
> > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@173] - Created
> > server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000
> > datadir /opt/data-zookeeper/version-2 snapdir
> /opt/data-zookeeper/version-2
> > 2017-05-10 11:37:16,613 [myid:5] - INFO
> > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@64] - FOLLOWING -
> LEADER
> > ELECTION TOOK - 942220
> > 2017-05-10 11:37:16,615 [myid:5] - INFO
> > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> (message
> > format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> > LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) FOLLOWING (my state)
> > 2017-05-10 11:37:16,615 [myid:5] - INFO
> > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> (message
> > format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> > FOLLOWING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) FOLLOWING (my state)
> > 2017-05-10 11:37:16,615 [myid:5] - INFO
> > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> (message
> > format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> > FOLLOWING (n.state), 4 (n.sid), 0x41 (n.peerEpoch) FOLLOWING (my state)
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Follower drops out of quorum, can't reconnect

hanm
I would suggest create a JIRA issue and attach the full log of sid 5 (if
that's possible). The log posted here does not have enough information to
analyze what happened on sid 5 during the 15 minutes when it's trying to
connect to an established quorum. Please also attach another one or two
servers log so we can cross check the quorum state for that time.

On Wed, May 10, 2017 at 12:47 PM, Ben Sherman <[hidden email]> wrote:

> Yes, synclimit is 5, initlimit is 10, ticktime is 2000.
>
> I'm less concerned with the disconnection and more concerned with the
> inability to reconnect for 15 minutes.  Networking on this host appeared to
> be fine as other services were operating nominally, as were the other 4
> servers in the ensemble.
>
> On Wed, May 10, 2017 at 12:38 PM, Jordan Zimmerman <
> [hidden email]> wrote:
>
> > Have you tried increasing syncLimit?
> >
> > ====================
> > Jordan Zimmerman
> >
> > > On May 10, 2017, at 9:29 PM, Ben Sherman <[hidden email]> wrote:
> > >
> > > I have been chasing this problem for months and I can't figure out what
> > is
> > > happening.
> > >
> > > We have a 5 node cluster running 3.4.10 we saw this in .8 and .9 as
> > well),
> > > and sometimes, a node gets a read timeout, drops all the connections
> and
> > > tries to re-establish itself to the quorum.  It can usually do this in
> a
> > > few seconds, but last night it took almost 15 minutes to reconnect.
> > >
> > > These are 5 servers in AWS, and we've tried tuning the timeouts, but
> the
> > > are exceeding any reasonable timeout and still failing.
> > >
> > > Any hints here?
> > >
> > > 2017-05-10 11:21:34,386 [myid:5] - WARN
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
> > > following the leader
> > > java.net.SocketTimeoutException: Read timed out
> > >        at java.net.SocketInputStream.socketRead0(Native Method)
> > >        at java.net.SocketInputStream.socketRead(SocketInputStream.
> > java:116)
> > >        at java.net.SocketInputStream.read(SocketInputStream.java:171)
> > >        at java.net.SocketInputStream.read(SocketInputStream.java:141)
> > >        at java.io.BufferedInputStream.fill(BufferedInputStream.java:
> 246)
> > >        at java.io.BufferedInputStream.read(BufferedInputStream.java:
> 265)
> > >        at java.io.DataInputStream.readInt(DataInputStream.java:387)
> > >        at
> > > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> > >        at
> > > org.apache.zookeeper.server.quorum.QuorumPacket.
> > deserialize(QuorumPacket.java:83)
> > >        at
> > > org.apache.jute.BinaryInputArchive.readRecord(
> > BinaryInputArchive.java:99)
> > >        at
> > > org.apache.zookeeper.server.quorum.Learner.readPacket(
> Learner.java:153)
> > >        at
> > > org.apache.zookeeper.server.quorum.Follower.followLeader(
> > Follower.java:85)
> > >        at
> > > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:937)
> > > 2017-05-10 11:21:34,387 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown
> called
> > > java.lang.Exception: shutdown Follower
> > >        at
> > > org.apache.zookeeper.server.quorum.Follower.shutdown(
> Follower.java:166)
> > >        at
> > > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:941)
> > > 2017-05-10 11:21:34,387 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown
> called
> > > 2017-05-10 11:21:34,388 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1044] - Closed
> > > socket connection for client /192.168.23.109:40666 which had sessionid
> > > 0x55bd553c076002d
> > > [snip .. zk closes all its connections]
> > > 2017-05-10 11:21:34,391 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1044] - Closed
> > > socket connection for client /192.168.23.88:65025 which had sessionid
> > > 0x55bd553c0760765
> > > 2017-05-10 11:21:34,391 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@140]
> -
> > > Shutting down
> > > 2017-05-10 11:21:34,391 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@505] -
> shutting
> > > down
> > > 2017-05-10 11:21:34,391 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FollowerRequestProcessor@107]
> -
> > > Shutting down
> > > 2017-05-10 11:21:34,391 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:CommitProcessor@184] -
> Shutting
> > > down
> > > 2017-05-10 11:21:34,392 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FinalRequestProcessor@402] -
> > > shutdown of request processor complete
> > > 2017-05-10 11:21:34,392 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:SyncRequestProcessor@208] -
> > > Shutting down
> > > 2017-05-10 11:21:34,392 [myid:5] - INFO
> > > [CommitProcessor:5:CommitProcessor@153] - CommitProcessor exited loop!
> > > 2017-05-10 11:21:34,392 [myid:5] - INFO
> > > [SyncThread:5:SyncRequestProcessor@186] - SyncRequestProcessor exited!
> > > 2017-05-10 11:21:34,392 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:QuorumPeer@865] - LOOKING
> > > 2017-05-10 11:21:34,393 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New
> > > election. My id =  5, proposed zxid=0x4100013987
> > > 2017-05-10 11:21:34,393 [myid:5] - INFO
> > > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > (message
> > > format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> > > LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > > 2017-05-10 11:21:34,393 [myid:5] - INFO
> > > [FollowerRequestProcessor:5:FollowerRequestProcessor@97] -
> > > FollowerRequestProcessor exited loop!
> > > 2017-05-10 11:21:34,593 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > Notification time out: 400
> > > 2017-05-10 11:21:34,594 [myid:5] - INFO
> > > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > (message
> > > format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> > > LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > > 2017-05-10 11:21:34,994 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > Notification time out: 800
> > > 2017-05-10 11:21:34,994 [myid:5] - INFO
> > > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > (message
> > > format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> > > LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > > 2017-05-10 11:21:35,794 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > Notification time out: 1600
> > > 2017-05-10 11:21:35,795 [myid:5] - INFO
> > > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > (message
> > > format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> > > LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > > 2017-05-10 11:21:37,395 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > Notification time out: 3200
> > > 2017-05-10 11:21:37,395 [myid:5] - INFO
> > > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > (message
> > > format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> > > LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > > 2017-05-10 11:21:40,595 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > Notification time out: 6400
> > > 2017-05-10 11:21:40,596 [myid:5] - INFO
> > > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > (message
> > > format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> > > LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > > 2017-05-10 11:21:46,996 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > Notification time out: 12800
> > > 2017-05-10 11:21:46,996 [myid:5] - INFO
> > > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > (message
> > > format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> > > LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > > 2017-05-10 11:21:59,796 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > Notification time out: 25600
> > > 2017-05-10 11:21:59,797 [myid:5] - INFO
> > > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > (message
> > > format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> > > LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > > ---
> > > MANY minutes later, we see things finally timeout, connections are
> > > re-established, and everything comes back:
> > > ---
> > > 2017-05-10 11:37:02,469 [myid:5] - WARN
> > > [RecvWorker:3:QuorumCnxManager$RecvWorker@1028] - Connection broken
> for
> > id
> > > 3, my id = 5, error =
> > > java.net.SocketException: Connection timed out (Read failed)
> > >        at java.net.SocketInputStream.socketRead0(Native Method)
> > >        at java.net.SocketInputStream.socketRead(SocketInputStream.
> > java:116)
> > >        at java.net.SocketInputStream.read(SocketInputStream.java:171)
> > >        at java.net.SocketInputStream.read(SocketInputStream.java:141)
> > >        at java.io.BufferedInputStream.fill(BufferedInputStream.java:
> 246)
> > >        at java.io.BufferedInputStream.read(BufferedInputStream.java:
> 265)
> > >        at java.io.DataInputStream.readInt(DataInputStream.java:387)
> > >        at
> > > org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(
> > QuorumCnxManager.java:1013)
> > > 2017-05-10 11:37:02,470 [myid:5] - WARN
> > > [RecvWorker:3:QuorumCnxManager$RecvWorker@1031] - Interrupting
> > SendWorker
> > > 2017-05-10 11:37:02,469 [myid:5] - WARN
> > > [RecvWorker:2:QuorumCnxManager$RecvWorker@1028] - Connection broken
> for
> > id
> > > 2, my id = 5, error =
> > > java.net.SocketException: Connection timed out (Read failed)
> > >        at java.net.SocketInputStream.socketRead0(Native Method)
> > >        at java.net.SocketInputStream.socketRead(SocketInputStream.
> > java:116)
> > >        at java.net.SocketInputStream.read(SocketInputStream.java:171)
> > >        at java.net.SocketInputStream.read(SocketInputStream.java:141)
> > >        at java.io.BufferedInputStream.fill(BufferedInputStream.java:
> 246)
> > >        at java.io.BufferedInputStream.read(BufferedInputStream.java:
> 265)
> > >        at java.io.DataInputStream.readInt(DataInputStream.java:387)
> > >        at
> > > org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(
> > QuorumCnxManager.java:1013)
> > > 2017-05-10 11:37:02,470 [myid:5] - WARN
> > > [RecvWorker:2:QuorumCnxManager$RecvWorker@1031] - Interrupting
> > SendWorker
> > > 2017-05-10 11:37:02,469 [myid:5] - WARN
> > > [RecvWorker:4:QuorumCnxManager$RecvWorker@1028] - Connection broken
> for
> > id
> > > 4, my id = 5, error =
> > > java.net.SocketException: Connection timed out (Read failed)
> > >        at java.net.SocketInputStream.socketRead0(Native Method)
> > >        at java.net.SocketInputStream.socketRead(SocketInputStream.
> > java:116)
> > >        at java.net.SocketInputStream.read(SocketInputStream.java:171)
> > >        at java.net.SocketInputStream.read(SocketInputStream.java:141)
> > >        at java.io.BufferedInputStream.fill(BufferedInputStream.java:
> 246)
> > >        at java.io.BufferedInputStream.read(BufferedInputStream.java:
> 265)
> > >        at java.io.DataInputStream.readInt(DataInputStream.java:387)
> > >        at
> > > org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(
> > QuorumCnxManager.java:1013)
> > > 2017-05-10 11:37:02,470 [myid:5] - WARN
> > > [RecvWorker:4:QuorumCnxManager$RecvWorker@1031] - Interrupting
> > SendWorker
> > > 2017-05-10 11:37:02,470 [myid:5] - WARN
> > > [SendWorker:3:QuorumCnxManager$SendWorker@944] - Interrupted while
> > waiting
> > > for message on queue
> > > java.lang.InterruptedException
> > >        at
> > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.
> > reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
> > >        at
> > > java.util.concurrent.locks.AbstractQueuedSynchronizer$
> > ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
> > >        at
> > > java.util.concurrent.ArrayBlockingQueue.poll(
> > ArrayBlockingQueue.java:418)
> > >        at
> > > org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(
> > QuorumCnxManager.java:1097)
> > >        at
> > > org.apache.zookeeper.server.quorum.QuorumCnxManager.
> > access$700(QuorumCnxManager.java:74)
> > >        at
> > > org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(
> > QuorumCnxManager.java:932)
> > > 2017-05-10 11:37:02,471 [myid:5] - WARN
> > > [SendWorker:3:QuorumCnxManager$SendWorker@954] - Send worker leaving
> > thread
> > > 2017-05-10 11:37:02,470 [myid:5] - WARN
> > > [SendWorker:4:QuorumCnxManager$SendWorker@944] - Interrupted while
> > waiting
> > > for message on queue
> > > java.lang.InterruptedException
> > >        at
> > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.
> > reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
> > >        at
> > > java.util.concurrent.locks.AbstractQueuedSynchronizer$
> > ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
> > >        at
> > > java.util.concurrent.ArrayBlockingQueue.poll(
> > ArrayBlockingQueue.java:418)
> > >        at
> > > org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(
> > QuorumCnxManager.java:1097)
> > >        at
> > > org.apache.zookeeper.server.quorum.QuorumCnxManager.
> > access$700(QuorumCnxManager.java:74)
> > >        at
> > > org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(
> > QuorumCnxManager.java:932)
> > > 2017-05-10 11:37:02,471 [myid:5] - WARN
> > > [SendWorker:4:QuorumCnxManager$SendWorker@954] - Send worker leaving
> > thread
> > > 2017-05-10 11:37:02,472 [myid:5] - WARN
> > > [SendWorker:2:QuorumCnxManager$SendWorker@944] - Interrupted while
> > waiting
> > > for message on queue
> > > java.lang.InterruptedException
> > >        at
> > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.
> > reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
> > >        at
> > > java.util.concurrent.locks.AbstractQueuedSynchronizer$
> > ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
> > >        at
> > > java.util.concurrent.ArrayBlockingQueue.poll(
> > ArrayBlockingQueue.java:418)
> > >        at
> > > org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(
> > QuorumCnxManager.java:1097)
> > >        at
> > > org.apache.zookeeper.server.quorum.QuorumCnxManager.
> > access$700(QuorumCnxManager.java:74)
> > >        at
> > > org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(
> > QuorumCnxManager.java:932)
> > > 2017-05-10 11:37:02,472 [myid:5] - WARN
> > > [SendWorker:2:QuorumCnxManager$SendWorker@954] - Send worker leaving
> > thread
> > > 2017-05-10 11:37:03,493 [myid:5] - WARN
> > > [RecvWorker:1:QuorumCnxManager$RecvWorker@1028] - Connection broken
> for
> > id
> > > 1, my id = 5, error =
> > > java.net.SocketException: Connection timed out (Read failed)
> > >        at java.net.SocketInputStream.socketRead0(Native Method)
> > >        at java.net.SocketInputStream.socketRead(SocketInputStream.
> > java:116)
> > >        at java.net.SocketInputStream.read(SocketInputStream.java:171)
> > >        at java.net.SocketInputStream.read(SocketInputStream.java:141)
> > >        at java.io.BufferedInputStream.fill(BufferedInputStream.java:
> 246)
> > >        at java.io.BufferedInputStream.read(BufferedInputStream.java:
> 265)
> > >        at java.io.DataInputStream.readInt(DataInputStream.java:387)
> > >        at
> > > org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(
> > QuorumCnxManager.java:1013)
> > > 2017-05-10 11:37:03,493 [myid:5] - WARN
> > > [RecvWorker:1:QuorumCnxManager$RecvWorker@1031] - Interrupting
> > SendWorker
> > > 2017-05-10 11:37:03,494 [myid:5] - WARN
> > > [SendWorker:1:QuorumCnxManager$SendWorker@944] - Interrupted while
> > waiting
> > > for message on queue
> > > java.lang.InterruptedException
> > >        at
> > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.
> > reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
> > >        at
> > > java.util.concurrent.locks.AbstractQueuedSynchronizer$
> > ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
> > >        at
> > > java.util.concurrent.ArrayBlockingQueue.poll(
> > ArrayBlockingQueue.java:418)
> > >        at
> > > org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(
> > QuorumCnxManager.java:1097)
> > >        at
> > > org.apache.zookeeper.server.quorum.QuorumCnxManager.
> > access$700(QuorumCnxManager.java:74)
> > >        at
> > > org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(
> > QuorumCnxManager.java:932)
> > > 2017-05-10 11:37:03,494 [myid:5] - WARN
> > > [SendWorker:1:QuorumCnxManager$SendWorker@954] - Send worker leaving
> > thread
> > > 2017-05-10 11:37:16,607 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > Notification time out: 60000
> > > 2017-05-10 11:37:16,611 [myid:5] - INFO
> > > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > (message
> > > format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> > > FOLLOWING (n.state), 1 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > > 2017-05-10 11:37:16,611 [myid:5] - INFO
> > > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > (message
> > > format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> > > FOLLOWING (n.state), 1 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > > 2017-05-10 11:37:16,611 [myid:5] - INFO
> > > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > (message
> > > format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> > > FOLLOWING (n.state), 2 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > > 2017-05-10 11:37:16,611 [myid:5] - INFO
> > > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > (message
> > > format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> > > FOLLOWING (n.state), 2 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > > 2017-05-10 11:37:16,613 [myid:5] - INFO
> > > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > (message
> > > format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> > > LEADING (n.state), 3 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > > 2017-05-10 11:37:16,613 [myid:5] - INFO
> > > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > (message
> > > format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> > > LEADING (n.state), 3 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > > 2017-05-10 11:37:16,613 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:QuorumPeer@935] - FOLLOWING
> > > 2017-05-10 11:37:16,613 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@173] -
> Created
> > > server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout
> 40000
> > > datadir /opt/data-zookeeper/version-2 snapdir
> > /opt/data-zookeeper/version-2
> > > 2017-05-10 11:37:16,613 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@64] - FOLLOWING -
> > LEADER
> > > ELECTION TOOK - 942220
> > > 2017-05-10 11:37:16,615 [myid:5] - INFO
> > > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > (message
> > > format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> > > LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) FOLLOWING (my state)
> > > 2017-05-10 11:37:16,615 [myid:5] - INFO
> > > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > (message
> > > format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> > > FOLLOWING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) FOLLOWING (my state)
> > > 2017-05-10 11:37:16,615 [myid:5] - INFO
> > > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > (message
> > > format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> > > FOLLOWING (n.state), 4 (n.sid), 0x41 (n.peerEpoch) FOLLOWING (my state)
> >
>



--
Cheers
Michael.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Follower drops out of quorum, can't reconnect

Mike Richardson
Unsubscribe


Mike Richardson

Senior Software Engineer



MoTuM N.V. | Dellingstraat 34 | B-2800 MECHELEN | Belgium


T +32(0)15 28 16 63
M +41 7943 69538


www.motum.be

On 11 May 2017 at 06:17, Michael Han <[hidden email]> wrote:
I would suggest create a JIRA issue and attach the full log of sid 5 (if
that's possible). The log posted here does not have enough information to
analyze what happened on sid 5 during the 15 minutes when it's trying to
connect to an established quorum. Please also attach another one or two
servers log so we can cross check the quorum state for that time.

On Wed, May 10, 2017 at 12:47 PM, Ben Sherman <[hidden email]> wrote:

> Yes, synclimit is 5, initlimit is 10, ticktime is 2000.
>
> I'm less concerned with the disconnection and more concerned with the
> inability to reconnect for 15 minutes.  Networking on this host appeared to
> be fine as other services were operating nominally, as were the other 4
> servers in the ensemble.
>
> On Wed, May 10, 2017 at 12:38 PM, Jordan Zimmerman <
> [hidden email]> wrote:
>
> > Have you tried increasing syncLimit?
> >
> > ====================
> > Jordan Zimmerman
> >
> > > On May 10, 2017, at 9:29 PM, Ben Sherman <[hidden email]> wrote:
> > >
> > > I have been chasing this problem for months and I can't figure out what
> > is
> > > happening.
> > >
> > > We have a 5 node cluster running 3.4.10 we saw this in .8 and .9 as
> > well),
> > > and sometimes, a node gets a read timeout, drops all the connections
> and
> > > tries to re-establish itself to the quorum.  It can usually do this in
> a
> > > few seconds, but last night it took almost 15 minutes to reconnect.
> > >
> > > These are 5 servers in AWS, and we've tried tuning the timeouts, but
> the
> > > are exceeding any reasonable timeout and still failing.
> > >
> > > Any hints here?
> > >
> > > 2017-05-10 11:21:34,386 [myid:5] - WARN
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
> > > following the leader
> > > java.net.SocketTimeoutException: Read timed out
> > >        at java.net.SocketInputStream.socketRead0(Native Method)
> > >        at java.net.SocketInputStream.socketRead(SocketInputStream.
> > java:116)
> > >        at java.net.SocketInputStream.read(SocketInputStream.java:171)
> > >        at java.net.SocketInputStream.read(SocketInputStream.java:141)
> > >        at java.io.BufferedInputStream.fill(BufferedInputStream.java:
> 246)
> > >        at java.io.BufferedInputStream.read(BufferedInputStream.java:
> 265)
> > >        at java.io.DataInputStream.readInt(DataInputStream.java:387)
> > >        at
> > > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> > >        at
> > > org.apache.zookeeper.server.quorum.QuorumPacket.
> > deserialize(QuorumPacket.java:83)
> > >        at
> > > org.apache.jute.BinaryInputArchive.readRecord(
> > BinaryInputArchive.java:99)
> > >        at
> > > org.apache.zookeeper.server.quorum.Learner.readPacket(
> Learner.java:153)
> > >        at
> > > org.apache.zookeeper.server.quorum.Follower.followLeader(
> > Follower.java:85)
> > >        at
> > > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:937)
> > > 2017-05-10 11:21:34,387 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown
> called
> > > java.lang.Exception: shutdown Follower
> > >        at
> > > org.apache.zookeeper.server.quorum.Follower.shutdown(
> Follower.java:166)
> > >        at
> > > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:941)
> > > 2017-05-10 11:21:34,387 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown
> called
> > > 2017-05-10 11:21:34,388 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1044] - Closed
> > > socket connection for client /192.168.23.109:40666 which had sessionid
> > > 0x55bd553c076002d
> > > [snip .. zk closes all its connections]
> > > 2017-05-10 11:21:34,391 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1044] - Closed
> > > socket connection for client /192.168.23.88:65025 which had sessionid
> > > 0x55bd553c0760765
> > > 2017-05-10 11:21:34,391 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@140]
> -
> > > Shutting down
> > > 2017-05-10 11:21:34,391 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@505] -
> shutting
> > > down
> > > 2017-05-10 11:21:34,391 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FollowerRequestProcessor@107]
> -
> > > Shutting down
> > > 2017-05-10 11:21:34,391 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:CommitProcessor@184] -
> Shutting
> > > down
> > > 2017-05-10 11:21:34,392 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FinalRequestProcessor@402] -
> > > shutdown of request processor complete
> > > 2017-05-10 11:21:34,392 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:SyncRequestProcessor@208] -
> > > Shutting down
> > > 2017-05-10 11:21:34,392 [myid:5] - INFO
> > > [CommitProcessor:5:CommitProcessor@153] - CommitProcessor exited loop!
> > > 2017-05-10 11:21:34,392 [myid:5] - INFO
> > > [SyncThread:5:SyncRequestProcessor@186] - SyncRequestProcessor exited!
> > > 2017-05-10 11:21:34,392 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:QuorumPeer@865] - LOOKING
> > > 2017-05-10 11:21:34,393 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New
> > > election. My id =  5, proposed zxid=0x4100013987
> > > 2017-05-10 11:21:34,393 [myid:5] - INFO
> > > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > (message
> > > format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> > > LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > > 2017-05-10 11:21:34,393 [myid:5] - INFO
> > > [FollowerRequestProcessor:5:FollowerRequestProcessor@97] -
> > > FollowerRequestProcessor exited loop!
> > > 2017-05-10 11:21:34,593 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > Notification time out: 400
> > > 2017-05-10 11:21:34,594 [myid:5] - INFO
> > > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > (message
> > > format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> > > LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > > 2017-05-10 11:21:34,994 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > Notification time out: 800
> > > 2017-05-10 11:21:34,994 [myid:5] - INFO
> > > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > (message
> > > format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> > > LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > > 2017-05-10 11:21:35,794 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > Notification time out: 1600
> > > 2017-05-10 11:21:35,795 [myid:5] - INFO
> > > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > (message
> > > format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> > > LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > > 2017-05-10 11:21:37,395 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > Notification time out: 3200
> > > 2017-05-10 11:21:37,395 [myid:5] - INFO
> > > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > (message
> > > format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> > > LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > > 2017-05-10 11:21:40,595 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > Notification time out: 6400
> > > 2017-05-10 11:21:40,596 [myid:5] - INFO
> > > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > (message
> > > format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> > > LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > > 2017-05-10 11:21:46,996 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > Notification time out: 12800
> > > 2017-05-10 11:21:46,996 [myid:5] - INFO
> > > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > (message
> > > format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> > > LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > > 2017-05-10 11:21:59,796 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > Notification time out: 25600
> > > 2017-05-10 11:21:59,797 [myid:5] - INFO
> > > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > (message
> > > format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> > > LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > > ---
> > > MANY minutes later, we see things finally timeout, connections are
> > > re-established, and everything comes back:
> > > ---
> > > 2017-05-10 11:37:02,469 [myid:5] - WARN
> > > [RecvWorker:3:QuorumCnxManager$RecvWorker@1028] - Connection broken
> for
> > id
> > > 3, my id = 5, error =
> > > java.net.SocketException: Connection timed out (Read failed)
> > >        at java.net.SocketInputStream.socketRead0(Native Method)
> > >        at java.net.SocketInputStream.socketRead(SocketInputStream.
> > java:116)
> > >        at java.net.SocketInputStream.read(SocketInputStream.java:171)
> > >        at java.net.SocketInputStream.read(SocketInputStream.java:141)
> > >        at java.io.BufferedInputStream.fill(BufferedInputStream.java:
> 246)
> > >        at java.io.BufferedInputStream.read(BufferedInputStream.java:
> 265)
> > >        at java.io.DataInputStream.readInt(DataInputStream.java:387)
> > >        at
> > > org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(
> > QuorumCnxManager.java:1013)
> > > 2017-05-10 11:37:02,470 [myid:5] - WARN
> > > [RecvWorker:3:QuorumCnxManager$RecvWorker@1031] - Interrupting
> > SendWorker
> > > 2017-05-10 11:37:02,469 [myid:5] - WARN
> > > [RecvWorker:2:QuorumCnxManager$RecvWorker@1028] - Connection broken
> for
> > id
> > > 2, my id = 5, error =
> > > java.net.SocketException: Connection timed out (Read failed)
> > >        at java.net.SocketInputStream.socketRead0(Native Method)
> > >        at java.net.SocketInputStream.socketRead(SocketInputStream.
> > java:116)
> > >        at java.net.SocketInputStream.read(SocketInputStream.java:171)
> > >        at java.net.SocketInputStream.read(SocketInputStream.java:141)
> > >        at java.io.BufferedInputStream.fill(BufferedInputStream.java:
> 246)
> > >        at java.io.BufferedInputStream.read(BufferedInputStream.java:
> 265)
> > >        at java.io.DataInputStream.readInt(DataInputStream.java:387)
> > >        at
> > > org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(
> > QuorumCnxManager.java:1013)
> > > 2017-05-10 11:37:02,470 [myid:5] - WARN
> > > [RecvWorker:2:QuorumCnxManager$RecvWorker@1031] - Interrupting
> > SendWorker
> > > 2017-05-10 11:37:02,469 [myid:5] - WARN
> > > [RecvWorker:4:QuorumCnxManager$RecvWorker@1028] - Connection broken
> for
> > id
> > > 4, my id = 5, error =
> > > java.net.SocketException: Connection timed out (Read failed)
> > >        at java.net.SocketInputStream.socketRead0(Native Method)
> > >        at java.net.SocketInputStream.socketRead(SocketInputStream.
> > java:116)
> > >        at java.net.SocketInputStream.read(SocketInputStream.java:171)
> > >        at java.net.SocketInputStream.read(SocketInputStream.java:141)
> > >        at java.io.BufferedInputStream.fill(BufferedInputStream.java:
> 246)
> > >        at java.io.BufferedInputStream.read(BufferedInputStream.java:
> 265)
> > >        at java.io.DataInputStream.readInt(DataInputStream.java:387)
> > >        at
> > > org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(
> > QuorumCnxManager.java:1013)
> > > 2017-05-10 11:37:02,470 [myid:5] - WARN
> > > [RecvWorker:4:QuorumCnxManager$RecvWorker@1031] - Interrupting
> > SendWorker
> > > 2017-05-10 11:37:02,470 [myid:5] - WARN
> > > [SendWorker:3:QuorumCnxManager$SendWorker@944] - Interrupted while
> > waiting
> > > for message on queue
> > > java.lang.InterruptedException
> > >        at
> > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.
> > reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
> > >        at
> > > java.util.concurrent.locks.AbstractQueuedSynchronizer$
> > ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
> > >        at
> > > java.util.concurrent.ArrayBlockingQueue.poll(
> > ArrayBlockingQueue.java:418)
> > >        at
> > > org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(
> > QuorumCnxManager.java:1097)
> > >        at
> > > org.apache.zookeeper.server.quorum.QuorumCnxManager.
> > access$700(QuorumCnxManager.java:74)
> > >        at
> > > org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(
> > QuorumCnxManager.java:932)
> > > 2017-05-10 11:37:02,471 [myid:5] - WARN
> > > [SendWorker:3:QuorumCnxManager$SendWorker@954] - Send worker leaving
> > thread
> > > 2017-05-10 11:37:02,470 [myid:5] - WARN
> > > [SendWorker:4:QuorumCnxManager$SendWorker@944] - Interrupted while
> > waiting
> > > for message on queue
> > > java.lang.InterruptedException
> > >        at
> > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.
> > reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
> > >        at
> > > java.util.concurrent.locks.AbstractQueuedSynchronizer$
> > ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
> > >        at
> > > java.util.concurrent.ArrayBlockingQueue.poll(
> > ArrayBlockingQueue.java:418)
> > >        at
> > > org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(
> > QuorumCnxManager.java:1097)
> > >        at
> > > org.apache.zookeeper.server.quorum.QuorumCnxManager.
> > access$700(QuorumCnxManager.java:74)
> > >        at
> > > org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(
> > QuorumCnxManager.java:932)
> > > 2017-05-10 11:37:02,471 [myid:5] - WARN
> > > [SendWorker:4:QuorumCnxManager$SendWorker@954] - Send worker leaving
> > thread
> > > 2017-05-10 11:37:02,472 [myid:5] - WARN
> > > [SendWorker:2:QuorumCnxManager$SendWorker@944] - Interrupted while
> > waiting
> > > for message on queue
> > > java.lang.InterruptedException
> > >        at
> > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.
> > reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
> > >        at
> > > java.util.concurrent.locks.AbstractQueuedSynchronizer$
> > ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
> > >        at
> > > java.util.concurrent.ArrayBlockingQueue.poll(
> > ArrayBlockingQueue.java:418)
> > >        at
> > > org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(
> > QuorumCnxManager.java:1097)
> > >        at
> > > org.apache.zookeeper.server.quorum.QuorumCnxManager.
> > access$700(QuorumCnxManager.java:74)
> > >        at
> > > org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(
> > QuorumCnxManager.java:932)
> > > 2017-05-10 11:37:02,472 [myid:5] - WARN
> > > [SendWorker:2:QuorumCnxManager$SendWorker@954] - Send worker leaving
> > thread
> > > 2017-05-10 11:37:03,493 [myid:5] - WARN
> > > [RecvWorker:1:QuorumCnxManager$RecvWorker@1028] - Connection broken
> for
> > id
> > > 1, my id = 5, error =
> > > java.net.SocketException: Connection timed out (Read failed)
> > >        at java.net.SocketInputStream.socketRead0(Native Method)
> > >        at java.net.SocketInputStream.socketRead(SocketInputStream.
> > java:116)
> > >        at java.net.SocketInputStream.read(SocketInputStream.java:171)
> > >        at java.net.SocketInputStream.read(SocketInputStream.java:141)
> > >        at java.io.BufferedInputStream.fill(BufferedInputStream.java:
> 246)
> > >        at java.io.BufferedInputStream.read(BufferedInputStream.java:
> 265)
> > >        at java.io.DataInputStream.readInt(DataInputStream.java:387)
> > >        at
> > > org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(
> > QuorumCnxManager.java:1013)
> > > 2017-05-10 11:37:03,493 [myid:5] - WARN
> > > [RecvWorker:1:QuorumCnxManager$RecvWorker@1031] - Interrupting
> > SendWorker
> > > 2017-05-10 11:37:03,494 [myid:5] - WARN
> > > [SendWorker:1:QuorumCnxManager$SendWorker@944] - Interrupted while
> > waiting
> > > for message on queue
> > > java.lang.InterruptedException
> > >        at
> > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.
> > reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
> > >        at
> > > java.util.concurrent.locks.AbstractQueuedSynchronizer$
> > ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
> > >        at
> > > java.util.concurrent.ArrayBlockingQueue.poll(
> > ArrayBlockingQueue.java:418)
> > >        at
> > > org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(
> > QuorumCnxManager.java:1097)
> > >        at
> > > org.apache.zookeeper.server.quorum.QuorumCnxManager.
> > access$700(QuorumCnxManager.java:74)
> > >        at
> > > org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(
> > QuorumCnxManager.java:932)
> > > 2017-05-10 11:37:03,494 [myid:5] - WARN
> > > [SendWorker:1:QuorumCnxManager$SendWorker@954] - Send worker leaving
> > thread
> > > 2017-05-10 11:37:16,607 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > Notification time out: 60000
> > > 2017-05-10 11:37:16,611 [myid:5] - INFO
> > > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > (message
> > > format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> > > FOLLOWING (n.state), 1 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > > 2017-05-10 11:37:16,611 [myid:5] - INFO
> > > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > (message
> > > format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> > > FOLLOWING (n.state), 1 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > > 2017-05-10 11:37:16,611 [myid:5] - INFO
> > > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > (message
> > > format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> > > FOLLOWING (n.state), 2 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > > 2017-05-10 11:37:16,611 [myid:5] - INFO
> > > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > (message
> > > format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> > > FOLLOWING (n.state), 2 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > > 2017-05-10 11:37:16,613 [myid:5] - INFO
> > > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > (message
> > > format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> > > LEADING (n.state), 3 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > > 2017-05-10 11:37:16,613 [myid:5] - INFO
> > > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > (message
> > > format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> > > LEADING (n.state), 3 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> > > 2017-05-10 11:37:16,613 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:QuorumPeer@935] - FOLLOWING
> > > 2017-05-10 11:37:16,613 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@173] -
> Created
> > > server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout
> 40000
> > > datadir /opt/data-zookeeper/version-2 snapdir
> > /opt/data-zookeeper/version-2
> > > 2017-05-10 11:37:16,613 [myid:5] - INFO
> > > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@64] - FOLLOWING -
> > LEADER
> > > ELECTION TOOK - 942220
> > > 2017-05-10 11:37:16,615 [myid:5] - INFO
> > > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > (message
> > > format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> > > LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) FOLLOWING (my state)
> > > 2017-05-10 11:37:16,615 [myid:5] - INFO
> > > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > (message
> > > format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> > > FOLLOWING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) FOLLOWING (my state)
> > > 2017-05-10 11:37:16,615 [myid:5] - INFO
> > > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > (message
> > > format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> > > FOLLOWING (n.state), 4 (n.sid), 0x41 (n.peerEpoch) FOLLOWING (my state)
> >
>



--
Cheers
Michael.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Follower drops out of quorum, can't reconnect

Ben Sherman-2
In reply to this post by hanm
Thanks, I added some logs and created
https://issues.apache.org/jira/browse/ZOOKEEPER-2783

I believe there is some sort of stale connection garbage happening between
the nodes on the quorum/election connections, but I have to dig in a bit
more - anyone running a ZK cluster on AWS should be seeing the same
behavior if this is the case.


On Wed, May 10, 2017 at 9:17 PM, Michael Han <[hidden email]> wrote:

> I would suggest create a JIRA issue and attach the full log of sid 5 (if
> that's possible). The log posted here does not have enough information to
> analyze what happened on sid 5 during the 15 minutes when it's trying to
> connect to an established quorum. Please also attach another one or two
> servers log so we can cross check the quorum state for that time.
>
>
Loading...