Unable to connect node to ensemble after restart of node zookeeper 3.4.6

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Unable to connect node to ensemble after restart of node zookeeper 3.4.6

hkwan
I have a 3 node ensemble in production and after restarting one node it can
no longer connect to the ensemble.  I am getting this error below:

2018-01-10 00:49:32,492 [myid:2] - INFO
[WorkerSender[myid=2]:QuorumCnxManager@193] - Have smaller server
identifier, so dropping the connection: (3, 2)
2018-01-10 00:50:20,342 [myid:2] - WARN
[RecvWorker:1:QuorumCnxManager$RecvWorker@780] - Connection broken for id 1,
my id = 2, error =
java.net.SocketException: Connection reset
        at java.net.SocketInputStream.read(SocketInputStream.java:197)
        at java.net.SocketInputStream.read(SocketInputStream.java:122)
        at java.net.SocketInputStream.read(SocketInputStream.java:211)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:765)
2018-01-10 00:50:20,343 [myid:2] - WARN
[RecvWorker:1:QuorumCnxManager$RecvWorker@783] - Interrupting SendWorker
2018-01-10 00:50:20,343 [myid:2] - WARN
[SendWorker:1:QuorumCnxManager$SendWorker@697] - Interrupted while waiting
for message on queue
java.lang.InterruptedException
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2095)
        at
java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:389)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:849)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:64)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:685)
2018-01-10 00:50:20,343 [myid:2] - WARN
[SendWorker:1:QuorumCnxManager$SendWorker@706] - Send worker leaving thread
2018-01-10 00:50:32,491 [myid:2] - INFO
[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
Notification time out: 60000
2018-01-10 00:50:32,493 [myid:2] - INFO
[WorkerReceiver[myid=2]:FastLeaderElection@597] - Notification: 1 (message
format version), 2 (n.leader), 0x707e3e9a9 (n.zxid), 0x1 (n.round), LOOKING
(n.state), 2 (n.sid), 0x7 (n.peerEpoch) LOOKING (my state)
2018-01-10 00:50:32,495 [myid:2] - INFO
[WorkerSender[myid=2]:QuorumCnxManager@193] - Have smaller server
identifier, so dropping the connection: (3, 2)
2018-01-10 00:51:32,494 [myid:2] - INFO
[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
Notification time out: 60000
2018-01-10 00:51:32,494 [myid:2] - INFO
[WorkerReceiver[myid=2]:FastLeaderElection@597] - Notification: 1 (message
format version), 2 (n.leader), 0x707e3e9a9 (n.zxid), 0x1 (n.round), LOOKING
(n.state), 2 (n.sid), 0x7 (n.peerEpoch) LOOKING (my state)
2018-01-10 00:51:32,496 [myid:2] - INFO
[WorkerSender[myid=2]:QuorumCnxManager@193] - Have smaller server
identifier, so dropping the connection: (3, 2)
2018-01-10 00:52:19,126 [myid:2] - WARN
[RecvWorker:1:QuorumCnxManager$RecvWorker@780] - Connection broken for id 1,
my id = 2, error =
java.net.SocketException: Connection reset
        at java.net.SocketInputStream.read(SocketInputStream.java:197)
        at java.net.SocketInputStream.read(SocketInputStream.java:122)
        at java.net.SocketInputStream.read(SocketInputStream.java:211)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:765)
2018-01-10 00:52:19,127 [myid:2] - WARN
[RecvWorker:1:QuorumCnxManager$RecvWorker@783] - Interrupting SendWorker
2018-01-10 00:52:19,127 [myid:2] - WARN
[SendWorker:1:QuorumCnxManager$SendWorker@697] - Interrupted while waiting
for message on queue
java.lang.InterruptedException
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2095)
        at
java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:389)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:849)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:64)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:685)
2018-01-10 00:52:19,128 [myid:2] - WARN
[SendWorker:1:QuorumCnxManager$SendWorker@706] - Send worker leaving thread
2018-01-10 00:52:32,495 [myid:2] - INFO
[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
Notification time out: 60000
2018-01-10 00:52:32,497 [myid:2] - INFO
[WorkerReceiver[myid=2]:FastLeaderElection@597] - Notification: 1 (message
format version), 2 (n.leader), 0x707e3e9a9 (n.zxid), 0x1 (n.round), LOOKING
(n.state), 2 (n.sid), 0x7 (n.peerEpoch) LOOKING (my state)
2018-01-10 00:52:32,499 [myid:2] - INFO
[WorkerSender[myid=2]:QuorumCnxManager@193] - Have smaller server
identifier, so dropping the connection: (3, 2)


my configuration on all three servers are:

clientPort=2181
dataDir=/var/opt/zookeeper/data
tickTime=2000
autopurge.purgeInterval=24
initLimit=10
syncLimit=5
server.1=10.1.0.122:2888:3888
server.2=10.1.1.75:2888:3888
server.3=10.1.2.221:2888:3888

server 3 is currently leader
server 1 is currently follower
server 2 currently cannot rejoin the ensemble

myid files are correctly configured for all three servers.  this is a
production cluster so I would like to know if there was a way to force the
node back into the cluster without anything drastic that would cause the
quorum to be lost.





--
Sent from: http://zookeeper-user.578899.n2.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Unable to connect node to ensemble after restart of node zookeeper 3.4.6

Andor Molnar
Hi hkwan,

java.net.SocketException: Connection reset
        at java.net.SocketInputStream.read(SocketInputStream.java:197)
        at java.net.SocketInputStream.read(SocketInputStream.java:122)
        at java.net.SocketInputStream.read(SocketInputStream.java:211)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)

This looks like a network issue to me.
Have you tried connecting a client from server 2 to the leader?

Regards,
Andor



On Wed, Jan 10, 2018 at 2:16 AM, hkwan <[hidden email]> wrote:

> I have a 3 node ensemble in production and after restarting one node it can
> no longer connect to the ensemble.  I am getting this error below:
>
> 2018-01-10 00:49:32,492 [myid:2] - INFO
> [WorkerSender[myid=2]:QuorumCnxManager@193] - Have smaller server
> identifier, so dropping the connection: (3, 2)
> 2018-01-10 00:50:20,342 [myid:2] - WARN
> [RecvWorker:1:QuorumCnxManager$RecvWorker@780] - Connection broken for id
> 1,
> my id = 2, error =
> java.net.SocketException: Connection reset
>         at java.net.SocketInputStream.read(SocketInputStream.java:197)
>         at java.net.SocketInputStream.read(SocketInputStream.java:122)
>         at java.net.SocketInputStream.read(SocketInputStream.java:211)
>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
>         at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(
> QuorumCnxManager.java:765)
> 2018-01-10 00:50:20,343 [myid:2] - WARN
> [RecvWorker:1:QuorumCnxManager$RecvWorker@783] - Interrupting SendWorker
> 2018-01-10 00:50:20,343 [myid:2] - WARN
> [SendWorker:1:QuorumCnxManager$SendWorker@697] - Interrupted while waiting
> for message on queue
> java.lang.InterruptedException
>         at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.
> reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
>         at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$
> ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2095)
>         at
> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:389)
>         at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(
> QuorumCnxManager.java:849)
>         at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.
> access$500(QuorumCnxManager.java:64)
>         at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(
> QuorumCnxManager.java:685)
> 2018-01-10 00:50:20,343 [myid:2] - WARN
> [SendWorker:1:QuorumCnxManager$SendWorker@706] - Send worker leaving
> thread
> 2018-01-10 00:50:32,491 [myid:2] - INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 60000
> 2018-01-10 00:50:32,493 [myid:2] - INFO
> [WorkerReceiver[myid=2]:FastLeaderElection@597] - Notification: 1 (message
> format version), 2 (n.leader), 0x707e3e9a9 (n.zxid), 0x1 (n.round), LOOKING
> (n.state), 2 (n.sid), 0x7 (n.peerEpoch) LOOKING (my state)
> 2018-01-10 00:50:32,495 [myid:2] - INFO
> [WorkerSender[myid=2]:QuorumCnxManager@193] - Have smaller server
> identifier, so dropping the connection: (3, 2)
> 2018-01-10 00:51:32,494 [myid:2] - INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 60000
> 2018-01-10 00:51:32,494 [myid:2] - INFO
> [WorkerReceiver[myid=2]:FastLeaderElection@597] - Notification: 1 (message
> format version), 2 (n.leader), 0x707e3e9a9 (n.zxid), 0x1 (n.round), LOOKING
> (n.state), 2 (n.sid), 0x7 (n.peerEpoch) LOOKING (my state)
> 2018-01-10 00:51:32,496 [myid:2] - INFO
> [WorkerSender[myid=2]:QuorumCnxManager@193] - Have smaller server
> identifier, so dropping the connection: (3, 2)
> 2018-01-10 00:52:19,126 [myid:2] - WARN
> [RecvWorker:1:QuorumCnxManager$RecvWorker@780] - Connection broken for id
> 1,
> my id = 2, error =
> java.net.SocketException: Connection reset
>         at java.net.SocketInputStream.read(SocketInputStream.java:197)
>         at java.net.SocketInputStream.read(SocketInputStream.java:122)
>         at java.net.SocketInputStream.read(SocketInputStream.java:211)
>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
>         at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(
> QuorumCnxManager.java:765)
> 2018-01-10 00:52:19,127 [myid:2] - WARN
> [RecvWorker:1:QuorumCnxManager$RecvWorker@783] - Interrupting SendWorker
> 2018-01-10 00:52:19,127 [myid:2] - WARN
> [SendWorker:1:QuorumCnxManager$SendWorker@697] - Interrupted while waiting
> for message on queue
> java.lang.InterruptedException
>         at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.
> reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
>         at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$
> ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2095)
>         at
> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:389)
>         at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(
> QuorumCnxManager.java:849)
>         at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.
> access$500(QuorumCnxManager.java:64)
>         at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(
> QuorumCnxManager.java:685)
> 2018-01-10 00:52:19,128 [myid:2] - WARN
> [SendWorker:1:QuorumCnxManager$SendWorker@706] - Send worker leaving
> thread
> 2018-01-10 00:52:32,495 [myid:2] - INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
> Notification time out: 60000
> 2018-01-10 00:52:32,497 [myid:2] - INFO
> [WorkerReceiver[myid=2]:FastLeaderElection@597] - Notification: 1 (message
> format version), 2 (n.leader), 0x707e3e9a9 (n.zxid), 0x1 (n.round), LOOKING
> (n.state), 2 (n.sid), 0x7 (n.peerEpoch) LOOKING (my state)
> 2018-01-10 00:52:32,499 [myid:2] - INFO
> [WorkerSender[myid=2]:QuorumCnxManager@193] - Have smaller server
> identifier, so dropping the connection: (3, 2)
>
>
> my configuration on all three servers are:
>
> clientPort=2181
> dataDir=/var/opt/zookeeper/data
> tickTime=2000
> autopurge.purgeInterval=24
> initLimit=10
> syncLimit=5
> server.1=10.1.0.122:2888:3888
> server.2=10.1.1.75:2888:3888
> server.3=10.1.2.221:2888:3888
>
> server 3 is currently leader
> server 1 is currently follower
> server 2 currently cannot rejoin the ensemble
>
> myid files are correctly configured for all three servers.  this is a
> production cluster so I would like to know if there was a way to force the
> node back into the cluster without anything drastic that would cause the
> quorum to be lost.
>
>
>
>
>
> --
> Sent from: http://zookeeper-user.578899.n2.nabble.com/
>
Reply | Threaded
Open this post in threaded view
|

Re: Unable to connect node to ensemble after restart of node zookeeper 3.4.6

hkwan
I am able to connect from server 2 to the leader using zkcli.



--
Sent from: http://zookeeper-user.578899.n2.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Unable to connect node to ensemble after restart of node zookeeper 3.4.6

Andor Molnar
[WorkerSender[myid=2]:QuorumCnxManager@193] - *Have smaller server
**identifier,
so dropping the connection: (3, 2)*

Sorry, the peer explicitly drops the connection during leader election for
some weird reason. The error message indicates that it believes that
another connection is already established between the two servers.
Not sure why.

Andor



On Wed, Jan 10, 2018 at 11:06 AM, hkwan <[hidden email]> wrote:

> I am able to connect from server 2 to the leader using zkcli.
>
>
>
> --
> Sent from: http://zookeeper-user.578899.n2.nabble.com/
>
Reply | Threaded
Open this post in threaded view
|

Re: Unable to connect node to ensemble after restart of node zookeeper 3.4.6

Andor Molnar
​It's very likely related to this:

https://issues.apache.org/jira/browse/ZOOKEEPER-2938



On Wed, Jan 10, 2018 at 11:52 AM, Andor Molnar <[hidden email]> wrote:

> [WorkerSender[myid=2]:QuorumCnxManager@193] - *Have smaller server **identifier,
> so dropping the connection: (3, 2)*
>
> Sorry, the peer explicitly drops the connection during leader election for
> some weird reason. The error message indicates that it believes that
> another connection is already established between the two servers.
> Not sure why.
>
> Andor
>
>
>
> On Wed, Jan 10, 2018 at 11:06 AM, hkwan <[hidden email]> wrote:
>
>> I am able to connect from server 2 to the leader using zkcli.
>>
>>
>>
>> --
>> Sent from: http://zookeeper-user.578899.n2.nabble.com/
>>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Unable to connect node to ensemble after restart of node zookeeper 3.4.6

hkwan
At this point would it be safe to say that I would have to reboot the entire
ensemble to get that node back into the cluster?



--
Sent from: http://zookeeper-user.578899.n2.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Unable to connect node to ensemble after restart of node zookeeper 3.4.6

Andor Molnar
Yes, you shouldn't experience any data loss, only a short outage in your
service until clients are able to reconnect the ensemble.
Unfortunately I'm unable to repro your issue with ZK 3.4.6. Is there any
specific in your ensemble that I need to know? Your config looks to be
quite ordinary.
Are u running on bare metal or cloud provider?

Andor


On Wed, Jan 10, 2018 at 7:16 PM, hkwan <[hidden email]> wrote:

> At this point would it be safe to say that I would have to reboot the
> entire
> ensemble to get that node back into the cluster?
>
>
>
> --
> Sent from: http://zookeeper-user.578899.n2.nabble.com/
>
Reply | Threaded
Open this post in threaded view
|

Re: Unable to connect node to ensemble after restart of node zookeeper 3.4.6

hkwan
I am running it on AWS ec2 instances.  It is a pretty standard setup, the
only thing of note is that it has been running for a year without any need
for a reboot and without any issues.  this is the first time i've tried to
reboot one of the nodes.



--
Sent from: http://zookeeper-user.578899.n2.nabble.com/