Quantcast

Zookeeper cluster failing

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Zookeeper cluster failing

Scott Lindner-2
Hi,

We are running 3.4.2 and have a development cluster of 3 servers running on
Windows and have run into some strange issues.  It seems to be related to
one of servers falling out of sync with the other 2 in the cluster.  I'll
describe the problem in more detail:

Everything was working fine yesterday and then all of a sudden late in the
morning the cluster seemed to get into a bad state and no clients could
connect to the cluster.  No changes have been made to the cluster for over
a week and things had been working smoothly up until now.  Looking at the
servers it appeared they were all able to communicate with each other,
perform leader election, but then shortly after we would get errors while
waiting for messages.  (Relevant snippets from the logs at various points
later in the day included below)

At this time no clients could connect and we would see "zookeeper sever not
running" messages as well.  (see logs)

If we took down 2 of the 3 servers in the cluster all clients could connect
and everything would function fine, but as soon as we introduce an
additional server to the cluster client connections would no longer work
and the errors described above would repeat.  Also, no configuration that
involved the out of sync server would allow the servers to properly
communicate to one another, but if we only involved the 2 servers that were
in sync then communication between the servers seemed to work without error.

We backed up the data from the out of sync server, deleted its data, and
brought back all 3 servers.  This seemed to solve the problem of the
servers having communication errors and the new / empty server slowly
synced its data. However even now no clients could connect to the cluster
if there was more than 1 server involved.

We then took down the entire cluster and moved to different ports (all 3
ports) and this allowed us to connect using the command line client.  The
only difference now is that we have about 5 or 6 application instances that
are not trying to talk to the cluster on the new ports - so the "traffic"
is less than the original configuration.  We did not try to update our
applications to point to the new ports as we really want to understand what
the problem is and how to resolve / avoid this in the future.

Any help / ideas you can provide would be greatly appreciated as we are
fairly stumped at this point.

As an aside, we tried various configuration options (tick time, init limit,
sync limit, etc) but they didn't seem to have any effect in terms of how
long it would wait prior to interrupting. It also doesn't seem that our
log4j configuration is being respected, but these are more minor to the
larger problem described above.

Thanks,
-Scott


2012-03-05 16:06:01,828 [myid:3] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@213] - Accepted socket connection
f
rom /10.17.30.160:60736
2012-03-05 16:06:01,828 [myid:3] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@213] - Accepted socket connection
f
rom /10.17.20.145:56169
2012-03-05 16:06:01,828 [myid:3] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session
0x0 due to java.io.IOException: ZooKeeperServer not running
2012-03-05 16:06:01,828 [myid:3] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for
clien
t /10.17.30.160:60736 (no session established for client)
2012-03-05 16:06:01,828 [myid:3] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session
0x0 due to java.io.IOException: ZooKeeperServer not running
2012-03-05 16:06:01,828 [myid:3] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for
clien
t /10.17.20.145:56169 (no session established for client)
2012-03-05 16:06:01,812 [myid:3] - INFO  [
zookeeper-windows3.vivisimo.com/10.17.20.164:13888:QuorumCnxManager$Listener@493]
- Received conne
ction request /10.17.20.142:35819
2012-03-05 16:06:01,828 [myid:3] - WARN
 [RecvWorker:1:QuorumCnxManager$RecvWorker@762] - Connection broken for id
1, my id = 3, error =
java.net.SocketException: socket closed
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(Unknown Source)
        at java.net.SocketInputStream.read(Unknown Source)
        at java.net.SocketInputStream.read(Unknown Source)
        at java.io.DataInputStream.readInt(Unknown Source)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:747)
2012-03-05 16:06:01,828 [myid:3] - WARN
 [RecvWorker:1:QuorumCnxManager$RecvWorker@765] - Interrupting SendWorker
2012-03-05 16:06:01,859 [myid:3] - WARN
 [SendWorker:1:QuorumCnxManager$SendWorker@679] - Interrupted while waiting
for message on queue
java.lang.InterruptedException
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(Unknown
Source)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown
Source)
        at java.util.concurrent.ArrayBlockingQueue.poll(Unknown Source)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:831)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:62)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:667)
2012-03-05 16:06:01,859 [myid:3] - WARN
 [SendWorker:1:QuorumCnxManager$SendWorker@688] - Send worker leaving thread
2012-03-05 16:06:01,859 [myid:3] - INFO
 [WorkerReceiver[myid=3]:FastLeaderElection@542] - Notification: 3
(n.leader), 377957122068 (n.zxid)
, 1 (n.round), FOLLOWING (n.state), 1 (n.sid), 88 (n.peerEPoch), LEADING
(my state)
2012-03-05 16:06:02,203 [myid:3] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@213] - Accepted socket connection
f
rom /192.168.0.238:47529
2012-03-05 16:06:02,203 [myid:3] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session
0x0 due to java.io.IOException: ZooKeeperServer not running
2012-03-05 16:06:02,203 [myid:3] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for
clien
t /192.168.0.238:47529 (no session established for client)



==================================





2012-03-05 16:07:06,468 [myid:3] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for
clien
t /10.17.30.100:61367 which had sessionid 0x335e4af5c7b0007
2012-03-05 16:07:06,468 [myid:3] - INFO
 [CommitProcessor:3:ZooKeeperServer@596] - Established session
0x335e4af5c7b0008 with negotiated tim
eout 15000 for client /192.168.0.238:47643
2012-03-05 16:07:06,468 [myid:3] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@349] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid
0x335e4af5c7b0008, likely client has closed socket
        at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
        at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:224)
        at java.lang.Thread.run(Unknown Source)
2012-03-05 16:07:06,468 [myid:3] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for
clien
t /192.168.0.238:47643 which had sessionid 0x335e4af5c7b0008
2012-03-05 16:07:06,468 [myid:3] - INFO
 [CommitProcessor:3:ZooKeeperServer@596] - Established session
0x335e4af5c7b0009 with negotiated tim
eout 15000 for client /10.17.30.100:61372
2012-03-05 16:07:06,468 [myid:3] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@349] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid
0x335e4af5c7b0009, likely client has closed socket
        at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
        at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:224)
        at java.lang.Thread.run(Unknown Source)






====================================




2012-03-05 16:05:50,234 [myid:3] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@213] - Accepted socket connection
f
rom /10.17.20.145:56160
2012-03-05 16:05:50,234 [myid:3] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session
0x0 due to java.io.IOException: ZooKeeperServer not running
2012-03-05 16:05:50,234 [myid:3] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for
clien
t /10.17.20.145:56160 (no session established for client)
2012-03-05 16:05:50,421 [myid:3] - WARN
 [SendWorker:1:QuorumCnxManager$SendWorker@679] - Interrupted while waiting
for message on queue
java.lang.InterruptedException
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(Unknown
Source)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown
Source)
        at java.util.concurrent.ArrayBlockingQueue.poll(Unknown Source)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:831)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:62)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:667)
2012-03-05 16:05:50,437 [myid:3] - WARN
 [SendWorker:1:QuorumCnxManager$SendWorker@688] - Send worker leaving thread
2012-03-05 16:05:50,437 [myid:3] - INFO
 [WorkerReceiver[myid=3]:FastLeaderElection@542] - Notification: 3
(n.leader), 377957122068 (n.zxid)
, 1 (n.round), FOLLOWING (n.state), 1 (n.sid), 88 (n.peerEPoch), LEADING
(my state)
2012-03-05 16:05:50,640 [myid:3] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@213] - Accepted socket connection
f
rom /10.17.30.160:60716
2012-03-05 16:05:50,640 [myid:3] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session
0x0 due to java.io.IOException: ZooKeeperServer not running
2012-03-05 16:05:50,640 [myid:3] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for
clien
t /10.17.30.160:60716 (no session established for client)
2012-03-05 16:05:50,703 [myid:3] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@213] - Accepted socket connection
f
rom /10.17.30.100:61281
2012-03-05 16:05:50,703 [myid:3] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session
0x0 due to java.io.IOException: ZooKeeperServer not running
2012-03-05 16:05:50,703 [myid:3] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for
clien
t /10.17.30.100:61281 (no session established for client)

=======
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Zookeeper cluster failing

Ted Dunning
OK.  Trying again against the correct thread.

I trimmed down your summary to something that jumped out at me.  If you
have three ZK servers in a cluster, then taking 2 down should not allow
normal operation.

I think you have a config error.

On Tue, Mar 6, 2012 at 8:15 AM, Scott Lindner <[hidden email]>wrote:

> ...
> If we took down 2 of the 3 servers in the cluster all clients could connect
> and everything would function fine, but as soon as we introduce an
> additional server to the cluster client connections would no longer work
> and the errors described above would repeat.  Also, no configuration that
> involved the out of sync server would allow the servers to properly
> communicate to one another, but if we only involved the 2 servers that were
> in sync then communication between the servers seemed to work without
> error.
> ...
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Zookeeper cluster failing

Scott Lindner
Hi Ted,

Sorry about that - I should have been more specific.  When I said "taking 2
down" I actually meant taking the entire cluster down and reconfiguring it
to only have a single server.  So to put it another way, when it is
actually set up as a cluster the failures occurred however when it was
configured to run with only a single server it worked fine.  Also when it
was configured to only include the 2 servers that were in sync that also
worked fine.  However any configurations involving the out of sync server
failed to work properly.

Any other ideas / suggestions?

Thanks again,
-Scott

On Tue, Mar 6, 2012 at 12:49 PM, Ted Dunning <[hidden email]> wrote:

> OK.  Trying again against the correct thread.
>
> I trimmed down your summary to something that jumped out at me.  If you
> have three ZK servers in a cluster, then taking 2 down should not allow
> normal operation.
>
> I think you have a config error.
>
> On Tue, Mar 6, 2012 at 8:15 AM, Scott Lindner <[hidden email]
> >wrote:
>
> > ...
> > If we took down 2 of the 3 servers in the cluster all clients could
> connect
> > and everything would function fine, but as soon as we introduce an
> > additional server to the cluster client connections would no longer work
> > and the errors described above would repeat.  Also, no configuration that
> > involved the out of sync server would allow the servers to properly
> > communicate to one another, but if we only involved the 2 servers that
> were
> > in sync then communication between the servers seemed to work without
> > error.
> > ...
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Zookeeper cluster failing

ramachandra
Hi Scott,

Even I am facing the same issue in one of our projects here. We use a
cluster  of 3 zookeeper. Suddenly the client was not able to connect to
cluster and zookeeper logs showing "Zookeeper Server not running" message.
We moved from cluster to stand alone server, then it started working as
expected.

Any solution for the above mentioned issue, it would be really helpful. ??


Regards,
Ramachandra Bhatt
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Zookeeper cluster failing

dcgregorya
In reply to this post by Scott Lindner-2
There's 3 ports that matter to Zookeeper and all 3 must be different.

1) Client port - where the clients connect.
2) Leader port - the other servers connect to the leader on this port - only listens while the server is elected leader.
3) Quorum port - used for leader election and voting and things like that between the Zookeeper servers to keep their data consistent.

If My config looks like this :

clientPort=2181
server.1=server1:3555:4555
server.2=server2:3555:4555
server.3=server3:3555:4555

Then that means each server needs 2181 open for clients, 3555 open for when that server is the leader and 4555 open for all quorum communications.  Your errors look like you have the leader port and quorum ports switched or firewalling in between.  It sounds like no quorum is achieved and that's why Zookeeper is not functioning.  If you have 3 servers in your ensemble, achieving quorum requires 2 of them to be up and communicating with each other over the proper ports.

In other words, make sure it doesn't look like this:

clientPort=2181
server.1=server1:2181:2181 <- WRONG

Anything where the ports are shared between functions or otherwise unable to communicate with each other will break the ensemble.

Thanks,
Greg
Loading...