|
Hi,
We are running 3.4.2 and have a development cluster of 3 servers running on Windows and have run into some strange issues. It seems to be related to one of servers falling out of sync with the other 2 in the cluster. I'll describe the problem in more detail: Everything was working fine yesterday and then all of a sudden late in the morning the cluster seemed to get into a bad state and no clients could connect to the cluster. No changes have been made to the cluster for over a week and things had been working smoothly up until now. Looking at the servers it appeared they were all able to communicate with each other, perform leader election, but then shortly after we would get errors while waiting for messages. (Relevant snippets from the logs at various points later in the day included below) At this time no clients could connect and we would see "zookeeper sever not running" messages as well. (see logs) If we took down 2 of the 3 servers in the cluster all clients could connect and everything would function fine, but as soon as we introduce an additional server to the cluster client connections would no longer work and the errors described above would repeat. Also, no configuration that involved the out of sync server would allow the servers to properly communicate to one another, but if we only involved the 2 servers that were in sync then communication between the servers seemed to work without error. We backed up the data from the out of sync server, deleted its data, and brought back all 3 servers. This seemed to solve the problem of the servers having communication errors and the new / empty server slowly synced its data. However even now no clients could connect to the cluster if there was more than 1 server involved. We then took down the entire cluster and moved to different ports (all 3 ports) and this allowed us to connect using the command line client. The only difference now is that we have about 5 or 6 application instances that are not trying to talk to the cluster on the new ports - so the "traffic" is less than the original configuration. We did not try to update our applications to point to the new ports as we really want to understand what the problem is and how to resolve / avoid this in the future. Any help / ideas you can provide would be greatly appreciated as we are fairly stumped at this point. As an aside, we tried various configuration options (tick time, init limit, sync limit, etc) but they didn't seem to have any effect in terms of how long it would wait prior to interrupting. It also doesn't seem that our log4j configuration is being respected, but these are more minor to the larger problem described above. Thanks, -Scott 2012-03-05 16:06:01,828 [myid:3] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@213] - Accepted socket connection f rom /10.17.30.160:60736 2012-03-05 16:06:01,828 [myid:3] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@213] - Accepted socket connection f rom /10.17.20.145:56169 2012-03-05 16:06:01,828 [myid:3] - WARN [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2012-03-05 16:06:01,828 [myid:3] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for clien t /10.17.30.160:60736 (no session established for client) 2012-03-05 16:06:01,828 [myid:3] - WARN [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2012-03-05 16:06:01,828 [myid:3] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for clien t /10.17.20.145:56169 (no session established for client) 2012-03-05 16:06:01,812 [myid:3] - INFO [ zookeeper-windows3.vivisimo.com/10.17.20.164:13888:QuorumCnxManager$Listener@493] - Received conne ction request /10.17.20.142:35819 2012-03-05 16:06:01,828 [myid:3] - WARN [RecvWorker:1:QuorumCnxManager$RecvWorker@762] - Connection broken for id 1, my id = 3, error = java.net.SocketException: socket closed at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(Unknown Source) at java.net.SocketInputStream.read(Unknown Source) at java.net.SocketInputStream.read(Unknown Source) at java.io.DataInputStream.readInt(Unknown Source) at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:747) 2012-03-05 16:06:01,828 [myid:3] - WARN [RecvWorker:1:QuorumCnxManager$RecvWorker@765] - Interrupting SendWorker 2012-03-05 16:06:01,859 [myid:3] - WARN [SendWorker:1:QuorumCnxManager$SendWorker@679] - Interrupted while waiting for message on queue java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(Unknown Source) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown Source) at java.util.concurrent.ArrayBlockingQueue.poll(Unknown Source) at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:831) at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:62) at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:667) 2012-03-05 16:06:01,859 [myid:3] - WARN [SendWorker:1:QuorumCnxManager$SendWorker@688] - Send worker leaving thread 2012-03-05 16:06:01,859 [myid:3] - INFO [WorkerReceiver[myid=3]:FastLeaderElection@542] - Notification: 3 (n.leader), 377957122068 (n.zxid) , 1 (n.round), FOLLOWING (n.state), 1 (n.sid), 88 (n.peerEPoch), LEADING (my state) 2012-03-05 16:06:02,203 [myid:3] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@213] - Accepted socket connection f rom /192.168.0.238:47529 2012-03-05 16:06:02,203 [myid:3] - WARN [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2012-03-05 16:06:02,203 [myid:3] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for clien t /192.168.0.238:47529 (no session established for client) ================================== 2012-03-05 16:07:06,468 [myid:3] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for clien t /10.17.30.100:61367 which had sessionid 0x335e4af5c7b0007 2012-03-05 16:07:06,468 [myid:3] - INFO [CommitProcessor:3:ZooKeeperServer@596] - Established session 0x335e4af5c7b0008 with negotiated tim eout 15000 for client /192.168.0.238:47643 2012-03-05 16:07:06,468 [myid:3] - WARN [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@349] - caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x335e4af5c7b0008, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:224) at java.lang.Thread.run(Unknown Source) 2012-03-05 16:07:06,468 [myid:3] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for clien t /192.168.0.238:47643 which had sessionid 0x335e4af5c7b0008 2012-03-05 16:07:06,468 [myid:3] - INFO [CommitProcessor:3:ZooKeeperServer@596] - Established session 0x335e4af5c7b0009 with negotiated tim eout 15000 for client /10.17.30.100:61372 2012-03-05 16:07:06,468 [myid:3] - WARN [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@349] - caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x335e4af5c7b0009, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:224) at java.lang.Thread.run(Unknown Source) ==================================== 2012-03-05 16:05:50,234 [myid:3] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@213] - Accepted socket connection f rom /10.17.20.145:56160 2012-03-05 16:05:50,234 [myid:3] - WARN [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2012-03-05 16:05:50,234 [myid:3] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for clien t /10.17.20.145:56160 (no session established for client) 2012-03-05 16:05:50,421 [myid:3] - WARN [SendWorker:1:QuorumCnxManager$SendWorker@679] - Interrupted while waiting for message on queue java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(Unknown Source) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown Source) at java.util.concurrent.ArrayBlockingQueue.poll(Unknown Source) at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:831) at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:62) at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:667) 2012-03-05 16:05:50,437 [myid:3] - WARN [SendWorker:1:QuorumCnxManager$SendWorker@688] - Send worker leaving thread 2012-03-05 16:05:50,437 [myid:3] - INFO [WorkerReceiver[myid=3]:FastLeaderElection@542] - Notification: 3 (n.leader), 377957122068 (n.zxid) , 1 (n.round), FOLLOWING (n.state), 1 (n.sid), 88 (n.peerEPoch), LEADING (my state) 2012-03-05 16:05:50,640 [myid:3] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@213] - Accepted socket connection f rom /10.17.30.160:60716 2012-03-05 16:05:50,640 [myid:3] - WARN [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2012-03-05 16:05:50,640 [myid:3] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for clien t /10.17.30.160:60716 (no session established for client) 2012-03-05 16:05:50,703 [myid:3] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@213] - Accepted socket connection f rom /10.17.30.100:61281 2012-03-05 16:05:50,703 [myid:3] - WARN [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2012-03-05 16:05:50,703 [myid:3] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for clien t /10.17.30.100:61281 (no session established for client) ======= |
|
OK. Trying again against the correct thread.
I trimmed down your summary to something that jumped out at me. If you have three ZK servers in a cluster, then taking 2 down should not allow normal operation. I think you have a config error. On Tue, Mar 6, 2012 at 8:15 AM, Scott Lindner <[hidden email]>wrote: > ... > If we took down 2 of the 3 servers in the cluster all clients could connect > and everything would function fine, but as soon as we introduce an > additional server to the cluster client connections would no longer work > and the errors described above would repeat. Also, no configuration that > involved the out of sync server would allow the servers to properly > communicate to one another, but if we only involved the 2 servers that were > in sync then communication between the servers seemed to work without > error. > ... |
|
Hi Ted,
Sorry about that - I should have been more specific. When I said "taking 2 down" I actually meant taking the entire cluster down and reconfiguring it to only have a single server. So to put it another way, when it is actually set up as a cluster the failures occurred however when it was configured to run with only a single server it worked fine. Also when it was configured to only include the 2 servers that were in sync that also worked fine. However any configurations involving the out of sync server failed to work properly. Any other ideas / suggestions? Thanks again, -Scott On Tue, Mar 6, 2012 at 12:49 PM, Ted Dunning <[hidden email]> wrote: > OK. Trying again against the correct thread. > > I trimmed down your summary to something that jumped out at me. If you > have three ZK servers in a cluster, then taking 2 down should not allow > normal operation. > > I think you have a config error. > > On Tue, Mar 6, 2012 at 8:15 AM, Scott Lindner <[hidden email] > >wrote: > > > ... > > If we took down 2 of the 3 servers in the cluster all clients could > connect > > and everything would function fine, but as soon as we introduce an > > additional server to the cluster client connections would no longer work > > and the errors described above would repeat. Also, no configuration that > > involved the out of sync server would allow the servers to properly > > communicate to one another, but if we only involved the 2 servers that > were > > in sync then communication between the servers seemed to work without > > error. > > ... > |
|
Hi Scott,
Even I am facing the same issue in one of our projects here. We use a cluster of 3 zookeeper. Suddenly the client was not able to connect to cluster and zookeeper logs showing "Zookeeper Server not running" message. We moved from cluster to stand alone server, then it started working as expected. Any solution for the above mentioned issue, it would be really helpful. ?? Regards, Ramachandra Bhatt |
|
In reply to this post by Scott Lindner-2
There's 3 ports that matter to Zookeeper and all 3 must be different.
1) Client port - where the clients connect. 2) Leader port - the other servers connect to the leader on this port - only listens while the server is elected leader. 3) Quorum port - used for leader election and voting and things like that between the Zookeeper servers to keep their data consistent. If My config looks like this : clientPort=2181 server.1=server1:3555:4555 server.2=server2:3555:4555 server.3=server3:3555:4555 Then that means each server needs 2181 open for clients, 3555 open for when that server is the leader and 4555 open for all quorum communications. Your errors look like you have the leader port and quorum ports switched or firewalling in between. It sounds like no quorum is achieved and that's why Zookeeper is not functioning. If you have 3 servers in your ensemble, achieving quorum requires 2 of them to be up and communicating with each other over the proper ports. In other words, make sure it doesn't look like this: clientPort=2181 server.1=server1:2181:2181 <- WRONG Anything where the ports are shared between functions or otherwise unable to communicate with each other will break the ensemble. Thanks, Greg |
| Powered by Nabble | Edit this page |
