Zookeeper c client disconnecting after hours of runtime

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Zookeeper c client disconnecting after hours of runtime

Andrew Jorgensen
I am going to try to provide as much information as possible but it might
be a bit sparse because I am still actively trying to get a grip on what
exactly I'm seeing with the c client.

Zookeeper client version: 3.4.5
Zookeeper server version: 3.4.10
5 node zookeeper cluster

The workflow I have is essentially a long lived process establishes an
ephemeral node with some data that is read by some number of other
processes located on separate machines, standard cluster coordination
stuff. The issue I am seeing is after about 7-9 hours of runtime, zookeeper
will expire the client session because it has reached the 30 second
timeout. On the zookeeper client side, I've confirmed there are no calls to
the supplied watcher functions or context supplied to zookeeper_init. The
long lived process is doing other things during its runtime but the
interaction with zookeeper is only via callback events and a pipe after
establishing the ephemeral node at the beginning.

One other datapoint is that I created an event loop that uses the same
client that established the ephemeral node to get the data from the
ephemeral node every 60 seconds and log it. While this event loop is
running I do not observe the client session expiring at all even after 14
hours or runtime.

I am not sure how to explain the client disconnecting without any message
to either the callback function or the context. I also am not sure how to
explain this behavior happening after many hours of running without issue.

If anyone has seen something similar, how did you go about fixing it. Also
if there are any ideas on how to debug this issue that would be very
helpful.

Thanks!
Andrew Jorgensen
@ajorgensen
Reply | Threaded
Open this post in threaded view
|

Re: Zookeeper c client disconnecting after hours of runtime

Andrew Jorgensen
One other piece of information that might be helpful is if I look at "lsof
-n -P" for the process I can see there are 3 entries for connections to
zookeeper on port 2181 in the ESTABLISHED state. However If i look at a
node that has had the session timeout it appears that the connection is in
the CLOSE_WAIT state. Is there a clean way to recover from this?

Andrew Jorgensen
@ajorgensen

On Wed, May 3, 2017 at 11:32 PM, Andrew Jorgensen <
[hidden email]> wrote:

> I am going to try to provide as much information as possible but it might
> be a bit sparse because I am still actively trying to get a grip on what
> exactly I'm seeing with the c client.
>
> Zookeeper client version: 3.4.5
> Zookeeper server version: 3.4.10
> 5 node zookeeper cluster
>
> The workflow I have is essentially a long lived process establishes an
> ephemeral node with some data that is read by some number of other
> processes located on separate machines, standard cluster coordination
> stuff. The issue I am seeing is after about 7-9 hours of runtime, zookeeper
> will expire the client session because it has reached the 30 second
> timeout. On the zookeeper client side, I've confirmed there are no calls to
> the supplied watcher functions or context supplied to zookeeper_init. The
> long lived process is doing other things during its runtime but the
> interaction with zookeeper is only via callback events and a pipe after
> establishing the ephemeral node at the beginning.
>
> One other datapoint is that I created an event loop that uses the same
> client that established the ephemeral node to get the data from the
> ephemeral node every 60 seconds and log it. While this event loop is
> running I do not observe the client session expiring at all even after 14
> hours or runtime.
>
> I am not sure how to explain the client disconnecting without any message
> to either the callback function or the context. I also am not sure how to
> explain this behavior happening after many hours of running without issue.
>
> If anyone has seen something similar, how did you go about fixing it. Also
> if there are any ideas on how to debug this issue that would be very
> helpful.
>
> Thanks!
> Andrew Jorgensen
> @ajorgensen
>