Ensemble fails when one node looses connectivity

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Ensemble fails when one node looses connectivity

Jim Keeney
I'm using Zookeeper with solr to create a cluster and I have come across what seems like an unexpected behavior. The cluster is setup on AWS using opsworks.  I am using a 3 node zookeeper ensemble. The zookeeper config on all three nodes is: 

clientPort=2181

dataDir=/var/opt/zookeeper/data

tickTime=2000

autopurge.purgeInterval=24

initLimit=100

syncLimit=5

server.1=172.31.86.130:2888:3888

server.2=172.31.16.234:2888:3888

server.3=172.31.73.122:2888:3888



Here is the issue: 

If one node in the ensemble fails or is shut down the ensemble carries on. However, when the node is restarted it's attempt to connect to the other members of the cluster are rejected. The only way that I have found to restore the ensemble is to restart all of the nodes within a short time span of each other.

If I do that they are able to discover each other  carry on a proper leader election and restore order. 

Once they are restored everything is fine but if one of the nodes goes down we are faced wit the same problem. 

How do I ensure that if a node goes down, it can restart and rejoin the ensemble with out having to manually restart all the other nodes? 

Any help appreciated.

Thanks. 

Jim K. 




--
Jim Keeney
President, FitterWeb
M: <a href="tel:(703)%20568-5887" value="+17035685887" style="color:rgb(17,85,204)" target="_blank">703-568-5887

FitterWeb Consulting
Are you lean and agile enough? 

zookeeper-logs.zip (222K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Ensemble fails when one node looses connectivity

svanschalkwyk
Does the log say anything about timing out on init?
Your initLimit is already pretty big, but then we don't know anything about
your setup.
Are you storing more than 1MB in a znode? Then increase jute.maxbuffer (in
java.env as a -Djute.maxbuffer=xxxxxx).
I've recently run into that with Fusion 3.1.
Post more details, if you would.
Good luck.
Steph


On Thu, Mar 1, 2018 at 7:43 PM, Jim Keeney <[hidden email]> wrote:

> I'm using Zookeeper with solr to create a cluster and I have come across
> what seems like an unexpected behavior. The cluster is setup on AWS using
> opsworks.  I am using a 3 node zookeeper ensemble. The zookeeper config
> on all three nodes is:
>
> clientPort=2181
>
> dataDir=/var/opt/zookeeper/data
>
> tickTime=2000
>
> autopurge.purgeInterval=24
>
> initLimit=100
>
> syncLimit=5
>
> server.1=172.31.86.130:2888:3888
>
> server.2=172.31.16.234:2888:3888
>
> server.3=172.31.73.122:2888:3888
>
>
> Here is the issue:
>
> If one node in the ensemble fails or is shut down the ensemble carries on.
> However, when the node is restarted it's attempt to connect to the other
> members of the cluster are rejected. The only way that I have found to
> restore the ensemble is to restart all of the nodes within a short time
> span of each other.
>
> If I do that they are able to discover each other  carry on a proper
> leader election and restore order.
>
> Once they are restored everything is fine but if one of the nodes goes
> down we are faced wit the same problem.
>
> How do I ensure that if a node goes down, it can restart and rejoin the
> ensemble with out having to manually restart all the other nodes?
>
> Any help appreciated.
>
> Thanks.
>
> Jim K.
>
>
>
>
> --
> Jim Keeney
> President, FitterWeb
> E: [hidden email]
> M: 703-568-5887 <(703)%20568-5887>
>
> *FitterWeb Consulting*
> *Are you lean and agile enough? *
>
Reply | Threaded
Open this post in threaded view
|

Re: Ensemble fails when one node looses connectivity

Jim Keeney
Thanks, Yes, I have about 2MB stored in the configurations folders. I will
increase the jute.maxbuffer and see if that helps.

Jim K.

On Thu, Mar 1, 2018 at 8:58 PM, Steph van Schalkwyk <[hidden email]
> wrote:

> Does the log say anything about timing out on init?
> Your initLimit is already pretty big, but then we don't know anything about
> your setup.
> Are you storing more than 1MB in a znode? Then increase jute.maxbuffer (in
> java.env as a -Djute.maxbuffer=xxxxxx).
> I've recently run into that with Fusion 3.1.
> Post more details, if you would.
> Good luck.
> Steph
>
>
> On Thu, Mar 1, 2018 at 7:43 PM, Jim Keeney <[hidden email]> wrote:
>
> > I'm using Zookeeper with solr to create a cluster and I have come across
> > what seems like an unexpected behavior. The cluster is setup on AWS using
> > opsworks.  I am using a 3 node zookeeper ensemble. The zookeeper config
> > on all three nodes is:
> >
> > clientPort=2181
> >
> > dataDir=/var/opt/zookeeper/data
> >
> > tickTime=2000
> >
> > autopurge.purgeInterval=24
> >
> > initLimit=100
> >
> > syncLimit=5
> >
> > server.1=172.31.86.130:2888:3888
> >
> > server.2=172.31.16.234:2888:3888
> >
> > server.3=172.31.73.122:2888:3888
> >
> >
> > Here is the issue:
> >
> > If one node in the ensemble fails or is shut down the ensemble carries
> on.
> > However, when the node is restarted it's attempt to connect to the other
> > members of the cluster are rejected. The only way that I have found to
> > restore the ensemble is to restart all of the nodes within a short time
> > span of each other.
> >
> > If I do that they are able to discover each other  carry on a proper
> > leader election and restore order.
> >
> > Once they are restored everything is fine but if one of the nodes goes
> > down we are faced wit the same problem.
> >
> > How do I ensure that if a node goes down, it can restart and rejoin the
> > ensemble with out having to manually restart all the other nodes?
> >
> > Any help appreciated.
> >
> > Thanks.
> >
> > Jim K.
> >
> >
> >
> >
> > --
> > Jim Keeney
> > President, FitterWeb
> > E: [hidden email]
> > M: 703-568-5887 <(703)%20568-5887>
> >
> > *FitterWeb Consulting*
> > *Are you lean and agile enough? *
> >
>



--
Jim Keeney
President, FitterWeb
E: [hidden email]
M: 703-568-5887 <(703)%20568-5887>

*FitterWeb Consulting*
*Are you lean and agile enough? *
Reply | Threaded
Open this post in threaded view
|

Re: Ensemble fails when one node looses connectivity

Jim Keeney
Steph -

Read about the maxbuffer and am pretty sure that this might explain the
behavior we are seeing since it occurs when there has been a significant
reboot of all the servers. We have over 2 mb of config files for all of our
indexes and if all the Solr nodes are sync ing their configs at once it
seems like that might overflow the buffer.

Newbie question, where would i set the -Djute.maxbuffer ? Should I update
the zkServer.sh file so this is applied every time zookeeper is started or
restarted.

Also, I noted the caution and will make sure that all of the nodes are set
to the same value. Saw some discussion about having to change the zkCli
settings to be larger than that of the server. Is that true?

Thanks in advance.

Jim K.

On Thu, Mar 1, 2018 at 9:13 PM, Jim Keeney <[hidden email]> wrote:

> Thanks, Yes, I have about 2MB stored in the configurations folders. I will
> increase the jute.maxbuffer and see if that helps.
>
> Jim K.
>
> On Thu, Mar 1, 2018 at 8:58 PM, Steph van Schalkwyk <
> [hidden email]> wrote:
>
>> Does the log say anything about timing out on init?
>> Your initLimit is already pretty big, but then we don't know anything
>> about
>> your setup.
>> Are you storing more than 1MB in a znode? Then increase jute.maxbuffer (in
>> java.env as a -Djute.maxbuffer=xxxxxx).
>> I've recently run into that with Fusion 3.1.
>> Post more details, if you would.
>> Good luck.
>> Steph
>>
>>
>> On Thu, Mar 1, 2018 at 7:43 PM, Jim Keeney <[hidden email]> wrote:
>>
>> > I'm using Zookeeper with solr to create a cluster and I have come across
>> > what seems like an unexpected behavior. The cluster is setup on AWS
>> using
>> > opsworks.  I am using a 3 node zookeeper ensemble. The zookeeper config
>> > on all three nodes is:
>> >
>> > clientPort=2181
>> >
>> > dataDir=/var/opt/zookeeper/data
>> >
>> > tickTime=2000
>> >
>> > autopurge.purgeInterval=24
>> >
>> > initLimit=100
>> >
>> > syncLimit=5
>> >
>> > server.1=172.31.86.130:2888:3888
>> >
>> > server.2=172.31.16.234:2888:3888
>> >
>> > server.3=172.31.73.122:2888:3888
>> >
>> >
>> > Here is the issue:
>> >
>> > If one node in the ensemble fails or is shut down the ensemble carries
>> on.
>> > However, when the node is restarted it's attempt to connect to the other
>> > members of the cluster are rejected. The only way that I have found to
>> > restore the ensemble is to restart all of the nodes within a short time
>> > span of each other.
>> >
>> > If I do that they are able to discover each other  carry on a proper
>> > leader election and restore order.
>> >
>> > Once they are restored everything is fine but if one of the nodes goes
>> > down we are faced wit the same problem.
>> >
>> > How do I ensure that if a node goes down, it can restart and rejoin the
>> > ensemble with out having to manually restart all the other nodes?
>> >
>> > Any help appreciated.
>> >
>> > Thanks.
>> >
>> > Jim K.
>> >
>> >
>> >
>> >
>> > --
>> > Jim Keeney
>> > President, FitterWeb
>> > E: [hidden email]
>> > M: 703-568-5887 <(703)%20568-5887>
>> >
>> > *FitterWeb Consulting*
>> > *Are you lean and agile enough? *
>> >
>>
>
>
>
> --
> Jim Keeney
> President, FitterWeb
> E: [hidden email]
> M: 703-568-5887 <(703)%20568-5887>
>
> *FitterWeb Consulting*
> *Are you lean and agile enough? *
>



--
Jim Keeney
President, FitterWeb
E: [hidden email]
M: 703-568-5887 <(703)%20568-5887>

*FitterWeb Consulting*
*Are you lean and agile enough? *
Reply | Threaded
Open this post in threaded view
|

Re: Ensemble fails when one node looses connectivity

Shawn Heisey
On 3/1/2018 7:59 PM, Jim Keeney wrote:
> Read about the maxbuffer and am pretty sure that this might explain the
> behavior we are seeing since it occurs when there has been a significant
> reboot of all the servers. We have over 2 mb of config files for all of our
> indexes and if all the Solr nodes are sync ing their configs at once it
> seems like that might overflow the buffer.

You probably recognize me from the Solr side.  Hello again.  I do know
enough to handle this part, so I'm answering. I didn't consider the
maxbuffer setting, because I didn't see anything about large packets in
the logs you shared on the Solr mailing list, and it's very rare for
Solr users to need to increase it.

You only need to worry about the maxbuffer if any single part of the
config in ZK (what is called a "znode") is over 1MB. Each file in the
configs that you upload will go into its own znode.  So if none of the
individual files in your configs is really large, you probably won't
need to set jute.maxbuffer.

As for the other things that Solr puts in ZK:  Unless you have a REALLY
huge cluster (tons of collections, shards, replicas, servers, etc) then
that information should be quite small.

> Newbie question, where would i set the -Djute.maxbuffer ? Should I update
> the zkServer.sh file so this is applied every time zookeeper is started or
> restarted.

If jute.maxbuffer is needed, it must be set on the startup options for
every ZK server and every client that will access large znodes.  Which
means all your ZK servers, all your Solr servers, and any invocations of
things like the scripts Solr includes for uploading configs.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Ensemble fails when one node looses connectivity

svanschalkwyk
In reply to this post by Jim Keeney
Hi Jim

You set it in the java.env file in /opt/zookeeper/conf.

JVMFLAGS=" -Xmx4g -Djute.maxbuffer=2147483648"

The example above is for 2GB, so please change the size :) In this case
(-Xmx4g) the ZK node was running on an 8GB VM.
And yes, make sure that you do that to all the servers.

Here is one reference to it:
https://community.cloudera.com/t5/Storage-Random-Access-HDFS/zookeeper-error-Unexpected-exception-causing-shutdown-while-sock/td-p/30914

If you need more debug information, you can add logging level as well:
-Dzookeeper.log.threshold=INFO

for example: JVMFLAGS=" -Xmx4g  -Djute.maxbuffer=2147483648
-Dzookeeper.log.threshold=DEBUG"

Good luck! I hope this works.
Steph



On Thu, Mar 1, 2018 at 8:59 PM, Jim Keeney <[hidden email]> wrote:

> Steph -
>
> Read about the maxbuffer and am pretty sure that this might explain the
> behavior we are seeing since it occurs when there has been a significant
> reboot of all the servers. We have over 2 mb of config files for all of our
> indexes and if all the Solr nodes are sync ing their configs at once it
> seems like that might overflow the buffer.
>
> Newbie question, where would i set the -Djute.maxbuffer ? Should I update
> the zkServer.sh file so this is applied every time zookeeper is started or
> restarted.
>
> Also, I noted the caution and will make sure that all of the nodes are set
> to the same value. Saw some discussion about having to change the zkCli
> settings to be larger than that of the server. Is that true?
>
> Thanks in advance.
>
> Jim K.
>
> On Thu, Mar 1, 2018 at 9:13 PM, Jim Keeney <[hidden email]> wrote:
>
> > Thanks, Yes, I have about 2MB stored in the configurations folders. I
> will
> > increase the jute.maxbuffer and see if that helps.
> >
> > Jim K.
> >
> > On Thu, Mar 1, 2018 at 8:58 PM, Steph van Schalkwyk <
> > [hidden email]> wrote:
> >
> >> Does the log say anything about timing out on init?
> >> Your initLimit is already pretty big, but then we don't know anything
> >> about
> >> your setup.
> >> Are you storing more than 1MB in a znode? Then increase jute.maxbuffer
> (in
> >> java.env as a -Djute.maxbuffer=xxxxxx).
> >> I've recently run into that with Fusion 3.1.
> >> Post more details, if you would.
> >> Good luck.
> >> Steph
> >>
> >>
> >> On Thu, Mar 1, 2018 at 7:43 PM, Jim Keeney <[hidden email]> wrote:
> >>
> >> > I'm using Zookeeper with solr to create a cluster and I have come
> across
> >> > what seems like an unexpected behavior. The cluster is setup on AWS
> >> using
> >> > opsworks.  I am using a 3 node zookeeper ensemble. The zookeeper
> config
> >> > on all three nodes is:
> >> >
> >> > clientPort=2181
> >> >
> >> > dataDir=/var/opt/zookeeper/data
> >> >
> >> > tickTime=2000
> >> >
> >> > autopurge.purgeInterval=24
> >> >
> >> > initLimit=100
> >> >
> >> > syncLimit=5
> >> >
> >> > server.1=172.31.86.130:2888:3888
> >> >
> >> > server.2=172.31.16.234:2888:3888
> >> >
> >> > server.3=172.31.73.122:2888:3888
> >> >
> >> >
> >> > Here is the issue:
> >> >
> >> > If one node in the ensemble fails or is shut down the ensemble carries
> >> on.
> >> > However, when the node is restarted it's attempt to connect to the
> other
> >> > members of the cluster are rejected. The only way that I have found to
> >> > restore the ensemble is to restart all of the nodes within a short
> time
> >> > span of each other.
> >> >
> >> > If I do that they are able to discover each other  carry on a proper
> >> > leader election and restore order.
> >> >
> >> > Once they are restored everything is fine but if one of the nodes goes
> >> > down we are faced wit the same problem.
> >> >
> >> > How do I ensure that if a node goes down, it can restart and rejoin
> the
> >> > ensemble with out having to manually restart all the other nodes?
> >> >
> >> > Any help appreciated.
> >> >
> >> > Thanks.
> >> >
> >> > Jim K.
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Jim Keeney
> >> > President, FitterWeb
> >> > E: [hidden email]
> >> > M: 703-568-5887 <(703)%20568-5887>
> >> >
> >> > *FitterWeb Consulting*
> >> > *Are you lean and agile enough? *
> >> >
> >>
> >
> >
> >
> > --
> > Jim Keeney
> > President, FitterWeb
> > E: [hidden email]
> > M: 703-568-5887 <(703)%20568-5887>
> >
> > *FitterWeb Consulting*
> > *Are you lean and agile enough? *
> >
>
>
>
> --
> Jim Keeney
> President, FitterWeb
> E: [hidden email]
> M: 703-568-5887 <(703)%20568-5887>
>
> *FitterWeb Consulting*
> *Are you lean and agile enough? *
>
Reply | Threaded
Open this post in threaded view
|

Re: Ensemble fails when one node looses connectivity

Jim Keeney
In reply to this post by Shawn Heisey
Shawn -

Thanks for jumping in on the ZK side as well.

I will take a hard look at my config files but I checked and I do not have
any one file over 1MB. The combined files (10 indexes) is 2.2MB.

I am using micros for the nodes which are very limited in memory.

I'm not currently using a java.env file so I guess I'm using the default
values for the JVM which is typically xmx512M if I remember correctly.

Could it be just a memory issue?

JiM K.

On Thu, Mar 1, 2018 at 11:13 PM, Shawn Heisey <[hidden email]> wrote:

> On 3/1/2018 7:59 PM, Jim Keeney wrote:
>
>> Read about the maxbuffer and am pretty sure that this might explain the
>> behavior we are seeing since it occurs when there has been a significant
>> reboot of all the servers. We have over 2 mb of config files for all of
>> our
>> indexes and if all the Solr nodes are sync ing their configs at once it
>> seems like that might overflow the buffer.
>>
>
> You probably recognize me from the Solr side.  Hello again.  I do know
> enough to handle this part, so I'm answering. I didn't consider the
> maxbuffer setting, because I didn't see anything about large packets in the
> logs you shared on the Solr mailing list, and it's very rare for Solr users
> to need to increase it.
>
> You only need to worry about the maxbuffer if any single part of the
> config in ZK (what is called a "znode") is over 1MB. Each file in the
> configs that you upload will go into its own znode.  So if none of the
> individual files in your configs is really large, you probably won't need
> to set jute.maxbuffer.
>
> As for the other things that Solr puts in ZK:  Unless you have a REALLY
> huge cluster (tons of collections, shards, replicas, servers, etc) then
> that information should be quite small.
>
> Newbie question, where would i set the -Djute.maxbuffer ? Should I update
>> the zkServer.sh file so this is applied every time zookeeper is started or
>> restarted.
>>
>
> If jute.maxbuffer is needed, it must be set on the startup options for
> every ZK server and every client that will access large znodes.  Which
> means all your ZK servers, all your Solr servers, and any invocations of
> things like the scripts Solr includes for uploading configs.
>
> Thanks,
> Shawn
>
>


--
Jim Keeney
President, FitterWeb
E: [hidden email]
M: 703-568-5887 <(703)%20568-5887>

*FitterWeb Consulting*
*Are you lean and agile enough? *
Reply | Threaded
Open this post in threaded view
|

Re: Ensemble fails when one node looses connectivity

Shawn Heisey-2
On 3/2/2018 6:54 AM, Jim Keeney wrote:

> Thanks for jumping in on the ZK side as well.
>
> I will take a hard look at my config files but I checked and I do not have
> any one file over 1MB. The combined files (10 indexes) is 2.2MB.
>
> I am using micros for the nodes which are very limited in memory.
>
> I'm not currently using a java.env file so I guess I'm using the default
> values for the JVM which is typically xmx512M if I remember correctly.
>
> Could it be just a memory issue?

Usually Java on Linux has a default heap size of about 4GB.  But it
would be highly dependent on the amount of memory actually present on
the machine.  Just yesterday, I saw Java report a 6GB default heap size,
on a machine with 24GB of memory. Information I can find about AWS
instance types says that a micro instance has 1GB of memory.  So the
default heap size is probably quite small.

Even in small server situations, I would strongly recommend that anytime
you have a java commandline, you define -Xmx for the max heap, and -Xms
should probably be set as well, to the same value as -Xmx.  That way
you're not relying on defaults, you're absolutely sure what the heap
size is.

For ZK servers handling 2 megabytes of config data plus the rest of a
small SolrCloud install, something like 256MB or 512MB of heap would
probably be plenty.  ZK holds a copy of its entire database in memory. 
Small SolrCloud installs won't put much of a load on ZK.  A micro
instance should be plenty for ZK when the software using it is Solr, as
long as that's the only thing it's running.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Ensemble fails when one node looses connectivity

svanschalkwyk
If this is a t2.micro on AWS, then it has 1GB of RAM.




On Fri, Mar 2, 2018 at 9:47 AM, Shawn Heisey <[hidden email]> wrote:

> On 3/2/2018 6:54 AM, Jim Keeney wrote:
>
>> Thanks for jumping in on the ZK side as well.
>>
>> I will take a hard look at my config files but I checked and I do not have
>> any one file over 1MB. The combined files (10 indexes) is 2.2MB.
>>
>> I am using micros for the nodes which are very limited in memory.
>>
>> I'm not currently using a java.env file so I guess I'm using the default
>> values for the JVM which is typically xmx512M if I remember correctly.
>>
>> Could it be just a memory issue?
>>
>
> Usually Java on Linux has a default heap size of about 4GB.  But it would
> be highly dependent on the amount of memory actually present on the
> machine.  Just yesterday, I saw Java report a 6GB default heap size, on a
> machine with 24GB of memory. Information I can find about AWS instance
> types says that a micro instance has 1GB of memory.  So the default heap
> size is probably quite small.
>
> Even in small server situations, I would strongly recommend that anytime
> you have a java commandline, you define -Xmx for the max heap, and -Xms
> should probably be set as well, to the same value as -Xmx.  That way you're
> not relying on defaults, you're absolutely sure what the heap size is.
>
> For ZK servers handling 2 megabytes of config data plus the rest of a
> small SolrCloud install, something like 256MB or 512MB of heap would
> probably be plenty.  ZK holds a copy of its entire database in memory.
> Small SolrCloud installs won't put much of a load on ZK.  A micro instance
> should be plenty for ZK when the software using it is Solr, as long as
> that's the only thing it's running.
>
> Thanks,
> Shawn
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Ensemble fails when one node looses connectivity

Jim Keeney
In reply to this post by Shawn Heisey-2
Thanks again Shawn and Steph. So that would tend to rule out the maxbuffer
or heapsize requirements.

I'll double check java and explicitly set the xmx an xms settings.

I think the next step is to try to get more information on what is
happening.

I'll play with log settings and see if I can get more information.

Good thing is I'm pretty sure I can duplicate the behavior.

Thanks.

Jim K.





On Fri, Mar 2, 2018 at 10:47 AM, Shawn Heisey <[hidden email]> wrote:

> On 3/2/2018 6:54 AM, Jim Keeney wrote:
>
>> Thanks for jumping in on the ZK side as well.
>>
>> I will take a hard look at my config files but I checked and I do not have
>> any one file over 1MB. The combined files (10 indexes) is 2.2MB.
>>
>> I am using micros for the nodes which are very limited in memory.
>>
>> I'm not currently using a java.env file so I guess I'm using the default
>> values for the JVM which is typically xmx512M if I remember correctly.
>>
>> Could it be just a memory issue?
>>
>
> Usually Java on Linux has a default heap size of about 4GB.  But it would
> be highly dependent on the amount of memory actually present on the
> machine.  Just yesterday, I saw Java report a 6GB default heap size, on a
> machine with 24GB of memory. Information I can find about AWS instance
> types says that a micro instance has 1GB of memory.  So the default heap
> size is probably quite small.
>
> Even in small server situations, I would strongly recommend that anytime
> you have a java commandline, you define -Xmx for the max heap, and -Xms
> should probably be set as well, to the same value as -Xmx.  That way you're
> not relying on defaults, you're absolutely sure what the heap size is.
>
> For ZK servers handling 2 megabytes of config data plus the rest of a
> small SolrCloud install, something like 256MB or 512MB of heap would
> probably be plenty.  ZK holds a copy of its entire database in memory.
> Small SolrCloud installs won't put much of a load on ZK.  A micro instance
> should be plenty for ZK when the software using it is Solr, as long as
> that's the only thing it's running.
>
> Thanks,
> Shawn
>
>


--
Jim Keeney
President, FitterWeb
E: [hidden email]
M: 703-568-5887 <(703)%20568-5887>

*FitterWeb Consulting*
*Are you lean and agile enough? *