I feel fairly certain that this thread willbe an annoyance. I don't
know enough about zookeeper to answer the questions that are being asked, so I apologize about needing to relay questions about ZK fault tolerance in two datacenters. It seems that everyone wants to avoid the expense of a tie-breaker ZK VM in a third datacenter. The scenario, which this list has seen over and over: DC1 - three ZK servers, one or more Solr servers. DC2 - two ZK servers, one or more Solr servers. I've already explained that if DC2 goes down, everything's fine, but if DC1 goes down, Solr goes ready-only, and there's no way to prevent that. The conversation went further, and I'm sure you guys have seen this before too: "Is there any way we can get DC2 back to operational with manual intervention if DC1 goes down?" I explained that any manual intervention would briefly take Solr down ... at which point the following proposal was mentioned: Add an observer node to DC2, and in the event DC1 goes down, run a script that reconfigures all the ZK servers to change the observer to a voting member and does rolling restarts. Will their proposal work? What happens when DC1 comes back online? As you know, DC1 will contain a partial ensemble that still has quorum, about to rejoin what it THINKS is a partial ensemble *without* quorum, which is not what it will find. I'm guessing that ZK assumes the question of who has the "real" quorum shouldn't ever need to be negotiated, because the rules prevent multiple partitions from gaining quorum. Solr currently ships with 3.4.6, but the next version of Solr (about to drop any day now) will have 3.4.10. Once 3.5 is released and Solr is updated to use it, does the situation I've described above change in any meaningful way? Thanks, Shawn |
In ZK 3.4.x if you have configuration differences amongst your instances you are susceptible to a split brain. See this email thread, "Rolling Config Change Considered Harmful":
http://zookeeper-user.578899.n2.nabble.com/Rolling-config-change-considered-harmful-td7578761.html <http://zookeeper-user.578899.n2.nabble.com/Rolling-config-change-considered-harmful-td7578761.html> In ZK 3.5.x I'm not even sure it would work. -JZ > On May 26, 2017, at 5:43 PM, Shawn Heisey <[hidden email]> wrote: > > I feel fairly certain that this thread willbe an annoyance. I don't > know enough about zookeeper to answer the questions that are being > asked, so I apologize about needing to relay questions about ZK fault > tolerance in two datacenters. > > It seems that everyone wants to avoid the expense of a tie-breaker ZK VM > in a third datacenter. > > The scenario, which this list has seen over and over: > > DC1 - three ZK servers, one or more Solr servers. > DC2 - two ZK servers, one or more Solr servers. > > I've already explained that if DC2 goes down, everything's fine, but if > DC1 goes down, Solr goes ready-only, and there's no way to prevent that. > > The conversation went further, and I'm sure you guys have seen this > before too: "Is there any way we can get DC2 back to operational with > manual intervention if DC1 goes down?" I explained that any manual > intervention would briefly take Solr down ... at which point the > following proposal was mentioned: > > Add an observer node to DC2, and in the event DC1 goes down, run a > script that reconfigures all the ZK servers to change the observer to a > voting member and does rolling restarts. > > Will their proposal work? What happens when DC1 comes back online? As > you know, DC1 will contain a partial ensemble that still has quorum, > about to rejoin what it THINKS is a partial ensemble *without* quorum, > which is not what it will find. I'm guessing that ZK assumes the > question of who has the "real" quorum shouldn't ever need to be > negotiated, because the rules prevent multiple partitions from gaining > quorum. > > Solr currently ships with 3.4.6, but the next version of Solr (about to > drop any day now) will have 3.4.10. Once 3.5 is released and Solr is > updated to use it, does the situation I've described above change in any > meaningful way? > > Thanks, > Shawn > |
On 5/26/2017 9:48 AM, Jordan Zimmerman wrote:
> In ZK 3.4.x if you have configuration differences amongst your instances you are susceptible to a split brain. See this email thread, "Rolling Config Change Considered Harmful": > > http://zookeeper-user.578899.n2.nabble.com/Rolling-config-change-considered-harmful-td7578761.html <http://zookeeper-user.578899.n2.nabble.com/Rolling-config-change-considered-harmful-td7578761.html> > > In ZK 3.5.x I'm not even sure it would work. Thank you for your reply. I don't fully understand everything being discussed in that thread, but it sounds like bad things could happen once connectivity is restored. If DC1 and DC2 were both operational from a client perspective, but unable to communicate with each other, I think the potential for bad things would be even higher, because there could be confusion about which Solr servers are leaders, as well as which ZK server is the leader. Thanks, Shawn |
Free forum by Nabble | Edit this page |