How to restore a snapshot after an accidental ZKclenup

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

How to restore a snapshot after an accidental ZKclenup

AALISHE
Hi all,

I have a CDH cluster with  ZK 3.4.5   running on 3 nodes ....and one of the
developers accidentlly ran an API command that does  zooKeeperCleanUp

I didnt know before this about ZK needing backups .... so there is no
backup(s) on-hand.

But I see some kind logs and snapshot files under zookeeper lib/data
(attached as well)

65M Feb 19 05:48   log.900000001
18K Feb 19 07:19    snapshot.9000000a0
65M Feb 19 08:14    log.a00000001
18K Feb 19 08:21     snapshot.a00000759
65M Feb 19 08:22     log.b00000001
17K Feb 19 08:27     snapshot.c00000000
2 Feb 19 11:39        acceptedEpoch
21K Feb 19 11:39     snapshot.c000016c7
2 Feb 19 11:39         currentEpoch
129M Feb 20 04:21    log.c00000001
21K Feb 20 04:21     snapshot.d000184ec
65M Feb 20 17:50    log.d000184ed
23K Feb 20 17:50    snapshot.d0002bf88
4.0K Feb 20 17:50 .
65M Feb 21 03:24 log.d0002bf89



this incident happened  on  *Feb 20 17:50 *   .... as you can see the
snapshot and log files stamped with earlier hour *Feb 20 04:21  *are larger
insize....


I think these last 2 files might the restore candidates?  if Yes  ... how
do I properly restore to that point ??


cheers!
Reply | Threaded
Open this post in threaded view
|

Re: How to restore a snapshot after an accidental ZKclenup

AALISHE
forgot the attachment

On Sun, Feb 21, 2016 at 10:29 AM, AALISHE <[hidden email]> wrote:
Hi all,

I have a CDH cluster with  ZK 3.4.5   running on 3 nodes ....and one of the developers accidentlly ran an API command that does  zooKeeperCleanUp 

I didnt know before this about ZK needing backups .... so there is no backup(s) on-hand.

But I see some kind logs and snapshot files under zookeeper lib/data (attached as well)

65M Feb 19 05:48   log.900000001
18K Feb 19 07:19    snapshot.9000000a0
65M Feb 19 08:14    log.a00000001
18K Feb 19 08:21     snapshot.a00000759
65M Feb 19 08:22     log.b00000001
17K Feb 19 08:27     snapshot.c00000000
2 Feb 19 11:39        acceptedEpoch
21K Feb 19 11:39     snapshot.c000016c7
2 Feb 19 11:39         currentEpoch
129M Feb 20 04:21    log.c00000001
21K Feb 20 04:21     snapshot.d000184ec
65M Feb 20 17:50    log.d000184ed
23K Feb 20 17:50    snapshot.d0002bf88
4.0K Feb 20 17:50 .
65M Feb 21 03:24 log.d0002bf89



this incident happened  on  Feb 20 17:50    .... as you can see the snapshot and log files stamped with earlier hour Feb 20 04:21  are larger insize.... 


I think these last 2 files might the restore candidates?  if Yes  ... how do I properly restore to that point ??


cheers!



Reply | Threaded
Open this post in threaded view
|

Re: How to restore a snapshot after an accidental ZKclenup

Ted Yu
The attachment didn't go through.

Consider using pastebin.

On Sun, Feb 21, 2016 at 12:30 AM, AALISHE <[hidden email]> wrote:

> forgot the attachment
>
> On Sun, Feb 21, 2016 at 10:29 AM, AALISHE <[hidden email]> wrote:
>
>> Hi all,
>>
>> I have a CDH cluster with  ZK 3.4.5   running on 3 nodes ....and one of
>> the developers accidentlly ran an API command that does  zooKeeperCleanUp
>>
>> I didnt know before this about ZK needing backups .... so there is no
>> backup(s) on-hand.
>>
>> But I see some kind logs and snapshot files under zookeeper lib/data
>> (attached as well)
>>
>> 65M Feb 19 05:48   log.900000001
>> 18K Feb 19 07:19    snapshot.9000000a0
>> 65M Feb 19 08:14    log.a00000001
>> 18K Feb 19 08:21     snapshot.a00000759
>> 65M Feb 19 08:22     log.b00000001
>> 17K Feb 19 08:27     snapshot.c00000000
>> 2 Feb 19 11:39        acceptedEpoch
>> 21K Feb 19 11:39     snapshot.c000016c7
>> 2 Feb 19 11:39         currentEpoch
>> 129M Feb 20 04:21    log.c00000001
>> 21K Feb 20 04:21     snapshot.d000184ec
>> 65M Feb 20 17:50    log.d000184ed
>> 23K Feb 20 17:50    snapshot.d0002bf88
>> 4.0K Feb 20 17:50 .
>> 65M Feb 21 03:24 log.d0002bf89
>>
>>
>>
>> this incident happened  on  *Feb 20 17:50 *   .... as you can see the
>> snapshot and log files stamped with earlier hour *Feb 20 04:21  *are
>> larger insize....
>>
>>
>> I think these last 2 files might the restore candidates?  if Yes  ... how
>> do I properly restore to that point ??
>>
>>
>> cheers!
>>
>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: How to restore a snapshot after an accidental ZKclenup

AALISHE
>
>
> thanks Ted,



this is the link   http://pastebin.com/CgGi45EN


cheers!
Reply | Threaded
Open this post in threaded view
|

Re: How to restore a snapshot after an accidental ZKclenup

AALISHE
Anything anyone please?
On Feb 21, 2016 5:51 PM, "AALISHE" <[hidden email]> wrote:

>
>> thanks Ted,
>
>
>
> this is the link   http://pastebin.com/CgGi45EN
>
>
> cheers!
>
Reply | Threaded
Open this post in threaded view
|

Re: How to restore a snapshot after an accidental ZKclenup

vikrant singh-2
I have not tried it, but as I understand following should be the steps to
follow.
Step1 - back up these snapshot files
Step2 - choose the snapshot files from which you want to recover.
Step3 - remove all other files from data dir
Step4 - Start server

On Mon, Feb 22, 2016 at 2:04 AM, AALISHE <[hidden email]> wrote:

> Anything anyone please?
> On Feb 21, 2016 5:51 PM, "AALISHE" <[hidden email]> wrote:
>
> >
> >> thanks Ted,
> >
> >
> >
> > this is the link   http://pastebin.com/CgGi45EN
> >
> >
> > cheers!
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: How to restore a snapshot after an accidental ZKclenup

Flavio Junqueira-3
In reply to this post by AALISHE
Hi there,

I'm not sure what the API command you're referring to does. Does it delete the whole data tree or a subtree or what? If you populate the data directory with a consistent snapshot/log pair, then I don't see why you shouldn't be able to recover at least some of your data. The sequence of steps should be:

- Copy appropriate files to the data/log directory of all replicas
- Start the replicas
- Check if it loaded the data tree successfully

In general, ZK doesn't need back up and what many applications end up doing is having a way to reconstruct the data tree in the case of disaster, like deleting accidentally the whole data tree.

-Flavio

 

> On 21 Feb 2016, at 08:29, AALISHE <[hidden email]> wrote:
>
> Hi all,
>
> I have a CDH cluster with  ZK 3.4.5   running on 3 nodes ....and one of the
> developers accidentlly ran an API command that does  zooKeeperCleanUp
>
> I didnt know before this about ZK needing backups .... so there is no
> backup(s) on-hand.
>
> But I see some kind logs and snapshot files under zookeeper lib/data
> (attached as well)
>
> 65M Feb 19 05:48   log.900000001
> 18K Feb 19 07:19    snapshot.9000000a0
> 65M Feb 19 08:14    log.a00000001
> 18K Feb 19 08:21     snapshot.a00000759
> 65M Feb 19 08:22     log.b00000001
> 17K Feb 19 08:27     snapshot.c00000000
> 2 Feb 19 11:39        acceptedEpoch
> 21K Feb 19 11:39     snapshot.c000016c7
> 2 Feb 19 11:39         currentEpoch
> 129M Feb 20 04:21    log.c00000001
> 21K Feb 20 04:21     snapshot.d000184ec
> 65M Feb 20 17:50    log.d000184ed
> 23K Feb 20 17:50    snapshot.d0002bf88
> 4.0K Feb 20 17:50 .
> 65M Feb 21 03:24 log.d0002bf89
>
>
>
> this incident happened  on  *Feb 20 17:50 *   .... as you can see the
> snapshot and log files stamped with earlier hour *Feb 20 04:21  *are larger
> insize....
>
>
> I think these last 2 files might the restore candidates?  if Yes  ... how
> do I properly restore to that point ??
>
>
> cheers!

Reply | Threaded
Open this post in threaded view
|

Re: How to restore a snapshot after an accidental ZKclenup

AALISHE
In reply to this post by vikrant singh-2
Hi Vikrant/All,

I have some thought about the steps to share:


1- Since this is a 3 node cluster ....  I must Identify which one is  the
(leader ZK node)
2- Stop ZK from cloudera manager
3- Go to snapshot folder (on the leader) and take a backup a side
4- delete the files (snapshot + log) with the newest date stamp?  (on all 3
nodes)
5-  Start ZK and make sure the previous leader is the current leader ?   or
maybe I should initialize ZK data ?



Can anyone take a look please and confirm/correct the above steps.


cheers!

On Mon, Feb 22, 2016 at 4:31 PM, vikrant singh <[hidden email]>
wrote:

> I have not tried it, but as I understand following should be the steps to
> follow.
> Step1 - back up these snapshot files
> Step2 - choose the snapshot files from which you want to recover.
> Step3 - remove all other files from data dir
> Step4 - Start server
>
> On Mon, Feb 22, 2016 at 2:04 AM, AALISHE <[hidden email]> wrote:
>
> > Anything anyone please?
> > On Feb 21, 2016 5:51 PM, "AALISHE" <[hidden email]> wrote:
> >
> > >
> > >> thanks Ted,
> > >
> > >
> > >
> > > this is the link   http://pastebin.com/CgGi45EN
> > >
> > >
> > > cheers!
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: How to restore a snapshot after an accidental ZKclenup

vikrant singh-2
I think you need not to worry about the leader election and who was the
previous leader. Quorum should be able to handle it when it comes up.
Neither you need to validate who becomes new leader.

Before you delete any files, please make sure you keep the back up so if
your experiment fails you do not end up with no files to try again.

That said.. once all dat is backed up I would go and delete all the
snapshot.* and log.*  except latest one. In your case I will leave
snapshot.d0002bf88
in the data folder. Please note the number at the end of file.. it is the
transaction number after which this snap shout was created. On each of your
ZK server you will have a file for which this number will be in the same
range. Keep those file on the server.

I do not think you need to initialize any data manually.. once snapshot
files are there in place you can start the  server and most likely it will
come up.

All the best.





On Mon, Feb 22, 2016 at 7:08 AM, AALISHE <[hidden email]> wrote:

> Hi Vikrant/All,
>
> I have some thought about the steps to share:
>
>
> 1- Since this is a 3 node cluster ....  I must Identify which one is  the
> (leader ZK node)
> 2- Stop ZK from cloudera manager
> 3- Go to snapshot folder (on the leader) and take a backup a side
> 4- delete the files (snapshot + log) with the newest date stamp?  (on all 3
> nodes)
> 5-  Start ZK and make sure the previous leader is the current leader ?   or
> maybe I should initialize ZK data ?
>
>
>
> Can anyone take a look please and confirm/correct the above steps.
>
>
> cheers!
>
> On Mon, Feb 22, 2016 at 4:31 PM, vikrant singh <[hidden email]>
> wrote:
>
> > I have not tried it, but as I understand following should be the steps to
> > follow.
> > Step1 - back up these snapshot files
> > Step2 - choose the snapshot files from which you want to recover.
> > Step3 - remove all other files from data dir
> > Step4 - Start server
> >
> > On Mon, Feb 22, 2016 at 2:04 AM, AALISHE <[hidden email]> wrote:
> >
> > > Anything anyone please?
> > > On Feb 21, 2016 5:51 PM, "AALISHE" <[hidden email]> wrote:
> > >
> > > >
> > > >> thanks Ted,
> > > >
> > > >
> > > >
> > > > this is the link   http://pastebin.com/CgGi45EN
> > > >
> > > >
> > > > cheers!
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: How to restore a snapshot after an accidental ZKclenup

Jordan Zimmerman-3
Be careful when restoring that you don’t go “back in time”. ZooKeeper can be used as a datastore (bad idea) and a coordinator. If the transactions files you are restoring contain paths that are involved in leader elections, etc. insanity can ensue.

> On Feb 22, 2016, at 9:05 AM, vikrant singh <[hidden email]> wrote:
>
> I think you need not to worry about the leader election and who was the
> previous leader. Quorum should be able to handle it when it comes up.
> Neither you need to validate who becomes new leader.
>
> Before you delete any files, please make sure you keep the back up so if
> your experiment fails you do not end up with no files to try again.
>
> That said.. once all dat is backed up I would go and delete all the
> snapshot.* and log.*  except latest one. In your case I will leave
> snapshot.d0002bf88
> in the data folder. Please note the number at the end of file.. it is the
> transaction number after which this snap shout was created. On each of your
> ZK server you will have a file for which this number will be in the same
> range. Keep those file on the server.
>
> I do not think you need to initialize any data manually.. once snapshot
> files are there in place you can start the  server and most likely it will
> come up.
>
> All the best.
>
>
>
>
>
> On Mon, Feb 22, 2016 at 7:08 AM, AALISHE <[hidden email]> wrote:
>
>> Hi Vikrant/All,
>>
>> I have some thought about the steps to share:
>>
>>
>> 1- Since this is a 3 node cluster ....  I must Identify which one is  the
>> (leader ZK node)
>> 2- Stop ZK from cloudera manager
>> 3- Go to snapshot folder (on the leader) and take a backup a side
>> 4- delete the files (snapshot + log) with the newest date stamp?  (on all 3
>> nodes)
>> 5-  Start ZK and make sure the previous leader is the current leader ?   or
>> maybe I should initialize ZK data ?
>>
>>
>>
>> Can anyone take a look please and confirm/correct the above steps.
>>
>>
>> cheers!
>>
>> On Mon, Feb 22, 2016 at 4:31 PM, vikrant singh <[hidden email]>
>> wrote:
>>
>>> I have not tried it, but as I understand following should be the steps to
>>> follow.
>>> Step1 - back up these snapshot files
>>> Step2 - choose the snapshot files from which you want to recover.
>>> Step3 - remove all other files from data dir
>>> Step4 - Start server
>>>
>>> On Mon, Feb 22, 2016 at 2:04 AM, AALISHE <[hidden email]> wrote:
>>>
>>>> Anything anyone please?
>>>> On Feb 21, 2016 5:51 PM, "AALISHE" <[hidden email]> wrote:
>>>>
>>>>>
>>>>>> thanks Ted,
>>>>>
>>>>>
>>>>>
>>>>> this is the link   http://pastebin.com/CgGi45EN
>>>>>
>>>>>
>>>>> cheers!
>>>>>
>>>>
>>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: How to restore a snapshot after an accidental ZKclenup

AALISHE
Thanks Jordan... can you elaborate more on your answer
On Feb 22, 2016 7:43 PM, "Jordan Zimmerman" <[hidden email]>
wrote:

> Be careful when restoring that you don’t go “back in time”. ZooKeeper can
> be used as a datastore (bad idea) and a coordinator. If the transactions
> files you are restoring contain paths that are involved in leader
> elections, etc. insanity can ensue.
>
> > On Feb 22, 2016, at 9:05 AM, vikrant singh <[hidden email]>
> wrote:
> >
> > I think you need not to worry about the leader election and who was the
> > previous leader. Quorum should be able to handle it when it comes up.
> > Neither you need to validate who becomes new leader.
> >
> > Before you delete any files, please make sure you keep the back up so if
> > your experiment fails you do not end up with no files to try again.
> >
> > That said.. once all dat is backed up I would go and delete all the
> > snapshot.* and log.*  except latest one. In your case I will leave
> > snapshot.d0002bf88
> > in the data folder. Please note the number at the end of file.. it is the
> > transaction number after which this snap shout was created. On each of
> your
> > ZK server you will have a file for which this number will be in the same
> > range. Keep those file on the server.
> >
> > I do not think you need to initialize any data manually.. once snapshot
> > files are there in place you can start the  server and most likely it
> will
> > come up.
> >
> > All the best.
> >
> >
> >
> >
> >
> > On Mon, Feb 22, 2016 at 7:08 AM, AALISHE <[hidden email]> wrote:
> >
> >> Hi Vikrant/All,
> >>
> >> I have some thought about the steps to share:
> >>
> >>
> >> 1- Since this is a 3 node cluster ....  I must Identify which one is
> the
> >> (leader ZK node)
> >> 2- Stop ZK from cloudera manager
> >> 3- Go to snapshot folder (on the leader) and take a backup a side
> >> 4- delete the files (snapshot + log) with the newest date stamp?  (on
> all 3
> >> nodes)
> >> 5-  Start ZK and make sure the previous leader is the current leader ?
>  or
> >> maybe I should initialize ZK data ?
> >>
> >>
> >>
> >> Can anyone take a look please and confirm/correct the above steps.
> >>
> >>
> >> cheers!
> >>
> >> On Mon, Feb 22, 2016 at 4:31 PM, vikrant singh <
> [hidden email]>
> >> wrote:
> >>
> >>> I have not tried it, but as I understand following should be the steps
> to
> >>> follow.
> >>> Step1 - back up these snapshot files
> >>> Step2 - choose the snapshot files from which you want to recover.
> >>> Step3 - remove all other files from data dir
> >>> Step4 - Start server
> >>>
> >>> On Mon, Feb 22, 2016 at 2:04 AM, AALISHE <[hidden email]> wrote:
> >>>
> >>>> Anything anyone please?
> >>>> On Feb 21, 2016 5:51 PM, "AALISHE" <[hidden email]> wrote:
> >>>>
> >>>>>
> >>>>>> thanks Ted,
> >>>>>
> >>>>>
> >>>>>
> >>>>> this is the link   http://pastebin.com/CgGi45EN
> >>>>>
> >>>>>
> >>>>> cheers!
> >>>>>
> >>>>
> >>>
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: How to restore a snapshot after an accidental ZKclenup

Jordan Zimmerman-3
Imagine a scenario in a unstable ensemble where the ZK leader is moving around. What server are you getting your transaction logs from? How old are they? There is a potential to go back in time. For persistent nodes this probably isn’t a big deal. But, what about ephemeral nodes involved in a lock recipe? Remember, writes only go to a Quorum of servers.

Imagine this wholly contrived scenario:

1. Your clients execute a leader election recipe and a leader is selected
2. The leader writes a ZNode as a type of flag (note: in practice there are issues with this, but don’t worry about it now)
3. The leader executes a REST call to a third party server that starts something important that is not idempotent
4. The ensemble has a horrid crash at this point

Now, you want to restore from a backup. How old is the backup? What server did the backup come from? There’s a very good chance that you restore from backup and the ZNode written in step 2 is not in the transaction log you restored. Now, your leader is going to send that REST call again. Even if the ZNode is recorded, old ephemerals may appear again and the leader might think it’s leader a 2nd time. There are so many vagaries that it’s difficult to reason about.

Again, this highly contrived but you can imagine many similar types of scenarios. ZK is a coordinator, not a database.

-JZ

> On Feb 22, 2016, at 9:58 AM, AALISHE <[hidden email]> wrote:
>
> Thanks Jordan... can you elaborate more on your answer
> On Feb 22, 2016 7:43 PM, "Jordan Zimmerman" <[hidden email]>
> wrote:
>
>> Be careful when restoring that you don’t go “back in time”. ZooKeeper can
>> be used as a datastore (bad idea) and a coordinator. If the transactions
>> files you are restoring contain paths that are involved in leader
>> elections, etc. insanity can ensue.
>>
>>> On Feb 22, 2016, at 9:05 AM, vikrant singh <[hidden email]>
>> wrote:
>>>
>>> I think you need not to worry about the leader election and who was the
>>> previous leader. Quorum should be able to handle it when it comes up.
>>> Neither you need to validate who becomes new leader.
>>>
>>> Before you delete any files, please make sure you keep the back up so if
>>> your experiment fails you do not end up with no files to try again.
>>>
>>> That said.. once all dat is backed up I would go and delete all the
>>> snapshot.* and log.*  except latest one. In your case I will leave
>>> snapshot.d0002bf88
>>> in the data folder. Please note the number at the end of file.. it is the
>>> transaction number after which this snap shout was created. On each of
>> your
>>> ZK server you will have a file for which this number will be in the same
>>> range. Keep those file on the server.
>>>
>>> I do not think you need to initialize any data manually.. once snapshot
>>> files are there in place you can start the  server and most likely it
>> will
>>> come up.
>>>
>>> All the best.
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Feb 22, 2016 at 7:08 AM, AALISHE <[hidden email]> wrote:
>>>
>>>> Hi Vikrant/All,
>>>>
>>>> I have some thought about the steps to share:
>>>>
>>>>
>>>> 1- Since this is a 3 node cluster ....  I must Identify which one is
>> the
>>>> (leader ZK node)
>>>> 2- Stop ZK from cloudera manager
>>>> 3- Go to snapshot folder (on the leader) and take a backup a side
>>>> 4- delete the files (snapshot + log) with the newest date stamp?  (on
>> all 3
>>>> nodes)
>>>> 5-  Start ZK and make sure the previous leader is the current leader ?
>> or
>>>> maybe I should initialize ZK data ?
>>>>
>>>>
>>>>
>>>> Can anyone take a look please and confirm/correct the above steps.
>>>>
>>>>
>>>> cheers!
>>>>
>>>> On Mon, Feb 22, 2016 at 4:31 PM, vikrant singh <
>> [hidden email]>
>>>> wrote:
>>>>
>>>>> I have not tried it, but as I understand following should be the steps
>> to
>>>>> follow.
>>>>> Step1 - back up these snapshot files
>>>>> Step2 - choose the snapshot files from which you want to recover.
>>>>> Step3 - remove all other files from data dir
>>>>> Step4 - Start server
>>>>>
>>>>> On Mon, Feb 22, 2016 at 2:04 AM, AALISHE <[hidden email]> wrote:
>>>>>
>>>>>> Anything anyone please?
>>>>>> On Feb 21, 2016 5:51 PM, "AALISHE" <[hidden email]> wrote:
>>>>>>
>>>>>>>
>>>>>>>> thanks Ted,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> this is the link   http://pastebin.com/CgGi45EN
>>>>>>>
>>>>>>>
>>>>>>> cheers!
>>>>>>>
>>>>>>
>>>>>
>>>>
>>
>>