IllegalStateException: instance must be started before calling this method

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

IllegalStateException: instance must be started before calling this method

Yuval Dori
Hi,

Our customer is using ZK 3.4.3 and got IllegalStateException: instance must be started before calling this method


with the following stack trace:
2018-02-15T09:58:12.094+0100; ERROR; WSOSTSLXWIT01/MANAGER; P3424/T194; [SPACE/LearnerHandler-/10.17.46.142:49336/LearnerHandler]; Unexpected exception causing shutdown while sock still open 
java.net.SocketTimeoutException: Read timed out 
at java.net.SocketInputStream.socketRead0(Native Method) 
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) 
at java.net.SocketInputStream.read(SocketInputStream.java:171) 
at java.net.SocketInputStream.read(SocketInputStream.java:141) 
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) 
at java.io.BufferedInputStream.read(BufferedInputStream.java:265) 
at java.io.DataInputStream.readInt(DataInputStream.java:387) 
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) 
at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) 
at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99) 
at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:542) 


This is the description:
 
At first ps-main.1 is started as primary on WSOSTSLXBMS01 GSC 3. 

24.01.2018 11:31:04,947 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/1 Space instance is Primary 

After some relocation of ps-main instances a new backup instance is started on the same GSC. 

24.01.2018 11:32:59,886 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T146 SPACE/GS-LRMI Connection-pool-1-thread-6/1 Space instance [ frqMain_container1:frqMain ] has been selected as Backup 

Later a failover test is performed by shutting down WSOSTSLXBMS02 and WSOSTSLXMSS02. 
We would expect that ps-main.1 backup will be elected as primary because the former primary was located at WSOSTSLXBMS02 but no election happend. In the end in the whole cluster there was no primary of ps-main.1 running. 

We found suspicious exception in the logs: 
24.01.2018 11:32:44,987 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/frqMain_container1:frqMain Lost leadership [sessionId=0x36128253b820001] 
24.01.2018 11:32:44,987 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T98 SPACE/Curator-Framework-0/CuratorFrameworkImpl backgroundOperationsLoop exiting 
24.01.2018 11:32:44,990 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/1 Quiesce state set to SUSPENDED 
24.01.2018 11:32:44,991 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/ZooKeeper Session: 0x36128253b820001 closed 
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector The leader threw an exception 
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector java.lang.IllegalStateException: instance must be started before calling this method 
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:176) 
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.imps.CuratorFrameworkImpl.delete(CuratorFrameworkImpl.java:359) 
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.locks.LockInternals.deleteOurPath(LockInternals.java:339) 
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.locks.LockInternals.releaseLock(LockInternals.java:123) 
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.locks.InterProcessMutex.release(InterProcessMutex.java:154) 
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector.doWork(LeaderSelector.java:449) 
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector.doWorkLoop(LeaderSelector.java:466) 
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector.access$100(LeaderSelector.java:65) 
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:246) 
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:240) 
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.lang.Thread.run(Thread.java:748) 
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector
24.01.2018 11:32:44,991 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T97 SPACE/GS-LRMI Connection-pool-1-thread-1-EventThread/ClientCnxn EventThread shut down for session: 0x36128253b820001 
24.01.2018 11:32:44,997 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/frqMain1_1 Channel is closed [target=frqMain1_1, target url=jini://*/frqMain_container1_1/frqMain?groups=POLBAY&ignoreValidation=true&schema=persistent&id=1&total_members=4,1&versioned=true&mirror=true&cluster_schema=partitioned&locators=WSOSTSLXMSS01,WSOSTSLXMSS02&state=started&timeout=5000, target machine connection url=NIO://WSOSTSLXBMS02:11002/pid[3756]/24795468496_3_2538032931939568226_details[class com.gigaspaces.internal.cluster.node.impl.router.AbstractConnectionProxyBasedReplicationRouter$ConnectionEndpoint(frqMain_container1_1:frqMain)]] 
24.01.2018 11:32:44,998 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/ps-main-mirror-service Channel is closed [target=ps-main-mirror-service, target url=jini://*/ps-main-mirror-service_container/ps-main-mirror-service?schema=persistent&id=1&total_members=4,1&versioned=true&mirror=true&groups=POLBAY&cluster_schema=partitioned&locators=WSOSTSLXMSS01,WSOSTSLXMSS02&state=started&timeout=5000, target machine connection url=NIO://WSOSTSLXBMS01:11002/pid[3220]/25906827215_3_7196393537085178059_details[class com.gigaspaces.internal.cluster.node.impl.router.AbstractConnectionProxyBasedReplicationRouter$ConnectionEndpoint(ps-main-mirror-service_container:ps-main-mirror-service)]] 
24.01.2018 11:32:45,767 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/1 Shutdown complete 
24.01.2018 11:32:45,768 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/container Container <frqMain_container1> shutdown completed 

24.01.2018 11:32:46,037 +01:00 WARN WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/PUServiceBeanImpl Failed to delete deployed processing unit from [C:\Program Files\Frequentis\Frequentis LifeX Platform\gigaspaces\work\processing-units\ps-main_1_2008048111] 
24.01.2018 11:32:46,037 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/PUServiceBeanImpl Stopped 

Attached thread dump. Failover happend around 24.01.2018 11:40:20,893 +01:00
WARN WSOSTSLXBMS01/GSC/3 P1280/T59 SPACE/GS-Watchdog/watchdog The ServerEndPoint [WSOSTSLXMSS02/10.17.46.142:11001] is not reachable or is reachable but with no matching invocation in progress at the server peer, closing invalid connection with local address [/10.17.46.136:49864] [ remote invocation of: net.jini.lookup.DiscoveryAdmin.getMemberGroups] [java.net.SocketTimeoutException]


Could it be ZK bug?

Thanks,

Yuval




threads.txt (146K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: IllegalStateException: instance must be started before calling this method

Martin Gainty


in docker  container?

if so how do you resolve hosts (.hosts file, kubernetes, swarm)?

M
________________________________
From: Yuval Dori <[hidden email]>
Sent: Sunday, March 25, 2018 5:34 AM
To: [hidden email]
Cc: Yuval Dori
Subject: IllegalStateException: instance must be started before calling this method

Hi,

Our customer is using ZK 3.4.3 and got IllegalStateException: instance must be started before calling this method


with the following stack trace:
2018-02-15T09:58:12.094+0100; ERROR; WSOSTSLXWIT01/MANAGER; P3424/T194; [SPACE/LearnerHandler-/10.17.46.142:49336/LearnerHandler<http://10.17.46.142:49336/LearnerHandler>]; Unexpected exception causing shutdown while sock still open
java.net<http://java.net>.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:542)


This is the description:

At first ps-main.1 is started as primary on WSOSTSLXBMS01 GSC 3.

24.01.2018 11:31:04,947 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/1 Space instance is Primary

After some relocation of ps-main instances a new backup instance is started on the same GSC.

24.01.2018 11:32:59,886 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T146 SPACE/GS-LRMI Connection-pool-1-thread-6/1 Space instance [ frqMain_container1:frqMain ] has been selected as Backup

Later a failover test is performed by shutting down WSOSTSLXBMS02 and WSOSTSLXMSS02.
We would expect that ps-main.1 backup will be elected as primary because the former primary was located at WSOSTSLXBMS02 but no election happend. In the end in the whole cluster there was no primary of ps-main.1 running.

We found suspicious exception in the logs:
24.01.2018 11:32:44,987 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/frqMain_container1:frqMain Lost leadership [sessionId=0x36128253b820001]
24.01.2018 11:32:44,987 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T98 SPACE/Curator-Framework-0/CuratorFrameworkImpl backgroundOperationsLoop exiting
24.01.2018 11:32:44,990 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/1 Quiesce state set to SUSPENDED
24.01.2018 11:32:44,991 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/ZooKeeper Session: 0x36128253b820001 closed
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector The leader threw an exception
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector java.lang.IllegalStateException: instance must be started before calling this method
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.shaded.com<http://org.apache.curator.shaded.com>.google.common.base.Preconditions.checkState(Preconditions.java:176)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.imps.CuratorFrameworkImpl.delete(CuratorFrameworkImpl.java:359)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.locks.LockInternals.deleteOurPath(LockInternals.java:339)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.locks.LockInternals.releaseLock(LockInternals.java:123)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.locks.InterProcessMutex.release(InterProcessMutex.java:154)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector.doWork(LeaderSelector.java:449)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector.doWorkLoop(LeaderSelector.java:466)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector.access$100(LeaderSelector.java:65)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:246)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:240)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.FutureTask.run(FutureTask.java:266)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.FutureTask.run(FutureTask.java:266)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.lang.Thread.run(Thread.java:748)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector
24.01.2018 11:32:44,991 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T97 SPACE/GS-LRMI Connection-pool-1-thread-1-EventThread/ClientCnxn EventThread shut down for session: 0x36128253b820001
24.01.2018 11:32:44,997 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/frqMain1_1 Channel is closed [target=frqMain1_1, target url=jini://*/frqMain_container1_1/frqMain?groups=POLBAY&ignoreValidation=true&schema=persistent&id=1&total_members=4,1&versioned=true&mirror=true&cluster_schema=partitioned&locators=WSOSTSLXMSS01,WSOSTSLXMSS02&state=started&timeout=5000, target machine connection url=NIO://WSOSTSLXBMS02:11002/pid[3756]/24795468496_3_2538032931939568226_details[class com.gigaspaces.internal.cluster.node.impl.router.AbstractConnectionProxyBasedReplicationRouter$ConnectionEndpoint(frqMain_container1_1:frqMain)]]
24.01.2018 11:32:44,998 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/ps-main-mirror-service Channel is closed [target=ps-main-mirror-service, target url=jini://*/ps-main-mirror-service_container/ps-main-mirror-service?schema=persistent&id=1&total_members=4,1&versioned=true&mirror=true&groups=POLBAY&cluster_schema=partitioned&locators=WSOSTSLXMSS01,WSOSTSLXMSS02&state=started&timeout=5000, target machine connection url=NIO://WSOSTSLXBMS01:11002/pid[3220]/25906827215_3_7196393537085178059_details[class com.gigaspaces.internal.cluster.node.impl.router.AbstractConnectionProxyBasedReplicationRouter$ConnectionEndpoint(ps-main-mirror-service_container:ps-main-mirror-service)]]
24.01.2018 11:32:45,767 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/1 Shutdown complete
24.01.2018 11:32:45,768 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/container Container <frqMain_container1> shutdown completed

24.01.2018 11:32:46,037 +01:00 WARN WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/PUServiceBeanImpl Failed to delete deployed processing unit from [C:\Program Files\Frequentis\Frequentis LifeX Platform\gigaspaces\work\processing-units\ps-main_1_2008048111]
24.01.2018 11:32:46,037 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/PUServiceBeanImpl Stopped

Attached thread dump. Failover happend around 24.01.2018 11:40:20,893 +01:00
WARN WSOSTSLXBMS01/GSC/3 P1280/T59 SPACE/GS-Watchdog/watchdog The ServerEndPoint [WSOSTSLXMSS02/10.17.46.142:11001<http://10.17.46.142:11001/>] is not reachable or is reachable but with no matching invocation in progress at the server peer, closing invalid connection with local address [/10.17.46.136:49864<http://10.17.46.136:49864/>] [ remote invocation of: net.jini.lookup.DiscoveryAdmin.getMemberGroups] [java.net<http://java.net>.SocketTimeoutException]


Could it be ZK bug?

Thanks,

Yuval


Reply | Threaded
Open this post in threaded view
|

Re: IllegalStateException: instance must be started before calling this method

Yuval Dori
Hi Martin,

Not in a Docker. We are using the word container for one of our services.

Thanks,

Yuval

On Sun, Mar 25, 2018 at 11:57 PM, Martin Gainty <[hidden email]> wrote:

>
>
> in docker  container?
>
> if so how do you resolve hosts (.hosts file, kubernetes, swarm)?
>
> M
> ________________________________
> From: Yuval Dori <[hidden email]>
> Sent: Sunday, March 25, 2018 5:34 AM
> To: [hidden email]
> Cc: Yuval Dori
> Subject: IllegalStateException: instance must be started before calling
> this method
>
> Hi,
>
> Our customer is using ZK 3.4.3 and got IllegalStateException: instance
> must be started before calling this method
>
>
> with the following stack trace:
> 2018-02-15T09:58:12.094+0100; ERROR; WSOSTSLXWIT01/MANAGER; P3424/T194;
> [SPACE/LearnerHandler-/10.17.46.142:49336/LearnerHandler<ht
> tp://10.17.46.142:49336/LearnerHandler>]; Unexpected exception causing
> shutdown while sock still open
> java.net<http://java.net>.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> at java.net.SocketInputStream.read(SocketInputStream.java:171)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
> at java.io.DataInputStream.readInt(DataInputStream.java:387)
> at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> at org.apache.zookeeper.server.quorum.QuorumPacket.
> deserialize(QuorumPacket.java:83)
> at org.apache.jute.BinaryInputArchive.readRecord(
> BinaryInputArchive.java:99)
> at org.apache.zookeeper.server.quorum.LearnerHandler.run(
> LearnerHandler.java:542)
>
>
> This is the description:
>
> At first ps-main.1 is started as primary on WSOSTSLXBMS01 GSC 3.
>
> 24.01.2018 11:31:04,947 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/1 Space instance is Primary
>
> After some relocation of ps-main instances a new backup instance is
> started on the same GSC.
>
> 24.01.2018 11:32:59,886 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T146
> SPACE/GS-LRMI Connection-pool-1-thread-6/1 Space instance [
> frqMain_container1:frqMain ] has been selected as Backup
>
> Later a failover test is performed by shutting down WSOSTSLXBMS02 and
> WSOSTSLXMSS02.
> We would expect that ps-main.1 backup will be elected as primary because
> the former primary was located at WSOSTSLXBMS02 but no election happend. In
> the end in the whole cluster there was no primary of ps-main.1 running.
>
> We found suspicious exception in the logs:
> 24.01.2018 11:32:44,987 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/frqMain_container1:frqMain Lost leadership
> [sessionId=0x36128253b820001]
> 24.01.2018 11:32:44,987 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T98
> SPACE/Curator-Framework-0/CuratorFrameworkImpl backgroundOperationsLoop
> exiting
> 24.01.2018 11:32:44,990 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/1 Quiesce state set to SUSPENDED
> 24.01.2018 11:32:44,991 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24
> SPACE/GS-LRMI Connection-pool-1-thread-1/ZooKeeper Session:
> 0x36128253b820001 closed
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector The leader threw an
> exception
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector java.lang.IllegalStateException:
> instance must be started before calling this method
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at
> org.apache.curator.shaded.com<http://org.apache.curator.shaded.com
> >.google.common.base.Preconditions.checkState(Preconditions.java:176)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at
> org.apache.curator.framework.imps.CuratorFrameworkImpl.
> delete(CuratorFrameworkImpl.java:359)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at
> org.apache.curator.framework.recipes.locks.LockInternals.
> deleteOurPath(LockInternals.java:339)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at
> org.apache.curator.framework.recipes.locks.LockInternals.
> releaseLock(LockInternals.java:123)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at
> org.apache.curator.framework.recipes.locks.InterProcessMutex.release(
> InterProcessMutex.java:154)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at
> org.apache.curator.framework.recipes.leader.LeaderSelector.
> doWork(LeaderSelector.java:449)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at
> org.apache.curator.framework.recipes.leader.LeaderSelector.
> doWorkLoop(LeaderSelector.java:466)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at
> org.apache.curator.framework.recipes.leader.LeaderSelector.
> access$100(LeaderSelector.java:65)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at
> org.apache.curator.framework.recipes.leader.LeaderSelector$
> 2.call(LeaderSelector.java:246)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at
> org.apache.curator.framework.recipes.leader.LeaderSelector$
> 2.call(LeaderSelector.java:240)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.
> FutureTask.run(FutureTask.java:266)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.
> Executors$RunnableAdapter.call(Executors.java:511)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.
> FutureTask.run(FutureTask.java:266)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.
> ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.
> ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at
> java.lang.Thread.run(Thread.java:748)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector
> 24.01.2018 11:32:44,991 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T97
> SPACE/GS-LRMI Connection-pool-1-thread-1-EventThread/ClientCnxn
> EventThread shut down for session: 0x36128253b820001
> 24.01.2018 11:32:44,997 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24
> SPACE/GS-LRMI Connection-pool-1-thread-1/frqMain1_1 Channel is closed
> [target=frqMain1_1, target url=jini://*/frqMain_
> container1_1/frqMain?groups=POLBAY&ignoreValidation=true&
> schema=persistent&id=1&total_members=4,1&versioned=true&
> mirror=true&cluster_schema=partitioned&locators=
> WSOSTSLXMSS01,WSOSTSLXMSS02&state=started&timeout=5000, target machine
> connection url=NIO://WSOSTSLXBMS02:11002/pid[3756]/24795468496_3_
> 2538032931939568226_details[class com.gigaspaces.internal.
> cluster.node.impl.router.AbstractConnectionProxyBasedReplicationRouter$
> ConnectionEndpoint(frqMain_container1_1:frqMain)]]
> 24.01.2018 11:32:44,998 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24
> SPACE/GS-LRMI Connection-pool-1-thread-1/ps-main-mirror-service Channel
> is closed [target=ps-main-mirror-service, target
> url=jini://*/ps-main-mirror-service_container/ps-main-
> mirror-service?schema=persistent&id=1&total_members=
> 4,1&versioned=true&mirror=true&groups=POLBAY&cluster_
> schema=partitioned&locators=WSOSTSLXMSS01,WSOSTSLXMSS02&state=started&timeout=5000,
> target machine connection url=NIO://WSOSTSLXBMS01:11002/
> pid[3220]/25906827215_3_7196393537085178059_details[class
> com.gigaspaces.internal.cluster.node.impl.router.
> AbstractConnectionProxyBasedReplicationRouter$ConnectionEndpoint(ps-main-
> mirror-service_container:ps-main-mirror-service)]]
> 24.01.2018 11:32:45,767 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24
> SPACE/GS-LRMI Connection-pool-1-thread-1/1 Shutdown complete
> 24.01.2018 11:32:45,768 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24
> SPACE/GS-LRMI Connection-pool-1-thread-1/container Container
> <frqMain_container1> shutdown completed
>
> 24.01.2018 11:32:46,037 +01:00 WARN WSOSTSLXBMS01/GSC/3 P1280/T24
> SPACE/GS-LRMI Connection-pool-1-thread-1/PUServiceBeanImpl Failed to
> delete deployed processing unit from [C:\Program
> Files\Frequentis\Frequentis LifeX Platform\gigaspaces\work\
> processing-units\ps-main_1_2008048111]
> 24.01.2018 11:32:46,037 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24
> SPACE/GS-LRMI Connection-pool-1-thread-1/PUServiceBeanImpl Stopped
>
> Attached thread dump. Failover happend around 24.01.2018 11:40:20,893
> +01:00
> WARN WSOSTSLXBMS01/GSC/3 P1280/T59 SPACE/GS-Watchdog/watchdog The
> ServerEndPoint [WSOSTSLXMSS02/10.17.46.142:11001<http://10.17.46.142:
> 11001/>] is not reachable or is reachable but with no matching invocation
> in progress at the server peer, closing invalid connection with local
> address [/10.17.46.136:49864<http://10.17.46.136:49864/>] [ remote
> invocation of: net.jini.lookup.DiscoveryAdmin.getMemberGroups] [java.net<
> http://java.net>.SocketTimeoutException]
>
>
> Could it be ZK bug?
>
> Thanks,
>
> Yuval
>
>
>
Reply | Threaded
Open this post in threaded view
|

RE: IllegalStateException: instance must be started before calling this method

VALLASTER Stefan
Hi,

As this discussion was originally started in https://issues.apache.org/jira/browse/ZOOKEEPER-2172?focusedCommentId=16389390&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16389390

How can we proceed here any further?

Best regards
Stefan

-----Original Message-----
From: Yuval Dori [mailto:[hidden email]]
Sent: Montag, 26. März 2018 13:14
To: [hidden email]
Subject: Re: IllegalStateException: instance must be started before calling this method

Hi Martin,

Not in a Docker. We are using the word container for one of our services.

Thanks,

Yuval

On Sun, Mar 25, 2018 at 11:57 PM, Martin Gainty <[hidden email]> wrote:

>
>
> in docker  container?
>
> if so how do you resolve hosts (.hosts file, kubernetes, swarm)?
>
> M
> ________________________________
> From: Yuval Dori <[hidden email]>
> Sent: Sunday, March 25, 2018 5:34 AM
> To: [hidden email]
> Cc: Yuval Dori
> Subject: IllegalStateException: instance must be started before calling
> this method
>
> Hi,
>
> Our customer is using ZK 3.4.3 and got IllegalStateException: instance
> must be started before calling this method
>
>
> with the following stack trace:
> 2018-02-15T09:58:12.094+0100; ERROR; WSOSTSLXWIT01/MANAGER; P3424/T194;
> [SPACE/LearnerHandler-/10.17.46.142:49336/LearnerHandler<ht
> tp://10.17.46.142:49336/LearnerHandler>]; Unexpected exception causing
> shutdown while sock still open
> java.net<http://java.net>.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> at java.net.SocketInputStream.read(SocketInputStream.java:171)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
> at java.io.DataInputStream.readInt(DataInputStream.java:387)
> at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> at org.apache.zookeeper.server.quorum.QuorumPacket.
> deserialize(QuorumPacket.java:83)
> at org.apache.jute.BinaryInputArchive.readRecord(
> BinaryInputArchive.java:99)
> at org.apache.zookeeper.server.quorum.LearnerHandler.run(
> LearnerHandler.java:542)
>
>
> This is the description:
>
> At first ps-main.1 is started as primary on WSOSTSLXBMS01 GSC 3.
>
> 24.01.2018 11:31:04,947 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/1 Space instance is Primary
>
> After some relocation of ps-main instances a new backup instance is
> started on the same GSC.
>
> 24.01.2018 11:32:59,886 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T146
> SPACE/GS-LRMI Connection-pool-1-thread-6/1 Space instance [
> frqMain_container1:frqMain ] has been selected as Backup
>
> Later a failover test is performed by shutting down WSOSTSLXBMS02 and
> WSOSTSLXMSS02.
> We would expect that ps-main.1 backup will be elected as primary because
> the former primary was located at WSOSTSLXBMS02 but no election happend. In
> the end in the whole cluster there was no primary of ps-main.1 running.
>
> We found suspicious exception in the logs:
> 24.01.2018 11:32:44,987 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/frqMain_container1:frqMain Lost leadership
> [sessionId=0x36128253b820001]
> 24.01.2018 11:32:44,987 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T98
> SPACE/Curator-Framework-0/CuratorFrameworkImpl backgroundOperationsLoop
> exiting
> 24.01.2018 11:32:44,990 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/1 Quiesce state set to SUSPENDED
> 24.01.2018 11:32:44,991 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24
> SPACE/GS-LRMI Connection-pool-1-thread-1/ZooKeeper Session:
> 0x36128253b820001 closed
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector The leader threw an
> exception
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector java.lang.IllegalStateException:
> instance must be started before calling this method
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at
> org.apache.curator.shaded.com<http://org.apache.curator.shaded.com
> >.google.common.base.Preconditions.checkState(Preconditions.java:176)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at
> org.apache.curator.framework.imps.CuratorFrameworkImpl.
> delete(CuratorFrameworkImpl.java:359)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at
> org.apache.curator.framework.recipes.locks.LockInternals.
> deleteOurPath(LockInternals.java:339)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at
> org.apache.curator.framework.recipes.locks.LockInternals.
> releaseLock(LockInternals.java:123)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at
> org.apache.curator.framework.recipes.locks.InterProcessMutex.release(
> InterProcessMutex.java:154)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at
> org.apache.curator.framework.recipes.leader.LeaderSelector.
> doWork(LeaderSelector.java:449)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at
> org.apache.curator.framework.recipes.leader.LeaderSelector.
> doWorkLoop(LeaderSelector.java:466)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at
> org.apache.curator.framework.recipes.leader.LeaderSelector.
> access$100(LeaderSelector.java:65)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at
> org.apache.curator.framework.recipes.leader.LeaderSelector$
> 2.call(LeaderSelector.java:246)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at
> org.apache.curator.framework.recipes.leader.LeaderSelector$
> 2.call(LeaderSelector.java:240)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.
> FutureTask.run(FutureTask.java:266)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.
> Executors$RunnableAdapter.call(Executors.java:511)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.
> FutureTask.run(FutureTask.java:266)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.
> ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.
> ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at
> java.lang.Thread.run(Thread.java:748)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector
> 24.01.2018 11:32:44,991 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T97
> SPACE/GS-LRMI Connection-pool-1-thread-1-EventThread/ClientCnxn
> EventThread shut down for session: 0x36128253b820001
> 24.01.2018 11:32:44,997 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24
> SPACE/GS-LRMI Connection-pool-1-thread-1/frqMain1_1 Channel is closed
> [target=frqMain1_1, target url=jini://*/frqMain_
> container1_1/frqMain?groups=POLBAY&ignoreValidation=true&
> schema=persistent&id=1&total_members=4,1&versioned=true&
> mirror=true&cluster_schema=partitioned&locators=
> WSOSTSLXMSS01,WSOSTSLXMSS02&state=started&timeout=5000, target machine
> connection url=NIO://WSOSTSLXBMS02:11002/pid[3756]/24795468496_3_
> 2538032931939568226_details[class com.gigaspaces.internal.
> cluster.node.impl.router.AbstractConnectionProxyBasedReplicationRouter$
> ConnectionEndpoint(frqMain_container1_1:frqMain)]]
> 24.01.2018 11:32:44,998 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24
> SPACE/GS-LRMI Connection-pool-1-thread-1/ps-main-mirror-service Channel
> is closed [target=ps-main-mirror-service, target
> url=jini://*/ps-main-mirror-service_container/ps-main-
> mirror-service?schema=persistent&id=1&total_members=
> 4,1&versioned=true&mirror=true&groups=POLBAY&cluster_
> schema=partitioned&locators=WSOSTSLXMSS01,WSOSTSLXMSS02&state=started&timeout=5000,
> target machine connection url=NIO://WSOSTSLXBMS01:11002/
> pid[3220]/25906827215_3_7196393537085178059_details[class
> com.gigaspaces.internal.cluster.node.impl.router.
> AbstractConnectionProxyBasedReplicationRouter$ConnectionEndpoint(ps-main-
> mirror-service_container:ps-main-mirror-service)]]
> 24.01.2018 11:32:45,767 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24
> SPACE/GS-LRMI Connection-pool-1-thread-1/1 Shutdown complete
> 24.01.2018 11:32:45,768 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24
> SPACE/GS-LRMI Connection-pool-1-thread-1/container Container
> <frqMain_container1> shutdown completed
>
> 24.01.2018 11:32:46,037 +01:00 WARN WSOSTSLXBMS01/GSC/3 P1280/T24
> SPACE/GS-LRMI Connection-pool-1-thread-1/PUServiceBeanImpl Failed to
> delete deployed processing unit from [C:\Program
> Files\Frequentis\Frequentis LifeX Platform\gigaspaces\work\
> processing-units\ps-main_1_2008048111]
> 24.01.2018 11:32:46,037 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24
> SPACE/GS-LRMI Connection-pool-1-thread-1/PUServiceBeanImpl Stopped
>
> Attached thread dump. Failover happend around 24.01.2018 11:40:20,893
> +01:00
> WARN WSOSTSLXBMS01/GSC/3 P1280/T59 SPACE/GS-Watchdog/watchdog The
> ServerEndPoint [WSOSTSLXMSS02/10.17.46.142:11001<http://10.17.46.142:
> 11001/>] is not reachable or is reachable but with no matching invocation
> in progress at the server peer, closing invalid connection with local
> address [/10.17.46.136:49864<http://10.17.46.136:49864/>] [ remote
> invocation of: net.jini.lookup.DiscoveryAdmin.getMemberGroups] [java.net<
> http://java.net>.SocketTimeoutException]
>
>
> Could it be ZK bug?
>
> Thanks,
>
> Yuval
>
>
>
Reply | Threaded
Open this post in threaded view
|

RE: IllegalStateException: instance must be started before calling this method

VALLASTER Stefan
In reply to this post by Martin Gainty
How can we get this topic forward again? It seems the mail thread is currently stalled for no obvious reason. Anyone willing to help us out here again?

Best regards
Stefan

-----Original Message-----
From: Martin Gainty [mailto:[hidden email]]
Sent: Sonntag, 25. März 2018 22:58
To: [hidden email]
Subject: Re: IllegalStateException: instance must be started before calling this method



in docker  container?

if so how do you resolve hosts (.hosts file, kubernetes, swarm)?

M
________________________________
From: Yuval Dori <[hidden email]>
Sent: Sunday, March 25, 2018 5:34 AM
To: [hidden email]
Cc: Yuval Dori
Subject: IllegalStateException: instance must be started before calling this method

Hi,

Our customer is using ZK 3.4.3 and got IllegalStateException: instance must be started before calling this method


with the following stack trace:
2018-02-15T09:58:12.094+0100; ERROR; WSOSTSLXWIT01/MANAGER; P3424/T194; [SPACE/LearnerHandler-/10.17.46.142:49336/LearnerHandler<http://10.17.46.142:49336/LearnerHandler>]; Unexpected exception causing shutdown while sock still open
java.net<http://java.net>.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:542)


This is the description:

At first ps-main.1 is started as primary on WSOSTSLXBMS01 GSC 3.

24.01.2018 11:31:04,947 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/1 Space instance is Primary

After some relocation of ps-main instances a new backup instance is started on the same GSC.

24.01.2018 11:32:59,886 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T146 SPACE/GS-LRMI Connection-pool-1-thread-6/1 Space instance [ frqMain_container1:frqMain ] has been selected as Backup

Later a failover test is performed by shutting down WSOSTSLXBMS02 and WSOSTSLXMSS02.
We would expect that ps-main.1 backup will be elected as primary because the former primary was located at WSOSTSLXBMS02 but no election happend. In the end in the whole cluster there was no primary of ps-main.1 running.

We found suspicious exception in the logs:
24.01.2018 11:32:44,987 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/frqMain_container1:frqMain Lost leadership [sessionId=0x36128253b820001]
24.01.2018 11:32:44,987 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T98 SPACE/Curator-Framework-0/CuratorFrameworkImpl backgroundOperationsLoop exiting
24.01.2018 11:32:44,990 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/1 Quiesce state set to SUSPENDED
24.01.2018 11:32:44,991 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/ZooKeeper Session: 0x36128253b820001 closed
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector The leader threw an exception
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector java.lang.IllegalStateException: instance must be started before calling this method
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.shaded.com<http://org.apache.curator.shaded.com>.google.common.base.Preconditions.checkState(Preconditions.java:176)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.imps.CuratorFrameworkImpl.delete(CuratorFrameworkImpl.java:359)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.locks.LockInternals.deleteOurPath(LockInternals.java:339)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.locks.LockInternals.releaseLock(LockInternals.java:123)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.locks.InterProcessMutex.release(InterProcessMutex.java:154)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector.doWork(LeaderSelector.java:449)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector.doWorkLoop(LeaderSelector.java:466)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector.access$100(LeaderSelector.java:65)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:246)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:240)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.FutureTask.run(FutureTask.java:266)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.FutureTask.run(FutureTask.java:266)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.lang.Thread.run(Thread.java:748)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector
24.01.2018 11:32:44,991 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T97 SPACE/GS-LRMI Connection-pool-1-thread-1-EventThread/ClientCnxn EventThread shut down for session: 0x36128253b820001
24.01.2018 11:32:44,997 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/frqMain1_1 Channel is closed [target=frqMain1_1, target url=jini://*/frqMain_container1_1/frqMain?groups=POLBAY&ignoreValidation=true&schema=persistent&id=1&total_members=4,1&versioned=true&mirror=true&cluster_schema=partitioned&locators=WSOSTSLXMSS01,WSOSTSLXMSS02&state=started&timeout=5000, target machine connection url=NIO://WSOSTSLXBMS02:11002/pid[3756]/24795468496_3_2538032931939568226_details[class com.gigaspaces.internal.cluster.node.impl.router.AbstractConnectionProxyBasedReplicationRouter$ConnectionEndpoint(frqMain_container1_1:frqMain)]]
24.01.2018 11:32:44,998 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/ps-main-mirror-service Channel is closed [target=ps-main-mirror-service, target url=jini://*/ps-main-mirror-service_container/ps-main-mirror-service?schema=persistent&id=1&total_members=4,1&versioned=true&mirror=true&groups=POLBAY&cluster_schema=partitioned&locators=WSOSTSLXMSS01,WSOSTSLXMSS02&state=started&timeout=5000, target machine connection url=NIO://WSOSTSLXBMS01:11002/pid[3220]/25906827215_3_7196393537085178059_details[class com.gigaspaces.internal.cluster.node.impl.router.AbstractConnectionProxyBasedReplicationRouter$ConnectionEndpoint(ps-main-mirror-service_container:ps-main-mirror-service)]]
24.01.2018 11:32:45,767 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/1 Shutdown complete
24.01.2018 11:32:45,768 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/container Container <frqMain_container1> shutdown completed

24.01.2018 11:32:46,037 +01:00 WARN WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/PUServiceBeanImpl Failed to delete deployed processing unit from [C:\Program Files\Frequentis\Frequentis LifeX Platform\gigaspaces\work\processing-units\ps-main_1_2008048111]
24.01.2018 11:32:46,037 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/PUServiceBeanImpl Stopped

Attached thread dump. Failover happend around 24.01.2018 11:40:20,893 +01:00
WARN WSOSTSLXBMS01/GSC/3 P1280/T59 SPACE/GS-Watchdog/watchdog The ServerEndPoint [WSOSTSLXMSS02/10.17.46.142:11001<http://10.17.46.142:11001/>] is not reachable or is reachable but with no matching invocation in progress at the server peer, closing invalid connection with local address [/10.17.46.136:49864<http://10.17.46.136:49864/>] [ remote invocation of: net.jini.lookup.DiscoveryAdmin.getMemberGroups] [java.net<http://java.net>.SocketTimeoutException]


Could it be ZK bug?

Thanks,

Yuval


Reply | Threaded
Open this post in threaded view
|

Re: IllegalStateException: instance must be started before calling this method

Martin Gainty
MG>waiting for Yuval to answer:


in docker  container?
if so how do you resolve hosts (.hosts file, kubernetes, swarm)?


MG>?

________________________________
From: VALLASTER Stefan <[hidden email]>
Sent: Tuesday, April 10, 2018 4:46 AM
To: [hidden email]
Subject: RE: IllegalStateException: instance must be started before calling this method

How can we get this topic forward again? It seems the mail thread is currently stalled for no obvious reason. Anyone willing to help us out here again?

Best regards
Stefan

-----Original Message-----
From: Martin Gainty [mailto:[hidden email]]
Sent: Sonntag, 25. März 2018 22:58
To: [hidden email]
Subject: Re: IllegalStateException: instance must be started before calling this method



in docker  container?

if so how do you resolve hosts (.hosts file, kubernetes, swarm)?

M
________________________________
From: Yuval Dori <[hidden email]>
Sent: Sunday, March 25, 2018 5:34 AM
To: [hidden email]
Cc: Yuval Dori
Subject: IllegalStateException: instance must be started before calling this method

Hi,

Our customer is using ZK 3.4.3 and got IllegalStateException: instance must be started before calling this method


with the following stack trace:
2018-02-15T09:58:12.094+0100; ERROR; WSOSTSLXWIT01/MANAGER; P3424/T194; [SPACE/LearnerHandler-/10.17.46.142:49336/LearnerHandler<http://10.17.46.142:49336/LearnerHandler>]; Unexpected exception causing shutdown while sock still open
java.net<http://java.net>.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:542)


This is the description:

At first ps-main.1 is started as primary on WSOSTSLXBMS01 GSC 3.

24.01.2018 11:31:04,947 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/1 Space instance is Primary

After some relocation of ps-main instances a new backup instance is started on the same GSC.

24.01.2018 11:32:59,886 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T146 SPACE/GS-LRMI Connection-pool-1-thread-6/1 Space instance [ frqMain_container1:frqMain ] has been selected as Backup

Later a failover test is performed by shutting down WSOSTSLXBMS02 and WSOSTSLXMSS02.
We would expect that ps-main.1 backup will be elected as primary because the former primary was located at WSOSTSLXBMS02 but no election happend. In the end in the whole cluster there was no primary of ps-main.1 running.

We found suspicious exception in the logs:
24.01.2018 11:32:44,987 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/frqMain_container1:frqMain Lost leadership [sessionId=0x36128253b820001]
24.01.2018 11:32:44,987 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T98 SPACE/Curator-Framework-0/CuratorFrameworkImpl backgroundOperationsLoop exiting
24.01.2018 11:32:44,990 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/1 Quiesce state set to SUSPENDED
24.01.2018 11:32:44,991 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/ZooKeeper Session: 0x36128253b820001 closed
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector The leader threw an exception
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector java.lang.IllegalStateException: instance must be started before calling this method
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.shaded.com<http://org.apache.curator.shaded.com>.google.common.base.Preconditions.checkState(Preconditions.java:176)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.imps.CuratorFrameworkImpl.delete(CuratorFrameworkImpl.java:359)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.locks.LockInternals.deleteOurPath(LockInternals.java:339)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.locks.LockInternals.releaseLock(LockInternals.java:123)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.locks.InterProcessMutex.release(InterProcessMutex.java:154)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector.doWork(LeaderSelector.java:449)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector.doWorkLoop(LeaderSelector.java:466)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector.access$100(LeaderSelector.java:65)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:246)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:240)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.FutureTask.run(FutureTask.java:266)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.FutureTask.run(FutureTask.java:266)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.lang.Thread.run(Thread.java:748)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector
24.01.2018 11:32:44,991 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T97 SPACE/GS-LRMI Connection-pool-1-thread-1-EventThread/ClientCnxn EventThread shut down for session: 0x36128253b820001
24.01.2018 11:32:44,997 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/frqMain1_1 Channel is closed [target=frqMain1_1, target url=jini://*/frqMain_container1_1/frqMain?groups=POLBAY&ignoreValidation=true&schema=persistent&id=1&total_members=4,1&versioned=true&mirror=true&cluster_schema=partitioned&locators=WSOSTSLXMSS01,WSOSTSLXMSS02&state=started&timeout=5000, target machine connection url=NIO://WSOSTSLXBMS02:11002/pid[3756]/24795468496_3_2538032931939568226_details[class com.gigaspaces.internal.cluster.node.impl.router.AbstractConnectionProxyBasedReplicationRouter$ConnectionEndpoint(frqMain_container1_1:frqMain)]]
24.01.2018 11:32:44,998 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/ps-main-mirror-service Channel is closed [target=ps-main-mirror-service, target url=jini://*/ps-main-mirror-service_container/ps-main-mirror-service?schema=persistent&id=1&total_members=4,1&versioned=true&mirror=true&groups=POLBAY&cluster_schema=partitioned&locators=WSOSTSLXMSS01,WSOSTSLXMSS02&state=started&timeout=5000, target machine connection url=NIO://WSOSTSLXBMS01:11002/pid[3220]/25906827215_3_7196393537085178059_details[class com.gigaspaces.internal.cluster.node.impl.router.AbstractConnectionProxyBasedReplicationRouter$ConnectionEndpoint(ps-main-mirror-service_container:ps-main-mirror-service)]]
24.01.2018 11:32:45,767 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/1 Shutdown complete
24.01.2018 11:32:45,768 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/container Container <frqMain_container1> shutdown completed

24.01.2018 11:32:46,037 +01:00 WARN WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/PUServiceBeanImpl Failed to delete deployed processing unit from [C:\Program Files\Frequentis\Frequentis LifeX Platform\gigaspaces\work\processing-units\ps-main_1_2008048111]
24.01.2018 11:32:46,037 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/PUServiceBeanImpl Stopped

Attached thread dump. Failover happend around 24.01.2018 11:40:20,893 +01:00
WARN WSOSTSLXBMS01/GSC/3 P1280/T59 SPACE/GS-Watchdog/watchdog The ServerEndPoint [WSOSTSLXMSS02/10.17.46.142:11001<http://10.17.46.142:11001/>] is not reachable or is reachable but with no matching invocation in progress at the server peer, closing invalid connection with local address [/10.17.46.136:49864<http://10.17.46.136:49864/>] [ remote invocation of: net.jini.lookup.DiscoveryAdmin.getMemberGroups] [java.net<http://java.net>.SocketTimeoutException]


Could it be ZK bug?

Thanks,

Yuval


Reply | Threaded
Open this post in threaded view
|

Re: IllegalStateException: instance must be started before calling this method

Yuval Dori
Hi,

I have answered the following at March 26:

Hi Martin,

Not in a Docker. We are using the word container for one of our services.

Thanks,

Yuval




> On Apr 10, 2018, at 2:40 PM, Martin Gainty <[hidden email]> wrote:
>
> MG>waiting for Yuval to answer:
>
>
> in docker  container?
> if so how do you resolve hosts (.hosts file, kubernetes, swarm)?
>
>
> MG>?
>
> ________________________________
> From: VALLASTER Stefan <[hidden email]>
> Sent: Tuesday, April 10, 2018 4:46 AM
> To: [hidden email]
> Subject: RE: IllegalStateException: instance must be started before calling this method
>
> How can we get this topic forward again? It seems the mail thread is currently stalled for no obvious reason. Anyone willing to help us out here again?
>
> Best regards
> Stefan
>
> -----Original Message-----
> From: Martin Gainty [mailto:[hidden email]]
> Sent: Sonntag, 25. März 2018 22:58
> To: [hidden email]
> Subject: Re: IllegalStateException: instance must be started before calling this method
>
>
>
> in docker  container?
>
> if so how do you resolve hosts (.hosts file, kubernetes, swarm)?
>
> M
> ________________________________
> From: Yuval Dori <[hidden email]>
> Sent: Sunday, March 25, 2018 5:34 AM
> To: [hidden email]
> Cc: Yuval Dori
> Subject: IllegalStateException: instance must be started before calling this method
>
> Hi,
>
> Our customer is using ZK 3.4.3 and got IllegalStateException: instance must be started before calling this method
>
>
> with the following stack trace:
> 2018-02-15T09:58:12.094+0100; ERROR; WSOSTSLXWIT01/MANAGER; P3424/T194; [SPACE/LearnerHandler-/10.17.46.142:49336/LearnerHandler<http://10.17.46.142:49336/LearnerHandler>]; Unexpected exception causing shutdown while sock still open
> java.net<http://java.net>.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> at java.net.SocketInputStream.read(SocketInputStream.java:171)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
> at java.io.DataInputStream.readInt(DataInputStream.java:387)
> at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
> at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
> at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:542)
>
>
> This is the description:
>
> At first ps-main.1 is started as primary on WSOSTSLXBMS01 GSC 3.
>
> 24.01.2018 11:31:04,947 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/1 Space instance is Primary
>
> After some relocation of ps-main instances a new backup instance is started on the same GSC.
>
> 24.01.2018 11:32:59,886 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T146 SPACE/GS-LRMI Connection-pool-1-thread-6/1 Space instance [ frqMain_container1:frqMain ] has been selected as Backup
>
> Later a failover test is performed by shutting down WSOSTSLXBMS02 and WSOSTSLXMSS02.
> We would expect that ps-main.1 backup will be elected as primary because the former primary was located at WSOSTSLXBMS02 but no election happend. In the end in the whole cluster there was no primary of ps-main.1 running.
>
> We found suspicious exception in the logs:
> 24.01.2018 11:32:44,987 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/frqMain_container1:frqMain Lost leadership [sessionId=0x36128253b820001]
> 24.01.2018 11:32:44,987 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T98 SPACE/Curator-Framework-0/CuratorFrameworkImpl backgroundOperationsLoop exiting
> 24.01.2018 11:32:44,990 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/1 Quiesce state set to SUSPENDED
> 24.01.2018 11:32:44,991 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/ZooKeeper Session: 0x36128253b820001 closed
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector The leader threw an exception
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector java.lang.IllegalStateException: instance must be started before calling this method
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.shaded.com<http://org.apache.curator.shaded.com>.google.common.base.Preconditions.checkState(Preconditions.java:176)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.imps.CuratorFrameworkImpl.delete(CuratorFrameworkImpl.java:359)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.locks.LockInternals.deleteOurPath(LockInternals.java:339)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.locks.LockInternals.releaseLock(LockInternals.java:123)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.locks.InterProcessMutex.release(InterProcessMutex.java:154)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector.doWork(LeaderSelector.java:449)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector.doWorkLoop(LeaderSelector.java:466)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector.access$100(LeaderSelector.java:65)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:246)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:240)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.lang.Thread.run(Thread.java:748)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector
> 24.01.2018 11:32:44,991 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T97 SPACE/GS-LRMI Connection-pool-1-thread-1-EventThread/ClientCnxn EventThread shut down for session: 0x36128253b820001
> 24.01.2018 11:32:44,997 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/frqMain1_1 Channel is closed [target=frqMain1_1, target url=jini://*/frqMain_container1_1/frqMain?groups=POLBAY&ignoreValidation=true&schema=persistent&id=1&total_members=4,1&versioned=true&mirror=true&cluster_schema=partitioned&locators=WSOSTSLXMSS01,WSOSTSLXMSS02&state=started&timeout=5000, target machine connection url=NIO://WSOSTSLXBMS02:11002/pid[3756]/24795468496_3_2538032931939568226_details[class com.gigaspaces.internal.cluster.node.impl.router.AbstractConnectionProxyBasedReplicationRouter$ConnectionEndpoint(frqMain_container1_1:frqMain)]]
> 24.01.2018 11:32:44,998 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/ps-main-mirror-service Channel is closed [target=ps-main-mirror-service, target url=jini://*/ps-main-mirror-service_container/ps-main-mirror-service?schema=persistent&id=1&total_members=4,1&versioned=true&mirror=true&groups=POLBAY&cluster_schema=partitioned&locators=WSOSTSLXMSS01,WSOSTSLXMSS02&state=started&timeout=5000, target machine connection url=NIO://WSOSTSLXBMS01:11002/pid[3220]/25906827215_3_7196393537085178059_details[class com.gigaspaces.internal.cluster.node.impl.router.AbstractConnectionProxyBasedReplicationRouter$ConnectionEndpoint(ps-main-mirror-service_container:ps-main-mirror-service)]]
> 24.01.2018 11:32:45,767 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/1 Shutdown complete
> 24.01.2018 11:32:45,768 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/container Container <frqMain_container1> shutdown completed
>
> 24.01.2018 11:32:46,037 +01:00 WARN WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/PUServiceBeanImpl Failed to delete deployed processing unit from [C:\Program Files\Frequentis\Frequentis LifeX Platform\gigaspaces\work\processing-units\ps-main_1_2008048111]
> 24.01.2018 11:32:46,037 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/PUServiceBeanImpl Stopped
>
> Attached thread dump. Failover happend around 24.01.2018 11:40:20,893 +01:00
> WARN WSOSTSLXBMS01/GSC/3 P1280/T59 SPACE/GS-Watchdog/watchdog The ServerEndPoint [WSOSTSLXMSS02/10.17.46.142:11001<http://10.17.46.142:11001/>] is not reachable or is reachable but with no matching invocation in progress at the server peer, closing invalid connection with local address [/10.17.46.136:49864<http://10.17.46.136:49864/>] [ remote invocation of: net.jini.lookup.DiscoveryAdmin.getMemberGroups] [java.net<http://java.net>.SocketTimeoutException]
>
>
> Could it be ZK bug?
>
> Thanks,
>
> Yuval
>
>

Reply | Threaded
Open this post in threaded view
|

RE: IllegalStateException: instance must be started before calling this method

VALLASTER Stefan
Hi,

getting back to this topic in more detail. The container meant here is a normal Java VM running a specific middleware which offers a "container" concept. It has nothing to do with docker.

The config is read from disc.

Best regards
Stefan

-----Original Message-----
From: Yuval Dori [mailto:[hidden email]]
Sent: Dienstag, 10. April 2018 14:34
To: [hidden email]
Cc: LEONHARTSBERGER Stefan
Subject: Re: IllegalStateException: instance must be started before calling this method

Hi,

I have answered the following at March 26:

Hi Martin,

Not in a Docker. We are using the word container for one of our services.

Thanks,

Yuval




> On Apr 10, 2018, at 2:40 PM, Martin Gainty <[hidden email]> wrote:
>
> MG>waiting for Yuval to answer:
>
>
> in docker  container?
> if so how do you resolve hosts (.hosts file, kubernetes, swarm)?
>
>
> MG>?
>
> ________________________________
> From: VALLASTER Stefan <[hidden email]>
> Sent: Tuesday, April 10, 2018 4:46 AM
> To: [hidden email]
> Subject: RE: IllegalStateException: instance must be started before calling this method
>
> How can we get this topic forward again? It seems the mail thread is currently stalled for no obvious reason. Anyone willing to help us out here again?
>
> Best regards
> Stefan
>
> -----Original Message-----
> From: Martin Gainty [mailto:[hidden email]]
> Sent: Sonntag, 25. März 2018 22:58
> To: [hidden email]
> Subject: Re: IllegalStateException: instance must be started before calling this method
>
>
>
> in docker  container?
>
> if so how do you resolve hosts (.hosts file, kubernetes, swarm)?
>
> M
> ________________________________
> From: Yuval Dori <[hidden email]>
> Sent: Sunday, March 25, 2018 5:34 AM
> To: [hidden email]
> Cc: Yuval Dori
> Subject: IllegalStateException: instance must be started before calling this method
>
> Hi,
>
> Our customer is using ZK 3.4.3 and got IllegalStateException: instance must be started before calling this method
>
>
> with the following stack trace:
> 2018-02-15T09:58:12.094+0100; ERROR; WSOSTSLXWIT01/MANAGER; P3424/T194; [SPACE/LearnerHandler-/10.17.46.142:49336/LearnerHandler<http://10.17.46.142:49336/LearnerHandler>]; Unexpected exception causing shutdown while sock still open
> java.net<http://java.net>.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> at java.net.SocketInputStream.read(SocketInputStream.java:171)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
> at java.io.DataInputStream.readInt(DataInputStream.java:387)
> at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
> at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
> at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:542)
>
>
> This is the description:
>
> At first ps-main.1 is started as primary on WSOSTSLXBMS01 GSC 3.
>
> 24.01.2018 11:31:04,947 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/1 Space instance is Primary
>
> After some relocation of ps-main instances a new backup instance is started on the same GSC.
>
> 24.01.2018 11:32:59,886 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T146 SPACE/GS-LRMI Connection-pool-1-thread-6/1 Space instance [ frqMain_container1:frqMain ] has been selected as Backup
>
> Later a failover test is performed by shutting down WSOSTSLXBMS02 and WSOSTSLXMSS02.
> We would expect that ps-main.1 backup will be elected as primary because the former primary was located at WSOSTSLXBMS02 but no election happend. In the end in the whole cluster there was no primary of ps-main.1 running.
>
> We found suspicious exception in the logs:
> 24.01.2018 11:32:44,987 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/frqMain_container1:frqMain Lost leadership [sessionId=0x36128253b820001]
> 24.01.2018 11:32:44,987 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T98 SPACE/Curator-Framework-0/CuratorFrameworkImpl backgroundOperationsLoop exiting
> 24.01.2018 11:32:44,990 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/1 Quiesce state set to SUSPENDED
> 24.01.2018 11:32:44,991 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/ZooKeeper Session: 0x36128253b820001 closed
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector The leader threw an exception
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector java.lang.IllegalStateException: instance must be started before calling this method
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.shaded.com<http://org.apache.curator.shaded.com>.google.common.base.Preconditions.checkState(Preconditions.java:176)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.imps.CuratorFrameworkImpl.delete(CuratorFrameworkImpl.java:359)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.locks.LockInternals.deleteOurPath(LockInternals.java:339)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.locks.LockInternals.releaseLock(LockInternals.java:123)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.locks.InterProcessMutex.release(InterProcessMutex.java:154)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector.doWork(LeaderSelector.java:449)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector.doWorkLoop(LeaderSelector.java:466)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector.access$100(LeaderSelector.java:65)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:246)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:240)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.lang.Thread.run(Thread.java:748)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector
> 24.01.2018 11:32:44,991 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T97 SPACE/GS-LRMI Connection-pool-1-thread-1-EventThread/ClientCnxn EventThread shut down for session: 0x36128253b820001
> 24.01.2018 11:32:44,997 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/frqMain1_1 Channel is closed [target=frqMain1_1, target url=jini://*/frqMain_container1_1/frqMain?groups=POLBAY&ignoreValidation=true&schema=persistent&id=1&total_members=4,1&versioned=true&mirror=true&cluster_schema=partitioned&locators=WSOSTSLXMSS01,WSOSTSLXMSS02&state=started&timeout=5000, target machine connection url=NIO://WSOSTSLXBMS02:11002/pid[3756]/24795468496_3_2538032931939568226_details[class com.gigaspaces.internal.cluster.node.impl.router.AbstractConnectionProxyBasedReplicationRouter$ConnectionEndpoint(frqMain_container1_1:frqMain)]]
> 24.01.2018 11:32:44,998 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/ps-main-mirror-service Channel is closed [target=ps-main-mirror-service, target url=jini://*/ps-main-mirror-service_container/ps-main-mirror-service?schema=persistent&id=1&total_members=4,1&versioned=true&mirror=true&groups=POLBAY&cluster_schema=partitioned&locators=WSOSTSLXMSS01,WSOSTSLXMSS02&state=started&timeout=5000, target machine connection url=NIO://WSOSTSLXBMS01:11002/pid[3220]/25906827215_3_7196393537085178059_details[class com.gigaspaces.internal.cluster.node.impl.router.AbstractConnectionProxyBasedReplicationRouter$ConnectionEndpoint(ps-main-mirror-service_container:ps-main-mirror-service)]]
> 24.01.2018 11:32:45,767 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/1 Shutdown complete
> 24.01.2018 11:32:45,768 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/container Container <frqMain_container1> shutdown completed
>
> 24.01.2018 11:32:46,037 +01:00 WARN WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/PUServiceBeanImpl Failed to delete deployed processing unit from [C:\Program Files\Frequentis\Frequentis LifeX Platform\gigaspaces\work\processing-units\ps-main_1_2008048111]
> 24.01.2018 11:32:46,037 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/PUServiceBeanImpl Stopped
>
> Attached thread dump. Failover happend around 24.01.2018 11:40:20,893 +01:00
> WARN WSOSTSLXBMS01/GSC/3 P1280/T59 SPACE/GS-Watchdog/watchdog The ServerEndPoint [WSOSTSLXMSS02/10.17.46.142:11001<http://10.17.46.142:11001/>] is not reachable or is reachable but with no matching invocation in progress at the server peer, closing invalid connection with local address [/10.17.46.136:49864<http://10.17.46.136:49864/>] [ remote invocation of: net.jini.lookup.DiscoveryAdmin.getMemberGroups] [java.net<http://java.net>.SocketTimeoutException]
>
>
> Could it be ZK bug?
>
> Thanks,
>
> Yuval
>
>

Reply | Threaded
Open this post in threaded view
|

RE: IllegalStateException: instance must be started before calling this method

VALLASTER Stefan
In reply to this post by Martin Gainty
Hello Martin,

did you read the input from Yuval and me. Does this answer your question?

Best regards
Stefan

-----Original Message-----
From: Martin Gainty [mailto:[hidden email]]
Sent: Dienstag, 10. April 2018 13:40
To: [hidden email]
Subject: Re: IllegalStateException: instance must be started before calling this method

MG>waiting for Yuval to answer:


in docker  container?
if so how do you resolve hosts (.hosts file, kubernetes, swarm)?


MG>?

________________________________
From: VALLASTER Stefan <[hidden email]>
Sent: Tuesday, April 10, 2018 4:46 AM
To: [hidden email]
Subject: RE: IllegalStateException: instance must be started before calling this method

How can we get this topic forward again? It seems the mail thread is currently stalled for no obvious reason. Anyone willing to help us out here again?

Best regards
Stefan

-----Original Message-----
From: Martin Gainty [mailto:[hidden email]]
Sent: Sonntag, 25. März 2018 22:58
To: [hidden email]
Subject: Re: IllegalStateException: instance must be started before calling this method



in docker  container?

if so how do you resolve hosts (.hosts file, kubernetes, swarm)?

M
________________________________
From: Yuval Dori <[hidden email]>
Sent: Sunday, March 25, 2018 5:34 AM
To: [hidden email]
Cc: Yuval Dori
Subject: IllegalStateException: instance must be started before calling this method

Hi,

Our customer is using ZK 3.4.3 and got IllegalStateException: instance must be started before calling this method


with the following stack trace:
2018-02-15T09:58:12.094+0100; ERROR; WSOSTSLXWIT01/MANAGER; P3424/T194; [SPACE/LearnerHandler-/10.17.46.142:49336/LearnerHandler<http://10.17.46.142:49336/LearnerHandler>]; Unexpected exception causing shutdown while sock still open
java.net<http://java.net>.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:542)


This is the description:

At first ps-main.1 is started as primary on WSOSTSLXBMS01 GSC 3.

24.01.2018 11:31:04,947 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/1 Space instance is Primary

After some relocation of ps-main instances a new backup instance is started on the same GSC.

24.01.2018 11:32:59,886 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T146 SPACE/GS-LRMI Connection-pool-1-thread-6/1 Space instance [ frqMain_container1:frqMain ] has been selected as Backup

Later a failover test is performed by shutting down WSOSTSLXBMS02 and WSOSTSLXMSS02.
We would expect that ps-main.1 backup will be elected as primary because the former primary was located at WSOSTSLXBMS02 but no election happend. In the end in the whole cluster there was no primary of ps-main.1 running.

We found suspicious exception in the logs:
24.01.2018 11:32:44,987 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/frqMain_container1:frqMain Lost leadership [sessionId=0x36128253b820001]
24.01.2018 11:32:44,987 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T98 SPACE/Curator-Framework-0/CuratorFrameworkImpl backgroundOperationsLoop exiting
24.01.2018 11:32:44,990 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/1 Quiesce state set to SUSPENDED
24.01.2018 11:32:44,991 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/ZooKeeper Session: 0x36128253b820001 closed
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector The leader threw an exception
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector java.lang.IllegalStateException: instance must be started before calling this method
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.shaded.com<http://org.apache.curator.shaded.com>.google.common.base.Preconditions.checkState(Preconditions.java:176)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.imps.CuratorFrameworkImpl.delete(CuratorFrameworkImpl.java:359)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.locks.LockInternals.deleteOurPath(LockInternals.java:339)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.locks.LockInternals.releaseLock(LockInternals.java:123)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.locks.InterProcessMutex.release(InterProcessMutex.java:154)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector.doWork(LeaderSelector.java:449)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector.doWorkLoop(LeaderSelector.java:466)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector.access$100(LeaderSelector.java:65)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:246)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:240)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.FutureTask.run(FutureTask.java:266)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.FutureTask.run(FutureTask.java:266)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.lang.Thread.run(Thread.java:748)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector
24.01.2018 11:32:44,991 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T97 SPACE/GS-LRMI Connection-pool-1-thread-1-EventThread/ClientCnxn EventThread shut down for session: 0x36128253b820001
24.01.2018 11:32:44,997 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/frqMain1_1 Channel is closed [target=frqMain1_1, target url=jini://*/frqMain_container1_1/frqMain?groups=POLBAY&ignoreValidation=true&schema=persistent&id=1&total_members=4,1&versioned=true&mirror=true&cluster_schema=partitioned&locators=WSOSTSLXMSS01,WSOSTSLXMSS02&state=started&timeout=5000, target machine connection url=NIO://WSOSTSLXBMS02:11002/pid[3756]/24795468496_3_2538032931939568226_details[class com.gigaspaces.internal.cluster.node.impl.router.AbstractConnectionProxyBasedReplicationRouter$ConnectionEndpoint(frqMain_container1_1:frqMain)]]
24.01.2018 11:32:44,998 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/ps-main-mirror-service Channel is closed [target=ps-main-mirror-service, target url=jini://*/ps-main-mirror-service_container/ps-main-mirror-service?schema=persistent&id=1&total_members=4,1&versioned=true&mirror=true&groups=POLBAY&cluster_schema=partitioned&locators=WSOSTSLXMSS01,WSOSTSLXMSS02&state=started&timeout=5000, target machine connection url=NIO://WSOSTSLXBMS01:11002/pid[3220]/25906827215_3_7196393537085178059_details[class com.gigaspaces.internal.cluster.node.impl.router.AbstractConnectionProxyBasedReplicationRouter$ConnectionEndpoint(ps-main-mirror-service_container:ps-main-mirror-service)]]
24.01.2018 11:32:45,767 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/1 Shutdown complete
24.01.2018 11:32:45,768 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/container Container <frqMain_container1> shutdown completed

24.01.2018 11:32:46,037 +01:00 WARN WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/PUServiceBeanImpl Failed to delete deployed processing unit from [C:\Program Files\Frequentis\Frequentis LifeX Platform\gigaspaces\work\processing-units\ps-main_1_2008048111]
24.01.2018 11:32:46,037 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/PUServiceBeanImpl Stopped

Attached thread dump. Failover happend around 24.01.2018 11:40:20,893 +01:00
WARN WSOSTSLXBMS01/GSC/3 P1280/T59 SPACE/GS-Watchdog/watchdog The ServerEndPoint [WSOSTSLXMSS02/10.17.46.142:11001<http://10.17.46.142:11001/>] is not reachable or is reachable but with no matching invocation in progress at the server peer, closing invalid connection with local address [/10.17.46.136:49864<http://10.17.46.136:49864/>] [ remote invocation of: net.jini.lookup.DiscoveryAdmin.getMemberGroups] [java.net<http://java.net>.SocketTimeoutException]


Could it be ZK bug?

Thanks,

Yuval


Reply | Threaded
Open this post in threaded view
|

Re: IllegalStateException: instance must be started before calling this method

Martin Gainty
too many variables so if we can knock this down to:

No Docker (run at known command line and let ZK select container )

JDK 1.8 (im hearing squirrely behaviour by JDK 1.9 with MR jars..lets use 1.9 MR jars after we verify 1.8 ok)

.hosts file for Host Container Resolver (we dont know if swarm/kubernetes may be causing issues) e.g:

# .hosts file for the website
10.17.46.142 OutOfReachHost

this error:
2018-02-15T09:58:12.094+0100; ERROR; WSOSTSLXWIT01/MANAGER; P3424/T194; [SPACE/LearnerHandler-/10.17.46.142:49336/LearnerHandler<http://10.17.46.142:49336/LearnerHandler>]; Unexpected exception causing shutdown while sock still open
java.net<http://java.net<http://java.net/>>.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)


smacks of either:
destination unreachable

some intervening node (proxy/reverseproxy) is slowing down socket read
so we will need to set the sessionTimeout and clientTimeout to an insanely high value such as 60 sec


* @param zkServers
        * The Zookeeper servers
        * @param sessionTimeout
        * The session timeout in milli seconds
        * @param connectionTimeout
        * The connection timeout in milli seconds
        * @param zkSerializer
        * The Zookeeper data serializer
        * @param operationRetryTimeout
        * Most operations done through this {@link org.I0Itec.zkclient.ZkClient} are retried in cases like
        * connection loss with the Zookeeper servers. During such failures, this
        * <code>operationRetryTimeout</code> decides the maximum amount of time, in milli seconds, each
        * operation is retried. A value lesser than 0 is considered as
        * "retry forever until a connection has been reestablished".


// Set sessionTimeOut and connectionTimeOut to insanely high value such as 60 sec
//set retry to forever

   final Integer sessionTimeOut= 60000;
   final Integer connectionTimeOut= 60000;
   final Integer forever=0;
   org.I0Itec.zkclient.ZkClient zkClient = new org.I0Itec.zkclient.ZkClient("OutOfReachHost:49336",sessionTimeout, connectionTimeOut, new org.I0Itec.zkclient.serialize.ZkSerializerSerializableSerializer(), forever);

if you increment sessionTimeOut and connectionTimeOut to 60sec will ZKClient connect ?

if bumping up sessionTimeOut and/or connectionTimeOut doesnt work then from Client try
>tracert OutOfReachHost


ask your net admin what are the intervening servers between
you
and OutOfReachHost


If one of intervening server hosts is proxy-server
very likely that proxy server that is slowing down your request in which case try using curl to bypass proxy

curl  -v --noproxy '*' -XGET "OutOfReachHost:4711"


?
Martin

______________________________________________


________________________________
From: VALLASTER Stefan <[hidden email]>
Sent: Tuesday, April 17, 2018 8:59 AM
To: [hidden email]
Subject: RE: IllegalStateException: instance must be started before calling this method

Hello Martin,

did you read the input from Yuval and me. Does this answer your question?

Best regards
Stefan

-----Original Message-----
From: Martin Gainty [mailto:[hidden email]]
Sent: Dienstag, 10. April 2018 13:40
To: [hidden email]
Subject: Re: IllegalStateException: instance must be started before calling this method

MG>waiting for Yuval to answer:


in docker  container?
if so how do you resolve hosts (.hosts file, kubernetes, swarm)?


MG>?

________________________________
From: VALLASTER Stefan <[hidden email]>
Sent: Tuesday, April 10, 2018 4:46 AM
To: [hidden email]
Subject: RE: IllegalStateException: instance must be started before calling this method

How can we get this topic forward again? It seems the mail thread is currently stalled for no obvious reason. Anyone willing to help us out here again?

Best regards
Stefan

-----Original Message-----
From: Martin Gainty [mailto:[hidden email]]
Sent: Sonntag, 25. März 2018 22:58
To: [hidden email]
Subject: Re: IllegalStateException: instance must be started before calling this method



in docker  container?

if so how do you resolve hosts (.hosts file, kubernetes, swarm)?

M
________________________________
From: Yuval Dori <[hidden email]>
Sent: Sunday, March 25, 2018 5:34 AM
To: [hidden email]
Cc: Yuval Dori
Subject: IllegalStateException: instance must be started before calling this method

Hi,

Our customer is using ZK 3.4.3 and got IllegalStateException: instance must be started before calling this method


with the following stack trace:
2018-02-15T09:58:12.094+0100; ERROR; WSOSTSLXWIT01/MANAGER; P3424/T194; [SPACE/LearnerHandler-/10.17.46.142:49336/LearnerHandler<http://10.17.46.142:49336/LearnerHandler>]; Unexpected exception causing shutdown while sock still open
java.net<http://java.net>.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:542)


This is the description:

At first ps-main.1 is started as primary on WSOSTSLXBMS01 GSC 3.

24.01.2018 11:31:04,947 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/1 Space instance is Primary

After some relocation of ps-main instances a new backup instance is started on the same GSC.

24.01.2018 11:32:59,886 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T146 SPACE/GS-LRMI Connection-pool-1-thread-6/1 Space instance [ frqMain_container1:frqMain ] has been selected as Backup

Later a failover test is performed by shutting down WSOSTSLXBMS02 and WSOSTSLXMSS02.
We would expect that ps-main.1 backup will be elected as primary because the former primary was located at WSOSTSLXBMS02 but no election happend. In the end in the whole cluster there was no primary of ps-main.1 running.

We found suspicious exception in the logs:
24.01.2018 11:32:44,987 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/frqMain_container1:frqMain Lost leadership [sessionId=0x36128253b820001]
24.01.2018 11:32:44,987 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T98 SPACE/Curator-Framework-0/CuratorFrameworkImpl backgroundOperationsLoop exiting
24.01.2018 11:32:44,990 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/1 Quiesce state set to SUSPENDED
24.01.2018 11:32:44,991 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/ZooKeeper Session: 0x36128253b820001 closed
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector The leader threw an exception
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector java.lang.IllegalStateException: instance must be started before calling this method
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.shaded.com<http://org.apache.curator.shaded.com>.google.common.base.Preconditions.checkState(Preconditions.java:176)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.imps.CuratorFrameworkImpl.delete(CuratorFrameworkImpl.java:359)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.locks.LockInternals.deleteOurPath(LockInternals.java:339)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.locks.LockInternals.releaseLock(LockInternals.java:123)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.locks.InterProcessMutex.release(InterProcessMutex.java:154)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector.doWork(LeaderSelector.java:449)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector.doWorkLoop(LeaderSelector.java:466)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector.access$100(LeaderSelector.java:65)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:246)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:240)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.FutureTask.run(FutureTask.java:266)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.FutureTask.run(FutureTask.java:266)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.lang.Thread.run(Thread.java:748)
24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector
24.01.2018 11:32:44,991 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T97 SPACE/GS-LRMI Connection-pool-1-thread-1-EventThread/ClientCnxn EventThread shut down for session: 0x36128253b820001
24.01.2018 11:32:44,997 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/frqMain1_1 Channel is closed [target=frqMain1_1, target url=jini://*/frqMain_container1_1/frqMain?groups=POLBAY&ignoreValidation=true&schema=persistent&id=1&total_members=4,1&versioned=true&mirror=true&cluster_schema=partitioned&locators=WSOSTSLXMSS01,WSOSTSLXMSS02&state=started&timeout=5000, target machine connection url=NIO://WSOSTSLXBMS02:11002/pid[3756]/24795468496_3_2538032931939568226_details[class com.gigaspaces.internal.cluster.node.impl.router.AbstractConnectionProxyBasedReplicationRouter$ConnectionEndpoint(frqMain_container1_1:frqMain)]]
24.01.2018 11:32:44,998 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/ps-main-mirror-service Channel is closed [target=ps-main-mirror-service, target url=jini://*/ps-main-mirror-service_container/ps-main-mirror-service?schema=persistent&id=1&total_members=4,1&versioned=true&mirror=true&groups=POLBAY&cluster_schema=partitioned&locators=WSOSTSLXMSS01,WSOSTSLXMSS02&state=started&timeout=5000, target machine connection url=NIO://WSOSTSLXBMS01:11002/pid[3220]/25906827215_3_7196393537085178059_details[class com.gigaspaces.internal.cluster.node.impl.router.AbstractConnectionProxyBasedReplicationRouter$ConnectionEndpoint(ps-main-mirror-service_container:ps-main-mirror-service)]]
24.01.2018 11:32:45,767 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/1 Shutdown complete
24.01.2018 11:32:45,768 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/container Container <frqMain_container1> shutdown completed

24.01.2018 11:32:46,037 +01:00 WARN WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/PUServiceBeanImpl Failed to delete deployed processing unit from [C:\Program Files\Frequentis\Frequentis LifeX Platform\gigaspaces\work\processing-units\ps-main_1_2008048111]
24.01.2018 11:32:46,037 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/PUServiceBeanImpl Stopped

Attached thread dump. Failover happend around 24.01.2018 11:40:20,893 +01:00
WARN WSOSTSLXBMS01/GSC/3 P1280/T59 SPACE/GS-Watchdog/watchdog The ServerEndPoint [WSOSTSLXMSS02/10.17.46.142:11001<http://10.17.46.142:11001/>] is not reachable or is reachable but with no matching invocation in progress at the server peer, closing invalid connection with local address [/10.17.46.136:49864<http://10.17.46.136:49864/>] [ remote invocation of: net.jini.lookup.DiscoveryAdmin.getMemberGroups] [java.net<http://java.net>.SocketTimeoutException]


Could it be ZK bug?

Thanks,

Yuval