History | Log In     View a printable version of the current page.  
Issue Details (XML | Word | Printable)

Key: RHQ-835
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Blocker Blocker
Assignee: John Mazzitelli
Reporter: Joseph Marques
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
RHQ Project

agent needs to reconnect to the cloud properly

Created: 13/Sep/08 07:51 PM   Updated: 08/Oct/08 09:29 AM
Component/s: Agent, High Availability
Affects Version/s: 1.1pre
Fix Version/s: 1.1

Time Tracking:
Not Specified

Environment: for simplicity: single-server cloud, single agent
Issue Links:
Incorporate
 
Relation
 

Resolution Date: 16/Sep/08 01:08 AM
Date of First Response: 13/Sep/08 09:18 PM
Tester: Jeff Weiss
VCS Revision: 1,472


 Description  « Hide
the use case:

* server and agent are both up, and cache is loaded
* take the server down for a while
* bring the server back up

expected result:

* you see cache reload messages if you're tailing the server log

actual result:

* you see nothing

after reading a bit of code, i think the problem is that the agent stops retrying after it unsuccessfully tries each server in its failover list once (the too_many_retries error msg). this could only happen if the entire cloud is down (or if there are rolling failures across the cloud such that at the instant this agent tries each cloud endpoint, the endpoint is/was unavailable). this is indeed unfortunate, but pragmatically speaking it becomes statistically more and more rare under normal operating conditions as the size of the cloud grows. in any event, i think we want to retry the servers indefinitely. just keep looping and looping and looping over the failover list until one of them comes back up.

 All   Comments   Work Log   Change History      Sort Order: Ascending order - Click to sort in descending order
John Mazzitelli - 13/Sep/08 09:18 PM
I was wondering about this and thought the loop-once-and-fail was optimal.

We already have code in place that handles the case when an agent is out of contact of its server (and now, of its serverS). That's the whole point of the message spooling and guaranteed delivery.

In addition, we can't retry indefinitely - eventually, messages will timeout and/or the agent will deadlock and/or the agent will run out of threads as it continually tries to send new messages when old messages haven't "timed out" or stopped trying to failover.

I think if we do the "infinite looping", the agent will end up in a bad state - I think "bad things will happen".

If ALL of the servers in a cloud are down (which usually indicates a network failure - I think it will be a rare instance when all servers actually are down), the agent should enter its "started>" mode (i.e. the command sender gets stopped, guaranteed messages are spooled) and wait for the poller/auto-discovery to see it come back

Joseph Marques - 14/Sep/08 05:21 AM
> If ALL of the servers in a cloud are down...I think it will be a rare

what if size(cloud) is 1? this single-server setup is what smaller environments will use by default, so we have to ensure that we're not degrading them to achieve what we think are better semantics for a large cloud.

anyway, maybe you and i have a fundamental disconnect for what it means for the agent to be in failover mode. if the agent is failing over, it shouldn't be sending any commands at all. i know that things will still be collecting, but i don't see why those messages / reports would still attempt to be sent up to the server when we know the agent is not currently connected to anything.

in other words, i would expect that the agent failing over puts the comm layer in, say, "paused" mode, when in that mode, the plugin container - which is responsible for collecting data as necessary and pushing results up to the server - simply spools the data it collects instead of trying to sends commands. it should be possible for the PC layer to either periodically query the comm layer, or the comm layer to somehow notify downstream consumers that it's services have temporarily ceased.

Joseph Marques - 14/Sep/08 05:33 AM
in any event, what i thought was happening here was that the agent never again tried to connect to any server once the failover list was exhausted. i've already disproved that theory empirically, but i still don't see the connectAgent call being made upon reconnect in a single-server environment. and that is a must.

John Mazzitelli - 14/Sep/08 08:09 AM
I have a feeling I just have to add a bit of code to send the connectAgent message when the poller/auto-discovery restarts the comm sender. This is probaby not hard to add in.

Remember the PC should have no knowledge of the comm layer (think of embedded - there IS no comm layer). The PC should have no dependency on any communications layer or have any knowledge that there is even a remote server in the picture. The PC doesn't really "send" messages, at least, it doesn't know it is doing so.

When the agent is failing over, the sender is still in the "sending" state so it still attempts to send messages - we don't stop messages from entering the comm layer. If a message fails to be sent because the server is down, that message sending thread enters the "failover" code. This is concurrent and thread-safe - the failover blocks other messages until one attempt to a new server is made. I have the code "constantly" moving, trying not to block any one thread for a long amount of time.

The poller will still be there and I believe it will still able to loop through the failover list infinitely. If the poller seesl all servers down, THEN stops the comm layer. When one comes back up, it starts it. This is true for 1 server in the cloud (as it was before) and N servers. The auto-discovery listener is a bit different - he's only listening for the primary server - but as Greg said, let's not worry about the AD listener working for 100% of the cases because people will rarely be using it.

All of what I just said above is what I hope we test in the coming weeks to make sure I am not lying :) Again, I will have to add code because I don't think we are sending connectAgent message when the comm layer starts back up. I guess we need to do this because if the server down was because the server was rebooted, we have to reconnect to it so it can reload the cache.

Greg Hinkle - 14/Sep/08 08:09 PM
So... is there a rigid "failover" mode (even if its not stopping the sender?). I'm wondering about those rebalance use cases we had gone over. We effectively want the agents to continue trying to get back to the primary server. Otherwise you could get failed over to the secondary and stay there forever. Also, let's take the simple case, My server list is A, B, C... First A goes down so i switch to B (i should periodically try to get back to A)... if B goes down... i should first try to see if A is back (i.e. at the start of each failover process start from the top of the list)

John Mazzitelli - 14/Sep/08 09:36 PM
everytime the agent successfully connects to SOME server, it always resets the failover list index so it'll always try from the top of the list the next time a failure occurs (this is the purpose of the "reset" method I added to FailoverListComposite).

John Mazzitelli - 15/Sep/08 08:23 AM
To close this JIRA out, I need to implement the following two things:

1) have the agent periodically check to make sure its connected to the "top" of the failover list. If the agent is not pointing to the top/primary server (even if it is connected to another server successfully), it should attempt to switch over. I will need to have the agent create another thread to do this polling.

2) Everytime the client command sender is started (either via the poller, or the prompt command or some other way), we need to send the connectAgent message to the server, effectively telling the server we are going to start sending it messages. We have a callback mechanism in the sender to leverage to get this to work.

John Mazzitelli - 15/Sep/08 11:59 AM
#2 done - svn rev 1461

John Mazzitelli - 16/Sep/08 01:08 AM
#1 done - svn rev 1472

John Mazzitelli - 16/Sep/08 01:08 AM
this needs ALOT of testing :)

Jeff Weiss - 08/Oct/08 09:29 AM
Fixed, rev1712