|
|
|
> If ALL of the servers in a cloud are down...I think it will be a rare
what if size(cloud) is 1? this single-server setup is what smaller environments will use by default, so we have to ensure that we're not degrading them to achieve what we think are better semantics for a large cloud. anyway, maybe you and i have a fundamental disconnect for what it means for the agent to be in failover mode. if the agent is failing over, it shouldn't be sending any commands at all. i know that things will still be collecting, but i don't see why those messages / reports would still attempt to be sent up to the server when we know the agent is not currently connected to anything. in other words, i would expect that the agent failing over puts the comm layer in, say, "paused" mode, when in that mode, the plugin container - which is responsible for collecting data as necessary and pushing results up to the server - simply spools the data it collects instead of trying to sends commands. it should be possible for the PC layer to either periodically query the comm layer, or the comm layer to somehow notify downstream consumers that it's services have temporarily ceased. in any event, what i thought was happening here was that the agent never again tried to connect to any server once the failover list was exhausted. i've already disproved that theory empirically, but i still don't see the connectAgent call being made upon reconnect in a single-server environment. and that is a must.
I have a feeling I just have to add a bit of code to send the connectAgent message when the poller/auto-discovery restarts the comm sender. This is probaby not hard to add in.
Remember the PC should have no knowledge of the comm layer (think of embedded - there IS no comm layer). The PC should have no dependency on any communications layer or have any knowledge that there is even a remote server in the picture. The PC doesn't really "send" messages, at least, it doesn't know it is doing so. When the agent is failing over, the sender is still in the "sending" state so it still attempts to send messages - we don't stop messages from entering the comm layer. If a message fails to be sent because the server is down, that message sending thread enters the "failover" code. This is concurrent and thread-safe - the failover blocks other messages until one attempt to a new server is made. I have the code "constantly" moving, trying not to block any one thread for a long amount of time. The poller will still be there and I believe it will still able to loop through the failover list infinitely. If the poller seesl all servers down, THEN stops the comm layer. When one comes back up, it starts it. This is true for 1 server in the cloud (as it was before) and N servers. The auto-discovery listener is a bit different - he's only listening for the primary server - but as Greg said, let's not worry about the AD listener working for 100% of the cases because people will rarely be using it. All of what I just said above is what I hope we test in the coming weeks to make sure I am not lying :) Again, I will have to add code because I don't think we are sending connectAgent message when the comm layer starts back up. I guess we need to do this because if the server down was because the server was rebooted, we have to reconnect to it so it can reload the cache. So... is there a rigid "failover" mode (even if its not stopping the sender?). I'm wondering about those rebalance use cases we had gone over. We effectively want the agents to continue trying to get back to the primary server. Otherwise you could get failed over to the secondary and stay there forever. Also, let's take the simple case, My server list is A, B, C... First A goes down so i switch to B (i should periodically try to get back to A)... if B goes down... i should first try to see if A is back (i.e. at the start of each failover process start from the top of the list)
everytime the agent successfully connects to SOME server, it always resets the failover list index so it'll always try from the top of the list the next time a failure occurs (this is the purpose of the "reset" method I added to FailoverListComposite).
To close this JIRA out, I need to implement the following two things:
1) have the agent periodically check to make sure its connected to the "top" of the failover list. If the agent is not pointing to the top/primary server (even if it is connected to another server successfully), it should attempt to switch over. I will need to have the agent create another thread to do this polling. 2) Everytime the client command sender is started (either via the poller, or the prompt command or some other way), we need to send the connectAgent message to the server, effectively telling the server we are going to start sending it messages. We have a callback mechanism in the sender to leverage to get this to work. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
We already have code in place that handles the case when an agent is out of contact of its server (and now, of its serverS). That's the whole point of the message spooling and guaranteed delivery.
In addition, we can't retry indefinitely - eventually, messages will timeout and/or the agent will deadlock and/or the agent will run out of threads as it continually tries to send new messages when old messages haven't "timed out" or stopped trying to failover.
I think if we do the "infinite looping", the agent will end up in a bad state - I think "bad things will happen".
If ALL of the servers in a cloud are down (which usually indicates a network failure - I think it will be a rare instance when all servers actually are down), the agent should enter its "started>" mode (i.e. the command sender gets stopped, guaranteed messages are spooled) and wait for the poller/auto-discovery to see it come back