History | Log In     View a printable version of the current page.  
Issue Details (XML | Word | Printable)

Key: RHQ-1069
Type: Improvement Improvement
Status: Integrated Integrated
Resolution: Fixed
Priority: Critical Critical
Assignee: John Mazzitelli
Reporter: John Mazzitelli
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
RHQ Project

be more fault tolerant when failing to download plugins

Created: 06/Nov/08 07:54 AM   Updated: 08/Dec/08 07:30 AM
Component/s: Agent
Fix Version/s: 1.2

Time Tracking:
Not Specified

Issue Links:
Relation

Resolution Date: 15/Nov/08 11:11 AM
VCS Revision: 2,005


 Description  « Hide
When the agent starts up, it attempts to download its plugins. If it fails to download one or more plugins, it continues on. This probably shouldn't be the case. If, during startup, the agent fails to pull down a plugin, the agent should try to get it later.

The case where this happens that I have seen is when the agent comes up but some servers in the cloud are down or worse go down during the download. The agent will attempt to switch over to another server but when that happens, the remote stream becomes invalid (the remote stream is only valid for the server where the stream originated from). As soon as the switchover happens, the agent will get a remote stream error and the plugin will fail to download. In this case, perhaps the agent should retry to pull down that plugin again - be fault tolerant of the case where the agent switched over to another server under the covers.

If we don't fix this, an agent could have an incomplete set of plugins and may fail to start properly (if the plugin it failed to get was the platform plugin, the agent will certainly be dead in the water).

 All   Comments   Work Log   Change History      Sort Order: Ascending order - Click to sort in descending order
John Mazzitelli - 11/Nov/08 11:44 AM
I would like to get this fixed in 1.2.

John Mazzitelli - 15/Nov/08 11:11 AM
the agent will attempt several times to download a plugin (sleeping for a bit between retries). only if it fails multiple times will the agent give up.

John Mazzitelli - 15/Nov/08 11:16 AM
might be tough to test, due to the timings but, here's what to try to test that this works:

1) have 1 or 2 servers in the cloud
2) start the agent
3) while the agent is downloading plugins, kill 1 or both servers
4) after a minute, restart the server(s)
5) in the agent log, you should see the agent successfully download all plugins after some warnings about needing to retry

Its tough because you have to kill the servers in step 2 at the exact same time the agents are going to download the plugins. Perhaps you could deploy a really fat plugin (create a temporary plugin with a very minimal, but valid, rhq-plugin.xml but put very large files inside the plugin .jar so it takes a long time to download - make the plugin jar file 100MB large or more - this way, it'll take a while for the agent to download it - enough time to give the tester a chance to see the download start and to kill the server in the middle of the download).