History | Log In     View a printable version of the current page.  
Issue Details (XML | Word | Printable)

Key: RHQ-1092
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Critical Critical
Assignee: John Mazzitelli
Reporter: John Mazzitelli
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
RHQ Project

be able to configure the availability "quiet time" before backfilling

Created: 08/Nov/08 06:30 PM   Updated: 22/Dec/08 09:17 PM
Component/s: FX - Availability, Core Server
Fix Version/s: 1.2

Time Tracking:
Issue & Sub-Tasks
Issue Only
Not Specified

Issue Links:
Relation
 

Resolution Date: 12/Nov/08 01:23 AM
Date of First Response: 22/Dec/08 09:17 PM
Tester: Corey Welton
VCS Revision: 1,969

Sub-Tasks  All   Open   

 Description  « Hide
When we have very large environments, it may be that we cannot process avail reports fast enough before our checkForSuspectAgents job backfills the agent. I see this every now and again - where resources go red then green and ping pong back and forth between up and down.

We should be able to configuration the "quiet time" that we allow before backfilling. Currently, its hardcoded to 2 minutes and the agent sends the avail messages every 1 minute. The agent is configurable - set "rhq.agent.plugins.availability-scan.period-secs" to have it send avail reports faster or slower than 1 minute. We currently have a hack to configure the server (set a prop in rhq-server.properties - see below). We should either consider putting a value in RHQ_SYSTEM_CONFIG so its the same across the cloud (and changable via Admin UI page) or we could put some kind of smarts in the server so it could say something like, "I'm getting clobbered with alot of agent messages/inventory reports/etc - I'll let agents slide another 2 minutes on avail reports - so I won't backfill unless I don't hear from an agent in 4 minutes". The server could then readjust later when it catches up, back to the 2-minute backfill quiet time.

In our AgentManagerBean we have:

    @SuppressWarnings("unchecked")
    public void checkForSuspectAgents() {
        if (log.isDebugEnabled())
            log.debug("Checking to see if there are agents that we suspect are down...");

        // TODO [mazz]: make this configurable via SystemManager bean
        long maximumQuietTimeAllowed = 1000L * 60 * 2;
        try {
            String propStr = System.getProperty("rhq.server.agent-max-quiet-time-allowed");
            if (propStr != null) {
                maximumQuietTimeAllowed = Long.parseLong(propStr);
            }
        } catch (Exception e) {
        }


 All   Comments   Work Log   Change History      Sort Order: Ascending order - Click to sort in descending order
John Mazzitelli - 08/Nov/08 09:01 PM
An alternative is to perform some additional checking after 2 minutes of quiet time but before we actually backfill.

Perhaps we can look in our DB for ANY activity from the agent right before we backfill. If we've seen we already processed (within the past 2 minutes) an inventory report, a measurement report, an operation result, a configuration change or other agent-originating message, we can assume the agent is up and just hasn't been able to send us its avail report yet. In this case, we abort the backfill.

So its:

1) checkSuspectAgents looks for an avail report that occurred within the past 2 minutes. If nothing then:
2) check to see if the agent has sent us any message in the previous 2m interval (like inventory report, measurement report, operation result, etc). If we DID get such a message from the agent, abort and do not backfill. Otherwise:
3) continue with the normal backfill processing

So step 2) would be new.

John Mazzitelli - 11/Nov/08 11:25 AM
making this critical - we need to at least explore the possibilty to bump up the quiet time interval and avail report interval.

John Mazzitelli - 12/Nov/08 01:23 AM
Admin > Server Config page now allows you to specify the agent max quiet time allowed setting which is what our check-suspect-agent job will use. therefore, this setting takes affect across the cloud. we no longer support that hidden system property override .

Corey Welton - 22/Dec/08 09:17 PM
QA Verified.