History | Log In     View a printable version of the current page.  
Issue Details (XML | Word | Printable)

Key: RHQ-1103
Type: Bug Bug
Status: Integrated Integrated
Resolution: Fixed
Priority: Critical Critical
Assignee: Joseph Marques
Reporter: Joseph Marques
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
RHQ Project

slowness on resource group browser

Created: 11/Nov/08 04:32 PM   Updated: 15/Dec/08 10:02 AM
Component/s: Core UI, Core Server, FX - Resource Grouping
Affects Version/s: 1.1, 1.1.2
Fix Version/s: 1.2

Time Tracking:
Not Specified

Environment: when running a 6-server HA environment. only 7 groups but memberships sizes ranging from 80 to 320, with mixed availability
Issue Links:
Incorporate
 
Relation
 

Resolution Date: 04/Dec/08 11:47 AM
Date of First Response: 11/Nov/08 06:11 PM
VCS Revision: 2,126


 Description  « Hide
significant slowness loading the compatible groups page.

 All   Comments   Work Log   Change History      Sort Order: Ascending order - Click to sort in descending order
John Mazzitelli - 11/Nov/08 06:11 PM
We were just discussing this very thing this morning. On Greg's plate to think of a way to fix this.

See: http://jira.rhq-project.org/browse/RHQ-1002

John Mazzitelli - 11/Nov/08 06:19 PM
I may have jumped the gun - are you saying the "resource hub" page is slow? i.e. clicking on Browse Resources -> Compatible Groups link?

If so, that other JIRA is probably related but doesn't address that exactly.

reopening just in case

John Mazzitelli - 12/Nov/08 07:17 AM
from joe:

i think i might have a bead on why we see contention from time to time with queries like

update RHQ_AVAILABILITY set START_TIME=:1, END_TIME=:2, AVAILABILITY_TYPE=:3, RESOURCE_ID=:4 where ID=:5

the suspicion is that the group browser might be taking a long time to load because it takes a long time to find the current availabilities of all resources in the system, i created a recursive dynagroup of "groupby resource.parent.trait[Trait.hostname]", which created 6 groups and put all resources into each that exist on each machine. the membership counts were: 16000, 3530, 1864, 653, 653, and 240.

it took 123 seconds to load the group definition view page, which essentially has the same logic as the group browser except that it's filtered by groupDefinitionId. yes, this is an extreme example, but i did it on purpose to test performance and database contention. sure enough, after doing so, i saw a lot of update statements (like the one above) sitting there blocked.

so, i think the availability subquery is hurting us here. after looking at some of the composite queries across the system that pull in the current availability, i see that we're using two different strategies: max(starttime) or endtime is null. we probably want to shy away from null endtime lookups, because there are plenty of dbs out there that can't index a null value...their argument being that you can't index a lack of a value).

but more than that, i think we should have a direct link from resource to availability. this way, we can use JPQL fragments like "resource.currentAvailability" or something like that. this would enable us to bypass the subquery altogether because joins would do a direct index lookup (as opposed to subquery) to find the availability row in question. however, that requires a little bit extra code in the availability manager to set the current availability on the corresponding resource object when the new RLE data comes across the line.

note, this is what we do for the pluginConfiguration and resourceConfiguration objects too (though, that was done to make navigating the object graph starting from a resource easier so as to make the dynamic JPQL generation for dynagroups simpler).

John Mazzitelli - 12/Nov/08 07:19 AM
> i think we should have a direct link from resource to availability

I'd be hesitant to add yet another relationship off of the Resource entity - won't that add more contention on an already-very-busy entity? (i.e. will we add additional contention elsewhere in the system that is using the RHQ_RESOURCE table - which is probably used in much more places than the avail data)

Joseph Marques - 12/Nov/08 12:37 PM
a couple reasons i think we should have a direct link:

* although we have many things that hang off of Resource, it is a relatively static construct
** this entity does not change often nor does the row count (once your inventory has stabilized after import), as opposed to insertion of availability data which needs to insert a new row as well as update an exist row for RLE processing for each member in the availability report
* availability table is several times larger than resource - if each resource in the system has over time gone down and come back up, say, 10 times, then the availability table will be 10x as large at the resource table
* in stable state (resource importing done, uninventory/deletes are infrequent), availability reporting happens much more frequently than the discovery reports

in fact, since availability data is so frequently used, i'm even tempted to consider a denormalized structure where the last known AvailabilityType enum (UP/DOWN) is directly set on the Resource entity. something like:

@Entity Resource { AvailabilityType currentAvailability; }

this would allow ALL queries that need to serve up availability data with the resource (which is a large majority of the resource queries) to bypass the rhq_availability table altogether. all the AvailabilityManagerBean needs to do is keep this field up to date when a new entry is inserted into the availability table. this way, the rhq_resource table is the only thing needed for composite queries, while the rhq_availability table is used for the monitor subtab to show the "xmas tree lights".

Greg Hinkle - 12/Nov/08 01:39 PM
What is the subquery that is slow? Have we proven that its that part that is slowing things down and does it change much switching between max(startTime) and null endTime?

I'm ok with denormalizing if it will improve things, but I don't want to optimize the wrong thing or make it worse. The Resource table is much wider than the avail table and will be significantly more expensive to update in terms of io and cache. This would be worse when we're doing the checkForSuspectAgents task of marking all resources on a box down.

Joseph Marques - 19/Nov/08 02:58 PM
schema updates
* new table - rhq_current_availability(resource_id, availability_type)
* resource_id fk to rhq_resource(id)
* index on resource_id

db-upgrade
* insert into rhq_current_availability(resource_id, availability_type)
select res.id, (correlated subQ for current avail) from rhq_resource res

biz logic
* upon resource import, insert row into rhq_current_availability too (null availability_type --> unknown)
* upon receipt of availability report, update rhq_current_availability table

ui changes
* update resource group browser queries to hit this rhq_current_availability table, instead of 2 joins with correlated subQ (rhq_resource_group left join rhq_resource left join rhq_availability w/correlated subQ between res and avail where avail has the max starttime or null endtime)

Joseph Marques - 20/Nov/08 10:51 PM
branch FEATURE_PRECOMPUTE_AVAIL:
rev2054 - precompute current / latest resource availability data;
use precompute data to improve named queries in ResourceGroup entity that display aggregate/average group availability;

Joseph Marques - 04/Dec/08 11:47 AM
rev2064 - merge branches/FEATURE_PRECOMPUTE_AVAIL back into trunk;

rev2073 - without setting the resourceId explicitly, this blows up upon insert (because resource.getId() does not match resourceId)

rev2074 - use hibernate postPersist hooks to seed the rhq_resource_avail table during initial persistence of resources, instead of as separate calls to entityManager.persist(ResourceAvailability) in DiscoveryBossBean;
this isn't just a cleaner methodology, it's critical - without this fix, rhq_resource_avail table will only eve get persisted with the InventoryReport's root resources;

rev2075 - use surrogate id field for rhq_resource_availability;
update other things to work in accordance with that;

rev2076 - additional processing to support updating resource availability during backfilling procedure;
break ResourceAvailability processing into its own SLSB;

rev2077 - update monitor tab auto-group queries to use precompute resource availability;

rev2078 - update monitor tab auto-group children queries to use precompute resource availability;

rev2079 - yet more monitor tab updates for auto-group and/or auto-group children queries to use precompute resource availability;

rev2080 - rest of the monitor tab updates for auto-group and/or auto-group children queries to use precompute resource availability;

rev2081 - LEFT JOIN the resource's availability so the cardinality of the query and countQuery are the same when viewing groups with 0 resource members;

rev2082 - LEFT JOIN the resource's availability so the cardinality of the query and countQuery are the same when viewing group detail page with 0 resource members;

rev2083 - update resourceGroup queries that display group membership details to use precompute resource availability;

rev2084 - update resource browser queries that display platforms / servers / services to use precompute resource availability;

rev2085 - fix query to update ResourceAvailability entities when back-filling occurs;

rev2098 - necessary logic to ensure ResourceAvailability already has a record for every COMMITTEED resource in inventory;
see comment in ResourceAvailabilityManagerLocal for details;

rev2116 - fix availability tests;

rev2126 - no longer need to explicitly create the ResourceAvailaibility objects in the SLSB, this is done now as a post-persist hook;