Need Some SQL Help

fluxion · Mar 19, 2009

I have 2 tables roughly defined as follows (using mysql btw):

Code:

[b]table: bundles[/b]
bundle_id INT
machine_name VARCHAR
state VARCHAR

to help put it in context, a "bundle" is basically a list of software packages to be installed on the machine "machine_name" in a specific order. state is "new", "processing", or "complete" depending on whether the software packages havent begun installing, have begun installing but arent yet complete, or have finished installing, respectively.

the actual packages/install jobs are stored in a seperate table:

Code:

[b]table: install_jobs[/b]
package_name VARCHAR
bundle_id INT
state VARCHAR
seq_num INT

bundle_id is the id corresponding to the bundle/machine the install job is for. state is "pending" or "complete", depending on whether or not the installation of the package on machine_ipaddr is complete. seq_num is the number corresponding to the order in which that package is to be installed *for a particular bundle/machine*

so i might have something like:

Code:

[b]bundle_id, machine_name, state[/b]
1, "machine1", "complete"
2, "machine2", "processing"
3, "machine3", "new"

[b]package_name, bundle_id, state, seq_num[/b]
"quake", 1, "complete", 1
"counterstrike", 2, "pending", 2
"crysis", 2, "pending", 1
"doom", 3, "pending", 1
"mariokart", 3, "pending", 2

i have a routine get_jobs() that needs to do the following:

for each machine_name that is not in a state "complete"
- find the package in install_jobs with the lowest seq_num, that is not in a state "complete", and belongs to that machine/bundle
- add (machine_name, package_name) to the @jobs array

so if i ran get_jobs() on the above tables i'd get the following list:

Code:

("machine2", "crysis")
("machine3", "doom")

nothing for "machine1" is listed because it was already marked complete. "crysis" was chosen over "counterstrike" because it had a lower sequence number...ditto for "doom" over "mariokart".

make sense? then i'd do the install, mark those installs as "complete", and repeat.

so the lazy version is something like:

Code:

@jobs = ()
@machines = SELECT machine_name FROM bundles WHERE state != 'complete'

foreach $machine_name in @machines:
  $package_name = SELECT package_name FROM bundles, install_jobs WHERE bundles.bundle_id = install_jobs.bundle_id AND install_jobs.state = 'pending' ORDER BY seq_num ASC LIMIT 1
  add ($machine_name, $package_name) to @jobs list

return @jobs list

which is okay i guess, but this is gonna be fairly heavilly trafficked, and its a web query so if i do a bajillion SQL queries in the for() loop it might time out or something (could be over a thousand machines)... and possibly for reasons of concurrency i'd rather have all this done in 1, nice little SQL query. i.e.

@jobs = <insert awesome SQL statement here>

hopefully this doesnt come off too messy. its actually a nice, clean little problem to think about when you cut to the root of it.

i blew half a day trying to figure this out, reading up on joins and subselect statements and whatnot and i think i'm a bit out of my element here, so any help would be MUCH appreciated. and if you dont think its possible let me know so i can stop wasting my time on this, and maybe reconsider the db structure a bit

im hitting the sack but ill check back in the morning. TIA

heatlesssun · Mar 19, 2009

This query should give you the rows that you want:

SELECT t2.machine_name, t1.package_name FROM install_jobs t1 INNER JOIN bundles t2 ON t1.bundle_id=t2.bundle_id WHERE t2.state!='complete' AND t1.seq_num=(SELECT MIN(seq_num) FROM install_jobs WHERE t1.bundle_id=bundle_id) GROUP BY t2.machine_name, t1.package_name

Then just loop through each row and execute your task. You could update the database with one update statement as well but then you get into the issue of setting up a transaction and rolling back in case of failure of your install task.

Also you'll want to put indexes and keys on these tables if they are going to be large and constantly accessed.

fluxion · Mar 19, 2009

wow, awesome!

so, just to confirm...a subselect can utilize row data from outside the subselect query? so its basically running the subselect for each row processed by the main select?

good stuff....ill try this as soon as i get to work. thanks!

heatlesssun · Mar 19, 2009

Yes to your question. A subselect is just a way of expressing a join, its easier for me conceptualize and when the tables are properly key and indexed there, like joins they are very efficient since they are joins.

Please let me know if this worked for you when you get a chance.

fluxion · Mar 19, 2009

thanks heatlesssun!

i tried you're query pretty much verbatim and it seems to do the trick:

SELECT machine_name, package_name FROM bundles INNER JOIN install_jobs USING(bundle_id)
WHERE bundles.state != 'complete'
AND install_jobs.seq_num = (SELECT MIN(seq_num) FROM install_jobs WHERE bundles.bundles_id = install_jobs.bundle_id);

one question, is there any significance to the GROUP BY i left off? since there's only gonna be 1 package per machine i didnt really see where it would come into play

heatlesssun · Mar 19, 2009

I believe that you are correct, as long as the sequence numbers are unique within a bundlle, there's no no need for the GROUP BY clause, doesn't effect the results but good catch!

fluxion · Mar 19, 2009

ahh, right, glad you mentioned that. i was planning on dealing with bundles that dont have a specified install order by setting all the seq_num's to be 0, in which case i'd violate my 1-package-per-machine constraint. So i did need the GROUP BY

thanks again!

Need Some SQL Help

fluxion

Gawd

heatlesssun

Extremely [H]

fluxion

Gawd

heatlesssun

Extremely [H]

fluxion

Gawd

heatlesssun

Extremely [H]

fluxion

Gawd