Need MySQL Query Optmization help

mystykmax

Limp Gawd
Joined
Feb 4, 2005
Messages
473
I have a possibly very very large InnoDB table in a MySQL 5.0 db. For the sake of simplicity, let's say it has 4 columns, 3 of those being email columns, like email, e_alt1, e_alt2 and the primary key is id. I've run into an issue that I need to cleanup. The issue is that sometimes there may be two different id values that have (id=1, email = [email protected]) and then (id=2, email = [email protected], [email protected]). So what I want is to set e_alt1 = NULL if I find out it's listed in the email column of a different id value. So stupidly I tried:

Code:
UPDATE table1 t1, table t2 SET t1.e_alt1 = NULL WHERE t1.email=t2.e_alt1
which works fine for a smallish number of rows, but of course with a large set...well...that just won't work, even though there are indexes on all 4 columns we're operating on here. (In real the table has at least 25 columns and probably more like 40+ in the wild.)

Can anyone help me optimize this query or am I stuck and need to just protect against the possibility of this issue in my UI?
 
Your query doesn't match what you describe, though I guess it's symmetrical. Is Table1 and Table intentionally different? From your description, I figured that we were thinking of the same table.

You might also want to add "where t1.id <> t2.id" to it. Otherwise, you're going to destroy data.

Code:
ID,email,e_alt1
1,FOO,BAR
2,BAR,FOO

will end up being set to

Code:
1,FOO,NULL
2,BAR,NULL

and I'm not positive that's what you want.

You're generating a cartesean product by self-joining and forcing the database to probe every single row for each row it considers. That's necessarily a long operation for large tables, as it grows quadratically. It might be faster to code your exact intent; something like this:

Code:
UPDATE Table1
  SET e_alt1 = NULL
  WHERE ID
  IN
  (
SELECT ID
  FROM Table1 T1
  WHERE EXISTS (SELECT 1 FROM Table1 T2 WHERE T1.e_alt= T2.email AND T1.ID <> T2.ID)
  )

depending on how the optimizer treats updates on self-join result sets. You can use that sub-select separately, perhaps with a LIMIT, to make sure you're going to attack the correct rows.

Depending on your requirements, another approach to take might involve having your application start to correct the database. That is, whenever someone saves their record, also check this condition and fix it. This will let you fix records as you go, instead of all in one batch.
 
I am guessing that you are only dealing with one table here, and when you wrote "table1" you meant "table", correct? In other words, "t1" and "t2" are the same tables.

First, mikeblas is right in that you need to make sure that id's are not the same as part of your Where statement.

Also, I am assuming that you'll be checking e_alt1 AND e_alt2 for duplicates in email.

I would include also where "e_alt1 <> NULL" or blank, if that's something that could occur.

I do not understand mikeblas' statement regarding this being a cartesean product in that it grows quadratically. In my mind, the database is going to do a single pass through the database (t2) looking for something in (t1). So for each row there is a single lookup and possible assignment to e_alt1 or e_alt2. But mikeblas is sharper on some of this stuff than I am - mikeblas, can you explain?

So, in conclusion, it is what it is. The only other thing I can think of is for you to give us what your indexes look like (is e_alt1 just a key unto itself, or is it part of a compound index).
 
I did make a bit of a mistake there and I appreciate you guys catching it. It really is the same table joined to itself. In the example below:

id, email, e_alt1
1, foo, bar
2, bar, foo

I want to end up like
1, foo, NULL
1, bar, NULL

In reality I have to also do this to e_alt2 as well.
 
Let's say n is the number of rows in the table.

To find a single row in the table against the index, we're going to look through the index. The index is a BTree, so the time it takes to find a row (or not) is O(log(n)). The base of the logarithm isn't important.

For each row in the table, we need to try to find a row in the table. So that means were going to do an O(log(n)) operation n times, where we arrive at O(n*log(n)).

Some people call this a "linearithmic" function. It's not quite as bad as a real quadratic, but it's certainly worse than linear.

Since we're talking about databases, it's really worse than linear. The reason is that the operation of finding a row involves I/O. If we think about that look through the B-tree to find the row we want, we revisit the implementation of the algorithm. We get the root node; we figure out which branch off the root node might have the key we want. We get that node. We figure out which child of that node has the key we want, and so on -- stopping when we fall off the leaf node, or when we manage to find the key.

In that algorithm, two interesting things are happening. First, the comparison involved is of two strings: the key string and the candidate string. If the columns are fancy and know about different code pages and comparison rules in foreign languages, we spend a lot of time switching around character sets and normalizing the keys for sorting and comparison. Even if not, the operation of a string compare is O(c), where c is the number of chartacters in the string. So now we're at O(log(n)*c) for this operation. The string comparison is nasty, but it's really minor in comparison to the other thing that happens: I/O.

When we read a node in the B-tree, we're doing an I/O operation. Unless mystykmax has some really fast drives, each I/O operation is going to take ten or twelve milliseconds. They're also likely to be random I/Os, since there's no good way to sequentially read a tree. (So even if he does have a great drive array, the seek time is still a few milliseconds--and that's assmuming nobody else is using the system.) If we're lucky, some pre-fetching happens. There's cache hits, and so on. If this table and all of its indexes fit in memory, we're in better shape--once they get into memory.

O(log(n)*c) tells us the number of operations we're doing. It turns out the cost of the operations, though, can be really high. Each operation invovles an I/O, involves probing a cache, and then doing string compares. I guess O(log(n)*c) isn't quite accurate, since we do more than one string compare per page that we read.

We're used to thinking that when n is small, the algorithms are quick. That's true. Here, though, there's a huge knee in the O(n*log(n)) curve that we're expecting to see. When n is small, all the rows and the indexes fit into memory. When n gets big, then the table and its indexes take more and more memory until you're out of memory, and you can't hold everything in cache ... and then the cost of each operation steps substantially higher, and your performance becomes just awful.

And all this doesn't even count the update operation, or any of the locking, or any of that tomfoolery.

And that's why databases are slow!
 
hi guys,

so I tried what mikeblas said, or at least how I interpreted it...:)

Code:
UPDATE contacts
    ->   SET email_alt1 = NULL
    ->   WHERE c_id
    ->   IN
    ->   (
    -> SELECT c_id
    ->   FROM contacts c1
    ->   WHERE EXISTS (SELECT 1 FROM contacts c2 WHERE c1.email_alt1= c2.email AND c1.c_id <> c2.c_id)
    ->   );
ERROR 1093 (HY000): You can't specify target table 'contacts' for update in FROM clause
 
Are you comparing email = e_alt1, email = e_alt2, and e_alt1 = e_alt2? Or just against email?

basically what I need to do is look through and see if email == email_alt1 of any contact. If it is, I need to set email_alt1 = NULL. Then I must do the same for email_alt2.
 
Mike,

Reading your post made me realize that what I need to do is a pretty nasty query. The contacts table I'm dealing with could be over 10 million rows in some bad cases. I think my attempt to clean up this data with this method is fraught with too much time. The servers are all RAID and so forth, but at 10 million rows, I think we can forget about a fast way to do this unless there's something I'm not seeing, but I'm certainly open to ideas.

Let's say n is the number of rows in the table.

To find a single row in the table against the index, we're going to look through the index. The index is a BTree, so the time it takes to find a row (or not) is O(log(n)). The base of the logarithm isn't important.

For each row in the table, we need to try to find a row in the table. So that means were going to do an O(log(n)) operation n times, where we arrive at O(n*log(n)).

Some people call this a "linearithmic" function. It's not quite as bad as a real quadratic, but it's certainly worse than linear.

Since we're talking about databases, it's really worse than linear. The reason is that the operation of finding a row involves I/O. If we think about that look through the B-tree to find the row we want, we revisit the implementation of the algorithm. We get the root node; we figure out which branch off the root node might have the key we want. We get that node. We figure out which child of that node has the key we want, and so on -- stopping when we fall off the leaf node, or when we manage to find the key.

In that algorithm, two interesting things are happening. First, the comparison involved is of two strings: the key string and the candidate string. If the columns are fancy and know about different code pages and comparison rules in foreign languages, we spend a lot of time switching around character sets and normalizing the keys for sorting and comparison. Even if not, the operation of a string compare is O(c), where c is the number of chartacters in the string. So now we're at O(log(n)*c) for this operation. The string comparison is nasty, but it's really minor in comparison to the other thing that happens: I/O.

When we read a node in the B-tree, we're doing an I/O operation. Unless mystykmax has some really fast drives, each I/O operation is going to take ten or twelve milliseconds. They're also likely to be random I/Os, since there's no good way to sequentially read a tree. (So even if he does have a great drive array, the seek time is still a few milliseconds--and that's assmuming nobody else is using the system.) If we're lucky, some pre-fetching happens. There's cache hits, and so on. If this table and all of its indexes fit in memory, we're in better shape--once they get into memory.

O(log(n)*c) tells us the number of operations we're doing. It turns out the cost of the operations, though, can be really high. Each operation invovles an I/O, involves probing a cache, and then doing string compares. I guess O(log(n)*c) isn't quite accurate, since we do more than one string compare per page that we read.

We're used to thinking that when n is small, the algorithms are quick. That's true. Here, though, there's a huge knee in the O(n*log(n)) curve that we're expecting to see. When n is small, all the rows and the indexes fit into memory. When n gets big, then the table and its indexes take more and more memory until you're out of memory, and you can't hold everything in cache ... and then the cost of each operation steps substantially higher, and your performance becomes just awful.

And all this doesn't even count the update operation, or any of the locking, or any of that tomfoolery.

And that's why databases are slow!
 
hi guys,

so I tried what mikeblas said, or at least how I interpreted it...:)
That's what I'm talking about, yes. Sorry; for obvious reasons, I don't use MySQL that much. The query I wrote works just fine under SQL Server, but apparently your version of MySQL can't cope with updating a table from which it is also executing a subselect to drive the update. Sorry.

but at 10 million rows, I think we can forget about a fast way to do this unless there's something I'm not seeing, but I'm certainly open to ideas.
If you have to do a lot of work, then it might not be fast. There's not always a silver bullet, or some magic optimization to bail you out.

You've not said much about your requirements; where this has to run, how fast it has to be, what the server hardware is, what the application is, what your uptime requirements are and so on. Without knowing that, it's hard to know what solutions might be acceptable for you.

Maybe you just suck it up: take the downtime, run the query, come back online with a fix. Maybe you can write code that does a few accounts at a time, eventually fixing them all. Perhaps you can write code that gets the list of accounts that are iffy, and stores it. then, walks through that list doing the update as a separate step. Maybe you can get a backup of the database, mount it on a different server, run though the data there, and then force the update on the production server. I already above offered my suggestion about doing it incrementally per update from the application. Or, perhaps this is a bargaining chip for a hardware upgrade. RAID arrays are nice; the problem I was pointing out was that you needed more memory. There are probably other ideas; maybe you can come up with some, or other readers of this thread have a few.
 
Have you thought of inserting id's into a scratch table through a aliased self join? Then run an update off of the contents of the scratch table?

I suck at ansi joins, but here is how I'd do it in oracle.

Code:
Insert into scratcha
select b.id
from table1 a, table1 b
where a.email = b.e_alt1

Insert into scratchb
select b.id
from table1 a, table1 b
where a.email = b.e_alt2

update table1
set e_alt1 = ''
where id in(select id from scratcha)

update table1
set e_alt2 = ''
where id in(select id from scratchb)
 
I kinda like geiger's idea of a temp table but I'd try to modify it a touch...

Code:
create table scratch as
select t2.id
from sometable t1
inner join sometable t2
    on (t1.email = t2.alt_email)
where t2.alt_email is not null  -- this might help a bit?

update sometable
set alt_email = null
where id in (select id form scratch)

Any way you slice it, it's going to be ugly.

Preventing this in the future isn't exactly pretty, either - dealing with 25 columns of 'equivalent' data is going to be rough.
 
Have you thought of inserting id's into a scratch table through a aliased self join? Then run an update off of the contents of the scratch table?
I've made this suggestion above, but we haven't heard back from mystykmax about it, yet. This query won't be any faster--it'll be slower, actually. The advantage in using it would be that once the temporary table is built, you can slowly work over the rows in the generated table to reduce load. Unfortunately, I think the SELECT is by far the more expensive part of the query.

If the query takes a lot longer than his maintenance window, I think the solution will be to write a program which does the work slowly over time. Get a list of each account; then, for each account, test it and fix it, then wait a few seconds, then do it again. spreading the work out actually ends up being more work for the server, but doing the work in short bursts keeps the system unloaded enough so that it's responsive to its normal workload.

Of course, this would be done after fixing the client code to make sure new problems aren't added to the database behind the working utility.

Taking a look at the indexes isn't a bad idea. If the optimizer in the version of MySQL that mystykmax is using is any good, and usable indexes exist on all the fields involved, the operation can be satisfied by a merge join instead of a nested loops join.

mystykmax, have you shown us the output from EXPLAIN on the query that you're running? And the exact query that you intend to run? How long does the query take to run on your populated test server?
 
When I ran the query using the where clause or putting the join after the word "update" the query takes about 6 hours to run.

To give you a better idea of the whole process here, there is a large set of upgrade steps with some simple data cleanup to even alter table statements. This process is tucked inside a larger process of file-replacement, ad infinitum. Adding an hour to my part of the upgrade will be completely unacceptable to "the powers that be".

Can I run explain on mysql on an update statement?

It would be an interesting idea to fix the data in post.
 
As I mentioned, I spend as little time with MySQL as possible. But I'd expect to run a command like EXPLAIN against any statement I could send to the server.

How big is the table, and its involved indexes, in bytes?

Given your time constraints, here's what I'd do:


1) Fix the bug in the code that lets this happen.
2) Deploy that code.
3) Take a backup of the database.
4) Mount the backup on a test server
5) Write code that does the join against the test server, saving the results into a new table on the test server.
6) Write a program that connects to the test server and the production server. Every few minutes, read a few matching records from the result table. Test that the condition is still true for those records, then execute the update if they still are.
7) Loop, slowly. Sleep for five minutes, do a thousand rows, and so on--whatever you think your server can handle while it's still live
8) Done

By not loading the production server with the whole query, you can still get the job done and stay up. At Step 5, you might find that the rows affected are only a few, and maybe you can just do one quick batch during your next scheduled downtime--or even live, if you have enough headroom.
 
That sounds like a good plan, Mike. I'm not sure how big each customer's table of contacts is, but I know some can reach 20 million rows while some are much smaller. I'm not sure how to calculate or find out how big the table is including indexes in bytes, but I'd be willing to learn. I could probably ask our hosting dept. If I exclude all rows where email_alt1 and email_alt2 are already NULL, this problem might get much smaller. I'll have to get our hosting dept to get me some numbers to find out if that's a viable way to fix this faster. Thanks again for all the help here. I promise to give back to the forum as I can.
 
Name: contacts
Engine: InnoDB
Version: 10
Row_format: Compact
Rows: 738094
Avg_row_length: 508
Data_length: 375390208
Max_data_length: 0
Index_length: 510722048
 
Since the table is less than a gig, including indexes, it seems surprising that this query takes so long on the hardware you describe. The rows are pretty wide, but that shouldn't be so much of a problem. What's going on with the indexes?
 
Are you seeing something about the indexes or just wondering why the performance is so crappy?
 
You haven't shown us the indexes yet. If the indexes don't help the query, then you're screwed. Finding a row with an index is something like O(n*log(n)). Finding a row without the index is O(n); it's a huge difference.
 
sorry bout that...

There are indexes on all the columns involved...
the id of the contact, and then all three email fields...email, email_alt1, email_alt2 as single column indexes.
 
There are four indexes, then? It seems remarkable that the query is so slow, then.
 
maybe some of the "knobs" in MySQL need to be tuned or something. There are actually more than 4 indexes, but I've truncated some things here to simplify.
There are four indexes, then? It seems remarkable that the query is so slow, then.
 
Back
Top