Sunday, December 17, 2006

Primary Key Order Does Matter!

There have been a few posts on PlanetMySQL regarding primary keys and the importance of choosing the right one. This is even more important when the table uses InnoDB. You've read different posts of why it is so important. Now, I'm all about benchmarks and showing the details. So I'll take a table from my previous posts about MySQL 5.1 partitioning and show what I found.

This table was created under MySQL 5.1.12-beta:

CREATE TABLE `big_table_test1` (
`entity_id` int(11) NOT NULL DEFAULT '0',
`col1` int(11) NOT NULL DEFAULT '0',
`col2` int(11) NOT NULL DEFAULT '0',
`col3` int(11) NOT NULL DEFAULT '0',
`col4` int(11) NOT NULL DEFAULT '0',
`col5` int(11) NOT NULL DEFAULT '0',
`col6` int(11) NOT NULL DEFAULT '0',
`ymdh` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`imps` bigint(20) NOT NULL DEFAULT '0',
`clicks` int(11) NOT NULL DEFAULT '0',
`convs` int(11) NOT NULL DEFAULT '0',
`id` int(10) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`id`,`ymdh`),
KEY `ix_big1` (`ymdh`,`entity_id`,`col3`) USING BTREE,
KEY `ix_big2` (`ymdh`,`entity_id`,`col4`) USING BTREE,
KEY `ix_big3` (`ymdh`,`entity_id`,`col2`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=latin1 ROW_FORMAT=COMPACT


I loaded about 180 million records into this table (a small set of data for us!) and ran one of our really popular types of queries:



SELECT col1,col2,col3,SUM(imps),SUM(clicks),SUM(convs)
FROM big_table_test1
WHERE ymdh IN ('2006-10-01 00:00:00','2006-10-02 00:00:00','2006-10-03 00:00:00',
'2006-10-04 00:00:00','2006-10-05 00:00:00','2006-10-06 00:00:00',
'2006-10-07 00:00:00')
AND entity_id = 2
GROUP BY col1, col2, col3
ORDER BY col1, col2, col3;
Doesn't look terribly nasty does it? This query takes about 7 MINUTES to run!!! EXPLAIN on the query shows nothing out of the ordinary, as it uses one of the secondary indexes on the table. The cardinality of entity_id is not really high, so forcing one of the secondary indexes over another wouldn't yield any performance benefits. The id column is basically a numerical hash of the tables "real" primary key, which is entity_id plus col1 through col6, and is used for uniqueness. What's interesting is that throughout our application, there are no direct queries against this id column. It just exists. But, it can't be removed.

If this column really serves no really significant value, what if we swapped the order of the definition of the primary key? So the definition of the primary key looks like:

PRIMARY KEY (`ymdh`,`id`)

Logically, no difference so we do not break any uniqueness constraints in the application. If we run the query again, 4 SECONDS!!!! Wow! How do we explain this massive performance increase?

Remember that InnoDB uses a clustered index for the primary key. Clustered indexes are indexes that are built based on the same key by which the data is ordered on disk. They are very efficient during scanning, but have performance implications when inserting new data, as some re-ordering may need to be done. All of our data is inserted in ymdh column order, so it makes sense if the primary key was based on this column. There are a lot of efficiencies that can be obtained, such as sequential disk read-ahead. The previous index for the primary key needs lots of random disk I/O to read the data portion of the table.

5 comments:

santosh said...

Remember that InnoDB uses a clustered index for the primary key. Clustered indexes are indexes that are built based on the same key by which the data is ordered on disk.

So, in the first case, the data on the disk would be ordered by "id" in a monotonically increasing/decreasing order? Would that be correct? If true, INSERT's would also take longer and involve more work.

Or have I misunderstood?

anru said...

hi:

i am not a expert on mysql , just use it on job.

you article is also suprised me, since change order of primay key will have such a big performance impact.

but since you did not write out how you actully test query, so i have to say have you check other things? before claim change order of primay key will do a such big improvement to query?

"other things" i mean, do you have any cache system running on same machine as mysql server, probabaly query result comes cache system, or any configuration setting will affect query speed.

if change order of primay key will have that big impact, i think this is defintely a bug. you should report it to innodb team.

regards,

ctx2002

Chip said...

I think you're misunderstanding what is going on here. The problem is the indexes on ymdh do not include all of the columns you are querying. This means, for every row innodb fetches from the secondary index, it then must probe the primary key index to find the other columns of data. So you are doing scans on an index, but then random IO inside of the table itself.

In the case where you have the PK on (ymdh, id) then you trigger the innodb behavior of index organized tables, that is, the PK index *is* the table, and in scanning it, you get all of the columns for "free". In this case, you access the table by scanning ranges, which is very efficient.

You would see the same performance improvement if you made a secondary index leading with ymdh and containing all of the columns you're querying.

santosh said...

Perhaps this might help:
Multi-column indexes in MySQL

MySQL can create composite indexes (that is, indexes on multiple columns). An index may consist of up to 15 columns. For certain data types, you can index a prefix of the column (see Section 7.4.3, “Column Indexes”).

A multiple-column index can be considered a sorted array containing values that are created by concatenating the values of the indexed columns.

MySQL uses multiple-column indexes in such a way that queries are fast when you specify a known quantity for the first column of the index in a WHERE clause, even if you do not specify values for the other columns.

santosh said...

Here is another section from the same manual,

If the table has a multiple-column index, any leftmost prefix of the index can be used by the optimizer to find rows. For example, if you have a three-column index on (col1, col2, col3), you have indexed search capabilities on (col1), (col1, col2), and (col1, col2, col3).

MySQL cannot use a partial index if the columns do not form a leftmost prefix of the index.

How MySQL uses indexes