This table was created under MySQL 5.1.12-beta:
CREATE TABLE `big_table_test1` (
`entity_id` int(11) NOT NULL DEFAULT '0',
`col1` int(11) NOT NULL DEFAULT '0',
`col2` int(11) NOT NULL DEFAULT '0',
`col3` int(11) NOT NULL DEFAULT '0',
`col4` int(11) NOT NULL DEFAULT '0',
`col5` int(11) NOT NULL DEFAULT '0',
`col6` int(11) NOT NULL DEFAULT '0',
`ymdh` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`imps` bigint(20) NOT NULL DEFAULT '0',
`clicks` int(11) NOT NULL DEFAULT '0',
`convs` int(11) NOT NULL DEFAULT '0',
`id` int(10) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`id`,`ymdh`),
KEY `ix_big1` (`ymdh`,`entity_id`,`col3`) USING BTREE,
KEY `ix_big2` (`ymdh`,`entity_id`,`col4`) USING BTREE,
KEY `ix_big3` (`ymdh`,`entity_id`,`col2`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=latin1 ROW_FORMAT=COMPACT
I loaded about 180 million records into this table (a small set of data for us!) and ran one of our really popular types of queries:
Doesn't look terribly nasty does it? This query takes about 7 MINUTES to run!!! EXPLAIN on the query shows nothing out of the ordinary, as it uses one of the secondary indexes on the table. The cardinality of
SELECT col1,col2,col3,SUM(imps),SUM(clicks),SUM(convs)
FROM big_table_test1
WHERE ymdh IN ('2006-10-01 00:00:00','2006-10-02 00:00:00','2006-10-03 00:00:00',
'2006-10-04 00:00:00','2006-10-05 00:00:00','2006-10-06 00:00:00',
'2006-10-07 00:00:00')
AND entity_id = 2
GROUP BY col1, col2, col3
ORDER BY col1, col2, col3;
entity_id
is not really high, so forcing one of the secondary indexes over another wouldn't yield any performance benefits. The id
column is basically a numerical hash of the tables "real" primary key, which is entity_id plus col1 through col6, and is used for uniqueness. What's interesting is that throughout our application, there are no direct queries against this id column. It just exists. But, it can't be removed.If this column really serves no really significant value, what if we swapped the order of the definition of the primary key? So the definition of the primary key looks like:
PRIMARY KEY (`ymdh`,`id`)
Logically, no difference so we do not break any uniqueness constraints in the application. If we run the query again, 4 SECONDS!!!! Wow! How do we explain this massive performance increase?
Remember that InnoDB uses a clustered index for the primary key. Clustered indexes are indexes that are built based on the same key by which the data is ordered on disk. They are very efficient during scanning, but have performance implications when inserting new data, as some re-ordering may need to be done. All of our data is inserted in
ymdh
column order, so it makes sense if the primary key was based on this column. There are a lot of efficiencies that can be obtained, such as sequential disk read-ahead. The previous index for the primary key needs lots of random disk I/O to read the data portion of the table.