Saturday, February 26, 2011
Why the Facebook/Percona versions of MySQL are so much better
One of my roles at Yahoo is to provide a rock solid infrastructure for MySQL-based projects. In order to provide that infrastructure, I have to satisfy many groups at Yahoo: Developers need an easy way of deploying the db server in their environment. Software Engineering (Operations) is looking for features to help scale MySQL, and DBAs are looking for more knobs and gauges to help tune the server. After about one year of attempting to manage the infrastructure, and following the MySQL sagas, my job is getting easier thanks to companies like Facebook and Percona as well as one of my predecessors (jcole) for releasing quality code that extends MySQL in so many ways than just raw performance. I've been able to take the best code from each company and mold together an internal version that satisfies all crowds. MySQL/Oracle, you should do the same.
Tuesday, February 15, 2011
Are you using the deadline scheduler? (Part 2)
The deadline scheduler has many advantages over the cfq scheduler, which is the default in operating systems like RedHat Enterprise Linux. In my previous post, I quickly showed how much of a performance gain can be had by switching to the deadline scheduler. Now I will show some real performance numbers for different RAID configurations.
All of the tests were performed on a Dell PowerEdge 2950 with 2xQuad Core Xeon,16GB of memory and 6x146Gb SAS drives on a Perc/5 RAID controller, and all filesystems were standard EXT3. The TPCC benchmarks were conducted with a smallish buffer pool (2GB) and a 1GB log file size. The database is approximately 7GB in size (100 warehouses). I wanted to show what performance an I/O bound test would yield. The numbers here show that its possible just to get that added boost without resorting to re-creating the entire database with a different filesystem (like XFS). I will come back to XFS later as it provides the best performance.
| 2 disk RAID-0 | 4 disk RAID-0 | 6 disk RAID-5 | 6 disk RAID-10 | |
| CFQ | 461.467 | 947.067 | 845.943 | 862.763 |
| Deadline | 851.067 | 2876.933 | 2145.866 | 2580.430 |
All of the tests were performed on a Dell PowerEdge 2950 with 2xQuad Core Xeon,16GB of memory and 6x146Gb SAS drives on a Perc/5 RAID controller, and all filesystems were standard EXT3. The TPCC benchmarks were conducted with a smallish buffer pool (2GB) and a 1GB log file size. The database is approximately 7GB in size (100 warehouses). I wanted to show what performance an I/O bound test would yield. The numbers here show that its possible just to get that added boost without resorting to re-creating the entire database with a different filesystem (like XFS). I will come back to XFS later as it provides the best performance.
Sunday, December 19, 2010
Are you using the deadline scheduler? (Part 1)
There have been many posts about performance, benchmarking and the results. Many DBAs have talked in the past about the deadline scheduler, available in all modern Linux distributions. The deadline scheduler is very effective for database systems, as it tries to prevent starvation of I/O requests. To see what the net effect of using the deadline scheduler looks like, I am using a simple TPCC benchmark program, created by the Percona team, tpcc-mysql to measure the number of New-Order transactions per minute (TpmC). All of the testing was performed using RedHat Enterprise Linux 5, using the standard ext3 file system and a combination of the default scheduler (cfq) and the deadline scheduler (dl):
This is a huge difference in performance between the two schedulers. How do you know if you are using the deadline scheduler? There are a few ways. The deadline scheduler can be enabled for all of your disks at boot time via the LILO or GRUB boot loaders. The scheduler can be changed by appending the string elevator=deadline to the boot line (file /boot/grub/grub.conf). If you issue the command cat /proc/cmdline this will display the kernel parameters that the system was booted with.
The second way of switching schedulers is to change it "on the fly." You can list all of the available schedulers, as well as determine which scheduler is current by using this command: cat /sys/block/drive/queue/scheduler where drive is the actual drive represented to your system (i.e. sda, hda). The list of schedulers is displayed, and the current scheduler is noted within square brackets. To change the scheduler, echo the name of the new scheduler to the meta-file from the previous command (as user root). Be careful when switching schedulers on the fly, as there is a possibility to hand the system.
In part 2, I'll go over my testing methodology and show how using the deadline scheduler with different types of RAID drive configurations and filesystem types affects the overall throughput of the system.
CFQ: 845 TpmC
DL: 2145 TpmCThis is a huge difference in performance between the two schedulers. How do you know if you are using the deadline scheduler? There are a few ways. The deadline scheduler can be enabled for all of your disks at boot time via the LILO or GRUB boot loaders. The scheduler can be changed by appending the string elevator=deadline to the boot line (file /boot/grub/grub.conf). If you issue the command cat /proc/cmdline this will display the kernel parameters that the system was booted with.
The second way of switching schedulers is to change it "on the fly." You can list all of the available schedulers, as well as determine which scheduler is current by using this command: cat /sys/block/drive/queue/scheduler where drive is the actual drive represented to your system (i.e. sda, hda). The list of schedulers is displayed, and the current scheduler is noted within square brackets. To change the scheduler, echo the name of the new scheduler to the meta-file from the previous command (as user root). Be careful when switching schedulers on the fly, as there is a possibility to hand the system.
In part 2, I'll go over my testing methodology and show how using the deadline scheduler with different types of RAID drive configurations and filesystem types affects the overall throughput of the system.
Wednesday, December 15, 2010
Wednesday, October 13, 2010
Woes of ROW based logging implementation
I have been trying to find ways to implement ROW based logging at our company, as it provides better reliability and far less chances for a slave going "out-of-sync" with a master. One of the big issues that I faced was constant replication lag from one datacenter to another because of the massive amounts of data that can potentially be generated just from one single SQL statement.
With the traditional STATEMENT based replication, one SQL statement is written to the binary log - very little network overhead there transferring that across the wire to another datacenter. But if that single SQL statement changes 20,000 rows, well that's where agony begins, and business continuity takes a beating.
And to compound situations even further, more and more operations are suddenly becoming "unsafe for STATEMENT based logging", generating hundreds upon thousands of warning statements in error log files. With 5.1.50, LOAD DATA INFILE statements generate warnings now. This leads me to believe that at some point, these warnings will soon become unsupported operations. Blech! Does not give me the warm fuzzy feeling I used to have with MySQL. But I will still keep trying to find a real-world solution to this problem.
With the traditional STATEMENT based replication, one SQL statement is written to the binary log - very little network overhead there transferring that across the wire to another datacenter. But if that single SQL statement changes 20,000 rows, well that's where agony begins, and business continuity takes a beating.
And to compound situations even further, more and more operations are suddenly becoming "unsafe for STATEMENT based logging", generating hundreds upon thousands of warning statements in error log files. With 5.1.50, LOAD DATA INFILE statements generate warnings now. This leads me to believe that at some point, these warnings will soon become unsupported operations. Blech! Does not give me the warm fuzzy feeling I used to have with MySQL. But I will still keep trying to find a real-world solution to this problem.
Wednesday, September 22, 2010
Extracting load files from the binary log
There are times when you may be rebuilding a DB server by replaying the binary logs using the
As an example, I have taken a simple text file of numbers and loaded it into a fictitious table abc using the
You can see that from the command above that the command, including the load file is contained between positions 174 and 432 of the binary log. Now that the start/stop positions are known, it is possible to extract the data file to load manually into your database (or into another database):
The
mysqlbinlog utility. Extracting CRUD statements and DDL is relatively straightforward, but not for statements like LOAD DATA INFILE. The actual data file is embedded within the binary log, and not very visible to the naked eye. But there is an easy way to decipher the binary log and extract the file to load manually.As an example, I have taken a simple text file of numbers and loaded it into a fictitious table abc using the
LOAD DATA LOCAL INFILE command. To see where in the binary log that command would reside, the mysqlbinlog utility is used:
$ mysqlbinlog mysqld-bin.000003 | grep -i -B7 "load data"
# at 174
#100921 21:42:10 server id 1136902037 end_log_pos 218
#Begin_load_query: file_id: 1 block_len: 21
# at 218
#100921 21:42:10 server id 1136902037 end_log_pos 432 Execute_load_query thread_id=5 exec_time=0 error_code=0
use test/*!*/;
SET TIMESTAMP=1285130530/*!*/;
LOAD DATA LOCAL INFILE '/tmp/SQL_LOAD_MB-1-2' INTO TABLE `abc` FIELDS TERMINATED BY '\t' ENCLOSED BY '' ESCAPED BY '\\' LINES TERMINATED BY '\n' (`a`)
$
You can see that from the command above that the command, including the load file is contained between positions 174 and 432 of the binary log. Now that the start/stop positions are known, it is possible to extract the data file to load manually into your database (or into another database):
mysqlbinlog --start-position=174 --stop-position=432 --local-load=/var/tmp mysqld-bin.000003
The
--local-load option specifies what directory to store the actual load file. And you can take the file and use the LOAD DATA command above (changing the directory name as needed) and load that data back into your database or use it to seed another database.
Sunday, November 29, 2009
I did not fall off the face of the earth....
It's been a looong time since I've posted anything. For the past year, I've been focused on operations and streamlining DBA tasks, as the group's responsibilities continues to grow. Its one thing to manage 10-20 production MySQL database servers, but when the number starts climbing to 160-200, things start getting interesting. For 2010, I expect that number to double. Performance is key, but more important is reliability, uptime, monitoring and notification. Dashboards are a good start, but the most important subsystem will be monitoring. How scalable does the system need to be? For 10-20 off-the-shelf products work fine. But when thousands of systems need to be monitored, then it starts getting interesting. I'll share my thoughts along the way as far as how we are handling this type of growth.
Saturday, October 25, 2008
Should you be worried about STATEMENT based replication?
Earlier this month, an announcement about STATEMENT based binary logging would be the default starting with MySQL version 5.1.29. I've always preached that backwards compatibility was key to new releases. In this case, lessons were not learned until close to final GA date.
I would like to point out that for 90% of customer cases, STATEMENT based replication will work fine as advertised. But I'd like to point out some use cases where STATEMENT based replication will be at best spotty (at least it is in 5.1.28).
If you primarily use InnoDB as your storage engine you will want to pay close attention to your transaction isolation level. There is a minimum requirement that READ COMMITED level be used, otherwise statement based replication can not be used.
Partitioning + InnoDB + STATEMENT-based binlog also has its problems. We faced constant issues, getting the error 'Binary logging not possible. Message: Statement-based format required for this
statement, but not allowed by this combination of engines'. What the heck does this mean?
It's a misleading error message. Partitioning is not a true storage engine, but is a virtual one. The first time an underlying table is opened, the partition engine caches the table flags of the real storage engine, in this case InnoDB. Lets say for example, an app performs a SELECT on a partitioned InnoDB table. Let's also assume that the transaction isolation level is READ UNCOMMITTED. The SELECT will execute without any issues. But, try to insert a record from a different session, and it will fail every time. I filed this BUG#39084 over a month and a half ago. Repeatable test cases were also given.
Since we are committed to releasing our reporting database and can not wait for MySQL to come around and realize mistakes they made, we came up with our own patch that addresses this immediate concern, and for this reporting db.
So, should you be worried? I think I would be.
I would like to point out that for 90% of customer cases, STATEMENT based replication will work fine as advertised. But I'd like to point out some use cases where STATEMENT based replication will be at best spotty (at least it is in 5.1.28).
If you primarily use InnoDB as your storage engine you will want to pay close attention to your transaction isolation level. There is a minimum requirement that READ COMMITED level be used, otherwise statement based replication can not be used.
Partitioning + InnoDB + STATEMENT-based binlog also has its problems. We faced constant issues, getting the error 'Binary logging not possible. Message: Statement-based format required for this
statement, but not allowed by this combination of engines'. What the heck does this mean?
It's a misleading error message. Partitioning is not a true storage engine, but is a virtual one. The first time an underlying table is opened, the partition engine caches the table flags of the real storage engine, in this case InnoDB. Lets say for example, an app performs a SELECT on a partitioned InnoDB table. Let's also assume that the transaction isolation level is READ UNCOMMITTED. The SELECT will execute without any issues. But, try to insert a record from a different session, and it will fail every time. I filed this BUG#39084 over a month and a half ago. Repeatable test cases were also given.
Since we are committed to releasing our reporting database and can not wait for MySQL to come around and realize mistakes they made, we came up with our own patch that addresses this immediate concern, and for this reporting db.
So, should you be worried? I think I would be.
Sunday, December 02, 2007
Wishlist for partitioning
I love the way partitioning works in MySQL. I remember in the past how many projects I implemented using application logic to parallelize I/O. Partitioning makes this seamless now. But it's not without its share of problems and workarounds. So I compiled my own wishlist that hopefully might make it into a future version of MySQL.
1. Partition level table locking. Partitions should be treated like tables and locked individually rather than the who table and all of its partitions.
2. Ability to add partitions from existing tables. This is very ueful, especially when trying to perform bulk maintainance operations.
3. Ability to convert a partition to a table.
4. Be able to mix and match storage engines for partitions and subpartitions. How cool would it be to have an archive partition for older data reside using ARCHIVE tables while the remaining partitions are InnoDB or MyISAM.
5. More usuable datatypes for partition pruning. How many times can you use datetme when timestamp is available. Also when would functions other than TO_DAYS and YEAR be supported?
1. Partition level table locking. Partitions should be treated like tables and locked individually rather than the who table and all of its partitions.
2. Ability to add partitions from existing tables. This is very ueful, especially when trying to perform bulk maintainance operations.
3. Ability to convert a partition to a table.
4. Be able to mix and match storage engines for partitions and subpartitions. How cool would it be to have an archive partition for older data reside using ARCHIVE tables while the remaining partitions are InnoDB or MyISAM.
5. More usuable datatypes for partition pruning. How many times can you use datetme when timestamp is available. Also when would functions other than TO_DAYS and YEAR be supported?
Tuesday, June 26, 2007
Version 3 of mysqlbackup - small bug fix
I just posted version 3 of mysqlbackup to MySQL Forge.
Small bugfix: Added option --add-drop-table to the default options for mysqldump. This was causing a failure in restoring views.
Small bugfix: Added option --add-drop-table to the default options for mysqldump. This was causing a failure in restoring views.
Sunday, June 17, 2007
Not all MySQL errors are visible to replication
This probably warrants a bug report to MySQL, but I want to let other people know about this first. There are situations where MySQL receives incomplete statements from replication relay logs, but does not trigger a replication error. Case in point is exceeding max_allowed_packet.
I recently had a situation where once of my machines was incorrectly configured with a different value for max_allowed_packet. What happened is not what I had expected. Instead of receiving a replication error (which we monitor for using Nagios), the MySQL error log was spewing with messages about exceeding max_allowed_packet. Instead, the only visible problem through our monitoring framework was that replication had fallen behind, and was continuing to fall behind.
Fixing the problem was rather easy: stop the slave, change the max_allowed_packet variable globally in the db server and in the configuration file, and then start the slave.
This is one of those things that falls under the category "MySQL annoyances and one-offs". Shouldn't this really trigger a true replication error, rather than spewage in log files? I will have to reproduce this and then file a bug report to MySQL, but I really shouldn't have to if there was some consistency in error reporting.
I recently had a situation where once of my machines was incorrectly configured with a different value for max_allowed_packet. What happened is not what I had expected. Instead of receiving a replication error (which we monitor for using Nagios), the MySQL error log was spewing with messages about exceeding max_allowed_packet. Instead, the only visible problem through our monitoring framework was that replication had fallen behind, and was continuing to fall behind.
Fixing the problem was rather easy: stop the slave, change the max_allowed_packet variable globally in the db server and in the configuration file, and then start the slave.
This is one of those things that falls under the category "MySQL annoyances and one-offs". Shouldn't this really trigger a true replication error, rather than spewage in log files? I will have to reproduce this and then file a bug report to MySQL, but I really shouldn't have to if there was some consistency in error reporting.
Script to backup binary logs on a master
I have recently posted a script on MySQL Forge to back up MySQL binary logs. One of the ideas that I had when I originally wrote the script was to take into account all of the slaves and what master log file & position that each one has executed. This way, only the relevant binary logs would get archived and then subsequently purged. You can find the script here.
Reload data quickly into MySQL InnoDB tables
As DBAs that manage large quantities of database servers, we are always looking for the fastest or most efficient way to load data into the database. Some DBAs have quarterly maintenance periods where they reload data into a database to refresh the indexes.
If you primarily use InnoDB tables in your MySQL database server, then these set of tricks will help in trying to make the reload process a bit faster than just a straight dump & reload.
my.cnf configuration
innodb_flush_log_at_trx_commmit = 0
innodb_support_xa = 0
skip-innodb-doublewrite
disable log-bin & log_slow_queries
Since the goal is to reload data quickly, we need to eliminate any potential bottlenecks. Setting innodb_flush_log_at_trx_commit = 0 this will reduce the amount of disk I/O by avoiding a flush to disk on each commit. If you are not using XA compliant transactions (multi system two-phase commits) then you won't need this option set. This will avoid an extra disk flush before the transaction starts. The skip-innodb-doublewrite option will turn off the use of this buffer for inserts, which will actually eek out a little bit more performance. Also if you don't need to use the binary log, turn it off during your reload period. Remember any excess disk I/O that is not needed will hurt in the performance of reloading the database.
Unloading the data
There are many ways to unload & reload the data using the standard MySQL tools or your own crafted toolset. Again the main idea is efficiency. The best advice here is while selecting the data to be unloaded, make sure that the select is in primary key order. If the data is sorted ahead of time, it loads pretty fast back into InnoDB as the primary key is a clustered index, meaning that the data is sorted based on the primary key as it is inserted into the database. Use the --order-by-primary option of the mysqldump utility while selecting the data.
I hope these small tips help you make the process a bit less painful.
If you primarily use InnoDB tables in your MySQL database server, then these set of tricks will help in trying to make the reload process a bit faster than just a straight dump & reload.
my.cnf configuration
innodb_flush_log_at_trx_commmit = 0
innodb_support_xa = 0
skip-innodb-doublewrite
disable log-bin & log_slow_queries
Since the goal is to reload data quickly, we need to eliminate any potential bottlenecks. Setting innodb_flush_log_at_trx_commit = 0 this will reduce the amount of disk I/O by avoiding a flush to disk on each commit. If you are not using XA compliant transactions (multi system two-phase commits) then you won't need this option set. This will avoid an extra disk flush before the transaction starts. The skip-innodb-doublewrite option will turn off the use of this buffer for inserts, which will actually eek out a little bit more performance. Also if you don't need to use the binary log, turn it off during your reload period. Remember any excess disk I/O that is not needed will hurt in the performance of reloading the database.
Unloading the data
There are many ways to unload & reload the data using the standard MySQL tools or your own crafted toolset. Again the main idea is efficiency. The best advice here is while selecting the data to be unloaded, make sure that the select is in primary key order. If the data is sorted ahead of time, it loads pretty fast back into InnoDB as the primary key is a clustered index, meaning that the data is sorted based on the primary key as it is inserted into the database. Use the --order-by-primary option of the mysqldump utility while selecting the data.
I hope these small tips help you make the process a bit less painful.
Thursday, May 10, 2007
Why is QA so important?
Wednesday, February 28, 2007
A case of no research before product rollout
I am digressing a bit from my usual topics, but this this pretty hilarious....
Wanted to get a bit of an after dinner snack and I find some corn dogs in the freezer. Yum!
Pop one out, read the instructions on the box (yes, RTFM!!!) and what do I see? Instructions that read Heat for 70 seconds.

When was the last time you have ever set the timer on a microwave oven for 70 seconds?????!!!!! Would it have taken just too much time to do some product research and realize that you should set the microwave for 1 minute 10 seconds?
So is there a moral to this story? Kind of. Documentation is not to be taken lightly, no matter how trivial it may seem to be.
Wanted to get a bit of an after dinner snack and I find some corn dogs in the freezer. Yum!
Pop one out, read the instructions on the box (yes, RTFM!!!) and what do I see? Instructions that read Heat for 70 seconds.
When was the last time you have ever set the timer on a microwave oven for 70 seconds?????!!!!! Would it have taken just too much time to do some product research and realize that you should set the microwave for 1 minute 10 seconds?
So is there a moral to this story? Kind of. Documentation is not to be taken lightly, no matter how trivial it may seem to be.
Sunday, February 04, 2007
MySQL and iSCSI - a winning combo!
So it's been a very long time again between posts. So much has happened. Let me first begin by saying that I am very impressed with iSCSI performance and I believe that it is mature enough to actually run production workloads (but it really depends on the type of workload).
After all of the benchmarking and analysis, we finally decided on moving forward with a purchase of an iSCSI storage solution. For the types of queries we run (large amount of records to scan, small resultset returned) we had to tweak the schema just a bit in order to realize the performance that we desired (that plus good quality fibre-channel drives to get that extra oomph that's needed).
Bottom line is we had to make a significant investment in hardware in order to realize the benefits of having a proper storage solution in place. The benefits though outweigh the overwhelming maintenance required to keep all of the machines running. Backups using the storage provider's snapshot mechanism will be extremely beneficial as well. All in all, a good decision to ease our minds.
After all of the benchmarking and analysis, we finally decided on moving forward with a purchase of an iSCSI storage solution. For the types of queries we run (large amount of records to scan, small resultset returned) we had to tweak the schema just a bit in order to realize the performance that we desired (that plus good quality fibre-channel drives to get that extra oomph that's needed).
Bottom line is we had to make a significant investment in hardware in order to realize the benefits of having a proper storage solution in place. The benefits though outweigh the overwhelming maintenance required to keep all of the machines running. Backups using the storage provider's snapshot mechanism will be extremely beneficial as well. All in all, a good decision to ease our minds.
Sunday, December 17, 2006
Primary Key Order Does Matter!
There have been a few posts on PlanetMySQL regarding primary keys and the importance of choosing the right one. This is even more important when the table uses InnoDB. You've read different posts of why it is so important. Now, I'm all about benchmarks and showing the details. So I'll take a table from my previous posts about MySQL 5.1 partitioning and show what I found.
This table was created under MySQL 5.1.12-beta:
I loaded about 180 million records into this table (a small set of data for us!) and ran one of our really popular types of queries:
If this column really serves no really significant value, what if we swapped the order of the definition of the primary key? So the definition of the primary key looks like:
Logically, no difference so we do not break any uniqueness constraints in the application. If we run the query again, 4 SECONDS!!!! Wow! How do we explain this massive performance increase?
Remember that InnoDB uses a clustered index for the primary key. Clustered indexes are indexes that are built based on the same key by which the data is ordered on disk. They are very efficient during scanning, but have performance implications when inserting new data, as some re-ordering may need to be done. All of our data is inserted in
This table was created under MySQL 5.1.12-beta:
CREATE TABLE `big_table_test1` (
`entity_id` int(11) NOT NULL DEFAULT '0',
`col1` int(11) NOT NULL DEFAULT '0',
`col2` int(11) NOT NULL DEFAULT '0',
`col3` int(11) NOT NULL DEFAULT '0',
`col4` int(11) NOT NULL DEFAULT '0',
`col5` int(11) NOT NULL DEFAULT '0',
`col6` int(11) NOT NULL DEFAULT '0',
`ymdh` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`imps` bigint(20) NOT NULL DEFAULT '0',
`clicks` int(11) NOT NULL DEFAULT '0',
`convs` int(11) NOT NULL DEFAULT '0',
`id` int(10) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`id`,`ymdh`),
KEY `ix_big1` (`ymdh`,`entity_id`,`col3`) USING BTREE,
KEY `ix_big2` (`ymdh`,`entity_id`,`col4`) USING BTREE,
KEY `ix_big3` (`ymdh`,`entity_id`,`col2`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=latin1 ROW_FORMAT=COMPACT
I loaded about 180 million records into this table (a small set of data for us!) and ran one of our really popular types of queries:
Doesn't look terribly nasty does it? This query takes about 7 MINUTES to run!!! EXPLAIN on the query shows nothing out of the ordinary, as it uses one of the secondary indexes on the table. The cardinality of
SELECT col1,col2,col3,SUM(imps),SUM(clicks),SUM(convs)
FROM big_table_test1
WHERE ymdh IN ('2006-10-01 00:00:00','2006-10-02 00:00:00','2006-10-03 00:00:00',
'2006-10-04 00:00:00','2006-10-05 00:00:00','2006-10-06 00:00:00',
'2006-10-07 00:00:00')
AND entity_id = 2
GROUP BY col1, col2, col3
ORDER BY col1, col2, col3;
entity_id is not really high, so forcing one of the secondary indexes over another wouldn't yield any performance benefits. The id column is basically a numerical hash of the tables "real" primary key, which is entity_id plus col1 through col6, and is used for uniqueness. What's interesting is that throughout our application, there are no direct queries against this id column. It just exists. But, it can't be removed.If this column really serves no really significant value, what if we swapped the order of the definition of the primary key? So the definition of the primary key looks like:
PRIMARY KEY (`ymdh`,`id`)
Logically, no difference so we do not break any uniqueness constraints in the application. If we run the query again, 4 SECONDS!!!! Wow! How do we explain this massive performance increase?
Remember that InnoDB uses a clustered index for the primary key. Clustered indexes are indexes that are built based on the same key by which the data is ordered on disk. They are very efficient during scanning, but have performance implications when inserting new data, as some re-ordering may need to be done. All of our data is inserted in
ymdh column order, so it makes sense if the primary key was based on this column. There are a lot of efficiencies that can be obtained, such as sequential disk read-ahead. The previous index for the primary key needs lots of random disk I/O to read the data portion of the table.
Monday, December 04, 2006
MySQL Partitioning 5.1 - Part 5 Slowdown Problem Solved!
Finally figured out what was causing the lack of performance on the
partitions with the query. The use of FORCE INDEX was causing the slowdown
with the partitioned table. Once I removed the clause, the query ran in 1
minute 19 seconds, which is more in line with expectations. Sorry for any
inconvenience!
partitions with the query. The use of FORCE INDEX was causing the slowdown
with the partitioned table. Once I removed the clause, the query ran in 1
minute 19 seconds, which is more in line with expectations. Sorry for any
inconvenience!
MySQL 5.1 Partitioning - Part 4 (Results)
In my previous post I started out by setting up MySQL 5.1.12 on a box in order to test the performance of the new partitioning option. During testing, I noted that I did not see any noticeable performance improvements from using partitioning. So I spent some time Saturday and Sunday (I guess I don't have anything else better to do!) to build the testing environment and perform the tests. So I was wrong, but only slightly. Partitioning does show better performance than standard tables, but not by as much as you would think. But, wait, there is light at the end of the tunnel (as well as a WTF). The numbers...
By partitioning, we gain about 25% improvement in performance. Not bad, but not great. I would have expected more than 25%, given that I loaded approximately 110 million records for the test, and there are approximately 650,000 records per day for the range for the test.
Here are the details of testing:
Now, one thing at RightMedia, is we have LOTS of data. In fact there is some skew on the
Okay, one last experiment (It's 3:46AM EST!!) I am going to subpartition the partitioned table into 7 subpartitions, hash by the
Result: 14.32 seconds! Holy friggin' turbo mode Batman! It's way too late to be trying to quantify performance gain, so I'm quantifying it by WTF units. I'd really like to get an explanation of what is really going on behind the scenes that yields this type of performance gain, and why I don't see at least 1 or 2 WTF units of gain on the original partitioned table, given that the amount of data is much less for the date range compared to the whole table.
| Table Type | Elapsed Time |
|---|---|
| Normal | 7 minutes, 41 seconds |
| Partitioned | 5 minutes, 51 seconds |
By partitioning, we gain about 25% improvement in performance. Not bad, but not great. I would have expected more than 25%, given that I loaded approximately 110 million records for the test, and there are approximately 650,000 records per day for the range for the test.
Here are the details of testing:
- Machine is an HP DL385 with 16GB of memory, 1 73GB 15k RPM SCSI drive for boot, root and /var partitions. 3 146GB 15k RPM SCSI drives in RAID-0 stripe (Hardware RAID).
- Using CentOS 4.3 as operating system
- MySQL version 5.1.12-beta (Compiled from SRPM)
- InnoDB Parameters
innodb_buffer_pool_size = 11G
innodb_file_per_table
innodb_open_files = 1000
innodb_log_file_size = 2000M
innodb_flush_method = O_DIRECT
innodb_flush_log_at_trx_commit = 1
innodb_support_xa = 0
- Prior to each test run, the database server was restarted, and mysql was run with the database name (It takes approximately 40 seconds for mysql to run the first time as it caches the table information)
- The test query was run 3 times (each with a db server restart) and the average of the 3 runs taken.
- SQL (Non-partitioned):
select entity_id,buyer_entity_id,buyer_line_item_id,sum(roi_cost) from network_daily_local_nonpart force index (ix_nsdl_ymdh_entity_seller) where ymdh in ('2006-11-01 00:00:00', '2006-11-02 00:00:00', '2006-11-03 00:00:00', '2006-11-04 00:00:00', '2006-11-05 00:00:00', '2006-11-06 00:00:00', '2006-11-07 00:00:00', '2006-11-08 00:00:00', '2006-11-09 00:00:00', '2006-11-10 00:00:00', '2006-11-11 00:00:00', '2006-11-12 00:00:00')and entity_id = 2 group by entity_id,buyer_entity_id,buyer_line_item_id order by entity_id,buyer_entity_id,buyer_line_item_id
- SQL (Partitioned):
select entity_id,buyer_entity_id,buyer_line_item_id,sum(roi_cost) from network_daily_local_part force index (ix_nsdl_ymdh_entity_seller) where ymdh in ('2006-11-01 00:00:00', '2006-11-02 00:00:00', '2006-11-03 00:00:00', '2006-11-04 00:00:00', '2006-11-05 00:00:00', '2006-11-06 00:00:00', '2006-11-07 00:00:00', '2006-11-08 00:00:00', '2006-11-09 00:00:00', '2006-11-10 00:00:00', '2006-11-11 00:00:00', '2006-11-12 00:00:00')and entity_id = 2 group by entity_id,buyer_entity_id,buyer_line_item_id order by entity_id,buyer_entity_id,buyer_line_item_id
Now, one thing at RightMedia, is we have LOTS of data. In fact there is some skew on the
entity_id column and the majority of the records have the value listed in the SQL above. That still should not explain huge increases in performance, should it?Okay, one last experiment (It's 3:46AM EST!!) I am going to subpartition the partitioned table into 7 subpartitions, hash by the
entity_id column.Result: 14.32 seconds! Holy friggin' turbo mode Batman! It's way too late to be trying to quantify performance gain, so I'm quantifying it by WTF units. I'd really like to get an explanation of what is really going on behind the scenes that yields this type of performance gain, and why I don't see at least 1 or 2 WTF units of gain on the original partitioned table, given that the amount of data is much less for the date range compared to the whole table.
Sunday, December 03, 2006
MySQL 5.1 Partitioning - Part 3
Finally, it's time to start putting MySQL 5.1.12-beta through the wringer. First order of business, convert the existing table schema to one that supports partitioning...
I made some minor changes to the configuration for partitioning, namely
This is what the new table schema looks like with partitioning:
There are a lot of partitions that I have defined. I need to keep a rolling 60 days of daily partitions active. The plan is to use the
After the data was reloaded, it was time to test the performance, and the ability to add partitions and reorganize them (this was broken in version 5.1.11-beta).
Performance, surprisingly wasn't what I expected. Queries that ran on the partitions were about the same performance wise as those on the unpartitioned table. Hmm, I have to double check my results. I'll post all of the performance data.
The real good news is that the partition maintenance commands all worked with InnoDB! Dropping & Reorganizing partitions worked perfectly. I'll have to redo my testing and to see what happened to the performance.
I made some minor changes to the configuration for partitioning, namely
innodb_file_per_table and innodb_open_files. I set innodb_open_files to 1000 based on the tables and partitions I plan on supporting.This is what the new table schema looks like with partitioning:
CREATE TABLE `network_daily` (
`entity_id` int(11) NOT NULL default '0',
`buyer_entity_id` int(11) NOT NULL default '0',
`buyer_line_item_id` int(11) NOT NULL default '0',
`seller_entity_id` int(11) NOT NULL default '0',
`seller_line_item_id` int(11) NOT NULL default '0',
`size_id` int(11) NOT NULL default '0',
`pop_type_id` int(11) NOT NULL default '0',
`country_group_id` int(11) NOT NULL default '0',
`is_adjustment` tinyint(4) NOT NULL default '0',
`adv_learn_type` char(1) NOT NULL default '',
`pub_learn_type` char(1) NOT NULL default '',
`frequency` smallint(6) NOT NULL default '0',
`ymdh` datetime NOT NULL default '0000-00-00 00:00:00',
`imps` bigint(20) NOT NULL default '0',
`clicks` int(11) NOT NULL default '0',
`convs` int(11) NOT NULL default '0',
`id` int(10) unsigned NOT NULL default '0',
`checkpoint` int(11) default NULL,
PRIMARY KEY (`id`,`ymdh`),
KEY `ix_nsl_ymdh_buyerli` (`ymdh`,`buyer_line_item_id`),
KEY `ix_nsdl_ymdh_entity_buyer` (`ymdh`,`entity_id`,`buyer_entity_id`),
KEY `ix_nsdl_ymdh_entity_seller` (`ymdh`,`entity_id`,`seller_entity_id`)
) ENGINE=InnoDB
PARTITION BY RANGE (TO_DAYS(`ymdh`))
(
PARTITION p2005 VALUES LESS THAN (TO_DAYS('2006-01-01')),
PARTITION p200601 VALUES LESS THAN (TO_DAYS('2006-02-01')),
PARTITION P200602 VALUES LESS THAN (TO_DAYS('2006-03-01')),
...
PARTITION P200609 VALUES LESS THAN (TO_DAYS('2006-10-01')),
PARTITION P20061001 VALUES LESS THAN (TO_DAYS('2006-10-02')),
PARTITION P20061002 VALUES LESS THAN (TO_DAYS('2006-10-03')),
PARTITION P20061003 VALUES LESS THAN (TO_DAYS('2006-10-04')),
...
PARTITION P20061130 VALUES LESS THAN (TO_DAYS('2006-12-01'))
);
There are a lot of partitions that I have defined. I need to keep a rolling 60 days of daily partitions active. The plan is to use the
ALTER TABLE REORGANIZE PARTITION statement to merge the older partitions together once per day, and to add a new partition once per day.After the data was reloaded, it was time to test the performance, and the ability to add partitions and reorganize them (this was broken in version 5.1.11-beta).
Performance, surprisingly wasn't what I expected. Queries that ran on the partitions were about the same performance wise as those on the unpartitioned table. Hmm, I have to double check my results. I'll post all of the performance data.
The real good news is that the partition maintenance commands all worked with InnoDB! Dropping & Reorganizing partitions worked perfectly. I'll have to redo my testing and to see what happened to the performance.
Subscribe to:
Comments (Atom)

