GIT – You have a “huge” table, and you need to ALTER it. By “huge” I mean that the ALTER would take an unacceptable amount of time.
You have a “huge” table, and you need to ALTER it. By “huge” I mean that the ALTER would take an unacceptable amount of time.
Things you might want to do. and for which this is well suited:
⚈ Add/drop columns
⚈ Modify datatypes (INT -> BIGINT, SIGNED -> UNSIGNED)
⚈ Modify character set / collation of column(s)
⚈ Add/drop indexes
⚈ Add/drop/modify PARTITIONing
⚈ Change Engine (MyISAM to/from InnoDB)
The Future Solution
Rumor has it that the ALTER in version 5.6 can do everything, and do it without locking the table.
Overview of Solution
1. Build the ‘Alter’ script to copy rows into a new table (in clumps of rows)
2. Push code to add a Database Layer between client(s) and the database
3. Push code to augment the Database Layer to handle the effect of the Alter
4. Turn on the Alter
5. Push code to deactivate the Alter
The Layer may already be in place; it is ‘best practice’. However, after seeing what is ahead (below), you may want to clean up the Layer some. One example is to change a BEGIN…COMMIT into a single API call (if practical). If this leads to a cleaner API, then that helps the API. Later, when you get to step 3, life will be simpler for other reasons.
Steps 2 and 3 are separate because I am assuming you also cannot afford screwups. Step 2 focuses on the API from Client to Layer; Step 3 focuses on tweaking the Layer for this Alter — they are separable focuses, and is safer to think about only one at a time.
“Turning on the Alter” is essentially running the script (Step 1) to copy the table over. This may take hours or days.
The guiding principles…
⚈ If the existing table continues to be read/written by the identical code as before, all queries on it should continue to work correctly.
⚈ At all times, rows with id <= $left_off (a “highwater mark”) will be correctly transformed, inserted, updated, etc.
⚈ At the end, when $left_off is at the end of the table, the transformation is complete.
Most of the rest of this discussion centers on the complex case of a table that is being modified — any row could be UPDATEd, INSERTs could go anywhere into the table. And it assumes a single machine, or a single Master.
Some likely special cases are covered near the end (“Alternative…”)
⚈ This discussion assumes you can walk through the original table using an explicit PRIMARY or UNIQUE key.
⚈ Single-column (not ‘compound’) key is used to walk through the table.
⚈ You have enough disk space to simultaneously hold both the original table and the new table(s).
⚈ There is enough ‘wiggle room’ in the performance of the databases so that the overhead of this process (100%?) can be handled.
⚈ INSERTs are single-row, UPDATEs are single-table, DELETEs are single-row
⚈ You can modify the Layer to change how INSERTs, etc, operate.
⚈ SELECTs need no modification. (Nice, eh?)
⚈ UPDATE statements do not set the key being used to walk through the table.
⚈ INSERT..ON DUPLICATE KEY UPDATE.. is not used
⚈ INSERT IGNORE does not depend on a secondary UNIQUE key
⚈ There are no FOREIGN KEYs in the definition of this table, nor any in other tables referencing this table.
⚈ Only one table is involved
⚈ You have all the write operations (no ad hoc queries from users)
⚈ Engine — The outline given here should work equally well for MyISAM or InnoDB.
⚈ Replication — You can perform this on the read/write Master and have it propagate gracefully to the slaves.
What to do if assumptions are not met?
⚈ Compound index — The code is more complex, but doable. (I have such code, but it is not included in this document.)
⚈ No PRIMARY/UNIQUE key — A non-unique key can be used, but this may lead to arbitrarily large locks on the table.
⚈ No explicit keys at all — Punt. (LIMIT+OFFSET is not viable!)
⚈ Inadequate disk space — Punt. Get space first.
⚈ If the system gets bogged down, two dynamic tunables can be tweaked to slow down the Alter and make it less invasive.
⚈ Multi-row INSERT — design the Layer’s API to make it easy to determine which rows are below $left_off.
⚈ Multi-table UPDATE — This is a challenge
⚈ DELETE .. IN(…) — pass the IN list in an array for easy handling
⚈ DELETE .. WHERE (multi-row) — no problem
⚈ UPDATE that changes the key — You must recognize the situation and write some messy code. (It would be handy if the API helped make this obvious.)
⚈ INSERT..ON DUPLICATE KEY UPDATE.. — Don’t know.
⚈ INSERT IGNORE — The problem is that it might get INSERTed before the conflicting record is inserted. No workaround available.
⚈ FOREIGN KEYS — have not investigated
⚈ Extra tables for normalization — not a big deal
⚈ ad hoc writ queries — must disallow for the duration of the Alter.
It is “best practice” to have a “Layer” between your clients and your database. SQL statements are only known to the Layer, not to the clients. By having the Layer, you are segregating “business logic” from “database details”. For the purposes here, the “database details” will be changing; it is better to have such code isolated.
The Layer would be called from the Client with calls like “Insert this stuff…”. If the ‘stuff’ needs massaguing to be compatible with MySQL (eg, timestamps), then the Layer is the ‘right’ place to do the conversion. The Layer is also a good place to hide any “normalization” — clients should no care whether a ‘name’ is stored in the table directly, or normalized into another table and only an id is put into the “Fact” table.
Keep the API clean and simple; hide any database complexity in the Layer. And, for this task, it will get complex.
For the task at hand, we will depend on the Layer to make it easy to migrate from one table to another, and to do that without any real knowledge of “business logic”. This ignorance of the client side of things makes it easier and safer to write the code and have confidence that it is correct.
This is a script that does the conversion by copying the data into a new table, which already has the ‘new’ schema elements (added columns, dropped indexes, etc).
⚈ CREATE TABLE — the new table, with all new indexes/datatypes, etc.
⚈ Change the row in the helper table `Migrate` (running=1, etc) (below)
⚈ Loop (below)
⚈ Deactivate (running=0)
⚈ RENAME TABLE existing TO old, new TO existing;
Each iteration does:
⚈ Migrates a “clump” of, say, 100 rows from the old table to the new.
⚈ Any transforms (normalization, datatype conversion, etc) are done for those rows.
⚈ Locks out all ‘write’ operations while handling the clump. (More later)
The starting values for Migrate:
⚈ running = 1
⚈ clump_size = 100 (1000 if existing table is InnoDB with no secondary keys); <100 if lots of keys)
⚈ delay = 1 (second)
⚈ left_off_* — whatever key is less than first value (eg, 0 or ”)
The Loop should watch for running out of rows. When that happens, it needs to do the Deactivate and RENAME _before_ the Unlock.
One Iteration (Copy one Clump)
⚈ Fetch the row from Migrate
⚈ Find the key of the 100th row after where you ‘left_off’.
⚈ Lock (exclusive)
⚈ INSERT INTO new SELECT * FROM existing WHERE id > $left_off AND id <= $hundredth_hence;
⚈ If no more rows, Deactivate and RENAME
⚈ Update row in Migrate
⚈ Sleep 1 second (dynamically tunable)
⚈ The “Find” step is outside the Lock/Unlock so as to minimize the time in lock state.
⚈ Risk: with the Find step outside, you could end up with more than 100 rows to move. (Probably not a big deal.)
⚈ Lock/Unlock — an exclusive LOCK on the Migrate table. Note that all INSERTs/UPDATEs/etc must also Lock that table
⚈ The SELECT step should include any transforms needed, build normalization records, and whatever else is required.
The database Layer is vital for giving us a simple way to modify all INSERT/UPDATE/REPLACE/DELETE statements. All client write operations must be going through the Layer. (SELECTs should go through the Layer, but that does not matter for this discussion.)
I will discuss ‘simple’ write operations only.
Around every atomic operation, you need to add the following:
⚈ Lock (read-lock) on Migrate
⚈ Fetch the row from Migrate
⚈ If not ‘running’, then skip most of these steps. (Alter has not yet started, or has finished.)
⚈ Perform the SQL statement on old table
⚈ Modify statement to include “AND id <= $left_off”, and perform it on new table.
For “transactions” in BEGIN..COMMIT, either make the whole transaction a single API call, or carefully coordinate the Lock/Unlock with BEGIN..COMMIT. BEGIN should map to the Lock and Fetch steps; COMMIT should map to the Unlock step.
REPLACE should be split into what it actually does — a DELETE and an INSERT. Both of these sound be inside a single Lock/Unlock pair.
As noted earlier, some multi-row and multi-table write operations get more complicated.
Helper Table: `Migrate`
This table has only 1 row.
CREATE TABLE Migrate ( running TINYINT UNSIGNED NOT NULL DEFAULT '0', clump_size INT UNSIGNED NOT NULL DEFAULT '100', delay FLOAT NOT NULL DEFAULT '1', left_off_1 ... (INT, VARCHAR, etc, matching first field in KEY being used) left_off_2 ... (more field(s), if needed) clumps_moved INT UNSIGNED NOT NULL DEFAULT '0', rows_moved BIGINT UNSIGNED NOT NULL DEFAULT '0', lock_time DOUBLE NOT NULL DEFAULT '0', move_time DOUBLE NOT NULL DEFAULT '0', sleep_time DOUBLE NOT NULL DEFAULT '0', last_move TIMESTAMP NOT NULL ) ENGINE=MyISAM;
⚈ `running` is 1 while Alter is running. INSERTs, etc, check this to know whether to INSERT into the new table.
⚈ clump_size says how many rows to move per iteration; “100” is used as an example. This can be dynamically tuned.
⚈ delay is how long to sleep() between iterations; 1 second is a good first guess. This is tunable.
⚈ left_off_* contain the value(s) of the last key used by the last clump. (Initialize to 0 or ”, or whatever is less than first key.)
⚈ clumps_moved and rows_moved tally the progress.
⚈ the *_time fields keep track of the seconds consumed by the phases of the loop.
⚈ last_move is a timestamp, overwritten at the end of each iteration. At the end of Alter, it is the finish time. (If the value is too long ago, but not finished, maybe it is broken.)
⚈ MyISAM was chosen to avoid any risk of involving this in BEGIN..COMMIT transactions.
⚈ Replicating the table is optional; it has no function on Slaves.
⚈ It was not intended to JOIN this table with any other; I don’t know if “left_off_*” is useful.
The table must be CREATEd before the Layer has the code in it that looks at the table. The imporant value (at that point) is running=0.
Tuning on the fly
What to watch…
⚈ In Slaves, keep an eye on Seconds_behind_master (or a heartbeat if you have such)
⚈ lock_time / clumps_moved — This should be close to zero; if it is high, then Client processes are not getting enough time.
⚈ move_time / clumps_moved — The average time that it takes to move a clump. This impacts latency of Client write operations.
⚈ sleep_time : move_time — which is getting more time, the Clients, or the Alter.
The following two values in the Migrate table can be changed at will; changes take effect at the start of the next clump.
“clump_size” (nominally 100) effectively controls how long INSERTs and other writes are blocked at a time. Decrease during busy times; increase during lulls.
“delay” (nominally 1 second; declared FLOAT), controls the gap between clumps. It is intended to let other operations get in. Also, it somewhat controls the overloading of the replication stream. Increase `delay` if replication gets behind.
After the Alter is finished, the code automatically “deactivates” (by setting running=0) itself. No further writes are needed to the new table (which has been RENAMEd out of existence anyway).
However, the Layer is still checking Migrate, but this is no longer necessary. To clean up the code, you need another code push. This code push should be no more complicated than turning off a flag (a code flag, not the dynamic ‘running’ flag). The flag should have been built in when Alter is originally written. It will control whether to look at the Migrate table, or simply follow the “not running” path and skip the lock and unlock.
I recommend leaving these vestiges of the code in place; you are likely to do another Alter in the future. Meanwhile, the code will be sufficiently efficient. (A few “if false” statements are insignifant.)
Alternative – Log table
If the table is a simple, append-only, “log”, then the job can be done a lot simpler. There is no need for the Database Layer (though that might be best practice). The Migrate script can probably be run faster.
⚈ Migrate 1K-10K records per iteration, no extra locking needed.
⚈ Key being used must be the one for which new rows are sent to the ‘end’ of the table.
⚈ The final step that involves the RENAME gets tricky.
The only step that needs careful design (and I don’t have the details ironed out…)
Plan A. When the Migrate script discovers that there are no rows left to move, it LOCKs the existing and new tables, does the RENAME, and UNLOCKs. (Could some INSERT slip in? Especially for InnoDB?)
Plan B. When the Migrate script discovers that there are no rows left to move, it does the RENAME, then checks for any extra rows that snuck in. If there are extra rows, it moves them. (What happens to any INSERTs that are trying to run when the RENAME occurs? Maybe they wait? Maybe they die? Does your code automatically retry in that case?)
Alternative – Dual Master
If you are set up with a pair of Masters replicating from each other (for quick recovery from a crash), then this is a much simpler approach. Or it is simpler for the developers — it can be done entirely by the SEs.
The idea is to do some ALTER on offline server(s), failover, then do the rest of the ALTER(s).
Caveat: This only applies if the replication stream will not be confused by a Master having the ‘old’ schema, and the a Slave having the ‘new’ schema. (Note: RBR will probably be messed up by COLUMN changes, but not INDEX changes?) If any INSERT or UPDATE that would be confused, then you need to use the main approach so that you can have two flavors of statement and know which to apply to the existing versus the new tables.
Caution: Replication will need to be stopped or replication will be stuck, perhaps for as long as the ALTER takes. If the ALTER takes a week, you need enough disk space for a week’s worth of binlogs and/or relay logs.
Caution: A really big table could take _days_ to ALTER. There will be multiple (different machines), non-overlaping, ALTERs. Plan accordingly.
Plan A — Allow the ALTER to replicate, but carefully staged: The idea is to run the ALTER on the Backup Master, and do the failover just when it is ready to start on the Live Master. You can spot that time by knowing when the ALTER finishes on the Backup Master. If the ALTER turns out to be “wrong”, fail back to the original Live Master and lick your wounds. Now you have the equivalent of the Backup Master having crashed.
Plan B — Manually do one machine at a time. This requires SET SESSION SQL_LOG_BIN=0; in front of the ALTER each time.
1. One at a time, take a Slave out of rotation and do the ALTER on it. Do multiple Slaves at a time, but not more than you can afford to have OOR simultaneously.
2. Ditto for the Backup Master. Note: Any slaves connected to this server will become ‘stale’, so should be OOR.
3. Failover to the Backup Master
4. ALTER what had been the Live Master. Again note that its slaves should be OOR for the length of the ALTER.
Caveat: The Backup Master will be unavailable for failover during the ALTER. However, you can abort the ALTER if you need to failover.
Caveat: Any machine offline should probably have replication turned on (though stuck) so that the I/O thread is receiving updates. This should avoid having binlogs purged out from under you. However, it means that the receiving machine will have a pileup in the relay logs — be sure to have enough disk space for such.