The true tale of migrating to Amazon Aurora with no interruption of service
Nulab Account’s Amazon RDS recently migrated from MySQL to Amazon Aurora. We will discuss how we made this migration possible and successful.
Reasons for the migration
Our Nulab Account regulates authentication for all our services: Backlog, Cacoo, and Typetalk. In the event that authentication is not functioning properly, our services will be unable to operate. Therefore, the Nulab Account must always be available in order for authentication to function at all times.
We were previously using RDS for MySQL but considered switching to Amazon Aurora from the moment it was released, especially since it features interchangeability with MySQL. Much has been said about Aurora’s merits, but it was its availability and scalability that appealed to us even more than its performance advantages.
Availability
A huge merit for us was Amazon Aurora’s high tolerance of disk failure and its high-level replication, described in the ”High Availability and Replication” section in the FAQ.
Amazon’s Multi-AZ MySQL uses DNS, so it can take quite a few minutes to go from a failure occurrence to the completion of the fail-over. Amazon Aurora, even without a cluster, can recover from failure within 15 minutes. With a multi-node cluster, it is said to complete the fail-over process within one minute.
This leads to high availability.
Scalability
You can expand disks with RDS for MySQL, but it is difficult to complete this action while running the system.
As stated in the ”Hardware and Scaling” section in the FAQ, you can extend Amazon Aurora’s storage without stopping or affecting database performance.
This is a huge benefit.
Conclusion
Aurora does have some disadvantages when compared to the RDS for MySQL—there are fewer instance types to choose from, the price is slightly higher, and there are several other minor points—but in the end, we did not find any real reason to choose MySQL over Aurora. We soon decided to migrate our Nulab Account to Amazon Aurora.
The migration process
Architecture and availability of Nulab Account
The Nulab Account is implemented as a JAVA servlet running as an instance on EC2. We had been using RDS for MySQL as the database.
As previously mentioned, we must offer authentication at all times. To achieve this, the database was designed so that if it becomes read-only, authentication will continue to work even if other functions are affected.
Considering migration steps
What we considered most important when migrating the database was that we keep the authentication function running.
We first migrated our staging environment from MySQL to Aurora. By trial and error there, we were able to make a migration strategy. We discovered there was no need to make significant changes in the application itself.
We migrated the data from MySQL to Aurora using the replication function, following the user’s guide: ”Migrating Data to Amazon Aurora DB Cluster.” To do this, we made an Aurora instance from a snapshot of MySQL. Then we migrated the MySQL data to Aurora by setting up replication with MySQL as the master and Aurora as the slave.
After checking that the replication worked correctly, we made the actual switch of the database that connects to the applications. One thing that we had to keep in mind was that there could be data inconsistency between MySQL and Aurora if new data is written between the time we migrated the data and the time we switched the connected database. We avoided this data inconsistency by temporarily stopping application use but not stopping the authentication function. We decided to announce a maintenance time, make the MySQL database read-only during that time (keeping authentication active), and migrate the data in that window.
Step 1: Make read replicas of MySQL and stop that replication to prevent the binary log from being deleted.
Step 2: Start Aurora.
Step 3: Construct replication with MySQL as master and Aurora as slave.
Step 4: Switch MySQL to read-only, temporarily disabling some functions.
Step 5: Switch the access point to Aurora, application by application. Once the switch is made, all functions become active again.
Step 6: Stop MySQL.
We decided to migrate our database without stopping our applications, using the above steps.
For steps 4 and 5, we set a one-hour maintenance window—in which some functions become unavailable—and planned to carry out the migration during that hour. In case of an error, we took a snapshot right after switching to read-only in step 4. From step 5, data writing to Aurora began. If a major problem occurred and forced us to fall back on MySQL, we would need to abandon the data written to Aurora since the beginning of step 5. We made the decision not to go back after step 5 even if a problem occurred.
We notified our users via our website.
Step 1: Make ‘read replicas’ of MySQL and stop replication to prevent the binary log from being deleted
For the replication migrating data from MySQL to Aurora, we used the “mysql.rds_set_external_master” command to manually perform a replication outside the RDS management system.
RDS for MySQL usually retreats binary logs to s3 and deletes them every five minutes on average. When replicating within the RDS management, it checks that the replication is up to date before deleting the log, but when replicating outside the RDS management, it will delete binary logs without checking the replicated data.
In that case, replication may not work.
To solve this problem, as stated in ”Replication Between Aurora and MySQL or Between Aurora and Another Aurora DB Cluster” on ”Replication with Amazon Aurora,” if you make another read replica under RDS management and then stop that replication, the master will keep the binary log without deleting it. Following the procedure in the above document, we made a MySQL read replica and ran the “mysql.rds_stop_replication” command to temporarily stopped that replication.
mysql> call mysql.rds_stop_replication; +---------------------------+ | Message | +---------------------------+ | Slave is down or disabled | +---------------------------+ 1 row in set (1.03 sec)
As the binary log would not be deleted, we had to pay attention to the remaining storage space. Since we had not stored a large amount of data in this database, there was not a large data increase.
We then needed to check the position in the binary log of the ‘read replica’ since we needed this information to set up the replication in Aurora.
mysql> show slave status \G *************************** 1. row *************************** ...Snip... Master_Log_File: mysql-bin-changelog.208690 Read_Master_Log_Pos: 110119 ...Snip... 1 row in set (0.00 sec)
Master_Log_File and Read_Master_Log_Pos positions are important in the binary log.
Step 2: Start Aurora
From the “Migrate Latest Snapshot” in the “Instance Actions” menu of the management console, we made an Aurora instance from the MySQL read replica. To start that instance based on the latest snapshot, we got the MySQL read replica snapshot immediately before doing this. We also set the ‘security group’ and ‘parameter group.’ This Aurora instance becomes the write node. Next, start up the read node from the management console and construct the cluster.
Step 3: Construct replication with MySQL as master and Aurora as slave
On Aurora, you set MySQL as master by “mysql.rds_set_external_master” command. As a parameter for that, specify the binary log position that you checked earlier.
mysql> call mysql.rds_set_external_master ( -> "
Endpoint of the MySQL master
" -> , "3306" -> , "
user
" -> , "
password
" -> , "mysql-bin-changelog.208690" -> , 110119 -> , 0 -> ); Query OK, 0 rows affected (0.03 sec)
Next, start the replication.
mysql> call mysql.rds_start_replication; +-------------------------+ | Message | +-------------------------+ | Slave running normally. | +-------------------------+ 1 row in set (1.01 sec) Query OK, 0 rows affected (1.01 sec)
After a while, MySQL and Aurora will have the same data.
Step 4: Switch MySQL to read-only
When the maintenance time started, the first thing we did was change the read_only variable of the parameter group from the management console, switching MySQL to read-only mode. This stopped any writing to MySQL or Aurora, and thus some application functions stopped.
Step 5: Switch the application access point to Aurora one by one.
Application by application, we switched the database to which the application connects from MySQL to Aurora. As each one switched to the active server, all its functions became active again.
The maintenance time could now come to an end.
All the applications were now connected to Aurora.
We then canceled the replication from MySQL to Aurora, using the “mysql.rds_reset_external_master” command.
mysql> call mysql.rds_reset_external_master; +----------------------+ | message | +----------------------+ | Slave has been reset | +----------------------+ 1 row in set (0.01 sec) Query OK, 0 rows affected (0.01 sec)
Step 6: Stop MySQL
Finally, we stopped the MySQL instance. We waited for a few days before carrying out this action.
Now the database migration was complete.
Overall, there were no significant issues and we felt the migration process went smoothly.
Changing the Applications
Changing behavior when the database is read-only
Our application would show an error page if the database was read-only when the application attempted to write to the database.
To clearly change to read-only, we made a change to the applications before the migration. The application would show as under maintenance instead of showing an error page.
Changing JDBC driver
When connecting from a Java application to Aurora, you can realize a fast fail-over if you use MariaDB Connector/J for a JDBC driver. We changed the JDBC driver after the database migration was completed.
MariaDB Connector/J
The conventional Multi-AZ of RDS has ‘hot standby’ construction. When fail-over occurs, the cluster end-point returns the IP addresses of the master and slave. As this has a DNS base, it is affected by TTL, so as long as the application refers to the cluster end-point, we cannot avoid prolonging the time that the application cannot access the database during the fail-over.
MariaDB Connector/J is a JDBC driver for MariaDB and MySQL. It has an incorporated mechanism for fail-over, and it works with Aurora clusters. This driver materializes a fast fail-over by recounting a node endpoint for the JDBC URL, as below, rather than a cluster endpoint.
jdbc:mysql:aurora://{Endpoint of node #1}:3306,{Endpoint of node #2}:3306,.../account-db
Aurora can construct a cluster of one writer and several readers. You can set the writer and readers by the global variable innodb_read_only. When a cluster detects a failure in a write node, it immediately changes the innodb_read_only value of one correctly functioning reader, making that note the writer.
If this driver detects a failure when issuing a query, it checks for the writer node by innodb_read_only, switches the node that issues the query, and re-issues the query to that node.
Even if a fail-over occurs, the driver immediately switches the access point so the application continues its processing without being interrupted by the fail-over.
For details, please check the documentation on mariadb.com.
Some points to consider
We used 1.2.3 for the MariaDB Connector/J version and found some points that need attention.
The following statement is on the MariaDB website:
It’s originally based on the Drizzle JDBC code, and with a lot of additions and bug fixes.
This driver is originally based on the Drizzle ( http://www.drizzle.org/ ) JDBC driver. Its code base is different from MySQL Connector/J, so it has a different parameter system and some behavior differences in detail.
At present, T getObject(int i, Class<T> type), a ResultSet interface added by JDBC 4.1, always returns ‘null.’ Nulab Account has some places where we had mapping of Java enum to MySQL VARCHAR. When getting the value from ResultSet, we used the aforementioned method, so we needed an amendment there.
Also, there seemed to be an issue with a few left-over threads when closing the application.
As a solution for this, there is a prepared method called unloadDriver in the org.mariadb.jdbc.Driver class. You can close the application correctly by calling this method after unloading the driver. In the Nulab Account, the following code is called from the contextDestroyed method of ServletContextListener.
try { final Class<?> clazz = ClassUtils.getClass("org.mariadb.jdbc.Driver"); MethodUtils.invokeStaticMethod(clazz, "unloadDriver"); } catch (final ReflectiveOperationException e) { }
Ensuring quality by a unit test
Even though the production environment switched to Aurora, developers continued to use MySQL. Also, the CI environment by Jenkins still uses MySQL. Even though Aurora features MySQL compatibility, we do not know how far it applies.
We decided to insure that there were no errors with SQL by issuing a unit test using the staging environment migrated to Aurora. In this server, the application works in the same way as in the production environment, but in addition to that, we made it a slave of Jenkins and set build execution once a day.
If there was any lack of interchangeability between MySQL and Aurora, we could detect it with this test.
So far, the test has never failed. We feel that Aurora’s quality is very high in terms of MySQL compatibility.
Summary
You can see Amazon’s excellence in many aspects of Aurora.
We have not had a database failure or fail-over after the migration. Up until now, we have not yet experienced the merits of Aurora’s fail-over in the production environment. However, in the staging environment, we have confirmed that the application is not affected when a fail-over occurs.
The high scalability and availability of Aurora cause higher scalability and availability of the system as a whole.
We believe Aurora’s compatibility with MySQL is truly remarkable. Even when infrastructure changes, we can continue with our operation and development without being too concerned.
We hope this article helps those who are considering migrating to Amazon Aurora!