On the one hand, we have Cloud SQL, Google’s fully managed relational database service that supports MySQL, PostgreSQL, and SQL Server. BigQuery, on the other hand, is Google’s serverless, highly scalable, and cost-effective multi-cloud data warehouse.
As a result, when you need your database to be able to handle large datasets and have high-speed while doing so, you may want to switch to BigQuery from Cloud SQL.
We are going to give you several methods of data migrations from Cloud SQL to BigQuery in this article. Let’s get started:
You can do this process through several different methods. And rest assured even if you don’t know how to code. Here are a few ways of migrating data from Cloud SQL to BigQuery:
1. Google Cloud Dataflow
Google Cloud Dataflow is a fully managed service for stream and batch data processing. It can handle large volumes of data and, therefore, can help us with the migrating process.
Here are the steps to follow:
- Before starting the migration, ensure that Cloud SQL, BigQuery, and Dataflow APIs are enabled in your Google Cloud project. Give appropriate IAM roles to the Dataflow service account, including roles for accessing Cloud SQL and BigQuery.
- Make sure that your Cloud SQL instance is configured to allow connections from Dataflow. You might need to set up IP-allow lists or configure Private IP.
- Install Apache Beat SDK for Python or Java.
- Use the ‘JdbcIO’ (Java) or a custom ‘DoFn’ (Python) to read data from Cloud SQL. If you exported data to Cloud Storage, read the exported files using the appropriate I/O connector (e.g., ‘TextIO’, ‘AvroIO’).
- Apply any necessary transformations to your data using Beam’s PTransforms. (Optional)
- Use the ‘BigQueryIO’ connector to write the data to BigQuery. Depending on your data size and pipeline requirements, you can choose different methods like direct write or writing to a temporary table first.
- Use the ‘gcloud’ command-line tool or the Google Cloud Console to deploy your Dataflow pipeline. Specify pipeline options, including your project ID, region, and any other relevant configurations.
- Monitor the pipeline’s execution in the Google Cloud Console, checking for any errors and ensuring data is correctly flowing from Cloud SQL to BigQuery.
- Once the pipeline completes, validate that the data in BigQuery matches your expectations and the source data in Cloud SQL.
Depending on your pipeline’s performance, you might need to adjust parallelism, choose different machine types, or optimize your source/query for better throughput.
2. Google Cloud Dataflow
If you don’t want to write a custom Dataflow pipeline, you can use a query scheduler’s help. Here are the steps:
- Use Cloud SQL’s export functionality to export data to GCS. This can be automated with Google Cloud’s Cloud Scheduler and Cloud Functions to trigger exports at regular intervals (Ensure data is exported in a format that BigQuery can import, such as CSV, JSON, Avro, or Parquet).
- If you haven’t already, create a dataset in BigQuery to hold your imported data.
- Schedule Data Load from GCS:
- Go to the BigQuery UI, click on “Scheduled queries” in the side menu, and then click “Create scheduled query.”
- Choose your data source as “Drive” or “Cloud Storage” and specify the path to your data in GCS.
- Write a ‘CREATE TABLE’ or ‘INSERT’ statement to load data into your BigQuery dataset. For large datasets or frequent updates, consider using ‘CREATE OR REPLACE TABLE’ to overwrite the existing table or ‘INSERT INTO’ to append to it.
- Configure the schedule for how often this query should run. This will depend on how frequently your Cloud SQL data is exported to GCS and how up-to-date you need the data in BigQuery to be.
- Use the BigQuery UI to monitor the execution of your scheduled queries and ensure data is being updated as expected.
You can’t access data in real-time while using this method, as there will be latency during exportation.
3. Cloud Datastream for CDC
You can also use Google Cloud Datastream for Change Data Capture (CDC) to replicate data from Cloud SQL to BigQuery. Here are the steps:
- Setting up:
- Make sure that the Cloud Datastream API is enabled in your Google Cloud project.
- Configure a connection profile for your Cloud SQL instance. You can do this by specifying connection details such as instance ID, database type, and credentials.
- Configure a connection profile for BigQuery as your destination.
- Create a new stream in Cloud Datastream, selecting your previously created source and destination connection profiles. You’ll also need to specify the database tables in Cloud SQL that you want to replicate.
- Choose the replication settings, such as starting point for the data replication (e.g., from now, from the earliest available point, or a specific point in time).
- Once configured, start the stream. Cloud Datastream will begin capturing changes from the specified Cloud SQL tables and replicate them to the destination in BigQuery.
- Ensure that the dataset and tables where the data will be replicated are properly set up in BigQuery. You may need to define schemas that match your Cloud SQL tables.
- Datastream will output changes to Google Cloud Storage in a format such as Avro. Use a Dataflow job or a custom service to ingest these changes into BigQuery. You may need to transform the data as necessary to fit your BigQuery schema.
- Apply error handling and alerting mechanisms to quickly address any issues that may arise during the replication process.
While this method is efficient as near real-time, the cost and complexity of the process should be your two biggest considerations.
Final Note
Remember that BigQuery treats all writes as immutable, so you might need a more specific strategy to migrate your data instead of using the solutions we have provided here. Also, there is a 4GB/file limit in BigQuery, and you may need to chunk the data extracts before the process.
Read more from techbullion