
As more databases move to the cloud, and as those databases grow and proliferate, technologies to move data are rapidly advancing to keep up with new requirements. Also, as technology advances, there have been further developments in data sharing across clouds.
The ability to move data has always been fundamental to data management. However, in the cloud, data movement is paramount. This is because hybrid clouds and multi-clouds, by definition, must be architected for data sharing. For instance, data integration, distribution, consumption, migration, and updates all involve data movement in and out of databases.
Developments in Cloud Data Sharing
A range of new and improved technologies enable high-speed, low-friction data distribution. Here are recent developments in five key areas:
1. APIs
On September 28th, Google Cloud announced the general availability of Apigee Integration. This is an API management and integration platform that makes it easier to connect data and applications via APIs. The solution has pre-built connectors for Google’s Cloud SQL and BigQuery databases, with more connectors planned. This makes it easier for developers to tap into data sources when building applications.
2. Change Data Capture
Recently, Google Cloud introduced Datastream, a cloud service that provides change data capture and replication. Change data capture is an automated process that ensures data synchronization and consistency between a primary database and others. Datastream works not only with Google Cloud Storage, but also with Oracle, MySQL, BigQuery, Cloud SQL, and Cloud Spanner. Many other vendors have their own change data capture capabilities.
3. ETL & ELT
Tools to extract, transform, and load (ETL) have long been the standard way of moving data from databases, apps, and other sources into data warehouses. In the cloud, data transformation sometimes happens as the last step, which is reflected in a slightly different abbreviation, ELT. Either way, these capabilities are essential for data integration as well as establishing pipelines of quality-controlled “clean data” for analysis.
4. Data Fabrics
InterSystems, Oracle, Talend, and other vendors offer data fabrics that weave together data wherever it may be. For instance, this includes databases, apps, and IoT. These many-to-many middleware fabrics support data streaming and also replication for real-time analytics.
5. Distributed Databases
A growing number of new-generation cloud databases are designed to be distributed with portions of the database running on different servers and potentially across geographic regions. A distributed database functions as one logical database, albeit with a distributed architecture. Microsoft’s Azure Cosmos DB, CockroachDB, DataStax Astra DB, and YugabyteDB are examples of distributed databases.
Follow the Money
Data movement makes data management more complex. It requires added attention to security and governance. Yet, data-in-motion technologies are essential to data lakes, data warehouses, data clouds, and other data-driven business operations.
Venture investment is pouring into these products and services. For example, Matillion, a data-integration and ETL specialist, just pulled in $150 million in series E funding, for a total of $310 million in venture capital at a market valuation of $1.5 billion.
Also, Fivetran recently announced plans to acquire HVR for $700 million, bringing together Fivetran’s automated data integration and HVR’s data replication. According to Fivetran, the goal is to make data access “as simple and reliable as electricity.”
Fivetran also revealed $565 million in additional venture funding, led by Andreessen Horowitz. So, that brings its total funding to $730 million at a $5.6 billion valuation.
Beyond Big Data
As databases expand to petabytes and beyond, standard approaches to data distribution and replication may not be sufficient. New solutions for very large databases (VLDBs) will ultimately be required.
In one interesting example, a project is underway by a handful of U.S. national labs to develop processes that support petabyte data transfers. The initiative, the Petascale DTN Project, involves data transfer nodes (DTNs) connected by high-speed networking. An article in The Next Platform describes the project, with diagrams and a link to a paper by the team behind it.
When gigabyte-speed connections aren’t enough, it’s time to bring in the 18-wheelers. My favorite example of big-time data transfer is AWS’s Snowmobile. This is a 45-foot-long shipping container that can move 100 petabytes at a time by driving down the road to an AWS data center. AWS says it could take 20 years to transfer that much data over a 16-Gbps connection.
Yet, data transfer by truck can only be an interim solution. So, state-of-the-art technologies must continue to advance to take on bigger workloads at faster speeds.
Data transfer gets even more complicated when entire databases are moved to the cloud. For more on that topic, see my article, “Cloud Vendors Confront ‘Highest Risk’ Projects: Database Migration.”
No one said data movement would be easy. But it’s absolutely essential, as the best way to unlock the tremendous value of data is through distribution and sharing.