In this repository, we will compare the various compute options we can leverage to copy files in Azure Data Factory. The compute options we will compare are Azure Integration Runtime, Self Hosted Integration Runtime (SHIR) on Azure VMs, and Managed VNet Integration Runtime. It is important to note that we will be copying files between two cloud data stores. If the data store were located on-premises, installing the SHIR on commodity hardware on-premises would be more appropriate.
Our comparison will cover the following aspects:
- Performance
- Cost
- Operational aspects
The copy alternatives we will cover are:
- Using Azure Data Factory Copy Activity
- Using a Azure Data Factory Web Activity to call a REST API - we will leverage Azure Container Apps (ACA) to create a REST API.
What would be covered?
- How does Azure Data Factory cost work.
- Use cases requiring a copy activity.
- Detailed view on the experiment setup.
- Brief overview on the different compute options.
- Detailed view on the experiment results.
- Conclusion
Ready to get started? Let's go!
The documentation provides a good overview on how the cost* is calculated through examples. The key points are:
-
Orchestrator cost is calculated based on the number of activities and pipeline runs.
-
Compute duration called DIU.
*Cost - covering the pipeline execution and the compute resources used to run the pipeline.
With number of activities run in mind, it is suggested to try and reduce the number of activities in a pipeline. This is because activity execution time is rounded up to the minute. I.e. if you have an activity that executed in 45 seconds, it will be rounded up to 1 minute for billing purposes. The more activities you have in a pipeline, the more you will pay. For example if you need to copy 100 files, you can use a single copy activity to copy all 100 files. This will be cheaper than using a loop over 100 copy activities to copy 1 file each.
You receive data from multiple sources, which are all landing in your storage account. You need to copy this data from the landing zone to your data lake.
The following resources were created:
-
Two storage accounts - source and destination. In most cases both would be within the same region.
-
Sample data - we created a sample container app, which can create the sample files. Each file is with the same size of 21KB. We used this to create 1000, 2000, 5000 and 10,000 files containers in our source storage account.
-
Azure Data Factory - we created multiple pipelines to test the different scenarios. We used manual trigger for all pipeline executions. The pipelines are as follows:
- Pipeline using a Copy Activity- Copying files from source to destination using a Copy Activity. We leveraged pipeline parameters to change the source/target of each execution. We have two instances of this pipeline, one is using Azure Integration Runtime and the other is using Self Hosted Integration Runtime (SHIR).
- Pipeline using a Web Activity - Copying files from source to destination using a Web Activity + ACA.
-
Azure Container Apps - hosting a REST API to copy files.
- Copy using ACA - Copying files from source to destination using a Web Activity to call a REST API. We leveraged pipeline parameters to change the source/target of each execution.
- Using an external call, is cost-effective when looping over large number of items with small size. In the context of the experiment, we leveraged Azure Container Apps to create a REST API. The REST API was used to copy files from one location to another. The API was implemented using the 202 Accepted pattern and the pipeline was configured to ignore the async response. This means that the time taken to copy the files is not included on the pipeline duration time and it was not considered in the experiment results. Therefore we did not include it in the experiment results.
-
Two SHIR nodes - We used a quickstart template to create the nodes. The template can be found here. We used Standard A4 v2 (4 vcpus, 8 GiB memory) VMs.
NOTE: The experiment does not support full network isolation, as it was not part of the scope of this experiment.
-
Managed Identity - used by the Azure Container Apps to access the storage accounts & Key Vault.
-
Key Vault - used to store the connection strings for the storage accounts, used by the Container Apps.
Most common compute option for Azure Data Factory. It is a managed compute, hosted in Azure, that can connect to public data stores. If you need to connect to private data stores, you can whitelist the IP address ranges published for the service, but this is not desirable in many scenarios from a security perspective. It is used to run the copy activity among many other activities. It is a shared resource, which means that multiple pipelines can use the same Azure Integration Runtime.
NOTE: If data needs to be copied into (or from) a virtual network, we encourage you to use either SHIR or Azure Managed VNet Integration Runtime. The Azure Integration Runtime is not supported in a virtual network.
The same service can be hosted by you on your own compute. Users can create the integration runtime service on stand alone compute, or reuse existing capacity. In this experience, we used a dedicated 2 nodes cluster to run the copy activity. We used this Quickstart to create all required resources for the SHIR.
The Managed VNet Integration Runtime is a compute option that is managed and hosted within your virtual network in Azure. This allows you to create private endpoint between the IR and your data store. When choosing the managed VNet IR you get a secure, fully managed, fully isolated, and highly available compute option.
Result values shown in the tables bellow were taken from the pipeline run details and consumption.
We are showing both the estimated cost per single pipeline run and the cost for 1000 runs of a pipeline in a month. All Prices are in USD. We used 'West Europe' as the region for all resources. Pricing details are taken from here.
Experiment | DIU | Activity Duration [sec] | Activity Runs | DIUHour | Cost/Run | Cost/1000 Runs |
---|---|---|---|---|---|---|
1000 Files | 4 | 26 | 1 | 0.0667 | 0.027 | 17.71 |
2000 Files | 4 | 42 | 1 | 0.0667 | 0.027 | 17.73 |
5000 Files | 4 | 78 | 1 | 0.1333 | 0.043 | 34.43 |
10000 Files | 4 | 180 | 1 | 0.2 | 0.06 | 51.25 |
Experiment | Activity Duration [sec] | Activity Runs | External Activity Runs | Cost/Run | Cost/1000 Runs |
---|---|---|---|---|---|
1000 Files | 14 | 1 | 0.0167 | 0.01 | 1.02 |
2000 Files | 14 | 1 | 0.0167 | 0.01 | 1.02 |
5000 Files | 14 | 1 | 0.0167 | 0.01 | 1.02 |
10000 Files | 14 | 1 | 0.0167 | 0.01 | 1.02 |
Note: The cost of the ACA is not included in the experiment results. The pricing calculator can be found here.
XComputeTime is the time taken to run the copy activity on the SHIR nodes. With the VMs we used, the compute time was 0.01 per minute for 2 nodes.
Experiment | Activity Duration [sec] | Activity Runs | External Activity Runs | SHIR Runs | X-Compute Cost | Cost/Run | Cost/1000 Runs |
---|---|---|---|---|---|---|---|
1000 Files | 55 | 1 | 0.0167 | 1 | 0.01 | 0.02 | 11.53 |
2000 Files | 96 | 1 | 0.0333 | 1 | 0.02 | 0.030 | 21.55 |
5000 Files | 197 | 1 | 0.0667 | 1 | 0.04 | 0.05 | 41.61 |
10000 Files | 397 | 1 | 0.1333 | 1 | 0.05 | 0.06 | 51.73 |
We have have used the time taken to run the copy activity on the Azure IR.
Experiment | DIU | Activity Duration [sec] | Activity Runs | DIU-Hour | Cluster Startup [sec] | Cost/1000 Runs |
---|---|---|---|---|---|---|
1000 Files | 4 | 26 | 1 | 0.0667 | 60 | 41.56 |
2000 Files | 4 | 42 | 1 | 0.0667 | 60 | 46.01 |
5000 Files | 4 | 78 | 1 | 0.1333 | 60 | 72.66 |
10000 Files | 4 | 180 | 1 | 0.2 | 60 | 117.67 |
The first conclusion is that it is always better to test your hypothesis before reaching to conclusions. Our hypothesis was that using Self-Hosted-Integration-Runtime would be the most cost effective approach. The experiment results shows that this is not always the case.
The second conclusion, is that each workload must be examined individually. When choosing your compute option, you would need to understand the cost elements, extrapolate with your data, and choose the best option for your workload.
Yes, in many cases using SHIR is an easy to switch between Azure IR or Managed VNet IR. Using SHIR brings additional cost factors, and dependant on your company or project, the additional cost factors are maintaining the SHIR.
In the specific use case of processing large number of individual files, using an activity for that could be less effective than delegating to another compute option such as Azure Container App, or Azure Functions. It does bring coding complexities to the pipeline, but it can be a good option to consider.