- Azure Data Factory is a cloud-based data integration service that allows you to create data-driven workflows in the cloud for orchestrating and automating data movement and data transformation. It’s built for complex hybrid extract-transform-load (ETL), extract-load-transform (ELT), and data integration projects.
- Choosing a data pipeline orchestration technology in Azure
- Ingest: With Data Factory, you can use the Copy Activity in a data pipeline to move data from both on-premises and cloud source data stores to a centralization data store in the cloud for further analysis.
- Transform: After data is present in a centralized data store in the cloud, process or transform the collected data by using compute services such as HDInsight Hadoop, Spark, Data Lake Analytics, and Machine Learning.
- Monitor: Azure Data Factory has built-in support for pipeline monitoring via Azure Monitor, API, PowerShell, Azure Monitor logs, and health panels on the Azure portal.
- Key Concepts
- Pipeline: A pipeline is a logical grouping of activities that performs a unit of work. It allows you to manage the activities as a set instead of managing each one individually. The activities in a pipeline can be chained together to operate sequentially, or they can operate independently in parallel.
- Activity: A processing step in a pipeline. Three types: Movement, Transformation, Control
- Datasets represent data structures within the data stores, which simply point to or reference the data you want to use in your activities as inputs or outputs.
- Linked services are much like connection strings, which define the connection information that's needed for Data Factory to connect to external resources.
- To represent a data store
- To represent a compute resource that can host the execution of an activity
- Recommend you store linked service credentials in Azure key Vault
- Triggers represent the unit of processing that determines when a pipeline execution needs to be kicked off.
- Schedule: Daily, weekly and monthly frequencies, in addition to minute and hour based settings
- Tumbling: This trigger type is good for automating historical data type loads. Supports the use of the WindowStart and WindowEnd system variables. Users can access triggerOutputs().windowStartTime and triggerOutputs().windowEndTime as trigger system variables in the trigger definition. These can be passed to other pipeline activities to process data range specific activities.
- Event based - Blob created/deleted.
- A pipeline run is an instance of the pipeline execution. Pipeline runs are typically instantiated by passing the arguments to the parameters that are defined in pipelines
- Sample Walkthrough
- Create a pipeline manually
- Create a pipeline through copy data wizard
- Copy activity in Azure Data Factory
- Copy activity performance and scalability guide
- List of data stores that Copy Activity supports as sources and sinks
- Parameterize linked services in Azure Data Factory
- You can copy data from a REST endpoint by using Azure Data Factory
- REST connector specifically supports copying data from RESTful APIs
- External Transformations (code): Transform staged data
- You can hand-code transformations and manage the external compute environment yourself.
- Transform data in Azure Data Factory
- Spark activity
- Transform un-staged data with the Stored Procedure Activity/Stored Procedure sink of the Copy Activity
- When copying data into Azure SQL Database or SQL Server, you can configure the SqlSink in the copy activity to invoke a stored procedure
- Steps to invoke a stored procedure from a SQL sink
- Native Transformations(Codeless): Transform data with the Mapping Data Flow
- Allows you to build visual, code-free, data transformations within ADF
- Supports flexible schemas. You can derive columns using template expression patterns based on name and type matching instead of defining static field names.
- You may need to first stage the data into a supported source (such as Data Lake or Azure SQL Database staging tables)
- Sample Walkthrough
- Native Transformations(Codeless): Transform data with the Wrangling Data Flow
- Azure Integration Runtime supports connecting to data stores and compute services with public accessible endpoints.
- Use a self-hosted integration runtime to connect to on-premises data sources
- Use SSIS integration runtime to run SSIS packages in ADF. Azure-SSIS IR can be provisioned in either public network or private network. On-premises data access is supported by joining Azure-SSIS IR to a Virtual Network that is connected to your on-premises network.
- Steps to create an Azure-SSIS Integration Runtime in Azure Data Factory
- Steps to migrate SSISDB to Azure SQL Database managed instance
- Self-Hosted Runtime Architecture Guidance
- Sample Walkthrough:
- Command Channel vs Data Channel
- The command channel allows communication between data movement services in Data Factory and self-hosted integration runtime. The communication contains information related to the activity.
- The data channel is used for transferring data between on-premises data stores and cloud data stores.
- Public IP address is used for Command Channel communications (which take place between the SHIR node and ADF)- ADF is not injected into your VNET, so ADF cannot communicate with your SHIR directly through private IP address.
- The self-hosted integration runtime only makes outbound HTTP-based connections to open internet.
Scale out a Self-Hosted Integration Runtime (SHIR)
- Identify and document external dependencies for each connection. For example, do your connections depend on an an ODBC driver, DSN, connection file, host file, registry key, environment variable, etc.
- Install + configure any additional dependencies on your new node
- Consider placing the SHIR node in an availability set
- Considerations
- Before you add another node for high availability and scalability, ensure that the Remote access to intranet option is enabled on the first node. To do so, select Microsoft Integration Runtime Configuration Manager > Settings > Remote access to intranet
- Can scale up to four SHIR nodes
- You don't need to create a new self-hosted integration runtime to associate each node. Instead, you can install the self-hosted integration runtime on another machine and register it by using the same authentication key.
- Configure SHIR
- The default value of the concurrent jobs limit is set based on the machine size. The factors used to calculate this value depend on the amount of RAM and the number of CPU cores of the machine. So the more cores and the more memory, the higher the default limit of concurrent jobs. You can override the calculated default value in the Azure portal. Select Author > Connections > Integration Runtimes > Edit > Nodes > Modify concurrent job value per node.
- You can also use the PowerShell update-Azdatafactoryv2integrationruntimenode command.
- Overview
- Pricing
- Architecture info
- Edition Info
- Create ADF + SSIS Integration Runtime + tutorial
- Connect SSIS runtime to on-premises environment
- Use SSMS to connect to the SSIS Catalog (this requires SSISDB in connection string)
- Deploy an SSIS package through SSDT. See section "To deploy a project to the Integration Services Server"
- Monitor SSIS packages
- Run a package
- Run an SSIS package with the Execute SSIS Package Activity in Azure Data Factory
- Schedule SSIS Packages
- Connect to data sources and file shares with Windows Authentication from SSIS packages in Azure
- Install any 3rd party drivers.
- Customize setup for the Azure-SSIS integration runtime