Skip to content

Latest commit

 

History

History
106 lines (96 loc) · 13.3 KB

Azure-Data-Factory.md

File metadata and controls

106 lines (96 loc) · 13.3 KB

Azure Data Factory

  • Azure Data Factory is a cloud-based data integration service that allows you to create data-driven workflows in the cloud for orchestrating and automating data movement and data transformation. It’s built for complex hybrid extract-transform-load (ETL), extract-load-transform (ELT), and data integration projects.
  • Choosing a data pipeline orchestration technology in Azure

How It Works

  • Ingest: With Data Factory, you can use the Copy Activity in a data pipeline to move data from both on-premises and cloud source data stores to a centralization data store in the cloud for further analysis.
  • Transform: After data is present in a centralized data store in the cloud, process or transform the collected data by using compute services such as HDInsight Hadoop, Spark, Data Lake Analytics, and Machine Learning.
  • Monitor: Azure Data Factory has built-in support for pipeline monitoring via Azure Monitor, API, PowerShell, Azure Monitor logs, and health panels on the Azure portal.
  • Key Concepts
    • Pipeline: A pipeline is a logical grouping of activities that performs a unit of work. It allows you to manage the activities as a set instead of managing each one individually. The activities in a pipeline can be chained together to operate sequentially, or they can operate independently in parallel.
    • Activity: A processing step in a pipeline. Three types: Movement, Transformation, Control
    • Datasets represent data structures within the data stores, which simply point to or reference the data you want to use in your activities as inputs or outputs.
    • Linked services are much like connection strings, which define the connection information that's needed for Data Factory to connect to external resources.
    • Triggers represent the unit of processing that determines when a pipeline execution needs to be kicked off.
      • Schedule: Daily, weekly and monthly frequencies, in addition to minute and hour based settings
      • Tumbling: This trigger type is good for automating historical data type loads. Supports the use of the WindowStart and WindowEnd system variables. Users can access triggerOutputs().windowStartTime and triggerOutputs().windowEndTime as trigger system variables in the trigger definition. These can be passed to other pipeline activities to process data range specific activities.
      • Event based - Blob created/deleted.
    • A pipeline run is an instance of the pipeline execution. Pipeline runs are typically instantiated by passing the arguments to the parameters that are defined in pipelines
  • Sample Walkthrough

Data Movement and Transformations

Connect to On-Premises

  • Azure Integration Runtime supports connecting to data stores and compute services with public accessible endpoints.
  • Use a self-hosted integration runtime to connect to on-premises data sources
  • Use SSIS integration runtime to run SSIS packages in ADF. Azure-SSIS IR can be provisioned in either public network or private network. On-premises data access is supported by joining Azure-SSIS IR to a Virtual Network that is connected to your on-premises network.
  • Self-Hosted Runtime Architecture Guidance
  • Sample Walkthrough:
  • Command Channel vs Data Channel
    • The command channel allows communication between data movement services in Data Factory and self-hosted integration runtime. The communication contains information related to the activity.
    • The data channel is used for transferring data between on-premises data stores and cloud data stores.
    • Public IP address is used for Command Channel communications (which take place between the SHIR node and ADF)- ADF is not injected into your VNET, so ADF cannot communicate with your SHIR directly through private IP address.
    • The self-hosted integration runtime only makes outbound HTTP-based connections to open internet.

Scale out a Self-Hosted Integration Runtime (SHIR)

  • Identify and document external dependencies for each connection. For example, do your connections depend on an an ODBC driver, DSN, connection file, host file, registry key, environment variable, etc.
  • Install + configure any additional dependencies on your new node
  • Consider placing the SHIR node in an availability set
  • Considerations
    • Before you add another node for high availability and scalability, ensure that the Remote access to intranet option is enabled on the first node. To do so, select Microsoft Integration Runtime Configuration Manager > Settings > Remote access to intranet
    • Can scale up to four SHIR nodes
    • You don't need to create a new self-hosted integration runtime to associate each node. Instead, you can install the self-hosted integration runtime on another machine and register it by using the same authentication key.
  • Configure SHIR
    • The default value of the concurrent jobs limit is set based on the machine size. The factors used to calculate this value depend on the amount of RAM and the number of CPU cores of the machine. So the more cores and the more memory, the higher the default limit of concurrent jobs. You can override the calculated default value in the Azure portal. Select Author > Connections > Integration Runtimes > Edit > Nodes > Modify concurrent job value per node.
    • You can also use the PowerShell update-Azdatafactoryv2integrationruntimenode command.

SSIS Integration Runtime

Walkthrough