Skip to content

This repository contains workshop content allowing users to gain hands on experience with Azure Data Factory Mapping Data Flows (or Synapse Pipeline Mapping Data Flows).

Notifications You must be signed in to change notification settings

adhazel/Azure-Data-Factory-Mapping-Data-Flow-Workshop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Azure-Data-Factory-Mapping-Data-Flow-Workshop

Use this repository for hands-on training of Azure Data Factory Mapping Data Flows capabilities.

What is Azure Data Factory?

Azure Data Factory is Azure's cloud ETL service for scale-out serverless data integration and data transformation. It offers a code-free UI for intuitive authoring and single-pane-of-glass monitoring and management.

❕ Another product within Azure, Azure Synapse Pipelines, is mostly synonomous with Azure Data Factory. While this repository uses Azure Data Factory for demonstration purposes, the lessons and concepts can be applied to Azure Synapse Pipelines as well.

What are Azure Data Factory mapping data flows?

This repository focuses on the mapping data flows feature within Azure Data Factory. Mapping data flows allow data engineers to develop data transformation logic without writing code (visual ETL). The resulting data flows are executed as activities within Azure Data Factory pipelines that use scaled-out Apache Spark clusters. Data flow activities can be operationalized using existing Azure Data Factory scheduling, control, flow, and monitoring capabilities.

Mapping data flows provide an entirely visual experience with no coding required. Your data flows run on ADF-managed execution clusters for scaled-out data processing. Azure Data Factory handles all the code translation, path optimization, and execution of your data flow jobs.

🤔 Prerequisites

  • An Azure account with an active subscription. Note: If you don't have access to an Azure subscription, you may be able to start with a free account.
  • You must have the necessary privileges within your Azure subscription to create resources, perform role assignments, register resource providers (if required), etc.

🧪 Lab Environment Setup

📚 Learning Modules

  1. Create Integration Runtime
  2. Create Linked Services
  3. Two Ways to do a Basic Copy
  4. Joins
  5. Slowly Changing Dimensions
  6. Change Data Capture Storage to SQL (module planned)
  7. Medallion Architecture: Bronze Layer
  8. Medallion Architecture: Silver Layer
  9. Medallion Architecture: Gold Layer
  10. Medallion Architecture: Consumption Layer
  11. Troubleshooting
  12. Best Practices

📚 Optional Learning Modules

  1. SAP Change Data Capture
  2. Working with pipeline templates

📚 Medallion Architecture

In a medallion architecture, data is organized into layers:

  • Bronze Layer: Holds raw data.
  • Silver Layer: Contains cleansed data.
  • Gold Layer: Stores aggregated data that's useful for business analytics.
  • Consumption Layer: Applications and data integrations read from the gold layer and may optionally create versions of the data that are purpose-built for their use case. This layer may reside within a transactional database used by the application, another analytical storage repository, or built as an API or another technology.

In this model, data is democratized so that all or most services that work with a dataset connect to a single underlying data source to ensure consistency. Integrated, row-level security is typically built in to allow for maxium data asset re-use.

In this lab, the following concepts by layer are present:

  • Bronze Layer
    • Data Ingestion
    • Data Retention Policy
  • Silver Layer
    • De-duplication
    • Data quality assertions
    • Cast data types
    • Joins
    • Reroute errors
    • Schema drift
  • Gold Layer
    • Calculated value(s)
    • Given that the source includes both general and confidential attributes, the data is sinked twice, once for consumption of general data and once for consumption of confidential data.
      • Sink for general sensitivity: selected attributes that are confirmed to be available for general use are included using explicit column.
      • Sink for confidential sensitivity: all attributes are passed through using schema drift and auto-mapping.
  • Consumption Layer
    • Read gold layer, and sink aggregate dataset with a new calculated column

🔗 References

🔗 Workshop URL

https://github.com/adhazel/Azure-Data-Factory-Mapping-Data-Flow-Workshop

About

This repository contains workshop content allowing users to gain hands on experience with Azure Data Factory Mapping Data Flows (or Synapse Pipeline Mapping Data Flows).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published