Skip to content

Latest commit

 

History

History
286 lines (163 loc) · 12.9 KB

module05.md

File metadata and controls

286 lines (163 loc) · 12.9 KB

Module 05 - Classifications

< Previous Module - Home - Next Module >

📢 Introduction

In Microsoft Purview, classifications are similar to subject tags, and are used to mark and identify data of a specific type that's found within your data estate during scanning. Classifications help you to better manage your data. You can use them for prioritizing your data efforts or improve data security and regulatory compliance. Classifications also improve user productivity and decision-making, and allow you to reduce costs by classifying and finding unused data.

Microsoft Purview provides a large set of default classifications that represent typical data types that might exist in your data estate (e.g. email address, credit card number, passport number, etc). In this module you learn how to create a custom classification, which can be an alternative to default classifications when they don't meet your needs.

🤔 Prerequisites

🔨 Tools

🎯 Objectives

  • Create a custom classification.
  • Trigger a scan that will apply the custom classification to an asset.

📑 Table of Contents

# Section Role
1 Create a Classification Data Curator
2 Create a Classification Rule (Regular Expression) Data Curator
3 Create a Scan Rule Set Data Source Admin
4 Upload Data to an Azure Data Lake Storage Gen2 Account Azure Administrator
5 Scan an Azure Data Lake Storage Gen2 Account Data Source Admin
6 Search by Classification Data Reader

1. Create a Classification

  1. Open the Microsoft Purview Governance Portal, navigate to Data map > Classifications (under Annotation management) and click New.

    New Classification

  2. Copy and paste the values below into the appropriate fields and click OK.

    Name

    Twitter Handle
    

    Description

    The username that appears at the end of your unique Twitter URL.
    

    Create Classification

  3. Navigate to the Custom tab to confirm the custom classification has been created.

    Create Classification

2. Create a Custom Classification Rule (Regular Expression)

  1. Navigate to Data map > Classification rules (under Annotation management) and click New.

    New Classification Rule

  2. Populate the classification rule fields as per the example below and click Continue.

    Field Example Value
    Name twitter_handle
    Description The username that appears at the end of your unique Twitter URL.
    Classification name Twitter Handle
    State Enabled
    Type Regular Expression

    💡 Did you know?

    There are two types of classification rules. Regular Expression performs pattern matching against the actual data and/or column name. Where as Dictionary based classification rules allows us to supply a list of all possible values via a CSV or TSV file.

    Regular Expression Classification Rule

  3. Download a copy of twitter_handles.csv to your local machine by opening the link in a new tab, right-click within the body of the content, and click Save as.

    CSV Save as

  4. Click the Browse icon and open the local copy of twitter_handles.csv.

    Upload file

  5. Select the data pattern associated to the Handle column and click Add to patterns.

    💡 Did you know?

    Thresholds help minimise the possibility of false-positive classifications. Minimum match threshold is the minimum percentage of data value matches in a column that needs to be found by the scanner for the classification to be applied.

    Pattern Detection

  6. Modify the Data Pattern by replacing the plus symbol (+) with with {5,15}.

    • The plus symbol (+) indicates one or more characters matching the preceding item. This may lead to false positives as it would allow for an unlimited number of alphanumeric characters. Twitter handles must be a minimum of 5 and a maximum of 15 characters.
    • With {5,15}, this will ensure matches only occur where there is a at least 5 and at most 15 occurrences of the preceding item.

    Classification Data Pattern

  7. While we can also specify a Column Pattern, in this example we will rely solely on the Data Pattern. Clear the Column Pattern input and click Create.

    Create Classification Rule

3. Create a Scan Rule Set

  1. Navigate to Data map > Scan rule sets (under Source management) and click New.

    💡 Did you know?

    Scan Rule Sets determine which File Types and Classification Rules are in scope. If you want to include a custom file type or custom classification rule as part of a scan, a custom scan rule set will need to be created.

    New Scan Rule Set

  2. Change the Source Type to Azure Data Lake Storage Gen2 then copy and paste the values below into the appropriate fields. Click Continue.

    Scan rule set name

    twitter_scan_rule_set
    

    Scan rule description

    Custom scan rule set to detect parquet files and classify twitter handles.
    

    Scan Rule Set Name

  3. Clear all file type selections with the exception of PARQUET and click Continue.

    Scan Rule Set File Type

  4. Clear all selected System rules and select the custom classification rule twitter_handle and click Continue.

    Scan Rule Set Classification

  5. Click Create.

    💡 Did you know?

    Ignore patterns tell Microsoft Purview which assets to exclude during scanning. During scanning, Microsoft Purview will compare the asset's URL against these regular expressions. All assets matching any of the regular expressions mentioned will be ignored while scanning.

    Ignore patterns

4. Upload Data to an Azure Data Lake Storage Gen2 Account

Before proceeding with the following steps, you will need to:

  • Download and install Azure Storage Explorer.
  • Open Azure Storage Explorer.
  • Sign in to Azure via View > Account Management > Add an account....

Note: If you have not created an Azure Data Lake Storage Gen2 Account, see module 02.

  1. Download a copy of twitter_handles.parquet to your local machine.

  2. Navigate to your Azure Data Lake Storage Gen2 Account, expand Blob Containers, and Open the raw container. Note: If a raw container does not exist, create one.

    Open Container

  3. Click on the New Folder button, provide the folder a name (e.g. Twitter) and click OK.

    New Folder

  4. Right-click on the newly created folder and click Open.

    Open Folder

  5. Click on the Upload button and select Upload Files....

    Upload File

  6. Select the local copy of twitter_handles.parquet and click Upload.

    Upload Parquet

5. Scan an Azure Data Lake Storage Gen2 Account

  1. Open the Microsoft Purview Governance Portal, navigate to Data map > Sources and click New Scan within the Azure Data Lake Storage Gen2 tile. Note: If you have not registered your Azure Data Lake Storage Gen2 Account, see module 02.

    New Scan

  2. Click Test connection to ensure the credentials have access and click Continue.

    Test Connection

  3. By default, Microsoft Purview will have the parent Azure Data Lake Storage Gen2 account selected and therefore include all paths in scope. To reduce the scope, deselect the parent and select the Twitter folder only. Click Continue.

    Scope Scan

  4. To validate the scope of the custom scan rule set, click View detail.

    Scan Rule Set Details

  5. Confirm that the custom scan rule set includes the PARQUET file type and the custom classification rule twitter_handle. Click OK.

    Verify Scan Rule Set

  6. Select the custom scan rule set twitter_scan_rule_set and click Continue.

    Select Scan Rule Set

  7. Set the scan trigger to Once and click Continue.

    Scan Cadence

  8. Click Save and Run.

    Run Scan

  9. To view the progress of the scan, navigate to Sources and click View details on the Azure Data Lake Storage Gen2 tile.

    Source Details

  10. Periodically click Refresh to update the scan status until Complete. Note: This will take approximately 5 to 10 minutes.

    Scan Progress

6. Search by Classification

  1. Once the scan has complete, perform a wildcard search by typing in the asterisk character (*) into the search bar and hit Enter.

    Wildcard Search

  2. Limit the search results by setting Classification within the filter panel to Twitter Handle. Click on the asset title (twitter_handles.parquet) to view the asset details.

    Filter Classification

  3. You will notice on the Overview tab that the schema includes the Twitter Handle classification. To identity which column has been classified, navigate to the Schema tab.

    Asset Details

  4. Within the Schema tab we can see that Account name is the column that has been classified.

    Asset Schema

🎓 Knowledge Check

https://aka.ms/purviewlab/q05

  1. Which of the following is a valid classification rule type?

    A ) Python
    B ) Regular Expression
    C ) C++

  2. When creating a regular expression based classification rule, you must specify a Data Pattern AND a Column Pattern.

    A ) True
    B ) False

  3. Custom classifications are automatically in scope of a system default scan rule set.

    A ) True
    B ) False

🎉 Summary

This module provided an overview of how to create a custom classification, and how to have the classification automatically applied as part of a scan using a custom scan rule set.

Continue >