< Previous Module - Home - Next Module >
In Microsoft Purview, classifications are similar to subject tags, and are used to mark and identify data of a specific type that's found within your data estate during scanning. Classifications help you to better manage your data. You can use them for prioritizing your data efforts or improve data security and regulatory compliance. Classifications also improve user productivity and decision-making, and allow you to reduce costs by classifying and finding unused data.
Microsoft Purview provides a large set of default classifications that represent typical data types that might exist in your data estate (e.g. email address, credit card number, passport number, etc). In this module you learn how to create a custom classification, which can be an alternative to default classifications when they don't meet your needs.
- An Azure account with an active subscription.
- An Azure Data Lake Storage Gen2 Account (see module 00).
- A Microsoft Purview account (see module 01).
- Create a custom classification.
- Trigger a scan that will apply the custom classification to an asset.
# | Section | Role |
---|---|---|
1 | Create a Classification | Data Curator |
2 | Create a Classification Rule (Regular Expression) | Data Curator |
3 | Create a Scan Rule Set | Data Source Admin |
4 | Upload Data to an Azure Data Lake Storage Gen2 Account | Azure Administrator |
5 | Scan an Azure Data Lake Storage Gen2 Account | Data Source Admin |
6 | Search by Classification | Data Reader |
-
Open the Microsoft Purview Governance Portal, navigate to Data map > Classifications (under Annotation management) and click New.
-
Copy and paste the values below into the appropriate fields and click OK.
Name
Twitter Handle
Description
The username that appears at the end of your unique Twitter URL.
-
Navigate to the Custom tab to confirm the custom classification has been created.
-
Navigate to Data map > Classification rules (under Annotation management) and click New.
-
Populate the classification rule fields as per the example below and click Continue.
Field Example Value Name twitter_handle
Description The username that appears at the end of your unique Twitter URL.
Classification name Twitter Handle
State Enabled
Type Regular Expression
💡 Did you know?
There are two types of classification rules. Regular Expression performs pattern matching against the actual data and/or column name. Where as Dictionary based classification rules allows us to supply a list of all possible values via a CSV or TSV file.
-
Download a copy of twitter_handles.csv to your local machine by opening the link in a new tab, right-click within the body of the content, and click Save as.
-
Click the Browse icon and open the local copy of twitter_handles.csv.
-
Select the data pattern associated to the Handle column and click Add to patterns.
💡 Did you know?
Thresholds help minimise the possibility of false-positive classifications. Minimum match threshold is the minimum percentage of data value matches in a column that needs to be found by the scanner for the classification to be applied.
-
Modify the Data Pattern by replacing the plus symbol (
+
) with with{5,15}
.- The plus symbol (
+
) indicates one or more characters matching the preceding item. This may lead to false positives as it would allow for an unlimited number of alphanumeric characters. Twitter handles must be a minimum of 5 and a maximum of 15 characters. - With
{5,15}
, this will ensure matches only occur where there is a at least 5 and at most 15 occurrences of the preceding item.
- The plus symbol (
-
While we can also specify a Column Pattern, in this example we will rely solely on the Data Pattern. Clear the Column Pattern input and click Create.
-
Navigate to Data map > Scan rule sets (under Source management) and click New.
💡 Did you know?
Scan Rule Sets determine which File Types and Classification Rules are in scope. If you want to include a custom file type or custom classification rule as part of a scan, a custom scan rule set will need to be created.
-
Change the Source Type to
Azure Data Lake Storage Gen2
then copy and paste the values below into the appropriate fields. Click Continue.Scan rule set name
twitter_scan_rule_set
Scan rule description
Custom scan rule set to detect parquet files and classify twitter handles.
-
Clear all file type selections with the exception of PARQUET and click Continue.
-
Clear all selected System rules and select the custom classification rule twitter_handle and click Continue.
-
Click Create.
💡 Did you know?
Ignore patterns tell Microsoft Purview which assets to exclude during scanning. During scanning, Microsoft Purview will compare the asset's URL against these regular expressions. All assets matching any of the regular expressions mentioned will be ignored while scanning.
Before proceeding with the following steps, you will need to:
- Download and install Azure Storage Explorer.
- Open Azure Storage Explorer.
- Sign in to Azure via View > Account Management > Add an account....
Note: If you have not created an Azure Data Lake Storage Gen2 Account, see module 02.
-
Download a copy of twitter_handles.parquet to your local machine.
-
Navigate to your Azure Data Lake Storage Gen2 Account, expand Blob Containers, and Open the raw container. Note: If a raw container does not exist, create one.
-
Click on the New Folder button, provide the folder a name (e.g.
Twitter
) and click OK. -
Right-click on the newly created folder and click Open.
-
Click on the Upload button and select Upload Files....
-
Select the local copy of twitter_handles.parquet and click Upload.
-
Open the Microsoft Purview Governance Portal, navigate to Data map > Sources and click New Scan within the Azure Data Lake Storage Gen2 tile. Note: If you have not registered your Azure Data Lake Storage Gen2 Account, see module 02.
-
Click Test connection to ensure the credentials have access and click Continue.
-
By default, Microsoft Purview will have the parent Azure Data Lake Storage Gen2 account selected and therefore include all paths in scope. To reduce the scope, deselect the parent and select the Twitter folder only. Click Continue.
-
To validate the scope of the custom scan rule set, click View detail.
-
Confirm that the custom scan rule set includes the PARQUET file type and the custom classification rule twitter_handle. Click OK.
-
Select the custom scan rule set twitter_scan_rule_set and click Continue.
-
Set the scan trigger to Once and click Continue.
-
Click Save and Run.
-
To view the progress of the scan, navigate to Sources and click View details on the Azure Data Lake Storage Gen2 tile.
-
Periodically click Refresh to update the scan status until Complete. Note: This will take approximately 5 to 10 minutes.
-
Once the scan has complete, perform a wildcard search by typing in the asterisk character (*) into the search bar and hit Enter.
-
Limit the search results by setting Classification within the filter panel to Twitter Handle. Click on the asset title (twitter_handles.parquet) to view the asset details.
-
You will notice on the Overview tab that the schema includes the Twitter Handle classification. To identity which column has been classified, navigate to the Schema tab.
-
Within the Schema tab we can see that Account name is the column that has been classified.
-
Which of the following is a valid classification rule type?
A ) Python
B ) Regular Expression
C ) C++ -
When creating a regular expression based classification rule, you must specify a Data Pattern AND a Column Pattern.
A ) True
B ) False -
Custom classifications are automatically in scope of a system default scan rule set.
A ) True
B ) False
This module provided an overview of how to create a custom classification, and how to have the classification automatically applied as part of a scan using a custom scan rule set.