Azure Purview - to own your data

While exploring the problem of migration to the cloud I've realized that it would be worth putting some light on the data owning issue. This is when the Azure Purview comes in.

I'm not a Chief Data Officer, but I if had been the one, would have been asked some fundamental questions about my data - what is the source, how the ETL/ELT process looked like, who own the data ... and, what is the data linage..., and finally what all those data mean.

All those questions are very important these days. Those might be triggered by regulatory requirements (data lineage, anonymisation), data scientists implementations (Where is my source dataset used to train the model?), internal auditors or the business itself, who struggles with defining the common domain models for years.

By working for one of my client's I've faced the problem of defining the Product Master application, which would be a golden source of domain model definitions across the company. The issue was too big to be swallowed at once. The definition of models has been described on the subdomain level, which is ok as long as applications and businesses users stay within subdomain silos. On the other hand, the isolated domains prevented product consolidation and migration to global operation models (including outsourcing and acquisitions).

Migration to cloud

For migration to a cloud, we can find a lot of processes and tools that will help us to transfer our data (databases) and infrastructure (VMs only, if we are lucky).

Unfortunately, the problem is far more than IT-specific, as the actual value of migration applications is measured by their data. The conveyed data create the links to be cut before migration take place. We don't really need to understand (as IT) the domain in detail, but we need to be able to explain what is the data lineage and who owns it.

Azur Purview

This is when Azure Purview comes into play. It's a brand new Azure service, but already available for some months in the Azure platform in preview (it's GA now).

The service is available in Azure in a form or workspace (mostly to expose Azure standard features like ACL, networking or Identity management). The right place to go is the Purview studio, which is a well-known concept already used by Azure Synapse or Azure Data Factory. This studio lets us configure one of around 30 data sources that can be used for analysis.

https://docs.microsoft.com/en-us/azure/purview/purview-connector-overview

The sources can be organized (grouped) by collections. Once the source is available we can scan it. The execution will take into account numbers for predefined categorisation rules (hundreds!), that will try to guess - based on the scanned data - what information hides behind. For example, if the data within a column looks like a social security number or an email, then this field will be considered as SSN or .. email respectively.

We can create our customer rules, providing a fixed list of options or using regular expressions. If a defined rule suits a number of cases (60% is a good value of the threshold to start), then the analyzed field is assigned to the captured category.

The identified entity is also analysed in the ETL context. If you scan a DB and notice that a particular table column is used by a view, then this dependency will be visualized on a diagram. It looks even more amazing when Azure Data Factory or Azure Synapse is involved. The data lineage is natively handled and exceeds the DB boundaries, by reflecting pipeline processing. It's very cool, isn't it?

To solve the problem of subdomains. Azure Purview introduced Glossaries. Those could be owned by different business owners (f.e. departments), in that case, a models process called "asking price" may vary between subdomains, but that's ok as they live as the definition of the particular glossary. The definition of a glossary could be imported using the CSV file to populate information in bulk.

If we are happy with the data lineage analysis performed by Azure Synsape, but we are not able to use it in our project then no worries. Azure Purview has been built over top of Apache Atlas 2.0, by using its well-documented API.

Summary

I'm wondering how fast it's analyses petabytes of parquet files or terabyte-sized databases. How efficient the delta load/update is? I'm eager to validate the in the context of the cloud migration of unknown data. By quoting Star Trek: "Its continuing mission: to explore strange new worlds, to seek out new life and new civilisations, to boldly go where no one has gone before".

Search This Blog

Devagnostic