Data Drifting Monitor in Azure

October 06, 2021

In order to capture suspicious data from external sources, we usually define a set of rules, that explicitly examine upcoming data and validate the data against those rules. What happens if the data still looks good and stays within defined frames and schemas, but something is smelly?

Classic approach

Let's consider the case of a company that tracks the estate market changes. If the volume of the data that comes from an external data provider rumps down, or values break their banks, then it is easy to capture that breach by introducing validation rules. For example, If a price of an estate is bigger than 100M$ or lower than zero, then such input data (like a file) should be rejected or fixed, before processing. The business users maybe not be happy with some delay, but still ... it's better to be safe than sorry.

Now let's consider a case when the average price of your property drifts during the time. If one week an average price of a property equals 100k$ and the next weekly load increases the value by 2%, we may take this as a normal market behaviour. The validation rules may stay happy with their excellence, being not conscious about the upcoming disasters. If the next week it happens again, and the price increases by the next 2%, then it may still look healthy, comparing to the previous week. After a couple of weeks, the price changes from 100k$ to 200k$, leaving the business users in the darkens, like living cancer boiled in heated water.

The asking price could the one of the key values. If we are lucky it might be captured, by the business users anyway. But what if the price is just a computation result dependent on fields, that come with the original dataset? We may stay happy with an average price, that fluctuates around 100k$, but the value of our derivative is computed based on drifted values, that haven't been considered in the past.

Data Drifts ML

To capture such behaviour we may use Machine Learning mechanisms. One of those could be the Azure ML Dataset Monitor service, which is a part of the Azure Machine Learning product. Let's see how to cause a stir.

Define the base dataset - the range of health data, we want to take as reference and compare with new data update
Define the target dataset - the range of data we want to compare with the base dataset (usually including last updates)
Define a Drift Magnitude threshold - if it's exceeded, then raise an alert to whom concerned (support, business users,....)
Create scheduler - Define how often we want to run our monitor to compare the updated target data set with the reference base dataset.

How does it work? First, we need to create a base data set - the reference data load that we trust and want to use to compare with the next loads. We need also to create a target data set that will contain the new (potentially suspicious) information. The target dataset will have to be updated every time before the analysis has been executed (see Azure Synapse or Azure Data Factory). Once is defined, we can run the analysis ad-hoc by simple pressing the Analyse Existing Data button. We can also set up the scheduler (Daily, weekly, monthly...), to run the analysis independently.

Once executed, the Auto ML magic takes place, and crunch the fields' values ( columns as features) and trying to assess how much the features have been changed, scoring drift points.

What do you get?

The Magnitude Drift. It's a value will a single number that expresses the drifting impact. Zero says, that the base dataset and target data set are the same. The 100 points mean, that both data sets look completely different. If the value exceeds the established threshold (example: 40% points) that the system is capable to send notifications about such a situation (including Application Insight... so, we can do anything we want :) ).

Individual field (feature) assessment. This is the level of detail you may value most! The Drift Monitor will tell you how a particular field (aka feature) has been changed during the time.

Summary

I like this tool because of one more reason. One of the biggest challenges I usually face in my projects is a lack of understanding of the data. It makes a difficult to assess the quality of the data and its dynamics. The drift monitor, executed ad-hoc, lets me do an initial check and easily capture an anomaly, despite my limited domain knowledge.

To get started please request your free Azure Account credit (200$) and try to run your first Azure ML Drift Monitor. It's really easy to set up and it made my day.

Source of the image photo pexels.com

Search This Blog

Devagnostic