Data validation in your data ingestion process is vital to keep your data pipelines flowing smoothly and to ensure that the data you’re working with is correct and useful.
We outlined some more of the objectives of data validation in this post and here we’ll show you how you can build these validation steps into your pipelines with CloverDX to maintain the quality of the data you're ingesting.
Both data profiling and data validation play an important role in managing your data quality.
In this example we’ll walk through both the profiling and validation steps in a CloverDX data ingestion pipeline.
This case study involves a platform where multiple schools upload data to either an FTP site or via email and get an analyzed view of that data returned.
The ingest process involves several steps, in a single pipeline:
So our basic process looks like this (with the validation steps highlighted):
And we build that out into an executable data pipeline in CloverDX that looks like this:
The section highlighted in green above is where we’re performing all those steps, and we can see each individual part of that section by opening it up:
There are several places where we’re doing some kind of check on our data:
And we can drill down further into that profiling step:
This is our validation step. We’re taking a flat file and running two data pipelines against it: the profiler, to give us that holistic, overarching view; and the business rules validation, where we validate our data against specific rules. We don’t necessarily stop the pipeline if we encounter any suspect data, but we do log it to Excel files.
When we run this process in CloverDX we can see the numbers of files flowing through the process, and how it’s changing at each step. We can also see all the individual records that are coming across, and can see that we’re profiling based on a number of different criteria: for this example where we’re looking at school data we have records based on classes, enrolments, students, teachers and so on.
We can also look at the validation results for each record. For instance, in this example we’re checking for null records. We’ve defined a threshold of 10% - if we’re getting fewer than 10% null records we continue processing, but as soon as it hits that threshold we log an error and alert the user, pulling those results into a spreadsheet.
And we can see that even though we haven’t yet reached that critical threshold, the data is trending in the wrong direction – we’re getting more and more null counts. With that information, your end user or support team can address those errors before they even reach critical status – enabling you to fix your data before it even fails.
Each school in our example has different business rules, different ways they format teacher or class IDs for instance, each very specific to a particular system or data source. The challenge in dealing with data ingestion from multiple sources is how to scale that up, without needing to build new data pipelines for each source?
In this video you can see how, instead of having those validation rules for each source explicitly listed in the pipeline, we’re externalizing the rules.
We’re looking at a file for each school, and depending on which school it is, changing the variables, allowing you to change the data validation rules on the fly per source system.
As mentioned above, in CloverDX you can define how you want to handle errors. Some you might want to stop the pipeline for, but some softer errors you might not. But it’s still useful to inform users about those soft errors in a way that’s meaningful and allows them to take action.
In our example we’re creating two Excel sheets, one showing errors by school, and one showing errors by date. You can imagine how for instance you might want to send the by-school report to those schools so they can fix any common errors. The errors-by-date report can be useful internally, for instance to check how your pipelines are running over time.
Building data ingestion pipelines with CloverDX gives you complete flexibility in how you manage your data and what data validation rules you want to apply. Building data validation into your automated processes enables you to:
Watch the whole video of this demo of data validation in data ingestion processes here. And if you want to talk about how your can build and automate your data ingestion pipelines in CloverDX, just get in touch with us.