At CloverDX, whether we’re talking to prospective new customers or to existing clients, we often hear the same problems from people trying to maintain and improve their data architecture.
Despite the differences in company sizes, industries, or the data goals they’re trying to achieve, it’s revealing that the pains people suffer are often very similar.
And these specific problems are usually indicators of larger, business-critical issues – the company’s losing money or making bad investment decisions, or someone’s embarrassing themselves in front of the board with incorrect data.
We thought we’d dive into some of these common symptoms people have told us, and share what the diagnosis (and treatment) might be.
So if any of these strike a chord with you, here’s what you can do about it….
This is a very common complaint. Data workflows evolve over time, and can get so complex and involve so many workarounds that no one understands why or how something is working any more. This is especially true when it’s all the work of one person, who maybe didn’t document things as thoroughly as they might, and they’ve now left the company. There inevitably comes a time when you need to update the process, and you can’t. Probably because everyone is too scared to touch it in case it breaks.
You’re wasting time (and money) because your development team can’t split up the tasks they need to do to work efficiently.
If you’re just thinking about the data part of your process, there’s a tendency to forget all the ‘other stuff’ that a robust process needs to include – the automation, orchestration and logging for example.
Large, complex data jobs are really difficult to maintain well, and they’re really hard to cover with tests. Things breaking when you make changes are a common sign that your architecture could use some improvement.
Large jobs are one of the most typical signs of data architecture that needs improvement. It’s really difficult to stay productive and efficient when you’re working on a single huge chunk of code. It’s hard to read, hard to understand, and hard to extend and test.
The remedy is to break complex processes into smaller pieces, with each individual piece of the pipeline dealing with one single task or responsibility.
The goal here is to identify pattens in your workflows so you can create reusable, repeatable pieces.
Benefits of breaking down your data processes
If you have 80 data sources, but you’ve got a process that’s the same for all of those sources, you don’t need 80 data workflows.
When you’re building everything from scratch every time, you’re bound to end up with differences in how each process works. Implementing something like auditing functionality in different ways throughout a complex pipeline can lead to errors and confusion.
Reusing what you’ve already built saves time and improves consistency. And by creating ‘modules’ that you reuse across your pipelines, you make it far easier to be flexible and adapt pipelines without having to reinvent the wheel to build an entirely new process.
For example, if you want to swap a data source and you’ve built your pipeline in a modular way, then you don’t need to build an entirely new pipeline, you can just plug in the new source – and crucially you won’t need to touch the rest of the pipeline, resulting in significant time and cost savings.
This also works the other way round too - you can take a source from one pipeline and plug it into another pipeline. This can be especially useful if you've already done work on that source, such as data quality checks, validation, or aggregations. Instead of doing all that all over again, we can reuse what we've already built in a new process:
Benefits of good modular design and reusing components:
The only thing worse than knowing your data is wrong, is when you don’t know it’s wrong. This customer who came to us only discovered that their data had errors months down the line, which had already resulted in some bad decisions, and resulted in a costly process to fix it all retrospectively.
A lack of development conventions and not having a good process for collaboration is bad for efficiency and productivity, and slows down delivery of your projects.
For anyone monitoring jobs in production, it’s important to be able to see and understand what’s happening in the pipelines. When each developer tackles problems differently, it makes troubleshooting – as well as extending and maintaining your jobs – very difficult.
Defining conventions right across your data processes helps build consistency. From naming conventions for files and processes, conventions for documentation, and for development (where you should also have a solid approach to versioning and teamwork) – all can help increase productivity and make data flows easier to understand.
Benefits of striving for consistency
Because of bad, inefficient or even completely missing data quality checks, business decisions can end up being based on incorrect information. No one wants to be responsible for wasting money, or presenting the company board with the wrong data.
This client was getting data from their transactional system, but not storing it anywhere. So by the time they discovered the issues with their data, it was too late to be able to correct it as the original data was gone, meaning they were unable to deliver their project.
If data quality checks are bottlenecking your process, the first question is often ‘should I just stop checking? The answer is typically ‘no’, but you should be thinking about how you can make things more efficient. For example, in order to check the structure of an incoming file, one customer we spoke to was reading the entire file – not at all necessary - and changing the process to read just one line achieves the same outcome in a much smarter way.
Your data is going to be less-than-perfect, because this is real life. But by expecting that from the start, you can mitigate the impact. Validate early, do it in a smart way, reuse validation rules for consistency and efficiency, and back up data that you might need to fix any future issues.
Building data validation into your data ingestion processesBenefits of improving your data quality processes
The good news is that these common problems are all fixable. By implementing some best practices when developing your data architecture you can achieve some business-critical improvements:
If you’ve got any worrying symptoms with your data architecture that you want us to take a look at, just get in touch.