So, you’re ready to build a data ingest pipeline. You know that manual data ingest is a waste of time and resources, and you know that a better data ingest process will help you grow. Now it’s time to jump into the tools and start building… right?
Not quite.
Before you get started, it’s essential to consider some key points around frameworks and requirements to help you hone your use case and configure appropriately from the start. Here are 3 questions to ask before you begin architecting your data ingest pipeline:
First, when building a data ingest pipeline, you must consider the data ingest model you want to use. There are 3 common types: Bulk/batch, Real-time streaming, and Lambda architecture.
Bulk/batch data ingest – this means that data is collected, mapped, validated, uploaded and logged in batches. These could be small micro-batches or data sets that contain millions of lines, the frequency could be minutes or months, and the timing could be regular or triggered.
Real-time streaming – when data needs to be instantly input into the target destination for up-to-the-minute insights and processes, an always-on data ingest approach may be best. Rather than having large sets of data with multiple rows, in real-time streaming data is usually ingested piece by piece.
Lambda architecture – for many organisations, a combination of bulk/batch and real-time streaming is required. Lambda architecture addresses the latency concerns associated with batch processing, whilst also providing the reconciliation capabilities and accuracy required with large data sets.
Ideally, the rules around your data mapping should be influenced by subject matter experts/business users – i.e. people who know what the data will need to do in the target platform and why it may be in the state it’s currently in. However, it’s also important to consider how often the transformation will be required in order to get as much benefit as possible for the cost. Will this mapping be utilized daily? If so, it’s worth making it easy for your business users to interface with, freeing up developer time and reducing friction. But if it’s only being used once, it might not be worth the effort.
It’s important to architect your data ingest processes around the assumption that you will receive bad data from time to time, if not often. Your best bet is always to assume the data will arrive in the worst possible state so that your processes are airtight no matter the data quality. So how can you make sure that data is effectively processed without compromising standards?
One way this is achieved is to combine auto-generated validation rules with your target data schema to help spot-check the data at each step of the ingest process. Then you should be looking to produce actionable error logs, meaning reports that make sense to the business users and make it clear exactly what action needs to be taken to rectify the errors.
Regardless of your data ingest use case, these considerations must be resolved before you begin architecting your pipeline. When you are ready, CloverDX can help you build a data ingest process that covers off all your must-haves.
Click here to learn how to get started with your data ingest architecture and framework.