Thought leadership content from CloverDX

Data Integration

Written by CloverDX | Sep 26, 2019 12:15:27 PM

What is Data Integration?

Data integration is the process of combining data from multiple sources in order to extract additional value. The process usually involves joining, cleansing, validating, and enriching the data along the way.

Even smaller organizations may use dozens of applications within their business, and that number can reach the thousands for larger organizations. The ability to work with all data across each of those different applications is crucial. Data integration makes this possible.

For example, a simple data integration task might consist of combining various Excel spreadsheets with the information in an Access database. At the other end of the scale, an extensive BI (business intelligence) platform needs to work with data from a customer database, web analytics platform, invoicing system, stock management platform, and third-party systems such as address databases – all in sync and without overlap.

Data Integration Examples

  • For a 360 degree customer view
    By combining data from CRM applications, marketing applications, and accounting systems, companies can derive valuable insights into how customers react to various marketing campaigns and measure effectiveness compared to campaign cost and actual sales data.
Blog: How to embrace digital transformation with confidence.
  • Creating a data warehouse
    Typically created to store data collected from many different applications, the data in a data warehouse is formatted so that a unified view of a whole department, or even an entire company, can be easily and effectively queried.
  • Data migration
    A data migration project often requires integrating data from two (or more) sources into a single new system. For instance, migrating a customer database and a marketing system into one new CRM.
  • Combining data from different applications or systems for business efficiency 
    For example, one leading bank integrated data from multiple systems into their central identity management platform to automate keeping user permissions up to date.

Types of Data Integration

Data Consolidation

Data consolidation is a process that combines data from multiple systems and stores it in one location. The goal is to simplify the application landscape and provide a single data point for other applications to access without needing to deal with the complexities – and often impossibilities – of dealing with data at the original source. Consolidating data makes processes such as BI, reporting, and data mining far more straightforward.

The standard example of data consolidation in action is a data warehouse. The primary objective of a data warehouse is to aggregate information from multiple systems and present that data to reporting or data mining systems to extract maximum value, all from a single data source.

Data Propagation

Data propagation is the process of copying data from one system to another, often while adding further information from elsewhere to enrich the data and prevent duplicate entries across multiple systems.

For example, enriching a licensing application with data from a CRM application can allow teams that manage licenses to serve customers more effectively by avoiding the need to enter customer information into two different applications.

Data ingestion vs data integration: What's the difference?

Another typical example is the propagation of data from production systems to QA or pre-production labs to support end-to-end testing of various in-house applications during regular development and maintenance cycles.

Such data propagation often involves logic that anonymizes data or selects only a subset of the source data due to security or volume concerns.

Virtualization

Data virtualization provides a near real-time unified view of data across multiple different systems without having to store all of that data in a single location.

When data is queried, the virtualization system queries the source systems and builds a unified view of the data on demand. Various caches are often employed to ensure that queries are fast and do not overload the source applications.

Being able to expose virtualization queries through multiple mechanisms, including easily configured API endpoints, is vital to make the query results as accessible as possible.

Data Integration Techniques

There are several ways in which data integration can be completed, ranging from the manual integration approach with little or no automation to the more comprehensive common data storage technique (e.g., a data warehouse).

Manual Integration or Common User Interface

Users operate with all the relevant information, accessing all the source systems or web page interfaces. No unified view of the data exists.

Application-Based Integration

This approach requires particular applications to implement all the integration efforts. The downside of this technique is that it's only manageable when there’s a minimal number of applications.

Middleware Data Integration

This approach transfers the integration logic from particular applications to a new middleware layer. Although the integration logic is not implemented in the applications anymore, the applications still need to participate in the data integration partially.

Read more on the importance of a data integration layer:
Download the white paper: Designing Data Applications the Right Way

Uniform Data Access or Virtual Integration

The uniform data access – or virtual integration – technique leaves data in the source systems. It defines a set of views to provide and access a unified view of the customer across the whole enterprise.

For example, when a user accesses customer information, the particular details of that customer are transparently acquired from the respective system.

The main benefits of virtual integration are:

  • Nearly zero latency of the data update propagation from the source system to the consolidated view
  • There’s no need for a separate store for the consolidated data

However, drawbacks can include:

  • Limited information about the data's history and version management
  • Restrictions on being able to apply the method to only 'similar’ data sources (e.g., the same type of database)
  • The fact that access to the user data generates extra load on the source systems, which they may not have been designed to accommodate

Blog: API Data IntegrationCommon Data Storage or Physical Data Integration

This usually means creating a new system that keeps a copy of the source system’s data to store and manage independently.

The most well-known example of this approach is a data warehouse, the benefits of which include data version management and being able to combine data from very different sources (mainframes, databases, flat files, etc.) The physical integration, however, requires a separate system to handle the vast volumes of data.

Challenges of Data Integration

Business vs IT

Any data integration initiative should be driven by business, not IT. The project needs a thorough analysis of the requirements so it can be designed in a way that will achieve business goals. This includes having a business champion who understands the company's data assets and can lead discussions about the long-term aims of the data integration project.

Blog: The 6 Biggest Data Integration Challenges (and How to Solve Them)

Technical considerations

The most significant data integration challenge is often integrating data from disparate, often incompatible, sources. To help design a successful project, you need to analyze thoroughly:

  • From which systems will the data be sourced?
  • Is all the data to fulfill the requirements available?
  • What are the options for extracting the data? Update notifications? Incremental extracts? Full extracts?
  • How often will you need to run your extracts?
  • Are the required data fields populated correctly and consistently?
  • What volume of data are you going to need to process?

Business process considerations

Discover who in the business owns the systems you need to work with and ensure they are fully involved in the process. Your data security and backup policies can also affect your integration.

Planning for the new system (and the future)

As well as the technical integration, you’ll need to determine who within the organization will be responsible for managing the new system and what the support and SLAs are. Before the project begins, you must also have a plan for future development – who will fund and manage ongoing maintenance and upgrades?

Considering your future requirements early on in the process can save expensive headaches further down the line. Can your integration be scaled to handle larger data volumes or new sources?

Poor error handling

As the number of systems involved in your integrations grows, the possibility of failure also increases. Having a strategy for handling errors will help manage issues such as downtime or poor data quality.

Building in automated error notification processes also means that errors can be detected and fixed quickly, whether automatically, as part of an error management pipeline, or flagged up for manual resolution.

White Paper: How to Design Your Systems to Effectively Control Bad Data

Custom-coded solutions

Custom-coded data integration solutions can work perfectly well with simple or smaller projects. However, as the code grows, maintainability becomes a severe challenge. It's essential to be aware of the potential pitfalls of a custom-coded approach, especially if your data needs increase:

  • Many can treat logging like an afterthought, resulting in a lack of diagnostic information when issues arise and problems if developers who carry a lot of knowledge in their heads (and have never logged it anywhere) decide to leave
  • Integration with new technology is slow to implement
  • Performance can dwindle as serious bottlenecks develop over the years, as these issues are often hard to spot before they become a serious problem. Fixing these can then require a full and highly time-consuming refactoring
  • Bugs can and will happen. Debugging can be challenging and cause downtime or interruptions

Data Integration Best Practice

Data quality monitoring

Monitoring every involved application is vital to prevent polluting multiple applications with low-quality data from a single source.

 

Resource usage monitoring

Data integration works with live systems that usually cannot be shut down while the integration is querying or updating the data. Proper resource monitoring will help ensure that the integration doesn’t negatively impact the involved applications. For example, monitoring response times and disk or memory usage can all help detect overloads of production systems.

Testing

Since data integration is a continuous process, implementation needs to be as robust as possible, created through rigorous testing before deployment to production.

It is good practice to test production data by copying it and configuring testing instances of all involved applications or systems. Such end-to-end testing can be beneficial not only to the integration itself but to the other systems involved in the integration.

Data design

When designing a data integration, it is always important to decide how to represent the data. If the integration copies data from one application to another (data propagation), the design is given by the applications involved.

However, for data consolidation and virtualization efforts, selecting the suitable representation of each entity can make or break a project.

If the data representation is hard to understand, it will be avoided, resulting in applications adding point-to-point interfaces rather than using a centralized data location.

Start small

Building a comprehensive, unified data model may sound like a good idea, but these attempts almost always fail due to sheer complexity (even with just a handful of applications). In practice, it’s best to start small, using only the minimum required data that serves the business needs.

Data Integration vs ETL

What is ETL?

ETL stands for Extract, Transform, and Load, and is a process for getting data from one place (the source system) to another (the target) while performing some kind of transformation on the data in the middle.

There’s also often a Cleaning step in the process to ensure the data maintains a high level of quality.

What's the difference between data integration and ETL?

Whereas data integration can define any process that combines different data sources, ETL is most often used to describe the process of transporting data into a data warehouse specifically. However, ETL can also be used to talk about any process where the data is transformed between the source and target system.

Steps in the ETL Process

Extract

The first step covers getting the data from the source system and making it available for processing, ideally using as little resource as possible and in a way that doesn’t negatively affect the source system’s performance, response times, or any kind of locking.

Extracts can be done in several ways, either by taking only the data that’s been changed since the last extract or a complete extract of the data, which is then compared with the previous extract to identify changes.

The frequency with which these extracts are performed is fundamental, as there can be a massive amount of data to parse. For example, full extracts can often involve volumes that reach tens of gigabytes.

Clean

Cleaning the extracted data is essential to maintain the quality of the target system. Data cleaning should perform basic data unification rules, such as:

  • Converting phone numbers and ZIP codes to a standardized format
  • Making identifiers unique (e.g., ensuring that Male/Female and M/F are translated into one standard)
  • Removing or standardizing missing values
  • Validating address fields and converting them to proper naming (e.g., unifying Street, St., St, and Str.)
  • Validating address fields against each other (e.g., making sure the State and Country or the City and ZIP Code values are valid together)

What's the difference between data ingestion and ETL?

Transform

The transform step applies a set of rules to transform the data from the source to the target. This includes converting any data to the same dimension (i.e., conformed dimension) using the same units so that they can later be joined.

The transformation step can also involve joining data from several sources, generating aggregates; generating surrogate keys, sorting, deriving new calculated values, and applying advanced validation rules.

Load

The load step involves loading the cleaned, transformed data into the target system (often a database). During this step, it's important to ensure the load is performed correctly and use as little resource as possible. To make the process efficient, good practice dictates that any constraints and indexes are turned off before the load and enabled again only after the load is completed.

Read more:

ETL? ELT? What's the deal?
What's the difference between ETL and ELT?

Data Integration Software

Using specialist data integration software can make dealing with a large number of data sources, formats, or transformations easier. It can also help automate your data integration processes to save manual effort, increase repeatability and reliability, and build in error monitoring and handling.

Choosing the right software to suit your organization's requirements can be a complicated decision. Before you start the process, it’s crucial to have a solid understanding of what you want to achieve. Other things to consider are:

  • Technical considerations, such as where your data sits
  • What your future needs might be so you can scale easily
  • Who will be using your software, and what’s their level of technical expertise?
  • How you plan to implement your solution – do you want to do it yourself, or get an expert to help you?
Read more about making the decision to upgrade from homegrown ETL solutions to a robust data platform:

Data Integration with CloverDX

Your data integration process should be as streamlined and straightforward as possible. Key to that is automation. By replacing ad-hoc Python scripts, sprawling spreadsheets and other unmanaged tools with an automated and reliable data pipeline, it becomes possible to:

  • Increase transparency and visibility across the pipeline
  • Correct any errors quickly thanks to instant error alerts
  • Reduce downtime, both in occurrence and how long it lasts
  • Make collaboration easier, including for less technically-minded users
  • Eliminate the risk of relying on a ‘black box’ or that one person who understands the process because they built it

CloverDX has helped many businesses with their data integration, including helping clients move away from Excel spreadsheets into automated integration, bringing thousands of systems together while tracking large volumes of data, or even helping one business recoup six working days a month.

What the process looks like

Data integration is a broad term, covering a considerable number of challenges and outcomes. We speak with you to learn what you hope to achieve and consider what might suit your business.

Some questions to consider will include:

  • What does your business want to gain from data integration? Is it a one-off project or the start of an ongoing, more automated approach?
  • What is the current state of your data? Where does it currently sit, and what are your plans for it?
  • Who do you want to work with your data? Will it be primarily technical teams, business users… or both?

We can learn more about your data integration needs by asking the above questions and others. Every business is unique, and your data may vary in quality, quantity, storage, format, etc.

From building robust architecture to frameworks, data integration takes many forms. Moving on from scripts or Excel can seem daunting, but our customers set the pace and direction of travel. You decide what you need; we find a way to make it happen.

In summary, your business defines what data integration means to you. Want to save time and resources? You can. Are you hoping to reduce instances of error and enable better collaboration? It’s possible.

Whatever your goals, we’ll always do our best to help you achieve them and create data pipelines that are automated and scalable. The ultimate goal is to save on time, cost, and stress for your business.

To learn more about how CloverDX helps modernize your ETL, click here.

We can help

Our demos are the best way to see how CloverDX works up close.

Your time is valuable, and we are serious about not wasting a moment. Here are three promises we make to everyone who signs up:

  • Tailored to you. Every business is unique. Our experts will base the demo on your unique business use case so you can visualize the direct impact our platform can have.
  • More conversation than demonstration. Have a question? We want to hear it. Integrating your data doesn’t have a one-size-fits-all solution, and we understand the different options can feel overwhelming. Whatever concerns or reservations you have, let us hear them.
  • Zero obligation. We’ve all been there. You spend some time hearing about a product or service… and then comes the hard sell. Our team doesn’t ‘do’ pushy. We prefer honest, open communication that leaves you feeling informed and confident.

Get in touch for a personalized demo.

Read More about Data Integration