Components for a data validation strategy

May 12, 2025 A data validation strategy is the guide to flawless data quality.

A data validation strategy combines economic goals with the appropriate technical means. This takes place against the background of a process and IT landscape that favors or inhibits certain procedures. Objectives, such as avoiding production downtime due to faulty documents, are relatively easy to determine. Selecting the appropriate means, on the other hand, is often more difficult, as these have both organizational and technical implications. To make it easier for you to form a strategy, we present the characteristics of various approaches to data validation.

Checking single documents or aggregated data

Let's start with the terms: A single document could be an eInvoice, for example. Aggregated data, on the other hand, is a structure in which a lot of information or documents are stored together. This could be data in a data warehouse. Factors for the comparison are, on the one hand, the response time between sending the data and receiving the validation report. On the other hand, you should consider the effort required to identify patterns in the error images.

A good strategy combines the appropriate means with the desired objectives.

In terms of response time, single document checks are ahead of the game. If the data sender receives feedback within one to two minutes, in most cases he can still correct his document so that the supply chain is not impaired. One prerequisite is that there is a defined correction process that he can use as a guide. On the other hand, the data sender must also have personnel ready to take immediate action in the event of an error. When checking aggregated data, on the other hand, no feedback is sent during the time in which the documents are collected for subsequent checking. This can take too long in just-in-time or just-in-sequence processes.

When displaying patterns in the error images, however, checks on aggregated data have an advantage. If thousands of files are checked individually and several hundred documents contain errors, a correspondingly large number of validation reports are generated. To ensure that these do not overwhelm their recipients, an aggregation mechanism must be implemented to display the results. This often requires a separate project and therefore more effort compared to validating already aggregated data.

Single document checks therefore have their strengths where multiple potential sources of error need to be checked in time-critical processes. Checks of aggregated data, on the other hand, are particularly recommended where you only want to check a specific, non-time-critical aspect of a document type across all data senders. For example, which partners particularly frequently do not provide contact details for inquiries in order confirmations.

Validations in original or target format?

There are often conversions of the transferred documents between the original data sender and the target system. For example, in order for a transport order to be processed in a transport management system, it usually has to be converted from the format of the data sender into a format that this system can process.

Here, too, there are a number of comparison factors that you can formulate in the form of key questions when developing your data validation strategy:

Should the causes of errors be tackled or is it enough to deal with their symptoms?
Are the people receiving the feedback familiar with the data format in which the errors are presented?
How advantageous is it to only need a single test interface per document type?
How problematic is it if individual documents have been corrected by other people or applications before the time of validation, so that they can no longer be recognized as originally incorrect?
At what point in the data stream do you have all the information you need for validation?

The advantage of checking documents in the original format is that you can send the validation report to the original data sender. This allows you to tackle the causes of errors and not their symptoms. Data senders will find it easier to rectify particularly complex error patterns if they receive a comparison of actual and target values based on their own document. Conversely, data senders will find it more difficult to correct errors if they are explained using the target format. However, if the errors are corrected on your side anyway, it can make sense to check documents in the target format straight away. This is particularly interesting if you have no influence on the data sender and therefore cannot expect any improvements on their side. Particularly with common target formats, such as CSV, JSON or XML, you also reduce the training time for new employees and thus increase the efficiency of the correction processes.

Test interfaces are not static constructs. With new requirements from the specialist departments, new checks become necessary or existing ones have to be adapted. As validations in the target format usually manage with significantly fewer check interfaces than checks for different source formats, you have the advantage that such adaptations require less effort and are less prone to errors. In addition, you can more easily identify errors that a data sender makes across different data formats. How serious this factor is depends on the number of source formats.

The more precisely you define segments and fields, the easier it will be for your partners to implement them.

However, the conversion of a document can also fail if there are serious structural errors. If this is the case, it will not be available in this form at a later stage and therefore cannot be checked. If it is corrected manually in the course of the business process, in the worst case it may not even be noticed that there was an error in the original data. You can avoid this problem if you check documents in their original format. If, on the other hand, such problems hardly ever occur, this factor is negligible. In some check scenarios, data relevant to the check is only added after the conversion process and is integrated into the target format. If this is the case, a check in the original format is not possible.

You should prefer checks in the original format if the original data senders are to make corrections in the event of errors and correct all errors themselves with long-term effect. As a rule, these will be external data senders. Validating a standardized target format, on the other hand, is attractive if either a party other than the data sender makes corrections or the format is also known to the data sender. In such cases, you have the efficiency advantage of fewer check interfaces, but you can still assume that the data sender is able to understand the feedback. This is especially true for data streams within a company. The other factors can tip the scales in favor of one of the two approaches, but should not usually carry any decisive weight.

The goal determines the path

As you can see, all approaches offer advantages and disadvantages. The ideal validation strategy is the one that achieves your goals to the greatest extent with the least effort. To determine it, you should first get a clear picture: What goals do you want to achieve? Which means are suitable for this? And what circumstances will influence the result? But you don't have to go down this path alone. In data quality projects, we are happy to advise you on developing the right validation strategy.