Skip to main content
Data Science Series (3 of 3): Data Loading and Integration in the Cloud with AWS | Kranio

""" Output
Clothing: 2314 Technology: 232
"""

Fixing reconciliation errors

It is extremely common that once the data is reviewed, it does not match; this does NOT always mean that the transformations were done incorrectly. It may also happen that the data source is different or that when entering the data into a system to reconcile it, we are misinterpreting some numerical values, etc. We must identify in which cases the values are not matching, proceed to isolate them for analysis and comparison.

Focusing error detection

Next, we will correct a reconciliation error with real examples, in this case correcting a count of values that does not match. A useful alternative to identify the source of this error is to obtain a grouping of a column used in the transformation filtering by the value that does not match; this way we can count the values with the goal of exporting this result and being able to compare the file in Excel.


df = df.select('group_column').where(df.TIPO_PRODUCTO == 'ROP').groupBy('group_column').count()
df.show()
""" Output
+-----------------+---------------+
|group_column     |      count()  |
+-----------------+---------------+
|       32413EH200|              1|
|       30413EH300|              2|
|       30413EH400|              1|
|       33313EH500|              3|
+-----------------+---------------+
"""

# We export the file to compare it in Excel
test.write.csv('grp_matnr', sep=';')

We create a tab in Excel where we add the result of the exported CSV and the same calculation from the script but done in Excel, each with their corresponding columns, as follows:

This will directly show us which values differ from our script and will allow us to focus our analysis on a few data points instead of thousands of them.

Transforming values that alter columns

On the other hand, we may encounter cases where some column values are altering the script’s calculations or disabling their calculation from the reconciliations. To fix this, we can generate scripts that transform the resulting data or the ingested data. For example, below it was identified that the use of double quotes in some values caused problems, so they are removed from the script.


df = dataframes_dict['main']

# Dictionary with the data types of each column
columns_types_dict = {
	#...
  'product_desc': StringType(),
}

# We iterate over each column indicated in the dictionary
for column_name, column_type in columns_types_dict.items():
	...
	
	# We remove the double quote from all Strings 
	elif column_type == StringType():
	  df = df.withColumn(
	    column_name,
	    regexp_replace(column_name, '"', '')
	  )

Fix separator in transformation

Sometimes the separator in the output file does not match the tool we use for reconciliation. Fortunately, fixing the separator of the generated document is very simple to do at the time of exporting in the script.


test.write.csv('export', sep=';')

It is also possible to fix the separator by searching all values where the separator appears and replacing it in all cases with another. For these tasks, a tool like Notepad++ can be very useful.

Fix sum of values

Sometimes when obtaining data from management systems like SAP with raw data, we may find data that we must transform to make it readable from the Data Warehouse and so that it can be reconciled, having to convert formats as follows:

410.500,210 ⇒ 410500.210

Script to fix decimal format


df_double_obj = ['peso_neto', 'cantidad', 'us$_fob', 'us$_cif', 'us$_flete', 'cantidad_aux', 'espesor', 'volumen_m3', 'densidad']
    
    for column_name in df_double_obj:
        if column_name not in data.columns:
            continue
        data[column_name] = data[column_name].apply(lambda x: str(x).replace('.','') if ',' in str(x) else x)
        data[column_name] = data[column_name].apply(lambda x: str(x).replace(',','.') if ',' in str(x) else x)

Sometimes we cannot use the script to reconcile and must correct the file manually. For example, in Excel, numerical values with incorrect decimal formatting do not allow sums to reconcile; this is because the numbers are misinterpreted, altering the real value of the results.

As in the previous case, it is possible to modify all these types of formats with replacements in tools like Notepad++.

Ready to optimize your data loading and validation in the cloud?

At Kranio, we have data solutions experts who will help you implement efficient loading and reconciliation processes using AWS tools, ensuring the quality and availability of your information for strategic decision-making. Contact us and discover how we can drive your company’s digital transformation.

Previous Posts

CAG in LLMs: How to Reduce Latency and Costs in AI

CAG in LLMs: How to Reduce Latency and Costs in AI

Discover what CAG is and how it enhances speed, reduces costs, and improves consistency in LLMs. Learn when to use it and how to design efficient AI architectures.

Rate limiting: protect your API and prevent overloads

Rate limiting: protect your API and prevent overloads

Control the traffic of your API with rate limiting. Enhance security, stability, and cost efficiency in your digital infrastructure.

Caching in RESTful Applications: How to Improve API Speed