Data Science Series (3 of 3): Loading and Quadrating Data in the Cloud with AWS

Organizations rely on their data to be able to make intelligent decisions, this decision-making tends to be carried out by senior officials and management teams. In order to understand the data, it must come in a report format or similar, so that it is easy to interpret, there are many alternatives to generate reportability, however, the ideal is to build automatic reports.

About this item

This is the third article in the series on Data Science Part 1 and Part 2

Each article can be read regardless of the order since the content is separated by different stages that, despite having a strong connection, can be understood individually. Each publication seeks to shed light on the processes that are being carried out in the industry and that could help you decide if your organization can hire a service to migrate your data to the cloud or learn about how the development of this type of project works if you are a student. There are some concepts that are taken for granted, which are explained in the series, so it is recommended to give a full reading in case some keywords sound unknown.

One of the fundamental processes in Data Science projects is the loading of data, in this article we will see what the loading processes consist of to visualize and how data is squared to ensure that the developed product is returning the data expected by the business.

Data load

Loading data is the last step in the ETL flow and allows you to enable data for data visualization, this consists of presenting data in a system in a visual way so that customers can make vital decisions. In this case, we always want the data to be visualized to be automated so that in this way, data transformation and cleaning tasks are carried out. To carry out the loading of data and the automation of all these processes, there are several tools, however, today we have the cloud.

Why might we be interested in migrating to the cloud?

The cloud is the technology that allows you to store, manage and provide access to all your business data. In addition to the fact that the cloud allows organizations to avoid investing money in their own equipment and data centers to manage their business information. The cloud allows companies to access a wide variety of applications and services that are fundamental and very powerful when it comes to modernizing the business, in order to take advantage of all the virtues of technology and turn that into a competitive advantage that drives their business forward.

In Cloud platforms, we can find a multitude of unique services that allow you to extract all the digital potential of your organization, including tools for managing and developing Big Data tools.

Charging in the cloud

Once the transformation processes are carried out, the results are these are sent to the cloud within an area in a Bucket of the Datalake. As this is the last transformation process, its results are stored in a provision-zone to be uploaded to a Data Warehouse Cloud system such as AWS Redshift or AWS Glue with Athena using a Crawler. So that the data goes from provision-zone to be available partitioned in a structured table ready to be consulted by a reportability system and thus carry out a data visualization and begin to carry out Intelligent Business practices. The tools that we can find to manage big data processes are one of the following:

AWS Glue

This is an AWS service that allows you to implement ETL processes with the objective of categorizing, cleaning, enriching and moving data. Track data sources and load structured data to a destination. When used with a Crawler and Athena, it builds a catalog interpreting data types. Catalogs are metadata repositories with data formats and schemas called Data Catalog.

Glue Crawler

The Crawlers analyze the data to create the metadata that allows Glue and services like Athena to view the information stored in S3 as databases with tables. By making the data available in a structured format, the Glue Data Catalog can be created, and in each execution of the ETL flow, the Glue Data Catalog is generated and updated.

Athena

It is a tool for data analysis. It is used to process SQL queries complex in a short time. It's a service serverless so it doesn't require you to manage the infrastructure. Therefore, you only pay per consultation performed. With this service, you can consult the data from platforms such as Tableau.

SQL queries: SQL queries are those computational processes in which information is requested from a structured database and it returns the indicated information.
Serverless: These are those platforms or systems that work without having a server enabled to work, but are executed only when consumed, thus saving computing power and servers that are not used.

Redshift

It is a Data Warehouse service in the AWS cloud that allows you to perform SQL queries for Big Data. Just create a Cluster to start working on the Redshift Data Warehouse. It is designed to work with gigantic amounts of data. It allows you to process data in parallel in order to save time.

Cluster: Clusters consist of one or more servers that are shared as one to carry out a process that requires computing power.
Big Data: All those processes in which large amounts of information are processed are called Big Data.

Tableau

It is a data visualization tool for data analysis and Intelligent Business. It takes data from queryable systems such as Athena or Redshift and allows us to visualize the data with multiple tools, and multiple views.

Visualization: Visualization consists of taking structured business data to generate graphics or visual representations of the data that allow us to understand the business quickly and simply, so that intelligent decisions can be made.
Business Intelligent: Business intelligence consists of interpreting and analyzing in the simplest and fastest way through the use of technology and various processes so that the business can make vital decisions.

Quadrature of data

Once the data is loaded into a Data Warehouse, it is available for consumption, however, whether or not you have worked with sensitive data, quadrature tasks must be carried out to ensure that the final result has not altered the expected values and thus ensure that the correct values are being reached.

Quadrature tasks are processes that seek to achieve the same results between the data that the user reaches with their manual solution and those that are arrived at the load or in some of the transformation processes.

It is possible that in some cases, instead of contrasting against a result constructed by the user or the client, the data source against which to square, taking the data from the ingestion area, has to be built manually.

Alternatives to square

To make a square, there are many different alternatives. You can use Excel or programming language libraries such as Python, including Pandas and PySpark.

We can square the values of some columns obtaining the total sum or the number of elements, we can also perform the same calculations or specific filters with the transformed results and confirm that the same values were reached.

The process would be similar to the following:

  1. We downloaded the original data from the ingestion zone.
  2. We read the source from some tool.
  3. We perform calculations, transformations, filters, dynamic tables, etc. To arrive at the same results as the script.
  4. We compared both sources and shared the results with the team.

It is essential that the data source used for the transformations and for the manual result is the same. Otherwise, the same results are unlikely to be achieved.

Square with Excel

To square this way, simply open our data source with Microsoft Excel or a similar tool.

Then we proceed to calculate a series of data that are related to the transformation file, for example: if we know that in the transformation we relate data, we should validate by grouping and filtering that specific data, in addition to the numerical values that are relevant.

We made a dynamic table to validate that the clusters we generate give the same result as seen in the following image for the filters and values:

Square in PySpark

PySpark is a Python library that allows transformations to Big Data with Spark. The following describes a series of scripts that can be performed in data quadrature processes.

Get clustering sum

Para obtener la suma de los valores de cada agrupación que generamos podemos utilizar groupBy(). El cual nos permite indicar las columnas que se utilizarán para agrupar seguido del tipo de calculo que se realizará sobre las agrupaciones, en este caso una suma con sum().

dataframe.groupBy('CATEGORIA','TIPO_PRODUCTO').sum('m3').sort('CATEGORIA').show()

Let's remember that dataframe is a data format that can be managed through programming code using specialized Data Analysis libraries such as: Pandas and PySpark.


""" Output
+-----------------+---------------+------------------+
|CATEGORIA        |  TIPO_PRODUCTO|           sum(m3)|
+-----------------+---------------+------------------+
|             null|           null|                31|
|             Ropa|         Polera|           556.291|
|             Ropa|        Poleron|         10154.277|
|       Tecnología|           Ipad|         20905.502|
+-----------------+---------------+------------------+
"""

Get clustering quantity

Se puede recurrir a groupBy() como en el caso anterior o podemos utilizar filtros para contrastar que los valores obtenidos coincidan contando distintos indicadores.

count_ropa = df.where(df.TIPO_PRODUCTO=='Ropa').count()
count_tecnologia = df.where(df.TIPO_PRODUCTO=='Tecnología').count()

print('Ropa:', count_ropa, 'Tecnología:', count_tecnologia)


""" Output
Ropa: 2314 Tecnologia: 232
"""

Resolve quadrature errors

It is extremely common that once the data has been reviewed, it does not fit, this will NOT always mean that the transformations are incorrectly performed, it can also happen that the data source is different or that when entering the data into a system to square them we are misinterpreting some numerical values, etc. We must identify in which cases the values are not matching, proceed to isolate them to analyze and contrast them.

Focus on error detection

Then we will correct a quadrature error with real examples, in this case correct a value count that does not match. A useful alternative to identify the source of this error is to obtain a grouping of a column used in the transformation by filtering by the value that doesn't fit, so we can count the values in order to export this result and be able to contrast the file in Excel.


df = df.select('group_column').where(df.TIPO_PRODUCTO == 'ROP').groupBy('group_column').count()
df.show()
""" Output
+-----------------+---------------+
|group_column     |      count()  |
+-----------------+---------------+
|       32413EH200|              1|
|       30413EH300|              2|
|       30413EH400|              1|
|       33313EH500|              3|
+-----------------+---------------+
"""

# Exportamos el archivo para compararlo en el Excel
test.write.csv('grp_matnr', sep=';')

We created a tab in Excel where we added the result of the CSV that we exported and the same calculation of the script but carried out in Excel, each with its corresponding columns, as follows:

This will directly show us which values are different from that of our Script and will allow us to focus our analysis on a few pieces of data, instead of thousands of them.

Transform values that alter columns

On the other hand, we can find the case that some values of the columns are altering the calculations of the script or are disabling its calculation from squaratures. To correct this, we can generate scripts that transform the resulting data or the ingested data. For example, it was then identified that the use of double quotes in some of the values caused problems, so it is removed from the script.


df = dataframes_dict['main']

# Diccionario con los tipos de dato de cada columna
columns_types_dict = {
	#...
  'product_desc': StringType(),
}

# Iteramos por cada columna indicada en el diccionario
for column_name, column_type in columns_types_dict.items():
	...
	
	# Removemos la comilla doble de todos los String 
	elif column_type == StringType():
	  df = df.withColumn(
	    column_name,
	    regexp_replace(column_name, '"', '')
	  )

Correct separator in transformation

Sometimes the separation of the output file does not match the tool we use to square. Fortunately, correcting the separation of the generated document is very simple to do when exporting in the script.


test.write.csv('export', sep=';')

It is also possible to correct the separation by searching for all the values where the separator appears and replacing in all cases with another one. For these tasks, a tool such as Notepad++.

Correct sum of values

Sometimes when obtaining data from management systems such as SAP with Raw data we can find data that we must transform to make them readable from the Data Warehouse and so that they can be squared, having to convert formats as follows:

410,500,210 ⇒ 410500.210

Script to correct decimal format


df_double_obj = ['peso_neto', 'cantidad', 'us$_fob', 'us$_cif', 'us$_flete', 'cantidad_aux', 'espesor', 'volumen_m3', 'densidad']
    
    for column_name in df_double_obj:
        if column_name not in data.columns:
            continue
        data[column_name] = data[column_name].apply(lambda x: str(x).replace('.','') if ',' in str(x) else x)
        data[column_name] = data[column_name].apply(lambda x: str(x).replace(',','.') if ',' in str(x) else x)

Sometimes we can't use the script to square and we'll have to correct the file manually. For example, in Excel, numerical values that are incorrectly formatted as decimals do not allow the sums to be squared, this is because the numbers are misinterpreted, altering the real value of the results.

As in the previous case, it is possible to modify all these types of formats with replacements in tools such as Notepad++.

Team Kranio

September 16, 2024