Blog: Data Science Series (2 of 3): Data Extraction and Transformation in ETL Projects

Nowadays we are increasingly connected to the internet, meaning that we consume more information and as a result we generate it in equal or greater proportion. This whole sea of data is a competitive advantage for organizations that are able to capture it and derive true value from it. We could see this in detail in the previous article - “Data Science Series: Data and Roles (1 of 3)”, which explains the value of data, the different types that we can find, the alternatives for its storage, in addition to defining the roles and processes that come into play when contracting this type of service.

However, in this article we will focus on more technical aspects, such as: Data extraction and transformation. To understand how this can allow us to activate the data and begin to generate new value.

About this item

This is the Second article From the series about DData Science. Each article can be read regardless of the order since the content is separated by different stages that, despite having a strong connection, can be understood individually. Each publication seeks to shed light on the processes that are being carried out in the industry and that could help you decide if your organization can hire a service to migrate your data to the cloud or learn about how the development of this type of project works if you are a student. There are some concepts that are taken for granted, which are explained in the series, so it is recommended to give a full reading in case some keywords sound unknown.

There are different professionals involved in the development of Science projects, however in this case we will focus on the role played by the Data Engineer professional.

What role does a Data Engineer play?

The Data Engineer builds ETL flows so that data can be easily consulted and in the appropriate format from different platforms. These flows contain a series of transformations that are nothing more than scripts that must be programmed to modify and Enrich the data. It is to be expected that the Data Engineer will create automations to periodically execute the ETL flow and processes. In essence, the Data Engineer ensures that the rest of the roles have data to analyze. The following explains in detail what the ETL flow consists of.

Enrich the data: It is any process from which a set of data undergoes a transformation and results in the creation of new columns or new data to it. It can also result in standardizing and cleaning the data after undergoing the transformation.

ETL flow

To understand where to extract our data and why we must transform it, we must first understand what the ETL flow is.

ETL They are the initials of Extract, Transform and Load. ETL is a process that draws information from different data sources to a Data Lake, transforms the data in a route or area and finally the onus in a Data Warehouse for consumption.

pullout

In the extraction, data is collected from several sources, those selected data are stored in a Staging area. It is often quite common that files extracted from SAP or other similar platforms that manage business data are taken as data sources.

Staging area: It is an area dedicated entirely to data processing, it is from where the data is taken to carry out the transformations and at the same time it is where the files of prosecution which consist of files that are generated and that will be used by other transformation steps of the ETL process.

During the extraction process, the selected data can have several formats, including:

Tables
SQL databases
NoSQL databases
Unstructured data
Text files

It's important to leave extraction data in a staging area because it's very common for extraction data to contain inconsistent, dirty data with an unpredictable variety of formats. In this way, by placing the data in a specific space, they can be read by the process and correct all its errors through transformations, in this way we can generate new data, with the correct format and ready to be analyzed.

It is considered Inconsistent data to information that has values in the same column or document with different formats. For example, if we have a column with date values and we receive some values with the format 2021/11/09 and others with the formats 2021-03-10 14:30, etc.

Transformation

In the transformation process, the data that was loaded during the extraction is taken and converted to a standard format in order to be able to upload them to another area of the Data Lake or to a Data Ware House and thus obtain results to analyze or present.

Transformations are performed multiple times with different scripts during the same flow or Pipeline. These are generally performed in scripts that can be built with different languages and libraries. The most common is to use Python as a programming language and Pandas or PySpark as a library for working with data.

Data Pipeline: They are a series of data processes that are executed in series where each process has an input and an output that continues the flow.

In transformation scripts, sequential tasks are performed to ensure that all elements process certain tasks, such as:

Apply format: We can define the data type of each column and carry out transformations according to the type of data assigned.
Apply filters: To get only the values we need.
Clear columns: In this way, we can apply a transformation on empty or null values and at the same time unify values that are repeated or registered differently but that point to the same thing.
Integrate data: We can obtain new data by integrating two different data sources or columns.

Carga

In this process, the transformed data is taken and loaded into a Data Warehouse. This is the final result of the ETL process and the idea is that the entire flow from extraction to loading is executed in certain periods of time on a regular basis.

Once the data is already available in a Data Warehouse, it can be consulted and used to visualize the data from a reporting system such as Tableau or Google Data Studio. Next, the second process will be explored in depth, that is, how data is transformed.

Activate the data

Being able to take advantage of available data is essential for organizations, even more so if you want to move the business through a process of digital transformation. One of the most common ways to take advantage of data is to take two or more Datasets and get new results by mixing the data.

We can build a new data based on some criteria, if we take a sales dataset and a supplier dataset, by relating them we can obtain a new column that indicates the age of the supplier or build a new value by combining the value content of both datasets.

Dataset: Datasets are sets of different data that are cohesive or generated from the same source. They generally have a tabular format, that is: They have rows and columns. An example of this would be the data generated by a user when making an online purchase.

Data mapping

To begin to take advantage of an organization's information, it is first necessary to have the data mapped. To do this, a data map can be generated, which consists of a diagram that identifies all the data sources generated in business processes.

The base component from which to begin to identify the data being generated are the datasets. We can group a set of datasets under the same source to form a Data Bucket. For example, we can have a Bucket with all the Users/Customers datasets, another for Corporate data and finally a Bucket of Industry/Competition data.

Identify unmapped date

These three Buckets make up the data map. If more datasets are added to the Buckets, much richer conclusions can be drawn with this new information found.

However, there may be times when there is data that we are not mapping or locating correctly, but that we know exists. In this case, we can start from the Buckets to determine which dataset we are needing. For example: My organization is generating data operations that we have not yet captured, so in the data map we indicate that it is pending obtaining a dataset in the Corporate Data Bucket.

Cloud Change Patterns

Migrations to the cloud allow systems and solutions to access different benefits such as: development of more agile and secure solutions, being able to work with data from the cloud, saving costs and generating new value. In addition, if the solution On Premise is obsolete or legacy can be completely transformed to integrate other services and modernize it.

Moving solutions to the cloud is not a process that always obeys the same rules, it all depends on the context. There are at least five patterns that can be adopted when you want to modernize an on-premise solution with the cloud.

On-premise: Refers to those applications or systems that have been developed and that run locally or on the business's own infrastructure.

01. Move and change

This pattern consists of moving the application to the cloud, making minimal modifications to make it work and once it is running, the original solution can be updated and improved in a simpler and more agile way.

02. Change and move

The next pattern is similar to the previous one only that the steps are reversed. The on-premise solution is modified to prepare it for the cloud and once ready, the migration is carried out to start running the application and apply any necessary changes or improvements, with the services associated with the selected cloud provider.

03. Maintain and extend

It consists of building from scratch a new solution from the cloud that is an extension or connects to the on-premise application. An example of this type of pattern is the creation of APIs to connect the local application with other services or applications that cannot be integrated into the original application.

04. Maintain and replace

Another very common alternative is to keep the on-premise application operating while creating a new application in the cloud environment so that, once developed, it will replace the on-premise solution and thus begin to obtain greater value and agility for the business.

05. JUST Move

In some cases, just freeing up the application or data is enough to modernize the infrastructure layer. This can be done with simpler or more common products. For example: migrating the organization's data to the cloud or modernizing an application infrastructure creating a virtualized environment for disaster recovery, among many other cases.

Next we will see a specific case where transformations are applied to a specific case.

Transform the data

As described above, transforming information helps us to make the most of the data we have, while at the same time giving us greater control over it in order to analyze it.

Use case

To demonstrate how the transformations are carried out, we will use the construction of a report as an example.

This on-premise report can take a certain amount of time to complete and is built using many different data sources, which must be manually consulted and configured to obtain the desired metrics. Then all the data is collected in an Excel and several calculations are applied. The final report would have a format similar to this:

The resulting report simulates the data of a company that distributes different products, we can see that it is a report that has concentrated information on the results obtained with some products, in addition to the calculation of the budget and Forecast for each one. This is the result table and to arrive at it we must perform transformations on different sources, specifically a series of transactional and master data.

Transactional data

Transactional data are those that are processed every certain period of time, which each time they are processed usually bring new data, modifications of previous data or the same data.

In our example, we have several transactional data, which we will call as venta_ropa.csv, venta_accesorios.csv and costos_margen.csv

What we'll do is take the sales sources and group data based on product_type. At the same time, we need to relate the data from this data to the master source to obtain the descriptive data that will be used to create the name and description columns.

Master Data

Master data or master tables are those data that do not usually change over time and have already defined values. In the example, these are tables that contain the codes of each product together with their description.

In our example, we will use a single master data that we will call tipo_producto_detail.csv

Both transactional and master data would have a format similar to the following:

Transformations applied to the case

In this case, we will solve the transformations using PySpark, it is a library for Big Data processes that is designed to work on Datalakes and Data Warehouses. It focuses on processing speed and Spark can be used in Python, Java, R and Scala. In our case, we will use Python as a programming language.

When starting to create scripts, it is normal to first start developing scripts called Scen, which contain all the logic and tests to build the script and then, once tested, this script is converted to the development version with slight modifications.

Read data

To read the data in PySpark we first use read.format () to indicate the type of file we are going to read, then with .option () we select separate options, in this case we use it to define the file delimiter and whether or not it includes headers. Finally, with .load () we indicate the path where the file is located.

Para leer los datos en PySpark primero hacemos uso de read.format() para indicar el tipo de archivo que vamos a leer, posteriormente con .option() seleccionamos opciones por separado, en este caso lo utilizamos para definir el delimitador del archivo y si incluye o no headers. Finalmente con .load() indicamos la ruta donde se encuentra el archivo.

In the example we read only venta_ropa.csv, but the logic is the same for the rest of the sources.

Headers: First row of the table that contains the names of each column.

-- CODE language-python -- # Leemos el archivo ropa_df = spark.read.format('csv').option('delimiter', ';').option('header', False).load('/datasets/transactional/venta_ropa.csv') # Asignamos los headers ropa_df = ropa_df.toDF( 'NOMBRE', 'CODIGO_PRODUCTO', 'FECHA', 'CANTIDAD', ... )

Correct data types

En este caso creamos la función set_correct_data_types la cual se encarga de estandarizar los datos para que asignar el tipo de dato correcto a cada columna y al mismo tiempo realizar modificaciones sobre dichos valores dependiendo del tipo de dato asignado.

In this code example, we correct negative values on the left, erroneous formats in text values, and date formats.

-- CODE language-python -- def set_correct_data_types(df): columns_types_dict = { 'FECHA': DateType(), 'CANTIDAD': IntegerType(), 'US_VENTAS': DoubleType(), 'NOMBRE': StringType(), 'CODIGO_PRODUCTO': IntegerType(), ... } for column_name, column_type in columns_types_dict.items(): if column_type == DoubleType(): column = col(column_name) negative_col = concat(column.substr(-1,1), column.substr(lit(1), length(column)-1)) df = df.withColumn( column_name, when( column.endswith('-'), negative_col.cast(column_type) ).otherwise(column.cast(column_type)) ) elif column_type == DateType(): df = df.withColumn( column_name, to_date(col(column_name), 'yyyyMMdd') ) elif column_type == StringType(): df = df.withColumn( column_name, regexp_replace(column_name, '"', '') ) return {'add_correct_data_types': df}

Get master relationships

En nuestro ejemplo tenemos un dato maestro denominado

 
tipo_producto_detail.csv

el cual posee los valores descriptivos utilizados para construir dos de las columnas del reporte final.

Transactional values such as venta_ropa.csv They have a column called product_code Which when relating it to the column Code From the master file tipo_producto_detail.csv It allows us to obtain the name, product code and the description of the product that we will rename Description. The result is that we managed to add two additional columns to ventas_ropa.csv when relating to the master file, as shown in the following image:

En el código utilizamos .join para obtener las columnas que necesitamos al hacer coincidir la columna codigo_producto con codigo.

-- CODE language-python -- def add_tipo_producto_descr(df): maestro = dataframes_dict['tipo_producto_detail'].select('codigo', 'nombre') df = df.join(maestro, df.codigo_producto == maestro.codigo, 'left') \ .withColumnRenamed('nombre', 'nombre') \ .drop('codigo') return {'add_tipo_producto': df} def add_descripcion_descr(df): maestro = dataframes_dict['tipo_producto_detail'].select('codigo', 'desc') df = df.join(maestro, df.codigo_producto == maestro.codigo, 'left') \ .withColumnRenamed('desc', 'descripcion') \ .drop('codigo') return {'add_descripcion': df}

Filter to generate new data

We can typically use filters to obtain only that information that is relevant to our objective. However, as we will see in the following case, it can also be used to build new data.

Con withColumn() podemos modificar el valor de una columna o crearla si agregamos el nombre de una columna que no existe. Luego con when podemos agregar una condición y el valor que se aplicará en caso de que esta condición coincida. En el ejemplo lo utilizamos para crear la columna categoria.

-- CODE language-python -- def add_category_filter(df): df = df.withColumn( 'CATEGORIA', when( (df.tipo_producto == 'Polera') | (df.tipo_producto == 'Poleron') | (df.tipo_producto == 'Gorro') \ , lit('Ropa').cast(StringType()) \ .when( (df.tipo_producto == 'Lentes') | (df.tipo_producto == 'Billetera') \ , lit('Accesorios').cast(StringType()) \ .otherwise('') )

Understanding the result

Based on what we saw above, we can describe the transformations using the following figure:

What we did was take two sources of data that shared characteristics, venta_ropa.csv and venta_accesorios.csv, we carried out standardization processes (Data Correction Script), obtained new columns by relating a column of the original file to the master source (Master Relationship Script) and through a filtering process we generated a new column by matching a condition (Filter Script). Additionally, we could take the learned references and perform more complex calculations to, for example, use the transactional table of costos_margen.csv and in this way, obtain the Forecast values.

Additionally, because the data is now categorized by column Category we could obtain the sum of the numerical columns, in this way we will extend the scope and purpose of our transformations.

In the next article in the series, we'll look in detail at the next step of the ETL flow, loading data and squaring the results. We hope that the article was helpful, if you have any questions left or your organization needs help solving projects of this type, do not hesitate to contact us.

Team Kranio

September 16, 2024