Introduction to Data Science: Data Types and Key Roles in Projects

Data Science or data science projects constitute all those developments where data is extracted from various sources, manipulated and visualized in order to carry out analysis.

For these projects to be built, the client's business must be understood together with the data it has in order to build a solution that delivers value to the organization and allows them to make decisions.

About this series and article

This is the first article in the series:”Data Science”. Each article can be read regardless of the order since the content is separated by different stages that, despite having a strong connection, can be understood individually. Each publication seeks to shed light on the processes that are being carried out in the industry and that could help you decide if your organization can hire a service to migrate your data to the cloud or learn about how the development of this type of project works if you are a student. In this first part, we will talk about the value of data and the roles that customers, users and developers play in Data Science projects.

The data

Data is all that information that is useful for a company, organizations can access a lot of information today. This includes internal organization data, external customer data, and external industry or competition data. Companies that have their operations digitized and therefore generate data that can be captured, processed and analyzed.

To be able to work with data it is first necessary to store it, to do this we have several alternatives. Cloud computing services such as Google Cloud Platform or Amazon Web Services (although there are others) are extremely efficient and profitable, since each one provides us with a variety of services that help us with the purpose of storing data efficiently and securely.

The value of data

In order to derive value from data, we must capture, store and structure it in a way that allows business decisions to be made. Data cannot be used only to analyze previous or current situations, but also to make predictions and take intelligent actions. This means that after capturing the data, a way must be found to be able to derive true value from it.

Once we capture, identify or enable a data source, we must store it. We can differentiate data storage into two different storage systems that we will explain below.

Data Warehouse vs Data Lake

Both data warehouse services and data lakes aim to solve the same thing: storing large amounts of data. Its main difference lies in the fact that data lakes, on the other hand, are designed to store raw data. While the data warehouse stores structured information, which is already filtered and that has already been previously processed in order to reach the structure that contains the data already stored in the same data warehouse.

Structured vs Unstructured Data

When storing data, we can find two formats:

Structured data: This is highly organized data, it can be customer records, tables or data with tabular format and other data that tends to be Quantitative. The advantage of this type of data format is that it can be easily stored and managed in databases. It must be considered that this type of data is generated when building models and structures that allow the data to accumulate in an orderly manner. This type of information is stored in Data Warehouses.

Unstructured data: This is data that is not organized and stands out because it tends to be qualitative, contains undefined information and comes in many formats. Examples of this type of data are: Images, audio and PDF files. This type of information is stored in Data Lakes.

Quantitative: It is all that information that can be measured.
Qualitative: It is all information that cannot be measured and for which measurement scales or models must be created.

Below we can see a figure that describes the differences between structured and unstructured data.

©Kranio SPA

Both of these structures can be used to achieve results and make intelligent decisions. However, the Historical date unstructured is much harder to analyze. But with the right cloud tools, value can be extracted from unstructured data. Using APIs to create structure.

Historical date: This is the information that organizations are generating over the years. Which is generally disordered and comes from several sources.
APIs: These are tools that allow two different systems to be integrated or communicated.

Roles

To carry out a project, there must be effective communication between three of the existing roles in a data project: the customer, the user and the development team.

The Customer

The customer plays a fundamental role in this type of project, because when working with data from an organization, it becomes imperative that the development team, throughout the project construction process, understand how the client company works and how their data works. In essence, it's about understanding business logic.

This understanding of the business is generated hand in hand between the developers and the customer, the development team must ensure that all their doubts regarding the operation of the business and the use of data are resolved. For their part, the client must be able to resolve all these doubts, which will make the difference in order to obtain a good result.

Meetings with the customer

The first thing to do when starting the project is to start a phase that consists of a series of meetings with the client to understand their expectations, and their flow, in order to define the solution they need. The first meetings that take place between client and developers are called “uprising meetings” and all those that are later are called “meetings of understanding”. Both meetings are aimed at getting to the heart of the problem being sought to solve and ideally to arrive together to determine what value data has for the business. However, the real objective of these meetings is understand business logic (hence the name). This understanding process includes knowing how data is obtained, what manual processes are carried out with them, how they are presented and ultimately how they are expected to be viewed or accessed, in short: the objective of understanding meetings is to trace the data that shapes business logic.

During the development of the project, it is recommended that the customer enable a Product Owner in order to ensure that communication with the customer is even more effective and agile. At the same time, a Product Owner in the development team can guarantee or ensure as much as possible that the efforts made by the development team directly target what the customer is looking for and thus reduce discarded development or time investments in development that are later modified or discarded because they are far from the customer's needs.

El Product Owner in agility projects, it is that member of the team that belongs to the client's organization who will support the developers and the Scrum master to carry out the project in line with the vision and requirements of their own organization.

Automated customer ingestion

For the proper development of the project, the client must ensure that the development team has sufficient data available to build the logic.

The “upload” of these files is usually stored in intake zones on some cloud computing service, such as S3 from AWS.

Las intake zones correspond to those directories or spaces within a Data Lake where the data that will form part of the ingestion process in the pipeline or data flow are stored. These are used by the development team to test and build the expected result.

However, this process is not limited only to the development process, but for the solution's data flow to work properly, it must be supplied to these areas where new data is ingested periodically. Generally, the frequency with which this data is uploaded is directly related to the occasions when the entire process is triggered, this is in the case of working with a serverless project.

Un serverless project It is one that only consumes resources and only executes when required.

The user

When developing the solution, it must always be built oriented towards the user. For a data project, a user can range from the management of an area and its supervisors to a worker who has been generating reports for some time and who has already built the solution manually. Solution which will now be transferred to the cloud to automate your work.

Users are required by the development team to understand the solution being developed and in case the solution already exists and wants to be transferred to a Cloud service.

In order to ensure that the solution is consistent with what the user is looking for, the user must have a meeting directly with the development team. In this way, the team that will develop the solution will have all the context, in this way they will be able to understand the data model and how to obtain the necessary metrics. As with meetings of understanding, this is an iterative process in which all the details related to the data must be resolved, specifically, the flow through which the data passes.

Development Team

The development team, made up of professionals from various disciplines in the IT area. They are responsible for carrying out the solution.

For example, we can find professionals who fulfill the roles of: Data Engineer, Data Ops, Dev Ops, Cloud Engineer, Data Analyst.

The context of the project changes throughout the development and the team must always be aware and focused. Feedback from both users and customers allows developers to build a deliverable that meets the expectations of both roles. We can summarize the concepts seen in this last section using the following image:

@Kranio SPA

In the next article in the series, we will see in detail what an ETL flow is and how data is extracted and transformed. We hope that the article was helpful, if you have any questions or your organization needs support to solve projects of this type, do not hesitate to contact us.

Team Kranio

September 16, 2024