El generador de coherencia, Data Cleaning Big Data

The consistency generator: “Data Cleaning”

02/09/20 6 min. read

In data analysis it is important that the information being studied does not include unnecessary or incorrectly formatted data to avoid erroneous or inaccurate deductions. For example, if you want to model the cost of low-cost flights within Europe, it does not make sense to include first class tickets in the data, or even worse, bus tickets!

Other cases that should also be considered are bad data. For example, you do an analysis of the average time it takes a bus to get to a stop; inconsistent data generated by faulty GPS in the vehicle is of little use to you.

Data cleaning procedures allow you to generate consistency in the data for the response you are looking for.

In the following paragraphs, designed for people who are taking their first steps in data analysis, we will explain the importance of data cleaning in addition to analysing incorrect or irrelevant information by making a simplified example.

A model of Machine Learning ๐Ÿง 

The life cycle to produce a Machine Learning model can be distributed in the following steps:

  1. Get the data
  2. Clean and prepare the data
  3. Generate the model
  4. Evaluating the model
  5. Deploy the Model
  6. Making predictions

Within these steps, the cleaning and preparation of the data is the most expensive but it is one of the most important. This stage is known as Data Cleansing or Data Cleaning.

Machine Learning model

What is Data Cleaning? ๐Ÿงน

As Data Cleaning we understand all the operations needed to clean a group of data from information that can divert data analysis from its purpose.

During this phase data discards and management of missing data are performed.

You may think that with this cleaning you can lose important information for the model; but done correctly this improves the quality of the data and brings us closer to a more correct answer to the question we want to answer.

Some of the cleaning operations that can be carried out are:

Cleaning up of badly formatted data ๐Ÿ’ป

The impact of the data format is evident in data analysis. It is easy to deduce that if we expect a field to be the data relating to a price but we work with a textโ€ฆ, the process will fail.

Cleaning up inconsistent data ๐Ÿ—‘

The cleaning of inconsistent data depends on the knowledge of the data. For example, if you are analyzing the data on the taxi route to the airport, you can exclude from the data all routes that do not end at the airport, right?

Compensation for missing data ๐Ÿ”Ž

Basically in the data there may be fields that are not informed, these we call the missing data. In this case we have two options:

  • Discard the data
  • Fill them in as best you can

In many cases the second option is chosen where for numerical data you can put a mean value of the field, or the median. And for non-numerical data, we can isolate them as a separate category.

Practical example: Price forecast ๐Ÿ“ˆ

In what follows we will demonstrate how useful Data Cleaning can be with an example. To do this, we will carry out an analysis to forecast New York taxi fares within the island of Manhattan. This is one of the classic introduction problems of the Kaagle website.

For those of you who don’t know it: Kaggle is a Data Science community that performs data analysis challenges with a very active community. For those of you interested in the world of Machine Learning, I suggest you come along.

To be able to run the example we will use Jupyter Notebook (About the installation and use of Jupyter notebook I will prepare a separate article :D. Meanwhile, a good reference is the Jupyter page itself).

The data about the taxi movement will be obtained from the NYC OpenData page.

To be able to test within the Jupyter Notebook environment, a reduced version is required, in order to have acceptable run times. We also restricted the dataset to only Manhattan. To simplify the model, we are going to eliminate non-numerical fields (e.g. “payment_type”) and also those that by their nature have no correlation with the question.

Data Cleaning

With this data, without cleaning, we tried to generate a Random Forest model:

Data Cleaning

The correlation value that we expect in this case should be as small as possible: a zero value implies that the forecast coincides, in this case we are talking about Overfitting, which is another problem.

The result can be interpreted as the model can predict the cost of the taxi ride, having all the other data, with an error of about 1 dollar; which is not bad.

Now we see what we can do if we clean up the data a bit. For example:

  • we remove the negative values of the amount
  • empty taxis (without passengers)
  • too long trips (1000 km inside Manhattan island is quite rare)
  • and rates that seem absurd (we assume a limit of $410)
Data Cleaning

Now calculating the model with the pre-processed data we obtain:

Data Cleaning

We now get 0.17 dollars as a deviation for our predictive model and without making any kind of calibration on it. It is a much better result than the previous one and it is in line with what could be expected

Conclusion ๐Ÿ˜Ž

As we have seen, the preparation and cleaning of the data, before generating the model, leads to a significant improvement in terms of the prediction that is achieved with the models.

It is important to consider that the work of preparing the data is the most expensive, since it is necessary to carry out an exploratory analysis of the data and process the millions of entries to format them correctly.

fabrizio dutra

Fabrizio Dutra

Santander Global Tech

Graduated in Physics and always in love with computers. Currently I dedicate more time to Big Data but I like to investigate everything that I find interesting: Machine Learning, Cyber, Configuration Management, Terry Pratchet, Scrum, 3d printing …

 

Other posts