Data collection and cleaning

Vicky Costa
19 de abr. de 2024
7 min de leitura

Atualizado: 2 de mai. de 2024

Data collection and cleaning are crucial steps in the data analysis process. Without clean, quality data, any analysis carried out can lead to erroneous conclusions and wrong decisions. Throughout this report, I will share effective methods for collecting and cleaning data, guaranteeing its quality and integrity, based on my professional experience.

Professional Experience:

During my career as a data analyst, I had the opportunity to gain a deep understanding of the importance of data collection and cleansing in different corporate environments. Using SQL and Python skills, I developed customised scripts to automate the process of extracting data from various sources, ensuring data consistency and integrity over time.

Efficient Methods for Data Collection and Cleaning:

Defining Clear Objectives: In any data analysis project, it is essential to establish clear and specific objectives in order to guide the data collection process and ensure its relevance to business needs.

Use Reliable Sources: Selecting reliable and up-to-date data sources is essential to ensure the quality and accuracy of analyses. This can involve the use of internal company databases, integration with third-party APIs or access to quality public data sources.

Data collection automation: Automation is a powerful strategy for increasing efficiency and reducing errors in data collection. By developing customised scripts in Python or using ETL tools, it is possible to automate the extraction of data from various sources, speeding up the entire process.

Data standardisation and normalisation: Before starting any analysis, it is crucial to standardise and normalise the data to ensure its consistency and comparability. This can include correcting typos, uniformly formatting dates and normalising values as necessary.

Identification and Treatment of Missing Data: Detecting and handling missing data are key steps in preserving data integrity. Strategies such as imputing data based on existing patterns or removing incomplete records are essential for guaranteeing the quality of data sets.

Removing Duplicates: The presence of duplicate records can distort analyses and insights. It is therefore essential to identify and remove duplicates during the data cleansing process, thus guaranteeing the accuracy and reliability of the final results.

Data Set used

Before we dive into the practical examples, let's take a look at the dataset we'll be using to demonstrate the collection and cleaning methods.This dataset was found on Kaggle under the heading "E-Commerce Data". Link: https://www.kaggle.com/datasets/carrie1/ecommerce-data/data

The dataset contains transactions from an online retailer between 12/01/2010 and 12/09/2011. It includes the following columns:

- `InvoiceNo`: Invoice number.

- `StockCode`: Item code.

- `Description`: Description of the item.

- `Quantity`: Quantity of the item.

- `InvoiceDate`: Invoice date.

- `UnitPrice`: Unit price of the item.

- `CustomerID`: Customer ID.

- `Country`: Country of the customer.

Here's a visualisation of the first few rows of the data set:

InvoiceNo17850	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	12/1/2010 8:26	2.55	17850	United Kingdom
536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	12/1/2010 8:26	2.75	17850	United Kingdom
...	...	...	...	...	...	...	...

This data set will be used to illustrate the effective data collection and cleaning methods that we will discuss below.

Practical examples

1.Defining Clear Objectives: Let's assume that the objective of our project is to analyse the sales performance of products in a retail shop.

Project objectives: the aim of our project is to analyse the sales performance of products in a retail shop. To do this, we will exploit available data to understand sales patterns, identify the best-selling products, assess trends over time and ultimately provide actionable insights to improve sales performance and drive business growth.

This objective will be achieved by carrying out the following analyses:

1. Analysing the distribution of sales by product.

2. Identifying the best-selling products.

3. Evaluation of sales trends over time.

4. Analysing the contribution of different product categories to total sales.

5. Investigating seasonal patterns or special events that affect sales.

With these clear objectives in mind, we can move on to the stages of collecting, cleaning and analysing data.

2. Using Reliable Sources: Selecting reliable and up-to-date data sources is essential to guarantee the quality and accuracy of the analyses carried out. This means that we must seek data from reliable and verifiable sources that are aligned with the objectives of our project.

Data Sources Used: For our project to analyse the sales performance of products in a retail shop, we will make use of the data available on Kaggle.

# Import the necessary libraries
import pandas as pd

The dataset contains detailed information on sales transactions, including invoice numbers, product codes, product descriptions, quantities, invoice dates, unit prices, customer IDs and countries.

# Load CSV file
df = pd.read_csv('data.csv')

By using this Kaggle data, we ensure that we are working with information that is reliable and relevant to the objectives of our project. In addition, it is important to maintain an ethical and legal approach to obtaining and using this data, ensuring respect for privacy and compliance with applicable regulations.

3. Data collection automation: for this sales analysis project, since we are using a dataset that is publicly available on Kaggle and will only be used as an example, data collection will be done manually by downloading the dataset directly from Kaggle.

As opposed to the automation of data collection, which is useful for collecting data systematically and repetitively, manual collection is suitable for cases where data is accessed occasionally or when it is not necessary to automate the process of obtaining the data.

While collecting data manually may be simpler in terms of implementation, it is important to ensure that the data is downloaded from reliable sources and that it is aligned with the project's objectives. In addition, it is essential to keep clear records of where the data came from and when it was obtained for future reference.

4. Data standardisation and normalisation: before we begin the actual analysis, it is important to prepare the data to ensure that it is organised and ready to be explored. Two crucial steps in this process are data standardisation and normalisation.

Why standardise and normalise?

When dealing with data, it's common to find information presented in different ways. For example, product names may be written in different ways (with upper and lower case letters, abbreviations, etc.), and sales figures may be on different scales. Standardising and normalising data helps us to make it more consistent and comparable, making it easier to analyse.

Making standardisation and normalisation work:

1. Standardising Product Names: We'll start by ensuring that product names are consistent. This means transforming all names into lower or upper case letters, removing extra spaces and other irregularities. This way, we'll have a uniform product list that's easier to work with.

# Standardising product names (convert to upper case) 
df['Description'] = df['Description'].str.upper()

2. Normalising sales figures: Next, we'll normalise the sales figures. This involves adjusting the values to a common scale, which helps us to compare the sales of different products in a more balanced way. For example, if we have products with very high prices and others with low prices, normalising the values allows us to analyse them more fairly.

# Normalisation of sales figures (filling in missing data with the average)
df['Quantity'].fillna(df['Quantity'].mean(), inplace=True)
df['UnitPrice'].fillna(df['UnitPrice'].mean(), inplace=True)

By carrying out these data preparation steps, we ensure that we are starting our analysis on a solid and consistent footing. This will make it easier to understand and interpret the results as we progress through our product sales analysis.

5. Identifying and Handling Missing Data: One of the common challenges when dealing with datasets is dealing with missing data, also known as missing values. These are values that are missing from one or more columns of the dataset, which can jeopardise the quality of the analyses if not handled properly.

Why identify and treat missing data?

Missing data can occur for a variety of reasons, such as collection failures, data entry errors or intentional omissions. Regardless of the cause, it is important to identify and deal with these missing values in order to guarantee the accuracy and reliability of the analyses we carry out

Identification and handling of missing data:

1. Identifying Missing Data: The first step is to identify which columns have missing data and in what proportion. This helps us understand the extent of the problem and decide on the best approach to deal with it.

2. Handling Missing Data: There are various strategies for dealing with missing data, such as filling in the missing values with averages or medians, removing the rows or columns that contain missing data, or using more advanced techniques such as model-based data imputation. The choice of strategy depends on the context of the data and the objectives of the analysis.

# Check for missing data
df.isnull().sum()

By identifying and handling missing data appropriately, we can ensure that our analyses are based on complete and reliable data, allowing us to extract meaningful insights and make informed decisions.

6. Removing Duplicates: Another important aspect of data cleansing is dealing with duplicate records. Duplicates are repeated entries in the dataset, which can distort analyses and results if not handled properly.

Why remove duplicates?

Duplicate records can arise due to data entry errors, flaws in collection processes or the integration of different data sources. The presence of duplicates can lead to erroneous conclusions and distort analyses, jeopardising the quality of decisions based on the data.

Identifying and removing duplicates:

1. Identifying Duplicates: The first step is to identify whether there are duplicate records in the dataset. This can be done by checking for identical rows or by using specific criteria to determine duplicates, such as combinations of values in certain columns.

2. Removing Duplicates: Once duplicates have been identified, we can safely remove them. This usually involves keeping only the first occurrence of each set of duplicate records or applying additional criteria to select which duplicate to keep, depending on the specific needs of the analysis.

# Identify and remove duplicates
df.drop_duplicates(inplace=True)

When you remove duplicates, we ensure that our data is clean and free of distortions, allowing our analyses to be more accurate and reliable.

Conclusion

Data collection and cleaning are fundamental steps in the data analysis process. By following simple methods such as establishing clear objectives, using reliable sources and applying data organisation and correction techniques, we can ensure that the data is ready for accurate and reliable analysis. Investing time in these initial steps is essential for extracting valuable insights and making informed decisions.

Our next step is to explore the data in more detail to identify patterns and trends that will help us achieve our goals. Collecting and cleaning data is just the beginning of our data analysis journey, but it is fundamental to building a solid foundation for future analyses and decision-making.

We'll move on to analysing the data in our next article, continuing to explore and learn from the information available.