Inside an End-to-End Machine Learning Pipeline: Part 4 —Data Ingestion and Cleaning | by Ahmed Nassar | Apr, 2024

Welcome again to our venture journey! On this installment, we’ll discover the essential preliminary levels of our machine-learning pipeline: knowledge ingestion and cleansing. These levels lay the inspiration for the accuracy and reliability of our predictive mannequin by guaranteeing that our enter knowledge is of top quality and correctly processed.

You will discover the entire supply code and venture recordsdata on GitHub. Be at liberty to discover, fork, and contribute to the venture:

GitHub Repository Link

Knowledge ingestion and knowledge cleansing are essential steps in any knowledge evaluation or machine studying venture. They lay the inspiration for correct, dependable, and significant insights and fashions.

Data ingestion is the foundational step in any knowledge evaluation or machine studying venture. It entails accumulating uncooked knowledge from various sources, equivalent to databases, APIs, recordsdata, or streaming platforms.

Yahoo Finance offers a free API that enables customers to entry a variety of monetary knowledge, together with historic inventory costs, firm data, and market statistics.
When retrieving inventory knowledge, customers can specify parameters such because the inventory ticker image, begin date, finish date, and frequency of knowledge (day by day, weekly, month-to-month).
These parameters permit customers to customise the info retrieval course of based mostly on their particular necessities.

Right here is how we get the inventory knowledge in our venture :

import logging
import yfinance as yfdef download_stock_data(stock_name, start_date, end_date, output_path):
"""
Obtain inventory knowledge from Yahoo Finance and put it aside to a CSV file.
Parameters:
stock_name (str): The title of the inventory.
start_date (str): The beginning date for the historic knowledge (YYYY-MM-DD).
end_date (str): The tip date for the historic knowledge (YYYY-MM-DD).
output_path (str): The trail the place the downloaded knowledge will likely be saved.
Raises:
ValueError: If no knowledge is accessible for the required inventory.
Returns:
None
"""
logging.data(f"Downloading knowledge for {stock_name} inventory")
attempt:
# Retrieve inventory knowledge utilizing yfinance
stock_data = yf.obtain(stock_name, begin=start_date, finish=end_date)
# Test if knowledge is retrieved efficiently
if stock_data.empty:
elevate ValueError(f"No knowledge out there for the inventory '{stock_name}'")
# Save inventory knowledge
stock_data.to_csv(output_path, index=True)
logging.data(f"Inventory knowledge downloaded and saved to {output_path}")
besides Exception as e:
logging.error(f"An error occurred whereas downloading inventory knowledge: {str(e)}")

This operate downloads inventory knowledge from Yahoo Finance API and saves it to a CSV file. It handles exceptions and logs the obtain course of.

Right here is an instance of the inventory knowledge that may be downloaded.

                  Open        Excessive         Low       Shut   Adj Shut      Quantity
Date                                                                              
2022-01-03  177.509995  179.720001  177.309998  179.529999  179.052185   9872800
2022-01-04  180.949997  181.380005  177.100006  177.300003  176.836288  14024600
2022-01-05  176.789993  179.399994  175.750000  179.259995  178.790207  13988900
2022-01-06  178.800003  179.570007  176.809998  178.380005  177.913971  11479400
2022-01-07  178.360001  178.699997  175.309998  175.570007  175.113892  13302500

This knowledge body comprises the next:

Open: the inventory’s opening worth.
Excessive: highest worth.
Low: lowest worth.
Shut: closing worth.
Adj Shut: adjusted closing worth
Quantity: buying and selling quantity

Data cleaning is the method of figuring out and correcting errors, inconsistencies, and lacking values within the knowledge. It ensures that the info is correct, full, and appropriate for evaluation.

Clear knowledge results in extra correct and dependable insights and fashions.
Knowledge cleansing entails duties equivalent to eradicating duplicates, dealing with lacking values, standardizing codecs, and correcting errors.

Right here is how we apply knowledge cleansing steps in our venture :


def clean_data(df):
"""
Apply knowledge cleansing steps to the DataFrame.Parameters:
df (pd.DataFrame): The DataFrame to be cleaned.
Returns:
pd.DataFrame: The cleaned DataFrame.
"""
attempt:
logging.data("Cleansing knowledge...")
df.columns = df.columns.str.decrease().str.change(' ', '_')
df['open'] = df['open'].interpolate(technique='linear' , limit_direction = 'each')
df['high'] = df['high'].interpolate(technique='linear', limit_direction = 'each')
df['low'] = df['low'].interpolate(technique='linear', limit_direction = 'each')
df['close'] = df['close'].interpolate(technique='linear', limit_direction = 'each')
df['adj_close'] = df['adj_close'].interpolate(technique='linear', limit_direction = 'each')
df['volume'] = df['volume'].interpolate(technique='linear', limit_direction = 'each')
logging.data("Knowledge cleaned efficiently")
return df
besides Exception as e:
logging.error(f"Error occurred whereas cleansing knowledge: {str(e)}")
elevate ValueError(f"Failed to scrub knowledge: {str(e)}")

The operate converts column names to lowercase and replaces areas with underscores. This ensures uniformity and consistency in column names.
It interpolates lacking values in numerical columns utilizing linear interpolation. This technique fills in lacking values by estimating them based mostly on neighboring knowledge factors. The limit_direction='each' parameter ensures that lacking values in the beginning and finish of every column are stuffed.

Instance :

Date        Worth
2022-01-01  100
2022-01-02  NaN
2022-01-03  NaN
2022-01-04  110
2022-01-05  NaN
2022-01-06  120

After making use of linear interpolation the info body can be up to date as follows:

Date        Worth
2022-01-01  100
2022-01-02  105   <-- Interpolated
2022-01-03  105   <-- Interpolated
2022-01-04  110
2022-01-05  115   <-- Interpolated
2022-01-06  120

In abstract, knowledge ingestion and knowledge cleansing are essential steps in any knowledge evaluation or machine studying venture. They lay the inspiration for correct, dependable, and significant insights and fashions, in the end driving knowledgeable decision-making and impactful outcomes.

Source link

Inside an End-to-End Machine Learning Pipeline: Part 4 —Data Ingestion and Cleaning | by Ahmed Nassar | Apr, 2024

Working with Input-Convex Neural Networks part3(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

Embracing the Future: The Rise of AI-Driven Development in Software Engineering The software… | by DevBlogs | Jul, 2024

Research on Metaheuristic methods part4(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

How Real-Time Data Analytics and AI Are Transforming Heavy Equipment Operations

NVIDIA Accelerates Google Quantum AI Processor Design With Simulation of Quantum Device Physics

Game Development and Cloud Computing: Benefits of Cloud-Native Game Servers

Teradata AI Unlimited in Microsoft Fabric is Now Available for Public Preview through Microsoft Fabric Workload Hub

Cognigy Unveils Agentic AI: Transforming the Future of Enterprise Contact Centers

Our Picks

At 2024 AI Hardware & Edge AI Summit: Vasudev Lal, Principal AI Research Scientist, Cognitive AI, Intel Labs

Design a Multi-Layer Perceptron (MLP) Neural Network for Classification | by Ayo Akinkugbe | May, 2024

Working with Selective Classification part3(Machine Learning 2024) | by Monodeep Mukherjee | May, 2024

Most Popular

Revolutionizing the Way We Find Love

Will GenAI Replace Data Engineers? No – And Here’s Why.

Assortment Optimization Machine Learning | by Danishaliarshar | Mar, 2024

Inside an End-to-End Machine Learning Pipeline: Part 4 —Data Ingestion and Cleaning | by Ahmed Nassar | Apr, 2024

Related Posts