Inside an End-to-End Machine Learning Pipeline: Part 4 —Data Ingestion and Cleaning | by Ahmed Nassar | Apr, 2024

Welcome once more to our enterprise journey! On this installment, we’ll uncover the important preliminary ranges of our machine-learning pipeline: data ingestion and cleaning. These ranges lay the inspiration for the accuracy and reliability of our predictive model by guaranteeing that our enter data is of high quality and accurately processed.

You’ll uncover all the provide code and enterprise recordsdata on GitHub. Be at liberty to find, fork, and contribute to the enterprise:

GitHub Repository Link

Data ingestion and data cleaning are important steps in any data analysis or machine finding out enterprise. They lay the inspiration for proper, reliable, and important insights and fashions.

Data ingestion is the foundational step in any data analysis or machine finding out enterprise. It entails accumulating raw data from varied sources, equal to databases, APIs, recordsdata, or streaming platforms.

Yahoo Finance affords a free API that permits prospects to entry quite a lot of financial data, along with historic stock prices, agency information, and market statistics.
When retrieving stock data, prospects can specify parameters such as a result of the stock ticker picture, start date, end date, and frequency of information (daily, weekly, month-to-month).
These parameters allow prospects to customize the information retrieval course of based mostly totally on their explicit requirements.

Proper right here is how we get the stock data in our enterprise :

import logging
import yfinance as yfdef download_stock_data(stock_name, start_date, end_date, output_path):
"""
Acquire stock data from Yahoo Finance and put it apart to a CSV file.
Parameters:
stock_name (str): The title of the stock.
start_date (str): The start date for the historic data (YYYY-MM-DD).
end_date (str): The tip date for the historic data (YYYY-MM-DD).
output_path (str): The path the place the downloaded data will seemingly be saved.
Raises:
ValueError: If no data is accessible for the required stock.
Returns:
None
"""
logging.information(f"Downloading data for {stock_name} stock")
try:
# Retrieve stock data using yfinance
stock_data = yf.acquire(stock_name, start=start_date, end=end_date)
# Take a look at if data is retrieved effectively
if stock_data.empty:
elevate ValueError(f"No data on the market for the stock '{stock_name}'")
# Save stock data
stock_data.to_csv(output_path, index=True)
logging.information(f"Stock data downloaded and saved to {output_path}")
in addition to Exception as e:
logging.error(f"An error occurred whereas downloading stock data: {str(e)}")

This function downloads stock data from Yahoo Finance API and saves it to a CSV file. It handles exceptions and logs the acquire course of.

Proper right here is an occasion of the stock data that could be downloaded.

                  Open        Extreme         Low       Shut   Adj Shut      Amount
Date                                                                              
2022-01-03  177.509995  179.720001  177.309998  179.529999  179.052185   9872800
2022-01-04  180.949997  181.380005  177.100006  177.300003  176.836288  14024600
2022-01-05  176.789993  179.399994  175.750000  179.259995  178.790207  13988900
2022-01-06  178.800003  179.570007  176.809998  178.380005  177.913971  11479400
2022-01-07  178.360001  178.699997  175.309998  175.570007  175.113892  13302500

This information physique includes the subsequent:

Open: the stock’s opening value.
Extreme: highest value.
Low: lowest value.
Shut: closing value.
Adj Shut: adjusted closing value
Amount: shopping for and promoting amount

Data cleaning is the strategy of determining and correcting errors, inconsistencies, and missing values inside the data. It ensures that the information is right, full, and acceptable for analysis.

Clear data leads to additional right and reliable insights and fashions.
Data cleaning entails duties equal to eradicating duplicates, coping with missing values, standardizing codecs, and correcting errors.

Proper right here is how we apply data cleaning steps in our enterprise :


def clean_data(df):
"""
Apply data cleaning steps to the DataFrame.Parameters:
df (pd.DataFrame): The DataFrame to be cleaned.
Returns:
pd.DataFrame: The cleaned DataFrame.
"""
try:
logging.information("Cleaning data...")
df.columns = df.columns.str.lower().str.change(' ', '_')
df['open'] = df['open'].interpolate(approach='linear' , limit_direction = 'every')
df['high'] = df['high'].interpolate(approach='linear', limit_direction = 'every')
df['low'] = df['low'].interpolate(approach='linear', limit_direction = 'every')
df['close'] = df['close'].interpolate(approach='linear', limit_direction = 'every')
df['adj_close'] = df['adj_close'].interpolate(approach='linear', limit_direction = 'every')
df['volume'] = df['volume'].interpolate(approach='linear', limit_direction = 'every')
logging.information("Data cleaned effectively")
return df
in addition to Exception as e:
logging.error(f"Error occurred whereas cleaning data: {str(e)}")
elevate ValueError(f"Failed to wash data: {str(e)}")

The function converts column names to lowercase and replaces areas with underscores. This ensures uniformity and consistency in column names.
It interpolates missing values in numerical columns using linear interpolation. This system fills in missing values by estimating them based mostly totally on neighboring data elements. The limit_direction='every' parameter ensures that missing values to start with and end of each column are stuffed.

Occasion :

Date        Price
2022-01-01  100
2022-01-02  NaN
2022-01-03  NaN
2022-01-04  110
2022-01-05  NaN
2022-01-06  120

After making use of linear interpolation the information physique might be updated as follows:

Date        Price
2022-01-01  100
2022-01-02  105   <-- Interpolated
2022-01-03  105   <-- Interpolated
2022-01-04  110
2022-01-05  115   <-- Interpolated
2022-01-06  120

In summary, data ingestion and data cleaning are important steps in any data analysis or machine finding out enterprise. They lay the inspiration for proper, reliable, and important insights and fashions, in the long run driving educated decision-making and impactful outcomes.

Source link

Inside an End-to-End Machine Learning Pipeline: Part 4 —Data Ingestion and Cleaning | by Ahmed Nassar | Apr, 2024

Working with Input-Convex Neural Networks part3(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

Embracing the Future: The Rise of AI-Driven Development in Software Engineering The software… | by DevBlogs | Jul, 2024

Research on Metaheuristic methods part4(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

Denodo Platform 9.1 Brings New Advanced AI Capabilities and Enhanced Data Lakehouse Performance

Harnessing AI in Agriculture – insideAI News

How Big Data Is Transforming Patient Care Delivery

How to Assist Human Agents & Transform Customer Experience with Conversational AI?

Salesforce Introduces Agentforce Testing Center: AI Agent Lifecycle Management Tooling for Testing Autonomous AI Agents at Scale

Our Picks

Building a basic Neural Net, using Javascript? | by Patrick Metzdorf | Jul, 2024

The importance of Feature Engineering in ML Development | by Nour Bessrour | Jun, 2024

Top 10 Net 60 Vendors for Building Business Credit in 2024

Most Popular

Revolutionizing the Way We Find Love

Will GenAI Replace Data Engineers? No – And Here’s Why.

Assortment Optimization Machine Learning | by Danishaliarshar | Mar, 2024

Inside an End-to-End Machine Learning Pipeline: Part 4 —Data Ingestion and Cleaning | by Ahmed Nassar | Apr, 2024

Related Posts