Welcome once more to our enterprise journey! On this installment, we’ll uncover the important preliminary ranges of our machine-learning pipeline: data ingestion and cleaning. These ranges lay the inspiration for the accuracy and reliability of our predictive model by guaranteeing that our enter data is of high quality and accurately processed.
You’ll uncover all the provide code and enterprise recordsdata on GitHub. Be at liberty to find, fork, and contribute to the enterprise:
Data ingestion and data cleaning are important steps in any data analysis or machine finding out enterprise. They lay the inspiration for proper, reliable, and important insights and fashions.
Data ingestion is the foundational step in any data analysis or machine finding out enterprise. It entails accumulating raw data from varied sources, equal to databases, APIs, recordsdata, or streaming platforms.
- Yahoo Finance affords a free API that permits prospects to entry quite a lot of financial data, along with historic stock prices, agency information, and market statistics.
- When retrieving stock data, prospects can specify parameters such as a result of the stock ticker picture, start date, end date, and frequency of information (daily, weekly, month-to-month).
- These parameters allow prospects to customize the information retrieval course of based mostly totally on their explicit requirements.
Proper right here is how we get the stock data in our enterprise :
import logging
import yfinance as yfdef download_stock_data(stock_name, start_date, end_date, output_path):
"""
Acquire stock data from Yahoo Finance and put it apart to a CSV file.
Parameters:
stock_name (str): The title of the stock.
start_date (str): The start date for the historic data (YYYY-MM-DD).
end_date (str): The tip date for the historic data (YYYY-MM-DD).
output_path (str): The path the place the downloaded data will seemingly be saved.
Raises:
ValueError: If no data is accessible for the required stock.
Returns:
None
"""
logging.information(f"Downloading data for {stock_name} stock")
try:
# Retrieve stock data using yfinance
stock_data = yf.acquire(stock_name, start=start_date, end=end_date)
# Take a look at if data is retrieved effectively
if stock_data.empty:
elevate ValueError(f"No data on the market for the stock '{stock_name}'")
# Save stock data
stock_data.to_csv(output_path, index=True)
logging.information(f"Stock data downloaded and saved to {output_path}")
in addition to Exception as e:
logging.error(f"An error occurred whereas downloading stock data: {str(e)}")
This function downloads stock data from Yahoo Finance API and saves it to a CSV file. It handles exceptions and logs the acquire course of.
Proper right here is an occasion of the stock data that could be downloaded.
Open Extreme Low Shut Adj Shut Amount
Date
2022-01-03 177.509995 179.720001 177.309998 179.529999 179.052185 9872800
2022-01-04 180.949997 181.380005 177.100006 177.300003 176.836288 14024600
2022-01-05 176.789993 179.399994 175.750000 179.259995 178.790207 13988900
2022-01-06 178.800003 179.570007 176.809998 178.380005 177.913971 11479400
2022-01-07 178.360001 178.699997 175.309998 175.570007 175.113892 13302500
This information physique includes the subsequent:
- Open: the stock’s opening value.
- Extreme: highest value.
- Low: lowest value.
- Shut: closing value.
- Adj Shut: adjusted closing value
- Amount: shopping for and promoting amount
Data cleaning is the strategy of determining and correcting errors, inconsistencies, and missing values inside the data. It ensures that the information is right, full, and acceptable for analysis.
- Clear data leads to additional right and reliable insights and fashions.
- Data cleaning entails duties equal to eradicating duplicates, coping with missing values, standardizing codecs, and correcting errors.
Proper right here is how we apply data cleaning steps in our enterprise :
def clean_data(df):
"""
Apply data cleaning steps to the DataFrame.Parameters:
df (pd.DataFrame): The DataFrame to be cleaned.
Returns:
pd.DataFrame: The cleaned DataFrame.
"""
try:
logging.information("Cleaning data...")
df.columns = df.columns.str.lower().str.change(' ', '_')
df['open'] = df['open'].interpolate(approach='linear' , limit_direction = 'every')
df['high'] = df['high'].interpolate(approach='linear', limit_direction = 'every')
df['low'] = df['low'].interpolate(approach='linear', limit_direction = 'every')
df['close'] = df['close'].interpolate(approach='linear', limit_direction = 'every')
df['adj_close'] = df['adj_close'].interpolate(approach='linear', limit_direction = 'every')
df['volume'] = df['volume'].interpolate(approach='linear', limit_direction = 'every')
logging.information("Data cleaned effectively")
return df
in addition to Exception as e:
logging.error(f"Error occurred whereas cleaning data: {str(e)}")
elevate ValueError(f"Failed to wash data: {str(e)}")
- The function converts column names to lowercase and replaces areas with underscores. This ensures uniformity and consistency in column names.
- It interpolates missing values in numerical columns using linear interpolation. This system fills in missing values by estimating them based mostly totally on neighboring data elements. The
limit_direction='every'
parameter ensures that missing values to start with and end of each column are stuffed.
Occasion :
Date Price
2022-01-01 100
2022-01-02 NaN
2022-01-03 NaN
2022-01-04 110
2022-01-05 NaN
2022-01-06 120
After making use of linear interpolation the information physique might be updated as follows:
Date Price
2022-01-01 100
2022-01-02 105 <-- Interpolated
2022-01-03 105 <-- Interpolated
2022-01-04 110
2022-01-05 115 <-- Interpolated
2022-01-06 120
In summary, data ingestion and data cleaning are important steps in any data analysis or machine finding out enterprise. They lay the inspiration for proper, reliable, and important insights and fashions, in the long run driving educated decision-making and impactful outcomes.