Pipeline: The Ultimate Design Pattern for Machine Learning (ML) | by Atul Kumar | Jun, 2024

On the earth of machine learning (ML), effectivity and scalability are key. The pipeline design pattern has emerged as a robust software program to streamline the ML workflow, reduce errors, and enhance reproducibility. This weblog will introduce you to the pipeline design pattern, highlight its advantages, and provide wise examples using scikit-learn to disclose its software program on tabular data, textual content material data, and a mixture of every. We’ll even uncover create custom-made pipeline steps and perform hyperparameter tuning using GridSearchCV.

A pipeline in machine learning is a sequence of data processing steps organized in a particular order. It encapsulates all of the workflow, from data preprocessing to model evaluation, guaranteeing that each step is executed persistently and successfully.

Modularity and Reusability: Pipelines allow you to encapsulate steps, making your code modular and reusable all through utterly totally different initiatives.
Streamlined Workflow: Ensures each step inside the ML course of is executed inside the precise order, simplifying the workflow.
Error Low cost: Standardizes the tactic, reducing information errors.
Hyperparameter Tuning: Facilitates tuning all through all phases, from preprocessing to model parameters.
Consistency and Reproducibility: Ensures the an identical steps are utilized to every teaching and try data, enhancing reliability.

Subsequent, let’s take a look at a straightforward occasion of using a pipeline for tabular data.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline# Sample tabular data
data = {
'feature1': [1, 2, 3, 4, 5],
'feature2': [10, 20, 30, 40, 50],
'label': [0, 1, 0, 1, 0]
}
df = pd.DataFrame(data)
# Break up the information
X = df[['feature1', 'feature2']]
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define the pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('logreg', LogisticRegression())
])
# Put together the model
pipeline.match(X_train, y_train)
# Contemplate the model
score = pipeline.score(X_test, y_test)
print(f'Model Accuracy: {score:.2f}')

Subsequent, let’s create a pipeline for processing textual content material data.

from sklearn.feature_extraction.textual content material import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
# Sample textual content material data
texts = ["I love machine learning", "Machine learning is great", "I hate spam emails", "Spam emails are annoying"]
labels = [1, 1, 0, 0]
# Break up the information
X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.2, random_state=42)
# Define the pipeline
pipeline = Pipeline([
('tfidf', TfidfVectorizer()),
('nb', MultinomialNB())
])
# Put together the model
pipeline.match(X_train, y_train)
# Contemplate the model
score = pipeline.score(X_test, y_test)
print(f'Model Accuracy: {score:.2f}')

Now, let’s combine tabular and textual content material data in a single pipeline.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.textual content material import TfidfVectorizer
from sklearn.linear_model import LogisticRegression# Sample blended data
data = {
'numeric_feature': [1, 2, 3, 4, 5],
'text_feature': ["I love ML", "ML is great", "I hate spam", "Spam is bad", "ML is useful"],
'label': [1, 1, 0, 0, 1]
}
df = pd.DataFrame(data)
# Break up the information
X = df[['numeric_feature', 'text_feature']]
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define the column transformer
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), ['numeric_feature']),
('textual content material', TfidfVectorizer(), 'text_feature')
])
# Define the pipeline
pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', LogisticRegression())
])
# Put together the model
pipeline.match(X_train, y_train)
# Contemplate the model
score = pipeline.score(X_test, y_test)
print(f'Model Accuracy: {score:.2f}')

Lastly, let’s create a custom-made pipeline step.

from sklearn.base import BaseEstimator, TransformerMixin# Personalized transformer in order so as to add a seamless perform
class AddConstantFeature(BaseEstimator, TransformerMixin):
def __init__(self, mounted=1):
self.mounted = mounted
def match(self, X, y=None):
return self
def transform(self, X):
X['constant_feature'] = self.mounted
return X
# Sample data
data = {
'feature1': [1, 2, 3, 4, 5],
'label': [0, 1, 0, 1, 0]
}
df = pd.DataFrame(data)
# Break up the information
X = df[['feature1']]
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define the pipeline with the custom-made step
pipeline = Pipeline([
('add_constant', AddConstantFeature()),
('scaler', StandardScaler()),
('logreg', LogisticRegression())
])
# Put together the model
pipeline.match(X_train, y_train)
# Contemplate the model
score = pipeline.score(X_test, y_test)
print(f'Model Accuracy: {score:.2f}')

import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline# Expanded sample tabular data
data = {
'feature1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'feature2': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
'label': [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]
}
df = pd.DataFrame(data)
# Break up the information
X = df[['feature1', 'feature2']]
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define the pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('logreg', LogisticRegression())
])
# Define the parameter grid
param_grid = {
'scaler__with_mean': [True, False],
'logreg__C': [0.1, 1, 10]
}
# Perform GridSearchCV with 3 splits
grid_search = GridSearchCV(pipeline, param_grid, cv=3)
grid_search.match(X_train, y_train)
# Contemplate the simplest model
best_score = grid_search.best_score_
best_params = grid_search.best_params_
print(f'Most interesting Model Accuracy: {best_score:.2f}')
print(f'Most interesting Parameters: {best_params}')

The pipeline design pattern is a robust software program in machine learning, providing modularity, streamlining workflows, reducing errors, and enabling full hyperparameter tuning. By incorporating pipelines into your ML initiatives, you can enhance consistency, reproducibility, and effectivity.

Pipelines are broadly utilized in diverse machine learning libraries and frameworks akin to scikit-learn, Apache Spark, spaCy, and Hugging Face, demonstrating their versatility and significance inside the ML ecosystem.

Hold tuned for my upcoming blogs, the place I will uncover scale machine learning pipelines using Apache Spark to cope with big datasets successfully.

We hope this weblog has given you a clear understanding of the pipeline design pattern and its wise capabilities using scikit-learn. Utterly joyful coding!

Source link

Pipeline: The Ultimate Design Pattern for Machine Learning (ML) | by Atul Kumar | Jun, 2024

Working with Input-Convex Neural Networks part3(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

Embracing the Future: The Rise of AI-Driven Development in Software Engineering The software… | by DevBlogs | Jul, 2024

Research on Metaheuristic methods part4(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

How Real-Time Data Analytics and AI Are Transforming Heavy Equipment Operations

NVIDIA Accelerates Google Quantum AI Processor Design With Simulation of Quantum Device Physics

Game Development and Cloud Computing: Benefits of Cloud-Native Game Servers

Teradata AI Unlimited in Microsoft Fabric is Now Available for Public Preview through Microsoft Fabric Workload Hub

Cognigy Unveils Agentic AI: Transforming the Future of Enterprise Contact Centers

Our Picks

Firebolt Introduces Industry-First Low Latency Cloud Data Warehouse

Gretel Announces General Availability of Gretel Navigator, Empowering Enterprises with High-Quality Synthetic Data on Demand

Working with Lefschetz method in Machine Learning research part4 – Monodeep Mukherjee

Most Popular

Revolutionizing the Way We Find Love

Will GenAI Replace Data Engineers? No – And Here’s Why.

Assortment Optimization Machine Learning | by Danishaliarshar | Mar, 2024

Pipeline: The Ultimate Design Pattern for Machine Learning (ML) | by Atul Kumar | Jun, 2024

Related Posts