Kenang Laverda & Muhammad Fitrah
Departemen Statistika, Fakultas Sains dan Analitika Information, Institut Teknologi Sepuluh Nopember Surabaya, Indonesia
ABSTRAK
Australia adalah benua yang dikenal dengan iklim ekstrem dan variabilitas cuaca yang tinggi, di mana curah hujan mempengaruhi berbagai sektor seperti pertanian, pengelolaan sumber daya air, dan perencanaan kota. Prediksi hujan yang akurat sangat penting untuk mendukung aktivitas tersebut. Penelitian ini memanfaatkan machine studying untuk memprediksi kejadian hujan di Australia menggunakan dataset “Rain in Australia” dari Kaggle. Empat metode klasifikasi yang dianalisis meliputi Logistic Regression, Random Forest, Excessive Gradient Boosting (XGBoost), dan Determination Tree. Hasil evaluasi menunjukkan bahwa XGBoost memberikan performa terbaik dengan akurasi 89.8% dan F1 Rating 88.73%, serta waktu pemrosesan yang efisien sebesar 1.71 detik. Oleh karena itu, XGBoost disarankan sebagai mannequin utama untuk prediksi hujan di Australia. Penelitian ini juga menekankan pentingnya optimasi hyperparameter lebih lanjut dan evaluasi berkelanjutan dengan dataset yang lebih besar dan bervariasi untuk memastikan mannequin tetap sturdy dan andal dalam berbagai kondisi cuaca. Hasil penelitian ini diharapkan dapat memberikan kontribusi signifikan dalam perencanaan dan pengelolaan sumber daya yang lebih baik di Australia.
Halo semua, kali ini saya akan melakukan penerapan metode analisis klasifikasi dengan menggunakan dataset “Rain in Australia”. Metode yang digunakan antara lain Determination Tree, Random Forest, XGBoost, dan Regesi Logistik. Untuk output dari penerapan ini adalah menentukan metode mana yang terbaik untuk digunakan dalam studi kasus ini.
Dataset ini terdapat sebanyak 19 variabel yang berisi variabel proceed, diskrit, maupun kategorik. Langkah pertama yang dapat dilakukan adalah
1. IMPORT DATA
# importing libraries# basic goal libraries
import numpy as np
import pandas as pd
# for visualization
import matplotlib.pyplot as plt
import seaborn as sns
# for statical evaluation
import time
from statsmodels.stats.outliers_influence import variance_inflation_factor
# for preprocessing
from sklearn.preprocessing import StandardScaler , LabelEncoder
# for practice check cut up
from sklearn.model_selection import train_test_split
# for oversampling
from imblearn.over_sampling import SMOTE
# for fashions
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
# for analysis
from sklearn.metrics import accuracy_score , classification_report , confusion_matrix , recall_score , precision_score , f1_score
%matplotlib inline
# to cover any warnings
import warnings
warnings.filterwarnings('ignore')
# to verify all of the columns are displayed
pd.set_option('show.max_columns' , None)
# Load the dataset
df = pd.read_csv('Information FP SML.csv', sep=';', header=0)
# Preview the dataset
df.head()
2. CEK TIPE DATA
# checking information sorts of columns
df.dtypesdf['Pressure3pm'] = pd.to_numeric(df['Pressure3pm'], errors='coerce')
# checking information sorts of columns
df.dtypes
3. Pre-Processing (Penanganan Lacking Worth, Outlier, Inbalance Information, dan Korelasi)
# test for lacking values
df.isna().sum()
# Get distinctive enteries from Location column of the dataset
unique_location = df['Location'].distinctive()# Iterate over every column with lacking values which have numeirc information sort
for column in df.select_dtypes(exclude='object'):
# establishing a situation to test lacking values
if df[column].isnull().any():
# Iterate over every distinctive location
for location in unique_location:
# Calculate the median worth for the every distinctive location entry and column
location_median = df[df['Location'] == location][column].median()
# Fill lacking values with the medin of the precise location
df.loc[df['Location'] == location, column] = df[df['Location'] == location][column].fillna(location_median)
# iterate over every column with misssing values which have object information sort
elif df.select_dtypes(embody='object'):
# iterate over every distinctive location
for location in unique_location:
# Calculate the median worth for the every distinctive location entry and column
location_mode = df[df['Location'] == location][column].mode()[0]
# Fill lacking values with the mode of the precise location
df.loc[df['Location'] == location , column ] = df[df['Location']==location][column].fillna(location_mode)
# dropping the rows with lacking values
df.dropna(subset=['WindGustDir','WindGustDir' ,'WindDir9am','WindDir3pm','RainToday' ,'RainTomorrow',"Pressure9am","Pressure3pm"
], inplace=True , axis= 0)
# resetting the index of the dataframe
df.reset_index(drop=True , inplace=True)
# confirming the end result
df.isnull().sum()
# Choose numeric columns excluding 'object' and 'int32'
numeric_cols = df.select_dtypes(exclude=['object', 'int32']).columns# defining the determine measurement
plt.determine(figsize=(25, 15))
# making for loop for subplots
for column in vary(len(numeric_cols)):
plt.subplot(4, 4, column+1)
# plotting boxplot
sns.boxplot(y=df[numeric_cols[column]])
# defining column title for as title for every boxplot
plt.title(f'The boxplot of {numeric_cols[column]}')
# adjusting the spacing between subplots
plt.tight_layout()
plt.present()
# Convert columns to numeric, forcing errors to NaN
for i in col:
df[i] = pd.to_numeric(df[i], errors='coerce')# Listing of columns which have outliers
col = ['MinTemp','MaxTemp', 'Rainfall', 'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Pressure3pm', 'Pressure9am', 'Temp9am', 'Temp3pm']
# For loop for changing outliers
for i in col:
# Discover Q1 and Q3
q1 = df[i].quantile(0.25)
q3 = df[i].quantile(0.75)
# Discover IQR
iqr = q3 - q1
# Discover decrease and higher limits
lower_limit = q1 - 1.5 * iqr
upper_limit = q3 + 1.5 * iqr
# Discover median of the column
median = df[i].median()
# Exchange outliers with median
df[i] = np.the place((df[i] < lower_limit) | (df[i] > upper_limit), median, df[i])
Dari gambar terlihat bahwa outlier pada tiap variabel sudah berkurang, meskipun masih terdapat outlier pada variabel Rainfall, WindGustSpeed,Humidity9am,Presure9am, dan Presure3pm. Akan tetapi hal tersebut diabaikan karena outlier pada variabel tersebut memiliki jumlah yang sedikit sehingga tidak terlalu berpengaruh signifikan.
#INBALANCE DATA (OVERSAMPLING)
# checking the worth counts of `RainTomorrow`
df['RainTomorrow'].value_counts()
# checking the rely plot of distinctive values of `RainTomorrow`
df['RainTomorrow'].value_counts().plot(sort='bar')
# establishing the title
plt.title('Bar Plot of worth counts of RainTomorrow Earlier than Oversampling')
# rotating the x axis labels
plt.xticks(rotation=360)
plt.xlabel({0: 'No rain ', 1:'Rain'})
plt.present()#PENANGANAN
# definig the X and y
X_resample = df.drop('RainTomorrow' , axis=1)
y_resample = df['RainTomorrow']
# splitting the info into practice and check
X_train_resample , X_test_resample , y_train_resample , y_test_resample = train_test_split(X_resample , y_resample , test_size=0.2 , random_state=42)
# calling SMOTE
smote = SMOTE(random_state=42)
# making use of SMOTE on coaching information
X_train_resample , y_train_resample = smote.fit_resample(X_train_resample , y_train_resample)
#CEK KORELASI
# Heatmap of correlation matrix of numeric options besides `day, month & yr`# calculating the correlation matrix
corr_matrix = df.select_dtypes(embody='float64').corr()
# defining the determine measurement
plt.determine(figsize=(16,6))
# plotting the heatmap
sns.heatmap(corr_matrix , cbar=False , cmap='viridis' , annot=True , fmt='.2f' , linewidths=0.5, linecolor='black')
# rotating the x axis labels
plt.xticks(rotation=45)
# establishing the title
plt.title('Correlation Matrix of Numeric Options')
plt.present()
4. DEFINE FEATURE AND TARGET
Langkah selanjutnya adalah membagi information ke dalam information practice dan information check. Akan tetapi sebelum membagi menjadi information check dan information practice terlebih dahulu hal yang harus dilakukan adalah melakukan defining options. Hal ini bertujuan untuk mencegah adanya overfiting dan mengurangi adanya multikolinieritas. Dengan cara melakukan drop variabel dari options yang akan kita gunakan, yaitu variabel yang memiliki nilai VIF tinggi. Berikut adalah nilai VIF dari tiap variabel :
variabel yang akan didrop dari options yang akan digunakan adalah variabel Temp3pm,Temp9am, dan Pressure3pm. Pada tabel 6 terlihat bahwa nilai VIF untuk variabel MinTemp sebesar 42,40 akan tetapi variabel tersebut tidak dilakukan drop variabel karena variabel MaxTemp memiliki peran yang penting pada saat dilakukan analisis. Setelah dilakukan defining options maka dilakukan pembagian information practice dan information check dengan perbandingan information practice sebesar 80% dan information check sebesar 20%.
5. KLASIFIKASI
a. Regresi Logistik
#practice check cut up
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# calling the logistic regression
mannequin = LogisticRegression()# making use of logistic regression
mannequin.match(X_train, y_train)
# predicting values
y_pred = mannequin.predict(X_test)
# checking accuracy rating
print('Accuracy Rating :', accuracy_score(y_test, y_pred) , 'n')
# checking recall rating
print('Recall Rating :',recall_score(y_test, y_pred) , 'n')
# checking precision rating
print('Precision Rating :',precision_score(y_test, y_pred) , 'n')
# checking f1 rating
print('F1 Rating :',f1_score(y_test, y_pred) , 'n')
# checking classification report
print('Classification Report :', 'n',classification_report(y_test, y_pred))
# confusion matrix heatmap
sns.heatmap(confusion_matrix(y_test, y_pred) , annot=True , fmt='.2f' , xticklabels=['True Negatif' , 'True Positive'] ,
yticklabels=['True Negatif' , 'True Positive'] , cbar=False)# establishing the x and y label
plt.xlabel('Precise')
plt.ylabel('Predicted')
# establishing the title
plt.title('Confusion Matrix of Logistic Regression')
plt.present()
b. Random Forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, classification_report
from sklearn.tree import export_graphviz
import graphviz
import pydotplus
from IPython.show import Picture# Calling the random forest
rf = RandomForestClassifier(random_state=42)
# Making use of random forest
rf.match(X_train, y_train)
# predicting values
y_pred = rf.predict(X_test)
# checking accuracy rating
print('Accuracy Rating :', accuracy_score(y_test, y_pred) , 'n')
# checking recall rating
print('Recall Rating :',recall_score(y_test, y_pred) , 'n')
# checking precision rating
print('Precision Rating :',precision_score(y_test, y_pred) , 'n')
# checking f1 rating
print('F1 Rating :',f1_score(y_test, y_pred) , 'n')
# checking classification report
print('Classification Report :', 'n',classification_report(y_test, y_pred))
# confusion matrix heatmap
sns.heatmap(confusion_matrix(y_test, y_pred) , annot=True , fmt='.2f' , xticklabels=['True Negative' , 'True Positive'] ,
yticklabels=['True Negative' , 'True Positive'] , cbar=False)# establishing the x and y label
plt.xlabel('Precise')
plt.ylabel('Predicted')
# establishing the title
plt.title('Confusion Matrix of Random Forest Classifier')
plt.present()
Cek akurasi information coaching & testing
# calculating the practice accuracy
y_pred_train = rf.predict(X_train)
print(accuracy_score(y_train, y_pred_train))# calculating the check accuracy
y_pred_test = rf.predict(X_test)
print(accuracy_score(y_test, y_pred_test))
Karena terindikasi over becoming, maka dilakukan Hypertunning Parameter
# calling the random forest and adjusting the parameters
mannequin = RandomForestClassifier(
n_estimators=150, # Variety of bushes
max_depth=10, # Most depth of every tree
min_samples_split=10, # Minimal samples required to separate an inner node
min_samples_leaf=5, # Minimal samples required to be at a leaf node
max_features=5, # Variety of options to think about when in search of the very best cut up
criterion='entropy', # Splitting criterion
random_state=42
)# becoming the mannequin
mannequin.match(X_train, y_train)
# calculating the practice and check accuracy
y_pred_train = mannequin.predict(X_train)
y_pred_test = mannequin.predict(X_test)
# printing the practice and check accuracy
print(f"Prepare accuracy: {accuracy_score(y_train, y_pred_train)}")
print(f"Check accuracy: {accuracy_score(y_test, y_pred_test)}")
c. Excessive Gradient Boosting (XGBoost)
# calling the xgboost classifier
xgb = XGBClassifier(random_state=42)
# becoming the mannequin
xgb.match(X_train, y_train)# predicting values
y_pred = xgb.predict(X_test)
# checking accuracy rating
print('Accuracy Rating :', accuracy_score(y_test, y_pred) , 'n')
# checking recall rating
print('Recall Rating :',recall_score(y_test, y_pred) , 'n')
# checking precision rating
print('Precision Rating :',precision_score(y_test, y_pred) , 'n')
# checking f1 rating
print('F1 Rating :',f1_score(y_test, y_pred) , 'n')
# checking classification report
print('Classification Report :', 'n',classification_report(y_test, y_pred))
# confusion matrix heatmap
# confusion matrix heatmap
sns.heatmap(confusion_matrix(y_test, y_pred) , annot=True , fmt='.2f' , xticklabels=['True Negative' , 'True Positive'] ,
yticklabels=['True Negative' , 'True Positive'] , cbar=False)plt.title('Confusion Matrix of XGBoost Classifier')
plt.present()
d. Determination Tree
from sklearn.tree import DecisionTreeClassifier# Calling the Determination Tree classifier
dt = DecisionTreeClassifier(random_state=42)
# Becoming the mannequin
dt.match(X_train, y_train)
# Predicting values
y_pred = dt.predict(X_test)
#Checking accuracy rating
print('Accuracy Rating :', accuracy_score(y_test, y_pred), 'n')
# Checking recall rating
print('Recall Rating :', recall_score(y_test, y_pred), 'n')
# Checking precision rating
print('Precision Rating :', precision_score(y_test, y_pred), 'n')
# Checking f1 rating
print('F1 Rating :', f1_score(y_test, y_pred), 'n')
# Checking classification report
print('Classification Report :', 'n', classification_report(y_test, y_pred))
# Confusion matrix heatmap
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='.2f',
xticklabels=['True Negative', 'True Positive'],
yticklabels=['True Negative', 'True Positive'], cbar=False)
plt.title('Confusion Matrix of Determination Tree Classifier')
plt.present()
e. Perbandingan Keseluruhan Metode
# Importing crucial libraries for the fashions
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier# Creating an empty listing to retailer outcomes
outcome = []
# Making a dictionary of fashions
fashions = {
'Logistic Regression': LogisticRegression(random_state=42),
'Random Forest': RandomForestClassifier(random_state=42),
'XGBoost': XGBClassifier(random_state=42),
'Determination Tree': DecisionTreeClassifier(random_state=42)
}
# Becoming the mannequin and evaluating efficiency
for title, mannequin in fashions.objects():
# Timing the match course of
start_fit = time.time()
# Becoming the mannequin
mannequin.match(X_train, y_train)
# Ending time of the match course of
end_fit = time.time()
# Calculating the match time for every mannequin
fit_time = end_fit - start_fit
# Taking predictions from mannequin on check information
y_pred = mannequin.predict(X_test)
# Calculating the accuracy rating
acc = accuracy_score(y_test, y_pred)
# Calculating the precision rating
precision = precision_score(y_test, y_pred)
# Calculating the recall rating
recall = recall_score(y_test, y_pred)
# Appending the outcome within the listing
outcome.append((title, acc, precision, recall, fit_time))
# Printing the analysis results of each mannequin
result_df = pd.DataFrame(outcome, columns=['Model', 'Accuracy Score', 'Precision', 'Recall', 'Fit Time (s)'])
print(result_df)
Dari tabel dapat dilihat bahwa :
· Mannequin dengan akurasi tertinggi adalah XGBoost (0.898411).
· Mannequin dengan presisi tertinggi adalah XGBoost (0.916866).
· Mannequin dengan recall tertinggi adalah Random Forest (0.873350), tetapi XGBoost juga memiliki nilai recall yang sangat baik (0.859569).
· Mannequin dengan waktu latih tercepat adalah XGBoost (1.708394 detik).
import matplotlib.pyplot as plt# Information
fashions = ['Logistic Regression', 'Random Forest', 'XGBoost', 'Decision Tree']
accuracy = [0.769956, 0.889137, 0.898411, 0.822644]
f1_score = [0.748807, 0.879451, 0.887053, 0.809980]
# Barchart Akurasi
plt.determine(figsize=(10, 6))
bars = plt.bar(fashions, accuracy, shade=['blue', 'green', 'red', 'purple'])
plt.xlabel('Mannequin')
plt.ylabel('Akurasi')
plt.title('Perbandingan Akurasi Antar Mannequin')
plt.ylim(0, 1)
# Menambahkan angka persentase di atas bar
for bar in bars:
yval = bar.get_height()
plt.textual content(bar.get_x() + bar.get_width()/2, yval, f'{yval*100:.2f}%', ha='heart', va='backside')
plt.present()
# Barchart F1 Rating
plt.determine(figsize=(10, 6))
bars = plt.bar(fashions, f1_score, shade=['blue', 'green', 'red', 'purple'])
plt.xlabel('Mannequin')
plt.ylabel('F1 Rating')
plt.title('Perbandingan F1 Rating Antar Mannequin')
plt.ylim(0, 1)
# Menambahkan angka persentase di atas bar
for bar in bars:
yval = bar.get_height()
plt.textual content(bar.get_x() + bar.get_width()/2, yval, f'{yval*100:.2f}%', ha='heart', va='backside')
plt.present()
6. KESIMPULAN
Berdasarkan penelitian yang dilakukan untuk menentukan metode klasifikasi terbaik dalam memprediksi hujan Australia menggunakan dataset “Rain in Australia”, diperoleh hasil evaluasi dari empat mannequin klasifikasi yaitu Logistic Regression, Random Forest, Excessive Gradient Boosting (XGBoost), dan Determination Tree. Analisis ini menunjukkan bahwa mannequin XGBoost memberikan performa terbaik dengan akurasi tertinggi sebesar 89.8 % dan F1 Rating tertinggi sebesar 88.73 %. Selain itu, XGBoost juga menunjukkan waktu pemrosesan yang paling efisien yaitu 1.71 detik.
Sehingga dapat disimpulkan bahwa XGBoost adalah mannequin terbaik untuk digunakan dalam memprediksi hujan di hari esok di Australia. Mannequin ini tidak hanya memberikan hasil yang sangat akurat tetapi juga efisien dalam waktu pemrosesan. Oleh karena itu, disarankan untuk menggunakan XGBoost sebagai mannequin utama untuk klasifikasi ini. Selain itu, meskipun XGBoost sudah menunjukkan performa terbaik, optimasi lebih lanjut terhadap hyperparameter mannequin dapat dilakukan untuk memastikan bahwa mannequin mencapai performa maksimal. Evaluasi berkelanjutan dengan dataset yang lebih besar dan lebih bervariasi juga perlu dilakukan untuk memastikan mannequin tetap sturdy dan dapat diandalkan dalam berbagai kondisi.