INTRODUCTION
This report relies on the Titanic dataset from Kaggle(https://www.kaggle.com/c/titanic/data). The first intention of this technical report is to investigate this dataset and develop a predictive mannequin that predicts the survival price of passengers on the Titanic.
For this report, I used two python libraries to make my statement. I used the Pandas to learn, perceive and get insights from the information. I additionally used the Seaborn library to visualise the information.
From the Prolonged Information Diagram (EDD), I noticed that there are 11 columns within the dataset with 6 numerical columns and 5 categorical columns:
Numerical Information:
· PassengerId
· Survived
· Pclass
· Age
· Sibsp
· Parch
Categorical Information:
· Title
· Intercourse
· Ticket
· Cabin
· Embarked
OBSERVATION
By mere trying on the information, I used to be in a position to observe that, there have been 891 passengers on the titanic and the intercourse column is extremely associated to the Survived column as many of the survivors are ladies.
From the Prolonged information dictionary (EDD), I made the next observations:
Lacking Values:
The EDD returned a depend from the values of the columns and from that depend I used to be in a position to decide which columns had lacking values, they embody:
· Age
· Cabin
· Embarked
Doable Outliers:
I additionally observed potential outliers in some columns and this was due to the bounce in values between the seventy fifth and the one centesimal percentile. This was observed within the following columns
· Age
· Sibsp
· Parch
· Fare.
CONCLUSION
From the dataset, I noticed lacking values in a couple of columns and they are often handled by both changing the lacking values with the median or mode of the column. The imply may also be used to deal with it however there are possibilities of you having outliers within the columns.
I additionally observed outliers in sure columns and they are often handled by changing the outliers with the both the 0th or 99th percentile.