The post is being written to explain how Machine Learning models can be used to predict the survival rates of passengers who were on-board the Titanic on the fateful day. I intend to walk you through the basics by focusing incrementally on the steps involved in any prediction model building. This and other articles in future will help beginners on how to approach any Machine-Learning problem. I will be using Jupyter notebook to execute Python codes.
The problem statement is well-known in the data science community and is one of the most worked on datasets. This has been taken up from the competition floated by Kaggle, the details of the same and the dataset itself can be found here.
The process will be divided in two parts:
Part 1 — Data cleaning
Part 2 — Model Building
Full Jupyter notebook with code can be found on GitHub here.
Part — 1 : Data Cleaning
Importing data and getting a bird’s eye view
The best way to start is to know the problem we are supposed to solve. But this may seem like a vague statement and hence let me break it down for you.
The dataset (available at the link mentioned above) available to us is in two parts and in CSV format. Let us import the same and start digging in!
Importing the datasets
Checking the shape of each file
Taking a glance at the dataset
Now, we could see that there was a difference of a 1 column between the train and test dataset and that would be our Target variable (Survived). As can be seen in the image above, the Survived column contains 0s and 1s and this means that this is going to be a classification problem. But we cant be too sure by just visually checking the first 5 rows. Let us properly check this. The image below shows the output of the code which is used to check the unique values in a column which confirms the problem as that of classification.
Exploring other features
This step tells us about each column in a little bit more detail. We can see there are some missing values in our training dataset. But this could be the case with the test dataset too.
Since there are missing values in both the datasets and we will be working on imputation along with some feature engineering, let us combine them both to avoid any discrepancy later. There are other methods of applying the same steps of treatment on test dataset as on train dataset but I prefer combining the two together and then splitting later on before model building.
We will be excluding the Survived feature while combining the two datasets at this stage and will get back to it a couple of steps from here.
The main area where we will have to use imputation methods is the AGE column.
Considering the fact that CABIN has more than 75% data missing and in absence of any imputation logic we will choose to drop the feature entirely.
FARE and EMBARKED have 1 and 2 missing values which can be easily filled by taking mean and mode of the columns respectively.
Before we take any step towards filling the missing values, let us include the Survived column to the dataset as well which will help us in the identifying the train from test data later on. Since the Survived column is not as long as the combined data, we need to do work around it a little. The code is in the image below
To understand what has happened in the execution above, let us look at the dataset where train and test data points meet.
The highlighted area explains the output of the code.
Feature Engineering and EDA
Modification of existing and creation of new features with detailed analysis
Feature — 1 : PassengerId
The count of unique entries is same as the length of the dataset and hence this feature can be dropped
Feature — 2 : Pclass
Passenger class are divided in 3 categories and although it is in int64 data type, we will be converting it to Object datatype before feeding it into our model. This will help the model understand it better.
Feature — 3 : Name
The name in itself may not be useful but we can see there are Titles mentioned along with each name. Let us try to extract the same.
A visual check tells us that most if the titles are on e of these ‘Mr.’, ‘Mrs’, ‘Master.’, ‘Miss.’, and hence we will work accordingly.
Now we have an additional feature which explains if the passenger on the ship was married or not. For cases with ‘Others’ the information could not be extracted.
We ignored the titles as such because the variability can also be explained by the Sex column, so it seemed a better idea to include only the marital status.
Feature — 4 : Sex
The feature look pretty straight forward with no missing values and we will encode it later before model building.
Feature — 5: Age
This feature is numeric and the first things we checked was its missing values. Also we will need to check the distribution of the same to ensure normality.
Now, one way is to simply impute the missing values in the feature by taking a mean. But let us try to find some other better way to do it.
I will be going through the features first and then come back to imputation of Age later on.
Feature — 6 : SibSp (Siblings and Spouses)
The feature tells us the count of Siblings and Spouse the passenger was travelling with.
Feature — 7 : Parch (Parent and Child)
The feature tells us the count of Parent and Children the passenger was travelling with.
Feature — 8 : Ticket (Ticket Number)
The feature is of Object type and has 70% unique values. This feature will be dropped for the same reasons as PassengerId is being dropped.
Feature — 9 : Cabin
The feature has more than ______ data points missing and hence will be dropped
Feature — 10 : Embarked
The feature is of Object type with 2 values missing. Lets fill those values
Feature — 11 : Fare
The plot shows the distribution to be highly right-skewed and also we can see a good possibility of outliers. We will work on normalization of this feature later on.
Filling in the missing values
Imputation of Age column
On good logic to fill-in the missing Age values will be to find mean of Age — Sex and Title/Marital Status wise. Let us do the same
But this only shows us one category. If we group the data in desired form, we can arrive at the required output
This shows us how we can fill the missing age value in much detailed way. This way we an avoid loosing the variability which in turn will be good for the model. Let us use these values and fill wherever required
Family Size : This can be obtained by adding SibSp and Parch
Group Size (size of group or number or travelers per ticket) : This can be obtained by counting passengers on each ticket number.
Since the Group Size will work on same logic as Family Size, we will have to exclude the lead person from each ticket count. So any person, if he is travelling alone, the Group Size will be 0.
Merging the details with main dataset
And, finally dropping the columns which are of no further use to us
Feature Transformation with Outlier detection
Now that we have dealt with missing values and have created additional features we need to work further on transforming the existing features to a more suitable form for the model we are going to make. We will start with most apparent ones, which are Age and Fare. As shown earlier, we need to normalize the two. But before scaling let us first treat the outliers.
Outliers can easily be identified or viewed using box-plots
All highlighted points can be termed as outliers. Let us use the IQR method to cap the outliers to IQR * 1.5
Checking the box plot with replaced values
Now, scaling and normalizing go hand-in-hand and we can try various kinds of methods available under sklearn library. But to maintain the easy approach we have taken we will currently stick to StandardScaler for both the features.
Checking if standardization has been successful or not
The plot now looks better than before. However, the mean of standardized values is still not 0, but it is closer to 0 than it was before and hence we will accept the output and move ahead.
We will follow same steps as we did for Fare. Starting with statistical characteristics, we will target to bring the mean and std of the feature to 0 and 1 respectively to normalize it.
Starting with outlier detection
We can see some outliers here. However, with Age removal of outliers does not seem to be a good idea since these are all genuine points. Hence I am skipping the step where we cap the outlier data points.
The plot already looked normal but scaling has improved the mean and std. The output below shows how the mean is much closer to 0 now than it was before.
This concludes cleaning of our data and now we will move ahead with building our model, which is Part 2 of this article.
We performed following steps in the article
- Data import
- Data merging
- Individual Feature Study
- Feature Engineering
- Feature Transformation
Any kind of detailed data exploration which would have involved intensive plotting has been deliberately skipped so that we can focus on the process of data preparation for model building.
Some of you might be able to come up with better approach to the problem. I request you to please comment and share the same so that I can improve.
Further explanation in Part 2 , happy learning till then!