Step by Step Analysis on the Role played by Social class and Gender on the Titanic survivability.

Henry Biko
6 min readDec 19, 2020
Photo by NOAA on Unsplash

Table of Contents

  1. Introduction
  2. Tutorial
  3. Summary
  4. References

Introduction

What is social class; social class mostly assumes three general economic categories: a very affluent and wealthy upper class that holds and manages the means of production; a middle class of skilled workers, small business owners and low-level managers; and a lower class that depends on low-paying employment for their livelihoods and experience poverty.

Below, I’ll offer a Python tutorial on using ensembled learning through Random Forest. In addition to Classification, we will perform sharp analysis and do statistical analysis on our variables of interest. Feel free to use the code.

Tutorial

Photo by Wahid Khene on Unsplash

Supervised Classification Model

This tutorial includes importing important libraries, preprocessing titanic data, creating a clustering model, utilizing principal component analysis.

The code below includes loading necessary packages that are essential in preparatory data step for the model. Several large libraries may be imported, although some are most certainly used already like pandas and matplotlib.

Loading necessary packages for data preparatory

The next section includes using the imported libraries to read the data, check what type of variables you have as well as the missing values.

Checking missing values is important as it may greatly impact the results of your analysis. Once we have confirmed the missing values like in this case, we had a high number of missing values on the age column. As the age factor is one of our variable of interest we will consider a different method of imputation instead of dropping the missing values on the age row

Percentage of missing entries per column

Filling in missing values on the age variable

In our case, we will use KNN as it is accurate than other imputation methods such as mean, median or most frequent imputation methods.

Data Cleaning

The next step involves dropping all NA’s and removing variables that may not be of interest, for example, Name_wiki, Hometown, Destination, Name, Boarded, Ticket, Embarked. This is because some of the variables like Ticket is represented by class: first-class ticket, second class etc. while some variables are ideally not useful for analysis for example Name, it does not make logical sense that one's name would affect one's survivability.

Other steps taken are changing the variable type to the desired one: for example, Survived should be of type int and not float as its a label :1, represents alive while 0 represents dead.

Encoding

The last stage in our data preprocessing is encoding. Since Random Forest is a type of machine learning model and all ML models require that input and output variables are numbers. We will use OrdinalEncoder from sklearn to encode Gender column after which we will drop the column and remain with the encoded one.

Encoded Gender where 1 represents male and 0 female

Model creation

We are now done with data preprocessing. The next steps left are dividing the data into train and test and then creating our model and feeding the data.

Random Forest Model

Evaluating the model

In this part, we will perform f1_score, Accuracy, recall and precision test. I will show you how to do them but won’t go into much detail, given that they are not the target of this paper.

Model evaluation results

Validation of the results

Here we will validate our accuracy results above using cross-validation; this will ensure that there was no overfitting and that our results were not biased.

Performing Shap test to check how each variable contributed to decision making

The results from Shap analysis shows that Gender and Fare(representing social class) contributed most on the model decision making, which brings us back to our topic of interest. In the next steps, we will perform some statistical analysis to verify our hypothesis of whether Gender and Social class contributed to survivability.

Statistical analysis and hypothesis testing

In this section, we will apply Paired sampled t-test to analyze the survivability chance by Gender. We will then use of significance test to conclude whether gender played a role.

We will group the samples by Gender and collect sample randomly from both gender’s who are over 18, assuming anyone younger than 18 would be grouped as a child. We will then analyze the mean of the number of survivors from the random samples.

Grouping our samples by Gender

The code below selects samples from both groups that are over 18. Then performs random sampling from both groups independently to limit bias sampling

Hypothesis testing

H0: Women are more likely to have survived the Titanic ship sink than men, i.e. Gender had an influence
H1: Gender did not influence the chances of surviving Titanic

Gender hypothesis testing results

The p_value<0.05 and thus our results conclude that gender played a role in the likelihood of one surviving the Titanic ship accident

Analysis by Social class (Part of Statistical Analysis)

In this section, we will analyze whether social class played a role in titanic accident survivability.

We will group the data by social class, as shown below. The Titanic had three passenger class that was based on fare: 1st class represented by 1 paid the most followed by 2nd class and finally 3rd class.

Grouping titanic data into a different social class

We will then resample randomly from 1st class and 3rd class and perform statistical analysis. Resampling randomly helps us avoid bias.

Hypothesis testing

H0: Social Status affected the likelihood of survival
H1: Social Status did not affect the likelihood of survival

Social class hypothesis results

The p_value<0.05 and thus our results conclude that social class played a role in the likelihood of one surviving the Titanic ship accident

Summary

Statistical analyses of social class conclude passengers in a higher social class putting all other factors constant had a higher chance of surviving a titanic accident. This is supported by other research such as one on Washington Post, that showed that the Titanic sank at night, and most third-class passengers were well below the dock, making access impossible. Gender also played a role as women and childer were saved more compared to there male counterpart (Hall, 1986)

Github link to the code project link

Data source

References

2020. [online] Available at: <https://www.washingtonpost.com/opinions/women-and-children-did-go-first-on-titanic/2012/04/19/gIQAgSaugT_story.html> [Accessed 19 December 2020].

Hall, W. (1986). Social class and survival on the S.S. Titanic. Social Science & Medicine, 22(6), 687–690. doi: 10.1016/0277–9536(86)90041–9

--

--

Henry Biko

Data scientist , Artificial intelligence, |Applied problem solving Enthusiast|,Computational Sciences major| henrybiko2016@minerva.kgi.edu