Thursday, 21 March 2013

Titanic Data Analysis for survival of passengers

بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيم

Titanic Survival Data

Data Description

(from https://www.kaggle.com/c/titanic) survival: Survival (0 = No; 1 = Yes) pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) name: Name
sex: Sex
age: Age
sibsp: Number of Siblings/Spouses Aboard
parch: Number of Parents/Children Aboard
ticket: Ticket Number
fare: Passenger Fare
cabin: Cabin
embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
Special Notes:
Pclass is a proxy for socio-economic status (SES) 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower
Age is in Years; Fractional if Age less than One (1) If the Age is Estimated, it is in the form xx.5
With respect to the family relation variables (i.e. sibsp and parch) some relations were ignored. The following are the definitions used for sibsp and parch.
Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent: Mother or Father of Passenger Aboard Titanic
Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic
Other family relatives excluded from this study include cousins, nephews/nieces, aunts/uncles, and in-
laws. Some children travelled only with a nanny, therefore parch=0 for them.
As well, some travelled with very close friends or neighbors in a village, however, the definitions do not support such relations.
In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
titanic = sns.load_dataset('titanic')
titanic.head()
Out[2]:
survivedpclasssexagesibspparchfareembarkedclasswhoadult_maledeckembark_townalivealone
003male22.0107.2500SThirdmanTrueNaNSouthamptonnoFalse
111female38.01071.2833CFirstwomanFalseCCherbourgyesFalse
213female26.0007.9250SThirdwomanFalseNaNSouthamptonyesTrue
311female35.01053.1000SFirstwomanFalseCSouthamptonyesFalse
403male35.0008.0500SThirdmanTrueNaNSouthamptonnoTrue

Detail of data

In [718]:
titanic.describe()
Out[718]:
survivedpclassagesibspparchfare
count891.000000891.000000714.000000891.000000891.000000891.000000
mean0.3838382.30864229.6991180.5230080.38159432.204208
std0.4865920.83607114.5264971.1027430.80605749.693429
min0.0000001.0000000.4200000.0000000.0000000.000000
25%0.0000002.00000020.1250000.0000000.0000007.910400
50%0.0000003.00000028.0000000.0000000.00000014.454200
75%1.0000003.00000038.0000001.0000000.00000031.000000
max1.0000003.00000080.0000008.0000006.000000512.329200

Step 1-Data Cleaning

We will check for NaN or Null values and also check if any of the columns need to be added of removed
In [3]:
#total nu of null values in each column
titanic.apply(lambda x: sum(x.isnull()),axis=0)
Out[3]:
survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64
we see that there are 177 missing values in "age" and "688" missing values in "deck". First check the values in "deck"
In [4]:
titanic['deck'].value_counts()
Out[4]:
C    59
B    47
D    33
E    32
A    15
F    13
G     4
Name: deck, dtype: int64
It seems that there is no need for "deck" values as more than 40% are missing.So we remove "deck" field. Also we remove the 'alive' column
In [5]:
#removed unwanted columns
def remove_columns(df,column):
    return df.drop(column,inplace=True,axis=1)

remove_columns(titanic,'alive')
remove_columns(titanic,'deck')
Number of Men, Women and Children
In [6]:
titanic['who'].value_counts()
Out[6]:
man      537
woman    271
child     83
Name: who, dtype: int64
In [101]:
xs=[i+0.8 for i,_ in enumerate(titanic['who'])]
sns.barplot(data=titanic,x=titanic['who'],y=titanic.survived)
plt.title('Surival Data for Men,Women and Children')
plt.xlabel('Men Women and Children')
plt.ylabel('Survival Probability')
Out[101]:
<matplotlib.text.Text at 0xe6c1d30>

There are 177 null values in age which can effect our results.So we have to fill these values by appropiate value

In [102]:
#first we find the mean
titanic['age'].mean()
Out[102]:
29.345679012345681
The mean is 29 but we cant fill all the 177 values by 29 as there can be many children among these 177 values.Let us check the median
In [103]:
titanic['age'].median()
Out[103]:
28.0
..Median is also same as the mean.we cant fill all the 177 values by 28 as there can be many children among these 177 values.So now check how many children are there
In [104]:
# we see that the 'who' column tells us whether the entry if of 'man' , 'woman' or 'child'.
#so we check the number of children here
In [105]:
def count_values(df,col_name):
    return df[col_name].value_counts()
count_values(titanic,'who')
Out[105]:
man      537
woman    271
child     83
Name: who, dtype: int64
Number of children is 83 so we cannot fill the NaN values by media=28 or mean=29.We can fill the NaN entries with median if we have adult 'man' or 'woman'. Let us check the NaN entries in titanic['age'] corressponding to titanic['who']='child
Checking how many Children have missing values
In [106]:
titanic.loc[titanic['who']=='child'].isnull().sum()
Out[106]:
survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alone          0
Sex_integer    0
child          0
parents        0
dtype: int64
We see that none of the 83 children have NaN values in their respective age column. So its same to replace missing values of age with median =28
In [107]:
titanic['age'].fillna(titanic['age'].median(),inplace=True)
In [108]:
#total nu of null values in each column
titanic.apply(lambda x :sum(x.isnull()),axis=0)
Out[108]:
survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       2
class          0
who            0
adult_male     0
embark_town    2
alone          0
Sex_integer    0
child          0
parents        0
dtype: int64
In [109]:
#making a new variable for Gender
titanic['Sex_integer']=np.where(titanic.sex=='male',1,0)
In [110]:
titanic.head()
Out[110]:
survivedpclasssexagesibspparchfareembarkedclasswhoadult_maleembark_townaloneSex_integerchildparents
003male22107.2500SThirdmanTrueSouthamptonFalse100
111female381071.2833CFirstwomanFalseCherbourgFalse000
213female26007.9250SThirdwomanFalseSouthamptonTrue000
311female351053.1000SFirstwomanFalseSouthamptonFalse000
403male35008.0500SThirdmanTrueSouthamptonTrue100
In [111]:
# convert floats to integer
titanic['age']=titanic['age'].astype(int)

Observations

Total Number of male and female

In [112]:
titanic['sex'].value_counts()
Out[112]:
male      577
female    314
Name: sex, dtype: int64
In [113]:
sns.countplot(x=titanic['sex'],data=titanic)
plt.xlabel('Gender')
plt.title('Male and Female Passengers')
Out[113]:
<matplotlib.text.Text at 0xe7ae5f8>

No of adult male,adult female and children

In [114]:
titanic_gender_pivot=titanic.pivot_table('sex', 'who',aggfunc='count')
titanic_gender_pivot
Out[114]:
sex
who
child83
man537
woman271
In [115]:
titanic_gender_pivot.plot(kind='bar',figsize=(15,10))
plt.ylabel('Number of persons')
plt.xlabel('Child or Man or Woman')
Out[115]:
<matplotlib.text.Text at 0xe7912b0>

Age distribution

In [116]:
sns.distplot(titanic['age'],color='red',kde=False)
plt.xlabel('Age')
plt.title('Age of passengers')
Out[116]:
<matplotlib.text.Text at 0xdf725c0>
Most of the age distribution is between 20 and 40.The peak 28 is actually the number of filled missing values with the median

Question : Is there relation between "Survival" and "Gender"

In [117]:
titanic_male_female=titanic.pivot_table('survived','sex',aggfunc='sum')
titanic_male_female
Out[117]:
survived
sex
female233
male109
In [118]:
titanic_male_female.plot(kind='bar',figsize=(10,6))
plt.ylabel('survived')
Out[118]:
<matplotlib.text.Text at 0xdcd12b0>
In [119]:
p= sns.barplot(x="sex", y="survived", data=titanic)
p.set(title = 'Gender Distribution by Survival', 
        xlabel = 'Gender', 
        ylabel = 'Whether Survived', 
        xticklabels = ['Male', 'Female']);
plt.show()

Above calculation and graph shows that female had more chance of survival

Question-2 : is the econmoic condition of the passengers play any role in the survival?

We will solve this in two parts

First : We check the survival related to Cabins(i.e First,Second & Third class

Second : We We check the survival related to the fare

In [120]:
titanic.head()
Out[120]:
survivedpclasssexagesibspparchfareembarkedclasswhoadult_maleembark_townaloneSex_integerchildparents
003male22107.2500SThirdmanTrueSouthamptonFalse100
111female381071.2833CFirstwomanFalseCherbourgFalse000
213female26007.9250SThirdwomanFalseSouthamptonTrue000
311female351053.1000SFirstwomanFalseSouthamptonFalse000
403male35008.0500SThirdmanTrueSouthamptonTrue100
In [121]:
def percentage_survival_by_two_factors(data,factor1,factor2):
    
    return data.pivot_table('survived',factor1,factor2)*100
In [122]:
# finding survival percentage between class and gender
survival_class_gender=percentage_survival_by_two_factors(titanic,'class','sex')
survival_class_gender
Out[122]:
sexfemalemale
class
First96.80851136.885246
Second92.10526315.740741
Third50.00000013.544669
We see that 96% of female and 35% of male survived compared to 50% and 13% respectively in the third class
In [123]:
g=sns.factorplot(x='sex',
                y='survived',hue='pclass' , kind='bar',data=titanic)
# Fix up the labels
g.set(xlabel='gender',ylabel='survived', title='Gender, Class and Survival'
)

plt.show()
In [124]:
g = sns.factorplot("survived", col="pclass", col_wrap=4,
                    data=titanic,
                    kind="count", size=4.5, aspect=.8,)

g.set_axis_labels("", "Count")
g.set_xticklabels(["Died", "Alive"])
Out[124]:
<seaborn.axisgrid.FacetGrid at 0xe2cd198>
In [125]:
g= sns.factorplot('pclass','survived',data=titanic)
g.set_axis_labels("Passenger Class", "Survival Probability")
g.set_xticklabels(["Class1", "Class2", "Class3"])
g.set(ylim=(0, 1))
Out[125]:
<seaborn.axisgrid.FacetGrid at 0xdaa0a20>
Above plot and calculations shows that First Class had more chance of survival compared to Third Class.After First Class, the Second Class had more chance of survival.Third class had the least chance of survival
Both Male and Female have higher chance of survival in the upper class as the data shows .
All bove plots shows that Higher Cabin class has more chance of survival

Second :Now we check the survival with the fares

In [126]:
fare = pd.qcut(titanic['fare'], 4)
tt=titanic.pivot_table('survived',['sex',fare])
tt.plot(kind='bar',figsize=(15,10))
plt.ylabel('survival probability')
plt.show()
Above graph shows that people paying higher fare had more survival chance.This is because they were in Upper Class.We already showed that upper class had more chances of survival
I will divide age in two groups i.e 0-18 and 18-80 for females and males
In [127]:
age = pd.cut(titanic['age'], [0, 18, 80])
fare = pd.qcut(titanic['fare'], 2)
titanic_fare=titanic.pivot_table('survived', ['sex', age], 'class',aggfunc='mean')
titanic_fare
Out[127]:
classFirstSecondThird
sexage
female(0, 18]0.9090911.0000000.487805
(18, 80]0.9759040.9032260.495050
male(0, 18]0.7500000.5000000.200000
(18, 80]0.3504270.0860220.121622
In [128]:
titanic_fare.plot(kind='bar',figsize=(15,10))
plt.ylabel('survival probability')
plt.xlabel('Female , Male and Fare distribution')
Out[128]:
<matplotlib.text.Text at 0xef0a1d0>
We see that the chances for survival in Class-3 is lower for all age and sex groups except for males between 18 and 80 years where Class-2 survival is higher than class-3 Men of all age groups have high survival in upper class Women of
The above graph again shows that first class had more chance of survival in all the above age groups

Which age group has more survival chance?

In [129]:
age_groups = pd.cut(titanic['age'], [0,20,40,60,81])
survival_by_age_gender=titanic.pivot_table('survived', ['sex', age_groups])*100
survival_by_age_gender
Out[129]:
survived
sexage
female(0, 20]68.000000
(20, 40]75.661376
(40, 60]75.555556
(60, 81]100.000000
male(0, 20]24.489796
(20, 40]16.577540
(40, 60]19.753086
(60, 81]10.526316
In [130]:
survival_by_age_total=titanic.pivot_table('survived', age_groups)*100
survival_by_age_total
Out[130]:
survived
age
(0, 20]43.352601
(20, 40]36.412078
(40, 60]39.682540
(60, 81]22.727273
In [131]:
age_groups = pd.cut(titanic['age'], [0,20,40,60,81])
titanic.groupby(age_groups).size().plot(kind='bar',stacked=True)
plt.title("Distribution of Age Groups",fontsize=14)
plt.ylabel('Count')
plt.xlabel('Age Group');
In [132]:
p = sns.violinplot(data = titanic, x = 'survived', y = 'age')
p.set(title = 'Survival by Age', 
        xlabel = 'Survival', 
        ylabel = 'Age Distribution', 
        xticklabels = ['Died', 'Survived']);
plt.show()
Above observations show that age group between 20 and 40 has more chance of survival

Question : What is the survival chance for lonely passengers?

sibsp: Number of Siblings/Spouses Aboard
parch: Number of Parents/Children Aboard
In [133]:
titanic.pivot_table('survived' ,'alone' ,aggfunc='count')
Out[133]:
survived
alone
False354
True537
So majority of passengers with a companion survived
In [134]:
ax = sns.violinplot(x="alone", y="survived", data=titanic)
Above graph shows that Lonely passengers had less chance of survival

Question : What is survival chance for children without parents?

In [135]:
titanic['child']=np.where(titanic.who=='child',1,0)
titanic['parents']=np.where(titanic.parch!=0,1,0)
In [136]:
# Draw a nested barplot to show survival for class and sex
g = sns.factorplot(x="child", y="survived", hue="parents", data=titanic,
                   size=6, kind="bar", palette="muted",
                   )
g.set(title = 'Children Survival wrt to Family');
g.set_axis_labels("Child or Adult", "Survival Probability")
g.set_xticklabels(["Adult", "Child"])
Out[136]:
<seaborn.axisgrid.FacetGrid at 0xefe3198>
The above graph shows that children with parents had more chance of survival compared to children without parents or with nannies
Above calculations shows that lonely passengers had less chance of survival

Conclusions

Based on the above calculations we can approximately say that :

1-Females had more survival chance than the male

2-First class passengers had more survival chance than the lower classes (economic factor)

3-More passengers who paid higer fares survived (also economic factor)

4-Age group between 20 and 40 had highest surival chance

5-Lonely passengers had less survival chance than those travelling with companions

6-More children died who were travelling without parents

Limitations:

Above findings cannot be accurate due to many aspects.Like we are missing a lot of age data (i.e 177 entries).

Also we dont know that in those days, which age category was considered as "child"

There are also 688 entries for "deck" column missing which can effect our finding that Class-1 passengers survived more .Also we dont know the locations of these decks,it is possible that decks in certain locations had more chance of survival than compared to the class of passengers.

It is also possible, that Class-3 passengers had a certain location which made them difficult to survive or it is also possible that the ice berg at the location of Class-3 passengers.

So these conclusions cannot be 100% correct as there are many factors involved which we have no information and also because there are so many missing values.