بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيم
Titanic Survival Data
Data Description
(from https://www.kaggle.com/c/titanic) survival: Survival (0 = No; 1 = Yes) pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) name: Name
sex: Sex
age: Age
sibsp: Number of Siblings/Spouses Aboard
parch: Number of Parents/Children Aboard
ticket: Ticket Number
fare: Passenger Fare
cabin: Cabin
embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
Special Notes:
Pclass is a proxy for socio-economic status (SES) 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower
Age is in Years; Fractional if Age less than One (1) If the Age is Estimated, it is in the form xx.5
With respect to the family relation variables (i.e. sibsp and parch) some relations were ignored. The following are the definitions used for sibsp and parch.
Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent: Mother or Father of Passenger Aboard Titanic
Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic
Other family relatives excluded from this study include cousins, nephews/nieces, aunts/uncles, and in-
laws. Some children travelled only with a nanny, therefore parch=0 for them.
As well, some travelled with very close friends or neighbors in a village, however, the definitions do not support such relations.
In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
titanic = sns.load_dataset('titanic')
titanic.head()
Out[2]:
Detail of data
In [718]:
titanic.describe()
Out[718]:
Step 1-Data Cleaning
We will check for NaN or Null values and also check if any of the columns need to be added of removed
In [3]:
#total nu of null values in each column
titanic.apply(lambda x: sum(x.isnull()),axis=0)
Out[3]:
we see that there are 177 missing values in "age" and "688" missing values in "deck". First check the values in "deck"
In [4]:
titanic['deck'].value_counts()
Out[4]:
It seems that there is no need for "deck" values as more than 40% are missing.So we remove "deck" field. Also we remove the 'alive' column
In [5]:
#removed unwanted columns
def remove_columns(df,column):
return df.drop(column,inplace=True,axis=1)
remove_columns(titanic,'alive')
remove_columns(titanic,'deck')
Number of Men, Women and Children
In [6]:
titanic['who'].value_counts()
Out[6]:
In [101]:
xs=[i+0.8 for i,_ in enumerate(titanic['who'])]
sns.barplot(data=titanic,x=titanic['who'],y=titanic.survived)
plt.title('Surival Data for Men,Women and Children')
plt.xlabel('Men Women and Children')
plt.ylabel('Survival Probability')
Out[101]:
There are 177 null values in age which can effect our results.So we have to fill these values by appropiate value
In [102]:
#first we find the mean
titanic['age'].mean()
Out[102]:
The mean is 29 but we cant fill all the 177 values by 29 as there can be many children among these 177 values.Let us check the median
In [103]:
titanic['age'].median()
Out[103]:
..Median is also same as the mean.we cant fill all the 177 values by 28 as there can be many children among these 177 values.So now check how many children are there
In [104]:
# we see that the 'who' column tells us whether the entry if of 'man' , 'woman' or 'child'.
#so we check the number of children here
In [105]:
def count_values(df,col_name):
return df[col_name].value_counts()
count_values(titanic,'who')
Out[105]:
Number of children is 83 so we cannot fill the NaN values by media=28 or mean=29.We can fill the NaN entries with median if we have adult 'man' or 'woman'. Let us check the NaN entries in titanic['age'] corressponding to titanic['who']='child
Checking how many Children have missing values
In [106]:
titanic.loc[titanic['who']=='child'].isnull().sum()
Out[106]:
We see that none of the 83 children have NaN values in their respective age column. So its same to replace missing values of age with median =28
In [107]:
titanic['age'].fillna(titanic['age'].median(),inplace=True)
In [108]:
#total nu of null values in each column
titanic.apply(lambda x :sum(x.isnull()),axis=0)
Out[108]:
In [109]:
#making a new variable for Gender
titanic['Sex_integer']=np.where(titanic.sex=='male',1,0)
In [110]:
titanic.head()
Out[110]:
In [111]:
# convert floats to integer
titanic['age']=titanic['age'].astype(int)
Observations
Total Number of male and female
In [112]:
titanic['sex'].value_counts()
Out[112]:
In [113]:
sns.countplot(x=titanic['sex'],data=titanic)
plt.xlabel('Gender')
plt.title('Male and Female Passengers')
Out[113]:
No of adult male,adult female and children
In [114]:
titanic_gender_pivot=titanic.pivot_table('sex', 'who',aggfunc='count')
titanic_gender_pivot
Out[114]:
In [115]:
titanic_gender_pivot.plot(kind='bar',figsize=(15,10))
plt.ylabel('Number of persons')
plt.xlabel('Child or Man or Woman')
Out[115]:
Age distribution
In [116]:
sns.distplot(titanic['age'],color='red',kde=False)
plt.xlabel('Age')
plt.title('Age of passengers')
Out[116]:
Most of the age distribution is between 20 and 40.The peak 28 is actually the number of filled missing values with the median
Question : Is there relation between "Survival" and "Gender"
In [117]:
titanic_male_female=titanic.pivot_table('survived','sex',aggfunc='sum')
titanic_male_female
Out[117]:
In [118]:
titanic_male_female.plot(kind='bar',figsize=(10,6))
plt.ylabel('survived')
Out[118]:
In [119]:
p= sns.barplot(x="sex", y="survived", data=titanic)
p.set(title = 'Gender Distribution by Survival',
xlabel = 'Gender',
ylabel = 'Whether Survived',
xticklabels = ['Male', 'Female']);
plt.show()
Above calculation and graph shows that female had more chance of survival
Question-2 : is the econmoic condition of the passengers play any role in the survival?
We will solve this in two parts
First : We check the survival related to Cabins(i.e First,Second & Third class
Second : We We check the survival related to the fare
In [120]:
titanic.head()
Out[120]:
In [121]:
def percentage_survival_by_two_factors(data,factor1,factor2):
return data.pivot_table('survived',factor1,factor2)*100
In [122]:
# finding survival percentage between class and gender
survival_class_gender=percentage_survival_by_two_factors(titanic,'class','sex')
survival_class_gender
Out[122]:
We see that 96% of female and 35% of male survived compared to 50% and 13% respectively in the third class
In [123]:
g=sns.factorplot(x='sex',
y='survived',hue='pclass' , kind='bar',data=titanic)
# Fix up the labels
g.set(xlabel='gender',ylabel='survived', title='Gender, Class and Survival'
)
plt.show()
In [124]:
g = sns.factorplot("survived", col="pclass", col_wrap=4,
data=titanic,
kind="count", size=4.5, aspect=.8,)
g.set_axis_labels("", "Count")
g.set_xticklabels(["Died", "Alive"])
Out[124]:
In [125]:
g= sns.factorplot('pclass','survived',data=titanic)
g.set_axis_labels("Passenger Class", "Survival Probability")
g.set_xticklabels(["Class1", "Class2", "Class3"])
g.set(ylim=(0, 1))
Out[125]:
Above plot and calculations shows that First Class had more chance of survival compared to Third Class.After First Class, the Second Class had more chance of survival.Third class had the least chance of survival
Both Male and Female have higher chance of survival in the upper class as the data shows .
All bove plots shows that Higher Cabin class has more chance of survival
Second :Now we check the survival with the fares
In [126]:
fare = pd.qcut(titanic['fare'], 4)
tt=titanic.pivot_table('survived',['sex',fare])
tt.plot(kind='bar',figsize=(15,10))
plt.ylabel('survival probability')
plt.show()
Above graph shows that people paying higher fare had more survival chance.This is because they were in Upper Class.We already showed that upper class had more chances of survival
I will divide age in two groups i.e 0-18 and 18-80 for females and males
In [127]:
age = pd.cut(titanic['age'], [0, 18, 80])
fare = pd.qcut(titanic['fare'], 2)
titanic_fare=titanic.pivot_table('survived', ['sex', age], 'class',aggfunc='mean')
titanic_fare
Out[127]:
In [128]:
titanic_fare.plot(kind='bar',figsize=(15,10))
plt.ylabel('survival probability')
plt.xlabel('Female , Male and Fare distribution')
Out[128]:
We see that the chances for survival in Class-3 is lower for all age and sex groups except for males between 18 and 80 years where Class-2 survival is higher than class-3 Men of all age groups have high survival in upper class Women of
The above graph again shows that first class had more chance of survival in all the above age groups
Which age group has more survival chance?
In [129]:
age_groups = pd.cut(titanic['age'], [0,20,40,60,81])
survival_by_age_gender=titanic.pivot_table('survived', ['sex', age_groups])*100
survival_by_age_gender
Out[129]:
In [130]:
survival_by_age_total=titanic.pivot_table('survived', age_groups)*100
survival_by_age_total
Out[130]:
In [131]:
age_groups = pd.cut(titanic['age'], [0,20,40,60,81])
titanic.groupby(age_groups).size().plot(kind='bar',stacked=True)
plt.title("Distribution of Age Groups",fontsize=14)
plt.ylabel('Count')
plt.xlabel('Age Group');
In [132]:
p = sns.violinplot(data = titanic, x = 'survived', y = 'age')
p.set(title = 'Survival by Age',
xlabel = 'Survival',
ylabel = 'Age Distribution',
xticklabels = ['Died', 'Survived']);
plt.show()
Above observations show that age group between 20 and 40 has more chance of survival
Question : What is the survival chance for lonely passengers?
sibsp: Number of Siblings/Spouses Aboard
parch: Number of Parents/Children Aboard
In [133]:
titanic.pivot_table('survived' ,'alone' ,aggfunc='count')
Out[133]:
So majority of passengers with a companion survived
In [134]:
ax = sns.violinplot(x="alone", y="survived", data=titanic)
Above graph shows that Lonely passengers had less chance of survival
Question : What is survival chance for children without parents?
In [135]:
titanic['child']=np.where(titanic.who=='child',1,0)
titanic['parents']=np.where(titanic.parch!=0,1,0)
In [136]:
# Draw a nested barplot to show survival for class and sex
g = sns.factorplot(x="child", y="survived", hue="parents", data=titanic,
size=6, kind="bar", palette="muted",
)
g.set(title = 'Children Survival wrt to Family');
g.set_axis_labels("Child or Adult", "Survival Probability")
g.set_xticklabels(["Adult", "Child"])
Out[136]:
The above graph shows that children with parents had more chance of survival compared to children without parents or with nannies
Above calculations shows that lonely passengers had less chance of survival