# بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيم

# Titanic Survival Data

## Data Description

(from https://www.kaggle.com/c/titanic) survival: Survival (0 = No; 1 = Yes) pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) name: Name

sex: Sex

age: Age

sibsp: Number of Siblings/Spouses Aboard

parch: Number of Parents/Children Aboard

ticket: Ticket Number

fare: Passenger Fare

cabin: Cabin

embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

**Special Notes:**

Pclass is a proxy for socio-economic status (SES) 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1) If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch) some relations were ignored. The following are the definitions used for sibsp and parch.

Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic

Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)

Parent: Mother or Father of Passenger Aboard Titanic

Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins, nephews/nieces, aunts/uncles, and in-

laws. Some children travelled only with a nanny, therefore parch=0 for them.

As well, some travelled with very close friends or neighbors in a village, however, the definitions do not support such relations.

In [1]:

```
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
```

In [2]:

```
titanic = sns.load_dataset('titanic')
titanic.head()
```

Out[2]:

## Detail of data

In [718]:

```
titanic.describe()
```

Out[718]:

# Step 1-Data Cleaning

We will check for NaN or Null values and also check if any of the columns need to be added of removed

In [3]:

```
#total nu of null values in each column
titanic.apply(lambda x: sum(x.isnull()),axis=0)
```

Out[3]:

In [4]:

```
titanic['deck'].value_counts()
```

Out[4]:

In [5]:

```
#removed unwanted columns
def remove_columns(df,column):
return df.drop(column,inplace=True,axis=1)
remove_columns(titanic,'alive')
remove_columns(titanic,'deck')
```

**Number of Men, Women and Children**

In [6]:

```
titanic['who'].value_counts()
```

Out[6]:

In [101]:

```
xs=[i+0.8 for i,_ in enumerate(titanic['who'])]
sns.barplot(data=titanic,x=titanic['who'],y=titanic.survived)
plt.title('Surival Data for Men,Women and Children')
plt.xlabel('Men Women and Children')
plt.ylabel('Survival Probability')
```

Out[101]:

In [102]:

```
#first we find the mean
titanic['age'].mean()
```

Out[102]:

In [103]:

```
titanic['age'].median()
```

Out[103]:

In [104]:

```
# we see that the 'who' column tells us whether the entry if of 'man' , 'woman' or 'child'.
#so we check the number of children here
```

In [105]:

```
def count_values(df,col_name):
return df[col_name].value_counts()
count_values(titanic,'who')
```

Out[105]:

**Checking how many Children have missing values**

In [106]:

```
titanic.loc[titanic['who']=='child'].isnull().sum()
```

Out[106]:

In [107]:

```
titanic['age'].fillna(titanic['age'].median(),inplace=True)
```

In [108]:

```
#total nu of null values in each column
titanic.apply(lambda x :sum(x.isnull()),axis=0)
```

Out[108]:

In [109]:

```
#making a new variable for Gender
titanic['Sex_integer']=np.where(titanic.sex=='male',1,0)
```

In [110]:

```
titanic.head()
```

Out[110]:

In [111]:

```
# convert floats to integer
titanic['age']=titanic['age'].astype(int)
```

# Observations

## Total Number of male and female

In [112]:

```
titanic['sex'].value_counts()
```

Out[112]:

In [113]:

```
sns.countplot(x=titanic['sex'],data=titanic)
plt.xlabel('Gender')
plt.title('Male and Female Passengers')
```

Out[113]:

## No of adult male,adult female and children

In [114]:

```
titanic_gender_pivot=titanic.pivot_table('sex', 'who',aggfunc='count')
titanic_gender_pivot
```

Out[114]:

In [115]:

```
titanic_gender_pivot.plot(kind='bar',figsize=(15,10))
plt.ylabel('Number of persons')
plt.xlabel('Child or Man or Woman')
```

Out[115]:

# Age distribution

In [116]:

```
sns.distplot(titanic['age'],color='red',kde=False)
plt.xlabel('Age')
plt.title('Age of passengers')
```

Out[116]:

# Question : Is there relation between "Survival" and "Gender"

In [117]:

```
titanic_male_female=titanic.pivot_table('survived','sex',aggfunc='sum')
titanic_male_female
```

Out[117]:

In [118]:

```
titanic_male_female.plot(kind='bar',figsize=(10,6))
plt.ylabel('survived')
```

Out[118]:

In [119]:

```
p= sns.barplot(x="sex", y="survived", data=titanic)
p.set(title = 'Gender Distribution by Survival',
xlabel = 'Gender',
ylabel = 'Whether Survived',
xticklabels = ['Male', 'Female']);
plt.show()
```

## Above calculation and graph shows that female had more chance of survival

# Question-2 : is the econmoic condition of the passengers play any role in the survival?

## We will solve this in two parts

## First : We check the survival related to Cabins(i.e First,Second & Third class

## Second : We We check the survival related to the fare

In [120]:

```
titanic.head()
```

Out[120]:

In [121]:

```
def percentage_survival_by_two_factors(data,factor1,factor2):
return data.pivot_table('survived',factor1,factor2)*100
```

In [122]:

```
# finding survival percentage between class and gender
survival_class_gender=percentage_survival_by_two_factors(titanic,'class','sex')
survival_class_gender
```

Out[122]:

In [123]:

```
g=sns.factorplot(x='sex',
y='survived',hue='pclass' , kind='bar',data=titanic)
# Fix up the labels
g.set(xlabel='gender',ylabel='survived', title='Gender, Class and Survival'
)
plt.show()
```

In [124]:

```
g = sns.factorplot("survived", col="pclass", col_wrap=4,
data=titanic,
kind="count", size=4.5, aspect=.8,)
g.set_axis_labels("", "Count")
g.set_xticklabels(["Died", "Alive"])
```

Out[124]:

In [125]:

```
g= sns.factorplot('pclass','survived',data=titanic)
g.set_axis_labels("Passenger Class", "Survival Probability")
g.set_xticklabels(["Class1", "Class2", "Class3"])
g.set(ylim=(0, 1))
```

Out[125]:

Both Male and Female have higher chance of survival in the upper class as the data shows .

All bove plots shows that Higher Cabin class has more chance of survival

## Second :Now we check the survival with the fares

In [126]:

```
fare = pd.qcut(titanic['fare'], 4)
tt=titanic.pivot_table('survived',['sex',fare])
tt.plot(kind='bar',figsize=(15,10))
plt.ylabel('survival probability')
plt.show()
```

**I will divide age in two groups i.e 0-18 and 18-80 for females and males**

In [127]:

```
age = pd.cut(titanic['age'], [0, 18, 80])
fare = pd.qcut(titanic['fare'], 2)
titanic_fare=titanic.pivot_table('survived', ['sex', age], 'class',aggfunc='mean')
titanic_fare
```

Out[127]:

In [128]:

```
titanic_fare.plot(kind='bar',figsize=(15,10))
plt.ylabel('survival probability')
plt.xlabel('Female , Male and Fare distribution')
```

Out[128]:

**The above graph again shows that first class had more chance of survival in all the above age groups**

# Which age group has more survival chance?

In [129]:

```
age_groups = pd.cut(titanic['age'], [0,20,40,60,81])
survival_by_age_gender=titanic.pivot_table('survived', ['sex', age_groups])*100
survival_by_age_gender
```

Out[129]:

In [130]:

```
survival_by_age_total=titanic.pivot_table('survived', age_groups)*100
survival_by_age_total
```

Out[130]:

In [131]:

```
age_groups = pd.cut(titanic['age'], [0,20,40,60,81])
titanic.groupby(age_groups).size().plot(kind='bar',stacked=True)
plt.title("Distribution of Age Groups",fontsize=14)
plt.ylabel('Count')
plt.xlabel('Age Group');
```

In [132]:

```
p = sns.violinplot(data = titanic, x = 'survived', y = 'age')
p.set(title = 'Survival by Age',
xlabel = 'Survival',
ylabel = 'Age Distribution',
xticklabels = ['Died', 'Survived']);
plt.show()
```

**Above observations show that age group between 20 and 40 has more chance of survival**

# Question : What is the survival chance for lonely passengers?

sibsp: Number of Siblings/Spouses Aboard

parch: Number of Parents/Children Aboard

In [133]:

```
titanic.pivot_table('survived' ,'alone' ,aggfunc='count')
```

Out[133]:

**So majority of passengers with a companion survived**

In [134]:

```
ax = sns.violinplot(x="alone", y="survived", data=titanic)
```

**Above graph shows that Lonely passengers had less chance of survival**

# Question : What is survival chance for children without parents?

In [135]:

```
titanic['child']=np.where(titanic.who=='child',1,0)
titanic['parents']=np.where(titanic.parch!=0,1,0)
```

In [136]:

```
# Draw a nested barplot to show survival for class and sex
g = sns.factorplot(x="child", y="survived", hue="parents", data=titanic,
size=6, kind="bar", palette="muted",
)
g.set(title = 'Children Survival wrt to Family');
g.set_axis_labels("Child or Adult", "Survival Probability")
g.set_xticklabels(["Adult", "Child"])
```

Out[136]:

**Above calculations shows that lonely passengers had less chance of survival**

## No comments:

## Post a Comment