Wednesday 31 August 2016

data visualization course on coursera solving!

TITANIC: MACHINE LEARNING DISASTER 

=======
DAY-1
=======
In this we are going to deal with the data set of titanic disaster. And analyse the on the given data "the sort of people who were likely to survive"

Below is the variable description of the data set

VARIABLE DESCRIPTIONS:
survival        Survival
(0 = No; 1 = Yes)
pclass          Passenger Class
(1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
(C = Cherbourg; Q = Queenstown; S = Southampton)

SPECIAL NOTES:
Pclass is a proxy for socio-economic status (SES)
1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1)
If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch)
some relations were ignored.  The following are the definitions used
for sibsp and parch.

Sibling:  Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse:   Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent:   Mother or Father of Passenger Aboard Titanic
Child:    Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins,
nephews/nieces, aunts/uncles, and in-laws.  Some children travelled
only with a nanny, therefore parch=0 for them.  As well, some
travelled with very close friends or neighbors in a village, however,
the definitions do not support such relations.
the data set

train.csv
test.csv

=======
DAY-2
=======

Today we will learn how to look through the data and get overview by seeing through the frequency distribution and percentage of variables when group on certain condition .

So to analyse the data ,we break the question to different sub questions and analyse those questions.Suppose it took a question like "What is the count of people survived on the basis of the class they were travelling"

So these questions provides us with hypothesis as to where to start from the analysing.

Below is the full description of my approach:


Question 1)
What is the count of people survived on the basis of the class they were travelling



Pclass=data.groupby(['Pclass','Survived'])['PassengerId'].count()
print pd.DataFrame({"Frequency":Pclass,\
"Percentage":Pclass.apply(lambda xx: 100*xx/float(Pclass.sum()))\
,'Cumulative_Frequency':Pclass.cumsum(),\
'Cumulative_percentage':Pclass.cumsum().apply(lambda xx: 100*xx/float(Pclass.sum()))})

So the above code produced-->

!                Cumulative_Frequency  Cumulative_percentage  Frequency  Percentage
Pclass Survived                                                                    
1      0                           80               8.978676         80    8.978676
       1                          216              24.242424        136   15.263749
2      0                          313              35.129068         97   10.886644
       1                          400              44.893378         87    9.764310
3      0                          772              86.644220        372   41.750842
       1                          891             100.000000        119   13.355780

So with the above table we can conclude that the first class had the highest possible of likely to survive and 3rd class had the highest probability of death.


Question 2) What is the count of people survived on the basis of the sex and age.



x = data[(data['Age']<=25)]
y = pd.merge(data[(data['Age']<=50)],data[(data['Age']>25)],how='inner',on='PassengerId')
z = data[(data['Age']>50)]
x=x.groupby(['Sex','Survived'])['PassengerId'].count()
y=y.groupby(['Sex_x','Survived_x'])['PassengerId'].count()
z=z.groupby(['Sex','Survived'])['PassengerId'].count()
print pd.DataFrame({'for age<=25':x,'for 25<age<=50':y,'for age>=50':z})

Below is the table and the graph.


!                       for 25<age<=50  for age<=25  for age>=50
Sex_x  Survived_x                                          
female 0                       28           52            1
       1                       94          123           16
male   0                      177          250           41
       1                       50           53            6


With the above view we can conclude that males aged less than 25 was likely to die and female age less than 25 was more likely to survive.


Question 3)What is the count of people survived on the basis of the their having spouse/siblings or having parents/children.


x=data.groupby(['SibSp','Survived'])['PassengerId'].count()
y=data.groupby(['Parch','Survived'])['PassengerId'].count()
z=pd.DataFrame({'SibSp':x,'Parch':y,'SibSp_cumulative':x.cumsum()\
,'SibSp_freq_dist':x.apply(lambda xx: 100*xx/float(x.sum()))\
,'Parch_cumulative':y.cumsum(),'Parch_freq_dist':y.apply(lambda xx: 100*xx/float(x.sum()))}).fillna(0)
print z

Below is the table.


!     Parch  Parch_cumulative  Parch_freq_dist  SibSp  SibSp_cumulative  SibSp_freq_dist
0 0    445               445        49.943883    398               398        44.668911
  1    233               678        26.150393    210               608        23.569024
1 0     53               731         5.948373     97               705        10.886644
  1     65               796         7.295174    112               817        12.570146
2 0     40               836         4.489338     15               832         1.683502
  1     40               876         4.489338     13               845         1.459035
3 0      2               878         0.224467     12               857         1.346801
  1      3               881         0.336700      4               861         0.448934
4 0      4               885         0.448934     15               876         1.683502
  1      0                 0         0.000000      3               879         0.336700
5 0      4               889         0.448934      5               884         0.561167
  1      1               890         0.112233      0                 0         0.000000
6 0      1               891         0.112233      0                 0         0.000000
8 0      0                 0         0.000000      7               891         0.785634



Here though we dont get a clear idea but, I would say through this table we can predict that peoples having no relatives aboard were most likely to survive as well as die.

Still we will keep this data because we dont know , we may figure out some hypothesis later in the course , which we are not able to right now.

Thus ending here, hope you got an clear idea about the frequency  distribution , and how it can be helpful in getting meaning of the data.

Below is the full code.