TITANIC: MACHINE LEARNING DISASTER
=======
DAY-1
=======
In this we are going to deal with the data set of titanic disaster. And analyse the on the given data "the sort of people who were likely to survive"
Below is the variable description of the data set
train.csv
test.csv
=======
DAY-2
=======
Today we will learn how to look through the data and get overview by seeing through the frequency distribution and percentage of variables when group on certain condition .
So to analyse the data ,we break the question to different sub questions and analyse those questions.Suppose it took a question like "What is the count of people survived on the basis of the class they were travelling"
So these questions provides us with hypothesis as to where to start from the analysing.
Below is the full description of my approach:
Question 1)
What is the count of people survived on the basis of the class they were travelling
So the above code produced-->
So with the above table we can conclude that the first class had the highest possible of likely to survive and 3rd class had the highest probability of death.
Question 2) What is the count of people survived on the basis of the sex and age.
Below is the table and the graph.
With the above view we can conclude that males aged less than 25 was likely to die and female age less than 25 was more likely to survive.
Question 3)What is the count of people survived on the basis of the their having spouse/siblings or having parents/children.
Below is the table.
Here though we dont get a clear idea but, I would say through this table we can predict that peoples having no relatives aboard were most likely to survive as well as die.
Still we will keep this data because we dont know , we may figure out some hypothesis later in the course , which we are not able to right now.
Thus ending here, hope you got an clear idea about the frequency distribution , and how it can be helpful in getting meaning of the data.
Below is the full code.
=======
DAY-1
=======
In this we are going to deal with the data set of titanic disaster. And analyse the on the given data "the sort of people who were likely to survive"
Below is the variable description of the data set
VARIABLE DESCRIPTIONS: survival Survival (0 = No; 1 = Yes) pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) name Name sex Sex age Age sibsp Number of Siblings/Spouses Aboard parch Number of Parents/Children Aboard ticket Ticket Number fare Passenger Fare cabin Cabin embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton) SPECIAL NOTES: Pclass is a proxy for socio-economic status (SES) 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower Age is in Years; Fractional if Age less than One (1) If the Age is Estimated, it is in the form xx.5 With respect to the family relation variables (i.e. sibsp and parch) some relations were ignored. The following are the definitions used for sibsp and parch. Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored) Parent: Mother or Father of Passenger Aboard Titanic Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic Other family relatives excluded from this study include cousins, nephews/nieces, aunts/uncles, and in-laws. Some children travelled only with a nanny, therefore parch=0 for them. As well, some travelled with very close friends or neighbors in a village, however, the definitions do not support such relations.the data set
train.csv
test.csv
=======
DAY-2
=======
Today we will learn how to look through the data and get overview by seeing through the frequency distribution and percentage of variables when group on certain condition .
So to analyse the data ,we break the question to different sub questions and analyse those questions.Suppose it took a question like "What is the count of people survived on the basis of the class they were travelling"
So these questions provides us with hypothesis as to where to start from the analysing.
Below is the full description of my approach:
Question 1)
What is the count of people survived on the basis of the class they were travelling
Pclass=data.groupby(['Pclass','Survived'])['PassengerId'].count() print pd.DataFrame({"Frequency":Pclass,\
"Percentage":Pclass.apply(lambda xx: 100*xx/float(Pclass.sum()))\ ,'Cumulative_Frequency':Pclass.cumsum(),\
'Cumulative_percentage':Pclass.cumsum().apply(lambda xx: 100*xx/float(Pclass.sum()))})
So the above code produced-->
! Cumulative_Frequency Cumulative_percentage Frequency Percentage Pclass Survived 1 0 80 8.978676 80 8.978676 1 216 24.242424 136 15.263749 2 0 313 35.129068 97 10.886644 1 400 44.893378 87 9.764310 3 0 772 86.644220 372 41.750842 1 891 100.000000 119 13.355780
So with the above table we can conclude that the first class had the highest possible of likely to survive and 3rd class had the highest probability of death.
Question 2) What is the count of people survived on the basis of the sex and age.
x = data[(data['Age']<=25)] y = pd.merge(data[(data['Age']<=50)],data[(data['Age']>25)],how='inner',on='PassengerId') z = data[(data['Age']>50)] x=x.groupby(['Sex','Survived'])['PassengerId'].count() y=y.groupby(['Sex_x','Survived_x'])['PassengerId'].count() z=z.groupby(['Sex','Survived'])['PassengerId'].count() print pd.DataFrame({'for age<=25':x,'for 25<age<=50':y,'for age>=50':z})
Below is the table and the graph.
! for 25<age<=50 for age<=25 for age>=50 Sex_x Survived_x female 0 28 52 1 1 94 123 16 male 0 177 250 41 1 50 53 6
With the above view we can conclude that males aged less than 25 was likely to die and female age less than 25 was more likely to survive.
Question 3)What is the count of people survived on the basis of the their having spouse/siblings or having parents/children.
x=data.groupby(['SibSp','Survived'])['PassengerId'].count() y=data.groupby(['Parch','Survived'])['PassengerId'].count() z=pd.DataFrame({'SibSp':x,'Parch':y,'SibSp_cumulative':x.cumsum()\ ,'SibSp_freq_dist':x.apply(lambda xx: 100*xx/float(x.sum()))\ ,'Parch_cumulative':y.cumsum(),'Parch_freq_dist':y.apply(lambda xx: 100*xx/float(x.sum()))}).fillna(0) print z
Below is the table.
! Parch Parch_cumulative Parch_freq_dist SibSp SibSp_cumulative SibSp_freq_dist 0 0 445 445 49.943883 398 398 44.668911 1 233 678 26.150393 210 608 23.569024 1 0 53 731 5.948373 97 705 10.886644 1 65 796 7.295174 112 817 12.570146 2 0 40 836 4.489338 15 832 1.683502 1 40 876 4.489338 13 845 1.459035 3 0 2 878 0.224467 12 857 1.346801 1 3 881 0.336700 4 861 0.448934 4 0 4 885 0.448934 15 876 1.683502 1 0 0 0.000000 3 879 0.336700 5 0 4 889 0.448934 5 884 0.561167 1 1 890 0.112233 0 0 0.000000 6 0 1 891 0.112233 0 0 0.000000 8 0 0 0 0.000000 7 891 0.785634
Here though we dont get a clear idea but, I would say through this table we can predict that peoples having no relatives aboard were most likely to survive as well as die.
Still we will keep this data because we dont know , we may figure out some hypothesis later in the course , which we are not able to right now.
Thus ending here, hope you got an clear idea about the frequency distribution , and how it can be helpful in getting meaning of the data.
Below is the full code.