Kaggle Titanic Competition Walkthrough
23 Jul 2016Welcome to my first, and rather long post on data analysis. Recently retook Andrew Ng’s machine learning course on Coursera, which I highly recommend as an intro course, and Harvard’s CS109 Data Science that’s filled with practical python examples and tutorials, so I thought I’d apply what I’ve learned with some real-life data sets.
Kaggle’s Titanic competition is part of their “getting started” competition for budding data scientists. The forum is well populated with many sample solutions and pointers, so I’d thought I’d whipping up a classifier and see how I fare on the Titanic journey.
Introduction and Conclusion (tl;dr)
Given the elaborative post, thought it’d be a good idea to post my thoughts at the very top for the less patient.
The web version of this notebook might be a bit hard to digest, if you’d like to try running each of the python cell blocks and to play around with the data, click here for a copy.
I’ll skip the introduction to what Titanis is about, as for the competition, this notebook took me about 3 weekends to complete, given the limited number of training size, so throwing away training rows wasn’t optimal. Like any data science project, I started by exloring every raw feature that was given, trying to connect the high level relationships with what I know about ship wrecks. Once I felt comfortable with the data set, a majority of my time was impute missing values by cleaning and engineering new features. Without referring to external resources, there’s quite a bit of creativity that’s needed to come up with differentiating factors.
The way I wanted to construct this post is to give a narrative of my analysis by starting with some introductory exploration then making incremental conclusions. Everything is written in python with Jupyter notebook, so it’s easy to clone my repo and fiddle around with the data yourself, so let’s get started!
Table of Contents
Load Data Set and Libraries
In [1]:
Read in our data set into train and test, then combine the two data frames to form a union of features for future data imputation.
In [2]:
Data Exploration
In this section, we’ll explore some basic structure of our data and try to come up creative ways of reformatting our features to make it more machine readable.
But first, let’s famarize ourselves with the formatting of our data set.
In [3]:
Age | Fare | Parch | PassengerId | Pclass | SibSp | Survived | |
---|---|---|---|---|---|---|---|
count | 1046.000000 | 1308.000000 | 1309.000000 | 1309.000000 | 1309.000000 | 1309.000000 | 891.000000 |
mean | 29.881138 | 33.295479 | 0.385027 | 655.000000 | 2.294882 | 0.498854 | 0.383838 |
std | 14.413493 | 51.758668 | 0.865560 | 378.020061 | 0.837836 | 1.041658 | 0.486592 |
min | 0.170000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 |
25% | 21.000000 | 7.895800 | 0.000000 | 328.000000 | 2.000000 | 0.000000 | 0.000000 |
50% | 28.000000 | 14.454200 | 0.000000 | 655.000000 | 3.000000 | 0.000000 | 0.000000 |
75% | 39.000000 | 31.275000 | 0.000000 | 982.000000 | 3.000000 | 1.000000 | 1.000000 |
max | 80.000000 | 512.329200 | 9.000000 | 1309.000000 | 3.000000 | 8.000000 | 1.000000 |
Sex
Now that you have a rough understanding of what each feature entail. Let’s first start by exploring how gender is related to the overall survival rate.
In [4]:
Survived | People | PctSurvived | |
---|---|---|---|
Sex | |||
female | 233.0 | 314 | 0.742038 |
male | 109.0 | 577 | 0.188908 |
This makes sense, Titanic’s evacuation procedure had a women and children first policy. So, vast majority of the women were led into a safety boat.
In [5]:
<matplotlib.axes._subplots.AxesSubplot at 0x117a3f668>
Passenger Class
What about survivial with respect to each passenger class (Pclass)? Do people with higher social class tend live through the catastrophe?
In [6]:
Survived | People | PctSurvived | |
---|---|---|---|
Pclass | |||
1 | 136.0 | 216 | 0.629630 |
2 | 87.0 | 184 | 0.472826 |
3 | 119.0 | 491 | 0.242363 |
So people in higher pclasses do have higher survivability. It’s hard to say why without digging into the details, maybe it’s the way they’re dressed, the way they conduct themselves during emergencies, better connected, or a higher percentage of women (which we found from earlier exploration that females tend to be led into life boats easier). All these questions are things we want to keep in mind.
In [7]:
<matplotlib.axes._subplots.AxesSubplot at 0x117b91390>
How about if we separated social class and gender, how does our survival rate look?
In [8]:
Pclass | 1 | 2 | 3 |
---|---|---|---|
Sex | |||
female | 0.968085 | 0.921053 | 0.500000 |
male | 0.368852 | 0.157407 | 0.135447 |
Looks between passenger classes, male in higher classes gets the biggest survivability boost. And let’s graph this for some visuals.
In [9]:
<matplotlib.axes._subplots.AxesSubplot at 0x117ba59e8>
Port of Embarkment
In [10]:
Survived | People | PctSurvived | |
---|---|---|---|
Embarked | |||
C | 93.0 | 168 | 0.553571 |
Q | 30.0 | 77 | 0.389610 |
S | 217.0 | 644 | 0.336957 |
So embarking at port C gives a higher survivability rate, I wonder why.
In [11]:
<matplotlib.axes._subplots.AxesSubplot at 0x117ae16d8>
Age
So was the Titanic filled with kids? or seniors?
I’ve grouped people into buckets of 5 years, starting with age 0.
In [12]:
<matplotlib.axes._subplots.AxesSubplot at 0x117a6d630>
We can get a bit more granular, let’s break down the ages by passenger classes.
In [13]:
So we think that kids, like females, were given priority during the evacuation. Let’s check if this this true.
In [14]:
Survived | People | PctSurvived | |
---|---|---|---|
AgeGroup5 | |||
0.0 | 27.0 | 40 | 0.675000 |
5.0 | 11.0 | 22 | 0.500000 |
10.0 | 7.0 | 16 | 0.437500 |
15.0 | 34.0 | 86 | 0.395349 |
20.0 | 39.0 | 114 | 0.342105 |
In [15]:
<matplotlib.axes._subplots.AxesSubplot at 0x11aca5748>
Running survival rate
Thought I’d try something different by plotting the cumulative survival rate from age 0, to see how it drops off as you get older.
In [16]:
<matplotlib.axes._subplots.AxesSubplot at 0x11b109978>
In [17]:
Age | Survived | CumSurvived | CumCount | CumSurvivalRate | |
---|---|---|---|---|---|
233 | 5.0 | 1.0 | 28.0 | 41 | 0.682927 |
58 | 5.0 | 1.0 | 29.0 | 42 | 0.690476 |
777 | 5.0 | 1.0 | 30.0 | 43 | 0.697674 |
448 | 5.0 | 1.0 | 31.0 | 44 | 0.704545 |
751 | 6.0 | 1.0 | 32.0 | 45 | 0.711111 |
Fare
Fare by default is like a proxy for passenger class. The more expensive the fare, the wealthier you are.
Anyways, let’s visual what the actual fare price distribution looks like.
In [18]:
<matplotlib.axes._subplots.AxesSubplot at 0x11b324ac8>
In [19]:
Fare | ||||||||
---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | |
Pclass | ||||||||
1 | 323.0 | 87.508992 | 80.447178 | 0.0 | 30.6958 | 60.0000 | 107.6625 | 512.3292 |
2 | 277.0 | 21.179196 | 13.607122 | 0.0 | 13.0000 | 15.0458 | 26.0000 | 73.5000 |
3 | 708.0 | 13.302889 | 11.494358 | 0.0 | 7.7500 | 8.0500 | 15.2458 | 69.5500 |
We need to come back and revisit Fare, since there are fares prices at $0 or null.
In [20]:
18
In [21]:
<matplotlib.axes._subplots.AxesSubplot at 0x11b722f98>
Let’s see if there’s a relationshp between fare price and those who has survived
In [22]:
Fare | ||||||||
---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | |
Pclass | ||||||||
1 | 136.0 | 95.608029 | 85.286820 | 25.9292 | 50.98545 | 77.9583 | 111.481225 | 512.3292 |
2 | 87.0 | 22.055700 | 10.853502 | 10.5000 | 13.00000 | 21.0000 | 26.250000 | 65.0000 |
3 | 118.0 | 13.810946 | 10.663057 | 6.9750 | 7.77500 | 8.5896 | 15.887500 | 56.4958 |
In [23]:
Fare | ||||||||
---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | |
Pclass | ||||||||
1 | 75.0 | 68.996275 | 60.224407 | 5.0000 | 29.7000 | 50.00 | 79.2000 | 263.00 |
2 | 91.0 | 20.692262 | 14.938248 | 10.5000 | 12.9375 | 13.00 | 26.0000 | 73.50 |
3 | 369.0 | 13.780497 | 12.104365 | 4.0125 | 7.7500 | 8.05 | 15.2458 | 69.55 |
In [24]:
<matplotlib.text.Text at 0x11bdfb860>
Percentage of people surivived by pclass
And not surpringly, people in better passenger classes had higher surivival rate
In [25]:
<matplotlib.text.Text at 0x11beb5240>
Cabin
There’s a lot of missing values in the cabin feature, but let’s check it out anyways. We would think that the closer your cabin is to the life rafts, the better your chances are.
First, we need to strip out the cabin number in the cabin column.
In [26]:
In [27]:
Survived | People | PctSurvived | AvgFare | AvgAge | PctFemaleInCabin | |
---|---|---|---|---|---|---|
CabinClass | ||||||
A | 7.0 | 15 | 0.466667 | 42.454164 | 44.833333 | 0.066667 |
B | 35.0 | 47 | 0.744681 | 118.550464 | 34.955556 | 0.574468 |
C | 35.0 | 59 | 0.593220 | 100.151341 | 36.086667 | 0.457627 |
D | 25.0 | 33 | 0.757576 | 57.244576 | 39.032258 | 0.545455 |
E | 24.0 | 32 | 0.750000 | 46.026694 | 38.116667 | 0.468750 |
F | 8.0 | 13 | 0.615385 | 18.696792 | 19.954545 | 0.384615 |
G | 2.0 | 4 | 0.500000 | 13.581250 | 14.750000 | 1.000000 |
T | 0.0 | 1 | 0.000000 | 35.500000 | 45.000000 | 0.000000 |
Weird how there’s a single person in cabin T …
More visualizations by cabin class!
In [28]:
<matplotlib.axes._subplots.AxesSubplot at 0x11bea8da0>
Feature Engineering
Alright, the fun begins. Now we need to think of new features to add into our classifier, what can we deduce from the raw components we have already examined earlier?
One quick way of picking out the features with missing values is to look at the non-null count from the describe() function. Also, we can visually inspect the quantiles of each feature to see if it makes logically sense. For example, why does the Fare feature has a min price of 0?
In [29]:
Age | Fare | Parch | PassengerId | Pclass | SibSp | Survived | AgeGroup5 | |
---|---|---|---|---|---|---|---|---|
count | 1046.000000 | 1308.000000 | 1309.000000 | 1309.000000 | 1309.000000 | 1309.000000 | 891.000000 | 1046.00000 |
mean | 29.881138 | 33.295479 | 0.385027 | 655.000000 | 2.294882 | 0.498854 | 0.383838 | 27.90631 |
std | 14.413493 | 51.758668 | 0.865560 | 378.020061 | 0.837836 | 1.041658 | 0.486592 | 14.60369 |
min | 0.170000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.00000 |
25% | 21.000000 | 7.895800 | 0.000000 | 328.000000 | 2.000000 | 0.000000 | 0.000000 | 20.00000 |
50% | 28.000000 | 14.454200 | 0.000000 | 655.000000 | 3.000000 | 0.000000 | 0.000000 | 25.00000 |
75% | 39.000000 | 31.275000 | 0.000000 | 982.000000 | 3.000000 | 1.000000 | 1.000000 | 35.00000 |
max | 80.000000 | 512.329200 | 9.000000 | 1309.000000 | 3.000000 | 8.000000 | 1.000000 | 80.00000 |
Extracting Titles from Name
Name has a wealth of string information that we can take advantage of. Notice how each person has a prefix, maybe we can extract this out and use it to predict the age of the person if it’s missing.
In [30]:
Age | ||||||||
---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | |
Title | ||||||||
Capt | 1.0 | 70.000000 | NaN | 70.00 | 70.00 | 70.0 | 70.00 | 70.0 |
Col | 4.0 | 54.000000 | 5.477226 | 47.00 | 51.50 | 54.5 | 57.00 | 60.0 |
Don | 1.0 | 40.000000 | NaN | 40.00 | 40.00 | 40.0 | 40.00 | 40.0 |
Dona | 1.0 | 39.000000 | NaN | 39.00 | 39.00 | 39.0 | 39.00 | 39.0 |
Dr | 7.0 | 43.571429 | 11.731115 | 23.00 | 38.00 | 49.0 | 51.50 | 54.0 |
Jonkheer | 1.0 | 38.000000 | NaN | 38.00 | 38.00 | 38.0 | 38.00 | 38.0 |
Lady | 1.0 | 48.000000 | NaN | 48.00 | 48.00 | 48.0 | 48.00 | 48.0 |
Major | 2.0 | 48.500000 | 4.949747 | 45.00 | 46.75 | 48.5 | 50.25 | 52.0 |
Master | 53.0 | 5.482642 | 4.161554 | 0.33 | 2.00 | 4.0 | 9.00 | 14.5 |
Miss | 210.0 | 21.774238 | 12.249077 | 0.17 | 15.00 | 22.0 | 30.00 | 63.0 |
Mlle | 2.0 | 24.000000 | 0.000000 | 24.00 | 24.00 | 24.0 | 24.00 | 24.0 |
Mme | 1.0 | 24.000000 | NaN | 24.00 | 24.00 | 24.0 | 24.00 | 24.0 |
Mr | 581.0 | 32.252151 | 12.422089 | 11.00 | 23.00 | 29.0 | 39.00 | 80.0 |
Mrs | 170.0 | 36.994118 | 12.901767 | 14.00 | 27.00 | 35.5 | 46.50 | 76.0 |
Ms | 1.0 | 28.000000 | NaN | 28.00 | 28.00 | 28.0 | 28.00 | 28.0 |
Rev | 8.0 | 41.250000 | 12.020815 | 27.00 | 29.50 | 41.5 | 51.75 | 57.0 |
Sir | 1.0 | 49.000000 | NaN | 49.00 | 49.00 | 49.0 | 49.00 | 49.0 |
the Countess | 1.0 | 33.000000 | NaN | 33.00 | 33.00 | 33.0 | 33.00 | 33.0 |
Are some titles tend to dominate in one gender?
In [31]:
Survived | ||
---|---|---|
Sex | female | male |
Title | ||
Capt | NaN | 0.0 |
Col | NaN | 1.0 |
Don | NaN | 0.0 |
Dona | NaN | NaN |
Dr | 1.0 | 2.0 |
Jonkheer | NaN | 0.0 |
Lady | 1.0 | NaN |
Major | NaN | 1.0 |
Master | NaN | 23.0 |
Miss | 127.0 | NaN |
Mlle | 2.0 | NaN |
Mme | 1.0 | NaN |
Mr | NaN | 81.0 |
Mrs | 99.0 | NaN |
Ms | 1.0 | NaN |
Rev | NaN | 0.0 |
Sir | NaN | 1.0 |
the Countess | 1.0 | NaN |
Working with all these different kinds of titles can overfit our data, let’s try classifying the less frequent titles into either Mr, Mrs, Miss, Master and Officer. The goal is to come up with the fewest number of unique titles that does not give us overlapping age ranges.
In [32]:
array(['Mr', 'Mrs', 'Miss', 'Master', 'Officer'], dtype=object)
Let see how well we have separated out the age groups by title.
In [33]:
Age | ||||||||
---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | |
Title | ||||||||
Master | 53.0 | 5.482642 | 4.161554 | 0.33 | 2.0 | 4.0 | 9.0 | 14.5 |
Miss | 215.0 | 21.914372 | 12.171758 | 0.17 | 15.5 | 22.0 | 30.0 | 63.0 |
Mr | 598.0 | 32.527592 | 12.476329 | 11.00 | 23.0 | 30.0 | 40.0 | 80.0 |
Mrs | 173.0 | 37.104046 | 12.852049 | 14.00 | 27.0 | 36.0 | 47.0 | 76.0 |
Officer | 7.0 | 54.714286 | 8.440266 | 45.00 | 49.5 | 53.0 | 58.0 | 70.0 |
Family Size
In [34]:
Parch | SibSp | FamilySize | |
---|---|---|---|
7 | 1 | 3 | 5 |
13 | 5 | 1 | 7 |
16 | 1 | 4 | 6 |
In [35]:
<matplotlib.axes._subplots.AxesSubplot at 0x11c347550>
In [36]:
Lone Travellers
In [37]:
In [38]:
Survived | People | PctSurvived | |
---|---|---|---|
Alone | |||
0 | 179.0 | 354 | 0.505650 |
1 | 163.0 | 537 | 0.303538 |
Family Name
Extract out the family names
In [39]:
In [40]:
Survived | FamilySize | |
---|---|---|
FamilyName | ||
Abbing | 0.0 | 1 |
Abbott | 1.0 | 3 |
Abelseth | NaN | 1 |
Abelson | 1.0 | 2 |
Abrahamsson | NaN | 1 |
Are families more likely to survive?
In [41]:
Survived | FamilyCount | PeopleInFamily | PctSurvived | |
---|---|---|---|---|
FamilySize | ||||
1 | 153.0 | 475 | 475 | 0.322105 |
2 | 89.0 | 104 | 208 | 0.427885 |
3 | 65.0 | 59 | 177 | 0.367232 |
4 | 21.0 | 14 | 56 | 0.375000 |
5 | 4.0 | 6 | 30 | 0.133333 |
6 | 5.0 | 5 | 30 | 0.166667 |
7 | 5.0 | 2 | 14 | 0.357143 |
8 | 0.0 | 1 | 8 | 0.000000 |
11 | 0.0 | 1 | 11 | 0.000000 |
Impute Missing Values
Label Encoding
Categorical columns needs to be mapped to ints for some classifiers to work. We’ll use the sklearn.preprocessing.LableEncoder() to help us with this.
In [42]:
0 Sex
1 Pclass
2 Title
3 FamilyName
4 FamilySizeBucket
Embarked
This is a bit of an overkill, but we have 1 passenger missing her port of emarkment. Simple analysis would’ve sufficed, but let’s go the whole nine yards anyways for completeness’ sake.
In [43]:
In [44]:
Survived | Count | AvgFare | AvgAge | ||
---|---|---|---|---|---|
Pclass | Embarked | ||||
1 | C | 59.0 | 85 | 106.845330 | 39.062500 |
Q | 1.0 | 2 | 90.000000 | 38.000000 | |
S | 74.0 | 127 | 72.148094 | 39.121987 |
In [45]:
Survived | |||
---|---|---|---|
Pclass | Embarked | CabinClass | |
1 | C | A | 7 |
B | 22 | ||
C | 21 | ||
D | 11 | ||
E | 5 | ||
Q | C | 2 | |
S | A | 8 | |
B | 23 | ||
C | 36 | ||
D | 18 | ||
E | 20 | ||
T | 1 |
In [46]:
Survived | |||
---|---|---|---|
Pclass | Embarked | Sex | |
1 | C | female | 43 |
male | 42 | ||
Q | female | 1 | |
male | 1 | ||
S | female | 48 | |
male | 79 |
For imputation, let’s train a k nearest neighbor classifer to figure out which port of embarkment you’re from.
In [47]:
In [48]:
array(['S', 'S'], dtype=object)
KNN tells us that our passenger with no port of embarkment probably boarded from “S”, let’s fill that in.
In [49]:
Let’s run our encoder code again on Embarked.
In [50]:
0 Embarked
Ticket
There’s a bit of judgemenet call in this one. The goal is to bucket the ticket strings into various classes. After going through all the labels, you slowly start to see some patterns.
In [51]:
Survived | |
---|---|
TicketGroup | |
A_Group | 29 |
C_Group | 46 |
F_Group | 7 |
Integer_Group | 661 |
L_Group | 4 |
P_Group | 65 |
SC_Group | 23 |
SP_Group | 7 |
STON_Group | 36 |
W_Group | 13 |
In [52]:
0 TicketGroup
Impute Age
Yay, if you actually read this far. Age is a very important factor in predicting survivability, so let’s take in all the raw and engineered features and fill in the blanks for our passengers.
In [53]:
{'n_neighbors': 15}
In [54]:
<matplotlib.text.Text at 0x11be70978>
Now that we’ve trained our classifier, how does the new imputed age distribution look like compared to the old one?
In [55]:
<matplotlib.text.Text at 0x11bda60f0>
Impute Fare
In [56]:
Fare | ||||||||
---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | |
Pclass | ||||||||
1 | 323.0 | 87.508992 | 80.447178 | 0.0 | 30.6958 | 60.0000 | 107.6625 | 512.3292 |
2 | 277.0 | 21.179196 | 13.607122 | 0.0 | 13.0000 | 15.0458 | 26.0000 | 73.5000 |
3 | 708.0 | 13.302889 | 11.494358 | 0.0 | 7.7500 | 8.0500 | 15.2458 | 69.5500 |
In [57]:
In [58]:
GridSearchCV(cv=5, error_score='raise',
estimator=KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=5, p=2,
weights='uniform'),
fit_params={}, iid=True, n_jobs=4,
param_grid={'n_neighbors': [2, 4, 7, 10, 15, 30, 50, 100, 200]},
pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)
In [59]:
Fare | |
---|---|
Pclass | |
1 | 7 |
2 | 6 |
3 | 4 |
In [60]:
In [61]:
In [62]:
EmbarkedEncoding | SexEncoding | PclassEncoding | Alone | SibSp | Parch | FamilySize | TitleEncoding | ImputedAge | TicketGroupEncoding | |
---|---|---|---|---|---|---|---|---|---|---|
179 | 2 | 1 | 2 | 1 | 0 | 0 | 1 | 2 | 36.0 | 4 |
263 | 2 | 1 | 0 | 1 | 0 | 0 | 1 | 2 | 40.0 | 3 |
271 | 2 | 1 | 2 | 1 | 0 | 0 | 1 | 2 | 25.0 | 4 |
277 | 2 | 1 | 1 | 1 | 0 | 0 | 1 | 2 | 32.0 | 3 |
302 | 2 | 1 | 2 | 1 | 0 | 0 | 1 | 2 | 19.0 | 4 |
In [63]:
Mother
“Mother and children first”, so I think that’s what they’d say as the Titanic is sinking. Let’s try to identify the mothers using the passenger’s gender, age, title, and whether if she has a child.
In [64]:
Cabin Class
Yikes, there’s quite a bit of missing cabin values to impute. Not sure how our classifier will do given the very limited number of training examples. Let’s give it a shot anyways.
In [65]:
1014
What are the different types of Cabin’s given to us?
In [66]:
array([nan, 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6', 'C23 C25 C27',
'B78'], dtype=object)
In [67]:
In [68]:
0 CabinClass
In [69]:
Probability of survival for each cabin class.
In [70]:
Survived | |||
---|---|---|---|
Pclass | 1 | 2 | 3 |
CabinClass | |||
A | 7.0 | NaN | NaN |
B | 35.0 | NaN | NaN |
C | 35.0 | NaN | NaN |
D | 22.0 | 3.0 | NaN |
E | 18.0 | 3.0 | 3.0 |
F | NaN | 7.0 | 1.0 |
G | NaN | NaN | 2.0 |
T | 0.0 | NaN | NaN |
In [71]:
<matplotlib.axes._subplots.AxesSubplot at 0x11cd65b38>
In [72]:
In [73]:
In [74]:
In [75]:
/Users/Estinox/bin/anaconda3/lib/python3.5/site-packages/sklearn/cross_validation.py:516: Warning: The least populated class in y has only 1 members, which is too few. The minimum number of labels for any class cannot be less than n_folds=3.
% (min_labels, self.n_folds)), Warning)
GridSearchCV(cv=None, error_score='raise',
estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False),
fit_params={}, iid=True, n_jobs=1,
param_grid={'n_estimators': [10, 20, 30]}, pre_dispatch='2*n_jobs',
refit=True, scoring=None, verbose=0)
In [76]:
In [77]:
0 ImputedCabinClass
In [78]:
Cabin Number
Trying to be a creative here. My hypothesis is that the cabin numbers are laid out either from the front to the back, so the number would indicate whether if the room is at the front, middle, or the back of the ship.
In [79]:
In [80]:
CabinNumber | Cabin | |
---|---|---|
1 | 85 | C85 |
3 | 123 | C123 |
6 | 46 | E46 |
10 | 6 | G6 |
11 | 103 | C103 |
In [81]:
In [82]:
0 CabinNumber
In [83]:
In [84]:
/Users/Estinox/bin/anaconda3/lib/python3.5/site-packages/sklearn/cross_validation.py:516: Warning: The least populated class in y has only 1 members, which is too few. The minimum number of labels for any class cannot be less than n_folds=3.
% (min_labels, self.n_folds)), Warning)
GridSearchCV(cv=None, error_score='raise',
estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False),
fit_params={}, iid=True, n_jobs=1,
param_grid={'n_estimators': [10, 20, 50, 100]},
pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)
In [85]:
In [86]:
0 ImputedCabinNumber
In [87]:
CabinNumber | |
---|---|
count | 289 |
unique | 104 |
top | 6 |
freq | 9 |
In [88]:
<matplotlib.axes._subplots.AxesSubplot at 0x11ab93a58>
Cabin N-Tile
Take a look at the min and max of our cabins
In [89]:
In [90]:
(2, 148)
In [91]:
In [92]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=5, p=2,
weights='uniform')
In [93]:
In [94]:
<matplotlib.axes._subplots.AxesSubplot at 0x11cab7b38>
Model Fitting
Base Line
If we just predicted everyone to not surive, then we’d have 61.6% accuracy rate. So this is the minimum base life we have to beat with our classifier.
In [95]:
0.6161616161616161
Null Checks
In [96]:
Check for null entries in our features
In [97]:
EmbarkedEncoding | Parch | PclassEncoding | SexEncoding | SibSp | TitleEncoding | FamilySize | Alone | FamilyNameEncoding | ImputedAge | Mother | ImputedFare | TicketGroupEncoding | ImputedCabinTertile |
---|
Okay, good, the above gave us no rows of null entries. We can continue.
Fitting
I’ve commented out a few classifiers I’ve tried to fit the model with. If you’ve cloned this note book, you can easily uncomment out thos rows and check out the results for yourselves.
For each classifier, I’ve decided to use grid search cross validation, and have the grid search return the best parameter with the best cross validation score.
In [98]:
models fitted
In [99]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
{'n_estimators': 1000}
0.837261503928
[mean: 0.83277, std: 0.01839, params: {'n_estimators': 100}, mean: 0.83614, std: 0.02583, params: {'n_estimators': 500}, mean: 0.83726, std: 0.02555, params: {'n_estimators': 1000}, mean: 0.83502, std: 0.02497, params: {'n_estimators': 2000}, mean: 0.83502, std: 0.02683, params: {'n_estimators': 4000}]
How well have we trained against our own training set.
In [100]:
<class 'sklearn.ensemble.forest.RandomForestClassifier'>
precision recall f1-score support
0.0 1.00 1.00 1.00 550
1.0 1.00 1.00 1.00 341
avg / total 1.00 1.00 1.00 891
How important were each of the features in our random forest?
In [101]:
Random Forest:
Importance Feature
3 0.168709 SexEncoding
11 0.161194 ImputedFare
8 0.157862 FamilyNameEncoding
9 0.143067 ImputedAge
5 0.114327 TitleEncoding
2 0.069930 PclassEncoding
6 0.039600 FamilySize
12 0.034325 TicketGroupEncoding
4 0.027307 SibSp
0 0.025159 EmbarkedEncoding
13 0.024457 ImputedCabinTertile
1 0.016596 Parch
7 0.009526 Alone
10 0.007940 Mother
How certain we are of our predictions? We can graph it out using predict proba.
In [102]:
Submission
Finally, we’ll loop through all our classifiers and save the predicted csv file.
In [103]:
Extras
Over here, we have the learning curve analysis from Andrew Ng’s machine learning class.
In [104]:
Conclusion
Was definitely fun developing this notebook, and hopefully it has helped anyone who’s trying to tackle the Kaggle Titanic problem. And again, if you’re looking for the source of the Jupyter notebook, it can be found in the introduction. Thanks for reading!