Titanic Data
Titanic Data
Titanic Data
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Fare 891 non-null float64
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(2)
memory usage: 62.7+ KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 712 entries, 0 to 890
Data columns (total 9 columns):
PassengerId 712 non-null int64
Survived 712 non-null int64
Pclass 712 non-null int64
Sex 712 non-null object
Age 712 non-null float64
SibSp 712 non-null int64
Parch 712 non-null int64
Fare 712 non-null float64
Embarked 712 non-null object
dtypes: float64(2), int64(5), object(2)
memory usage: 55.6+ KB
# dataset is reduced to 712 rows from 891, which means we are wasting
data.
# So we preserve the data and make use of it as much as we can
# Convert Pclass, Sex, Embarked to columns in pandas and drop them
after conversion
dummies = []
cols = ['Pclass','Sex','Embarked']
for col in cols:
dummies.append(pd.get_dummies(df[col]))
1 2 3 female male C Q S
0 0 0 1 0 1 0 0 1
1 1 0 0 1 0 1 0 0
2 0 0 1 1 0 0 0 1
3 1 0 0 1 0 0 0 1
4 0 0 1 0 1 0 0 1
df = pd.concat((df,titanic_dummies),axis=1)
df.head()
C Q S
0 0 0 1
1 1 0 0
2 0 0 1
3 0 0 1
4 0 0 1
df.info()
# Here we find that age has lots of missing values
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 14 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Fare 891 non-null float64
1 891 non-null uint8
2 891 non-null uint8
3 891 non-null uint8
female 891 non-null uint8
male 891 non-null uint8
C 891 non-null uint8
Q 891 non-null uint8
S 891 non-null uint8
dtypes: float64(2), int64(4), uint8(8)
memory usage: 48.8 KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 14 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Age 891 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Fare 891 non-null float64
1 891 non-null uint8
2 891 non-null uint8
3 891 non-null uint8
female 891 non-null uint8
male 891 non-null uint8
C 891 non-null uint8
Q 891 non-null uint8
S 891 non-null uint8
dtypes: float64(2), int64(4), uint8(8)
memory usage: 48.8 KB