Sparkling Titanic

Introduction

titanic_logReg.py trains a Logistic Regression and makes prediction for Titanic dataset as part of Kaggle competition using Apache-Spark spark-1.3.1-bin-hadoop2.4 with its Python API on a local machine. I used pyspark_csv.py to load data as Spark DataFrame, for more instructions see this.

The following will be added later

Imputing NAs in train and test sets
Cross-validation
Using more features and feature engineering
RandomForest classifier, SVM, etc.

Running PySpark Script in Shell

Use $SPARK_HOME/bin/spark-submit scriptDirectoryPath/titanic_logReg.py. For multithreading, you can add the option --master local[N] where N is the number of threads.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data		data
README.md		README.md
titanic_logReg.py		titanic_logReg.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparkling Titanic

Introduction

Running PySpark Script in Shell

About

Releases

Packages

Languages

ehsanmok/sparkling-titanic

Folders and files

Latest commit

History

Repository files navigation

Sparkling Titanic

Introduction

Running PySpark Script in Shell

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages