Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Training models with Apache Spark, PySpark for Titanic Kaggle competition

Notifications You must be signed in to change notification settings

ehsanmok/sparkling-titanic

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 

Repository files navigation

Sparkling Titanic

Introduction

titanic_logReg.py trains a Logistic Regression and makes prediction for Titanic dataset as part of Kaggle competition using Apache-Spark spark-1.3.1-bin-hadoop2.4 with its Python API on a local machine. I used pyspark_csv.py to load data as Spark DataFrame, for more instructions see this.

The following will be added later

  • Imputing NAs in train and test sets
  • Cross-validation
  • Using more features and feature engineering
  • RandomForest classifier, SVM, etc.

Running PySpark Script in Shell

Use $SPARK_HOME/bin/spark-submit scriptDirectoryPath/titanic_logReg.py. For multithreading, you can add the option --master local[N] where N is the number of threads.

About

Training models with Apache Spark, PySpark for Titanic Kaggle competition

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages