Big Data is on every CIO’s mind. It is presently synonymous with open source technologies like Hadoop, and the ‘NoSQL’ class of databases. Another technology that is shaking things up in Big Data is R (www.r-project.org, #rstats). R is an open source programming language and software environment designed for statistical computing and visualisation. The statistical software R is the fastest growing analytics platform in the world, and is established in both academia and companies for robustness, reliability and accuracy. For real big data analyses you have to access your data in your preferred database on the fly. In this talk I will give a short overview about R, the available connection to MongoDB and present some big data analyses using R and mongoDB.
1 of 36
Downloaded 87 times
More Related Content
R Statistics With MongoDB
1. R Statistics with MongoDB
R Statistics with Mon‐
goDB
Dr. Markus Schmidberger
October 14th, 2013 Munich, Germany
Email: markus@mongosoup.de
Twitter: @cloudHPC
1 von 36
3. R Statistics with MongoDB
Outline
Introduction to Big Data, MongoSoup and R
R statistics with MongoDB and Examples
Summary & Questions
3 von 36
4. R Statistics with MongoDB
Big Data
Wikipedia: … a collection of data sets so large and complex that it
becomes difficult to process using on-hand database management
tools or traditional data processing. …
storing
processing
4 von 36
5. Storing: NoSQL - MongoDB
R Statistics with MongoDB
databases using looser consistency models to store data
German MongoDB as a Service: MongoSoup
cloudControl Add-On
currently running on AWS EU-Region (Ireland)
all features available: shared / dedicated hosting, replica
set, sharding
24/7 support available
5 von 36
6. R Statistics with MongoDB
MongoSoup in < 5 min
go to cloudControl: www.cloudcontrol.com
add an account and a billing address
create a new app, e.g. “rmongodb”
install cloudControl command line tools: cctrlapp
enable your preferred MongoSoup hosting: cctrlapp
rmongodb/default addon.add mongosoup.medium
go to the cloudControl Web-Console-AddOns and get your
credentials
https://www.cloudcontrol.com/console/app/rmongodb
6 von 36
7. Processing: Analyzing with R and Hadoop
R Statistics with MongoDB
backward-looking analysis is outdated
today: quasi real-time analysis
tomorrow: forward-looking predictive analysis
more complex methods, more data available, more
processing time required
Check my Strata London Tutorial “Big Data Analyses with R”
7 von 36
8. R Statistics with MongoDB
Introduction to R
R is a free software environment for statistical computing
and graphics
offers tools to manage and analyze data
standard statistical methods are implemented
compiles and runs under different OS
support via huge community
www.r-project.org
8 von 36
9. huge online-libraries with > 5000 R-packages:
R Statistics with MongoDB
http://cran.r-project.org
possibility to write personalized code and to contribute new
packages
really famous since January 6, 2009: The New York Times,
“Data Analysts Captivated by R's Power”
9 von 36
10. R Statistics with MongoDB
RStudio IDE
http://www.rstudio.com
10 von 36
11. R Statistics with MongoDB
R as calculator
(5+5) - 1 * 3
[1] 7
x <- 3
x
[1] 3
x^2 + 4
[1] 13
11 von 36
12. R Statistics with MongoDB
y <- c(1,2,3)
y
[1] 1 2 3
x <- 1:10
x
[1]
1
2
3
4
5
6
7
8
9 10
x < 5
[1] TRUE TRUE TRUE TRUE FALSE FALSE
FALSE FALSE FALSE FALSE
12 von 36
13. R Statistics with MongoDB
x[3:7]
[1] 3 4 5 6 7
mean(x)
[1] 5.5
help("mean")
?mean
13 von 36
16. R Statistics with MongoDB
plot(dat, col = cl$cluster, cex=2, pch=16)
points(cl$centers, col = 1:4, pch = 13, cex
= 4)
16 von 36
17. R Shiny - easy web application
R Statistics with MongoDB
developed by RStudio
turns R analyses into interactive web applications that
anyone can use
let your users choose input parameters using friendly
controls like sliders, drop-downs, and text fields
easily incorporate any number of outputs like plots, tables,
and summaries
no HTML or JavaScript knowledge is necessary, only R
http://www.rstudio.com/shiny/
17 von 36
18. R Statistics with MongoDB
R and Databases
SQL provides a standard language to filter, aggregate, group,
sort data
SQL in new places: Hive, Impala, …
ODBC provides SQL interface to non-database data (Excel,
CSV, text files)
R stores relational data in data.frames (extended lists)
18 von 36
19. R Statistics with MongoDB
data(iris)
head(iris, n=3)
Sepal.Length Sepal.Width Petal.Length
Petal.Width Species
1
5.1
3.5
1.4
0.2 setosa
2
4.9
3.0
1.4
0.2 setosa
3
4.7
3.2
1.3
0.2 setosa
class(iris)
[1] "data.frame"
19 von 36
20. R Statistics with MongoDB
R package: sqldf
running SQL statements on R data frames
library(sqldf)
sqldf("select * from iris limit 2")
Sepal_Length Sepal_Width Petal_Length
Petal_Width Species
1
5.1
3.5
1.4
0.2 setosa
2
4.9
3.0
1.4
0.2 setosa
sqldf("select count(*) from iris")
count(*)
1
150
20 von 36
21. Other relational R package
R Statistics with MongoDB
RMySQL package provides an interface to MySQL
RPostgreSQL package provides an interface to PostgreSQL
ROracle package provides an interface for Oracle
RJDBC package provides access to databases through a
JDBC interface
RSQLite package provides access to SQLite
(SQLite engine is included)
One big problem:
all packages read the full result in R memory
21 von 36
22. R Statistics with MongoDB
R and MongoDB
on CRAN there are two packages to connect R with MongoDB
rmongodb supported by MongoDB, Inc.
powerful for big data
difficult to use due to BSON objects
RMongo
easy to use
limited functionality
reads full results in R memory
does not work on MAC OS X
22 von 36
23. R Statistics with MongoDB
R package: RMongo
library(Rmongo)
mongo <- mongoDbConnect("cc_JwQcDLJSYQJb",
"dbs001.mongosoup.de", 27017)
dbAuthenticate(mongo,
username="JwQcDLJSYQJb",
password="RSXPkUkXXXXX")
dbShowCollections(mongo)
dbGetQuery(mongo, "zips","{'state':'AL'}")
dbInsertDocument(mongo, "test_data",
'{"foo": "bar", "size": 5 }')
dbDisconnect(mongo)
23 von 36
24. R Statistics with MongoDB
R package: rmongodb
developed on top of the MongoDB supported C driver
library(rmongodb)
mongo <mongo.create(host="dbs001.mongosoup.de",
db="cc_JwQcDLJSYQJb",
username="JwQcDLJSYQJb",
password="RSXPkUkXXXXX")
mongo
[1] 0
attr(,"mongo")
<pointer: 0x105a1de80>
attr(,"class")
[1] "mongo"
attr(,"host")
[1] "dbs001.mongosoup.de"
attr(,"name")
[1] ""
attr(,"username")
[1] "JwQcDLJSYQJb"
attr(,"password")
[1] "RSXPkUkxRdOX"
attr(,"db")
[1] "cc_JwQcDLJSYQJb"
attr(,"timeout")
[1] 0
24 von 36
25. R Statistics with MongoDB
mongo.get.database.collections(mongo,
"cc_JwQcDLJSYQJb")
[1] "cc_JwQcDLJSYQJb.zips"
"cc_JwQcDLJSYQJb.ccp" "cc_JwQcDLJSYQJb.test"
mongo <- mongo.disconnect(mongo)
25 von 36
26. R Statistics with MongoDB
buf <- mongo.bson.buffer.create()
mongo.bson.buffer.append(buf, "state", "AL")
[1] TRUE
query <- mongo.bson.from.buffer(buf)
query
state : 2
26 von 36
AL
27. R Statistics with MongoDB
res <- mongo.find.one(mongo,
"cc_JwQcDLJSYQJb.zips", query)
res
city : 2
loc : 4
0 : 1
1 : 1
pop : 16
state : 2
_id : 2
27 von 36
ACMAR
6055
AL
35004
-86.515570
33.584132
28. R Statistics with MongoDB
out <- mongo.bson.to.list(res)
out$loc
[1] -86.52
33.58
typeof(out$loc)
[1] "double"
out$pop
[1] 6055
out$state
[1] "AL"
28 von 36
29. R Statistics with MongoDB
cursor <- mongo.find(mongo,
"cc_JwQcDLJSYQJb.zips", query)
res <- NULL
while (mongo.cursor.next(cursor)){
value <- mongo.cursor.value(cursor)
Rvalue <- mongo.bson.to.list(value)
res <- rbind(res, Rvalue)
}
err <- mongo.cursor.destroy(cursor)
head(res, n=4)
city
_id
Rvalue "ACMAR"
"35004"
Rvalue "ADAMSVILLE"
"35005"
Rvalue "ADGER"
"35006"
Rvalue "KEYSTONE"
"35007"
29 von 36
loc
pop
Numeric,2 6055
state
"AL"
Numeric,2 10616 "AL"
Numeric,2 3205
"AL"
Numeric,2 14218 "AL"
30. It is all about creating BSON query or field objects
R Statistics with MongoDB
b <- mongo.bson.from.list(
list(name="Fred", age=29, city="Boston"))
b
name : 2
age : 1
city : 2
Fred
29.000000
Boston
mongo.bson.to.list(b)
$name
[1] "Fred"
$age
[1] 29
$city
[1] "Boston"
30 von 36
32. CCP Web Analytics Challenge
R Statistics with MongoDB
buf <- mongo.bson.buffer.create()
query <- mongo.bson.from.buffer(buf)
buf <- mongo.bson.buffer.create()
err <- mongo.bson.buffer.append(buf, "user",
1)
err <- mongo.bson.buffer.append(buf, "type",
1)
field <- mongo.bson.from.buffer(buf)
out <- mongo.find(mongo,
"cc_JwQcDLJSYQJb.ccp", query, fields=field,
limit=1000)
res <- NULL
while (mongo.cursor.next(out)){
value <- mongo.cursor.value(out)
Rvalue <- mongo.bson.to.list(value)
res <- rbind(res, Rvalue)
}
32 von 36
33. R Statistics with MongoDB
boxplot( as.integer(table(unlist(res[,2]))
), cex=4, horizontal=TRUE, main="Number of
actions per user")
33 von 36
34. R Statistics with MongoDB
Shiny Mongo
R based MongoDB User Interface
R packages shiny and rmongodb
less than 200 lines of code
DEMO: http://localhost:8100
https://github.com/comsysto/ShinyMongo
34 von 36
35. R Statistics with MongoDB
Summary
R is a powerful statistical tool to analyse many different kind
of data
R can access databases
MongoDB and rmongodb ready for Big Data
start playing around with R, Big Data and MongoDB
http://www.r-project.org
http://www.mongodb.org
http://www.mongosoup.de 
35 von 36
36. R Statistics with MongoDB
See you soon
thanks a lot for your attention
there are R trainings in December 2013 in Munich
http://comsysto.com/events.html#r
we are hosting many events and meetups
meet you at the MongoSoup booth
Email: markus@mongosoup.de
Twitter: @cloudHPC
36 von 36