Definition
The R language (R Core Team 2017; Chambers 2008; Matloff 2011) is currently the most popular tool in the general data science field. It features outstanding graphics capabilities and a rich set of more than 10,000 library packages to draw upon. (Other notable languages in data science are Python and Julia. Python is popular among those trained in computer science. Julia, a new language, has as top priority producing fast code.) Its interfaces to SQL databases and the C/C++ language are first rate. All of this, along with recent developments regarding memory issues, makes R well poised as a highly effective tool in Big Data applications. In this chapter, the use of R in Big Data settings will be presented.
It should be noted that Big Data can be “big” in one of two ways, phrased in terms of the classical n × p matrix representing a dataset:
-
Big-n: Large number of data points.
-
Big-p: Large number of variables/features.
Both senses will come into play later. For now, though,...