Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Data Science and R, Base R, and the tidyverse Ecosystem

  • Chapter
  • First Online:
Introduction to Data Science in Biostatistics
  • 426 Accesses

Abstract

The purpose of this chapter is to provide introductory guidance and examples on the issue that quality software is never static but is instead subject to continuous improvement, whether the software is proprietary or open source. R is a typical example of this observation and the realization that software must either evolve or die. S was developed in the mid-1970s at a world-renowned telecommunications research laboratory, largely to meet unmet needs due to the deficiencies of then existing software. S continued for a time, until a new business model at the telecommunications company contributed to limited support for S. Later, S was reimagined and evolved into what is now known as R. R has seen continuous evolution, with each new version of R and the many thousands of freely available R-based packages contributing to its evolution. A major evolutionary jump for the R community was in the mid-2000s when the first packages associated with the tidyverse ecosystem became available, offering functionality that finally encouraged many data scientists to embrace R as a first-choice language, using Base R as well as this new ecosystem of packages – the tidyverse.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 129.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Look at the history of the reshape package (and later the reshape2 package and even later the tidyr package) and note time of first availability for each package. The reshape package and the ggplot2 package were among the earliest packages associated with the tidyverse ecosystem and their use was quickly accepted by data scientists within the R community, leading to the development of other tidyverse packages and associated tools.

  2. 2.

    Review the term ASCII Art to see how simple graphics were attempted in these early days, often with questionable results (including the subject matter) for selected early days figures.

  3. 3.

    As early as January 6, 2009, a prominent national newspaper in the United States featured R and how it was emerging as a first choice for many engaged in data management and statistical analyses.

  4. 4.

    This is not the only place where it could be mentioned, but it should be remembered that names are messy, first names and last names. This issue is especially evident when datasets are merged. Imagine a subject with Sean as a first name. It is not at all inconceivable, right or wrong, that this name may entered in other datasets as Jean, Jehan, John, Shane, Shaun, Shawn, Shayne, or Shon. Which spelling is correct, which is incorrect, and is it possible to accommodate differences? If there are language accent marks over some characters, how will the software address these special characters, if at all? Then consider the clever idea of using Social Security Numbers, National Identity Numbers, or some other de facto recognized means of identifying individuals by using unique government-issued codes. At first this idea may sound grand, but is it legal? If legal, is it prudent? These national identification numbers may allow for consistency in identifying individuals and they are especially convenient when datasets are merged, but their use also puts individuals at risk for identify fraud, such as unauthorized credit card purchases, title theft, and dodgy mortgage applications. Always deploy best practices against the possibility of a data breach and use caution when deploying identification codes.

  5. 5.

    With the March 2023 release of tidyverse 2.0.0, it can now be stated that the lubridate package has been added as part of the package of packages associated with core tidyverse. Older references to core tidyverse may not refer to the lubridate package.

  6. 6.

    Go to the URL https://cran.r-project.org/web/packages/available_packages_by_name.html#available-packages-A to see the list of Available CRAN Packages By Name. At the time this chapter was prepared, there were nearly 20,000 packages included in this listing. Of those packages, more than 70 packages had the text string tidy somewhere in the package name (not the package description). This listing does not include the many tidyverse ecosystem packages where the text string tidy is not included in the package name, such as broom and pillar. The tidyverse ecosystem is ubiquitous in R.

  7. 7.

    Data scraping is an increasingly important activity, but far beyond appropriate demonstration in an introductory text.

  8. 8.

    To promote initiative, a series of challenges are offered on how the data associated with Addendum 2 can be used to address interesting issues. Yet, unlike what is seen in Addendum 1, in Addendum 2 some syntax and output are purposely excluded from this chapter, all part of the desire to gradually encourage skill development. Use the syntax and process first seen in Addendum 1 as a guide for completion of the challenges identified in Addendum 2. This should be an achievable outcome, even at the introductory level of this text.

  9. 9.

    A series of suggestions are offered on how the data associated with Addendum 3 can be used to address interesting issues. Yet, unlike what is seen in Addendum 1 and less so in Addendum 2, in Addendum 3, most syntax and output are purposely excluded. Use the syntax and process first seen in Addendum 1 and partially repeated in Addendum 2 as a guide for completion of Addendum 3, again as a purposeful effort to encourage skill development.

  10. 10.

    Although the janitor::clean_names() function is used occasionally in this text, there are many data scientists who use this function for nearly all datasets in use, as a standard practice for naming object variables.

  11. 11.

    Throughout this text, the convention PackageName::FunctionName() has been used regularly to identify functions by their full name, taking into account function namespace, a critical issue if a selected package and associated functions are to work and play well with other packages and other functions. The select() function is an excellent example of why this approach, PackageName::FunctionName(), is used. For this lesson, the select() function is associated with the dplyr package. However, install and load the MASS package and type help(select) at the R prompt to see that the select() function is also associated with the MASS package. If the dplyr package and the MASS package were both loaded in the same R session, there may be some confusion as to the use of the select() function, alone. However, using a PackageName::FunctionName() approach to naming functions takes care of any confusion. If there were ever any question as to which R packages are available in the current session, merely type sessionInfo() and look at the output in the other attached packages: section.

  12. 12.

    An Application Programming Interface (API) paradigm for data retrieval is often desired, but not always feasible. Data scientists need to interact with multiple processes to obtain desired data.

  13. 13.

    There is a known association between wealth and health. Wealth does not guarantee health, but wealth eases access to medicines, availability of services, etc. It is entirely appropriate to consider proxies for wealth, such as Gross Domestic Product (GDP), when examining health-related issues in biostatistics.

  14. 14.

    Addendum 2 purposely does not include the full set of syntax needed for suggested actions, such as syntax used to generate the dataset LGDPByEntity1960OnwardAdjusted.tbl and many later actions based on this dataset. Use the syntax in Addendum 1 as a guide for actions in Addendum 2, but of course make improvements with an expanding skill set and interest.

  15. 15.

    If a simple copy and paste were used to change the syntax in Addendum 1 for use in Addendum 2, do not forget that the main object variable of interest in Addendum 1 was listed as birth_rate, whereas the main object variable of interest in Addendum 2 should be listed as either gdp or GDP. Equally, the scale needs to be adjusted for the Y axis in Addendum 2, with a maximum value for GDP at 21400000000000, whereas the maximum value for the birth_rate Y axis value in Addendum 1 was 48.4.

  16. 16.

    It may be beyond the purpose of this text, but it would still be useful to review the literature to see different views on when missing data should be removed from a dataset and equally, when missing data should be retained but accommodated.

  17. 17.

    The expression 7.68e12, with experience, is far more convenient than writing the exceptionally long number 7680000000000.

  18. 18.

    The term Deaths of Despair is now part of the lexicon found in the popular press as well as the professional literature and is, collectively, a major contributor to awareness of the growing phenomenon of excess deaths, observed well before the COVID-19 pandemic, but now accelerated far beyond expectations.

  19. 19.

    Recall that use of the leading W is a self-guided practice, indicating that the data are in wide format, not long format.

  20. 20.

    The inclusion of geographic entities in this listing is aspirational. The dataset provided by the World Health Organization may not include all entities. It is also possible that slightly different wording is used for identification of these entities. Use personal judgment on which entities to include.

  21. 21.

    Whatever the result (likely a Pearson’s r estimate), never forget that association (e.g., correlation) does not suggest causation.

Author information

Authors and Affiliations

Authors

Electronic Supplementary Materials

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

MacFarland, T.W. (2024). Data Science and R, Base R, and the tidyverse Ecosystem. In: Introduction to Data Science in Biostatistics. Springer, Cham. https://doi.org/10.1007/978-3-031-46383-9_4

Download citation

Publish with us

Policies and ethics