Abstract
The purpose of this lesson is to provide a glimpse of contemporary data science. This lesson will offer a definition of data science, as promulgated by the federal government. There will be extensive discussion on how data science fits into the higher education (e.g., postsecondary, grades 13 and above) curriculum. There is also a discussion on employment opportunities for those who see a future career in which skills in data science are required, with data science either a primary job requirement or at least a secondary job requirement. This lesson also provides a general introduction to R and computing issues inherent to R, where R is used as a platform in data science. The addenda provides more specific information, but still at an introductory level, to the tidyverse ecosystem. Those who wish to use this data science text but are totally new to R should consider reviewing materials from among the many published and free resources on R, either prior to or at least concurrent to the use of this text.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
In this text, the term data is nearly always used in a plural sense of the term (e.g., the data have been recorded.) and the term datum is nearly always used in a singular sense of the term (e.g., the datum has been recorded.). This approach follows along with the Latin origin of the term(s) datum and data. The term datapoints is occasionally used, but the term datums is avoided. It is recognized, however, that the term data is regularly seen in the literature for both singular and plural use. Going beyond its use in statistics and data science, the term data, in the plural sense of the term, is often viewed as a mass noun, indicating either a substance or quality that cannot be counted.
- 2.
Like uniform coding practices by nearly all other federal agencies, the United States Bureau of Labor Statistics Standard Occupational Classification (SOC) system is a uniform coding system that consists of nearly 900 codes, organized in a hierarchy, used to classify worker occupations.
- 3.
Throughout this text, the color indicates R-based input and the color indicates R-based output.
- 4.
In an effort to save space, look at the expression . Even so, copy and paste the R-based syntax to replicate outcomes.
- 5.
Similar to other software associated with R, the swirl package is free and open source software (FOSS). The package is free to download and install. The syntax associated with the package can also be viewed, to better understand the package and possibly improve upon it.
- 6.
Some apps, of all types and not only those related to health, have automated background sharing of generated data (e.g., daily number of steps, as in the current example) as a default setting. Yes, there is often a way of disabling automated data sharing, but many users either do not know about this option or find it difficult to disable automated data sharing. There is a growing movement by many national governments to respond to this unwanted data harvesting process – the right to be left alone. A common remedy is that app developers must make default automated data sharing prominently known at the time of app download or first use, and it must be as easy to disable default automated data sharing as it is to enable this process. There are a few national governments that have applied large fines against companies associated with the digital advertisement industry that, unknowing to the user, obtain app-generated data without express permission, but the universal application of these protective practices is still uncertain. Regardless of the appropriate use of large-scale background data harvesting, it is only possible due to the emergence of data science applications.
- 7.
The United States Department of Labor (Labor), which is detailed later in this lesson, is also an excellent resource for information on career opportunities and required skills and education for those who wish to work in data science. As an advance organizer to information presented later in this lesson, review Occupational Employment and Wages - 15-2051 Data Scientists (https://www.bls.gov/oes/current/oes152051.htm) and notice the salaries for data scientists (national and selected regions), job distribution throughout the nation, etc. Again, far more detail is provided later in this lesson.
- 8.
Review the detailed description for Medical Informatics (https://nces.ed.gov/ipeds/cipcode/cipdetail.aspx?y=55&cipid=87670) and follow the progression of detail: CIP Code 51, Health Professions and Related Clinical Sciences; CIP Code 51.27, Medical Illustration and Informatics; and CIP Code 51.2706, Medical Informatics.
- 9.
Although the Classification of Instructional Programs was developed specifically for educational purposes, the use of CIP codes is found throughout the many agencies, institutions, businesses, and other entities that have some degree of interest in throughput of students in postsecondary education, including: Department of Commerce, Bureau of the Census; Department of Education, Office of Career, Technical, and Adult Education; Department of Education, Office for Civil Rights; Department of Education, Office of Federal Student Aid; Department of Education, Office of Special Education; Department of Homeland Security; Department of Labor, Bureau of Labor Statistics; National Academy of Sciences; National Institutes of Health; National Occupational Information Coordinating Committee; National Science Foundation; etc. CIP codes are also creatively used by state departments of education and other state agencies, national organizations and professional groups, higher education institutions, technical institutions, and the many businesses that provide employment services.
- 10.
The normal lag from data submission by individual postsecondary institutions to public access of the data at the IPEDS Peer Analysis System has been noticeably extended due to the prior and continuing challenges brought about by COVID-19 and the still frequent practice of work from home (WFH) by many postsecondary education information workers and government counterparts, including external contracted workers. As an example, deadlines for IPEDS data submission were extended, and it is expected that these extensions will ultimately impact the timeliness of data availability. These issues related to data availability are of course not restricted to the United States Department of Education but are endemic to many data resources, like the way COVID-19 has become increasingly endemic.
- 11.
The IPEDS Peer Analysis System offers a Save session option for replication of consistently structured queries, but by no means does this option begin to equal the convenience and quality assurance of an R-based API data query process based on reproducible syntax, as will be seen in later lessons with other data resources.
- 12.
These datasets are in wide format, but tidyverse ecosystem tools can be used to put the data into long format, a tidy approach to dataset structure. This practice is demonstrated multiple times in this text, in later lessons.
- 13.
From among many possible resources, refer to the United States Department of Homeland Security STEM Designated Degree Program list (https://www.ice.gov/sites/default/files/documents/stem-list.pdf) for an audit of the more than 500 CIPs associated with STEM (science, technology, engineering, and mathematics) that are recognized by this federal executive cabinet-level department. Many CIPs on this extensive list require exposure to data science and acquaintance with biostatistics.
- 14.
This listing and many others in this text that are similar should not be viewed as a word-processed table. Instead, text of this type represents output from some type of computing activity, with some degree of modification for presentation purposes. Accordingly, the text shows in, following along with the identification of and used throughout this text.
- 15.
At the time these figures were prepared, Academic Year 2018–2019 was the ending point for the availability of CIP-specific IPEDS data on completers. Even if the data were updated by the time this text becomes available, it is best to question if completions data for Academic Year 2019–2020 and onward for the next few years are typical and appropriate for year-by-year comparisons. From among the many social outcomes of reactions to COVID-19, higher education enrollment patterns were greatly stressed as communities went into lockdown, students went home, and many students suspended their studies, moved on, and may not return to higher education. It will likely be a few years before postsecondary education enrollment patterns, throughput, and completions return to expected patterns.
- 16.
Similar hierarchical coding systems are used throughout the many departments, bureaus, agencies, offices, etc. associated with the United States federal government. The data maintained by these entities should always be considered for use, either as a primary source or as a proxy that at least provides guidance, if possible. Data scientists quickly learn about the use of data resources that are convenient and freely available if the available data meet needs.
- 17.
External resources related to biostatistics that allow access to data that are highly vetted, reliable, and valid are identified in a later lesson.
- 18.
Observe how Occupational Information Network (O*NET) codes use a different numbering sequence than the Classification of Instructional Programs (CIP) codes. Even so, there is structure (e.g., hierarchy) to O*NET codes, and with experience it is possible to learn more about the requirements for each specific job, regardless of how the job is coded by a local employer. Take the time to review, for at least a few selected jobs, highly detailed information on: (1) employment estimate and mean wage, (2) percentile wage estimates, (3) industries with the highest concentration of employment, (4) top paying industries, (5) states with the highest employment level, (6) states with the highest concentration of jobs and location quotients, (7) top paying states, (8) metropolitan areas with the highest employment level, (9) metropolitan areas with the highest concentration of jobs and location quotients, (10) top paying metropolitan areas, (11) nonmetropolitan areas with the highest concentration of jobs and location quotients, and (12) top paying nonmetropolitan areas.
- 19.
These behaviors and dispositions are organized in alphabetical order, to avoid any attempt to suggest that there is a hierarchy of importance or sequence of these behaviors and dispositions. Again, these behaviors and dispositions go across the many jobs associated with data science and biostatistics. Use the Occupational Information Network for specific job-by-job details, details and work activities.
- 20.
Give special attention to the following six-digit CIP codes and how these programs of study crosswalk to O*NET-identified jobs: CIP 30.7001 Data Science, General; CIP 51.1201 Medicine; and CIP 51.2706 Medical Informatics.
- 21.
See the prior comment about the use of data from either before or at the earliest stages of the COVID-19 pandemic and the impact of mitigation practices such as lockdowns on data representation, thus the use of data from 2019 and early 2020, but not beyond.
- 22.
R supports processes to work with data imputation through the use of functions from a few different packages, such as: Amelia, brms, mi, mice, VIM, mitml, etc.
- 23.
When using R, these special characters should be removed and replaced by NA, the symbolic representation of missing data. A simple and effective transformation process is demonstrated in Addendum 2.
- 24.
Challenge: Search on the names Jean Jennings Bartik, Frances Bilas, Ruth Licherman, Kathleen McNulty, Betty Snyder, and Marlyn Wescoff. What was their role regarding ENIAC?
- 25.
Search on the term Moore’s law, and see if it applies, then and now.
- 26.
For unexplained reasons, online videos of dancing cats were among the first entertainment experiences for many early Internet users and these funny videos helped entice the public to try out both entertainment and serious video resources that were now available to the public.
- 27.
Server farms, with thousands of computing devices in operation, consume vast amounts of electricity and generate extreme amounts of heat, requiring additional electricity for cooling. There is a growing movement in cloud computing to locate server farms at locations that are naturally cool throughout the year, if not cold, with electricity often generated by either geothermal or hydro technology. The excess heat generated by the many servers is often captured and ported to other buildings for heating purposes, adding to efficiencies of use.
- 28.
Challenge: To save space, the outcomes of the R-based syntax presented in this section are not always presented. However, either duplicate or copy and paste the syntax into an active R session and replicate findings. This self-directed activity is among the many paths used in this text for learning R and later the tidyverse ecosystem.
- 29.
Preemptively, it must be mentioned that an oddity of R is that the base::mode() function is not used as a measure a central tendency, such as the base::mean() function or the stats::median() function. As a measure of central tendency, the mode (e.g., the most frequently occurring value in a set of values) is accommodated by using functions from external R packages, such as use of the DescTools::Mode() function among many other complementary functions that serve the same purpose.
- 30.
Any further discussion would go beyond the purpose of this lesson, but recall that decimal notation, as used in this example, is not universal and an experienced data scientist may encounter the use of commas instead of decimal points to express what is otherwise called decimal notation. R can accommodate this practice (e.g., the use of commas instead of decimal points), if needed.
- 31.
As a good programming practice (gpp), for when there is no compelling reason to do otherwise, it may be best to have codes organized so that the terms they represent show in alphabetical order, thus 1 for Female and 2 for Male, since this naming scheme follows alphabetical ordering. Of course, there are times when an ordinal approach may be best for the assignment of integer factor-type codes, such as: 1 (Small), 2 (Medium), and 3 (Large), which is decidedly not an ascending alphabetical ordering but is instead an ordering by size. These issues are best communicated in a code book.
- 32.
Social Security Numbers are now rarely used as a requested number for identification purposes, due to public concern about this practice and justified personal identity and security concerns.
- 33.
The origin date for R (January 1, 1970) borrows from what is commonly referred to as the arbitrary UNIX epoch time and date of 00:00:00 UTC (Universal Time Coordinated) on Thursday, January 01, 1970).
- 34.
For calculations going back over extreme lengths of time, it may be helpful to know that the Gregorian calendar is used for UNIX epoch time, as opposed to use of the Julian calendar.
- 35.
As an independent activity, at the R prompt key help(as.POSIXct) to learn more about this otherwise somewhat complex calculation that with careful thought does not need to be as complex as originally thought.
- 36.
Using R, the symbol NA equates to the term Not Available. The symbol NaN equates to the term Not a Number.
- 37.
For those who wish to explore this topic in more detail, use available resources such as keying help(NameOfFunction) at the R prompt to learn more about the following terms, as used in R: NULL, NA, NaN, and finite (Inf and -Inf, both positive and negative infinity).
- 38.
Think of the expression measure twice and cut once before deploying any action that eliminates data from a dataset.
- 39.
When using R, an array is an object that can store data in more than two dimensions. As seen later, a matrix is a two-dimensional array.
- 40.
As a brief explanation of the value of using Integrated Postsecondary Education Data System (IPEDS) data on completers, it should be mentioned that six-digit CIP data (e.g., the highly granular data that are specific to individual programs of study) are available for nearly all completers, but that is not the case for the availability of six-digit CIP (Classification of Instructional Programs) fall term enrollment data from the IPEDS Peer Analysis System. The rationale for that decision is that postsecondary students frequently change their academic major program of study before completion, such that six-digit CIP enrollment data would be an inconsistent and misleading false friend. The decision to exclude fall term enrollment data also considers that many students transfer from one postsecondary institution to another, again confounding the efficacy of using enrollment as an appropriate metric. In contrast, completion of a program of study from a specific institution and the awarding of either a certificate or degree by that institution is a recorded final event that results in a fixed datum that will not change and therefore serves as a valid measure of interest for individual programs of study.
- 41.
The syntax in this addendum provides an advance organizer to the use of data science and R. For now, give attention to process. More specific exposure to the use of R syntax in support of data science is provided in later lessons.
- 42.
The original format column header names are certainly not tidy, but look at the syntax on how the base::colnames() function is used to put final form column names into a more tidy format and are equally easy to read and understand.
- 43.
The utils::read.table() function, associated with Base R, the set of packages and functions available when R is first downloaded, is a common tool for importing .csv files into an active R session. One frequently used feature associated with the utils::read.table() function is use of the stringsAsFactors argument. By using this argument, it is possible to set character data (e.g., Female or Male, Fail or Pass, etc.) as factors during the data import process. The tidyverse ecosystem and specifically the readxl package takes a totally different approach to this task. When the readxl package is used to import data, using either the readxl::read_excel(), readxl::read_xls(), or readxl::read_xlsx() functions, character data are retained as characters after they are imported and they are not put into factor format during the data import process. If there were a desire to organize character data in factor format, which is common, that forced action must come later, after the data are imported and put into a tibble. The advantage of this approach is that when using tools in the tidyverse ecosystem, data must be organized early-on and the syntax for this set of actions will be very visible as syntax of its own, minimizing a possible misuse of the data because of a simple action that was somehow earlier forgotten.
- 44.
As a desire, in the near future the active R session will be based on use of the cloud, but for now a portable drive, a physical portable F drive in this case, meets current needs.
- 45.
After this syntax was prepared, the core tidyverse was updated to tidyverse 2.0.0. The lubridate package was added as part of the new core tidyverse package of packages. More detail is provided in a later lesson.
- 46.
Ideally the IPEDS Peer Analysis System would support function specific R-based API (Application Programming Interface) data retrieval. R syntax would be created in such a way that a function would be invoked from an active R session and by using this function desired data would be returned, eliminating actual interaction with an interface at the originating data source. Unfortunately, the IPEDS data resource does not yet support this type of API data retrieval. Those who work in data science must be able to react to multiple data acquisition processes, not only those that are ideal.
- 47.
Search on the early 1300s writings of William of Ockham, who is generally attributed as formulating Occam’s razor – an approach that advocates simple solutions at problem-solving, whenever possible.
- 48.
The tidyr::gather() function is deprecated. It is still included among the many functions available in the tidyr package, but there are no current efforts to improve upon its functionality. Most importantly, existing syntax from prior projects that use the tidyr::gather() function still work, but new projects tend to use the tidyr::pivot_longer() function, given its support and improved functionality.
- 49.
The best way to learn R syntax is to read R syntax prepared by others, write R syntax, make corrections, read documentation and then experiment with multiple packages and functions, etc. Practice – practice – practice!
- 50.
OCC-coded jobs refer to primary occupation. OCC codes are used by federal agencies and provide some degree of command and control on employment trends.
- 51.
Although it is common to see mean as a measure of central tendency when identifying average salary information, it is argued that median may be a more appropriate statistic. Yet, both measures of central tendency (e.g., mean, and median) are commonly presented and used.
- 52.
This lesson is introductory. Give attention to the syntax, but more detail on its selection and use is presented in later lessons.
- 53.
The figure is presented in the front matter of this lesson.
- 54.
An entire lesson in this text is provided on APIs. Look at the API process, here, but use the later lesson for explicit detail on how APIs are used in data science.
Author information
Authors and Affiliations
Electronic Supplementary Materials
Supplementary Materials
(ZIP 24725 kb)
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
MacFarland, T.W. (2024). Emergence of Data Science as a Critical Discipline in Biostatistics. In: Introduction to Data Science in Biostatistics. Springer, Cham. https://doi.org/10.1007/978-3-031-46383-9_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-46383-9_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-46382-2
Online ISBN: 978-3-031-46383-9
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)