Towards Data Science: Yangyong Zhu and Yun Xiong
Towards Data Science: Yangyong Zhu and Yun Xiong
Towards Data Science: Yangyong Zhu and Yun Xiong
PROCEEDINGS PAPER
Currently, a huge amount of data is being rapidly generated in cyberspace. Datanature (all data
in cyberspace) is forming due to a data explosion. Exploring the patterns and rules in data
nature is necessary but difficult. A new discipline called Data Science is coming. It provides a
type of novel research method (a data-intensive method) for natural and social sciences and
goes beyond computer science in researching data. This paper presents the challenges presented
by data and discusses what differentiates data science from the established sciences, data
technologies, and big data. Our goal is to encourage data related researchers to transfer their
focus towards this new science.
1 Introduction
The data explosion is the rapid increase in the amount of data in cyberspace, which brings humanity into the
big data era. The meaning of data has evolved. Data are no longer limited to values of qualitative or quanti-
tative variables, or the results of measurements, or scientific data generated within the context of scientific
observations and experiments. In addition to all of that, data also are everything found in cyberspace. Data-
nature (all the data in cyberspace) forms and develops unconsciously (Zhu, Zhong, & Xiong, 2009: Zhu &
Xiong, 2009). There are increasing instances of data that have no references in the natural world, such as
computer viruses, online games, and junk data, all of which are generated in datanature. The information
generated in datanature has gradually surpassed the facts existing in the natural world and has come to
exhibit unique patterns.
Since the computer was invented, we have been constantly utilizing and dealing with data. The facts of the
natural world are mapped as data and stored in computers so that we can use them when needed. However,
the method of using data has changed from simple data access to big data analysis, especially in the realm
of science (e.g., life science). This brings new requirements and challenges for data technologies, which lead
to research on the data themselves, such as how to study life through DNA data. The goal of data utilization
is also changing. Data analysis not only aims to solve problems based in reality but also extends to analyzing
data in order to study the phenomena and rules of the data themselves (e.g., discovering the growth patterns
of data and predicting the scale of data in cyberspace ten years into the future). Providing natural and social
sciences with data technologies and methods and exploring datanature can and should lead the transition
towards this new science, data science. Whether you know it or not; whether you accept it or not; whether
you are ready for it or not, data science is coming. If you have been engaging in data science research, you
may already have become a data scientist.
In this paper, we present the challenges presented by data and investigate why we need data science. We
also include how data science differs from existing technologies and established sciences. Furthermore, we
discuss some key issues (e.g., fundamental theories, new methods, and research topics) that will be faced by
data science when it becomes an academic discipline having data as its research objects. We also review the
progress being made in the current research and society of data science and discuss a few perspectives and
Art. 8, page 2 of 7 Zhu and Xiong: Towards Data Science
challenges found on the agenda of data science. Finally, we illustrate how to transfer existing knowledge to
this new science.
demonstrating that we can research life through biological data. Biology research with data also solves some
new problems that traditional methods cannot handle.
On the other hand, more and more scientific research will be directly targeted at data in datanature,
instead of the facts in nature, which will then promote man to recognize data and facilitate them to explore
nature and human behavior. Natural science takes substances in nature as research objects, and social sci-
ence takes human behaviors as research objects. However, data in cyberspace are gradually covering and
exceeding the facts in nature and human behavior because more and more data exist without references in
nature and human behavior. Consequently, data researchers tend to research data in cyberspace, i.e., take
data as research objects, which is different from natural science and social science.
5 Conclusions
There is unanimous agreement that data science is different from existing technologies and established
sciences and will be a meaningful and promising research direction in the future. Data related research
can and should lead the transition towards this new science – data science. Meanwhile, data researchers
should transfer into data science rather than developing individual or separate data analysis methods and
techniques on their own. We believe that data science will become a new kind of science, which is exactly
the same as the natural sciences and social sciences.
6 Acknowledgment
This work is supported in part by Shanghai Science and Technology Development Funds
(13dz2260200,13511504300), NSFC-61170096.
6 References
Cao, L. B. & Yu, P. S. (2009) Behavior Informatics: An Informatics Perspective for Behavior Studies. IEEE Intel-
ligent Informatics Bulletin 10(1), pp 6–11.
Cleveland, W. S. (2001) Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statis-
tics. International Statistical Review 69(1), pp 21–26.
Dhar, V. (2013) Data Science and Prediction. CACM 56, p 12.
EMC (2011) Data Science Revealed: A Data-Driven Glimpse into the Burgeoning New Field. Retrieved from the
World Wide Web November 11, 2014: http://www.emc.com/collateral/about/news/emc-data-science-
study-wp.pdf
Hayashi, C. (1996) What is Data Science? Fundamental Concepts and a Heuristic Example. In Proceedings of
the 5th Conference of the International Federation of Classification Societies (IFCS’96).
Hey, T., Tansley, S., & Tolle, K. (2009) The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft
Research.
Iwata, S. C. (2008) Editor’s Note: Scientific ‘Agenda’ of Data Science. Data Science Journal 7, pp 54–56.
Liu, L., Zhang, H., Li, J. H., et al. (2009) Building a Community of Data Scientists: an Explorative Analysis.
Data Science Journal 8, p 24.
Loukides, M. (2010) What is Data Science? An O’Reilly Radar Report.
Naur, P. (1966) The Science of Datalogy. Communications of the ACM 9(7), p 485.
Smith, F. Jack (2006) Data Science as an academic discipline. Data Science Journal 5, pp 163–164.
Zhu, Y. Y. & Xiong, Y. (2009) Dataology and Data Science (in Chinese with English abstract). Fudan University
Press.
Zhu, Y. Y. & Xiong, Y. (2011) Dataology and Data Science: Up to Now. Retrieved from the World Wide Web
November 16, 2014: http://www.paper.edu.cn/en_releasepaper/content/4432156
Zhu and Xiong: Towards Data Science Art. 8, page 7 of 7
Zhu, Y. Y., Zhong, N., & Xiong, Y. (2009) Data Explosion, Data Nature and Dataology. In Proceedings of Inter-
national Conference on Brain Informatics (BI’09).
How to cite this article: Zhu, Y and Xiong, Y 2015 Towards Data Science. Data Science Journal, 14: 8, pp. 1–7, DOI:
http://dx.doi.org/10.5334/dsj-2015-008
Copyright: © 2015 The Author(s). This is an open-access article distributed under the terms of the Creative
Commons Attribution 3.0 Unported License (CC-BY 3.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original author and source are credited. See http://creativecommons.org/
licenses/by/3.0/.
Data Science Journal is a peer-reviewed open access journal published by Ubiquity OPEN ACCESS
Press.