Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Brief Introduction of Medical Database and Data Mi

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Received: 27 August 2019 Accepted: 23 January 2020

DOI: 10.1111/jebm.12373

REVIEW

Brief introduction of medical database and data mining


technology in big data era

Jin Yang1,2 Yuanjie Li3 Qingqing Liu1,2 Li Li1 Aozi Feng1 Tianyi Wang4,5
Shuai Zheng4 Anding Xu6 Jun Lyu1,2

1 Department of Clinical Research, The First

Affiliated Hospital of Jinan University, Abstract


Guangzhou, Guangdong, China Data mining technology can search for potentially valuable knowledge from a large amount of
2 School of Public Health, Xi’an Jiaotong
data, mainly divided into data preparation and data mining, and expression and analysis of results.
University Health Science Center, Xi’an, Shaanxi, It is a mature information processing technology and applies database technology. Database
China
technology is a software science that researches manages, and applies databases. The data in
3 Department of Human Anatomy, Histology and

Embryology, School of Basic Medical Sciences,


the database are processed and analyzed by studying the underlying theory and implementa-
Xi’an Jiaotong University Health Science Center, tion methods of the structure, storage, design, management, and application of the database. We
Xi’an, Shaanxi, China have introduced several databases and data mining techniques to help a wide range of clinical
4 School of Public Health, Shaanxi University of
researchers better understand and apply database technology.
Chinese Medicine, Xianyang, Shaanxi, China
5 Xianyang Central Hospital, Xianyang, Shaanxi,
KEYWORDS
China
big data, data mining, database, method, technology
6 Department of Neurology, The First Affiliated

Hospital of Jinan University, Guangzhou,


Guangdong, China

Correspondence
Jun Lyu, Department of Clinical Research, The
First Affiliated Hospital of Jinan University,
Guangzhou 510632, Guangdong, China.
Email: lyujun2019@163.com
Anding Xu, Department of Neurology, The
First Affiliated Hospital of Jinan University,
Guangzhou 510632, Guangdong Province,
China.
Email: tlil@jnu.edu.cn

Funding information
This study was supported by the National Social
Science Foundation of China (No.16BGL183).

1 INTRODUCTION discover new value.4,5 The information brought by big data is


also changing the ecosystem of medical education and medicine.
In the era of the big information explosion, the speed of information The amount of data collected and stored digitally is growing
generation is increasing day by day, and the world’s information is exponentially.1,6 The medical industry is producing a large num-
massively produced.1 In the past few years, to Big Data has become ber of data every day, which is an important area of big data’s
one of the most-used vocabulary in the industrial sector, finance, and application. In order to provide patients with the best services and
healthcare.2,3 Most areas have begun to use big data to analyze and care, medical institutions in many countries have proposed various

This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided
the original work is properly cited.

c 2020 The Authors. Journal of Evidence-Based Medicine published by Chinese Cochrane Center, West China Hospital of Sichuan University and John Wiley & Sons
Australia, Ltd

J Evid Based Med. 2020;1–13. wileyonlinelibrary.com/journal/jebm 1


2 YANG ET AL.

modes of medical information systems.7 Therefore, how to better TA B L E 1 Medical public database overview
develop and utilize the huge medical big data has become the focus of Databases Range Patients Cost
attention, and promoting the research and application of medical big SEER Tumor USA Partially free
data has become a key factor in modern medical research.
MIMIC Intensive care unit USA Free
Big data is an abstract concept. It is usually explained that big data
CHNS Health and nutrition China Partially free
refers to the data integration which is difficult to deal with the exist-
HRS Ageing health Global Free
ing database management tools, which has both massive characteris-
Dryad Medicine, biology, ecology Global Free
tics and complexity characteristics. Big data are frequently character-
UK Biobank Biomedical UK Free
ized as the five Vs—volume, velocity, variety, value, and veracity.8-10
Volume is “huge in volume” with the massive generation and collec- BioLINCC Blood and cardiovascular USA Free

tion of data, the scale of the data has become larger and larger, and GEPIA Cancer genomics USA Free

has gone beyond traditional storage and analysis techniques; velocity TCGA Cancer genomics USA Free
is “speed,” that is, big data’s timeliness, which means that data collec- TATGET Childhood cancer USA Free
tion and analysis must be carried out quickly and on time; variability eICU-CRD Intensive care unit USA Free
is “a wide range of data types,” including semistructured and unstruc- GEO Genomics data USA Free
tured data, such as audio, video, web pages, and text, as well as tradi- GBD Burden of disease Global Free
tional structured data; Value is “value,” which is mainly reflected in the
BioLINCC, biologic specimen and data repositories information coordinat-
low density of value and the commercial value is high. Veracity which ing center; CHNS, China health and nutrition survey; eICU-CRD, eICU col-
emphasizes that meaningful data must be true and accurate. The key laborative research database; GBD, global burden of disease; GEO, gene
question when using big data is how to find value from a large, rapidly expression omnibus; GEPIA, gene expression profiling interactive analy-
sis; HRS, health and retirement research; MIMIC, medical information mart
generating, and diverse data set.11,12 The computational analysis of for intensive care; SEER, Surveillance, Epidemiology, and End Results; TAT-
integrating databases has become the basic method of medicine and GET, therapeutically applicable research to generate effective treatments;
molecular biology.13 TCGA, the cancer genome atlas.

Medical data have the characteristics of disease diversity, hetero-


geneity of treatment and outcome, and the complexity of collecting, low cost, increase global cooperation to promote clinical practice, edu-
processing, and interpreting data.14 With the development of med- cation and scientific research, help the global precision medicine trans-
ical information, a large number of digital data has been produced formation application and the emergence of new health management
in the process of medical service, health care, and health manage- model.26,27
ment, forming medical big data.15 Medical big data come from a vari-
ety of sources, such as administrative claims records, clinical registra-
tion, electronic health records, biometric data, patient report data, and 2 MEDICAL PUBLIC DATABASE OVERVIEW
more.16,17 There are many values in big data applications and data
collection in healthcare systems. For example, people with diabetes Today’s society produces massive amounts of data all the time.
use mobile devices to communicate with each other, share informa- Database technology is a software science that researches, manages,
tion or search for information, thus, forming a large group of big data and applies databases. The data in the database are processed and
networks.18 The US Department of Health and Human Services has analyzed by studying the basic theory and implementation methods
issued a policy to increase the transparency of the US healthcare sys- of the structure, storage, design, management and application of the
tem, which constitutes big data sharing for many patients, physicians, database. The main medical public databases are described in Table 1.
and medical-related information.19 Faced with a huge amount of dif-
ferent types of electronic data, new requirements for R&D-related
2.1 Surveillance, epidemiology, and end results
electronic products have been put forward to adapt to complex and
(SEER)
competitive big data and its logical way.20,21 From the massive elec-
tronic medical record data, we found that the new efficacy of exist- To reduce the cancer burden of the population, the National Cancer
ing drugs—metformin for cancer treatment can also be used to treat Institutes established a monitoring, epidemiological, and final results
diabetes.22 database (SEER) for cancer patients in 1973.28 This is one of the most
Medical big data have several unique characteristics that differ representative large tumor databases in North America, which covers
from big data in other disciplines: medical big data are often difficult approximately 28% of the US population.29,30 SEER has collected infor-
to obtain10 ; are usually based on protocols, collected and relatively mation on the incidence, prevalence, mortality, and other evidence-
structured23 ; and when analyzing data and interpreting results, the based medicines of cancer patients in some US states and counties
role of professional knowledge may be dominant24 ; time-dependent for decades, providing valuable information on cancer diseases for
mixing.3 Medical data are large in scale, extremely fast in update, poly- the majority of clinical medical staff.31 Especially, it provides a broad
morphic, incomplete, and time sensitive.25 The construction of a big path for the study of malignant tumors and rare tumors. At the begin-
data platform will facilitate the remote consultation, easy operation, ning of the establishment of SEER, there were only a few registration
YANG ET AL. 3

stations in several regions. The number of registration stations has Medical supported by the National Institutes of Health. The clini-
now expanded to 18.32 These registration stations operate using cal diagnosis and treatment information of more than 40 000 real
SEER*STAT software and are submitted to NCI for biennial frequency patients living in the ICU of the Beacon Israel Dikang Medical Cen-
statistics and aggregation, and then publicize the cancer-related infor- ter from 2001 to 2012 was collected.39 The database has a large
mation of the covered population to the United States and the world. sample size, comprehensive information, long patient tracking, and
The SEER database has a large sample size, high quality, and strong can be used free of charge, providing a wealth of resources for the
statistical power, which can provide tumor-related researchers with study of critical care.40,41 It provides abundant resources for the
high clinical reference value data. Researchers can obtain partial data study of severe medicine, and solves the problem that clinical medi-
through the application of the account number. There are three ways cal workers suffer from a large number of systematic clinical diagno-
to obtain data from the SEER database: the first way is obtained by sis and treatment data for scientific research status quo. The MIMIC
SEER*Stat software, this method is the simplest and widely used; the database is constantly updated and the latest release is MIMIC-
second method is download the compressed file from the SEER offi- III version 1.4 (release notes available from https://mimic.physionet.
cial website, extract the binary data after decompression, and then use org/about/releasenotes/).42,43 Patients information for the database
software such as R is converted to data in the normal format and this come from two different intensive care information systems: the
method requires the user to have certain software knowledge; the last Philips Care Vue Clinical Information System (https://mimic.physionet.
way is used by applying to the management personnel for DVD discs org/mimicdata/carevue/) and the IMD Soft Meta Vision ICU Sys-
and using SEER*Stat without high-speed internet support. Radiation tem (https://mimic.physionet.org/mimicdata/metavision/). From 2001
therapy and chemotherapy variables in public databases have been to 2008, the Philips Care Vue clinical information system was used to
removed since the November 2016 data submission. These variables track patients for a minimum of 4 years; from 2008 to 2012, the IMD
can be obtained after signing an additional data usage agreement. The Soft Meta Vision ICU system was used to track patients for a minimum
protocol describes the integrity of the chemoradiotherapy treatment of 90 days.
variables and the potential bias in the use of chemoradiotherapy data. The MIMIC database involves coding work during its use, which
The SEER database is one of the most representative tumor is a challenge for clinicians. On the GitHub platform (https://github.
databases in North America, and some of the data are free to the com/MIT-lcp/mimiccode), there is an open-source code package for
public.33 Although the SEER database has some shortcomings, such analyzing patient characteristics that can be downloaded and used free
as family history of cancer patients, genetic history, genes, disease of charge by researchers around the world. When bugs or improve-
recurrence, and adjuvant chemotherapy, are not included, the SEER ments are found, you can modify it yourself, and then you can pull
database is still a good source of data, providing high quality for clinical request, when the platform merge, you can successfully share your
researchers.34-36 Clinical information helps clinical researchers pro- modified code package to the world, other users can also use it for free.
vide efficient, convenient, and clear access to data. The MIMIC database has great support for the research in the fields
of critical medicine, evidence-based medicine, clinical big data min-
ing, and medical monitoring equipment data analysis, and has achieved
2.2 Medical information mart for intensive care
fruitful results.
(MIMIC)
The MIMIC database is open to the world and collects the actual
Severe medicine is a discipline that studies the characteristics and reg- medical treatment of more than 40 000 patients in the ICU of the Beth
ularity of any injury or disease that leads to the development of the Israel Dikang Medical Center for 12 years. The sample size is large and
body in the direction of death, and treats severe diseases. The focus the information is comprehensive. Github provides open-source code
of this discipline is on the monitoring of critically ill patients, the imple- for researchers all over the world to use. In short, the MIMIC database
mentation of organs for organ dysfunction or debilitating organs. Sup- provides excellent support for all aspects of clinical research.
port, so that patients can win the time to remove the cause under the
condition of ensuring oxygen delivery and maintaining organ function.
2.3 China health and nutrition survey (CHNS)
As we all know, intensive care unit (ICU) is in a very special impor-
tant position in the hospital, and undertakes the treatment of patients The CHNS project, the China Resident Health and Nutrition Survey, is
with serious diseases.37,38 The level of diagnosis and treatment is also an open public platform (http://www.cpc.unc.edu/projects/china). The
one of the important indicators for modern measurement of hospital project is a cohort of international collaborations conducted by the
level. The era of big data provides an unprecedented opportunity for University of North Carolina at Chapel Hill Population Center in con-
the study of critically ill patients. By strengthening basic and clinical junction with the Center for Nutrition and Health of the Chinese Cen-
research, making full use of big data and artificial intelligence is the ter for Disease Control and Prevention.44 The study aims to explore
development trend of future critical medicine. how China’s socioeconomic transformation and family planning poli-
In order to promote the work of intensive medical research, the cies have affected the health and nutritional status of the country
MIMIC (Medical Information Mart for Intensive Care) database jointly over the past 30 years. The research includes the status and changes
issued by the Massachusetts Institute of Technology’s Computational of community organizations, family and individual economic, demo-
Physiology Laboratory, Beth Israel Dikang Medical Center, and Philips graphic and social factors. The research team for this survey is an
4 YANG ET AL.

international research team composed of researchers in the fields of reserves are growing rapidly. It is difficult to conduct an effective and
nutrition, public health, economics, sociology, and demography. The comprehensive statistical analysis through traditional data collection
project began in 1989 and carried out project research and data compi- methods.
lation and release in 1989, 1991, 1993, 1997, 2000, 2004, 2006, 2009, The Health and Retirement Research (HRS), supported by the
2011, and 2015. The CHNS website updated the dataset content on National Institute for Ageing (NIAU01AG009740) and social security
12 June 2018. The updated dataset covers vertically integrated data management, is a longitudinal research group survey conducted by the
from 10 survey data from 1989 to 2015.45 The China Health and Nutri- University of Michigan since 1992 and has established a representa-
tion Survey (CHNS) has shown the shift in the form of either nutrients tive large sample database.55 More and more multidisciplinary data
or food items or dietary patterns and this dietary shift is associated are provided through unique and in-depth interviews every 2 years for
with education, income, urbanicity, and macro food environment and participants over 50 years of age. The HRS database provides valu-
policy.46-49 able information for researchers, such as society and healthcare, to
The survey used multistage stratified cluster random sampling to use it to address important issues related to ageing challenges and
collect data from 15 provinces, autonomous regions, and municipali- opportunities.56-58 International substudies that have evaluated global
ties in China’s eastern, central, and western regions.50,51 As of August ageing and similarities have also emerged, including research in Mex-
2018, a total of 220 community samples, 7200 family samples and ico, the United Kingdom, Europe, South Korea, Japan, Ireland, China,
30 000 resident samples were included. The survey data include com- Indonesia, Costa Rica, and New Zealand, while developing to India,
munity surveys, household surveys, and personal survey data. Personal Brazil, Africa, Scotland, and Canada to form a global ageing health big
and household survey data include basic demographics, health sta- data.
tus, nutritional and dietary status and health indicators, and medical The HRS database has a large sample size, high quality, and is com-
insurance.52 plex. In order to make the data easier to study, the HRS data are clas-
The family data and personal data in the CHNS data are freely avail- sified into public data and sensitive/restricted data. Anyone can cre-
able to the public on the official website of CHNS. The community ate an account on the HRS data download site to get public data while
data can be obtained through the community-level data use agreement restricting data and sensitive health data requires a separate appli-
and completed online (Data Linkage Request Form). Paid access, the cation. The HRS database can be accessed in seven areas, including
fee standard is 330 dollars. Researchers should really apply the CHNS biennial data products, vertical data, nonelected year studies, sensi-
database information, and a detailed reading to fully understand the tive health data (requires additional registration), researcher contri-
CHNS project research documents is a necessary prerequisite. The butions, RAND’s contribution data, and cognitive, economic projects.
CHNS database website provides a clear and detailed study descrip- Each subdataset file can be read by three different statements: SAS,
tion document. The questionnaires, database descriptions, ID vari- SPSS, or Stata.
able names, etc. of the calendar year are there in the CHNS database The HRS database is a database of resources related to ageing in
description file. the United States regarding changes in health and economic environ-
CHNS, a longitudinal cohort study of international cooperation, ment. Most of the public data in this database are freely available
began in 1989 and has conducted 10 surveys until the year of 2015. through user registration. Its multidisciplinary data focuses on sur-
The research covers data on the health and nutritional status of veys of income and wealth, health, awareness, and use of health ser-
Chinese residents at the individual level, family level, and commu- vices, work and retirement, and contact with the family. Since 2006,
nity level. Since then, Chinese national health, nutrition, medical, eco- data collection has expanded to include biomarkers and genetics as
nomic, social and other research has provided more comprehensive well as greater depth of psychology and social background. This mixed-
data support.53 The official website of CHNS covers the details of economy, health and psychological information database provides the
the research, and the website not only updates the number of stud- unprecedented potential for researchers’ work.59,60 The HRS database
ies dynamically. According to the research, links related to research are can help researchers in all disciplines to obtain more convenient, effi-
provided, and the existing research results based on this research can cient, and clear data to improve work efficiency.
be conveniently retrieved. The CHNS research data set can be down-
loaded and obtained on the official website of CHNS, which is efficient
2.5 Dryad
and convenient.
With the advent of the era of big data, data reusability and data sharing
policies are attracting global attention. The infrastructure and related
2.4 Health and retirement research (HRS)
regulations for data management and data sharing have been rapidly
As an important measure of the level of international economic and developing in the past decade. Since 2003, the National Institutes of
social development, population ageing not only means an increase in Health has required all large funds to fund research projects to dis-
the number of elderly people but also poses severe challenges to the close their data. PLOS One, the world’s largest open-access journal,
economy and society.54 This has become a major social problem that requires authors to submit their data to a public database platform
cannot be ignored. There are many types of research on the health while the article is published. The BMJ Publishing Group recommends
of an ageing population, data types are constantly enriched, and data that authors store data in the Dryad database while submitting the
YANG ET AL. 5

manuscript.61-63 As a large and robust data sharing platform, Dryad is dom to obtain baseline data, including family history, drug history, and
a model for realizing data circulation and improving data reuse. health status.64 UK Biobank collected approximately 15 million bio-
The Dryad database was funded by the National Science Founda- logical samples of blood, urine, and saliva and performed genotyp-
tion and was established in September 2008 as a nonprofit member- ing and blood biochemical analysis on all participants.65,66 Moreover,
ship organization. The Dryad database stores research data in the the database will keep track of their health and medical profile infor-
fields of medicine, biology, and ecology. It is open to the world and mation for a long time. At the same time, the database collects all
can be downloaded free of charge and reused. Dryad was born out research results and provides them to other researchers. It aims to
of the initiative of leading journals and scientific groups in the fields study the relationship between genetic factors, environmental factors,
of biology and ecology, and they encourage researchers who sub- living habits, and other major human diseases.66
mit journals to submit data to a professional database to store data UK Biobank started a new medical imaging data collection pro-
and share data (http://dryad2.lib.ncsu.edu/pages/organization). The gram in 2014, using magnetic resonance imaging (MRI) and X-ray tech-
Dryad database helps researchers realize that data can be archived nology to brain, heart, and bone of more than 100 000 volunteers.67
for long periods of time and open for free reuse. As of February The imaging analysis is performed to establish a database of scanned
2018, there were more than 600 journals working with the Dryad images of internal organs. This will also be the most significant health
database, with more than 60 000 data files and 2.3 million downloads imaging research in the world to date. These vast amounts of data
(http://dryad2.lib.ncsu.edu/). will help researchers analyze population differences and their causes,
More and more journals encourage researchers to publish research such as cancer, heart disease, diabetes, arthritis, Alzheimer’s disease,
data. On the one hand, encouraging the reuse of scientific research and even change scientists’ perceptions of such chronic and epidemic
data to generate more new scientific discoveries, on the other diseases.
hand, promoting the transparency and openness of medical research. UK Biobank’s application process has high requirements for
Researchers publish data on Dryad for sharing. When someone researcher’s and research institutions’ research background, research
searches for data through Dryad, they will find that articles published purposes and research motivation, including the need to provide evi-
using the data help to increase the reputation and academic influ- dence of recently published academic results to ensure that research
ence of the researcher and the publisher. At the same time, Dryad is conducted in good faith.
assigns each packet a globally identifiable, permanent digital object The most significant advantage of UK Biobank is that all volunteers
identifier (DOI) that can be used for data references. Dryad will per- recruited are registered with the UK National Health Service (NHS)
form a necessary check on each submitted file. For example, whether and agree to link their medical records. This allows UK Biobank to track
a file can be opened, whether there is a virus, whether there is a the health and health of all volunteers in detail through national med-
copyright restriction, and whether sensitive data are displayed. Dryad ical data.68 Prospective cohort studies are important for the identifi-
also checks the integrity and correctness of the metadata. For exam- cation of disease risk factors and the prevention, treatment and treat-
ple, information about related publications delayed dates for data ment of diseases. However, too small a cohort is detrimental to the
publication, index keywords, etc. Once the article is published online, study of rare diseases and the complicated relationship between dif-
the data package will be published publicly, unless the data provider ferent risk factors and diseases. UK Biobank’s forward-looking and
chooses to post the data. Because the title, abstract, author, etc. of large sample size and continued integration with health records pro-
the article often changes during the publishing process, Dryad will vide researchers with an excellent platform to address a variety of
confirm and update this information based on accepted or published research issues.
articles. The disadvantage of UK Biobank is that the sample provider must
Compared to other public database platforms, the Dryad database fill out a detailed basic situation questionnaire, including its name, gen-
is more efficient in data sharing by working with many mainstream der, NHS number, disease information, etc., and there are inevitable
journals. By assigning DOIs to metadata, data can be referenced, privacy leaks.69 At the same time, the registration and application pro-
increasing the scientific data utilization rate while increasing the aca- cess is complicated and cumbersome, and the period is long. It may be
demic reputation of researchers and publishers. Dryad has a detailed difficult for the first-time applicant.
management policy for data maintenance and data disaster recovery, We believe that UK Biobank will provide more comprehensive
so data can be stored for a long time. The use of data “zero thresholds” research data and biological sample coverage in the future, providing
and friendly interface also make the Dryad database more and more global researchers with more efficient and convenient resource regis-
popular among researchers. tration, application and use services, as well as more secure informa-
tion security.

2.6 UK biobank
2.7 Biologic specimen and data repositories
UK Biobank (http://www.ukbiobank.ac.uk), is the world’s largest
information coordinating center (BioLINCC)
database of biomedical samples and officially opened all data to global
researchers on 30 April 2017. Between 2006 and 2010, UK Biobank BioLINCC was established in 2008 by the National Heart, Lung, and
recruited 500 000 volunteers aged 40-69 from across the United King- Blood Institute (NHLBI). The Institute provides global leadership in
6 YANG ET AL.

the prevention and treatment of heart, lung, and blood diseases and 2.8 Gene expression profiling interactive analysis
supports basic, transformational, and clinical research in these areas. (GEPIA)
By establishing BioLINCC, NHLBI provides medical researchers with
The use of big data analysis has facilitated the development of can-
access to scientific data and access to biological samples, maximiz-
cer genomics research. In essence, the cause of cancer is a genetic dis-
ing the utilization of research resources for NHLBI development and
ease caused by differential gene expression within the cell. With the
maintenance. These resources are the NHLBI biological sample library
establishment and opening of many public databases, more and more
managed by the Blood Disease Resources Department since 1975 and
researchers can access sequencing data. GEPIA (Gene Expression Pro-
the NHLBI database managed by the Cardiovascular Science Research
filing Interactive Analysis), a dynamic analysis of gene expression pro-
Center since 2000.70-72
filing data, is a newly developed web server for cancer and normal gene
The BioLINCC public website (https://biolincc.nhlbi.nih.gov/) was
expression profiling and interactive analysis, filling the gaps in cancer
established in October 2009. The site provides clinical and epidemi-
genomics big data information and helping clinical research people use
ological research data and biological samples from more than 110
public data resources more efficiently.
research institutes collected by NHLBI. BioLINCC is actively engaged
GEPIA (http://gepia.cancer-pku.cn/index.html) was developed
in data sharing and is loved by many medical science and technol-
by Professor Zhang Zemin from Peking University. The RNA-
ogy workers. Each year, more than 100 research project leaders
Seq data set used by GEPIA is based on the UCSC Xena project
apply to BioLINCC for their clinical data. A study from the affili-
(http://xena.ucsc.edu). The project was calculated by standard
ated hospital of Yale University School of Medicine in 2015 showed
pipelines and analyzed RNA sequencing expression data from 9736
that more than 90% of users are satisfied with the data shared by
tumors and 8587 normal samples from the TCGA and GTEx projects.
BioLINCC and are suitable for conducting their clinical research using
TCGA produced 9736 tumor samples in 33 cancer types, and this
this data. Half of the users have used data to complete their research,
project only provided 726 standard samples. The imbalance between
and 67% of them have published articles with more than 1000
tumors and standard data can lead to inefficiencies in various iden-
articles.73
tification analyses, so GEPIA also integrates data from GTEx. The
Data and biological samples stored in the BioLINCC database are
GTEx project produced RNA sequencing data for 8 000 standard
provided free of charge, but the cost of shipping of the biological sam-
samples. At the same time, the UCSC Xena project recalculated the
ples is at the expense of the investigator. Researchers are required to
TCGA and GTEx raw RNA-Seq data using standard pipelines, which
submit an application to BioLINCC for review and access to the data or
made the two datasets compatible. Therefore, TCGA and GTEx data
biological samples they are applying for. After the researcher applies
can be integrated for very comprehensive expression analysis. The
to data or biological samples, the NHLBI staff will review the appli-
expression data of TCGA and GTEx are recalculated under the same
cation materials. For the application of data resources, NHLBI mainly
pipeline and can be directly compared. GEPIA uses MySQL to create
reviews whether the application data matches the research plan, and
databases. The topic analysis process is done by R and PerL. Web-
the ethics committee’s explanation of the research plan, the ethical
based interactive display with php provides key interactive analysis
review is passed or exempted. BioLINCC will send an email reminder
of GPIIA, including tumor/normal differential expression profiling,
to submit the study on 1st March every year. The progress report, the
section mapping, based on tumor type or pathological staging. Anal-
researcher, can also submit a progress report on his application page
ysis modules such as analysis, patient survival analysis, similar gene
at any time after the application is successful. The published article
detection, correlation analysis, and dimensionality reduction analysis,
will be displayed on the research project page, where the resource is
as well as rapid customization are used.
located.
GEPIA is a public database developed by the Chinese. Using the
BioLINCC is a collector of high-quality medical research data
GEPIA database, laboratory biologists can easily explore TCGA and
and biological samples. It is the disseminator of advanced medical
GTEx data sets, find answers to questions, and test their hypotheses. In
research concepts and research methods and is a practitioner who
the differential analysis and expression profiling, users can easily dis-
actively promotes global medical data sharing. Through the use of
cover genes that are differentially expressed. With the application of
BioLINCC research resources, more and more research results are
genetic testing, the model of tumor prognosis assessment and treat-
continually being produced. The disadvantage of BioLINCC is that
ment options for immunohistochemistry-based tumors has been grad-
each resource shared by BioLINCC needs to be applied separately.
ually changed, and the more accurate classification of tumors has more
For applicants who want to apply for multiple research resources,
important guiding significance for prognosis evaluation and treatment.
the application process is complicated; when searching for biolog-
ical samples, BioLINCC needs to provide the name of the biologi-
cal sample for research purposes. The search method is not efficient
2.9 The cancer genome atlas (TCGA)
enough for unidentified researchers. In the future, BioLINCC will also
expand the field of data sharing, provide a more convenient resource For a long time, tumor prevention, early screening, individualized treat-
application process, collect and maintain data and specimens in a ment, and prognosis evaluation have always been the key issues that
“high-efficiency-low cost” way, and maximize the utilization of existing the medical community is committed to. As a result, the magnitude
resources. of cancer is increasing substantially, with over 20 million new cancer
YANG ET AL. 7

cases projected for 2025 compared to an estimated 14.1 million new myeloid leukemia (AML), kidney tumors (KT), model systems (MDLS),
cases in 2012.74 The study found that genetic variation is an impor- neuroblastoma (NBL), and osteosarcoma (OS). TARGET detects the
tant microscopic molecular cause of all tumor cells. Therefore, more genome, transcriptome, and epigenetics of specific childhood cancers
and more oncology researchers began to conduct related research through sequencing and chip technology. A multiomics approach is
from the perspective of molecular genetics. By measuring the biolog- used to generate a comprehensive molecular alteration map for each
ical identity of specific gene expression, it is possible to predict tumor type of cancer (change refers to changes in DNA or RNA, such as rear-
growth, spread, and patient survival, and to develop a targeted diag- rangement of chromosome structure or changes in gene expression).
nosis and treatment plan based on gene expression.75 Whole-genome By calculating and validating biological functions to determine which
sequencing and the development of bioinformatics provide new clues changes disrupt the functional pathways of genes, promote cancer
for cancer genome research.76 growth, progression, and survival, thereby, identifying candidate ther-
TCGA is a publicly funded project led by NCI in 2006. It has pub- apeutic targets and prognostic markers from cancer-related changes.
lished phased results since 2008.77 In 2009, it continued to invest The TARGET program originated from two pilot projects: ALL and NBL.
US$275 million, increasing various types of cancer data. By 2014, the To date, TARGET consists of five projects, ALL, AML, KT, NBL, and
analysis extended to 33 other types. Cancer data (including 10 rare OS. It is managed by NCI’s Office of Cancer Genomics and Cancer
tumors), from more than 11 000 tumor samples, data volume up to Therapy Evaluation Program. ALL is one of the major types of child-
255T, including clinical data, DNA, RNA, protein, and other multilevel hood leukemia. The ALL project clarifies the comprehensive molec-
data. In terms of data generation, the project achieved undisputed suc- ular characteristics to identify genetic changes in the initiation and
cess. The goal of TCGA is to integrate multidimensional omics data progression of childhood cancer that are difficult to treat. AML is
through large-scale, high-throughput genome sequencing, and gene a cancer derived from immature white blood cells in the bone mar-
chip technology to study, define, discover, and analyze all human tumor row or myeloblasts. About 25% of children with leukemia have AML.
genome changes, and finally draw a genome-wide, multidimensional Through comprehensive genome-wide identification, researchers can
cancer genome map.78 TCGA provides a large amount of genomic develop more targeted treatments based on genetic and epigenetic
data and related clinical data for oncology researchers, providing a changes found to improve the prognosis of children with AML. KT
large data base for finding small mutations in cancer-related genes and is a kidney tumor. Children’s KTs account for about 7% of childhood
studying tumor biological mechanisms, thereby, improving people’s sci- cancer. The vast majority are nephroblastoma. Through genome-wide
entific understanding of cancer at the molecular level and the ability to research, it is hoped that key molecules of these tumors will be dis-
prevent, diagnose, and treat. For example, some researchers use gene covered so that better treatments can be developed to improve the
expression data and patient survival data to explore the association prognosis of patients. NBL is a neuroblastoma. Neuroblastoma (NB)
between the two, and then predict the patient’s survival.79-81 TCGA tumors are the most common extracranial solid tumors in children,
includes genomes, proteomes, transcriptomes, epigenetic groups, and and have become one of the most important diseases that affect chil-
clinical data.82 These data are supported and maintained by multiple dren’s physical and mental health. The NBL project clarifies compre-
organizational structures and units. hensive molecular characteristics to identify genetic changes in the
TCGA has opened up an era of tumor molecular biology and preci- initiation and progression of cancer in high-risk or difficult-to-treat
sion medicine, providing researchers with new opportunities to study children. OS is osteosarcoma. TARGET generates genomic data for
the development of cancer, allowing us to look at cancer with an selected pediatric cancers and provides access to discover therapeutic
unprecedented microscopic perspective, so that we can get closer to targets for childhood cancer and translate these findings into clinical
its overall picture step by step. Currently, TCGA data have been used to applications.
discover new mutations, identify intrinsic tumor types, and determine The TARGET large database targets children’s tumors, although it
pan-cancerous similarities and differences. At the same time, evidence contains fewer types of diseases, but it is more targeted. To a certain
of tumor evolution was collected. More and more bioinformatics tools extent, the database can help researchers conduct more in-depth dis-
for the TCGA database have been developed. ease research and lead to more precise treatment options.

2.10 Therapeutically applicable research to 2.11 eICU collaborative research database


generate effective treatments (TARGET) (eICU-CRD)
In recent years, with the continuous development of medical level, Severe medicine is an inevitable trend and a prominent symbol of the
the overall prognosis of childhood cancer has been greatly improved, development and progress of modern medicine. It is an era product
but childhood malignant tumors are still the main cause of childhood of the development of medical science to a fairly high level. There
death. The TARGET (Therapeutically Applicable Research to Generate are many difficult problems involved in critical medicine, including the
Effective Treatments) database is a multiomics approach to determin- application and management of noninvasive ventilation, the rational
ing molecular changes that drive the development and progression of use of antibiotics, the implementation of nutritional assessment and
childhood cancer. The TARGET database targets childhood tumors, and nutritional support, the indications for analgesia and sedatives, and
major disease items include acute lymphoblastic leukemia (ALL), acute the scope of application of the ICU risk assessment model.83 Philips
8 YANG ET AL.

Healthcare is a leading provider of ICU equipment and services, offer- The search results of the GEO dataset include name, description,
ing a teleICU service called the eICU program. After implementing the species, platform, submitter contact, series, publication time, numeric
eICU plan, a large amount of data is collected and streamed for real- type, and the number of samples. The search results of the GEO expres-
time monitoring by the remote ICU team. These data were archived by sion map show the expression level of a gene for all samples in the form
Philips and converted to a research database by the eICU Institute.84 of pictures. The experimental conditions in the search results facilitate
The eICU Collaborative Research Database (eICU-CRD) is a large us to observe the difference in expression levels of a gene under dif-
public database created by the Philips Group in collaboration with ferent conditions. Each dataset outlines its research data report and
the Massachusetts Institute of Technology (MIT) Laboratory for Com- purpose, showing the number of platforms, samples, and series associ-
putational Physiology (LCP).85 The release of eICU-CRD is based on ated with it, from which researchers can select the research content of
the successful establishment of MIMIC-III and expands the scope of interest to download the data.
research by providing data from multiple centers. The database con- GEO also offers GEO2R online analysis tools. GEO2R is an inter-
sists of data from a number of ICUs in the United States. The current active web tool that uses GEO2R to screen differentially expressed
version is Version 2.0 and was released on 17 May 2018. The database genes, allowing users to compare two or more groups of GEO series to
covers routine data from more than 200 000 ICU patients in 2014 and identify genes that are differentially expressed under different exper-
2015, collecting a wealth of high-quality clinical information including imental conditions, and the results are shown to be significant (sorted
vital signs, care plan documentation, disease severity, diagnostic infor- gene tables). GEO2R uses the GEOquery and limmaR packages from
mation, and treatment information. The free availability of data will the Bioconductor project to perform a comparison of the original pro-
support many applications including machine learning algorithms, deci- cessed data tables provided by the submitters. Unlike GEO’s other
sion support tools, and the development of clinical research. dataset analysis tools, GEO2R does not rely on collated datasets, but
To obtain access to the eICU Collaborative Research Database, instead queries the original series of matrix data files.
you must first apply for registration.86 The agreement stipulates Developed and maintained by NCBI, GEO is one of the well-known
that applicants do not share data with others, do not attempt comprehensive databases for storing and querying chip data. There
to reidentify any patient or institution, and abide by the princi- are various chip technology platforms. The Gene Expression Omnibus
ples of collaborative research.87 There is a repository on GitHub (GEO) was created in 2000 and the last modified date is 26 July 2016.
to store eICU collaborative research database code, and the code The researchers explored the potential biological value through the
for generating tables and descriptive statistics is available online deep mining and analysis of the gene expression data information pro-
(https://github.com/mit-lcp/eicu-code). vided by the gene chip, and applied it to the research of gene analysis,
With the advent of health information networks, humans need to gene expression and regulation, disease diagnosis, and drug screening.
develop cost-effective systems to reduce the time and effort spent The mining and analysis of gene expression profile data help to under-
recording health care data. Patients in the ICU are closely monitored stand the function of genes and the interactions between genes, and to
throughout the hospital stay to detect changes in the condition. The analyze the genetic characteristics and functions of genes. GEO adapts
patient’s condition changes require the medical worker to modify the to the development trend of chip database, reduces the cost of chip
treatment plan in time. The eICU collaborative research database detection, shortens the data reading time, efficiently and rationally uti-
solves the problem that it is difficult for medical workers to have a lot lizes resources, and integrates data of more researchers.
of time and energy to collect a large amount of complete information,
and it is free to open to medical workers all over the world.
2.13 Global burden of disease (GBD)
People have always been concerned about the dangers of diseases that
2.12 Gene expression omnibus (GEO)
endanger human health. Accurately grasping the burden of various dis-
The GEO database is an international public function gene expression eases around the world is of great significance for understanding the
repository created by NCBI. The data have powerful inclusion and stor- degree of damage and development of diseases, improving the effi-
age capabilities that allow users or researchers to submit, save, and ciency of health services, and promoting the health and social and eco-
retrieve many different types of data. GEO provides a simple submis- nomic development of residents.88,89 In 1988, with the support of the
sion process and format whose data source relies on data submission World Health Organization (WHO) and the World Bank, funded by the
from researchers. GEO data submission follows the MIAME principles. Bill and Melinda Gates Foundation of the United States, the Harvard
The GEO database architecture not only provides researchers with a School of Public Health began research on the GBD.90 Subsequently,
wealth of disease-related gene expression profiles, but also provides the Institute of Health Measurement and Evaluation of the Univer-
tools for querying and downloading experiments and gene expression sity of Washington established the GBD Research Group to study the
data, allowing users to query and download interesting research and GBD.91,92
gene expression profiles. The GEO database contains the raw data and The GBD is a comprehensive health loss study. The GBD database
the data set or map generated from the original data. GEO’s raw data contains all GBD disease, risk, etiology, injury, natural injury, and
are placed in three different entity databases: platform, sample, and sequela syndrome. Indicators that measure the GBD include: deaths,
series. loss of life (YLLs), lifespan disability (YLDs), life-limiting disability
YANG ET AL. 9

(DALYs), prevalence, morbidity, life expectancy, probability of death, nomenon is the correlation between goods in the store. The associa-
and healthy life expectancy (HALE), maternal mortality (MMR), and tion analysis consists of two steps: first, list all the high-frequency items
total exposure (SEV). The extracted data indicators (units) include: in the set; then, generate frequent association rules based on the high-
quantity, ratio, percentage, year, and probability of death. The year of frequency items. The second step is the generation of association rules.
extractable data is: the annual results of all measures from 1990 to According to the high frequency item group obtained in the first step,
2017, and all GBD age groups; Gender: male, female, or a combina- if the rule satisfies the minimum confidence, the rule is an association
tion of both. The research areas are divided into: GBD super regions, rule. Machine learning methods for association analysis include: Apri-
regions, countries, and selected subnational units, World Health Orga- ori algorithm, FP tree frequency set algorithm, and Upgrade Lift.
nization regions, World Bank income levels, etc. The data can be down-
loaded free of charge to help a wide range of clinical researchers. Apriori algorithm
Although the GBD database can query and download data, includ- The Apriori algorithm is based on the a priori principle and reflects the
ing many search parameters can cause problems: Query sometimes relationship between the subset and the superset: that is, all nonempty
causes the file to ignore certain results specified in the query: specific subsets of frequent itemsets must be frequent, and all supersets of
age groups, years, etc.; query all locations at the same time and many or infrequent sets must be infrequent. If item set I does not satisfy the
all of the reasons, age groups, years, etc. will appear incomplete data. minimum support thresholds, then I is not frequent. Frequent mode
This tool is not available for Internet Explorer 10 and earlier. refers to the fact that the various items that appear in each shop-
ping record actually reflect the nature of a combination. The combi-
nation of these items is unordered in the record, and this disordered
combination is called “pattern.” Some of these modes have low fre-
3 CLINICAL DATA MINING METHODS
quency and some have high frequency. It is generally considered that

With the advent of the information age, data mining is increasingly the higher frequency is usually more instructive. This high frequency

being used in clinical practice. With information technology, medical mode is called “frequent mode.” Therefore, the nature of the Apriori

records and follow-up data can be stored and extracted more effi- algorithm is mainly used to search for candidates when searching for

ciently. At the same time, look for potential relationships or laws from frequent itemsets. Apriori algorithm can better avoid blind search and

medical data to gain effective knowledge of diagnosis and treatment improve the efficiency of frequent item set search.

of patients; increase the accuracy of disease prediction, detect disease


at an early stage, and improve cure rate. Different from the traditional FP tree frequency set algorithm
research methods, data mining is to mine information and discover The FP tree is constructed by reading in transactions one by one and
knowledge without explicit assumptions, that is, without prior research mapping the transactions to a path in the FP tree. Since different trans-
and design, the information obtained should have three characteris- actions may have several identical items, their paths may partially over-
tics: previously unknown, effective, and practical. The emergence of lap. The more the paths overlap each other, the better the compres-
data mining technology is not to replace the traditional statistical anal- sion effect obtained by using the FP tree structure; if the FP tree is
ysis technology, but the extension and extension of statistical analysis small enough to be stored in the memory, the frequent itemsets can be
methodology. Data mining methods can be divided into two categories: extracted directly from the structure in the memory without having to
descriptive and predictive. Descriptive patterns characterize the gen- repeatedly scan and store the data on the hard disk. The main idea of
eral nature of data including association analysis and clusters analysis. the FP tree frequency set algorithm is to compress the frequency set in
Predictive patterns are summarized on current data including classifi- the database into a frequent pattern tree after the first pass scan, while
cation and regression. still retaining the associated information, and then separately mining
the condition bases.

3.1 Description
Upgrade lift
3.1.1 Association analysis Regardless of the Apriori algorithm or the FP tree frequency set algo-

Association analysis, also known as association mining, is the search for rithm, in some cases, even if the two indicators of support and confi-

frequent patterns, associations, correlations, or causal structures that dence are relatively high, the rules generated may still be useless. Lift

exist between project collections or collections of objects in transac- gives a new indicator of the quality of the evaluation rules. Lift indicates

tion data, relational data, or other information carriers. In other words, the intensity of a given random occurrence of the predecessor and the

correlation analysis is the discovery of the connection between data back part, which provides an improved message to increase the proba-

from large amounts of data. Shopping basket analysis is a classic exam- bility of occurrence of the next piece of the given front piece.

ple of correlation analysis. It mainly analyzes the customer’s buying


habits by discovering different products in the customer’s shopping 3.1.2 Cluster analysis
basket. Knowing which items are often purchased by customers at The classification algorithm must know the information of each
the same time can help retailers develop marketing plans. This phe- category in advance, and all the data to be classified have
10 YANG ET AL.

corresponding categories. When the above conditions are not The grid method can effectively reduce the computational complex-
met, we need to try cluster analysis. Cluster analysis is to study how ity of the algorithm and is also sensitive to density parameters. The
to classify similar things into one category. Clustering divides similar grid-based clustering method uses a multiresolution grid data struc-
objects into different groups or more subsets by static classification, ture. The advantage of this method is that the processing speed
so that member objects in the same subset have similar properties. is extremely fast and depends only on the number of elements in
There are several clustering methods: partition-based algorithm, hier- each dimension in the quantization space. Common methods include
archical clustering algorithm, density-based algorithm, and grid-based STING, CLIQUE, and WaveCluster. STING, based on grid multiresolu-
algorithm. tion, space is divided into square units, corresponding to different reso-
lutions; CLIQUE combined with the idea of the grid and density cluster-
Partition-based algorithm ing, subspace clustering for high-dimensional data; WaveCluster using
The K-means method is the most commonly used and most basic clus- wavelet analysis. The boundaries of the cluster become clearer.
tering algorithm in cluster analysis. It is based on the prototype and
partitioned distance technique. According to the given parameter K,
the N objects are roughly divided into K classes, and then the unrea- 3.2 Prediction
sonable classification is modified according to some optimal principle.
The advantages of the K-means algorithm are that it is simple, fast, easy 3.2.1 Regression analysis
to understand, and has low time complexity. However, the K-means are Traditional regression is a statistical analysis method that uses ordi-
poorly processed for high-dimensional data and do not recognize non- nary linear regression to determine the quantitative relationship
spherical clusters. between two or more variables. It is widely used. Its expression is
y = w’x + e, and e is a normal distribution with an error obeying a mean
Hierarchical clustering algorithm of 0. Regression analysis can be divided into a linear regression anal-
The hierarchical clustering algorithm hierarchically decomposes the ysis and multiple linear regression analysis according to the number
data set. It is divided into agglomerative hierarchical clustering and of independent variables. A linear regression analysis contains only
top-down divisive hierarchical clustering. Commonly used hierarchi- one independent variable and one dependent variable, and a straight
cal clustering methods include BIRCH, CURE, ROCK, Chameleon, and line can approximate the relationship between the two. If the regres-
other algorithms. This type of algorithm initially treats each point as a sion analysis includes two or more independent variables, and the lin-
cluster. Clusters and clusters are combined according to closeness, and ear relationship between the dependent variable and the independent
proximity can be defined differently according to the different mean- variable is called multiple linear regression analysis. In practice, a phe-
ings of “close.” The above combination process ends when further com- nomenon is often associated with multiple factors. When performing
binations result in undesired results under multiple causes. regression analysis, you need two or more independent variables. This
regression is called multiple regression. It is more effective and more
Density-based algorithm realistic to predict or estimate the dependent variable by the opti-
In order to find clusters of arbitrary shape, the cluster can be regarded mal combination of multiple independent variables than to predict or
as a dense region separated by sparse regions in the data space, which estimate with only one independent variable. Therefore, multiple lin-
is the core idea based on the density algorithm. Common methods ear regression is more practical than one-dimensional linear regres-
include DBSCAN, OPTICS, and DENCLUE methods. DBSCAN is the sion. Multiple linear regression analysis consists of three steps: the first
most representative. It is a density-based clustering based on high- step is using the collected data to establish a regression equation; sec-
density connected regions, which connects core objects and their ond, performing a hypothesis test on the regression equation obtained
neighborhood-difficult regions as clusters. Mainly used to deal with by the analysis; third, when the regression equation has significant
noise. The density of object O can be measured by the number of significance, it is necessary to perform the partial regression coeffi-
objects close to O. The core idea of the algorithm is to find all core cient for each independent variable. Hypothesis testing, after elimi-
points, boundary points, and noise points. DBSCAN does not need to nating the variables with no significant partial regression coefficients,
input the number of clusters to be divided and can handle clusters of re-establish the multiple regression equations that do not contain the
various shapes. However, the time complexity of the algorithm is high, variables and repeat the above process. Its basic principle is to apply
so high-dimensional data cannot be processed. the least-squares method to the regression of the linear regression
model.
Grid-based algorithm Most of the statistical models of traditional methods have specific
Based on the partitioning and hierarchical clustering methods, the requirements on the data, and the model itself has a mathematical
nonconvex shape clusters can not database. The algorithm that can form that can be clearly expressed. The pros and cons of the model are
effectively find the arbitrarily shaped clusters is based on the den- mostly judged according to the test obtained from the assumption of
sity algorithm, but the density-based algorithm generally has a high the distribution of the data. However, in the actual work process, it is
time complexity. From 1996 to 2000, the research data mining schol- difficult to make any assumptions about the distribution of data in the
ars have proposed a large number of grid-based clustering algorithms. real world. At the same time, it is difficult to describe with a limited
YANG ET AL. 11

mathematical formula. The machine learning method has no assump- nience of big data analysis as long as we make productive investments
tions about the data, and the results are also cross-over. The method in appropriate systems and achieve key breakthroughs in technology
of verification judges that the prediction model based on the algo- and workforce.
rithm or program is quite effective and the result of cross-validation
is easily understood and accepted by the majority of practical workers.
The machine learning methods for regression models are decision tree,
5 CONCLUSIONS
adaptive boosting, bagging, random forests, support vector machines,
nearest neighbor algorithm, and artificial neural network. This article first briefly introduces the database and data mining meth-
ods commonly used in the era of big data. With the advent of the infor-
3.2.2 Classification analysis mation age, data mining is increasingly being used in clinical practice.

Classification is a supervised learning process. The goal is to “tag” the With information technology, medical records and follow-up data can

data to extract valuable data. The more accurate the categories are, the be stored and extracted more efficiently. At the same time, look for

more valuable the results will be. Usually, the following methods are potential relationships or patterns from medical data to gain useful

used: logistic regression, probit regression, classical discriminant anal- knowledge of the diagnosis and treatment of patients. At the same

ysis: there are not many classification levels of independent variables, time, it can also increase the predictive accuracy of the disease, find

and there are two levels of dependent variables; discriminant analysis: the disease at an early stage, and improve the cure rate. The database

there are not many classification levels of independent variables, and we introduced is only a small part, and there are many databases wor-

there are more than two dependent variables level; machine learning thy of research attention, such as Catalogue of Somatic Mutations

method: there are many levels of independent variable classification. in Cancer (COSMIC), The Human Gene Mutation Database (HGMD),

Building a classification model can help us better understand the data. Oncomine, cBioPortal for Cancer Genomics (cBioPortal), Sequence

However, there are limitations. The information of each category must Read Archive(SRA), WHO Mortality Database, Orphanet, Database

be known in advance, and all the data to be classified have correspond- of Genomic Variants (DGV), Online Mendelian Inheritance in Man

ing categories. When the dependent variable is a categorical variable (OMIM), etc. With the deepening of theoretical research and further

and the independent variable contains multiple categorical variables practical exploration, medical data mining will play an influential role in

or the categorical variable has a high level, the classical statistic is not the diagnosis and treatment of diseases, medical research and teach-

applicable, and the machine learning method is more practical for pro- ing, and hospital management.

cessing complex data, and the accuracy is better.


CONFLICT OF INTEREST

The authors declare no conflict of interest.


4 PROSPECTS AND CHALLENGES OF
MEDICAL DATA MINING
REFERENCES
The use of new cutting-edge disciplines to generate big data and ana-
1. Schlick CJR, Castle JP, Bentrem DJ. Utilizing big data in cancer care.
lyze big data is a trend that has evolved between traditional medicine Surg Oncol Clin N Am. 2018;27:641–652.
and precision medicine. The development of big data will help the 2. Trifiro G, Sultana J, Bate A. From big data to smart data for phar-
global application of precision medicine and the emergence of new macovigilance: the role of healthcare databases and other emerging
sources. Drug Saf. 2018;41:143–149.
health management models.27 The potential for big data is still to be
3. Binder H, Blettner M. Big data in medical science–a biostatistical view.
discovered. Although it is not easy to generate new findings and con- Dtsch Arztebl Int. 2015;112:137–142.
clusions in massive amounts of data, as long as effective investments 4. Bahi M, Walmsley RS, Gray AR, et al. The risk of non-melanoma skin
are made on the right systems, key breakthroughs in technology and cancer in New Zealand in inflammatory bowel disease patients treated
with thiopurines. J Gastroenterol Hepatol. 2018;33:1047–1052.
workforce are available, and future big data analysis, visualization, and
5. Jonathan E, Mayer RHG. Arsenic and skin cancer in the USA: the cur-
artificial intelligence can be foreseen. The convenience and change in rent evidence regarding arsenic-contaminated drinking water. J Der-
medical care and life are worth looking forward to. The potential for matol. 2016;55:585–591.
big data is still to be discovered. However, medical big data mining still 6. Bayne LE. Big data in neonatal health care: big reach, big reward? Crit
faces enormous challenges, mainly in the following: medical knowledge Care Nurs Clin North Am. 2018;30:481–497.
7. Ristevski B, Chen M. Big data analytics in medicine and healthcare. J
concept is complex, medical knowledge reasoning key technology has
Integr Bioinform. 2018;15: 20170030.
not broken through; medical information sources are wide, data modal- 8. Bellazzi R. Big data and biomedical informatics: a challenging opportu-
ity is high, the latitude is high, the type is unbalanced, and structure is nity. Yearb Med Inform. 2014;9:8–13.
complicated. The hospital’s electronic medical record system is poor 9. Sinha A, Hripcsak G, Markatou M. Large datasets in biomedicine: a dis-
cussion of salient analytic issues. J Am Med Inform Assoc. 2009;16:759–
in openness and scalability; the out-of-hospital process is poorly reg-
767.
ulated. Although it is not easy to generate new findings and conclu- 10. Scruggs SB, Watson K, Su AI, et al. Harnessing the heart of big data. Circ
sions in massive data, we can foresee the future medical and life conve- Res. 2015;116:1115–1119.
12 YANG ET AL.

11. Chen M, Mao S, Liu Y. Big data: a survey. Mobile Net Appl. 2014;19:171– 35. Kooby DA, Gillespie TW, Liu Y, et al. Impact of adjuvant radiotherapy
209. on survival after pancreatic cancer resection: an appraisal of data from
12. IEEE Internet Computing. IEEE Internet Computing, 2012:1–6. the national cancer data base. Ann Surg Oncol. 2013;20:3634–3642.
13. Bolouri H. Modeling genomic regulatory networks with big data. Trends 36. Scosyrev E, Messing J, Noyes K, Veazie P, Messing E. Surveillance
Genet. 2014;30:182–191. Epidemiology and End Results (SEER) program and population-based
14. Dinov ID. Methodological challenges and analytic opportunities research in urologic oncology: an overview. Urol Oncol. 2012;30:126–
for modeling and interpreting Big Healthcare Data. Gigascience. 132.
2016;5:12. 37. Ednell AK, Siljegren S, Engstrom A. The ICU patient diary-A nursing
15. Lee CH, Yoon H-J. Medical big data: promise and challenges. Kidney Res intervention that is complicated in its simplicity: a qualitative study.
Clin Pract. 2017;36:3–11. Intensive Crit Care Nurs. 2017;40:70–76.
16. Ossio R, Roldan-Marin R, Martinez-Said H, Adams DJ, Robles-Espinoza 38. Noome M, Beneken Genaamd Kolmer DM, van Leeuwen E, Dijkstra
CD. Melanoma: a global perspective. Nat Rev Cancer. 2017;17:393– BM, Vloet LCM. The role of ICU nurses in the spiritual aspects of
394. end-of-life care in the ICU: an explorative study. Scand J Caring Sci.
17. Rumsfeld JS, Joynt KE, Maddox TM. Big data analytics to improve 2017;31:569–578.
cardiovascular care: promise and challenges. Nat Rev Cardiol. 39. Saeed M, Villarroel M, Reisner AT, et al. Multiparameter Intelligent
2016;13:350–359. Monitoring in Intensive Care II: a public-access intensive care unit
18. Fernandez-Luque L MY, Mayer MA, Hasvold PE, Joshi S. Panel: big database. Crit Care Med. 2011;39:952–960.
data & social media for empowering patients with diabetes. Stud Health 40. Liu Q, Yang J, Zhang J, et al. Description of clinical characteristics of
Technol Inform. 2016;225:607–609. VAP patients in MIMIC database. Front Pharmacol. 2019;10:62.
19. Feldman K, Chawla NV. Does medical school training relate to prac- 41. Jiang X, Su Z, Wang Y, et al. Prognostic nomogram for acute pancre-
tice? Evidence from big data. Big Data. 2015;3:103–113. atitis patients: an analysis of publicly electronic healthcare records in
20. Ellaway RH, Pusic MV, Galbraith RM, Cameron T. Developing the role intensive care unit. J Crit Care. 2019;50:213–220.
of big data and analytics in health professional education. Med Teach. 42. Johnson AE, Pollard TJ, Shen L, et al. MIMIC-III, a freely accessible crit-
2014;36:216–222. ical care database. Sci Data. 2016;3:160035.
21. O’Sullivan DE, Brenner DR, Demers PA, et al. Indoor tanning and skin 43. Zhang Z. Accessing critical care big data: a step by step approach. J Tho-
cancer in Canada: a meta-analysis and attributable burden estimation. rac Dis. 2015;7:238–242.
Cancer Epidemiol. 2019;59:1–7. 44. Popkin BM, Du S, Zhai F, Zhang B. Cohort Profile: the China Health and
22. Xu H, Aldrich MC, Chen Q, et al. Validating drug repurposing signals Nutrition Survey–monitoring and understanding socio-economic and
using electronic health records: a case study of metformin associated health change in China, 1989–2011. Int J Epidemiol. 2010;39:1435–
with reduced cancer mortality. J Am Med Inform Assoc. 2015;22:179– 1440.
191. 45. Zhang B, Zhai FY, Du SF, Popkin BM. The China Health and Nutrition
23. Wang W, Krishnan E. Big data and clinicians: a review on the state of Survey, 1989–2011. Obes Rev. 2014;15:2–7.
the science. JMIR Med Inform. 2014;2:e1. 46. Popkin BM, Adair LS, Ng SW. Global nutrition transition and the pan-
24. Bellazzi R, Zupan B. Predictive data mining in clinical medicine: current demic of obesity in developing countries. Nutr Rev. 2012;70:3–21.
issues and guidelines. Int J Med Inform. 2008;77:81–97. 47. Popkin BM. Synthesis and implications: china’s nutrition transition in
25. United Nations Development Programme. Human Development the context of changes across other low- and middle-income countries.
Report 2015. New York: PBM Graphics. 2015;204. Obes Rev. 2014;15:60–67.
26. Hsieh J-C, Li A-H, Yang C-C. Mobile, cloud, and big data computing: 48. Zhai FY, Du SF, Wang ZH, Zhang JG, Du WW, Popkin BM. Dynamics
contributions, challenges, and new directions in telecardiology. Int J of the Chinese diet and the role of urbanicity, 1991–2011. Obes Rev.
Environ Res Public Health. 2013;10:6131–6153. 2014;15:16–26.
27. Alyass A, Turcotte M, Meyre D. From big data analysis to personal- 49. Friel S, Hattersley L, Snowdon W, et al. Monitoring the impacts of trade
ized medicine for all: challenges and opportunities. BMC Med Genomics. agreements on food environments. Obes Rev. 2013;14:120–134.
2015;8:33. 50. Zhen S, Ma Y, Zhao Z, Yang X, Wen D. Dietary pattern is associated
28. Ma H, Sun H, Sun X. Survival improvement by decade of patients aged with obesity in Chinese children and adolescents: data from China
0–14 years with acute lymphoblastic leukemia: a SEER analysis. Sci Rep. Health and Nutrition Survey (CHNS). Nutr J. 2018;17:68.
2014;4:4227. 51. Shi Z, Yuan B, Hu G, Dai Y, Zuo H, Holmboe-Ottesen G. Dietary pattern
29. Taylor JS, He W, Harrison R, et al. Disparities in treatment and survival and weight change in a 5-year follow-up among Chinese adults: results
among elderly ovarian cancer patients. Gynecol Oncol. 2018;151:269– from the Jiangsu Nutrition Study. Br J Nutr. 2011;105:1047–1054.
274. 52. Li M, Shi Z. Dietary pattern during 1991–2011 and its association with
30. Rauh-Hain JA, Melamed A, Schaps D, et al. Racial and ethnic disparities cardio metabolic risks in Chinese adults: the China health and nutrition
over time in the treatment and mortality of women with gynecological survey. Nutrients. 2017;9:1–13.
malignancies. Gynecol Oncol. 2018;149:4–11. 53. Barlow P, McKee M, Basu S, Stuckler D. The health impact of trade and
31. Gaitanidis A, Alevizakos M, Pitiakoudis M, Wiggins D. Trends in inci- investment agreements: a quantitative systematic review and network
dence and associated risk factors of suicide mortality among breast co-citation analysis. Global Health. 2017;13:13.
cancer patients. Psychooncology. 2018;27:1450–1456. 54. Wang C, Li F, Wang L, et al. The impact of population aging on med-
32. Moss HA, Havrilesky LJ, Chino J. Insurance coverage among women ical expenses: a big data study based on the life table. Biosci Trends.
diagnosed with a gynecologic malignancy before and after imple- 2018;11:619–631.
mentation of the Affordable Care Act. Gynecol Oncol. 2017;146:457- 55. Fisher GG, Ryan LH. Overview of the health and retirement study and
464. introduction to the special issue. Work Aging Retire. 2018;4:1–9.
33. Yang J, Chen S, Li Y, et al. Incidence rate and risk factors for suicide 56. Lewis NA, Brazeau H, Hill PL. Adjusting after stroke: changes in sense
death in patients with skin malignant melanoma: a Surveillance, Epi- of purpose in life and the role of social support, relationship strain, and
demiology, and End Results analysis. Melanoma Res. 2018. time. J Health Psychol. 2018:135910531877265.
34. Megwalu UC. Observation versus thyroidectomy for papillary thy- 57. Sonnega A, Faul JD, Ofstedal MB, Langa KM, Phillips JW, Weir DR.
roid microcarcinoma in the elderly. J Laryngol Otol. 2017;131:173- Cohort profile: the Health and Retirement Study (HRS). Int J Epidemiol.
176. 2014;43:576–585.
YANG ET AL. 13

58. Morin RT, Midlarsky E. Depressive symptoms and cognitive function- 78. Chin L, Andersen JN, Futreal PA. Cancer genomics: from discovery sci-
ing among older adults with cancer. Aging Ment Health. 2018;22:1465– ence to personalized medicine. Nat Med. 2011;17:297–303.
1470. 79. Rao S, Welsh L, Cunningham D, et al. Correlation of overall survival
59. Byles JE, Vo K, Forder PM, et al. Gender, mental health, physical health with gene expression profiles in a prospective study of resectable
and retirement: a prospective study of 21,608 Australians aged 55–69 esophageal cancer. Clinical Colorectal Cancer. 2011;10:48–56.
years. Maturitas. 2016;87:40–48. 80. Chen YC, Ke WC, Chiu HW. Risk classification of cancer survival using
60. Assari S, Nikahd A, Malekahmadi MR, Lankarani MM, Zamanian H. ANN with gene expression data from multiple laboratories. Comput
Race by gender group differences in the protective effects of socioeco- Biol Med. 2014;48:1–7.
nomic factors against sustained health problems across five domains. J 81. Sfakianos GP, Iversen ES, Whitaker R, et al. Validation of ovarian can-
Racial Ethn Health Disparities. 2016;4:884–894. cer gene expression signatures for survival and subtype in formalin
61. Dyke SO HT. Developing and implementing an institute-wide data fixed paraffin embedded tissues. Gynecologic Oncology. 2013;129:159–
sharing policy. Genome Med. 2011;28:60. 164.
62. Piwowar HA, Day RS, Fridsma DB. Sharing detailed research data is 82. Gao J, Ciriello G, Sander C, Schultz N. Collection, integration and anal-
associated with increased citation rate. PLoS One. 2007;2:e308. ysis of cancer genomic profiles: from data to insight. Curr Opin Genet
63. Khan K, Weeks A. Dryad in the UK and USA—prospective and retro- Dev. 2014;24:92–98.
spective data publication. Toxicol Sci. 2016;153:225–227. 83. Badawi O, Liu X, Hassan E, Amelung PJ, Swami S. Evaluation of ICU
64. Ollier W ST, Peakman T. UK Biobank: from concept to reality. Pharma- risk models adapted for use as continuous markers of severity of illness
cogenomics. 2005;6:639–646. throughout the ICU stay. Crit Care Med. 2018;46:361–367.
65. Palmer LJ. UK Biobank: bank on it. The Lancet. 2007;369(9578):1980– 84. McShea M, Holl R, Badawi O, Riker R, Silfen EA. Collaboration between
1982. industry, health-care providers, and academia. IEEE Eng Med Biol Mag.
66. Collins R. What makes UK Biobank special? The Lancet. 2010;29:18–25.
2012;379(9822):1173–1174. 85. Pollard TJ, Johnson AEW, Raffa JD, Celi LA, Mark RG, Badawi O. The
67. Matthews PM, Sudlow C. The UK Biobank. Brain. 2015;138:3463– eICU Collaborative Research Database, a freely available multi-center
3465. database for critical care research. Sci Data. 2018;5:180178.
68. Littlejohns TJ, Sudlow C, Allen NE, Collins R. UK Biobank: opportunities 86. Johnson AEW, Pollard TJ, Celi LA, Mark RG. Analyzing the eICU Col-
for cardiovascular research. Eur Heart J. 2019;40:1158–1166. laborative Research Database. 2017:631.
69. Barbour V. UK Biobank: a project in search of a protocol? The Lancet. 87. Braunschweiger P, Goodman KW. The CITI Program: an international
2003;361(9370):1734–1738. online resource for education in human subjects protection and the
70. Giffen CA, Carroll LE, Adams JT, Brennan SP, Coady SA, Wagner responsible conduct of research. Acad Med. 2007;82:861–864.
EL. Providing contemporary access to historical biospecimen collec- 88. Afshin A, Sur PJ, Fay KA, et al. Health effects of dietary risks in 195
tions: development of the NHLBI biologic specimen and data reposi- countries, 1990–2017: a systematic analysis for the Global Burden of
tory information coordinating center (BioLINCC). Biopreserv Biobank. Disease Study 2017. The Lancet. 2019;393:1958–1972.
2015;13:271–279. 89. GBD 2017 Disease and Injury Incidence and Prevalence Collabrators.
71. Shea KE, Wagner EL, Marchesani L, Meagher K, Giffen C. Efficiently Global, regional, and national incidence, prevalence, and years lived
maintaining a national resource of historical and contemporary bio- with disability for 354 diseases and injuries for 195 countries and terri-
logical collections: the NHLBI biorepository model. Biopreserv Biobank. tories, 1990–2017: a systematic analysis for the Global Burden of Dis-
2017;15:17–19. ease Study 2017. Lancet. 2018;392(10519):1789–1858.
72. Giffen CA, Wagner EL, Adams JT, et al. Providing researchers 90. Liang J, Li X, Kang C, et al. Maternal mortality ratios in 2852 Chinese
with online access to NHLBI biospecimen collections: the results counties, 1996–2015, and achievement of Millennium Development
of the first six years of the NHLBI BioLINCC program. PLoS One. Goal 5 in China: a subnational analysis of the Global Burden of Disease
2017;12:e0178141. Study 2016. The Lancet. 2019;393:241–252.
73. Ross JS, Ritchie JD, Finn E, et al. Data sharing through an NIH central 91. The Lancet. GBD 2017: a fragile world. The Lancet. 2018;392:1683.
database repository: a cross-sectional survey of BioLINCC users. BMJ 92. Global Burden of Disease Liver Cancer Collabration, Akinyemiju T,
Open. 2016;6:e012769. Abera S, et al. The burden of primary liver cancer and underlying eti-
74. Fidler MM, Soerjomataram I, Bray F. A global view on cancer inci- ologies from 1990 to 2015 at the global, regional, and national level:
dence and national levels of the human development index. Int J Cancer. results From the Global Burden of Disease Study 2015. JAMA Oncol.
2016;139:2436–2446. 2017;3:1683–1691.
75. Olex AL, Turkett WH, Fetrow JS, Loeser RF. Integration of gene
expression data with network-based analysis to identify signaling and
metabolic pathways regulated during the development of osteoarthri-
tis. Gene. 2014;542:38–45. How to cite this article: Yang J, Li Y, Liu Q, et al. Brief intro-
76. Garraway LA, Lander ES. Lessons from the cancer genome. Cell.
duction of medical database and data mining technology in big
2013;153:17–37.
77. Network TC. Corrigendum: comprehensive genomic characteriza- data era. J Evid Based Med. 2020;1–13. https://doi.org/10.1111/
tion defines human glioblastoma genes and core pathways. Nature. jebm.12373
2013;494(7438):506.

You might also like