Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Motivation and Justification of Naturalistic Method for Bioinformatics Research

2014
...Read more
Vol. 5, No. 2 February 2014 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences ©2009-2014 CIS Journal. All rights reserved. http://www.cisjournal.org 80 Motivation and Justification of Naturalistic Method for Bioinformatics Research 1 Nooruldeen Nasih Qader, 2 Hussein Keitan Al-Khafaji 1 Computer Science, University of Sulaimani, Sulaimani, Iraq 2 Computer Communication, Alrafidain University College, Baghdad, Iraq ABSTRACT This paper introduces and proposes naturalistic method as a trends base for the Bioinformatics research. Naturalistic method emphasizes on finding biodata properties by insight in a real data nature to reflect its de facto and to be as far from the Bioinformatics theoretical assumptions as possible. We present and justify motivating factors in this direction such as studies that depend mainly on hypotheses models lead to the derivation of imperfect biological models, availability of huge real data, furthermore new technologies enable sustainable flow of data. This method aims to find better ways for representing biological data and process. This goal could be reached by finding biodata properties and characteristics. On the other hand, discovered properties could be utilized to enhance different algorithm in Bioinformatics. Keywords: Naturalistic, Bioinformatics, microarray, property, algorithm, data mining, motif, genome, gene, PWM, nucleotide, DNA, binding site. 1. INTRODUCTION The research methodologies are continuously developing to involve new techniques and ideas. Therefore, the appearance of the network and the web made it possible for the scientific community to share data produced by high throughput techniques, thus providing massive, new and free data to be investigated and analyzed. A set of data on its own is very hard to interpret. There is a lot of information contained in the data, but it is hard to see. Ways of understanding important features of the data are necessary [1], [2]. In this study we demonstrated shortcoming and disadvantages of using theoretical assumptions in Bioinformatics such as in motif representation and sequence generation. Also, we briefly introduced naturalistic method. The aim of this work is basically to present motivation and justification factors to shift Bioinformatics research to rely more on available data. To overcome challenges faced in researches, different disciplines continuously conduct the process of designing new methods beside ordinary research methods; pragmatism was a philosophical foundation for new methods of research [3], [4]. In this context disciplines such as Bioinformatics and more precisely data mining in Bioinformatics come in advance. These efforts lead to good progress, knowledge and efficiency in medicine and Bioinformatics. In Bioinformatics, recent trends concentrate on the nature of biological data to make a design more efficient [5]. The situation results in an increase in the amount of information mining from the data. This study proposes and emphasis on naturalistic and realistic trend as a base for the Bioinformatics research method. 2. RELATED WORK No single scientific method could be applied to all branches of science. Pragmatism and finding solution to a problem made scientists use whatever they can. In the following we present some related ideas to researches methods: 2.1 Deduction Philosophy vs. Induction Philosophy In the article called “Is the Scientific Paper a Fraud?” Peter Medewar reported in which induction, unlike with deductions, acquired no place with scientific research. Medewar agrees with Karl Popper, a philosopher of science. Popper refused induction being a legit sort of judgment from the process of scientific research [6]. The reason why in which deductions generally seems to delight in recommended philosophical standing subsequently is if typically the axiom plus the observation are generally appropriate typically the logical inference needs to be appropriate. By contrast, induction sometimes appears is noted as being not secure philosophically simply because it collapses to help counter-examples [7]. 2.2 Hypothesis-driven and Data-driven Method Popper and Medewar argued vehemently for a method of scientific practice based on the so-called hypothetico-deductive system, the essence of which is the formulation of a hypothesis derived from a collection of facts, testing the hypothesis by trying to ‘falsify’ it,
Vol. 5, No. 2 February 2014 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences ©2009-2014 CIS Journal. All rights reserved. http://www.cisjournal.org 81 collecting more facts if ‘falsification’ fails, and repeating the falsification tests until either you and the hypothesis agree on a draw or one of you admits defeat [6]. Another direction tries to prove that data and technology-driven programs are not alternatives to hypothesis-led studies in scientific knowledge discovery but are complementary and iterative partners with them. Many fields are data-rich but hypothesis-poor. Here, computational methods of data analysis, which may be automated, provide the means of generating novel hypotheses, especially in the post genomic era [8]. Another researcher proposes to extend the period of hypothesis generation through discovery-driven approaches hoping to develop a comprehensive and interesting hypothesis. Also, there is a direction that said hypothesis-driven science is dead [9], [10]. 2.3 Hypothesis-free Science Einstein says "If we knew what we were doing, it wouldn't be called research, would it?” Also, Max Ferdinand Perutz says “In practice, scientific advances often originate from observation, made either by accident or design, without any hypothesis or paradigm in mind". Therefore, research could be looking around, observing, describing and mapping undiscovered territory, not testing theories or models. The goal is to discover things we neither knew nor expected, and to see relationships and connections among the elements. This process is not driven by hypothesis and should be as a model-independent as possible. A hypothesis-free process, when applied to data alone, is sufficient to produce a gain in understanding [7]. In the following, we present some examples of hypothesis-free science [8]: a. Double helical structure of DNA is regarded as one of the three major pillars of modern biology. Watson and Crick solve the structure of DNA without specific hypothesis. b. Examples of novel discovery by scientists won Nobel science prizes for their creators. In biological chemistry, Sanger develops methods for sequencing proteins and nucleic acids. Mullis found the polymerase chain reaction and soft- ionization mass spectrometry methods. c. Epidemiology holds a special place as a well- established science that is essentially a data-driven, and in which hypotheses are the results of the epidemiological study of interest and not its starting point. In a similar vein, we comment that almost all kinds of data mining. A now-common strategy in post-genomic biology is to measure, quantitatively, the action of all (or as many as possible) of the genes at the level of the transcriptome, proteome, metabolome and phenotype, and to use computerized methods to infer gene function via techniques. Such activities are seen as lacking in hypotheses. 3. ASSUMPTIONS AND DATA MINING IN BIOINFORMATICS Explosion and growth of biological data in exponential rate resulted in urgent collaborative work to enable understanding and analyzing such data. The aim is exploiting and utilizing data better in daily life. Although massive efforts have been done, Bioinformatics are still infancy. There are a lot of factors that make the challenges harder; including huge information carried by a genome, lack of techniques to reveal benefit knowledge from, and difficulty of the biology laboratory test to validate. Also, challenges arise from multi disciplines of Bioinformatics, because some of the disciplines have their own computational problems [11]. Data mining comes as the first technique to design new methods and algorithms for knowledge extraction by finding patterns, classification, clustering, etc. The objectives are finding characteristics and properties of biosequences that make genome; therefore, numerous data structure and mapping have been used. Recent research motivates investigating the structural properties of biological sequences to enhance algorithms in molecular biology [12], [13]. Therefore, we focus on the nature of biological data to formulate and develop a method of research in Bioinformatics. A method, that is more efficient following the new trends in Bioinformatics. The ENCyclopedia Of DNA Elements (ENCODE) delving into how variation between people affects the activity of regulatory elements in the genome. “At some places there’s going to be some sequence variation that means a Transcription Factor (TF) is not going to bind here the same way it binds over here,” says Mark Gerstein, a computational biologist at Yale University in New Haven [14]. In this section, we demonstrated some examples of commonly assumptions and models that suffered from being correct. These examples are discussed in the following: 3.1 Motif Representation A critical step of the process of motif discovery is the choice of an appropriate structure to model the motifs. This choice is a trade-off between the expressiveness of the model to describe particular biological properties, and the efficiency of the algorithms that can be applied when that model is chosen [15]. Arguably the most important distinction between motif discovery tools is the model that is
Vol. 5, No. 2 February 2014 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences ©2009-2014 CIS Journal. All rights reserved. http://www.cisjournal.org Motivation and Justification of Naturalistic Method for Bioinformatics Research 1 Nooruldeen Nasih Qader, 2 Hussein Keitan Al-Khafaji 2 1 Computer Science, University of Sulaimani, Sulaimani, Iraq Computer Communication, Alrafidain University College, Baghdad, Iraq ABSTRACT This paper introduces and proposes naturalistic method as a trends base for the Bioinformatics research. Naturalistic method emphasizes on finding biodata properties by insight in a real data nature to reflect its de facto and to be as far from the Bioinformatics theoretical assumptions as possible. We present and justify motivating factors in this direction such as studies that depend mainly on hypotheses models lead to the derivation of imperfect biological models, availability of huge real data, furthermore new technologies enable sustainable flow of data. This method aims to find better ways for representing biological data and process. This goal could be reached by finding biodata properties and characteristics. On the other hand, discovered properties could be utilized to enhance different algorithm in Bioinformatics. Keywords: Naturalistic, Bioinformatics, microarray, property, algorithm, data mining, motif, genome, gene, PWM, nucleotide, DNA, binding site. 1. INTRODUCTION The research methodologies are continuously developing to involve new techniques and ideas. Therefore, the appearance of the network and the web made it possible for the scientific community to share data produced by high throughput techniques, thus providing massive, new and free data to be investigated and analyzed. A set of data on its own is very hard to interpret. There is a lot of information contained in the data, but it is hard to see. Ways of understanding important features of the data are necessary [1], [2]. To overcome challenges faced in researches, different disciplines continuously conduct the process of designing new methods beside ordinary research methods; pragmatism was a philosophical foundation for new methods of research [3], [4]. In this context disciplines such as Bioinformatics and more precisely data mining in Bioinformatics come in advance. These efforts lead to good progress, knowledge and efficiency in medicine and Bioinformatics. In Bioinformatics, recent trends concentrate on the nature of biological data to make a design more efficient [5]. The situation results in an increase in the amount of information mining from the data. This study proposes and emphasis on naturalistic and realistic trend as a base for the Bioinformatics research method. In this study we demonstrated shortcoming and disadvantages of using theoretical assumptions in Bioinformatics such as in motif representation and sequence generation. Also, we briefly introduced naturalistic method. The aim of this work is basically to present motivation and justification factors to shift Bioinformatics research to rely more on available data. 2. RELATED WORK No single scientific method could be applied to all branches of science. Pragmatism and finding solution to a problem made scientists use whatever they can. In the following we present some related ideas to researches methods: 2.1 Deduction Philosophy vs. Induction Philosophy In the article called “Is the Scientific Paper a Fraud?” Peter Medewar reported in which induction, unlike with deductions, acquired no place with scientific research. Medewar agrees with Karl Popper, a philosopher of science. Popper refused induction being a legit sort of judgment from the process of scientific research [6]. The reason why in which deductions generally seems to delight in recommended philosophical standing subsequently is if typically the axiom plus the observation are generally appropriate typically the logical inference needs to be appropriate. By contrast, induction sometimes appears is noted as being not secure philosophically simply because it collapses to help counter-examples [7]. 2.2 Hypothesis-driven and Data-driven Method Popper and Medewar argued vehemently for a method of scientific practice based on the so-called hypothetico-deductive system, the essence of which is the formulation of a hypothesis derived from a collection of facts, testing the hypothesis by trying to ‘falsify’ it, 80 Vol. 5, No. 2 February 2014 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences ©2009-2014 CIS Journal. All rights reserved. http://www.cisjournal.org collecting more facts if ‘falsification’ fails, and repeating the falsification tests until either you and the hypothesis agree on a draw or one of you admits defeat [6]. Another direction tries to prove that data and technology-driven programs are not alternatives to hypothesis-led studies in scientific knowledge discovery but are complementary and iterative partners with them. Many fields are data-rich but hypothesis-poor. Here, computational methods of data analysis, which may be automated, provide the means of generating novel hypotheses, especially in the post genomic era [8]. Another researcher proposes to extend the period of hypothesis generation through discovery-driven approaches hoping to develop a comprehensive and interesting hypothesis. Also, there is a direction that said hypothesis-driven science is dead [9], [10]. 2.3 Hypothesis-free Science Einstein says "If we knew what we were doing, it wouldn't be called research, would it?” Also, Max Ferdinand Perutz says “In practice, scientific advances often originate from observation, made either by accident or design, without any hypothesis or paradigm in mind". Therefore, research could be looking around, observing, describing and mapping undiscovered territory, not testing theories or models. The goal is to discover things we neither knew nor expected, and to see relationships and connections among the elements. This process is not driven by hypothesis and should be as a model-independent as possible. A hypothesis-free process, when applied to data alone, is sufficient to produce a gain in understanding [7]. In the following, we present some examples of hypothesis-free science [8]: a. Double helical structure of DNA is regarded as one of the three major pillars of modern biology. Watson and Crick solve the structure of DNA without specific hypothesis. b. c. Examples of novel discovery by scientists won Nobel science prizes for their creators. In biological chemistry, Sanger develops methods for sequencing proteins and nucleic acids. Mullis found the polymerase chain reaction and softionization mass spectrometry methods. Epidemiology holds a special place as a wellestablished science that is essentially a data-driven, and in which hypotheses are the results of the epidemiological study of interest and not its starting point. In a similar vein, we comment that almost all kinds of data mining. A now-common strategy in post-genomic biology is to measure, quantitatively, the action of all (or as many as possible) of the genes at the level of the transcriptome, proteome, metabolome and phenotype, and to use computerized methods to infer gene function via techniques. Such activities are seen as lacking in hypotheses. 3. ASSUMPTIONS AND DATA MINING IN BIOINFORMATICS Explosion and growth of biological data in exponential rate resulted in urgent collaborative work to enable understanding and analyzing such data. The aim is exploiting and utilizing data better in daily life. Although massive efforts have been done, Bioinformatics are still infancy. There are a lot of factors that make the challenges harder; including huge information carried by a genome, lack of techniques to reveal benefit knowledge from, and difficulty of the biology laboratory test to validate. Also, challenges arise from multi disciplines of Bioinformatics, because some of the disciplines have their own computational problems [11]. Data mining comes as the first technique to design new methods and algorithms for knowledge extraction by finding patterns, classification, clustering, etc. The objectives are finding characteristics and properties of biosequences that make genome; therefore, numerous data structure and mapping have been used. Recent research motivates investigating the structural properties of biological sequences to enhance algorithms in molecular biology [12], [13]. Therefore, we focus on the nature of biological data to formulate and develop a method of research in Bioinformatics. A method, that is more efficient following the new trends in Bioinformatics. The ENCyclopedia Of DNA Elements (ENCODE) delving into how variation between people affects the activity of regulatory elements in the genome. “At some places there’s going to be some sequence variation that means a Transcription Factor (TF) is not going to bind here the same way it binds over here,” says Mark Gerstein, a computational biologist at Yale University in New Haven [14]. In this section, we demonstrated some examples of commonly assumptions and models that suffered from being correct. These examples are discussed in the following: 3.1 Motif Representation A critical step of the process of motif discovery is the choice of an appropriate structure to model the motifs. This choice is a trade-off between the expressiveness of the model to describe particular biological properties, and the efficiency of the algorithms that can be applied when that model is chosen [15]. Arguably the most important distinction between motif discovery tools is the model that is 81 Vol. 5, No. 2 February 2014 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences ©2009-2014 CIS Journal. All rights reserved. http://www.cisjournal.org used. A motif can be represented by two popular models: string representation, consensus or pattern and matrix representation (Position Frequency Matrix (PFM), Position Weight Matrix (PWM) or profile). Figure 1 displays an example of these models. The consensus sequence gives the most frequent nucleotide in each position. To allow for degeneracy, the characters that are used to describe a motif can be extended from {A, C, G, T} to IUPAC characters, e.g., “TATRNT” is a consensus where “R” stands for a purine (A or G) and “N” stands for a base of any type [16]. The PFM represents the frequencies of each base type at each position. The PWM computes a log-ratio between observed frequencies in the frequency matrix and base occurrence frequencies in random DNA (background frequency). In the PWM, motifs of length one character are represented by size 4 × l with the four entries in the jth column of the matrix [17]–[19]. PWM describes the effect of each base on binding separately. Due to the low resolution of most existing data, it is not clear how generally applicable this model is [20]. 3.2 Nucleotides Positions Interdependency Motif representation models, the string and the matrix share an important common weakness: they assume the occurrence of each nucleotide at a particular position of binding site is independent of the occurrence of nucleotides at other positions. Thus, motif representations cannot model biological issues well because they fail to capture nucleotide interdependence. It has been pointed out by many researchers that the nucleotides of the DNA binding site cannot be treated independently, e.g. The binding sites of zinc finger in proteins and the TF CSRE, which activates the gluconeogenic structural genes, can bind to the following binding sites: CGGATGAATGG CGGATGAATGG CGGATGAAAGG CGGACGGATGG CGGACGGATGG Fig 1: Motifs representing forms [36] 3.3 Probability Analysis In Bioinformatics, two models have been exhaustively used to generate sequence according to: First, the Bernoulli Model, it is assumed that symbols of a sequence are generated according to an independent identically distributed process; hereunder, there is no dependency between the probability distribution of the symbols, but this argument is not entirely true, since sequences are believed to be biologically related [16], [18]. Second, hidden Markov model (relies on a basic Markov process), it’s a simplified state of reality because it states that the probability of an event is only dependent on the event that occurred in the previous time step, and is not affected by events that happened two or more steps previously. Most events in the real world do depend on what happened two or more steps in the past. Both models used assumptions not entirely true, but they simplify the problem [21], [22]. Shortcoming shown in the presented examples of motif representation and sequence generation requests look for new perspectives, ideas, and methods to deal with Bioinformatics data. 4. CHARACTERISTICS OF BIOLOGICAL SEQUENCES Note that there is a dependence between the fifth and the seventh symbols [16], [18]. Strong base interdependencies were observed in a stretch of three to five A or T residues flanking the core binding site in multiple TF classes [20]. Knowing the properties of biological sequence can be very valuable in analyzing data and making appropriate conclusions. In this context, appropriate characterization of the biological sequence structures and exploitation of biosequence properties consider important step to develop and create powerful algorithms in Bioinformatics. Biodata, or more pricisly molecular biological data DNA, RNA and proteins, create organism body. Biodata are rich of 82 Vol. 5, No. 2 February 2014 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences ©2009-2014 CIS Journal. All rights reserved. http://www.cisjournal.org information and have many properties. Some of the related properties are listed briefly: a. Small alphabet, biosequence alphabet (DNA, RNA and proteins) regards small when compared with transaction sequences (e.g. market-basket analysis). Biosequence typically requires an alphabet of size less than 21; DNA and RNA consist of four alphabets and proteins consists of 20 alphabets [21], [23], [24]. b. Long sequences, biological sequences carry full detail information about organism species in the genes. Biosequences are long, for example chromosome 1 of the human sized 243 megabytes and human genome sized more than 3 gigabytes. Therefore, long sequences considered an important property of biological sequence data set [25], [26]. c. Mutation, it is the most outstanding property that distinguishes between biosequences and transactional sequences. Occurrences of patterns are not always identical; some copies may be approximated. The biosequence pattern usually allows nontrivial numbers of insertions, deletions, and other mutations. The instances of the pattern usually differ from the model in a few positions. Mutation represents a real challenge of sequential pattern mining [25], [27], [28]. method to understand the object. Therefore, this method depends on discovering properties of biodata. Simple example of applying this method in DNA motif discovery, DNA shows de facto properties such as small alphabet, long sequence, containing gaps, and mutation. But we know that DNA is full of information [14], therefore, they have more properties. Naturalistic method is calling to concentrate more on biodata in order to discover hidden properties. We expect following naturalistic method will increase our understanding of biodata. The method’s limitations: first study time; a study conducted over a certain interval of time is a snapshot dependent on conditions occurring during that time. Job Dekker, ENCODE group leader at the University of Massachusetts Medical School in Worcester, says “It sometimes takes you a long time to know how much can you learn from any given data set” [14]. Second is mechanism and how to figure out new properties. While the above statements about the method are not enough and not a panacea, it is certainly a step towards clarifying method choice. We will indicate the method by example in the next paper. Because the aim of this paper is emphasis on justifications of the method. 5.2 Motivation and Justification Factors Factors motivating the method are described in the following: a. The natural world is not possible to avoid. In contextualized knowledge, the natural world reforms official knowledge with respect to its practical objectives. Pragmatics is the soul of design. Naturalistic differentiates the purely natural world from models, formal systems, and specify the restrictions of formal systems in catching natural world operation. Formal rule methodized thinking is not capable to do various things that people do, such as realizing daily language [29]. b. Naturalistic method concurs as well as with tranquility along with moving conventional and traditional data mining paradigm to domain-driven data mining (D3M). D3M has been suggested to fill the space between academic objectives and business goals, because traditional data mining research principally concentrated upon improving, presenting, and applying the use of particular algorithms and models. An illustration will be in which scientists tend to be thinking about new pattern kinds, whilst professionals worry about obtaining an issue resolved. Real-world company and industry difficulties (in many cases) are hidden 5. NATURALIST METHOD JUSTIFICATION 5.1 Naturalistic Method This method proposes shifting the direction of researches in Bioinformatics to rely more on real biodata to deduce knowledge. It avoids assumption-driven model that restrains the researcher to see the real picture. This method enables the researchers to dive further into the data to understand biodata properties, ground their research on a meaningful theory with a meaningful purpose, seeks to discover and describe biodata properties, configure arguments to explain properties of biodata, and they all theorize about how a structure of biodata can be used to deduce their features. In-depth studies of biodata structure gain more understanding of biodata. The goal of the method is recognize biodata reality and comprehend its nature. It selects and uses analytical techniques to gain maximum meaning of biodata and processes. It emphasizes on discovering biodata characteristics by analyzing the real data nature to reflect its de facto and to be as far as possible from Bioinformatics theoretical assumptions. Characteristics and properties of any object form corner stone and powerful 83 Vol. 5, No. 2 February 2014 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences ©2009-2014 CIS Journal. All rights reserved. http://www.cisjournal.org within complex conditions as well as elements. Environmentally components are usually strained or simplified within conventional research. As a result, there is a large gap between a syntactic programs and its actual target problem. The discovered patterns cannot be used for problem solving. D3M has been found to undertake the above problem [30]–[32]. c. e. Studies that depend mainly on hypotheses models lead to the derivation of imperfect biological models such as models used to generate the sequences (i.e., the Bernoulli Model, hidden Markov model) and motifs representation (indicated previously at page 81). Epigenetics present evidences that environment factors have impact on gene expression (i.e., genetically inconceivable) and gene-protein interaction that challenged the view of DNA [10]. These challenges generate questions about the suitability of hypothesis-driven science in Bioinformatics. Although assumptions trend produced no entirely true result, it provides a simplification which is a good approximation to the actual verified values; but to validate results it always requires laboratory experiment that is not possible constantly due to cost, time, technology availability, etc. On the other hand, the process simplification results in loss of information and may be having driven away from reality (e.g. Mendelian process). f. The sophisticated and unknown aspects of organism systems make the naturalistic method most in harmony with the observed facts on living things. The systems in living creatures constructs the general domain of Bioinformatics. These systems are in an ideal state, and they are interrelated in such a high order and the level that is beyond the comprehension of current sciences; for example, genes/proteins interact in a complex biological network. Genes and gene products interact on several levels (i.e., gene regulatory networks, protein-protein interaction network, and metabolic network), in many cases these different levels of interaction are integrated – for example, when the presence of an external signal triggers a cascade of interactions that involves both biochemical reactions and transcriptional regulation. Complex network produces from the relationship and operation of many genes, which consider a source of ambiguity in the genotype– phenotype relationship [11]. Keeping this in view, all beings are somehow interconnected. For the sake of simplicity they are subdivided into biology, chemistry, physics, environment, physiology, psychology, etc. Thus the limitation of current Although many disciplines have been used in Bioinformatics (i.e., applied mathematics, informatics, statistics, computer science, artificial intelligence, biology, and biochemistry) they are still infancy. Bioinformatics is full of unknown areas such as biological role; the functions are unknown for over 50% of discovering genes [33]. Also due to the large number of TFs (>1,000), cell types, and environmental states, exhaustive application of such approaches to understand the human transcriptional regulation is not feasible. Furthermore, observing where TFs bind in the genome does not explain why they bind there [20]. Also, our knowledge is limited regarding the differences in human protein abundance and the genetic basis for these differences [34]. No one knows how much more information the human genome holds, or when to stop looking for it. We do not know what most of our DNA does, nor how, or to what extent it governs traits. In other words, we do not fully understand the mechanism of work at the molecular level. DNA story has turned out to be a little more complex, there should be a bolder admission — indeed a celebration — of the known unknowns [11]. Deeper characterization of everything the genome is doing is probably only 10% finished. Unknown dependable information about the most human complexity made many of biologists think they are laid in ‘deserts’ between the genes. Although a single-letter difference, or variant, seems to be associated with disease risk. But researchers have few clues about the mechanism of the cause or disease control. Furthermore; lack of information gained by ENCODE and unclear endpoint; force a few scientists complain and prefer to change the current method [14]. d. and disease phenotypes. Data mining will play an essential role in addressing these fundamental problems and the development of novel therapeutic/diagnostic solutions in the postgenomics era of medicine. On the other hand, genomification and Bioinformatics in general consider promising fields to find approaches to critical problems (e.g. genetic diseases and cancer). Current state needs more researches as well as reviews of research methodology. There is a pressing need to use these data and computational techniques to build network models of complex biological processes 84 Vol. 5, No. 2 February 2014 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences ©2009-2014 CIS Journal. All rights reserved. http://www.cisjournal.org information is behind using statistical approaches to find even approximate concepts, and statistical techniques which are used for establishing relationships between the data and the unknowns [35]. g. The validity and adequacy of an evaluation are affected by the availability of data [4]. Limited availability of data often makes assumptions and estimating best choice; although in such cases achieving accuracy is a difficult task. Therefore, another factor that motivates this direction of research is availability of huge real data (large amounts of data generated from microarray and next-generation sequencing). Furthermore new technology such as Chromatin Immunoprecipitation (ChIP) and gene chip technology enable sustainable flow of data. In general, continuous progress in data mining techniques (preprocessing) and biotechnology promise availability and sustainability of biodata. h. A large-scale distributed computing platform offered great opportunity to review and develop research methods; platform such as the cluster built by Google and IBM in conjunction with six pilot universities. This cluster composed of 1,600 processor, several terabytes of memory, and hundreds of terabytes of storage, along with the special software from IBM and Google. Massive biodata, along with the good progress in data mining, software, and hardware offers a whole new way of understanding biodata [10]. i. Bioinformatics is similar to Google in the perspective of growing in a field of massively abundant data. Google does not depend on any model, but in spite of that it achieved successes and present good example. Peter Norvig, Google's research director, said at the O'Reilly Emerging Technology Conference "All models are wrong, and increasingly you can succeed without them". We think current Bioinformatics researches are suffering and restricted by involving models. Using models is unappropriate in Bioinformatics, because models are systems produced in the mind of scientists. Simply because scientists don’t have enough information about bio-systems. Therefore, as we learn more about biology, we find more weaknesses about using models, to see examples refer to section 3. Naturalistic method calls to study raw biodata without specific hypothesis, model, and assumption. Although, such hypothesis simplifies problems and helps us to focus on one relation to conclude fast result. Hypothesis, model, and assumption limit the ability to see whole true picture. And With enough data, the numbers speak for themselves [10]. j. Technology developments don’t lead to (or rarely) hypothesis. That they quite influenced scientific research. Highly effective computer system, data mining or prospecting instruments, and massive biodata make it possible for us all to research biodata without assumption. Inside this sort of natural environment hypothesizing, recreating, along with examining have grown to be useless. As an example with this route, the try things out produced by N. Craig Venter, the objective ended up being sequencing overall ecosystems. They employed supercomputer, high speed sequencers, along with statistical instruments to research files. Venter accomplished an improvement along with created a true side of the bargain of biology improvement [10]. k. Naturalistic method is more satisfactory to the research process for the reason that research process motivates some sort of “rigorous, inhuman method involving method following a calls for involving truth, judgment along with purpose procedure” [13]. The idea ignores personalized judgment, is targeted on exact studies in the analysis and prevent frugal concerning how that they record for the reason that effects. Although some people might believe it is not probable since analysts are generally individuals and should not be simple or maybe price cost-free about any scenario [1]. Typically the naturalistic method is choice to accomplish simple. 6. CONCLUSION The motivation and justification factors shown by the study lead to preferring naturalistic method research for Bioinformatics, because it depends on real data. The method empowers Bioinformatics techniques to handle the true properties and reducing assumptions for un-modeled or uncover biodata phenomena. The empowerment comes from recognizing and understanding biodata properties and processes. 7. FUTURE WORK Ideas used in this study deserve encourage more utilization. In order to show the advantages of the proposed method, we will present an example of the naturalistic method by searching for biological data characteristics and 85 Vol. 5, No. 2 February 2014 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences ©2009-2014 CIS Journal. All rights reserved. http://www.cisjournal.org exploiting them to develop algorithms in motif discovery, which is regarded as the most active and vital field of data mining in Bioinformatics. The following contains some suggestions for future work: [9] “Is Discovery Science Really Bogus?” [Online]. Available: http://ivory.idyll.org/blog/is-discoveryscience-really-bogus.html. [Accessed: 19-Aug-2013]. [10] C. Anderson, “The end of theory: The data deluge makes the scientific method obsolete.,” Wired, pp. 8– 10, 2008. a. Apply the naturalistic method on biodata and discover new structure properties of motifs and biosequences. [11] b. Utilization of the discovered properties in order to develop an algorithm that efficiently discovers more complex patterns in scalable data sets. P. Ball, “DNA: Celebrate the unknowns,” Nature, vol. 496, no. 7446, pp. 419–420, Apr. 2013. [12] Implementation of the designed algorithm and experiment, its efficiency, and robustness using many factors comparable with the current algorithms. M. Friberg, P. Von Rohr, and G. Gonnet, “Scoring Functions for Transcription Factor Binding Site Prediction,” BMC Bioinformatics, vol. 6, no. 1, pp. 1–11, 2005. [13] A. S. Sundar, S. M. Varghese, K. Shameer, N. Karaba, M. Udayakumar, and R. Sowdhamini, “STIF: Identification of stress-upregulated transcription factor binding sites in Arabidopsis thaliana,” Bioinformation, vol. 2, no. 10, p. 431, 2008. c. REFERENCES [1] B. Robin, “An Introduction to statistics Hypotheses, Power and Sample size,” Power, pp. 1–25, 2011. [2] K. Hon, “An Introduction to Statistics,” alt2.mathlinks.ro, no. February, pp. 1–29, 2010. [14] W. Doolittle, “The Human Encyclopaedia,” Proc. Natl. Acad. Sci., pp. 8–10, 2013. F. Soriano, Conducting needs assessments: A multidisciplinary approach, Second Edi. SAGE Human Services Guides, 2012, p. 240. [15] C. Pizzi, “Motif Discovery Approaches-Design and intechopen.com, 2011. [16] H. Ji and W. H. Wong, “Computational biology: toward deciphering gene regulatory information in mammalian genomes,” Biometrics, vol. 62, no. 3, pp. 645–663, 2006. P. Kumar, P. Krishna, and S. Raju, Pattern Discovery Using Sequence Data Mining: Applications and Studies. IGI Global, 2012, p. 286. [17] F. Chin and H. Leung, “Optimal algorithm for finding dna motifs with nucleotide adjacent dependency,” Proc. APBC, pp. 343–352, 2008. I. Rothchild, “Induction, deduction, and the scientific method,” Soc. study Reprod., 2006. [18] F. Chin and H. C. M. Leung, “DNA Motif Representation with Nucleotide Dependency,” IEEE/ACM Trans. Comput. Biol. Bioinform., vol. 5, no. 1, pp. 110–9, 2008. [19] Y. Zhang and M. Zaki, “SMOTIF: efficient structured pattern and profile motif search,” Algorithms Mol. Biol., vol. 1, no. 1, p. 22, Jan. 2006. [20] A. Jolma, J. Yan, T. Whitington, and J. Toivonen, “DNA-Binding Specificities of Human Transcription Factors,” Cell, vol. 152, no. 1–2, pp. 327–339, Jan. 2013. [3] [4] [5] [6] [7] [8] M. Bamberger and J. Rugh, Real world evaluation— Working under budget, time, data, and political constraints: Overview, Second Edi. SAGE Publications, 2012, p. 712. J. F. Allen, “Bioinformatics and discovery: induction beckons again.,” BioEssays news Rev. Mol. Cell. Dev. Biol., vol. 23, no. 1, pp. 104–107, Jan. 2001. D. B. Kell and S. G. Oliver, “Here is the evidence, now what is the hypothesis? The complementary roles of inductive and hypothesis-driven science in the post-genomic era.,” BioEssays news Rev. Mol. Cell. Dev. Biol., vol. 26, no. 1, pp. 99–105, Jan. 2004. with Compact Applications,” 86 Vol. 5, No. 2 February 2014 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences ©2009-2014 CIS Journal. All rights reserved. http://www.cisjournal.org [21] [22] [23] [24] [25] F. Masseglia, P. Poncelet, and M. Teisseire, Successes and New Directions in Data Mining. Information Science Reference, 2008, p. 386. S. Kumar, “Finding Patterns in Sequences: Comparison of Motif Extraction, Dynamic Time Warping, and Hidden Markov Model Approaches,” University of Illinois, 2004. E. Loekito, J. Bailey, and J. Pei, “A Binary Decision Diagram Based Approach for Mining Frequent Subsequences,” Knowl. Inf. Syst., vol. 24, no. 2, pp. 235–268, Sep. 2010. K. Pavel and P. Vladimir, “Efficient Motif Finding Algorithms for Large-Alphabet Inputs,” BMC Bioinformatics 2010, 11(Suppl 8) :S1, doi: 10.1186/1471-2105-11-S8-S1. M. Piipari, T. A. Down, and T. J. P. Hubbard, “Large-Scale Gene Regulatory Motif Discovery with NestedMICA,” … Pattern Discov., vol. 7, p. 1, 2011. [26] F. Hadzic, T. Dillon, and H. Tan, Mining of Data with Complex Structures. 2011, p. 348. [27] H. Chen-Ming, C. Chien-Yu, and L. Baw-Jhiune, “WildSpan: mining structured motifs from protein sequences,” Algorithms Mol. Biol., vol. 6, no. 1, p. 6, 2011. [28] G. Chen and Q. Zhou, “Heterogeneity in DNA multiple alignments: modeling, inference, and applications in motif finding,” Biometrics, vol. 66, no. 3, pp. 694–704, 2010. [29] P. Storkerson, “Naturalistic cognition: A research paradigm for human-centered design,” J. Res. Pract., vol. 6, no. 2, pp. 1–24, 2010. [30] K. P. Karunakaran, “Review of Domain Driven Data Mining,” vol. 2, no. 3, pp. 112–116, 2013. [31] V. R. Elangovan and E. Ramaraj, “Comparative Study of Domain Driven Data Mining for It Infrastructure Suport,” no. 1, pp. 225–231, 2013. [32] L. Cao and S. Member, “Domain-Driven Data Mining : Challenges and Prospects,” vol. 22, no. 6, pp. 755–769, 2010. [33] “About the Human Genome Project.” [Online]. Available: http://web.ornl.gov/sci/techresources/Human_Genom e/project/info.shtml. [Accessed: 24-Aug-2013]. [34] L. Wu, S. I. Candille, Y. Choi, D. Xie, L. Jiang, J. LiPook-Than, H. Tang, and M. Snyder, “Variation and genetic control of protein abundance in humans,” Nature, p. 12223, May 2013. [35] W. Goddard, “Research methodology: An introduction,” Methods, vol. IX, pp. 1–23, 2004. [36] T. T. Nguyen and I. P. Androulakis, “Recent Advances in the Computational Discovery of Transcription Factor Binding Sites,” Algorithms, vol. 2, no. 1, pp. 582–605, Mar. 2009. 87 View publication stats
Keep reading this paper — and 50 million others — with a free Academia account
Used by leading Academics
Monica Ballarino
Università degli Studi "La Sapienza" di Roma
Prof. Dr. Rasime Kalkan
European University of Lefke
Grum Gebreyesus
Aarhus University
Jon R Sayers
The University of Sheffield