Data science, also known as data-driven science, is an interdisciplinary field about scientific methods, processes, and systems to extract knowledge or insights from data in various forms, either structured or unstructured, similar to... more
Data science, also known as data-driven science, is an interdisciplinary field about scientific methods, processes, and systems to extract knowledge or insights from data in various forms, either structured or unstructured, similar to data mining.Data science is about dealing with large quality of data for the purpose of extracting meaningful and logical results/conclusions/patterns. It's a newly emerging field that encompasses a number of activities, such as data mining and data analysis. It employs techniques ranging from mathematics, statistics, and information technology, computer programming, data engineering, pattern recognition and learning, visualization, and high performance computing. This paper gives a clear idea about the different data science technologies used in Big data Analytics. Data science is a "concept to unify statistics, data analysis and their related methods" in order to "understand and analyze actual phenomena" with data. It employs ...
Big data analytics has created opportunities for researchers to process huge amount of data but created a big threat to privacy of individual. Data processed by big data analytics platforms may have personal information which need to be... more
Big data analytics has created opportunities for researchers to process huge amount of data but created a big threat to privacy of individual. Data processed by big data analytics platforms may have personal information which need to be taken care of when deriving some useful results for research. Existing privacy preserving techniques like, anonymization requires having dataset divided in the set of attributes like, sensitive attributes, quasi identifiers, and non-sensitive attributes. With the structured data it may possible to have such a distribution but in unstructured data it is very difficult to identify sensitive attribute and quasi identifiers.
During a pandemic, such as COVID-19, the scientific community must optimize collaboration, as part of the race against time to identify and repurpose existing treatments. Today, Artificial Intelligence (AI) offers us a significant... more
During a pandemic, such as COVID-19, the scientific community must optimize collaboration, as part of the race against time to identify and repurpose existing treatments. Today, Artificial Intelligence (AI) offers us a significant opportunity to generate insights and provide predictive models that could substantially improve the opportunities for understanding the core metrics that characterize the epidemic. A principal barrier for effective AI models in a collaborative environment, especially in the medical and pharmaceutical industries, is dealing with datasets that are distributed across multiple organizations, as traditional AI models rely on the datasets being in one location. In the status quo, organizations must slog through a costly and timeconsuming process of extract-transform-loading to build a dataset in a singular location. This paper addresses how Federated Learning may be applied to facilitate flexible AI models that have been trained on biopharma and clinical unstructured data, with a special focus on extracting actionable intelligence from existing research and communications via Natural Language Processing (NLP).
In today‘s highly developed world, every minute, people around the globe express themselves via various platforms on the Web. And in each minute, a huge amount of unstructured data is generated. This data is in the form of text which is... more
In today‘s highly developed world, every minute, people around the globe express themselves via various platforms on the Web. And in each minute, a huge amount of unstructured data is generated. This data is in the form of text which is gathered from forums and social media websites. Such data is termed as big data. User opinions are related to a wide range of topics like politics, latest gadgets and products. These opinions can be mined using various technologies and are of utmost importance to make predictions or for one-to-one consumer marketing since they directly convey the viewpoint of the masses. Here we propose to analyse the sentiments of Twitter users through their tweets in order to extract what they think. Hence we are using hadoop for sentiment analysis which will process the huge amount of data on a hadoop cluster faster. Keywords— Opinion Mining, Sentiment analysis, Hadoop Cluster, Twitter, Unstructured data, Movie review analysis, Tokenisation. INTRODUCTION Sentiment...
We present a Heterogenous Data Quality Methodology (HDQM) for Data Quality (DQ) assessment and improvement that considers all types of data managed in an organization, namely structured data represented in databases, semistructured data... more
We present a Heterogenous Data Quality Methodology (HDQM) for Data Quality (DQ) assessment and improvement that considers all types of data managed in an organization, namely structured data represented in databases, semistructured data usually represented in XML, and unstructured data represented in documents. We also define a meta-model in order to describe the relevant knowledge managed in the methodology. The different types of data are translated in a common conceptual representation. We consider two dimensions widely analyzed in the specialist literature and used in practice: Accuracy and Currency. The methodology provides stakeholders involved in DQ management with a complete set of phases for data quality assessment and improvement. A non trivial case study from the business domain is used to illustrate and validate the methodology.
The paper discusses the feasibility of constructing a SQL Server FILESTREAM based English Language Learning System (ELLS). The paper focuses on the Implementation phase of the system. It explains the prospect of storing and managing... more
The paper discusses the feasibility of constructing a SQL Server FILESTREAM based English Language Learning System (ELLS). The paper focuses on the Implementation phase of the system. It explains the prospect of storing and managing unstructured data (e.g. Images, video, Word, Excel, PDF, MP3, etc) for educational purposes utilizing FILESTREAM technique provided by SQL Server 2012, and explains how to maintain efficient storage and access to BLOB data. However, Storing unstructured data posed many challenges, such as how to maintain transactional consistency between the structured and unstructured data, how to manage backup, restore, and storage performance. The paper seeks to utilize the combination of SQL Server 2012 features and the NTFS (New Technology File System) to improve the efficiency and performance of the ELLS system for children. The system also seeks to maintain the transactional consistency between the unstructured data and corresponding structured data. Furthermore, the system allows the user to customize some operations such as creating, updating and deleting photos, videos or audios in the database. The system supports some maintenance operations such as backup, restore, and consistency checking.
The paper discusses the feasibility of constructing a SQL Server FILESTREAM based English Language Learning System (ELLS). The paper focuses on the Implementation phase of the system. It explains the prospect of storing and managing... more
The paper discusses the feasibility of constructing a SQL Server FILESTREAM based English Language Learning System (ELLS). The paper focuses on the Implementation phase of the system. It explains the prospect of storing and managing unstructured data (e.g. Images, video, Word, Excel, PDF, MP3, etc) for educational purposes utilizing FILESTREAM technique provided by SQL Server 2012, and explains how to maintain efficient storage and access to BLOB data. However, Storing unstructured data posed many challenges, such as how to maintain transactional consistency between the structured and unstructured data, how to manage backup, restore, and storage performance. The paper seeks to utilize the combination of SQL Server 2012 features and the NTFS (New Technology File System) to improve the efficiency and performance of the ELLS system for children. The system also seeks to maintain the transactional consistency between the unstructured data and corresponding structured data. Furthermore, ...
Abstract. We present our on-going research on constructing and extending a version of Controlled English (CE) in support of knowledge sharing and decision-making for effective and efficient operations in the military coalition... more
Abstract. We present our on-going research on constructing and extending a version of Controlled English (CE) in support of knowledge sharing and decision-making for effective and efficient operations in the military coalition environment. This work would be useful for any multinational English speaking environment. This CE is intended for both human use and machine processing, providing: (i) A user-friendly language in a form of English enabling the user to use it in a fairly intuitive way. (ii) A precise language that enables clear, unambiguous representation of information that is amenable to rule-based interpretation and inferencing. The paper focuses on the discussion of methods for CE construction while optimizing a balance between the naturalness for humans and machine readability of the CE language in light of theoretical considerations and empirical experimentations. We discuss certain aspects of CE syntax, semantics and the lexical model as examples. We also show sample CE...
Big data have become a global strategic issue, as increasingly large amounts of unstructured data challenge the IT infrastructure of global organizations and threaten their capacity for strategic forecasting. As experienced in former... more
Big data have become a global strategic issue, as increasingly large amounts of unstructured data challenge the IT infrastructure of global organizations and threaten their capacity for strategic forecasting. As experienced in former massive information issues, big data technologies, such as Hadoop, should efficiently tackle the incoming large amounts of data and provide organizations with relevant processed information that was formerly neither visible nor manageable. After having briefly recalled the strategic advantages of big data solutions in the introductory remarks, in the first part of this paper, we focus on the advantages of big data solutions in the currently difficult time of the COVID-19 pandemic. We characterize it as an endemic heterogeneous data context; we then outline the advantages of technologies such as Hadoop and its IT suitability in this context. In the second part, we identify two specific advantages of Hadoop solutions, globality combined with flexibility, and we notice that they are at work with a “Hadoop Fusion Approach” that we describe as an optimal response to the context. In the third part, we justify selected qualifications of globality and flexibility by the fact that Hadoop solutions enable comparable returns in opposite contexts of models of partial submodels and of models of final exact systems. In part four, we remark that in both these opposite contexts, Hadoop’s solutions allow a large range of needs to be fulfilled, which fits with requirements previously identified as the current heterogeneous data structure of COVID-19 information. In the final part, we propose a framework of strategic data processing conditions. To the best of our knowledge, they appear to be the most suitable to overcome COVID-19 massive information challenges.
Linguistic flexibility around non-predetermined expressions is important for more effective human-robot face-to-face interaction. In the past, most robots have been fitted with a limited and supervised response process and programmed with... more
Linguistic flexibility around non-predetermined expressions is important for more effective human-robot face-to-face interaction. In the past, most robots have been fitted with a limited and supervised response process and programmed with certain responses for predetermined words or sentences. As a result, implementing viable robot-based recommendation services has been difficult. The purpose of this paper is to propose a text-mining approach to flexible robot-based recommendation services in which, when the robot encounters linguistic expressions that differ substantially from the programmed linguistic database, flexible responses are generated based on understanding of several external corpora and a knowledge-learning process. This study combines two text-mining methods, TF-IDF and LDA, to generate flexible communication, which enables a robot to respond with recommendation content that is not pre-programmed. The results of our analysis suggest that the proposed combined approach outperforms the TF-IDF and LDA methods in terms of overall accuracy and F-score.
This paper presents a work-in-progress that deals with the assessment of the use of controlled vocabularies during the processes of requirements engineering, as a means to mine data from different sources (interviews, contracts, schemas... more
This paper presents a work-in-progress that deals with the assessment of the use of controlled vocabularies during the processes of requirements engineering, as a means to mine data from different sources (interviews, contracts, schemas and diagrams). By doing this the requirements description, analysis and comprehension is facilitated for both developers and end users. As a research methodology, we decided to use a systematic mapping study covering the last fourteen years (2000 - 2014). As far as we know, such studies have not yet been done; however, the cost incurred from errors in the requirements elicitation phase is one of the problems that is most commonly reported by the practitioners. Our study includes data on the processes of building the controlled vocabulary and assesses the productivity and quality. We are also interested in tools and techniques to classify and retrieve information. Our first findings suggest that this is an under-research area.