Social media have become a discussion platform for individuals and groups. Hence, users belonging to different groups can communicate together. Positive and negative messages as well as media are circulated between those users. Users can... more
Social media have become a discussion platform for individuals and groups. Hence, users belonging to different groups can communicate together. Positive and negative messages as well as media are circulated between those users. Users can form special groups with people who they already know in real life or meet through social networking after being suggested by the system. In this article, we propose a framework for recommending communities to users based on their preferences; for example, a community for people who are interested in certain sports, art, hobbies, diseases, age, case, and so on. The framework is based on a feature extraction algorithm that utilizes user profiling and combines the cosine similarity measure with term frequency to recommend groups or communities. Once the data is received from the user, the system tracks their behavior, the relationships are identified, and then the system recommends one or more communities based on their preferences. Finally, experimental studies are conducted using a prototype developed to test the proposed framework, and results show the importance of our framework in recommending people to communities.
6LowPAN was introduced by the IETF as a standard protocol to interconnect tiny and constrained devices across IPv6 clouds. 6LowPAN supports a QoS feature based on two priority bits. So far, little interest has been granted and this QoS... more
6LowPAN was introduced by the IETF as a standard protocol to interconnect tiny and constrained devices across IPv6 clouds. 6LowPAN supports a QoS feature based on two priority bits. So far, little interest has been granted and this QoS feature and there are no implementations of such feature in real networks. In this paper,we evaluate the capacity to provide QoS of these priority bits in various scenarios. We show that under very heavy or very low network load, these bits have a limited effect on the delay.
סיפורה של האחזות נח"ל גנ"ת בעיר העתיקה, ראשית ההתיישבות ברובע היהודי לאחרת מלחמת ששת הימים. The story of 'Moriah' Nahal holding. The first settlement in the Jewish Quarter of Jerusalem after the Six-Day War. The Old City of Jerusalem... more
סיפורה של האחזות נח"ל גנ"ת בעיר העתיקה, ראשית ההתיישבות ברובע היהודי לאחרת מלחמת ששת הימים.
The story of 'Moriah' Nahal holding. The first settlement in the Jewish Quarter of Jerusalem after the Six-Day War. The Old City of Jerusalem 1967-1972
The problem with Social networks has been a major issues globally. In recent years, Spam Detection on social networks has been focused. However, Spammer has seen that social networks are vulnerable to attack in order to perpetrate their... more
The problem with Social networks has been a major issues globally. In recent years, Spam Detection on social networks has been focused. However, Spammer has seen that social networks are vulnerable to attack in order to perpetrate their evil. Influx of spam has been a great threat to individual, organization, government, institution if left unchecked, spam threatens to undermine resource sharing, interactivity, and openness. Due to the ubiquitous use of social networks it has generated huge amount of social data which gives the spammer the leverage to performance various forms of malicious attack and spam activities. This paper survey three computational categorization issues on social networks like Size, Noise and Dynamism, due to the issue often experience on social networks data that are complex to analyse. The paper talk on the Various Data mining techniques used in mining diverse aspects of the social networks site analysis over decades going from the inception to the up to date models, include use of novel algorithms likes Porter Stemmer algorithm, TF-IDF algorithms that are proposed. General Terms Porter Stemmer Algorithm (PSA), TF-IDF Algorithm,
The paper attempts to analyze if the sentiment stability of financial 10-K reports over time can determine the company’s future mean returns. A diverse portfolio of stocks was selected to test this hypothesis. The proposed framework... more
The paper attempts to analyze if the sentiment stability of financial 10-K reports over time can determine the company’s future mean returns. A diverse portfolio of stocks was selected to test this hypothesis. The proposed framework downloads 10-K reports of the companies from SEC’s EDGAR database. It passes them through the preprocessing pipeline to extract critical sections of the filings to perform NLP analysis. Using Loughran and McDonald sentiment word list, the framework generates sentiment TF-IDF from the 10-K documents to calculate the cosine similarity between two consecutive 10-K reports and proposes to leverage this cosine similarity as the alpha factor. For analyzing the effectiveness of our alpha factor at predicting future returns, the framework uses the alphalens library to perform factor return analysis, turnover analysis, and for comparing the Sharpe ratio of potential alpha factors. The results show that there exists a strong correlation between the sentiment stability of our portfolio’s 10-K statements and its future mean returns. For the benefit of the research community, the code and Jupyter notebooks related to this paper have been open-sourced on Github1.
Data mining is the process of analyzing data to find information knowledge discovery in databases. One of the techniques of Data Mining is Classification. Neural network has emerged as an algorithm for classification. In this research,... more
Data mining is the process of analyzing data to find information knowledge discovery in databases. One of the techniques of Data Mining is Classification. Neural network has emerged as an algorithm for classification. In this research, backpropagation neural network algorithm is adapted for the text mining to classify advisor lecturers based on the student's final project. Generally, small neural network structures are faster when deployed. The use of SVD and Weight Initialisation for optimizing the neural network structures was proposed in. SVD is used to identify and eliminate redundant hidden nodes. Moreover, the optimal neural network size is highly dependent on the weight of initialisation. It starts from a fairly large network and dynamically removes unimportant connections. The experiment was done 5 times for each testing scenarios. The results showed that neural network algorithm with prune method and a lot of training data produces better result. It showed the accuracy is amount of 85%, while the precision is amount of 90.63%, while recall is amount of 85%, while f-measure is amount of 87.72%. 1. Pendahuluan 1.1 Latar Belakang Tugas akhir merupakan karya ilmiah mahasiswa sebagai salah satu syarat untuk mendapatkan gelar sarjana. Dalam menyusun tugas akhir, mahasiswa membutuhkan dosen pembimbing sebagai tempat konsultasi dalam menyelesaikan tugas akhir tersebut. Dosen pembimbing sebaiknya merupakan orang yang menguasai bidang yang sesuai dengan tugas akhir mahasiswa sehingga proses bimbingan dapat berjalan dengan baik. Dalam proses penentuan dosen pembimbing di Jurusan Teknik Informatika UMM masih dilakukan secara manual dengan mengandalkan pengetahuan pribadi tentang keahlian dosen yang dibutuhkan. Oleh karena itu dibutuhkan analisis tentang keahlian dosen yang sesuai dengan topik tugas akhir mahasiswa. Pada penelitian tugas akhir ini, peneliti memanfaatkan penggunaan data mining berdasarkan pengalaman dosen yang telah membimbing mahasiswa dengan menggunakan parameter variabel topik, judul, serta keyword abstrak tugas akhir. Dengan mengenali pola dari variable-variabel yang menggambarkan tugas akhir yang sudah dibimbing oleh dosen dapat dibuat sebuah aplikasi untuk menentukan dosen pembimbing tugas akhir dengan teknik klasifikasi. Klasifikasi itu mengenali pola yang menggambarkan kelompok objek yang sudah diklasifikasi dan menyimpulkan
Increasing progress in numerous research fields and information technologies, led to an increase in the publication of research papers. Therefore, researchers take a lot of time to find interesting research papers that are close to their... more
Increasing progress in numerous research fields and information technologies, led to an increase in the publication of research papers. Therefore, researchers take a lot of time to find interesting research papers that are close to their field of specialization. Consequently, in this paper we have proposed documents classification approach that can cluster the text documents of research papers into the meaningful categories in which contain a similar scientific field. Our presented approach based on essential focus and scopes of the target categories, where each of these categories includes many topics. Accordingly, we extract word tokens from these topics that relate to a specific category, separately. The frequency of word tokens in documents impacts on weight of document that calculated by using a numerical statistic of term frequency-inverse document frequency (TF-IDF). The proposed approach uses title, abstract, and keywords of the paper, in addition to the categories topics to perform the classification process. Subsequently, documents are classified and clustered into the primary categories based on the highest measure of cosine similarity between category weight and documents weights.
With the ever-growing amount of text and information in digital space, it is nearly impossible to manually extract summary. Hence, there is demand of automatic system that can comprehend those data and deliver relevant information... more
With the ever-growing amount of text and information in digital space, it is nearly impossible to manually extract summary. Hence, there is demand of automatic system that can comprehend those data and deliver relevant information efficiently in short time. In this project, we have developed an unsupervised extractive text summarizer that pulls out most important and relevant information from text to form concise and accurate summary. The system is designed to generate summary for both categories of dataset i.e. single document and multiple documents. Various extractive summarization algorithms like Text Rank, TF-IDF and Luhn’s algorithm are used for experimenting and building the model.
When managers fail to attend a business meeting, they have to read the transcript from the meeting they missed and get informed about the decisions that have been taken. Text mining may fully automate this process. Support tools which can... more
When managers fail to attend a business meeting, they have to read the transcript from the meeting they missed and get informed about the decisions that have been taken. Text mining may fully automate this process. Support tools which can automatically detect decisions in business meeting transcripts would be a benefit for companies in terms of efficiency and productivity.
This report examines whether Machine Learning for Text Classification can be used to identify useful information in textual data. Specifically, Naive Bayes (NB) and Support Vector Machines (SVM), two popular machine learning algorithms for text classification tasks, are used to explore whether it is possible to recognize decisions (a kind of valuable information for the purpose of this study) in business meeting transcripts.
An imbalanced dataset containing decisions and non-decisions was built from transcripts of the United States Chemical Safety and Hazard Investigation Board (CSB) business meetings. An experiment was conducted where the two classifiers of NB and SVM were compared. The results showed that SVM can identify decisions more successfully than NB accomplishing a 0.92 precision and 0.50 recall which can be significantly improved in a balanced dataset.
Growth of research articles publication in various streams of research is exponential. Searching for a particular article from the research repository is considered to be a tremendous one and also time consuming. Research articles... more
Growth of research articles publication in various streams of research is exponential. Searching for a particular article from the research repository is considered to be a tremendous one and also time consuming. Research articles classification based on their respective domain plays an important role for researchers to retrieve articles in a fast manner. Hence a popular search mechanism, namely keyword search has been applied to retrieve appropriate articles, documents, texts, graphs and even relational databases. When new domains of documents are added to the repository it has to identify keywords and add to the corresponding domains for proper classification. A numerical statistic called TF-IDF has been proposed to determine the relevance of word to a document corpus. Clustering algorithms namely Hierarchical, K-Means and Fuzzy C-Means have been used to cluster articles based on the relevance factor TF-IDF. The strength of Fuzzy C-Means clustering has been validated using Silhouette Cluster Validation technique. Finally, performance has been evaluated using Precision, Recall and F-measure and demonstrated that Fuzzy C-Means clustering depicts better accuracy compared to K-Means and Hierarchical clustering.
Permasalahan yang selalu terjadi di kalangan mahasiswa Universitas Atma Jaya Yogyakarta adalah kesulitan dalam mendapatkan informasi perkuliahan secara cepat dan akurat. Dalam waktu dekat ini, Universitas Atma Jaya Yogyakarta akan... more
Permasalahan yang selalu terjadi di kalangan mahasiswa Universitas Atma Jaya Yogyakarta adalah kesulitan dalam mendapatkan informasi perkuliahan secara cepat dan akurat. Dalam waktu dekat ini, Universitas Atma Jaya Yogyakarta akan mengadopsi salah satu teknologi berbasis layanan, yakni Application Program Interface (API). Untuk mendapatkan data akademik melalui API UAJY, mahasiswa memerlukan satu aplikasi yang melayani mereka melalui pendekatan tanya jawab, yaitu aplikasi ChatBot. Aplikasi ChatBot UAJY memberikan informasi yang spesifik dan mengarahkan mahasiswa kepada topik perkuliahan. Sumber data ChatBot UAJY berasal dari datawarehouse Universitas Atma Jaya Yogyakarta. Dengan adanya otomatisasi layanan tersebut, ChatBot UAJY diharapkan menjadi solusi terbaik untuk mendapatkan informasi perkuliahan di Universitas Atma Jaya Yogyakarta.
This presentation refers to the project doen by Ms. Sidra Mehtab as a part of her MSc (Data Science & Analytics) minor projects series. The project has two parts. In the Part I of the project, we have carried out a sentiment analysis on... more
This presentation refers to the project doen by Ms. Sidra Mehtab as a part of her MSc (Data Science & Analytics) minor projects series. The project has two parts. In the Part I of the project, we have carried out a sentiment analysis on Twitter data which is based on the reviews written by the customers of six US airlines. The tweets are already classified into three categories: “positive”, “negative”, and “neutral”. Using a supervised learning approach of classification we have used a Random Forest classifier model on the tweet data. We have tested the model on the test data and evaluated it on various metrics like “precision”, recall”, F1-score etc. In this second part of the project, we have carried out another important task of Text Mining which is known as Topic Modeling. We have carried out the task of Topic Modeling using Scikit-Learn library of Python. We have used a food review dataset consisting of 50K text reviews on various food items and categorized the reviews into various topics using a method called Latent Dirichlet Allocation (LDA).
Feature selection and extraction are frequently used approaches to solve the computational burden in problems with the classification of texts. An introduction of an extraction method for each class that summarizes the characteristics of... more
Feature selection and extraction are frequently used approaches to solve the computational burden in problems with the classification of texts. An introduction of an extraction method for each class that summarizes the characteristics of the sample documents where the new features bring together information on the amount of proof contained in a document. In order to construct the abstract features of a new feature room with dimensions equal to the number of groups, the high dimensional properties of documents are predicted. This paper is aimed at exploring how various methods of feature extraction of text data are influenced by text classification tests. Two different methods of extraction for Bag of Words are studied, specifically the approaches with Count Vector and TF-IDF. An embedding method, called the GloVe extraction process, is also investigated. A comparison of the effectiveness and improvements of classifiers in standard text classification test sets is made. The findings show that the choice of the extraction method has a substantial effect on the resulting classifications but that no approach outperforms each other consistently. The findings instead indicate the best output for the retrieval methods with GloVe and the best output with the Bag of Words system for the precise measurements. While the main emphasis is on TF-IDF and word embedding methods, various feature extraction methods have been discussed
The key to the keys to immortality and eternal youth lies in the correct answer to the main question: How to naively discover new essential – but still hidden – features required for properly training novel adaptive supervised machine... more
The key to the keys to immortality and eternal youth lies in the correct answer to the main question: How to naively discover new essential – but still hidden – features required for properly training novel adaptive supervised machine learning algorithms (SMLA)
One of the several benefits of text classification is to automatically assign document in predefined category is one of the primary steps toward knowledge extraction from the raw textual data. In such tasks, words are dealt with as a set of... more
One of the several benefits of text classification is to automatically assign document in predefined category is one of the primary steps toward knowledge extraction from the raw textual data. In such tasks, words are dealt with as a set of features. Due to high dimensionality and sparseness of feature vector results from traditional feature selection methods, most of the proposed text classification methods for this purpose lack performance and accuracy. Many algorithms have been implemented to the problem of Automatic Text Categorization that’s why, we tried to use new methods like Information Extraction, Natural Language Processing, and Machine Learning. This paper proposes an innovative approach to improve the classification performance of the Persian text. Naive Bayes classifiers which are widely used for text classification in machine learning are based on the conditional probability. we have compared the Gaussian, Multinomial and Bernoulli methods of naive Bayes algorithms with SVM algorithm. for statistical text representation, TF and TF-IDF and character-level 3 (3-Gram) [6,9] were used. Finally, experimental results on 10 newsgroups.
In the present days, the development of the internet has resulted in a significant rise in the number of electronic documents in several regional languages. As Tamil Text data in digital format both in online and offline mode is growing... more
In the present days, the development of the internet has resulted in a significant rise in the number of electronic documents in several regional languages. As Tamil Text data in digital format both in online and offline mode is growing significantly nowadays, management and retrieval of the documents is a tedious process. Automatic text classification aims to allocate fixed class labels to unclassified text documents. Many natural language processing (NLP) techniques areextremelydependenton the automatic classification of Tamil Text documents. The current development of machine learning (ML) algorithms helps to attain effective Tamil document classification. In this view, this paper introduces an automated Tamil document classification technique using ML models. The presented model involves different processes such as preprocessing, feature extraction, feature selection, and classification. The proposed model uses term frequency-inverse document frequency (TF-IDF) approach for the feature extraction process. Besides, the Chi-square test is employed to select an optimal set of features. At last, three ML models such as random forest (RF), decision tree (DT), and gradient boosting tree (GBT) are applied to determine the class labels of the Tamil documents. To assess the performance of the presented model, a set of simulations takes place on a Tamil document dataset collected on our own. The experimental values ensured the effective classifier results of the presented model over the compared methods. From the experimental values, it is ensured that the GBT model has reached an effective classification outcome with the maximum accuracy of 85.10%, precision of 87.01%, recall of 85.10%, and F1-score of 85.52%
The sentiment analysis approach is used to determine the sentiment in the text content by using the keyword intensity or term frequency based approach. The keyword extraction models are used to determine the words containing the sentiment... more
The sentiment analysis approach is used to determine the sentiment in the text content by using the keyword intensity or term frequency based approach. The keyword extraction models are used to determine the words containing the sentiment from the text data, and eliminate the remaining content based upon the selection or design of the feature extraction model. The keywords based features are then transformed to the numeric formation by using the ratio, weight or appearance based description, and further classified using the supervised classification model to identify its orientation. In this paper, the supervised machine learning approach combines the count vectorization and TF-IDF based features with Chi-square based feature selection for sentiment analysis in the IMDB review database. The proposed feature description model combines the various N-gram features, such as unigram, bigram and trigram, which signify the different aspects of sentiment contained in the text data. The proposed model has outperformed the existing model based upon the layered model using a count based method with TF-IDF. Support Vector Machine (SVM) classification method is considered as the best method after the result evaluation with the proposed feature descriptor.
Near-Miss incidents can be treated as events to signal the weakness of safety management system (SMS) at the workplace. Analyzing near-misses will provide relevant root causes behind such incidents so that effective safety related... more
Near-Miss incidents can be treated as events to signal the weakness of safety management system (SMS) at the workplace. Analyzing near-misses will provide relevant root causes behind such incidents so that effective safety related interventions can be developed beforehand. Despite having a huge potential towards workplace safety improvements, analysis of near-misses is scant in the literature owing to the fact that near-misses are often reported as text narratives. The aim of this study is therefore to explore text-mining for extraction of root causes of near-misses from the narrative text descriptions of such incidents and to measure their relationships probabilistically. Root causes were extracted by word cloud technique and causal model was constructed using a Bayesian network (BN). Finally, using BN’s inference mechanism, scenarios were evaluated and root causes were listed in a prioritized order. A case study in a steel plant validated the approach and raised concerns for variety of circumstances such as incidents related to collision, slip-trip-fall, and working at height.
In this paper we present and validate a novel approach for single-label multi-class document categorization. The proposed catego-rization approach relies on the statistical property of Principal Component Analysis (PCA), which minimizes... more
In this paper we present and validate a novel approach for single-label multi-class document categorization. The proposed catego-rization approach relies on the statistical property of Principal Component Analysis (PCA), which minimizes the reconstruction error of the training documents used to compute a low-rank category transformation matrix. This matrix allows projecting the original training documents from a given category to a new low-rank space and then optimally reconstructs them to the original space with a minimum loss of information. The proposed method, called Minimum Loss of Reconstruction Information (mLRI) classifier, uses this property, extends and applies it to unseen documents. Several experiments on three well-known multi-class datasets for text categorization are conducted in order to highlight the stable and generally better performance of the proposed approach in comparison with other popular categorization methods.
The growth of interest in epistolary texts over the last few decades has led to a flourishing of international research projects devoted to cataloguing, editing, and studying modern letters, in a collective and coordinated effort to... more
The growth of interest in epistolary texts over the last few decades has led to a flourishing of international research projects devoted to cataloguing, editing, and studying modern letters, in a collective and coordinated effort to better understand these materials. In this seminar, Dr Gianluca Valenti will introduce epistolarITA, a project dedicated to the edition and analysis of epistolary texts written in Italian between the 15th and 17th centuries and sent from the former Low Countries.
Emotion is the human feeling when communicating with other humans or reaction to everyday events. Emotion classification is needed to recognize human emotions from text. This study compare the performance of the TF-IDF and Word2Vec models... more
Emotion is the human feeling when communicating with other humans or reaction to everyday events. Emotion classification is needed to recognize human emotions from text. This study compare the performance of the TF-IDF and Word2Vec models to represent features in the emotional text classification. We use the support vector machine (SVM) and Multinomial Naïve Bayes (MNB) methods for classification of emotional text on commuter line and transjakarta tweet data. The emotion classification in this study has two steps. The first step classifies data that contain emotion or no emotion. The second step classifies data that contain emotions into five types of emotions i.e. happy, angry, sad, scared, and surprised. This study used three scenarios, namely SVM with TF-IDF, SVM with Word2Vec, and MNB with TF-IDF. The SVM with TF-IDF method generate the highest accuracy compared to other methods in the first dan second steps classification, then followed by the MNB with TF-IDF, and the last is SVM with Word2Vec. Then, the evaluation using precision, recall, and F1-measure results that the SVM with TF-IDF provides the best overall method. This study shows TF-IDF modeling has better performance than Word2Vec modeling and this study improves classification performance results compared to previous studies.
There have been many notable works related to plagiarism detection techniques in the English language but very few in the Nepali language, mostly due to the involved challenges in the preprocessing of the Devanagari script (script for the... more
There have been many notable works related to plagiarism detection techniques in the English language but very few in the Nepali language, mostly due to the involved challenges in the preprocessing of the Devanagari script (script for the Nepali language). The complicated grammatical rules and structure of the Nepali language compared to the English language and lack of available datasets give rise to these difficulties. So, in this paper, we build a rule-based recursive stemming algorithm for preprocessing Nepali texts to develop a Nepali Plagiarism Detection System using tf-idf feature vector construction with Cosine similarity measure.
User reviews provide a rich source of information regarding user interests. Many web platforms allow or even encourage their visitors to leave their feedback regarding the products and services they have consumed. The Term Frequency (TF)... more
User reviews provide a rich source of information regarding user interests. Many web platforms allow or even encourage their visitors to leave their feedback regarding the products and services they have consumed. The Term Frequency (TF) and the Inverse Document Frequency (IDF) are two factors that have been used extensively in capturing users' preferences. This paper collects users' reviews from e-tourism web platforms, calculates the TF and the IDF for each user and adopts a multi-criteria approach in order to quantify users' preferences and dynamically adapt the websites design accordingly. It utilizes AHP and similarities methods in order to determine the relative importance of terms and web pages and then rearrange them in a new web site structure.
A spam filter is a program which is used to identify unwanted emails and prevents those messages from getting into a user's mail. The study was focused on how the algorithms can be applied on a number of e-mails consisting of both ham and... more
A spam filter is a program which is used to identify unwanted emails and prevents those messages from getting into a user's mail. The study was focused on how the algorithms can be applied on a number of e-mails consisting of both ham and spam e-mails. First, the working principle and steps which are followed for implementation of stop words, TF-IDF and stemming algorithm on NVIDIA's Tesla P100 GPU are discussed and to verify the findings by executing of Naïve Bayes algorithm. After complete training and testing of the spam e-mails dataset taken from Kaggle by using the proposed method, we got a high training accuracy of 99.67% and got a testing accuracy of about 99.03% on the multicore GPU that boosted the speed of execution of training time period and testing time period which is improved of training and testing accuracy around 0.22% and 0.18% respectively when compared to that after applying only Naïve Bayes i.e. conventional method to the same dataset where we found training and testing accuracy to be 99.45% and 98.85% respectively. Also, we found that training time taken on GPU is 1.361 seconds which was about 1.49X faster than that taken on CPU which is 2.029 seconds. And the testing time taken on GPU is 1.978 seconds which was about 1.15X faster than that taken on CPU which is 2.280 seconds.
Fault Tree Analysis (FTA) is a proven technique for finding out the root cause of the problem and simplifies the problem systematically and logically. In auto parts manufacturing companies, line stoppage is a major problem and thus Bottle... more
Fault Tree Analysis (FTA) is a proven technique for finding out the root cause of the problem and simplifies the problem systematically and logically. In auto parts manufacturing companies, line stoppage is a major problem and thus Bottle Neck Machines are identified. In this case Honing machine was identified as the Bottle Neck Machine, which is being used for Honing of Brake Drums. The problem here was the Seat Check Alarm which was halting the machine and only after cleaning the Break drum surface and holes the machine would restart. This was not only time consuming but also caused a delay in the production of parts with respect to the fixed Takt time. Also the burrs of holes on the fixture seating area used to effect proper seating of the next part on fixture surface area, this would cause further delay in production. This could have been avoided if a chamfer operation was added to the rear face of the drum holes in the initial design and process, but that may have resulted in an additional operation and would require another machine. The proposed approach solves the problem by changing the fixture plate in such a way that the holes will not fall in the seating area and the burrs area will be relieved. This needs a new fixture plate design with proper repositioning of the Seat Check Air Hole keeping clamping area same. The functioning of the machine was studied for a month after mounting the newly designed fixture plate and Seat Check Alarm was not triggered, thus the proposed technique successfully eliminated the stoppage issue thereby improving the production efficiency
Sentiment analysis is an interdisciplinary field between natural language processing, artificial intelligence and text mining. The main key of the sentiment analysis is the polarity that is meant by the sentiment is positive or negative... more
Sentiment analysis is an interdisciplinary field between natural language processing, artificial intelligence and text mining. The main key of the sentiment analysis is the polarity that is meant by the sentiment is positive or negative (Chen, 2012). In this study using the method of classification support vector machine with the amount of data consumer reviews amounted to 648 data. The data obtained from consumer reviews from the marketplace with products sold is hand phone. The results of this study get 3 aspects that indicate sentiment analysis on the marketplace aspects of service, delivery and products. The slang dictionary used for the normation process is 552 words slang. This study compares the characteristic analysis to obtain the best classification result, because classification accuracy is influenced by characteristic analysis process. The result of comparison value from characteristic analysis between n-gram and TF-IDF by using Support Vector Machine method found that Unigram has the highest accuracy value, with accuracy value 80,87%. The results of this study explain that in the case of analysis sentiment at the aspect level with the comparison of characteristics with the classification model of support vector machine found that the analysis model of unigram character and classification of support vector machine is the best model.
The 2020 regional elections in the midst of the COVID-19 pandemic are starting to get crowded starting from the real world and in cyberspace, especially on Twitter social media. Twitter's existence has been widely used by various... more
The 2020 regional elections in the midst of the COVID-19 pandemic are starting to get crowded starting from the real world and in cyberspace, especially on Twitter social media. Twitter's existence has been widely used by various communities in recent years. Twitter is one of the media that represents the public response regarding public issu. Ahead of the general election (PEMILU), there are usually some parties who want to know the results of public sentiment or response to the issue, namely academics, intellectuals or even political opponents. Nevertheless, the implementation of local elections is very polemic in the community, therefore this study tries to analyze tweets that talk about issue public, namely the 2020 elections in the wake of the COVID-19 Pandemic. The analysis usually uses the classification of tweets containing public sentiment about the issue. The classification method used in this research is Naive Bayes Classifier (NBC) And Support Vector Machine (SVM). Naive Bayes Classifier is combined with features that can detect weighting using probability. The classification of tweets in this study was obtained based on a combination of two classes namely sentiment class and category class. The classification of sentiment consists of positive and negative. Test results on built-in applications show that accuracy with Naive Bayes delivers better results than Support Vector Machine. However, overall the use of the Naive Bayes method has a good performance to classify tweets with an accuracy rate of 92.2% Abstrak Pemilihan kepala daerah (Pilkada) serentak 2020 di tengah pandemic COVID-19 mulai ramai di bicarakan mulai dari dunia nyata maupun di dunia maya, khususnya di media sosial Twitter. Keberadaan Twitter telah digunakan secara luas oleh berbagai kalangan masyarakat dalam beberapa tahun terakhir. Twitter adalah salah satu media yang merepresentasikan tanggapan masyarakat terkait issu publik. Menjelang dilaksanakanya pemilihan umum (PEMILU), biasanya ada beberapa pihak yang ingin mengetahui hasil sentimen atau tanggapan masyarakat terhadap issu tersebut, yaitu akademisi, intelektual atau bahkan lawan politik. Kendati demikian pelaksaan pilkada sangat menuai polemik di lapisan masyarakat,oleh karena itu penelitian ini mencoba menganalisis tweet yang membicarakan tentang issu public yaitu pilkada 2020 di tengan Pandemic COVID-19. Analisis yang dilakukan biasanya menggunakan klasifikasi tweet yang berisi sentimen masyarakat tentang issu tersebut. Metode klasifikasi yang digunakan pada penelitian kali ini adalah Naive Bayes Classifier (NBC) Dan Support Vector Machine (SVM). Naive Bayes Classifier dikombinasikan dengan fitur yang dapat mendeteksi pembobotan menggunakan probabilitas. Klasifikasi tweet dalam penelitian ini diperoleh berdasarkan kombinasi antara dua kelas yaitu kelas sentimen dan kelas kategori. Klasifikasi sentimen terdiri dari positif dan negatif. Hasil pengujian pada aplikasi yang dibangun memperlihatkan bahwa akurasi dengan Naive Bayes memberikan hasil yang lebih baik dari pada Support Vector Machine. Namun demikian, secara keseluruhan penggunaan metode Naive Bayes memiliki performansi yang baik untuk melakukan klasifikasi tweet dengan tingkat akurasi 92,2%. Kata kunci: analisis sentimen, klasifikasi, Naive Bayes Classifier.
Social media enables government to discover events in real time, and forecast public opinion. This study presents a system prototype for measuring public opinion from News channels, Bulletin Board Systems (BBS) and social networking... more
Social media enables government to discover events in real time, and forecast public opinion. This study presents a system prototype for measuring public opinion from News channels, Bulletin Board Systems (BBS) and social networking sites, including Facebook. The proposed system aims to improve communication between government officials and ordinary citizens about service delivery. The proposed system applies event-driven simulation to accelerate the processing speed, and thus provides a better solution for measuring public opinion.
It began in 2004 with a simple idea. By organizing Walks on World Diabetes Day, organisations and individuals could raise awareness about diabetes, and how to prevent it. These Walks would be low-cost, educational, and fun. WDF would... more
It began in 2004 with a simple idea. By organizing Walks on World Diabetes Day, organisations and individuals could raise awareness about diabetes, and how to prevent it. These Walks would be low-cost, educational, and fun. WDF would help by providing banners, tools, and guidance. Since then, 5 million participants of Global Diabetes Walks worldwide have raised awareness, galvanised communities - and, in some cases, even changed public policy.
Objetivo. Describe la aplicación de una herramienta para el análisis semántico de una colección documental, basada en el uso de la frecuencia de término – frecuencia inversa de documento (TF-IDF). Metodología. Se desarrolla un sistema,... more
Objetivo. Describe la aplicación de una herramienta para el análisis semántico de una colección documental, basada en el uso de la frecuencia de término – frecuencia inversa de documento (TF-IDF). Metodología. Se desarrolla un sistema, basado en lenguaje PHP y bases de datos MySQL, para la gestión de un tesauro, del cálculo TF-IDF (como indicador de peso semántico) y para el desarrollo de un árbol de relevancia (conformado por aquellos conceptos más relevantes del tema analizado). Se evaluó la herramienta en el análisis semántico de una colección documental de Psicología. Resultados. El sistema logró identificar el nivel de presencia del tema: deontología profesional, en una colección los documentos del programa de Psicología. Conclusiones. La experiencia descrita confirma la viabilidad de la herramienta para el análisis semántico de una colección documental. Destaca la pertinencia y las capacidades de los profesionales de la información para el desarrollo de herramientas para el tratamiento de información. Los autores sugieren un especial abordaje técnico a partir del uso de scripts y de flujos de la información.
Objective. This paper describes the application of a tool for the semantic analysis of a document collection based on the use of term frequency–inverse document frequency (TF – IDF). Methodology. A system based on PHP and MySQL database for the management of a thesaurus, the calculation of TF – IDF (as an indicator of semantic weight) and for development a relevance tree (consisting of those concepts is developed most relevant issue analyzed). The tool was tested to the semantic analysis of a documentary collection of Psychology. Results. The system was able to identify the level of track presence: professional ethics, in a collection of documents Psychology program. Conclusions. The experience described confirms the viability of the tool for the semantic analysis of a documentary collection. It underlines the relevance and capacities of information professionals to develop this kind of tools for processing information. The authors suggests a special technical approach for use of scripts and information flows.
User reviews provide a rich source of information regarding user interests. Many Web platforms allow or even encourage their visitors to leave their feedback regarding the products and services they have consumed. The Term Frequency (TF)... more
User reviews provide a rich source of information regarding user interests. Many Web platforms allow or even encourage their visitors to leave their feedback regarding the products and services they have consumed. The Term Frequency (TF) and the Inverse Document Frequency (IDF) are two factors that have been used extensively in capturing users’ preferences. This paper collects users’ reviews from etourism Web platforms, calculates the TF and the IDF for each user and adopts a multi-criteria approach in order to quantify users’ preferences and dynamically adapt the websites design accordingly. It utilizes the Analytic Hierarchy Process (AHP) and similarity methods in order to determine the relative importance of terms and Web pages and then rearranges them in a new website structure. Keywords-Web Adaptation; TF-IDF; AHP; Multi-Criteria Analysis.
Devido ao crescente aumento do volume de informaç̧ões na internet, buscam-se uma melhoria contínua das diversas técnicas da recuperação de informaçõ̃es à fim de alcançar resultados mais eficientes e eficazes para encontrar documentos cada... more
Devido ao crescente aumento do volume de informaç̧ões na internet, buscam-se uma melhoria contínua das diversas técnicas da recuperação de informaçõ̃es à fim de alcançar resultados mais eficientes e eficazes para encontrar documentos cada vez mais relevantes a uma determinada consulta em questão e, em tempo hábil. Apesar dos mais diversos sites possuírem mecanismos de buscas por publicações, é perdido uma quantidade enorme de informação visto que, estas buscas tentam encontrar as respectivas palavras e não se é levado o espaço semântico de um contexto similar. Assim sendo, este trabalho possui como objetivo a criação de uma coleção fechada de documentos psicológicos coletados do Google Schoolar para, se aplicar a função peso da frequência do termo - frequência de dados inversa (TF-IDF) e, uma técnica computacional da recuperação da informação titulada de análise semântica latente (LSA), embasando-se na literatura vigente para criação, aplicação e comparação dos resultados obtidos através de um algoritmo desenvolvido em Python pelo próprio autor.
SMS classifying technology has important significance to assist people in dealing with SMS messages. Although sms classification can be performed with little or no effort by people, it still remains difficult for computers. Machine... more
SMS classifying technology has important significance to assist people in dealing with SMS messages. Although sms classification can be performed with little or no effort by people, it still remains difficult for computers. Machine learning offers a promising approach to the design of algorithms for training computer programs to efficiently and accurately classify short text message data.. In this paper we introduce a weighting method based on statistical estimation of the importance of a word for an SMS categorization problem, which will classify Mobile SMS into predefined classes such as occasions, friendship, sales etc. All sms are converted into text documents. After preprocessing vector space model is prepared and weight is assigned to each term. This weighting method based on statistical estimation of the importance of a word for an SMS categorization problem. The experiments reported in the paper shows that this weighting method improves significantly the classification accuracy as measured on many categorization tasks.
— The Semantic Web opens up new opportunities for the data mining research. Identification of the current interests of the user based on the short-term navigational patterns instead of explicit user information has proved to be one of the... more
— The Semantic Web opens up new opportunities for the data mining research. Identification of the current interests of the user based on the short-term navigational patterns instead of explicit user information has proved to be one of the potential sources for prediction of pages which may be of interest to the user. This would help organizations in various analyses such as web site improvement. Various techniques are employed for achieving personalized recommendation. In this research employs web usage mining techniques for determining the interest of " similar " users, technique for classifying and matching an online user based on his browsing interests. A novel approach for prediction of unvisited pages has been employed. The complete process for next page prediction, represented in the architecture broadly consists of two components: offline component and online component.
Bangla blog is increasing rapidly in the era of information, and consequently, the blog has a diverse layout and categorization. In such an aptitude, automated blog post classification is a comparatively more efficient solution in order... more
Bangla blog is increasing rapidly in the era of information, and consequently, the blog has a diverse layout and categorization. In such an aptitude, automated blog post classification is a comparatively more efficient solution in order to organize Bangla blog posts in a standard way so that users can easily find their required articles of interest. In this research, nine supervised learning models which are Support Vector Machine (SVM), multinomial naïve Bayes (MNB), multi-layer perceptron (MLP), k-nearest neighbours (k-NN), stochastic gradient descent (SGD), decision tree, perceptron, ridge classifier and random forest are utilized and compared for classification of Bangla blog post. Moreover, the performance on predicting blog posts against eight categories, three feature extraction techniques are applied, namely unigram TF-IDF (term frequency-inverse document frequency), bigram TF-IDF, and trigram TF-IDF. The majority of the classifiers show above 80% accuracy. Other performance...
Bangla blog is increasing rapidly in the era of information, and consequently, the blog has a diverse layout and categorization. In such an aptitude, automated blog post classification is a comparatively more efficient solution in order... more
Bangla blog is increasing rapidly in the era of information, and consequently, the blog has a diverse layout and categorization. In such an aptitude, automated blog post classification is a comparatively more efficient solution in order to organize Bangla blog posts in a standard way so that users can easily find their required articles of interest. In this research, nine supervised learning models which are support vector machine (SVM), multinomial naïve Bayes (MNB), multi-layer perceptron (MLP), k-nearest neighbours (k-NN), stochastic gradient descent (SGD), decision tree, perceptron, ridge classifier and random forest are utilized and compared for classification of Bangla blog post. Moreover, the performance on predicting blog posts against eight categories, three feature extraction techniques are applied, namely unigram TF-IDF (term frequency-inverse document frequency), bigram TF-IDF, and trigram TF-IDF. The majority of the classifiers show above 80% accuracy. Other performance evaluation metrics also show good results while comparing the selected classifiers.
In recent times, the exponential growth of the Internet has resulted to an enormous number of electronic documents in several regional languages apart from English. Numerous documents in Tamil language are being generated from news,... more
In recent times, the exponential growth of the Internet has resulted to an enormous number of electronic documents in several regional languages apart from English. Numerous documents in Tamil language are being generated from news, blogs, eBooks, and entertainment, the automated classification of Tamil documents is needed. Since the automated Tamil document classification is not discovered proficiently, this study focuses on the development of deep learning (DL) models for Tamil document classification. This paper introduces an ensemble of feature selection with DL based classification models for Tamil documents. The presented model primarily involves preprocessing to remove the unwanted data and improve the data quality to a certain extent. Besides, term frequency-inverse document frequency (TF-IDF) approach is used to extract the features from the Tamil documents. In addition, two feature selection (FS) techniques namely Chi Squared(CS) and Extra Tree (ET) Classifier models are employed. The proposed method also uses deep neural network (DNN) and convolutional neural network (CNN) models for classification purposes. A detailed experimentation analysis takes place using a Tamil document dataset gathered by our own. The experimental values showcased that the ETFS-CNN model has obtained effective classification outcome with the maximum accuracy of 90%, precision of 90.57%, recall of 90%, and F-score of 89.89%.