Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

    Arzucan Ozgur

    Customer comments collected by companies through various channels are useful resources for understanding customer satisfaction. The continuous increase in the amount of comments makes manual analysis infeasible. In this study, the... more
    Customer comments collected by companies through various channels are useful resources for understanding customer satisfaction. The continuous increase in the amount of comments makes manual analysis infeasible. In this study, the comments of customers, written in Turkish, regarding banking services collected through NPS questionnaires were analyzed using Natural Language Processing methods. BERT-based sentiment classification models were developed and compared with traditional methods for the banking domain. The effectiveness of the methods was investigated in a low-resource setting, where (i) there is a small amount of labeled training data and (ii) there is no labeled training data in the target domain. For the first case, the results showed that BERTurk-based model performs better than the traditional models and its performance is affected less from the decrease in training data size. For the second case, training with out of domain data from Twitter was explored. In addition, zero-shot learning with XLM-Roberta, which was pertained for natural language inference, was investigated. While using out of domain data resulted in poor performance, the zero-shot learning approach achieved promising results for sentiment classification in the banking domain.
    Sentiment analysis is one of the key topics in Natural Language Processing which helps several applications from social media analysis to stock market prediction. Sentiment analysis datasets are generally collected by semi-supervision... more
    Sentiment analysis is one of the key topics in Natural Language Processing which helps several applications from social media analysis to stock market prediction. Sentiment analysis datasets are generally collected by semi-supervision through shopping or review websites. These datasets are constructed by mapping users' text reviews to the given scores by users. However, these datasets might contain errors due to automatic mapping, and generally they don't have the characteristic features of social media texts such as emojis, slangs, and typos. To address these problems, one of the first manually annotated Turkish Sentiment Analysis datasets from Twitter is proposed. The BounTi dataset contains Turkish tweets about specific universities at Turkey. Furthermore, the performance of multilingual and Turkish transformer models such as MBERT, XLM-Roberta, and BERTurk are analyzed for this dataset. The best proposed model https://github.com/boun-tabi/BounTi-Turkish-Sentiment-Analysis is based on BERTurk and achieves 0.729 macro-averaged recall score on the test set. Finally, a social media analysis demonstration with the best model is performed on Turkish tweets about a food brand. BounTi dataset, finetuned models, and related scripts are publicly released.
    Networks describe various complex natural systems including social systems. We investigate the social network of co-occurrence in Reuters-21578 corpus, which consists of news articles that appeared in the Reuters newswire in 1987. People... more
    Networks describe various complex natural systems including social systems. We investigate the social network of co-occurrence in Reuters-21578 corpus, which consists of news articles that appeared in the Reuters newswire in 1987. People are represented as vertices and two persons are connected if they co-occur in the same article. The network has small-world features with power-law degree distribution. The network is disconnected and the component size distribution has power law characteristics. Community detection on a degree-reduced network provides meaningful communities. An edge-reduced network, which contains only the strong ties has a star topology. "Importance" of persons are investigated. The network is the situation in 1987. After 20 years, a better judgment on the importance of the people can be done. A number of ranking algorithms, including Citation count, PageRank, are used to assign ranks to vertices. The ranks given by the algorithms are compared against ho...
    This work includes processing and classification of tweets which are written in Turkish language. Four different sector tweet datasets are vectorized with Word Embedding model and classified with Support Vector Machine and Random Forests... more
    This work includes processing and classification of tweets which are written in Turkish language. Four different sector tweet datasets are vectorized with Word Embedding model and classified with Support Vector Machine and Random Forests classifiers and results have been compared. We have showed that sector based tweet classification is more successful compared to general tweets. Accuracy rates for Banking sector is 89.97%, for Football 84.02%, for Telecom 73.86%, for Retail 63.68% and for overall 74.60% have been achieved.
    With the increasing availability of images on the web, identifying image related sentences has become an important problem. This research area is also important for the news publishing community for automatic captioning of news images and... more
    With the increasing availability of images on the web, identifying image related sentences has become an important problem. This research area is also important for the news publishing community for automatic captioning of news images and summarization. Although a large body of research has been devoted to image captioning, it is still a challenging problem. Previous works on image captioning mostly focus on generating new captions for the images. The problem of identifying image related sentences in news articles is discussed in this study for the first time and is novel because we do not try to generate a caption from scratch, but we try to select the most appropriate set of sentences for the image from the news text itself. We have used the CNN news dataset which only contains the text parts of news as basis and we have augmented the dataset by collecting the images of the news articles. We generated two class ground truth for the image and sentences of news by using Tf-Idf and W...
    This paper addresses the task of political orientation prediction and assigning a person to one of ‘democrat’ or ‘republican’ classes based on Twitter data that is produced by republicans and democrat voters. We have used Long Short Term... more
    This paper addresses the task of political orientation prediction and assigning a person to one of ‘democrat’ or ‘republican’ classes based on Twitter data that is produced by republicans and democrat voters. We have used Long Short Term Memory Recurrent Neural Networks and Support Vector Machine, algorithms to model the classification process. Long Short Term Memory Recurrent Neural Networks performed better and we have achieved 77.92% accuracy rate.
    Coronavirus Disease of 2019 (COVID-19) created dire consequences globally and triggered an enormous scientific effort from different domains. Resulting publications formed a gigantic domain-specific collection of text in which finding... more
    Coronavirus Disease of 2019 (COVID-19) created dire consequences globally and triggered an enormous scientific effort from different domains. Resulting publications formed a gigantic domain-specific collection of text in which finding studies on a biomolecule of interest is quite challenging for general purpose search engines due to terminology-rich characteristics of the publications. Here, we present Vapur, an online COVID-19 search engine specifically designed for finding related protein - chemical pairs. Vapur is empowered with a biochemically related entities-oriented inverted index in order to group studies relevant to a biomolecule with respect to its related entities. The inverted index of Vapur is automatically created with a BioNLP pipeline and integrated with an online user interface. The online interface is designed for the smooth traversal of the current literature and is publicly available at https://tabilab.cmpe.boun.edu.tr/vapur/.
    Motivation: Prediction of the interaction affinity between proteins and compounds is a major challenge in the drug discovery process. WideDTA is a deep-learning based prediction model that employs chemical and biological textual sequence... more
    Motivation: Prediction of the interaction affinity between proteins and compounds is a major challenge in the drug discovery process. WideDTA is a deep-learning based prediction model that employs chemical and biological textual sequence information to predict binding affinity. Results: WideDTA uses four text-based information sources, namely the protein sequence, ligand SMILES, protein domains and motifs, and maximum common substructure words to predict binding affinity. WideDTA outperformed one of the state of the art deep learning methods for drug-target binding affinity prediction, DeepDTA on the KIBA dataset with a statistical significance. This indicates that the word-based sequence representation adapted by WideDTA is a promising alternative to the character-based sequence representation approach in deep learning models for binding affinity prediction, such as the one used in DeepDTA. In addition, the results showed that, given the protein sequence and ligand SMILES, the incl...
    The biomedical literature is growing rapidly. This increases the need for developing text mining techniques to automatically extract biologically important information such as protein-protein interactions from free texts. Besides... more
    The biomedical literature is growing rapidly. This increases the need for developing text mining techniques to automatically extract biologically important information such as protein-protein interactions from free texts. Besides identifying an interaction and the interacting pair of proteins, it is also important to extract from the full text the most relevant sentences describing that interaction. These issues were addressed in the BioCreAtIvE II (Critical Assessment for Information Extraction in Biology) challenge evaluation as sub-tasks under the protein-protein interaction extraction (PPI) task. We present our approach of using dependency parsing and machine learning techniques to identify interacting protein pairs from full text articles (Protein Interaction Pairs Sub-task 2 (IPS)) and extracting the most relevant sentences that describe their interaction (Protein Interaction Sentences Sub-task 3 (ISS)).
    The effective representation of proteins is a crucial task that directly affects the performance of many bioinformatics problems. Related proteins usually bind to similar ligands. Chemical characteristics of ligands are known to capture... more
    The effective representation of proteins is a crucial task that directly affects the performance of many bioinformatics problems. Related proteins usually bind to similar ligands. Chemical characteristics of ligands are known to capture the functional and mechanistic properties of proteins suggesting that a ligand-based approach can be utilized in protein representation. In this study, we propose SMILESVec, a Simplified molecular input line entry system (SMILES)-based method to represent ligands and a novel method to compute similarity of proteins by describing them based on their ligands. The proteins are defined utilizing the word-embeddings of the SMILES strings of their ligands. The performance of the proposed protein description method is evaluated in protein clustering task using TransClust and MCL algorithms. Two other protein representation methods that utilize protein sequence, Basic local alignment tool and ProtVec, and two compound fingerprint-based protein representation...
    Adverse drug reactions (ADRs), also called as drug adverse events (AEs), are reported in the FDA drug labels; however, it is a big challenge to properly retrieve and analyze the ADRs and their potential relationships from textual data.... more
    Adverse drug reactions (ADRs), also called as drug adverse events (AEs), are reported in the FDA drug labels; however, it is a big challenge to properly retrieve and analyze the ADRs and their potential relationships from textual data. Previously, we identified and ontologically modeled over 240 drugs that can induce peripheral neuropathy through mining public drug-related databases and drug labels. However, the ADR mechanisms of these drugs are still unclear. In this study, we aimed to develop an ontology-based literature mining system to identify ADRs from drug labels and to elucidate potential mechanisms of the neuropathy-inducing drugs (NIDs). We developed and applied an ontology-based SciMiner literature mining strategy to mine ADRs from the drug labels provided in the Text Analysis Conference (TAC) 2017, which included drug labels for 53 neuropathy-inducing drugs (NIDs). We identified an average of 243 ADRs per NID and constructed an ADR-ADR network, which consists of 29 ADR n...
    BioC is a simple XML format for text, annotations and relations, and was developed to achieve interoperability for biomedical text processing. Following the success of BioC in BioCreative IV, the BioCreative V BioC track addressed a... more
    BioC is a simple XML format for text, annotations and relations, and was developed to achieve interoperability for biomedical text processing. Following the success of BioC in BioCreative IV, the BioCreative V BioC track addressed a collaborative task to build an assistant system for BioGRID curation. In this paper, we describe the framework of the collaborative BioC task and discuss our findings based on the user survey. This track consisted of eight subtasks including gene/protein/organism named entity recognition, protein-protein/genetic interaction passage identification and annotation visualization. Using BioC as their data-sharing and communication medium, nine teams, world-wide, participated and contributed either new methods or improvements of existing tools to address different subtasks of the BioC track. Results from different teams were shared in BioC and made available to other teams as they addressed different subtasks of the track. In the end, all submitted runs were m...
    Twitter is an extremely high volume platform for user generated contributions regarding any topic. The wealth of content created at real-time in massive quantities calls for automated approaches to identify the topics of the... more
    Twitter is an extremely high volume platform for user generated contributions regarding any topic. The wealth of content created at real-time in massive quantities calls for automated approaches to identify the topics of the contributions. Such topics can be utilized in numerous ways, such as public opinion mining, marketing, entertainment, and disaster management. Towards this end, approaches to relate single or partial posts to knowledge base items have been proposed. However, in microblogging systems like Twitter, topics emerge from the culmination of a large number of contributions. Therefore, identifying topics based on collections of posts, where individual posts contribute to some aspect of the greater topic is necessary. Models, such as Latent Dirichlet Allocation (LDA), propose algorithms for relating collections of posts to sets of keywords that represent underlying topics. In these approaches, figuring out what the specific topic(s) the keyword sets represent remains as a...
    In this study, we analyzed the effects of applying different levels of stemming approaches such as fixed-length word truncation and mor-phological analysis for multi-document sum-marization (MDS) on Turkish, which is an ag-glutinative and... more
    In this study, we analyzed the effects of applying different levels of stemming approaches such as fixed-length word truncation and mor-phological analysis for multi-document sum-marization (MDS) on Turkish, which is an ag-glutinative and morphologically rich language. We constructed a manually annotated MDS data set, and to our best knowledge, reported the first results on Turkish MDS. Our results show that a simple fixed-length word trun-cation approach performs slightly better than no stemming, whereas applying complex morphological analysis does not improve Turkish MDS.
    Research Interests:
    G protein-coupled receptors (GPCRs) are probably the most attractive drug target membrane proteins, which constitute nearly half of drug targets in the contemporary drug discovery industry. While the majority of drug discovery studies... more
    G protein-coupled receptors (GPCRs) are probably the most attractive drug target membrane proteins, which constitute nearly half of drug targets in the contemporary drug discovery industry. While the majority of drug discovery studies employ existing GPCR and ligand interactions to identify new compounds, there remains a shortage of specific databases with precisely annotated GPCR-ligand associations. We have developed a new database, GLASS, which aims to provide a comprehensive, manually-curated resource for experimentally validated GPCR-ligand associations. A new text-mining algorithm was proposed to collect GPCR-ligand interactions from the biomedical literature, which is then crosschecked with five primary pharmacological datasets, to enhance the coverage and accuracy of GPCR-ligand association data identifications. A special architecture has been designed to allow users for making homologous ligand search with flexible bioactivity parameters. The current database contains appro...
    Research Interests:
    Research Interests:
    Research Interests:
    Research Interests:
    Research Interests:
    Research Interests:

    And 14 more