Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content
Aline Paes
    O número de eventos climáticos extremos têm aumentado em todo planeta, impactando fortemente grandes centros urbanos, em especial aqueles que cresceram de forma desordenada. Nessas regiões, enchentes e deslizamentos de terras são... more
    O número de eventos climáticos extremos têm aumentado em todo planeta, impactando fortemente grandes centros urbanos, em especial aqueles que cresceram de forma desordenada. Nessas regiões, enchentes e deslizamentos de terras são responsáveis por muitas mortes, todos os anos. Por isso, planejamentos que ajudem a se antecipar, reagir e evitar tais eventos são de fundamental importância. Neste artigo, apresentamos uma plataforma chamada @WeatherNit para o monitoramento de chuvas e eventos climáticos na cidade de Niterói. A plataforma permite a visualização interativa de dados históricos e de tempo real de volumes acumulados de chuvas e de ocorrências de enchentes e deslizamentos, integrados e armazenados usando um Data Lakehouse. A plataforma foi avaliada com um estudo de caso utilizando dados do CEMADEN e da Prefeitura de Niterói que demonstrou o potencial da abordagem no monitoramento de crises e no apoio ao desenvolvimento de políticas públicas.
    Embora estudos sobre como medir a leiturabilidade de um texto remontem ao século passado, ainda não há um consenso sobre quais seriam as melhores métricas. Ferramentas de Processamento de Linguagem Natural (PLN) podem apoiar esta tarefa,... more
    Embora estudos sobre como medir a leiturabilidade de um texto remontem ao século passado, ainda não há um consenso sobre quais seriam as melhores métricas. Ferramentas de Processamento de Linguagem Natural (PLN) podem apoiar esta tarefa, mas dependem de um grande número de amostras para treinamento, o que é uma barreira para seu avanço. O objetivo principal deste artigo é analisar o impacto de determinados métodos de aumento de dados (AD) para enfrentar essa barreira e apoiar a classificação de leiturabilidade no português brasileiro (PB). Para tanto, foi estabelecido um corpus pareado e classificado, com textos originais complexos e suas versões simplificadas sobre temas de Ciências, desenvolvido por linguistas. Esse corpus foi aumentado com técnicas agnósticas de AD: substituição por sinônimos (SS) e retrotradução (RT). Foram avaliados 75 modelos com diferentes técnicas e combinações de atributos de entrada. O melhor resultado obtido para o conjunto dos textos do corpus sem aument...
    O judiciário brasileiro possui uma grande carga de trabalho, o que acaba acarretando um longo tempo para conclusão dos processos judiciais. Diversas iniciativas de digitalização têm surgido, abrindo a possibilidade do uso de recursos... more
    O judiciário brasileiro possui uma grande carga de trabalho, o que acaba acarretando um longo tempo para conclusão dos processos judiciais. Diversas iniciativas de digitalização têm surgido, abrindo a possibilidade do uso de recursos computacionais no auxílio das tarefas cotidianas do domínio jurídico. O domínio jurídico lida, em sua maioria, com dados textuais e a Inteligência Artificial tem técnicas que podem ajudar a apoiar as tarefas cotidianas, dando maior celeridade ao processo. No entanto, conjuntos de dados do domínio jurídico necessários para algumas técnicas atuais de Inteligência Artificial são escassos e de difícil obtenção, uma vez que requerem anotações por parte de especialistas. Este artigo apresenta quatro conjuntos de dados do domínio jurídico, dois com corpus de documentos e alguns metadados mas sem rótulo, e outros dois anotados com uma heurística visando seu uso na tarefa de similaridade semântica textual.
    Modeling business processes as a set of activities to accomplish goals naturally makes them be executed several times. Usually, such executions produce a large portion of provenance data in different formats such as text, audio, and... more
    Modeling business processes as a set of activities to accomplish goals naturally makes them be executed several times. Usually, such executions produce a large portion of provenance data in different formats such as text, audio, and video. Such a multiple-type nature gives origin to multimodal provenance data. Analyzing multimodal provenance data in an integrated form may be complex and error-prone when manually performed as it requires extracting information from free-text, audio, and video files. However, such an analysis may generate valuable insights into the business process. The present article presents MINERVA (Multimodal busINEss pRoVenance Analysis). This approach focuses on identifying improvements that can be implemented in business processes, as well as in collaboration analysis using multimodal provenance data. MINERVA was evaluated through a feasibility study that used data from a consulting company.
    Scientific Workflows (SWfs) have revolutionized how scientists in various domains of science conduct their experiments. The management of SWfs is performed by complex tools that provide support for workflow composition, monitoring,... more
    Scientific Workflows (SWfs) have revolutionized how scientists in various domains of science conduct their experiments. The management of SWfs is performed by complex tools that provide support for workflow composition, monitoring, execution, capturing, and storage of the data generated during execution. In some cases, they also provide components to ease the visualization and analysis of the generated data. During the workflow’s composition phase, programs must be selected to perform the activities defined in the workflow specification. These programs often require additional parameters that serve to adjust the program’s behavior according to the experiment’s goals. Consequently, workflows commonly have many parameters to be manually configured, encompassing even more than one hundred in many cases. Wrongly parameters’ values choosing can lead to crash workflows executions or provide undesired results. As the execution of data- and compute-intensive workflows is commonly performed ...
    Arguably, player behavior profiling is one of the most relevant tasks of Game Analytics. However, to fulfill the needs of this task, gameplay data should be handled so that the player behavior can be profiled and even understood. Usually,... more
    Arguably, player behavior profiling is one of the most relevant tasks of Game Analytics. However, to fulfill the needs of this task, gameplay data should be handled so that the player behavior can be profiled and even understood. Usually, gameplay data is stored as raw log-like files, from which gameplay metrics are computed. However, gameplay metrics have been commonly used as input to classify player behavior with two drawbacks: (1) gameplay metrics are mostly handcrafted and (2) they might not be adequate for fine-grain analysis as they are just computed after key events, such as stage or game completion. In this paper, we present a novel approach for player profiling based on provenance graphs, an alternative to log-like files that model causal relationships between entities in game. Our approach leverages recent advances in deep learning over graph representation of player states and its neighboring contexts, requiring no handcrafted features. We perform clustering on learned nodes representations to profile at a fine-grain the player behavior in provenance data collected from a multiplayer battle game and assess the obtained profiles through statistical analysis and data visualization.
    Most scientific experiments can be modeled as workflows. These workflows are usually computing‐ and data‐intensive, demanding the use of high‐performance computing environments such as clusters, grids, and clouds. This latter offers the... more
    Most scientific experiments can be modeled as workflows. These workflows are usually computing‐ and data‐intensive, demanding the use of high‐performance computing environments such as clusters, grids, and clouds. This latter offers the advantage of the elasticity, which allows for changing the number of virtual machines (VMs) on demand. Workflows are typically managed using scientific workflow management systems (SWfMS). Many existing SWfMSs offer support for cloud‐based execution. Each SWfMS has its scheduler that follows a well‐defined cost function. However, such cost functions should consider the characteristics of a dynamic environment, such as live migrations or performance fluctuations, which are far from trivial to model. This article proposes a novel scheduling strategy, named ReASSIgN, based on reinforcement learning (RL). By relying on an RL technique, one may assume that there is an optimal (or suboptimal) solution for the scheduling problem, and aims at learning the best scheduling based on previous executions in the absence of a mathematical model of the environment. For this, an extension of a well‐known workflow simulator WorkflowSim is proposed to implement an RL strategy for scheduling workflows. Once the scheduling plan is generated via simulation, the workflow is executed in the cloud using SciCumulus SWfMS. We conducted a throughout evaluation of the proposed scheduling strategy using a real astronomy workflow named Montage.
    ILP has been successfully applied to a variety of tasks. Nevertheless, ILP systems have huge time and storage requirements, owing to a large search space of possible clauses. Therefore, clever search strategies are needed. One promising... more
    ILP has been successfully applied to a variety of tasks. Nevertheless, ILP systems have huge time and storage requirements, owing to a large search space of possible clauses. Therefore, clever search strategies are needed. One promising family of search strategies is that of stochastic local search methods. These methods have been successfully applied to propositional tasks, such as satisfiability, substantially improving their efficiency. Following the success of such methods, a promising research direction is to employ stochastic local search within ILP, to accelerate the runtime of the learning process. An investigation in that direction was recently performed within ILP [ ˇ Zelezn´y et al., 2004]. Stochastic local search algorithms for propositional satisfiability benefit from the ability to quickly test whether a truth assignment satisfies a formula. As a result, many possible solutions (assignments) can be tested and scored in a short time. In contrast, the analogous test with...
    As mídias sociais se tornaram um ambiente popular para comunicação. Por isso, analisar o sentimento que o usuário expressa em suas postagens nas redes sociais é um importante campo de pesquisa. No entanto, detectar a polaridade em tais... more
    As mídias sociais se tornaram um ambiente popular para comunicação. Por isso, analisar o sentimento que o usuário expressa em suas postagens nas redes sociais é um importante campo de pesquisa. No entanto, detectar a polaridade em tais conteúdos é um desafio, em parte porque a quantidade de dados rotulados para treinar classificadores é escassa em muitas situações. Este artigo explora estratégias para reusar um modelo aprendido a partir de conjunto de dados fonte para classificar instâncias em um conjunto de dados de destino. Os experimentos são conduzidos com 22 conjuntos de dados de análise de sentimento em tweets e abordagens baseadas em métricas de similaridade. Os resultados apontam que o tamanho do conjunto de treinamento fonte desempenha um papel essencial no desempenho dos classificadores quando usados para inferir a classe das instâncias alvo.
    Durante a definição e execução de processos de negócio, dados de proveniência em diferentes formatos são coletados, e analisá-los de forma integrada é uma tarefa complexa e propensa a erros, se realizada de forma manual. Entretanto, tal... more
    Durante a definição e execução de processos de negócio, dados de proveniência em diferentes formatos são coletados, e analisá-los de forma integrada é uma tarefa complexa e propensa a erros, se realizada de forma manual. Entretanto, tal integração pode trazer insights sobre o processo. O presente artigo apresenta a abordagem MINERVA (Multimodal busINEss pRoVenance Analysis), que permite a análise de colaboração e identificação de pontos de melhoria em processos de negócio por meio de dados de proveniência multimodais e Bancos de Dados orientados a grafos. A abordagem foi avaliada por meio de um estudo de viabilidade com dados reais de uma empresa de consultoria.
    Mental disorders such as depression and anxiety have been increasing at alarming rates in the worldwide population. Notably, the major depressive disorder has become a common problem among higher education students, aggravated, and maybe... more
    Mental disorders such as depression and anxiety have been increasing at alarming rates in the worldwide population. Notably, the major depressive disorder has become a common problem among higher education students, aggravated, and maybe even occasioned, by the academic pressures they must face. While the reasons for this alarming situation remain unclear (although widely investigated), the student already facing this problem must receive treatment. To that, it is first necessary to screen the symptoms. The traditional way for that is relying on clinical consultations or answering questionnaires. However, nowadays, the data shared at social media is a ubiquitous source that can be used to detect the depression symptoms even when the student is not able to afford or search for professional care. Previous works have already relied on social media data to detect depression on the general population, usually focusing on either posted images or texts or relying on metadata. In this work,...
    Description logics based languages have became the standard representation scheme for ontologies. They formalize the domain knowledge using interrelated concepts, contained in terminologies. The manual definition of terminologies is an... more
    Description logics based languages have became the standard representation scheme for ontologies. They formalize the domain knowledge using interrelated concepts, contained in terminologies. The manual definition of terminologies is an expensive and error prone task, therefore automatic learning methods are a necessity. In this paper we lay the foundations of a multiple concept learning method that uses virtual concepts to aid the learning process, yielding more compact and readable terminologies. In this paper, we define virtual concepts and how they can be implemented in the current concept learning methods. We show through experiments how the method stacks up against other multiple concept learning methods.
    Sustainability is a major concern of our time. Companies can have a considerable negative impact on the environment, as their productive processes often contribute to the emission of greenhouse gases, generate toxic waste and consume... more
    Sustainability is a major concern of our time. Companies can have a considerable negative impact on the environment, as their productive processes often contribute to the emission of greenhouse gases, generate toxic waste and consume natural resources. As such, they also have a great share of the responsibility towards our sustainable development. This paper presents a vision for a method to identify and mitigate unsustainable practices in business organisations. The method is inspired on the KAOS Framework, in the sense that it offers catalogues for the identification and resolution of obstacles to sustainability. The method will be complemented by a metamodel, a semi-structured language and a knowledge base, to eventually allow automation.
    Automatic retrieval of information is better achieved when the data is previously classified in categories. However, given the amount of currently available information, and being constantly produced, it is very hard to manually classify... more
    Automatic retrieval of information is better achieved when the data is previously classified in categories. However, given the amount of currently available information, and being constantly produced, it is very hard to manually classify all of them. Thus, it is essential to establish a methodology that would be able to not only retrieving data, but also to present them in its correct class. One way to accomplish this later goal is to use Machine Learning algorithms for classification. In this work we tackle this problem using a Deep Learning technique, given the increasingly success of this area in classification tasks. Particularly, we address the automatic classification of images related to the Olympic Games, focusing on predicting which sport a image is associated with. We present an extensive empirical study encompassing the training the Deep Learning Network known as GoogLeNet with 26 of the 42 Olympic sports disciplines and 5362 images. From the results, we show that GoogLeN...
    The growing of criminality in Brazilian cities is a common theme addressed by media as well as by the legal authorities. To effectively reduce the criminality, people and infrastructure must be carefully involved to not only punish who... more
    The growing of criminality in Brazilian cities is a common theme addressed by media as well as by the legal authorities. To effectively reduce the criminality, people and infrastructure must be carefully involved to not only punish who had committed crimes, but also predict and prevent it. Since acquiring official data about crimes is far from trivial, citizens have become important data sources through Web-based collaborative systems. These systems provide a huge volume of data that has to be analyzed. How to analyze this volume of data and identify patterns in crimes is an important, yet open, issue. Thus, this work presents a system called SiAPP. Its main objective is to support the analysis and prediction of crime patterns using a machine learning algorithm. SiAPP automatically acquires data from collaborative sources, generate logical rules and visualizes the found patterns. Experimental analysis shows that SiAPP is a promising solution tool to assist crimes prevention.
    Smart cities initiatives have the potential to improve the life of citizens in a huge number of dimensions. One of them is the development of techniques and services capable of contributing to the enhancement of security public policies.... more
    Smart cities initiatives have the potential to improve the life of citizens in a huge number of dimensions. One of them is the development of techniques and services capable of contributing to the enhancement of security public policies. Finding criminal patterns from historical data would arguably help in predicting and even preventing thefts and burglaries that continuously increase in urban centers worldwide. However, accessing such history and finding patterns across the interrelated crime occurrences data are challenging tasks, particularly to underdevelopment countries. In this paper, we address these problems by combining three techniques: we collect crime data from existing crowd-sourcing systems, we automatically induce patterns with relational machine learning, and we manage the entire process using scientific workflows. The framework developed under these lines is named CRiMINaL (Crime patteRn MachINe Learning). Experimental results conducted from a popular Brazilian sour...
    Scientific workflows are a de facto standard for modeling scientific experiments. However, several workflows have too many parameters to be manually configured. Poor choices of parameter values may lead to unsuccessful executions of the... more
    Scientific workflows are a de facto standard for modeling scientific experiments. However, several workflows have too many parameters to be manually configured. Poor choices of parameter values may lead to unsuccessful executions of the workflow. In this paper, we present F ReeP , a parameter recommendation algorithm that suggests a value to a parameter that agrees with the user preferences. F ReeP is based on the Preference Learning technique. A preliminary experimental evaluation performed over the SciPhy workflow showed the feasibility of F ReeP to recommend parameter values for scientific workflows.
    With the growth of social medias, such as Twitter, plenty of usergenerated data emerge daily. The short texts published on Twitter – the tweets – have earned significant attention as a rich source of information to guide many... more
    With the growth of social medias, such as Twitter, plenty of usergenerated data emerge daily. The short texts published on Twitter – the tweets – have earned significant attention as a rich source of information to guide many decision-making processes. However, their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks, including sentiment analysis. Sentiment classification is tackled mainly by machine learning-based classifiers. The literature has adopted word representations from distinct natures to transform tweets to vector-based inputs to feed sentiment classifiers. The representations come from simple count-based methods, such as bag-of-words, to more sophisticated ones, such as BERTweet, built upon the trendy BERT architecture. Nevertheless, most studies mainly focus on evaluating those models using only a small number of datasets. Despite the progress made in recent years in language mo...
    The use of social media data to mine opinions during elections has emerged as an alternative to traditional election polls. However, relying on social media data in electoral scenarios comes with a number of challenges, such as tackling... more
    The use of social media data to mine opinions during elections has emerged as an alternative to traditional election polls. However, relying on social media data in electoral scenarios comes with a number of challenges, such as tackling sentences with domain specific terms, texts full of hate speech, noisy, informal vocabulary, sarcasm and irony. Also, in Twitter, for instance, loss of context may occur due to the imposed limit of characters to the posts. Furthermore, prediction tasks that use machine learning require labeled datasets and it is not trivial to reliably annotate them during the short period of campaigns. Motivated by the aforementioned issues, we investigate if it is possible to use or mix curated datasets from other domains as a starting point to opinion mining tasks during elections. To avoid introducing a knowledge from the other domains that could end up by disturbing the task, we propose to use similarity metrics that point out whether or not the dataset should be used. In our approach, we conduct a case study using the 2018 Brazilian Presidential Elections and labeled datasets for sentiment analysis from other domains. To identify the similarity between the datasets, we use the Jaccard distance and a metric based on word embeddings. Our experimental results show that taking into account the (dis) similarity between different domains, it is possible to achieve results closer to the ones that would be achieved with classifiers trained with annotated datasets of the electoral domain.
    Election polls are the de facto mechanisms to predict political outcomes. Traditionally, they are conducted with personal interviews and questionnaires. This process is costly and time consuming, demanding the development of alternative... more
    Election polls are the de facto mechanisms to predict political outcomes. Traditionally, they are conducted with personal interviews and questionnaires. This process is costly and time consuming, demanding the development of alternative approaches faster and less expensive. On the other hand, social media emerge as important tools for people to express their opinions about candidates in electoral scenarios. In this context, there is an increasing number of election prediction approaches using social media and opinion mining, modeling this problem in different ways. In this work, we present a survey on approaches to election predictions and discuss many possibilities of decisions in the general process of constructing solutions to this end, including the quantity of collected data, the specific social media used, the collection period, the algorithms and prediction approaches adopted, among others aspects. Our overview allowed us to identify the main factors that should be considered when predicting elections outcomes supported by social media content, as well as the main open issues and limitations of the approaches found in the literature for data science communities. In brief, the main challenges that we have found include but are not limited to: labeling data reliably during the short period of electoral campaigns, absence of a robust methodology to collect and analyze data, non-availability of domain (labeled) datasets, a lack of a pattern to evaluate the obtained results and exploration of new machine learning algorithms and methods for tackling the peculiarities of this scenario.
    The Internet’s popularization has increased the amount of content produced and consumed on the web. To take advantage of this new market, major content producers such as Netflix and Amazon Prime have emerged, focusing on video streaming... more
    The Internet’s popularization has increased the amount of content produced and consumed on the web. To take advantage of this new market, major content producers such as Netflix and Amazon Prime have emerged, focusing on video streaming services. However, despite the large number and diversity of videos made available by these content providers, few of them attract the attention of most users. For example, in the data explored in this article, only 6% of the most popular videos account for 85% of total views. Finding out in advance which videos will be popular is not trivial, especially given many influencing variables. Nevertheless, a tool with this ability would be of great value to help dimension network infrastructure and properly recommend new content to users. In this way, this manuscript examines the machine learning-based approaches that have been proposed to solve the prediction of web content popularity. To this end, we first survey the literature and elaborate a taxonomy ...
    An ontology formalises a number of dependent and related concepts in a domain, encapsulated as a terminology. Manually defining such terminologies is a complex, time-consuming and error-prone task. Thus, there is great interest for... more
    An ontology formalises a number of dependent and related concepts in a domain, encapsulated as a terminology. Manually defining such terminologies is a complex, time-consuming and error-prone task. Thus, there is great interest for strategies to learn terminologies automatically. However, most of the existing approaches induce a single concept definition at a time, disregarding dependencies that may exist among the concepts. As a consequence, terminologies that are difficult to interpret may be induced. Thus, systems capable of learning all concepts within a single task, respecting their dependency, are essential for reaching concise and readable ontologies. In this paper, we tackle this issue presenting three terminology learning strategies that aim at finding dependencies among concepts, before, during or after they have been defined. Experimental results show the advantages of regarding the dependencies among the concepts to achieve readable and concise terminologies, compared to...
    Making artificial agents that learn how to play is a long-standing goal in the area of Game AI. Recently, several successful cases have emerged driven by Reinforcement Learning (RL) and neural network-based approaches. However, in most of... more
    Making artificial agents that learn how to play is a long-standing goal in the area of Game AI. Recently, several successful cases have emerged driven by Reinforcement Learning (RL) and neural network-based approaches. However, in most of the cases, the results have been achieved by training directly from pixel frames with valuable computational resources. In this paper, we devise agents that learn how to play the popular game of Bomberman by relying on state representations and RL-based algorithms without looking at the pixel level. To that, we designed five vector-based state representations and implemented Bomberman on the top of the Unity game engine through the ML-agents toolkit. We enhance the ML-agents algorithms by developing an Imitation-based learner (IL) that improves its model with the Actor-Critic Proximal-Policy Optimization (PPO) method. We compared this approach with a PPO-only learner that uses either a Multi-Layer Perceptron or a Long-Short Term-Memory network (LSTM). We conducted several pieces of training and tournament experiments by making the agents play against each other. The hybrid state representation and our IL followed by PPO learning algorithm achieve the best overall quantitative results, and we also observed that their agents learn a correct Bomberman behavior.
    Scientific experiments can be modeled as Workflows. Such Workflows are usually computing-and data-intensive, demanding the use of High-Performance Computing environments such as clusters, grids, and clouds. This latter offers the... more
    Scientific experiments can be modeled as Workflows. Such Workflows are usually computing-and data-intensive, demanding the use of High-Performance Computing environments such as clusters, grids, and clouds. This latter offers the advantage of elasticity, which allows for increasing and/or decreasing the number of Virtual Machines (VMs) on demand. Workflows are typically managed using Scientific Workflow Management Systems (SWfMS). Many existing SWfMSs offer support for cloud-based execution. Each SWfMS has its own scheduler that follows a well-defined cost function. However, such cost functions must consider the characteristics of a dynamic environment, such as live migrations and/or performance fluctuations, which are far from trivial to model. This paper proposes a novel scheduling strategy, named ReASSIgN, based on Reinforcement Learning (RL). By relying on an RL technique, one may assume that there is an optimal (or sub-optimal) solution for the scheduling problem, and aims at learning the best scheduling based on previous executions in the absence of a mathematical model of the environment. For this, an extension of a well-known workflow simulator WorkflowSim is proposed to implement an RL strategy for scheduling workflows. Once the scheduling plan is generated, the workflow is executed in the cloud using SciCumulus SWfMS. We conducted a thorough evaluation of the proposed scheduling strategy using a real astronomy workflow.
    In recent years, several studies proposed different Reinforcement Learning (RL) methods and formulations for the Traffic Signal Control problem, presenting promising results and flexible traffic light solutions. These studies generally... more
    In recent years, several studies proposed different Reinforcement Learning (RL) methods and formulations for the Traffic Signal Control problem, presenting promising results and flexible traffic light solutions. These studies generally decide for optimizing travel time as the objective. However, travel time has some crucial shortcomings: it is not easily decomposable into rewards; it hinders analysis at any simulation time but the very end; it produces unrealistic results for deadlock and starvation situations. In this paper, we propose two versions of objectives based on time loss, namely instant time loss per driver and consolidated time loss per driver, to address travel time shortcomings. We also show that improving time loss implies improving travel time and that there is a direct relationship between the time loss objective and the agents' reward. Our experimental results point out that our time loss-based RL formulation improves the time savings by 6% when compared to other commonly-adopted state-of-the-art formulations.
    Analyzing electoral trends in political scenarios using social media with data mining techniques has become popular in recent years. A problem in this field is to reliably annotate data during the short period of electoral campaigns. In... more
    Analyzing electoral trends in political scenarios using social media with data mining techniques has become popular in recent years. A problem in this field is to reliably annotate data during the short period of electoral campaigns. In this paper, we present a methodology to measure labeling divergence and an exploratory analysis of data related to the 2018 Brazilian Presidential Elections. As a result, we point out some of the main characteristics that lead to a high level of divergence during the annotation process in this domain. Our analysis shows a high degree of divergence mainly in regard to sentiment labels. Also, a significant difference was identified between labels obtained by manual annotation and labels obtained using an automatic annotation approach.
    The Internet's popularization has increased the amount of content produced and consumed on the Web. To take advantage of this new market, major content producers such as Netflix and Amazon Prime have emerged focusing on video... more
    The Internet's popularization has increased the amount of content produced and consumed on the Web. To take advantage of this new market, major content producers such as Netflix and Amazon Prime have emerged focusing on video streaming services. However, despite the large number and diversity of videos made available by these content providers, few of them attract most users' attention. For example, in the data explored in this paper, only 6% of the most popular videos are responsible for 85% of the total views. Finding out in advance which videos will be popular is not trivial, specially because of the large amount of influencing variables. Nevertheless, a tool with this ability would be of great value to help dimensioning network infrastructure and to properly recommend new content to users. In this work, we propose two approaches to obtaining features to classify the popularity of a video before it is published. The first one builds upon predictive attributes defined by feature engineering. The second leverages word embeddings from the descriptions and titles of the videos. We experiment with the proposed approaches on a set of videos from GloboPlay, the largest provider of video streaming services in Latin America. A combination of both engineered features and the embeddings using Random Forest machine learning algorithm reached the best result, with an accuracy of 87%.
    Sentiment analysis on social media data can be a challenging task, among other reasons, because labeled data for training is not always available. Transfer learning approaches address this problem by leveraging a labeled source domain to... more
    Sentiment analysis on social media data can be a challenging task, among other reasons, because labeled data for training is not always available. Transfer learning approaches address this problem by leveraging a labeled source domain to obtain a model for a target domain that is different but related to the source domain. However, the question that arises is how to choose proper source data for training the target classifier, which can be made considering the similarity between source and target data using distance metrics. This article investigates the relation between these distance metrics and the classifiers’ performance. For this purpose, we propose to evaluate four metrics combined with distinct dataset representations. Computational experiments, conducted in the Twitter sentiment analysis scenario, showed that the cosine similarity metric combined with bag-of-words normalized with term frequency-inverse document frequency presented the best results in terms of predictive pow...
    Abstract Heterogeneous systems employing CPUs and GPUs are becoming increasingly popular in large-scale data centers and cloud environments. In these platforms, sharing a GPU across different applications is an important feature to... more
    Abstract Heterogeneous systems employing CPUs and GPUs are becoming increasingly popular in large-scale data centers and cloud environments. In these platforms, sharing a GPU across different applications is an important feature to improve hardware utilization and system throughput. However, under scenarios where GPUs are competitively shared, some challenges arise. The decision on the simultaneous execution of different kernels is made by the hardware and depends on the kernels resource requirements. Besides that, it is very difficult to understand all the hardware variables involved in the simultaneous execution decisions in order to describe a formal allocation method. In this work, we use machine learning techniques to understand how the resource requirements of the kernels from the most important GPU benchmarks impact their concurrent execution. We focus on making the machine learning algorithms capture the hidden patterns that make a kernel interfere in the execution of another one when they are submitted to run at the same time. The techniques analyzed were k -NN, Logistic Regression, Multilayer Perceptron and XGBoost (which obtained the best results) over the GPU benchmark suites, Rodinia, Parboil and SHOC. Our results showed that, from the features selected in the analysis, the number of blocks per grid, number of threads per block, and number of registers are the resource consumption features that most affect the performance of the concurrent execution.
    Semantic Role Labelling (SRL) is the process of automatically finding the semantic roles of terms in a sentence. It is an essential task towards creating a machine-meaningful representation of textual information. One public linguistic... more
    Semantic Role Labelling (SRL) is the process of automatically finding the semantic roles of terms in a sentence. It is an essential task towards creating a machine-meaningful representation of textual information. One public linguistic resource commonly used for this task is the FrameNet Project. FrameNet is a human and machine-readable lexical database containing a considerable number of annotated sentences, those annotations link sentence fragments to semantic frames. However, while the annotations across all the documents covered in the dataset link to most of the frames, a large group of frames lack annotations in the documents pointing to them. In this paper, we present a data augmentation method for FrameNet documents that increases by over 13% the total number of annotations. Our approach relies on lexical, syntactic, and semantic aspects of the sentences to provide additional annotations. We evaluate the proposed augmentation method by comparing the performance of a state-of...
    Research Interests:
    Diversos experimentos de larga escala modelados como workflows científicos podem executar em paralelo por diversos dias ou semanas em ambientes de alto desempenho. O tempo de execução é determinado por fatores como o volume de dados de... more
    Diversos experimentos de larga escala modelados como workflows científicos podem executar em paralelo por diversos dias ou semanas em ambientes de alto desempenho. O tempo de execução é determinado por fatores como o volume de dados de entrada, a quantidade de parâmetros explorados, etc. Assim, se torna importante para o cientista que as execuções de workflows que não produzem resultados satisfatórios ou que produzem resultados com erros sejam reduzidas ao máximo. Estimar quais execuções irão falhar (ou não) é um problema importante, porém em aberto. De forma a reduzir esse problema, propomos um mecanismo de recomendação de parâmetros para workflows baseado em algoritmos de mineração de dados para que o cientista possa configurar seu workflow da melhor forma possível (e.x., para evitar erros) antes da execução propriamente dita.

    And 48 more