-
Combining propensity score methods with variational autoencoders for generating synthetic data in presence of latent sub-groups
Authors:
Kiana Farhadyar,
Federico Bonofiglio,
Maren Hackenberg,
Daniela Zoeller,
Harald Binder
Abstract:
In settings requiring synthetic data generation based on a clinical cohort, e.g., due to data protection regulations, heterogeneity across individuals might be a nuisance that we need to control or faithfully preserve. The sources of such heterogeneity might be known, e.g., as indicated by sub-groups labels, or might be unknown and thus reflected only in properties of distributions, such as bimoda…
▽ More
In settings requiring synthetic data generation based on a clinical cohort, e.g., due to data protection regulations, heterogeneity across individuals might be a nuisance that we need to control or faithfully preserve. The sources of such heterogeneity might be known, e.g., as indicated by sub-groups labels, or might be unknown and thus reflected only in properties of distributions, such as bimodality or skewness. We investigate how such heterogeneity can be preserved and controlled when obtaining synthetic data from variational autoencoders (VAEs), i.e., a generative deep learning technique that utilizes a low-dimensional latent representation. To faithfully reproduce unknown heterogeneity reflected in marginal distributions, we propose to combine VAEs with pre-transformations. For dealing with known heterogeneity due to sub-groups, we complement VAEs with models for group membership, specifically from propensity score regression. The evaluation is performed with a realistic simulation design that features sub-groups and challenging marginal distributions. The proposed approach faithfully recovers the latter, compared to synthetic data approaches that focus purely on marginal distributions. Propensity scores add complementary information, e.g., when visualized in the latent space, and enable sampling of synthetic data with or without sub-group specific characteristics. We also illustrate the proposed approach with real data from an international stroke trial that exhibits considerable distribution differences between study sites, in addition to bimodality. These results indicate that describing heterogeneity by statistical approaches, such as propensity score regression, might be more generally useful for complementing generative deep learning for obtaining synthetic data that faithfully reflects structure from clinical cohorts.
△ Less
Submitted 12 December, 2023;
originally announced December 2023.
-
Sequential Item Recommendation in the MOBA Game Dota 2
Authors:
Alexander Dallmann,
Johannes Kohlmann,
Daniel Zoller,
Andreas Hotho
Abstract:
Multiplayer Online Battle Arena (MOBA) games such as Dota 2 attract hundreds of thousands of players every year. Despite the large player base, it is still important to attract new players to prevent the community of a game from becoming inactive. Entering MOBA games is, however, often demanding, requiring the player to learn numerous skills at once. An important factor of success is buying the co…
▽ More
Multiplayer Online Battle Arena (MOBA) games such as Dota 2 attract hundreds of thousands of players every year. Despite the large player base, it is still important to attract new players to prevent the community of a game from becoming inactive. Entering MOBA games is, however, often demanding, requiring the player to learn numerous skills at once. An important factor of success is buying the correct items which forms a complex task depending on various in-game factors such as already purchased items, the team composition, or available resources. A recommendation system can support players by reducing the mental effort required to choose a suitable item, helping, e.g., newer players or players returning to the game after a longer break, to focus on other aspects of the game. Since Sequential Item Recommendation (SIR) has proven to be effective in various domains (e.g. e-commerce, movie recommendation or playlist continuation), we explore the applicability of well-known SIR models in the context of purchase recommendations in Dota 2. To facilitate this research, we collect, analyze and publish Dota-350k, a new large dataset based on recent Dota 2 matches. We find that SIR models can be employed effectively for item recommendation in Dota 2. Our results show that models that consider the order of purchases are the most effective. In contrast to other domains, we find RNN-based models to outperform the more recent Transformer-based architectures on Dota-350k.
△ Less
Submitted 17 January, 2022;
originally announced January 2022.
-
A Case Study on Sampling Strategies for Evaluating Neural Sequential Item Recommendation Models
Authors:
Alexander Dallmann,
Daniel Zoller,
Andreas Hotho
Abstract:
At the present time, sequential item recommendation models are compared by calculating metrics on a small item subset (target set) to speed up computation. The target set contains the relevant item and a set of negative items that are sampled from the full item set. Two well-known strategies to sample negative items are uniform random sampling and sampling by popularity to better approximate the i…
▽ More
At the present time, sequential item recommendation models are compared by calculating metrics on a small item subset (target set) to speed up computation. The target set contains the relevant item and a set of negative items that are sampled from the full item set. Two well-known strategies to sample negative items are uniform random sampling and sampling by popularity to better approximate the item frequency distribution in the dataset. Most recently published papers on sequential item recommendation rely on sampling by popularity to compare the evaluated models. However, recent work has already shown that an evaluation with uniform random sampling may not be consistent with the full ranking, that is, the model ranking obtained by evaluating a metric using the full item set as target set, which raises the question whether the ranking obtained by sampling by popularity is equal to the full ranking. In this work, we re-evaluate current state-of-the-art sequential recommender models from the point of view, whether these sampling strategies have an impact on the final ranking of the models. We therefore train four recently proposed sequential recommendation models on five widely known datasets. For each dataset and model, we employ three evaluation strategies. First, we compute the full model ranking. Then we evaluate all models on a target set sampled by the two different sampling strategies, uniform random sampling and sampling by popularity with the commonly used target set size of 100, compute the model ranking for each strategy and compare them with each other. Additionally, we vary the size of the sampled target set. Overall, we find that both sampling strategies can produce inconsistent rankings compared with the full ranking of the models. Furthermore, both sampling by popularity and uniform random sampling do not consistently produce the same ranking ...
△ Less
Submitted 27 July, 2021;
originally announced July 2021.
-
Adapting deep generative approaches for getting synthetic data with realistic marginal distributions
Authors:
Kiana Farhadyar,
Federico Bonofiglio,
Daniela Zoeller,
Harald Binder
Abstract:
Synthetic data generation is of great interest in diverse applications, such as for privacy protection. Deep generative models, such as variational autoencoders (VAEs), are a popular approach for creating such synthetic datasets from original data. Despite the success of VAEs, there are limitations when it comes to the bimodal and skewed marginal distributions. These deviate from the unimodal symm…
▽ More
Synthetic data generation is of great interest in diverse applications, such as for privacy protection. Deep generative models, such as variational autoencoders (VAEs), are a popular approach for creating such synthetic datasets from original data. Despite the success of VAEs, there are limitations when it comes to the bimodal and skewed marginal distributions. These deviate from the unimodal symmetric distributions that are encouraged by the normality assumption typically used for the latent representations in VAEs. While there are extensions that assume other distributions for the latent space, this does not generally increase flexibility for data with many different distributions. Therefore, we propose a novel method, pre-transformation variational autoencoders (PTVAEs), to specifically address bimodal and skewed data, by employing pre-transformations at the level of original variables. Two types of transformations are used to bring the data close to a normal distribution by a separate parameter optimization for each variable in a dataset. We compare the performance of our method with other state-of-the-art methods for synthetic data generation. In addition to the visual comparison, we use a utility measurement for a quantitative evaluation. The results show that the PTVAE approach can outperform others in both bimodal and skewed data generation. Furthermore, the simplicity of the approach makes it usable in combination with other extensions of VAE.
△ Less
Submitted 14 May, 2021;
originally announced May 2021.
-
Distributed Multivariate Regression Modeling For Selecting Biomarkers Under Data Protection Constraints
Authors:
Daniela Zöller,
Harald Binder
Abstract:
The discovery of clinical biomarkers requires large patient cohorts and is aided by a pooled data approach across institutions. In many countries, data protection constraints, especially in the clinical environment, forbid the exchange of individual-level data between different research institutes, impeding the conduct of a joint analyses. To circumvent this problem, only non-disclosive aggregated…
▽ More
The discovery of clinical biomarkers requires large patient cohorts and is aided by a pooled data approach across institutions. In many countries, data protection constraints, especially in the clinical environment, forbid the exchange of individual-level data between different research institutes, impeding the conduct of a joint analyses. To circumvent this problem, only non-disclosive aggregated data is exchanged, which is often done manually and requires explicit permission before transfer, i.e., the number of data calls and the amount of data should be limited. This does not allow for more complex tasks such as variable selection, as only simple aggregated summary statistics are typically transferred. Other methods have been proposed that require more complex aggregated data or use input data perturbation, but these methods can either not deal with a high number of biomarkers or lose information. Here, we propose a multivariable regression approach for identifying biomarkers by automatic variable selection based on aggregated data in iterative calls, which can be implemented under data protection constraints. The approach can be used to jointly analyze data distributed across several locations. To minimize the amount of transferred data and the number of calls, we also provide a heuristic variant of the approach. When performing global data standardization, the proposed method yields the same results as pooled individual-level data analysis. In a simulation study, the information loss introduced by local standardization is seen to be minimal. In a typical scenario, the heuristic decreases the number of data calls from more than 10 to 3, rendering manual data releases feasible. To make our approach widely available for application, we provide an implementation of the heuristic version incorporated in the DataSHIELD framework.\
△ Less
Submitted 1 October, 2023; v1 submitted 1 March, 2018;
originally announced March 2018.
-
Modeling Activity Tracker Data Using Deep Boltzmann Machines
Authors:
Martin Treppner,
Stefan Lenz,
Harald Binder,
Daniela Zöller
Abstract:
Commercial activity trackers are set to become an essential tool in health research, due to increasing availability in the general population. The corresponding vast amounts of mostly unlabeled data pose a challenge to statistical modeling approaches. To investigate the feasibility of deep learning approaches for unsupervised learning with such data, we examine weekly usage patterns of Fitbit acti…
▽ More
Commercial activity trackers are set to become an essential tool in health research, due to increasing availability in the general population. The corresponding vast amounts of mostly unlabeled data pose a challenge to statistical modeling approaches. To investigate the feasibility of deep learning approaches for unsupervised learning with such data, we examine weekly usage patterns of Fitbit activity trackers with deep Boltzmann machines (DBMs). This method is particularly suitable for modeling complex joint distributions via latent variables. We also chose this specific procedure because it is a generative approach, i.e., artificial samples can be generated to explore the learned structure. We describe how the data can be preprocessed to be compatible with binary DBMs. The results reveal two distinct usage patterns in which one group frequently uses trackers on Mondays and Tuesdays, whereas the other uses trackers during the entire week. This exemplary result shows that DBMs are feasible and can be useful for modeling activity tracker data.
△ Less
Submitted 28 February, 2018;
originally announced February 2018.
-
Improving Session Recommendation with Recurrent Neural Networks by Exploiting Dwell Time
Authors:
Alexander Dallmann,
Alexander Grimm,
Christian Pölitz,
Daniel Zoller,
Andreas Hotho
Abstract:
Recently, Recurrent Neural Networks (RNNs) have been applied to the task of session-based recommendation. These approaches use RNNs to predict the next item in a user session based on the previ- ously visited items. While some approaches consider additional item properties, we argue that item dwell time can be used as an implicit measure of user interest to improve session-based item recommen- dat…
▽ More
Recently, Recurrent Neural Networks (RNNs) have been applied to the task of session-based recommendation. These approaches use RNNs to predict the next item in a user session based on the previ- ously visited items. While some approaches consider additional item properties, we argue that item dwell time can be used as an implicit measure of user interest to improve session-based item recommen- dations. We propose an extension to existing RNN approaches that captures user dwell time in addition to the visited items and show that recommendation performance can be improved. Additionally, we investigate the usefulness of a single validation split for model selection in the case of minor improvements and find that in our case the best model is not selected and a fold-like study with different validation sets is necessary to ensure the selection of the best model.
△ Less
Submitted 30 June, 2017;
originally announced June 2017.
-
Of course we share! Testing Assumptions about Social Tagging Systems
Authors:
Stephan Doerfel,
Daniel Zoller,
Philipp Singer,
Thomas Niebler,
Andreas Hotho,
Markus Strohmaier
Abstract:
Social tagging systems have established themselves as an important part in today's web and have attracted the interest from our research community in a variety of investigations. The overall vision of our community is that simply through interactions with the system, i.e., through tagging and sharing of resources, users would contribute to building useful semantic structures as well as resource in…
▽ More
Social tagging systems have established themselves as an important part in today's web and have attracted the interest from our research community in a variety of investigations. The overall vision of our community is that simply through interactions with the system, i.e., through tagging and sharing of resources, users would contribute to building useful semantic structures as well as resource indexes using uncontrolled vocabulary not only due to the easy-to-use mechanics. Henceforth, a variety of assumptions about social tagging systems have emerged, yet testing them has been difficult due to the absence of suitable data. In this work we thoroughly investigate three available assumptions - e.g., is a tagging system really social? - by examining live log data gathered from the real-world public social tagging system BibSonomy. Our empirical results indicate that while some of these assumptions hold to a certain extent, other assumptions need to be reflected and viewed in a very critical light. Our observations have implications for the design of future search and other algorithms to better reflect the actual user behavior.
△ Less
Submitted 28 March, 2014; v1 submitted 3 January, 2014;
originally announced January 2014.