Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Free access
Just Accepted

ProcessGAN: Generating Privacy-Preserving Time-Aware Process Data with Conditional Generative Adversarial Nets

Online AM: 28 August 2024 Publication History

Abstract

Process data constructed from event logs provides valuable insights into procedural dynamics over time. The confidential information in process data, together with the data's intricate nature, makes the datasets not sharable and challenging to collect. Consequently, research is limited using process data and analytics in the process mining domain. In this study, we introduced a synthetic process data generation task to address the limitation of sharable process data. We introduced a generative adversarial network, called ProcessGAN, to generate process data with activity sequences and corresponding timestamps. ProcessGAN consists of a transformer-based network as the generator, and a time-aware self-attention network as the discriminator. It can generate privacy-preserving process data from random noise. ProcessGAN considers the duration of the process and time intervals between activities to generate realistic activity sequences with timestamps. We evaluated ProcessGAN on five real-world datasets, two that are public and three collected in medical domains that are private. To evaluate the synthetic data, in addition to statistical metrics, we trained a supervised model to score the synthetic processes. We also used process mining to discover workflows for synthetic medical processes and had domain experts evaluate the clinical applicability of the synthetic workflows. ProcessGAN outperformed the existing generative models in generating complex processes with valid parallel pathways. The synthetic process data generated by ProcessGAN better represented the long-range dependencies between activities, a feature relevant to complicated medical and other processes. The timestamps generated by the ProcessGAN model showed similar distributions with the authentic timestamps. In addition, we trained a transformer-based network to generate synthetic contexts (e.g., patient demographics) that were associated with the synthetic processes. The synthetic contexts generated by our model outperformed the baseline models, with the distributions similar to the authentic contexts. We conclude that ProcessGAN can generate sharable synthetic process data indistinguishable from authentic data. Our source code is available in https://github.com/raaachli/ProcessGAN.

References

[1]
Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. Advances in neural information processing systems 28 (2015).
[2]
Jarren Briscoe, Assefaw Gebremedhin, Lawrence B Holder, and Diane J Cook. 2022. Adversarial creation of a smart home testbed for novelty detection. In AAAI Spring Symposium on Designing AI for Open Worlds.
[3]
Zaharah A Bukhsh, Aaqib Saeed, and Remco M Dijkman. 2021. Processtransformer: Predictive business process monitoring with transformer network. arXiv preprint arXiv:2104.00721 (2021).
[4]
Awatef Hicheur Cairns, Billel Gueni, Mehdi Fhima, Andrew Cairns, Stéphane David, and Nasser Khelifa. 2015. Process mining in the education domain. International Journal on Advances in Intelligent Systems 8, 1 (2015), 219–232.
[5]
Shuhong Chen, Sen Yang, Moliang Zhou, Randall Burd, and Ivan Marsic. 2017. Process-oriented iterative multiple alignment for medical process mining. In 2017 IEEE international conference on data mining workshops (ICDMW). IEEE, 438–445.
[6]
Jessamyn Dahmen and Diane Cook. 2019. SynSys: A synthetic data generation system for healthcare applications. Sensors 19, 5 (2019), 1181.
[7]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[8]
Justin Engelmann and Stefan Lessmann. 2021. Conditional Wasserstein GAN-based oversampling of tabular data for imbalanced learning. Expert Systems with Applications 174 (2021), 114582.
[9]
Ziwei Fan, Zhiwei Liu, Jiawei Zhang, Yun Xiong, Lei Zheng, and Philip S Yu. 2021. Continuous-time sequential recommendation with temporal graph collaborative transformer. In Proceedings of the 30th ACM international conference on information & knowledge management. 433–442.
[10]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. Advances in neural information processing systems 27 (2014).
[11]
Roberto Gozalo-Brizuela and Eduardo C Garrido-Merchan. 2023. ChatGPT is not all you need. A State of the Art Review of large Generative AI models. arXiv preprint arXiv:2301.04655 (2023).
[12]
Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. 2017. Improved training of wasserstein gans. Advances in neural information processing systems 30 (2017).
[13]
Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, and Jun Wang. 2018. Long text generation via adversarial training with leaked information. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32.
[14]
Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144 (2016).
[15]
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8110–8119.
[16]
Alex Kendall, Yarin Gal, and Roberto Cipolla. 2018. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7482–7491.
[17]
Evelina Leivada, Elliot Murphy, and Gary Marcus. 2023. DALL· E 2 fails to reliably capture common syntactic processes. Social Sciences & Humanities Open 8, 1 (2023), 100648.
[18]
Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, and Dan Jurafsky. 2017. Adversarial learning for neural dialogue generation. arXiv preprint arXiv:1701.06547 (2017).
[19]
Jiacheng Li, Yujie Wang, and Julian McAuley. 2020. Time interval aware self-attention for sequential recommendation. In Proceedings of the 13th international conference on web search and data mining. 322–330.
[20]
Keyi Li, Ivan Marsic, Aleksandra Sarcevic, Sen Yang, Travis M Sullivan, Peyton E Tempel, Zachary P Milestone, Karen J O’Connell, and Randall S Burd. 2023. Discovering interpretable medical process models: A case study in trauma resuscitation. Journal of biomedical informatics 140 (2023), 104344.
[21]
Keyi Li, Sen Yang, Travis M Sullivan, Randall S Burd, and Ivan Marsic. 2022. Exploring Runtime Decision Support for Trauma Resuscitation. arXiv preprint arXiv:2207.02922 (2022).
[22]
Xiaomin Li, Vangelis Metsis, Huangyingrui Wang, and Anne Hee Hiong Ngu. 2022. Tts-gan: A transformer-based time-series generative adversarial network. In International conference on artificial intelligence in medicine. Springer, 133–143.
[23]
Chang Lu, Chandan K Reddy, Ping Wang, Dong Nie, and Yue Ning. 2023. Multi-label clinical time-series generation via conditional GAN. IEEE Transactions on Knowledge and Data Engineering (2023).
[24]
Felix Mannhardt and Daan Blinde. 2017. Analyzing the trajectories of patients with sepsis using process mining. In RADAR+ EMISA 2017. CEUR-ws. org, 72–80.
[25]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems 26 (2013).
[26]
Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014).
[27]
Jorge Munoz-Gama, Niels Martin, Carlos Fernandez-Llatas, Owen A Johnson, Marcos Sepúlveda, Emmanuel Helm, Victor Galvez-Yanjari, Eric Rojas, Antonio Martinez-Millana, Davide Aloini, et al. 2022. Process mining for healthcare: Characteristics and challenges. Journal of Biomedical Informatics 127 (2022), 103994.
[28]
Dominic A Neu, Johannes Lahann, and Peter Fettke. 2022. A systematic literature review on state-of-the-art deep learning methods for process prediction. Artificial Intelligence Review 55, 2 (2022), 801–827.
[29]
Weili Nie, Nina Narodytska, and Ankit Patel. 2018. Relgan: Relational generative adversarial networks for text generation. In International conference on learning representations.
[30]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
[31]
Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732 (2015).
[32]
Eric Rojas, Jorge Munoz-Gama, Marcos Sepúlveda, and Daniel Capurro. 2016. Process mining in healthcare: A literature review. Journal of biomedical informatics 61 (2016), 224–236.
[33]
Aleksandra Sarcevic, Ivan Marsic, and Randal S Burd. 2012. Teamwork errors in trauma resuscitation. ACM Transactions on Computer-Human Interaction (TOCHI) 19, 2 (2012), 1–30.
[34]
Vignesh Shankar, Elnaz Yousefi, Alireza Manashty, Dayne Blair, and Deepika Teegapuram. 2023. Clinical-GAN: trajectory forecasting of clinical events using transformer and generative adversarial networks. Artificial Intelligence in Medicine 138 (2023), 102507.
[35]
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155 (2018).
[36]
Connor Shorten and Taghi M Khoshgoftaar. 2019. A survey on image data augmentation for deep learning. Journal of big data 6, 1 (2019), 1–48.
[37]
Latanya Sweeney. 2000. Simple demographics often identify people uniquely. Health (San Francisco) 671, 2000 (2000), 1–34.
[38]
Farbod Taymouri, Marcello La Rosa, Sarah Erfani, Zahra Dasht Bozorgi, and Ilya Verenich. 2020. Predictive business process monitoring via generative adversarial nets: the case of next event prediction. In Business Process Management: 18th International Conference, BPM 2020, Seville, Spain, September 13–18, 2020, Proceedings 18. Springer, 237–256.
[39]
Wil Van der Aalst, Ton Weijters, and Laura Maruster. 2004. Workflow mining: Discovering process models from event logs. IEEE transactions on knowledge and data engineering 16, 9 (2004), 1128–1142.
[40]
Wil MP Van Der Aalst, Hajo A Reijers, Anton JMM Weijters, Boudewijn F van Dongen, AK Alves De Medeiros, Minseok Song, and HMW Verbeek. 2007. Business process mining: An industrial application. Information systems 32, 5 (2007), 713–732.
[41]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[42]
Zhenchen Wang, Puja Myles, and Allan Tucker. 2019. Generating and evaluating synthetic UK primary care data: preserving data utility & patient privacy. In 2019 IEEE 32nd International Symposium on Computer-Based Medical Systems (CBMS). IEEE, 126–131.
[43]
AJMM Weijters and Wil MP van der Aalst. 2001. Process mining: discovering workflow models from event-based data. In Belgium-Netherlands Conf. on Artificial Intelligence. Citeseer.
[44]
Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. 2019. Modeling tabular data using conditional gan. Advances in neural information processing systems 32 (2019).
[45]
Sen Yang, Yichen Zhou, Yifeng Guo, Richard A Farneth, Ivan Marsic, and Burd S Randall. 2017. Semi-synthetic trauma resuscitation process data generator. In 2017 IEEE International Conference on Healthcare Informatics (ICHI). IEEE, 573–573.
[46]
Jinsung Yoon, Daniel Jarrett, and Mihaela Van der Schaar. 2019. Time-series generative adversarial networks. Advances in neural information processing systems 32 (2019).
[47]
Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017. Seqgan: Sequence generative adversarial nets with policy gradient. In Proceedings of the AAAI conference on artificial intelligence, Vol. 31.
[48]
Yi Yu, Abhishek Srivastava, and Simon Canales. 2021. Conditional lstm-gan for melody generation from lyrics. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17, 1 (2021), 1–20.
[49]
Pierluigi Zerbino, Alessandro Stefanini, and Davide Aloini. 2021. Process science in action: A literature review on process mining in business management. Technological Forecasting and Social Change 172 (2021), 121021.
[50]
Yizhe Zhang, Zhe Gan, and Lawrence Carin. 2016. Generating text via adversarial training. In NIPS workshop on Adversarial Training, Vol. 21. academia. edu, 21–32.

Index Terms

  1. ProcessGAN: Generating Privacy-Preserving Time-Aware Process Data with Conditional Generative Adversarial Nets

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Knowledge Discovery from Data
      ACM Transactions on Knowledge Discovery from Data Just Accepted
      EISSN:1556-472X
      Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Online AM: 28 August 2024
      Accepted: 26 July 2024
      Revised: 14 July 2024
      Received: 09 October 2022

      Check for updates

      Author Tags

      1. Synthetic data generation
      2. Process mining
      3. Sequential data
      4. Generative adversarial networks
      5. Data privacy
      6. Time aware

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 120
        Total Downloads
      • Downloads (Last 12 months)120
      • Downloads (Last 6 weeks)52
      Reflects downloads up to 09 Nov 2024

      Other Metrics

      Citations

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media