Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3678717.3691230acmconferencesArticle/Chapter ViewAbstractPublication PagesgisConference Proceedingsconference-collections
research-article
Open access

Augmentation Techniques for Balancing Spatial Datasets in Machine and Deep Learning Applications

Published: 22 November 2024 Publication History

Abstract

Thanks to the availability of a huge amount of spatial data, many new machine and deep learning (ML/DL) applications have emerged that are able to deal with such kind of information. In particular, new cost models have been developed with the aim of predicting the cost of spatial operations carefully. For obtaining good ML/DL models, the training activity is usually performed with synthetically generated datasets that capture as many spatial distributions as possible and as many combinations of features as desired (e.g., cardinality, geometry complexity, etc), with the aim to improve the generalization capabilities of the trained models. However, when a model is used to estimate some properties of a spatial operation, like the range query selectivity, balancing the characteristics of the input datasets could be not enough to guarantee a balancing in the ground truth values of the target variable. Therefore, we need to develop a way to balance the final results without recomputing the operation from scratch. This paper formalizes the notion of dataset balancing in the context of spatial ML/DL, proposes a set of metrics for evaluating the degree of balancing of the input domains and the target values, and defines a set of augmentation techniques specifically tailored for spatial data. Finally, it tests the effects of such augmentations in the training of a generic ML cost model for estimating the selectivity of spatial range query.

References

[1]
Alberto Belussi and Christos Faloutsos. 1998. Self-spacial Join Selectivity Estimation Using Fractal Concepts. ACM Trans. Inf. Syst. 16, 2 (1998), 161--201. https://doi.org/10.1145/279339.279342
[2]
Alberto Belussi, Sara Migliorini, and Ahmed Eldawy. 2018. Detecting skewness of big spatial data in SpatialHadoop. In Proceedings of the 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (SIGSPATIAL 2018). ACM, Seattle, WA, USA, 432--435. https://doi.org/10.1145/3274895.3274923
[3]
Alberto Belussi, Sara Migliorini, and Ahmed Eldawy. 2020. Skewness-Based Partitioning in SpatialHadoop. ISPRS International Journal of Geo-Information 9, 4 (2020), 201. https://doi.org/10.3390/ijgi9040201
[4]
Alberto Belussi, Sara Migliorini, and Ahmed Eldawy. 2022. Spatial embedding: a generic machine learning model for spatial query optimization. In Proceedings of the 30th International Conference on Advances in Geographic Information Systems (Seattle, Washington, USA) (SIGSPATIAL '22). Association for Computing Machinery, New York, NY, USA, Article 26, 4 pages. https://doi.org/10.1145/3557915.3560960
[5]
Alberto Belussi, Sara Migliorini, and Ahmed Eldawy. 2024. A Generic Machine Learning Model for Spatial Query Optimization based on Spatial Embeddings. ACM Trans. Spatial Algorithms Syst. (apr 2024). https://doi.org/10.1145/3657633 Just Accepted.
[6]
Ahmed Eldawy, Louai Alarabi, and Mohamed F.Mokbel. 2015. Spatial partitioning techniques in SpatialHadoop. Proc. VLDB Endow. 8, 12 (2015), 1602--1605. https://doi.org/10.14778/2824032.2824057
[7]
Ahmed Eldawy, Mostafa Elganainy, Ammar Bakeer, Ahmed Abdelmotaleb, and Mohamed Mokbel. 2015. Sphinx: distributed execution of interactive SQL queries on big spatial data. In Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems (SIGSPATIAL '15). Article 78, 4 pages. https://doi.org/10.1145/2820783.2820869
[8]
Ahmed Eldawy, Vagelis Hristidis, Saheli Ghosh, Majid Saeedan, Akil Sevim, A.B. Siddique, Samriddhi Singla, Ganesh Sivaram, Tin Vu, and Yaming Zhang. 2021. Beast: Scalable Exploratory Analytics on Spatio-temporal Data. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management (Virtual Event, Queensland, Australia) (CIKM '21). Association for Computing Machinery, New York, NY, USA, 3796--3807. https://doi.org/10.1145/3459637.3481897
[9]
Ahmed Eldawy and Mohamed F Mokbel. 2015. Spatialhadoop: A MapReduce framework for spatial data. In 31st IEEE International Conference on Data Engineering (ICDE). 1352--1363. https://doi.org/10.1109/ICDE.2015.7113382
[10]
Zeshan Hussain, Francisco Gimenez, Darvin Yi, and Daniel Rubin. 2018. Differential Data Augmentation Techniques for Medical Imaging Classification Tasks. AMIA Annu Symp Proc 2017 (apr 2018), 979--984.
[11]
Puloma Katiyar, Tin Vu, Ahmed Eldawy, Sara Migliorini, and Alberto Belussi. 2020. SpiderWeb: A Spatial Data Generator on the Web. In Proceedings of the 28th International Conference on Advances in Geographic Information Systems (Seattle, WA, USA) (SIGSPATIAL '20). Association for Computing Machinery, New York, NY, USA, 465--468. https://doi.org/10.1145/3397536.3422351
[12]
Jia Shijie, Wang Ping, Jia Peiyi, and Hu Siping. 2017. Research on data augmentation for image classification based on convolution neural networks. In 2017 Chinese Automation Congress (CAC). 4165--4170. https://doi.org/10.1109/CAC.2017.8243510
[13]
Connor Shorten and Taghi M. Khoshgoftaar. 2019. A survey on Image Data Augmentation for Deep Learning. Journal of Big Data 6, 1 (2019), 60. https://doi.org/10.1186/s40537-019-0197-0
[14]
MingJie Tang, Yongyang Yu, Qutaibah M. Malluhi, Mourad Ouzzani, and Walid G. Aref. 2016. LocationSpark: A Distributed In-Memory Data Management System for Big Spatial Data. Proc. VLDB Endow. 9, 13 (2016), 1565--1568.
[15]
Luke Taylor and Geoff Nitschke. 2018. Improving Deep Learning with Generic Data Augmentation. In IEEE Symposium Series on Computational Intelligence ((SSCI)). 1542--1547. https://doi.org/10.1109/SSCI.2018.8628742
[16]
Tin Vu, Alberto Belussi, Sara Migliorini, and Ahmed Eldawy. 2021. A Learned Query Optimizer for Spatial Join. In Proceedings of the 29th International Conference on Advances in Geographic Information Systems (Beijing, China) (SIGSPATIAL '21). Association for Computing Machinery, New York, NY, USA, 458--467. https://doi.org/10.1145/3474717.3484217
[17]
Tin Vu, Alberto Belussi, Sara Migliorini, and Ahmed Eldawy. 2022. Towards a Learned Cost Model for Distributed Spatial Join: Data, Code & Models. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management (Atlanta, GA, USA) (CIKM '22). Association for Computing Machinery, New York, NY, USA, 4550--4554. https://doi.org/10.1145/3511808.3557712
[18]
Tin Vu, Alberto Belussi, Sara Migliorini, and Ahmed Eldawy. 2024. A learning-based framework for spatial join processing: estimation, optimization and tuning. The VLDB Journal (13 Feb 2024). https://doi.org/10.1007/s00778-024-00836-1
[19]
Tin Vu and Ahmed Eldawy. 2020. R*-Grove: Balanced Spatial Partitioning for Large-Scale Datasets. Frontiers in Big Data 3 (2020). https://doi.org/10.3389/fdata.2020.00028
[20]
Tin Vu, Sara Migliorini, Ahmed Eldawy, and Alberto Belussi. 2022. Spatial Data Generators (1 ed.). Association for Computing Machinery, New York, NY, USA, 13--24. https://doi.org/10.1145/3548732.3548736
[21]
Dong Xie, Feifei Li, Bin Yao, Gefei Li, Liang Zhou, and Minyi Guo. 2016. Simba: Efficient In-Memory Spatial Analytics. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD '16). 1071--1085. https://doi.org/10.1145/2882903.2915237
[22]
Jimin Wang Yingjie Hu, Zhipeng Gui and Muxian Li. 2022. Enriching the metadata of map images: a deep learning approach with GIS-based data augmentation. International Journal of Geographical Information Science 36, 4 (2022), 799--821. https://doi.org/10.1080/13658816.2021.1968407
[23]
Jia Yu, Jinxuan Wu, and Mohamed Sarwat. 2016. A demonstration of GeoSpark: A cluster computing framework for processing big spatial data. In 32nd IEEE International Conference on Data Engineering (ICDE). 1410--1413.

Index Terms

  1. Augmentation Techniques for Balancing Spatial Datasets in Machine and Deep Learning Applications

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGSPATIAL '24: Proceedings of the 32nd ACM International Conference on Advances in Geographic Information Systems
      October 2024
      743 pages
      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 22 November 2024

      Check for updates

      Author Tags

      1. Spatial augmentation
      2. deep learning
      3. machine learning
      4. training set balancing

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Funding Sources

      • Interconnected Nord-Est Innovation Ecosystem (iNEST) - European Union Next-GenerationEU

      Conference

      SIGSPATIAL '24
      Sponsor:

      Acceptance Rates

      SIGSPATIAL '24 Paper Acceptance Rate 37 of 122 submissions, 30%;
      Overall Acceptance Rate 257 of 1,238 submissions, 21%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 43
        Total Downloads
      • Downloads (Last 12 months)43
      • Downloads (Last 6 weeks)27
      Reflects downloads up to 13 Jan 2025

      Other Metrics

      Citations

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media