Abstract
The Internet provides a wide variety of information that can be collected and studied, creating a massive data repository. Among the data available on the Internet, we can find articles about Violence against Women (VAW) published in the digital press, which are of great societal interest. In this work, we utilized Web scraping techniques to gather VAW-related news from the internet. Applying Text Mining techniques, we conducted a study on VAW and its characteristics. Our work comprises an exploratory analysis and the application of Topic Modelling to VAW events to identify latent topics and their semantic structures. We employed classification algorithms on a set of VAW press articles to determine the type of violence they refer to, namely physical, psychological, sexual, or a combination of them. We proposed two methodologies to target the data: the first one is based on dictionaries of VAW types, while the second approach extends the former by using the predominant violence to identify other associated types. Furthermore, we implemented two feature selection techniques: TF-IDF and \({Chi}^{2}\). Then, we applied Support Vector Machine, Decision Tree, Bayesian Networks, XGBoost Classifier, Random Forest, and Artificial Neural Networks. The results obtained showed that the classifiers achieved better performance when using \({Chi}^{2}\). The Boost Classifier demonstrated the best performance, followed by Random Forest.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Availability of data and materials
To access the item, go to https://doi.org/10.6084/m9.figshare.21252987.v1.
Abbreviations
- ANN:
-
Artificial neural networks
- BC:
-
Boost classifier
- BN:
-
Bayesian networks
- DT:
-
Decision tree
- VAW:
-
Violence against women
- LDA:
-
Latent Dirichlet allocation
- NPL:
-
Natural processing lenguage
- RF:
-
Random forest
- SVM:
-
Support vector machine
References
United Nation-Women. (2020). Intimate partner violence in five Caricom countries: Findings from national prevalence surveys on violence against women, (May). file:///C:/Users/inbalh/ Downloads/20201009CARICOMResearchBrief5.pdf
Assembly, U.G.: Declaration on the elimination of violence against women. UN General Assembly (1993)
Xue, J., Chen, J., Gelles, R.: Using data mining techniques to examine domestic violence topics on Twitter. Violence Gender 6(2), 105–114 (2019). https://doi.org/10.1089/vio.2017.0066
Dehingia, N., Raj, A.: Mining Twitter Data to Identify Topics of Discussion by Indian Feminist Activists (2020). http://data2x.org/wp-content/uploads/2021/01/UCSD-Brief-1_Big-Data-and-Gender-in-Covid-Brief-Series.pdf
Gil, V., Betancur, J., Puerta, I., Montoya, L., Sepulveda, J.: The femicide in Colombia and Mexico: a text mining analysis. Turk. Online J. Des. Art Commun. 8, 170–177 (2018). https://doi.org/10.7456/1080mse/021
Madhubala, D., Rajendiran, M., Elangovan, D.: A study on effective analysis of machine learning algorithm towards the women’s safety in social media. In: 2020 4th International Conference on Electronics, Communication and Aerospace Technology (ICECA), pp. 1151–1156. IEEE (2020). https://doi.org/10.1109/ICECA49313.2020.9297386
Melville, S., Eccles, K., Yasseri, T.: Semantic map of sexism: topic modelling of everyday sexism project entries. CoRR (2017). http://www.researchgate.net/profile/Taha-Yasseri/publication/321306966_Semantic_Map_of_Sexism_Topic_Modelling_of_Everyday_Sexism_Project_Entries/links/5a38da49458515919e72785a/Semantic-Map-of-Sexism-Topic-Modelling-of-Everyday-Sexism-Project-Entries.pdf
Poelmans, J., Elzinga, P., Viaene, S., Dedene, G.: Formally analysing the concepts of domestic violence. Expert Syst. Appl. 38, 3116–3130 (2011). https://doi.org/10.1016/j.eswa.2010.08.103
Karystianis, G., Adily, A., Schofield, P., Knight, L., Galdon, C., Greenberg, D., Jorm, L., Nenadic, G., Butler, T.: Automatic extraction of mental health disorders from domestic violence police narratives: text mining study. J. Med. Internet Res. 20, e11548 (2018). https://doi.org/10.2196/11548
Kiani, R., Mahdavi, S., Keshavarzi, A.: Analysis and prediction of crimes by clustering and classification. Int. J. Adv. Res. Artif. Intell. 4, 11–17 (2015). https://doi.org/10.14569/ijarai.2015.040802
Hwang, Y.I., Zheng, L., Karystianis, G., Gibbs, V., Sharp, K., Butler, T.: Domestic violence events involving autism: a text mining study of police records in New South Wales, 2005–2016. Res. Autism Spectrum Disorders 78, 101634 (2020). https://doi.org/10.1016/j.rasd.2020.101634
Motwani, M., Purwar, R., Madhur, R., Jamshed, A.: An efficient approach towards crime against women using Time Series algorithm. Int J Comput Appl 179, 22–26 (2018). https://doi.org/10.5120/ijca2018916730
Karystianis, G., Adily, A., Schofield, P.W., Wand, H., Lukmanjaya, W., Buchan, I., et al.: Surveillance of domestic violence using text mining outputs from australian police records. Front. Psych. 12, 1–13 (2022). https://doi.org/10.3389/fpsyt.2021.787792
Poojitha, P.V., Menon, R.R.K. (2020) Document representations to improve topic modelling, pp. 18–25
Chakravorty, S., Daripa, S., Saha, U., Bose, S., Goswami, S., Mitra, S.: Data mining techniques for analyzing murder related structured and unstructured data. Am. J. Adv. Comput. 2, 47–54 (2015). http://www.researchgate.net/profile/Saptarsi-Goswami-2/publication/297369503_Data_mining_techniques_for_analyzing_murder_related_structured_and_unstructured_data/links/56e0158508ae979addf0e341/Data-mining-techniques-for-analyzing-murder-related-structured-and-unstructured-data.pdf
Karami, A., White, C.N., Ford, K., Swan, S., Spinel, M.Y.: Unwanted advances in higher education: uncovering sexual harassment experiences in academia with text mining. Inf. Process. Manag. 57, 102167 (2020). https://doi.org/10.1016/j.ipm.2019.102167
Tayal, D.K., Jain, A., Arora, S., Agarwal, S., Gupta, T., Tyagi, N.: Crime detection and criminal identification in India using data mining techniques. AI Soc. 30, 117–127 (2015). https://doi.org/10.1007/s00146-014-0539-6
Febro-Naga, J., Tinam-Isan, M.A.: Exploring cyber violence against women and girls in th Philippnes though Explorando la cibrviolencia contra mujres y niñas enFilipnas. Comunicar 30(70), 121–133 (2022)
Negara, E.S., Triadi, D., Andryani, R.: Topic modelling twitter data with latent Dirichlet allocation method. In: 2019 International Conference on Electrical Engineering and Computer Science (ICECOS) (2019). https://doi.org/10.1109/ICECOS47637.2019.8984523
Rachman, F.F., Pramana, S.: Analisis sentimen Pro dan Kontra Masyarakat Indonesia tentang Vaksin COVID-19 pada Media Sosial Twitter. Indones. Health Inf. Manag. J. (INOHIM) 8(2), 100–109 (2020). https://doi.org/10.4108/eai.2-8-2019.2290336
Ahmed, F., Nawaz, M., Jadoon, A.: Topic modeling of the Pakistani economy in English newspapers via latent Dirichlet allocation (LDA). SAGE Open (2022). https://doi.org/10.1177/21582440221079931
Amara, A., HadjTaieb, M.A., BenAouicha, M.: Multilingual topic modeling for tracking COVID-19 trends based on Facebook data analysis. Appl. Intell. 51(5), 3052–3073 (2021). https://doi.org/10.1007/s10489-020-02033-3]
Fahlevvi, M.R., Azhari, S.N.: Topic modeling on online news portal using latent Dirichlet allocation (LDA). IJCCS 16(4), 335 (2022). https://doi.org/10.22146/ijccs.74383
Zhao, B.: Web scraping. Encyclopedia Big Data (2017). https://doi.org/10.1007/978-3-319-32001-4
Dias Canedo, E., Cordeiro Mendes, B.: Software requirements classification using machine learning algorithms. Entropy 22, 1057 (2020). https://doi.org/10.3390/E22091057
Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E.D., Gutierrez, J.B., Kochut, K.: A brief survey of text mining: classification, clustering and extraction techniques. arXiv preprint arXiv:1707.02919 (2017)
Kou, G., Yang, P., Peng, Y., Xiao, F., Chen, Y., Alsaadi, F.E.: Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods. Appl. Soft Comput. 86, 105836 (2020). https://doi.org/10.1016/j.asoc.2019.105836
Esplugues, J.S.: ¿Qué es violencia? Una aproximación al concepto ya la clasificación de la violencia. Daimon Rev. Int. Filos. (2007). http://revistas.um.es/daimon/article/view/95881
López, Y.R., Gigato, B.A.A., Alvarez, I.G.: Consecuencias psicológicas del abuso sexual infantil. Eureka (Asunc.) Línea 9, 58–68 (2012)
Hernández, R.P., Gras, R.M.L.: Víctimas de violencia familiar: Consecuencias psicológicas en hijos de mujeres maltratadas. Anal. Psicol./Ann. Psychol. 21, 11–17 (2005)
Hernández Ramos, C., Magro Servet, V., Cuéllar Otón, J.P.: El maltrato psicológico. Causas, consecuencias y criterios jurisprudenciales. El Probl. Prob. http://hdl.handle.net/10045/46929
Sheykhmousa, M., Mahdianpari, M., Ghanbari, H., Mohammadimanesh, F., Ghamisi, P., Homayouni, S.: Support vector machine versus random forest for remote sensing image classification: a meta-analysis and systematic review. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 13, 6308–6325 (2020). https://doi.org/10.1109/JSTARS.2020.3026724
Wu, J.-Y., Hsiao, Y.-C., Nian, M.-W.: Using supervised machine learning on large-scale online forums to classify course-related Facebook messages in predicting learning achievement within the personal learning environment. Interact. Learn. Environ. 28, 65–80 (2020). https://doi.org/10.1080/10494820.2018.1515085
Wang, P., Yan, Y., Si, Y., Zhu, G., Zhan, X., Wang, J., Pan, R.: Classification of proactive personality: text mining based on weibo text and short-answer questions text. IEEE Access 8, 97370–97382 (2020). https://doi.org/10.1109/ACCESS.2020.2995905
Sisodia, D., Sisodia, D.S.: Prediction of diabetes using classification algorithms. Procedia Comput. Sci. 132, 1578–1585 (2018). https://doi.org/10.1016/j.procs.2018.05.122
Qiu, S., Lin, Z., Zhou, Y., Wang, D., Yuan, L., Wei, Y., Dai, T., Luo, L., Chen, G.: Highly selective colorimetric bacteria sensing based on protein-capped nanoparticles. Analyst 140, 1149–1154 (2015). https://doi.org/10.1039/b000000xv
Jambukia, S.H., Dabhi, V.K., Prajapati, H.B.: ECG beat classification using machine learning techniques. Int. J. Biomed. Eng. Technol. 26, 32–53 (2018). https://doi.org/10.1504/IJBET.2018.089255
Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Comput. Surv. (CSUR) 54, 1–40 (2021). https://doi.org/10.1145/3439726
Pranckevičius, T., Marcinkevičius, V.: Comparison of naive bayes, random forest, decision tree, support vector machines, and logistic regression classifiers for text reviews classification. Balt. J. Mod. Comput. 5, 221 (2017). https://doi.org/10.22364/bjmc.2017.5.2.05
Barberá, P., Boydstun, A.E., Linn, S., McMahon, R., Nagler, J.: Automated text classification of news articles: a practical guide. Polit. Anal. 29, 19–42 (2021). https://doi.org/10.1017/pan.2020.8]
Campos, D., Silva, R.R., Bernardino, J.: Text mining in hotel reviews: impact of words restriction in text classification. In: KDIR, pp. 442–449 (2019). https://doi.org/10.5220/0008346904420449
Li, L., Goh, T.-T., Jin, D.: How textual quality of online reviews affect classification performance: a case of deep learning sentiment analysis. Neural Comput. Appl. 32, 4387–4415 (2020). https://doi.org/10.1007/s00521-018-3865-7
Hartmann, J., Huppertz, J., Schamp, C., Heitmann, M.: Comparing automated text classification methods. Int. J. Res. Mark. 36, 20–38 (2019). https://doi.org/10.1016/j.ijresmar.2018.09.009
Funding
We acknowledge financial support from the Ministerio de Ciencia e Innovación (Spain) (Research Project PID2020-112495RB-C21) and the I + D + i FEDER 2020 project B-TIC-42-UGR20.
Author information
Authors and Affiliations
Contributions
All authors have contributed equally to this work.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflicts of interests.
Ethical approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Stephanie, E.M.A., Ruiz, L.G.B., Vila, M.A. et al. Study of violence against women and its characteristics through the application of text mining techniques. Int J Data Sci Anal 18, 35–48 (2024). https://doi.org/10.1007/s41060-023-00448-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41060-023-00448-y