Evaluating pre-trained models for user feedback analysis in software engineering: A study on classification of app-reviews

MA Hadi, FH Fard - Empirical Software Engineering, 2023 - Springer
Empirical Software Engineering, 2023Springer
Context Automatic classification of mobile applications users' feedback is studied for
different areas of software engineering. However, supervised classification requires a lot of
manually labeled data, and with introducing new classes or new platforms, new labeled data
and models are required. Employing Pre-trained neural Language Models (PLMs) have
found success in the Natural Language Processing field. However, their applicability has not
been explored for app review classification. Objective We evaluate using PLMs for issue …
Context
Automatic classification of mobile applications users’ feedback is studied for different areas of software engineering. However, supervised classification requires a lot of manually labeled data, and with introducing new classes or new platforms, new labeled data and models are required. Employing Pre-trained neural Language Models (PLMs) have found success in the Natural Language Processing field. However, their applicability has not been explored for app review classification.
Objective
We evaluate using PLMs for issue classification from app reviews in multiple settings and compare them with the existing models.
Method
We set up different studies to evaluate the performance and time efficiency of PLMs compared to Prior approaches on six datasets: binary vs. multi-class, zero-shot, multi-task, and multi-resource settings. In addition, we train and study domain-specific (Custom) PLMs by incorporating app reviews in the pre-training. We report Micro and Macro Precision, Recall, and F1 scores and the time required for training and predicting with the models.
Results
Our results show that PLMs can classify the app issues with higher scores, except in multi-resource setting. On the largest dataset, results are improved by 13 and 8 micro- and macro-average F1-scores, respectively, compared to the Prior approaches. Domain-specific PLMs achieve the highest scores in all settings with less prediction time, and they benefit from pre-training with a larger number of app reviews. On the largest dataset, we obtain 98 and 92 micro- and macro-average F1-score (from 4.5 to 8.3 more F1-score compared to general pre-trained models), 71 F1-score in zero-shot setting, and 93 and 92 F1-score in multi-task and multi-resource settings, respectively, using the large domain-specific PLMs.
Conclusion
Although prior approaches achieve high scores in some settings, PLMs are the only models that can work well in the zero-shot setting. When trained on the app review dataset, the Custom PLMs have higher performance and lower prediction times.
Springer