What does affect the correlation among evaluation measures?

N Ferro - ACM Transactions on Information Systems (TOIS), 2017 - dl.acm.org
ACM Transactions on Information Systems (TOIS), 2017dl.acm.org
Information Retrieval (IR) is well-known for the great number of adopted evaluation
measures, with new ones popping up more and more frequently. In this context, correlation
analysis is the tool used to study the evaluation measures and to let us understand if two
measures rank systems similarly, if they grasp different aspects of system performances or
actually reflect different user models, if a new measure is well motivated or not. To this end,
the two most commonly used correlation coefficients are the Kendall's τ correlation and the …
Information Retrieval (IR) is well-known for the great number of adopted evaluation measures, with new ones popping up more and more frequently. In this context, correlation analysis is the tool used to study the evaluation measures and to let us understand if two measures rank systems similarly, if they grasp different aspects of system performances or actually reflect different user models, if a new measure is well motivated or not. To this end, the two most commonly used correlation coefficients are the Kendall’s τ correlation and the AP correlation τAP.
The goal of the article is to investigate the properties of the tool, that is, correlation analysis, we use to study evaluation measures. In particular, we investigate three research questions about these two correlation coefficients: (i) what is the effect of the number of systems and topics? (ii) what is the effect of removing low-performing systems? (iii) what is the effect of the experimental collections?
To answer these research questions, we propose a methodology based on General Linear Mixed Model (GLMM) and ANalysis Of VAriance (ANOVA) to isolate the effects of the number of topics, number of systems, and experimental collections and to let us observe expected correlation values, net from these effects, which are stable and reliable.
We learned that the effect of the number of topics is more prominent than the effect of the number of systems. Even if it produces different absolute values, the effect of removing low-performing systems does not seem to provide information substantially different from not removing them, especially when comparing a whole set of evaluation measures. Finally, we found out that both document corpora and topic sets affect the correlation among evaluation measures, the effect of the latter being more prominent. Moreover, there is a substantial interaction between evaluation measures, corpora and topic sets, meaning that the correlation between different evaluation measures can be substantially increased or decreased depending on the different corpora and topics at hand.
ACM Digital Library