\makesavenoteenv

longtable

Leveraging Natural Language and Item Response Theory Models for ESG Scoring

César P. Soares
Ph.D. in Global Health and Sustainability
cpscesar@gmail.com

Abstract

This paper explores an innovative approach to Environmental, Social, and Governance (ESG) scoring by integrating Natural Language Processing (NLP) techniques with Item Response Theory (IRT), specifically the Rasch model. The study utilizes a comprehensive dataset of news articles in Portuguese related to Petrobras, a major oil company in Brazil, collected from 2022 and 2023. The data is filtered and classified for ESG-related sentiments using advanced NLP methods. The Rasch model is then applied to evaluate the psychometric properties of these ESG measures, providing a nuanced assessment of ESG sentiment trends over time. The results demonstrate the efficacy of this methodology in offering a more precise and reliable measurement of ESG factors, highlighting significant periods and trends. This approach may enhance the robustness of ESG metrics and contribute to the broader field of sustainability and finance by offering a deeper understanding of the temporal dynamics in ESG reporting.

Keywords ESG $\cdot$ NLP $\cdot$ IRT $\cdot$ Rasch Model $\cdot$ Psychometry

Introduction

In recent years, the integration of language models into various facets of research and industry has ushered in a new era of text generation and comprehension. These models, driven by advancements in deep learning and natural language processing (NLP), hold the potential to change how we understand and interact with textual data.

The convergence of NLP and psychometric methods are emerging in this context as an interdisciplinary approach for tackling complex socio-environmental analyses [6, 4, 5, 8, 9]. By combining both, researchers aim to unlock new insights into the intricate interplay between human behaviour and nature aspects. This paper leverages this interdisciplinary synergy to propose a new methodology to help estimate Environmental, Social, and Governance (ESG) rating scores. Integrating psychometric techniques allows for the rigorous measurement of latent constructs underlying ESG performance, while NLP techniques facilitate the extraction of valuable insights from vast troves of unstructured textual data.

ESG factors have emerged as critical for investors, stakeholders, and regulators seeking to evaluate companies and entities’ sustainability and societal impact. Integrating ESG criteria into investment decisions reflects a growing recognition of the interconnectedness between corporate performance, environmental stewardship, social responsibility, and effective governance practices. Consequently, there is an increasing demand for robust methodologies and tools to assess and quantify ESG performance accurately [12, 7, 11].

Previous studies in the realm of ESG scoring have explored the application of NLP techniques to analyze textual data extracted from various sources such as news articles, social media, and corporate reports [10, 2, 1]. However, notably absent from these endeavours is the integration of psychometric methods, which may work instrumentally in assessing the reliability and validity of the measured ESG construct. While NLP offers unparalleled capabilities in processing and extracting insights from unstructured textual data, the absence of psychometric validation may introduce potential limitations in the robustness and interpretability of the resulting ESG ratings. By exploring the possibility of incorporating psychometric principles into the ESG measurement framework, our research aims to address this gap by providing a more comprehensive and rigorous approach to ESG scoring, thereby enhancing the accountability and reliability of the assessment process.

In other words, this paper aims to explore a novel methodology that integrates NLP and Item Response Theory (IRT) to enhance the robustness of studies calculating ESG metrics. While the paper does not propose a specific metric, the combined NLP and IRT results presented may contribute to further developing more reliable and robust ESG metric calculations. The results may also contribute to advancing scholarship at the intersection of finance, sustainability, and computational linguistics by shedding light on the transformative potential of labelled text data for enhancing ESG estimation methodologies.

This paper is organized as follows: Section 2 presents the methodology. Section 3 presents the results. Section 4 discusses the implications of our findings and potential applications. Finally, Section 5 concludes the paper and outlines future research directions.

Methodology

Study Design

This study’s methodology leverages labelled text data to enhance the estimation of the ESG construct in a similar way applied by Aue, Jatowt & Färber [2]. Labelled text data consists of textual information annotated according to specific ESG-related dimensions, including environmental impact, social responsibility, and corporate governance practices. These data sources encompass news articles, offering valuable insights for ESG analysis. Advanced NLP techniques are employed to extract meaningful information from these textual sources.

Incorporating IRT into the methodology provides a robust framework for assessing the psychometric properties of ESG constructs. IRT facilitates the measurement of latent traits by analyzing the relationship between news data and the underlying trait being measured. This study uses the Rasch model to evaluate the consistency and reliability of ESG measures derived from textual data.

The first step involved downloading and scraping news articles in Portuguese using the Global Database of Events, Language, and Tone (GDEL)¹¹1More information: https://www.gdeltproject.org/. The focus was on Brazil in 2022 and 2023, particularly the company Petrobras²²2More information: https://petrobras.com.br/en/, a significant oil business in Brazil. A comprehensive dataset covering these years was organized by concatenating data from all months. Each month was treated as an individual "item" to be lately evaluated by the Rasch model.

The second step involved cleaning the dataset using Bidirectional Encoder Representations from Transformers (BERT) for the Portuguese language to apply embeddings and similarity algorithms with ESG definitions. This process filtered relevant and irrelevant news articles, retaining only those pertinent to one of the ESG dimensions. The final dataset considered 3,653 news for 2022 and 4,401 news for 2023.

In the third step, the dataset was processed using embeddings and similarity algorithms to classify a sample of 365 cases, or 10% of 2022 news articles’ data, as positive or negative. A Portuguese BERT classification model, trained and fine-tuned on this data sample, determined the sentiment of news articles regarding ESG dimensions on the whole dataset of 2022 and 2023. The table below shows the ESG definitions and the positive/negative criteria used for similarity analysis. The text is in Portuguese since the news texts were in this language.

Table 1: ESG Definitions

ESG Dimension	ESG Definition	Positive ESG	Negative ESG
Environment	São reportados dados sobre alterações climáticas, emissões de gases com efeito estufa, perda de biodiversidade, desflorestação/reflorestação, mitigação da poluição, eficiência energética e gestão da água.	Iniciativas para a mitigação das alterações climáticas, redução das emissões de gases com efeito de estufa, promoção da biodiversidade, reflorestação, redução da poluição, melhoria da eficiência energética e gestão eficaz da água.	Questões relacionadas com a inação em matéria de alterações climáticas, o aumento das emissões de gases com efeito de estufa, a perda de biodiversidade, a desflorestação, a poluição, a utilização ineficiente da energia e a má gestão da água.
Social	Os dados são relatados sobre segurança e saúde dos funcionários, condições de trabalho, diversidade, equidade e inclusão, e conflitos e crises humanitárias, e são relevantes nas avaliações de risco e retorno diretamente através dos resultados no aumento (ou destruição) da satisfação do cliente e da satisfação dos funcionários.	Melhorias na saúde e segurança dos funcionários, melhores condições de trabalho, avanços na diversidade e inclusão, resolução de conflitos, maior satisfação do cliente e envolvimento dos funcionários.	Problemas com a saúde e segurança dos funcionários, más condições de trabalho, falta de diversidade e inclusão, conflitos não resolvidos, baixa satisfação dos clientes e envolvimento dos funcionários.
Governance	Os dados são relatados sobre governança corporativa, como prevenção de suborno, corrupção, diversidade do conselho de administração, remuneração de executivos, práticas de segurança cibernética, privacidade e estrutura de gestão.	Governança corporativa forte, medidas antissuborno, diversidade do conselho, remuneração justa dos executivos, práticas robustas de segurança cibernética e privacidade.	Governança corporativa fraca, problemas com suborno e corrupção, falta de diversidade no conselho, remuneração inadequada dos executivos, práticas inadequadas de segurança cibernética e privacidade.

Source: Built by the author, 2024, based on Wikipedia and OpenAI ChatGPT consultation.

The final step involved structuring the dichotomous dataset, where each row represented the sentiment of the news articles (1 for positive and 0 for negative), and each column represented a month of the year 2022 and 2023. Then, we run the Rasch psychometric model to obtain the psychometric characteristics of the ESG construct.

Training Dataset

We used a sample of news from 2022 (n = 365) to build the dataset for training the model. All the news was in Portuguese.

The training and evaluation pipeline used a pre-trained BERTimbau Base model for feature extraction and a custom classifier for classification. The training was done in two stages: the first stage only trained the classifier, and the second stage fine-tuned the entire model.

The results obtained by the training process were logged in a CSV file, including various metrics that were further analyzed to determine the best model. The steps for this procedure are below.

Refer to caption — Figure 1: Steps flow of training

1.

Function Definition and Initialization

The code first defined a function that took several parameters (e.g., learning rate, batch size, epochs). Below, you can see the combination of tested parameters. In the end, we tested 729 combinations of parameter values. The last subsection of the methodology explains how we analyzed the metrics to define the best model to be used.

Table 2: Parameters considered

Parameters	Values
Learning rate	1e-5	2e-5	3e-5
Number of layers	2	5	10
Hidden Size	256	512	768
Batch Size	5	10	20
Epochs	2	5	10
Max Length	50	100	200

Source: build by the author, 2024.

2.

Loading Pre-trained Model and Tokenizer

Based on the function definition above, the initialization started loading the pre-trained BERTimbau Base model and tokenizer from the Hugging Face Transformers library. We tokenized the input text (news) and converted it into input IDs and attention masks. These tokenized sequences were fed into the model, which produced high-dimensional contextualized embeddings for each token in the sequence. These embeddings captured the semantic information of the words in the context of the entire sequence.

3.

Defining Classification Layers and Optimizer

A custom classification layer was defined using PyTorch’s ‘Sequential‘ module. It consisted of a Long Short-Term Memory (LSTM) layer followed by a dropout layer to prevent overfitting and a linear layer to map the LSTM’s outputs to the final classification decision. The softmax activation function was applied to the output of this linear layer to produce class probabilities for news sentiment classification analysis. This architecture allowed the model to learn temporal dependencies and patterns in the sequence of embeddings.

The purpose of this classifier was to learn patterns and relationships in the embeddings that were relevant to the classification task. Specifically, the custom classifier was a separate neural network built on the embedding model. It was designed to take the contextualized embeddings and make a final classification decision (e.g., positive or negative in the case of the news considered in this paper). We initialized this classifier with random weights and trained it to fit the classification task.

4.

Tokenizing

The news text was tokenized using the previously loaded tokenizer. It then converted the tokenized data into input IDs and attention masks. The input IDs, attention masks, and labels were stored in separate lists for use in the next steps.

5.

Data Splitting and Loading

The data was divided into training (n = 329, 90%) and validation (n = 36, 10%) sets. We defined data loaders for batch-wise processing.

6.

Training Loop

The code had a two-stage training loop. The first stage involved freezing the model and training only the custom classifier. The second stage unfreezes the model for fine-tuning:

•

Stage 1 (Classifier Training):
- –
  
  Objective: In this initial stage, the primary goal was to adapt a pre-trained model for our specific text classification task. This process aimed to "teach" the model to classify text into different categories or labels (e.g. positive or negative).
- –
  
  Model Configuration: At the beginning of Stage 1, the model was set to "evaluation mode," and we froze its parameters. This process means that we did not update the model during this stage. It remained as it was after pre-training on a large corpus.
- –
  
  Custom Classifier: In addition to the model, we had a custom classifier, which typically consists of one or more layers (in this case, an LSTM layer and a linear layer). These layers were responsible for taking the representations learned by the model and making the final classification decision.
- –
  
  Training Focus: During Stage 1, the only updated parameters were those of the custom classifier (LSTM and linear layers). The model’s parameters remained fixed. This stage aimed to fine-tune the custom classifier’s parameters to be well-suited for the specific classification task. It taught the model how to use its pre-trained embeddings to classify text.
•

Stage 2 (Fine-tuning):
- –
  
  Objective: After completing Stage 1, the model had a custom classifier that was better adapted to the classification task. However, the pre-trained model still had generic language representations. In Stage 2, the goal was to further adapt the entire model for the specific classification task.
- –
  
  Model Configuration: Unlike Stage 1, this stage, it was allowed updates to both the custom classifier (LSTM and linear layers) and the model.
- –
  
  Training Focus: During Stage 2, the model was fine-tuned using the labelled data for the classification task. This process allowed the model to adjust the learned representations of the text data, including those learned before, to fit the classification task’s nuances better. Essentially, it allowed the model to make more significant changes to its internal representations based on the specific data it was working with.

This two-stage process was designed to leverage the power of pre-trained language models like BERTimbau Base while still tailoring them to this paper’s specific NLP classification task. Stage 1 focused on adapting the classifier, and Stage 2 further adapted the entire model to maximize its performance on the task at hand.

7.

Validation

After each epoch, the classifier was set to evaluation mode, iterating over the validation data and calculating loss and accuracy. Additionally, it computed precision, recall, F1-score, and AUC-ROC for each class (e.g., positive and negative).

8.

Results Collection and Saving

We computed metrics, and results were collected and written in the output CSV file.

Evaluating the model

The trained classifier was evaluated using the validation data. We passed the validation data through the model, and the logits were obtained. The cross-entropy loss was computed, and evaluation metrics such as accuracy, precision, recall, F1 score, and area under the ROC curve (AUC) were calculated. We used the Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) scores for ranking.

Based on the evaluation metrics calculated during the finetuning process, we implemented a normalization function to normalize the values using a weighted approach. For each value, the function calculated the root of the sum of squares and then applied weighted normalization to each element. The weights were determined based on the literature on the area and the researcher’s knowledge of the importance of the variables.

Further, the score calculation involved computing the distance from each column’s ideal best and worst values. We summed up these distances after squaring each value. Finally, the TOPSIS score for each model was computed as the distance ratio from the ideal worst to the sum of the distances from the ideal best and ideal worst. The ranking of the best models was derived based on their TOPSIS scores.

Table 3: TOPSIS weights

Train loss	Train accuracy	Val. loss	Val. accuracy	Val. precision	Val. recall	Val. F1 score	Val. auc roc1	Val. auc roc2
0.1	0.1	0.2	0.2	0.1	0.1	0.1	0.1	0.1

Source: build by the author, 2024. “Val.” means Validation. “auc roc” means Area Under the Receiver Operating Characteristic Curve.

IRT – Rasch Model

The final step involved structuring the dichotomous dataset, where each row represented the sentiment of the news articles (1 for positive and 0 for negative), and each column represented a month of the years 2022 and 2023. This arrangement allowed for a comprehensive temporal analysis of sentiment trends across the specified period.

IRT is a family of mathematical models used to analyze the relationship between individuals’ latent traits (unobservable characteristics or attributes) and their item responses on assessments or questionnaires. IRT provides a framework for understanding how specific test items function across different latent trait levels. It is often used in educational testing, psychological assessments, and health outcomes measurement. IRT models include the Rasch model, the one and two-parameter logistic model, and the three-parameter logistic model, each differing in complexity and assumptions about the data [3].

We employed the Rasch psychometric model to analyze the psychometric properties of the ESG construct. The Rasch model, a specific form of IRT, is particularly suitable for dichotomous data and is widely used in psychometrics. It facilitates the measurement of latent traits by modelling the probability of a given response as a function of the respondent’s ability and the item’s difficulty. This model helped understand how different periods (represented by months) contributed to the overall ESG sentiment, enabling a detailed assessment of its psychometric characteristics. Moreover, the "individual" being analyzed is the Petrobras company, and the news articles served as the responses to the items in the dataset. The primary objective was not to examine the specific result of each news item but to assess how each month’s positive or negative sentiment affected the ESG construct.

This analysis helped us to understand the temporal dynamics of ESG sentiment for Petrobras based on media news and may guide future strategies to measure ESG. By structuring the dataset to reflect monthly sentiment trends, we can gain valuable insights into how different periods impact the company’s ESG profile.

When applied to ESG measurement, the concepts of IRT might not directly translate into the traditional sense used in educational or psychological assessments. Instead of "proficiency," the latent trait measured is the sentiment level regarding ESG news. This sentiment level indicates how positive or negative the news is perceived. Similarly, item difficulty (monthly difficulty parameters) is understood differently. Higher difficulty values for a particular month indicate that achieving positive news during that month was harder. This context could be due to more negative news or challenging ESG-related events.

To analyze these factors, we presented each month’s Item Information Curves (IIC) and Item Characteristic Curves (ICC). The IICs show how much information each month provides about the sentiment levels. Peaks at different sentiment levels indicate that certain months are more informative for different sentiment ranges. The ICC illustrates the probability of receiving a positive/negative sentiment score as a function of the underlying sentiment trait.

We can understand the temporal trends in ESG sentiment by analyzing the difficulty parameters and information curves. Months with greater difficulty and more information could indicate critical periods where significant ESG events occurred. These periods are particularly informative for understanding the challenges and achievements in the company’s ESG initiatives. This comprehensive analysis highlights key periods of ESG performance and provides a framework for ongoing monitoring and improvement of ESG strategies.

Results

Considering the TOPSIS score, the model below presented the best results:

Table 4: Best model parameters combination

Parameters	Best Model
Learning rate	2e-5
Number of layers	5
Hidden Size	768
Batch Size	20
Epochs	10
Max Length	200

Source: build by the author, 2024.

The confusion matrix below shows how the model classified the news after the training.

Fig. 2: Confusion matrix

Source: build by the author, 2024. Class 0: Negative ESG News. Class 1: Positive ESG News.

The confusion matrix results indicate that the classification model used to distinguish between negative and positive ESG news is highly effective. The model correctly identified 100% of the negative ESG news (Class 0) without any false positives, demonstrating precision for negative news. It also achieved a 97.37% accuracy in identifying positive ESG news (Class 1), with a false negative rate of 2.63%.

The figures below show the IIC for 2022 and 2023.

Fig. 3: IIC, 2022

Source: build by the author, 2024.

Fig. 4: IIC, 2023

Source: build by the author, 2024.

Each curve on the IIC shows the amount of information a particular month provides across these sentiment difficulty levels. Peaks at different points indicate which months are more informative for different ranges of sentiment difficulty. For instance, months with peaks at higher sentiment values are more informative for capturing difficult (negative) sentiment, while those peaking at lower values are more informative for easier (positive) sentiment.

The figures below show the ICC for 2022 and 2023.

Fig. 5: ICC, 2022

Source: build by the author, 2024.

Fig. 6: ICC, 2023

Source: build by the author, 2024.

The ICCs for 2022 and 2023 illustrate the probability of achieving positive ESG sentiment across different months. Considering that a sentiment value of 1 is positive, we move from left to right on the x-axis from easiest (most positive news) to most difficult (most negative news). These curves help identify key periods with significant positive or negative ESG events, aiding in understanding temporal dynamics in ESG sentiment and guiding strategic ESG measurement improvements.

Discussion

This study has explored the potential of using IRT, specifically the Rasch model, to analyze ESG sentiment data. Applying the Rasch model in this context may enhance the measurement methods of this construct.

One of the main advantages of using the Rasch model is its ability to provide a more precise measurement of ESG sentiment. By accounting for both item difficulty (in this case, variability in sentiment across different months) and respondent ability (in this case, the Brazillian company Petrobras), the model places all data on an interval scale. This not only facilitates more accurate comparisons over time but also ensures that these comparisons are independent of the specific sample used [3]. The scale independence may be important for ESG scoring process, where variability can arise from numerous external factors, such as market events or regulatory changes.

The Rasch model’s capacity to handle variability in item characteristics allows for a nuanced understanding of how different months contribute to the overall ESG sentiment. This is particularly valuable for organizations aiming to track sentiment trends related to ESG topics, as it can identify periods of heightened scrutiny or significant sentiment shifts.

The use of IICs and ICCs in the Rasch model provides additional insight into the informativeness of each month. This helps in identifying which time periods offer the most significant data for assessing ESG sentiment, thereby guiding more targeted analyses and strategic responses. Additionally, the latent trait estimation offered by the Rasch model may contribute to a more accurate depiction of the underlying sentiment trends than simple averages, which can be skewed by outliers or non-normal data distributions, or techniques used on the studies of the area [10, 2, 1].

The precision and reliability of the Rasch model may contribute to ESG accountability calculations. By providing accurate and consistent sentiment scores, the model enhances transparency and comparability across different periods and among various organizations. This is important for stakeholders, including investors and regulators, who rely on clear and credible ESG reporting. The model’s ability to identify key periods and provide objective criteria for performance measurement allows for more strategic planning and targeted interventions.

The findings also highlight the potential for integrating the Rasch model’s output with other ESG metrics and qualitative data already used on the study area [10, 2, 1]. This integration can provide a comprehensive view of ESG performance, supporting a balanced scorecard approach.

Conclusion

This exploratory study contributes to ESG research by introducing the use of the Rasch model, a method from IRT, to analyze ESG sentiment data. The study’s findings highlight the model’s advantages in providing precise, reliable, and consistent measurements compared to traditional approaches. By effectively handling variability and offering deeper insights into the latent traits of ESG sentiment, the Rasch model enables a more nuanced and accurate assessment of ESG performance. This methodological advancement enhances the accuracy of ESG accountability measures, supports more informed decision-making processes, and strengthens stakeholder trust through clearer and more credible reporting.

Future studies could further explore and refine the application of the Rasch model in ESG sentiment analysis. One potential area for advancement is the integration of more diverse data sources, such as social media sentiment, stakeholder surveys, and financial performance metrics, to enrich the analysis and provide a more comprehensive view of ESG performance. Additionally, longitudinal studies could investigate the stability and predictive power of ESG sentiment measured by the Rasch model over time, assessing how these sentiments correlate with actual ESG outcomes and broader market or societal trends.

References

[1] I. Atanassova, M. Potier, M. Mathie, M. Bertin, and P. K. Ningrum. Criticalminds: Enhancing ML models for ESG impact analysis categorisation using linguistic resources and aspect-based sentiment analysis. In Proceedings of the Joint Workshop of the 7th Financial Technology and Natural Language Processing, the 5th Knowledge Discovery from Unstructured Data in Financial Services, and the 4th Workshop on Economics and Natural Language Processing@ LREC-COLING 2024, pages 248–253, 2024.
[2] T. Aue, A. Jatowt, and M. Färber. Predicting companies’ ESG ratings from news articles using multivariate timeseries analysis. arXiv preprint arXiv:2212.11765, 2022. Doi: https://doi.org/10.48550/arXiv.2212.11765.
[3] T. G. Bond and C. M. Fox. Applying the Rasch model: Fundamental measurement in the human sciences. Psychology Press, 2013.
[4] D. Johannßen and C. Biemann. Between the lines: Machine learning for prediction of psychological traits-a survey. In Machine Learning and Knowledge Extraction: Second IFIP TC 5, TC 8/WG 8.4, 8.9, TC 12/WG 12.9 International Cross-Domain Conference, CD-MAKE 2018, Hamburg, Germany, August 27–30, 2018, Proceedings 2, pages 192–211. Springer International Publishing, 2018.
[5] C. J. Kennedy, G. Bacon, A. Sahn, and C. von Vacano. Constructing interval variables via faceted Rasch measurement and multitask deep learning: A hate speech application. arXiv preprint arXiv:2009.10277, 2020. Doi: https://doi.org/10.48550/arXiv.2009.10277.
[6] J. P. Lalor, H. Wu, and H. Yu. Building an evaluation scale using item response theory. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, volume 2016, page 648. NIH Public Access, 2016. Doi: https://doi.org/10.48550/arXiv.1605.08889.
[7] L. Michalski and R. K. Y. Low. Corporate credit rating feature importance: Does ESG matter? Available at SSRN 3788037, 2021. Doi: http://dx.doi.org/10.2139/ssrn.3788037.
[8] S. Rathi, J. P. Verma, R. Jain, A. Nayyar, and N. Thakur. Psychometric profiling of individuals using Twitter profiles: A psychological Natural Language Processing based approach. Concurrency and Computation: Practice and Experience, 34(19):e7029, 2022. Doi: https://doi.org/10.1002/cpe.7029.
[9] C. P. Soares and M. da Penha Vasconcellos. Proposal of a socio-environmental vulnerability scale of municipalities that receive financial compensation. 2022. Doi: https://doi.org/10.31235/osf.io/jg4a8.
[10] A. Sokolov, J. Mostovoy, J. Ding, and L. Seco. Building machine learning systems to automate ESG index construction. 2020.
[11] G. Zairis, P. Liargovas, and N. Apostolopoulos. Sustainable finance and ESG importance: A systematic literature review and research agenda. Sustainability, 16(7):2878, 2024. Doi: https://doi.org/10.3390/su16072878.
[12] I. Zumente and J. Bistrova. ESG importance for long-term shareholder value creation: Literature vs. practice. Journal of Open Innovation: Technology, Market, and Complexity, 7(2):127, 2021. Doi: https://doi.org/10.3390/joitmc7020127.