Tackling transparency in UK politics: application of large language models to clustering and classification of UK parliamentary divisions

Lilley, Joshua; Townley, Stuart

doi:10.1007/s42001-024-00317-z

Tackling transparency in UK politics: application of large language models to clustering and classification of UK parliamentary divisions

Research Article
Open access
Published: 10 October 2024

Volume 7, pages 2563–2589, (2024)
Cite this article

Download PDF

You have full access to this open access article

Journal of Computational Social Science Aims and scope Submit manuscript

Tackling transparency in UK politics: application of large language models to clustering and classification of UK parliamentary divisions

Download PDF

882 Accesses
Explore all metrics

Abstract

For a healthier democracy in the UK, novel methods of visualising political data are key to improving transparency, and encouraging engagement. The paper proposes a visualisation tool, using Large language models (LLMs), such as GPT3.5 and GPT4, to conduct Natural Language Processing (NLP) in a novel methodology. We investigate partisan voting profiles, specifically of the Conservative, Labour, and Liberal Democrat parties along 11 predetermined dimensions, ranging from Immigration and Borders, over Welfare and Social Housing, to European Union and Foreign Affairs. Higher order dimensions reveals shifts in party preference over time, while clear trends of more extreme voting behaviour can be seen across parties between 2016 and 2023. The novel visualisation methodology reveals that voting behaviour has become more polarised along party lines, with Labour becoming more left-wing and Conservatives becoming more right-wing regarding most political topics. Liberal Democrats voting behaviour has typically been those of an opposition party, albeit becoming somewhat more extreme.

Digitization of the Australian Parliamentary Debates, 1998–2022

Article Open access 26 August 2023

A scoping review on the use of natural language processing in research on political polarization: trends and research prospects

Article Open access 19 December 2022

The Norwegian Parliamentary Debates Dataset

Article Open access 02 January 2025

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

If we don’t know how our representatives vote,

then how can we hold them to account?

It is the proposal of this paper that an information asymmetry exists between constituents and representatives in the UK, which prevents the proper dynamics between these two groups. The existence of an ‘information gap’ may be a social problem, but its solution falls in the domain of data science and visualisation tools. This paper bridges two fields—examining government and politics through a lens of data science and modelling.

UK democracy can be characterised by three key observations that combined point to a concerning trend that motivates this paper:

1.
Generally, the quality of democratic societies scales negatively with participation, ‘The highest levels of disengagement have occurred in 16 out of the 20 countries classified as ‘full democracies’ [1];
2.
Youth participation in general elections is especially poor in western democracies [2, 3], turnout for 18–34 year olds in the UK has averaged 50% over the past two decades [4];
3.
Less than 40% of UK participants could name their local representative [5], let alone on what, or how they last voted.

The common logic that ‘the most politically motivated members of society are in the youth’, appears to be refuted by actual participation statistics. The simple explanation given for this is that the right to vote is ‘taken for granted’, or that people are ‘not interested in politics’ [2]. While this is likely a partial explanation, a more difficult justification for the observation above is that generally people feel under-represented by the current system.

The problems at the crux of this paper highlight the disillusionment of youth in western democracies, despite growing engagement in alternative political activity such as protests, online petitions, and community action [6, 7]. If apathy cannot explain low participation in youth, then disengagement must stem from a different source.

‘They are apathetic because they are powerless, not powerless

because they are apathetic’—Barber [8]

Policy designed to increase engagement, such as Compulsory Voting, has been successful in Australia and Belgium [9], but is criticised in the UK because of the public’s lack of political awareness [10, 11]. In the context of this paper the ‘information gap’ refers to the inability for more than half of UK constituents to name their local representative, let alone what they voted on, or how they voted in a division [6]. If we want to close this gap, digitally democratic projects in the UK should focus on platforms designed for ‘information provision’, which will act as a stepping-stone toward further political engagement. Only with greater political awareness, are more mature digital democratic concepts likely to succeed.

The primary contribution of this paper is to propose a visualisation tool that allows UK citizens to make more informed decisions about who represents them in parliament.

Our results section will show that new advances in NLP are able to combine, quantify, and visualise vast textual data. Producing visualisations that are simple, but sufficiently expressive to differentiate party positions in Parliament.

The secondary aim of this paper is to explore trends in the data, to validate the methodology, and connect the statistics to the UK’s cultural context.

Ian Dunt, a political commentator, Rory Stewart, a former MP, both claim a lack of serious legislative criticism in the House of Commons over the past decade [12, 13], suggesting that the commons debate is being used for partisan political grandstanding, rather than compromise and policy discussion. This could be expressed in the voting record as polarisation, and leads to our hypothesis:

H0 There is no evidence of party political polarisation in the House of Commons voting record.
H1 There is evidence of party political polarisation in the House of Commons voting record.

Literature review

Related literature in political visualisation tools has matured instep with government transparency. Most significant contributions use open source parliamentary data as the backbone of their projects [14,15,16]. Projects tend to focus heavily on US [15, 17, 18], South American [19, 20], and Indian [21, 22] democracies because of their relative importance, complexity, and size respectively.

Broadly speaking the literature can be split into three camps. Visualisations for conveying election results, supporting exploratory analysis, or comparing ideological difference.

Of these three, the need to articulate election results to the general public, has resulted in the most simple, but effective visualisations. Congress seat percentages, typically shaped as a donut, and cartographic vote maps are prolific in the literature, and media [14, 21, 23, 24].

There is a tendency for visualisation tools to target political scientists, journalists, and those with a strong political education [14, 15, 22, 25], despite a general aspiration to promote citizens’ education [21, 25,26,27,28]. This disconnect between intention and outcome, can result from a focus on interactivity, which grows the scope of possible findings at the cost of intimidating users [25]. The disconnect can also result from the complexity of visualisations, which require greater standards of graph literacy [21].

The literature comparing ideological difference is the least concerned with public accessibility. Approaches use either roll-call data [29, 30] or Natural Language Processing (NLP) on parliamentary debates [31, 32]. Spatial models are overwhelmingly used in roll-call analysis. Visualisations generally result in two dimensional scatter plots, optimised for predictive power in voting behaviour. This literature generally claims that high dimensionality is desirable, but too computationally expensive [33], and unnecessary due to high predictive power in 1 or 2 dimensions. In the UK context, parties exert significant influence over representative’s voting [30, 34]. As such, quantitative measures of alternative data such as debate transcripts are often deployed to identify legislator preference rather than using roll-call data directly.

We intend to follow the literature comparing ideological difference, embracing concepts from both intellectual pathways. Combining meta, and roll-call data is not a novel concept [33], although prior models are ultimately designed by scholars for scholars [29]. We depart from spatial roll-call analysis, by maintaining higher dimensions. We suggest that higher dimensionality is desirable for measuring ideological difference, not because of its predictive power, but because of its explanatory value. This is important to maximise information while maintaining relevance for the public. Visualisations are shape-based, simultaniously making comparison easier, and lowering graph literacy requirements.

The visualisation tool described in this paper requires a novel methodology and so content is heavily weighted towards methods. We resort to LLMs, because the complexity and quantity of data presents a significant barrier to manual labelling, while traditional NLP techniques lack flexibility, and context to solve the tasks alone. The novelty of LLMs makes it prudent to introduce, briefly, the models and techniques being used.

The launch of chatGPT from OpenAI in November 2022 [35], is the first time the public has access to LLMs as powerful as GPT3. The rapid rate of progress in the field during the writing process meant that models such as GPT4 became available by March 2023 which was utilised for latter tasks.

GPT stands for Generative Pre-trained Transformer. A high level conceptualisation of these models can be understood by breaking down these three terms. To simplify the concepts we can approximate terms like ‘tokens’, as ‘words’, and ‘embeddings’, as vectorised tokens.

“Generative”—ChatGPT creates a new output by predicting one token at-a-time. The model is probabilistic, sampling from a weighted distribution of likely tokens.
“Pre-trained”—The parameters within the model have been learned on vast text examples, they interact with input data through highly-tuned weights within the model.
“Transformer”—Refers to the model architecture. It can be broken into three core concepts, embedding/un-embedding, Attention, and Multi-layer Perceptron (MLP) blocks. Embedding layers convert input text into tokens, which can be vectorised as the input to the model. Un-embedding reverses this step, turning token position embeddings into probability distributions at the end of the process.

These vectors pass through ‘Attention blocks’ which identify the strength of relationships between multiple token embeddings at-a-time based on the context within the original input. For example ‘the smooth rock’, would identify that the embedding of the adjective ‘smooth’ would need to attend to the embedding of the noun ‘rock’ a great deal, while ‘the’ would have a lesser impact on the noun ‘rock’. It would then alter the embedding ‘rock’ in a high dimensional space toward ‘smooth’ concepts and away from ‘rough’ or ‘jagged’ concepts.

Finally, MLP blocks identify nuanced relationships within the text using up and down projections. Attention and MLP blocks are repeated allowing for greater nuance before reaching the final un-embedding layer, producing one new word in the output at a time.

The process generates outputs which at least give the illusion of deep understanding. This is enough to solve text based problems too difficult for traditional NLP techniques (Fig. 1).

Methods

Studying voter preference amongst MPs (or amongst groups of MPs, say at party level) using decades of parliamentary debates, would have been almost impossible before the emergence of NLP and LLMs. But this is now a viable task for modern AI. To avoid bias, it is desirable to produce patterns organically from the data, rather than imposing any structures from the top down. This leads to a number of key methodological challenges:

Clustering—Can the vast amounts of divisions be simplified into digestible clusters? If so, how many clusters? What clustering tools are suitable?
Classification—Does a division positively or negatively impact the given cluster?
Visualisation—How to best present the data?

These challenges will be used as a structure for the methods section.

Data

Voting data pins down an MP’s ‘opinion’ at a given time. We make use of the Hansard public record of divisions (votes). As data are missing between 2004 and 2006, a time frame between 01/01/2006 and 18/10/2023 was selected. This period contains a full record of all the division bells. The Data.Parliament API contained 4021 divisions in that time period. It contains the following parameters—Division Title, Date, Number of Votes For, Number of Votes Against, and MP names that participated in the division.

The API date and title was used to match divisions on the Hansard.Parliament website which counted 4043 divisions in the same time frame. Hansard.Parliament was used to manually collect 1115 ‘Debate As Text’ .txt files relating to 2663 divisions in the Data.Parliament API. These files contained between 1000 and 80,000 words approximately, with an average of around 28,000 words linked to debates preceding a given division.^{Footnote 1}

The 1380 missing divisions are due to a combination of missing entries and discrepancies in title and dates between the Data API and Hansard websites. In total 32,890,533 words were processed to classify the data.

Clustering

The first task was to simplify the data by grouping divisions into similar clusters. These clusters will be used as dimensions in an MP’s vote profile. Three traditional NLP methods were tried, but each failed to minimise relevant clusters using ‘Division Title’ alone. This led to the use of chatGPT-3.5.

Data engineering

The size of ‘Debate As Text’ files meant many NLP algorithms using term document frequency matrices were too computationally expensive. Instead ‘Division Titles’, as seen in Table 1, were used for clustering.

Table 1 Division title format is broken down into sections used in pre-processing

Full size table

Traditional NLP

Three clustering algorithms were applied to the unique list of 784 division titles—K-means, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), and Hierarchical Clustering. All methods were ineffective as grouped titles were not unique enough between clusters, but also too diverse within clusters. The titles alone did not contain enough information to successfully complete the clustering task using traditional NLP methods. Evidence for this is the sparseness of the various term document frequency matrices. Briefly:

K-means optimised using the elbow method around 550 clusters which does not simplify the divisions enough.
DBSCAN worked best with minimum cluster value of 2, with epsilon set at 2.14. With these parameters DBSCAN identified 32 clusters, leaving 603 titles unclustered, before one group would expand consuming all unclustered titles.
Hierarchical clustering with a Jaccard distance matrix, and Ward.D2 method was the most successful strategy for clustering using traditional techniques. Silhouette scores indicated a peak around 32 clusters, however obscure clusters could be identified, and 32 dimensions does not simplify the data enough.

The failure of these traditional approaches led us to turn to LLMs for the purpose of clustering.

LLM

At this stage ChatGPT-3.5 was used to identify 12 clusters, including 1 miscellaneous which identified poorly defined categories. Reinforcement Learning with Human Feedback (RLHF), a feature of ChatGPT’s model design, meant the clusters were relevant and interesting.

The prompt design was ‘Split, Zero-shot’. Splitting the task across two prompts avoided token limits, while encouraging output coherence.

GPT-3.5 was very successful at identifying clusters and combining the titles in a single heading such as ‘Immigration and borders’. GPT-3.5 was inaccurate at a secondary task asking it to iteratively assign titles to each cluster via chat GPT’s API, because the separate prompts would launch a new ‘conversation’ not remembering the previous cluster process. This would regenerate clusters with each prompt. The model’s inherent randomness meant much of the cluster assignment had to be checked manually.

2241 titles out of the 2663 divisions fit into the 11 clusters excluding miscellaneous. This is a spurious measure of accuracy because the metric is measured slightly differently to traditional NLP methods which would measure the level of fit within and between clusters. However, 84.15% of divisions clustered into 11 meaningful categories, gives some idea of the effectiveness of the method.

In this first task, the LLM took only the list of pre-processed titles, and a prompt. The prompt design used few prompt engineering techniques which made it very accessible. The parameter for creativeness, ‘Temperature’ in chatGPT-3.5, was set to 0.1 to try and reproduce similar clusters if necessary. At the time OpenAI had not released the ability to set a seed within the model, hence the use of low ‘Temperature’.

In January 2023, GPT4 had not been made publicly available, which is why GPT3.5 was used instead of 4. Table 2 shows the initial prompt, and Table 3 shows the final output of the clustering conversation.

Table 2 Clustering prompt format is broken down into sections

Full size table

Table 3 Clustering output

Full size table

Classification

Before visualising an MP’s voting record, we need to identify how each division impacts its given cluster.

A problem identified by ‘The Public Whip’ is the complexity of divisions [37]. A division can be a vote for, or against a bill, an amendment to a bill, or a motion to disagree with an amendment to a bill. Experts are needed to understand the complexity of any debate, before breaking down the logic behind each division call. ‘The Public Whip’ expect errors, and so rely on some degree of crowd-sourcing to review their data.

We propose that with very tight prompt engineering, a LLM could outperform individual experts, and take a fraction of the time. For each division we are looking for either a 1 or a − 1 corresponding to the relationship between the division and its cluster. For example, will, voting ‘For’ the division:

‘Illegal Migration Bill: motion to disagree with Lords Amendment 1’

tighten or loosen immigration and border control?

If it ‘tightens’ we can call this Sentiment: 1, and if it ‘loosens’ we can call this Sentiment: − 1. That way, when paired with a vote: Ayes: 1, Abstain: 0, Noes: − 1, we can identify a politician’s stance by taking the product of sentiment and vote.

In this methodology, cluster assignment is based on unique titles which tends to group irrelevant divisions. The classification task is therefore multi-class, identifying divisions that relate positively, negatively, or neutrally to its given cluster. An example of neutral classification might be a ‘procedural’ division about the deadline for a given bill, clustered into ‘Immigration and Borders’. The result of which most people would not consider indicative of an MPs position on the matter.

The LLM ‘GPT4’ was used because of its accessibility and feature list, rather than the model design itself. Specifically we used ‘Document Scan’ which was an addition to ‘Bing Chat’ via Microsoft’s acquisition of OpenAI. While large language models at the scale of GPT4 are successfully transferable, smaller models have reduced accuracy when prompted with tasks using text outside of the model’s training data [38]. Without access to the parameters inside the GPT4 model, document scan was a key feature to ensure the model was exposed to some relevant data during classification.Document scan ^{Footnote 2}was not available with ChatGPT3.5 which necessitated the switch of large language models during the project.

The prompt design for the classification task was a single prompt, ‘Zero-shot, Chain of Thought (CoT)’ with document scan [39, 40]. CoT is useful for complex tasks because justifying its classification gave it time to ‘think’, and the predictive nature of the next letter A or B would be consistent with its prior logic. In relevant literature this has been shown to drastically improve accuracy, while reducing hallucinations [40, 41] (Table 4).

Table 4 Example of classification prompt format broken down into sections

Full size table

Other techniques such as ‘tagging’ the text, which involves repeating words across the prompt and document it is scanning, help to locate the correct division which it can otherwise conflate with the bill itself. In the final prompt, parameters were not set, because in testing it seemed to have little effect. There was no way to check if setting the parameter in text was being applied consistently in BingChat. Demonstrating expected ‘output format’ in the prompt was key to producing consistent formatting (Table 5). Consistent formatting was key for post-processing, and visualisation.

Table 5 Example of classification output

Full size table

When constructing this classification method in a social science context, we found that issues such as ‘Economy and Finance’ created a potential failure. Economic growth can be considered a by-partisan issue, because no party supports a weaker economy in the UK.

This poses a significant theoretical problem, because developing opposing policy must meet two criteria. First, it has to separate Right vs Left political stances within a countries political context. Secondly the Large Language model itself must be able to distinguish between the two positions effectively. If the first criteria was not met the findings would contain little information about party positions. If the second criteria was not met the reliability of the model would be reduced.

We resolved this issue by selecting Left vs Right policy suggested by the model itself. In the case of ‘Economy and Finance’, we worked with the model GPT4 to manually identify ‘Stimulus and the Free Market’ vs ‘Austerity and Regulation’. This approach had two benefits, it limited the influence of our own subjectivity and political ignorance, while identifying policy the model could easily differentiate.

This process would begin by prompting the model to suggest possible political positions on a given cluster. Selecting the best results from this step we can iteratively ask the model to find opposing positions with prompts like: ‘If I was debating on the topic: ‘Economy and Finance’, and my position was ‘Stimulus and the Free Market’, what would the opposing argument be?’. This resulted in Table 6.

Table 6 Table shows policy positions along all 11 clusters

Full size table

Model validation

The data is unlabelled, and GPT4 is essentially being used as a transfer learning model to solve a classification task. The standard method to check reliability is to split the data into test and training data. In this scenario GPT4 is pre-trained, so we only need a suitable test sample, traditionally greater than 25% of the data.

Due to the complexity, and volume of legislation, creating a substantial test-set of that size was unachievable given the lack of cross-discipline expertise, funding, and time (Years worth of work).

Using the models inherent randomness, we can test reliability of prompt designs, and use this to infer effectiveness of the methodology. If the model was unable to produce consistent classifications based on the same text inputs, then the model would be unreliable at the task. If, despite the variation in output text, the final classification was consistent across prompts given the same inputs, it would give some idea as to how confident the model was in its classification.

Testing involved 10 repetitions for a given prompt, up to 15 examples would be tested, checking for consistency. There were two main outcomes that would occur:

Consistent Classification—The vast majority of final prompts would classify a division in the same way, returning A or B for all 10 repetitions. It can be assumed that there is some error rate, however it was very low. Very rarely it would identify C consistently for all 10 divisions. This would usually occur if the division was incorrectly clustered, for example a division on ‘Parliamentary Procedure’ about an MP’s conduct was incorrectly clustered into ‘Crime and Justice’.
Inconsistent Classification—This was an infrequent error, but a limitation of the model, and method. This problem occurred if the classification A or B, was subjective, or policy positions did not properly reflect the choice being made in the division. GPT4 would justify its decision by stating ‘In the long-term...’ for classification A, vs ‘The direct impact of...’ for classification B. The model would often be aware of its uncertainty in the response, but would only classify the division as C about 20% of the time. The fact that GPT4 would classify these uncertain divisions as either A or B 80% of the time, means that there is certainly some degree of error in the data. It is very difficult to know how frequently this occurred during the classification task. In the end, miss-classification just appears as noise on top of the majority of correctly classified debates.

Visualisation

The following ‘voter preference’ diagrams provide visualisations at the resolution of the party. All votes in the period are multiplied by sentiment, such that each vote reflects an MP’s stance toward its given cluster. A mean party position is established across each division, tallying votes by party, before dividing by the total number of voting MPs in the given party. To find the average stance of a party member over a given period, we merge all divisions by cluster, by simply taking another mean resulting in a value between − 1 and + 1. The diagrams required the final output to be between 0 and 1, which is achieved by adding 1 to all values before dividing by 2. This customised form of linear scaling preserves the position of values within the range in a way Min–max normalisation would not.

Outliers are often created when less than 5 votes on a given dimension occur in the set period. The lack of expression appears to indicate more radical perspectives which may be misleading. When this happens in the results section, the dimension is highlighted as an outlier before centering at 0.5.

The charts are organised so that typically left-wing issues associated with 1 are on the left of the diagram, and typically right-wing issues associated with 1 are positioned on the right of the diagram. This shifts over time and so has dated the diagrams to some degree. The orientation of clusters helps differentiate between political stances by the shape of the diagrams; as seen in Fig. 2.

It is important to note that 1 does not mean good, and 0 does not mean bad. Instead if a politician scores closer to 1, or 0 it implies they have stronger opinions on an issue, while approximately 0.5 indicates a more moderate compromising stance.

Deciding which policy should be associated with 1 and which with 0, is supposed to be arbitrary. A limitation of the diagrams is that people intuitively associate 1 with ‘pro’ an issue, and 0 with against. In most cases this helped to logically organise the chart. Judgement calls were made for dimensions such as ‘Economy and Finance’, where the difference in A/B classification reflects opposing policy. e.g. ‘Stimulus and the Free Market’ vs ‘Austerity and Regulation’. Here stimulus was associated with 1 and austerity 0.

Results

Our contribution is a tool that visualises MP voter profiles based on their voting record in parliamentary divisions. The visualisation depends on novel NLP-based clustering of textual data from each division and associated debates—containing in total 32,890,533 words. These visualisations (or Voter Preference diagrams) are similar to circular bar charts—see for example Fig. 3. In order to avoid controversy, this paper will visualise MP voter profiles at the resolution of the party rather than at the level of individual MPs. This masks some of the model’s expressiveness by taking the average stance of MP groups, but is necessary to focus the paper on method rather than political message.

The findings will narrow its focus on Conservative, Labour and Liberal Democrat parties as case studies. The three parties are selected based on their historic popularity between 2006 and 2023. For the purpose of demonstrating how the visualisation can capture changes in position over time, the plots below are broken into a series of 427 divisions which represents 3.4 years on average—approximately the average duration of a sitting parliament.

The key given in Table 6 is important in understanding the plots—although the diagrams are designed to make sense intuitively.

Some comments on the conservative party voter preference diagrams:

1.
The diagrams show greater levels of moderation between 2006 until 2012. Only in areas of ‘Environment and Energy’, and ‘Health and Healthcare’ do the conservative party exceed the 0.25–0.75 range indicating strong views only on specific issues.
2.
A clear trend can be seen from 2009 roughly coinciding with a conservative house majority (2010), which shifts their vote profile right in the diagram. Beginning in 2012, seven of the eleven dimensions now exceed the 0.25–075 range. This trend continues toward the present day, where the diagrams become quite distorted in dimensions such as ‘Immigration and Boarders’ and ‘Defence and Armed Forces’. Some signs of moderation can be seen in ‘Standards and Technology’ and ‘Economy and Financial Services’.
3.
Some notable shifts in conservative policy stance are worth mentioning. For example, ‘Environment and Energy’ shifts from a core conservative value between 2006 and 2009 of 0.815, to a key feature in their opposition with a score of 0.141 by 2023. In the diagram this indicates a preference for ‘Economic Growth’ over ‘Environmental Sustainability’.

Some comments on the Labour party voter preference diagrams:

1.
In the 2006–2009 period, all labour party scores are less than 0.75, perhaps reflecting the centre-left stance of the Blair–Brown administrations.
2.
Post Blair–Brown and in opposition there is a shift left, with 0.75 and higher scores in ‘Health and Healthcare’, ‘Welfare and Social Housing’ and ‘Education and Learning’ emerging as we go from 2009 to 2023.

Other comments:

1.
By 2021 the voting behaviour of both the conservative and the labour parties are significantly distorted from a neutral position (0.5), while being diametrically opposed to each other.
2.
The Liberal Democrats are an interesting third party for comparison. Their views are largely characterised by their opposition to the dominant party, rather than by any particular core value. In 2006 they look very similar to the conservative party, while from 2016 onward they imitate the Labour stance.
3.
Only in intermediate periods from 2012 to 2016 are the Lib Dems taking a unique vote profile where they appear as the party of moderation, which is short lived as the conservative majority defines its later shape.
4.
A few events have been highlighted, the EU referendum and ‘Party gate’, to which all parties react. This reaction is a good indicator for the model’s expressiveness, and confidence, as it is able to capture sensible shifts to current events.
5.
It is also worth noting the fairly neutral stance of all parties in ‘EU and Foreign Affairs’ before the 2016 EU referendum, with the Lib Dems slightly more positive than the other two. Then after the referendum, the conservative party score drops significantly, whilst the other two increase above 0.75.

Evidence for strong classification capabilities can be seen in the final outputs. If significant miss-classification took place, we could expect voter profile shapes to be generally circular, indicating high levels of randomness. Instead Fig. 2 shows between 2006 and 2023 labour party votes shifted to the left of the diagrams, while conservatives are generally shifted toward the right of the diagram. Further evidence for GPT4’s reliability, results from the extreme polarisation between 2021 and 2023. If the method was unreliable you could consider each vote a random walk from the neutral position (0.5), the probability of any dimension ending at, or close to either pole (1 or 0) with a substantial degree of randomness is low.

Using this same logic we can claim evidence of poor classification capabilities for the dimension of ‘Crime and Justice’, because of limited fluctuations over the period 2016–2023. We reject the claim that this reflects poorly on the models reliability, because the observation is not consistent throughout the period between 2006 and 2016. Reviewing the individual divisions, the dimension appears to conflate two ideas. The conservative party which we would expect to support harsher punishments generally, vote against these punishments when the legal decision relates to parliamentary conduct, or standards. The legal dimension to these divisions meant they were clustered into ‘Crime and Justice’ rather than ‘Parliamentary Procedures’ which creates the anomaly in the diagrams from 2016 onwards.

Hypothesis

Using the above visualisations, we can quantify the level of extreme voting behaviour in the House of commons by era.

Re-normalising the values around 0, and adjusting the scale from − 1 to 1 for clarity. A measurement for level of polarisation can be collected by taking the absolute value of each dimension. Outliers highlighted in Figs. 3, 4, and 5 have been excluded.

Figure 6 shows that the trend between 2006 and 2023 is non-linear, as such a Mann Kendall test is a preferred measure of Significance.

The results of the Mann–Kendall test show a moderate to strong positive trend in the data, with Kendall’s tau of (0.435). The extremely low p-value (2.22e\(-\)16) indicates that this trend is highly statistically significant and very unlikely to have occurred by chance (Fig. 7).

Based on this we reject H0:

H0 There is no evidence of party political polarisation in the House of Commons voting record.
Accepting H1:
H1 There is evidence of party political polarisation in the House of Commons voting record

If we accept the credibility of the methodology, we can claim strong evidence of partisan polarisation over the past decade. Which supports claims of a reduction in legislative scrutiny in the House of Commons [12, 13].

Discussion

We can see Labour becomes more left wing, and the Conservatives become more right wing over the period 2006–2023. All parties appear to polarise to a similar degree in each period, suggesting this is a dependent behaviour, becoming more extreme as a reaction to their opposition. Between 2006–2009, the classification of a division had moderate predictive power over how a party would vote on a dimension. As a consequence of the polarisation, by 2023 dimensions such as ‘Immigration and boarders’ have become so extreme the model has almost perfect predictive power.

The diagrams also reveal interesting dynamics in the UKs multi-party political system, where Liberal democrats shift allegiances between 2009 where they vote similarly to Conservatives, and 2023 where their vote profile is almost identical to Labour. We can use this to predict that in the event of a large Labour majority in the upcoming 2024 elections, Liberal Democrats will once again begin to embody a more conservative stance.

An easy misunderstanding of the graphs above, is to assume that we are measuring some amount of ‘rebel-rate’ within a party. This is a minor factor in the diagrams. The majority of the variance is created by the classification of bills, and amendments. This has some interesting implications when inspecting individual MP diagrams. For example, if politician ‘J’ had a reputation for voting against their party on specific issues, you might expect their final diagram to look more similar to that of their opposition. In reality their score between 0 and 1 combines all the divisions in which their party voted positively on the issue, with some of the divisions in which the opposition voted positively on the issue. The resulting vote profile correctly shows the politician taking a more radical stance, but this diverges from the oppositions party diagram to a greater, not lesser degree.

The correct interpretation of these results will understand that the visualisations are created using the voting record which is subject to significant structural bias, especially from the party whip in UK politics.

The discovery that voting profiles were becoming more extreme was an unexpected finding. We have described this trend as polarisation, but this refers specifically to an increased degree of tactical voting. This could result from greater difference between parties along ideological grounds which aligns with the standard definition of polarisation. It could also result from structural incentives which express themselves in the voting record, but do not directly measure ideological difference.

Without the proper grounding within a historical context, it is unclear if the increase in polarisation is an alarming trend, or if periods of polarisation, and moderation occur in cycles, as part of a healthy democracy.

We can say that today the UK is in a period of increased polarisation as measured by parliamentary divisions on key issues. This ultimately raises questions about the health of our democracy, and the ability of Parliament to effectively compromise, create new policy, or govern effectively.

Limitations

1.
The methodology above does not capture abstinence from a vote, because it is indistinguishable from absence in the voting record.
2.
Minor changes in the prompt had massive impacts on the consistency of classification. This made the methodology extremely fragile. For example, the words ‘associated with policy x’, were used in the final prompt because the model has been made incredibly cautious about making politically fuelled judgments. If asked directly if voting Aye in a division would tighten border control, it had a high probability of rejecting the prompt under safety guidelines. BingChat (Co-pilot) has since increased it’s ‘safety’ parameters nerfing its effectiveness.
3.
Classification C was introduced to the prompt as a ‘catch all’ clause which discouraged the model from guessing in very difficult divisions. This classification meant discarding the divisions post-processing. Despite the sensitivity to the prompt, C classification only occurred in 104 out of 2241 divisions that made it to the classification task. C-classification includes poorly clustered divisions, irrelevant divisions correctly clustered by title, and manually assigned divisions if the output was found to alternate A/B classification in testing. A 4.64% C-classification-rate gives some idea of the methodologies effectiveness, but cannot be used to indicate GPT4’s accuracy in the classification task.

The lack of time and resources available to manually label the data has left a question about the reliability of GPT3.5 and GPT4 in this methodology. Without a clear measure of accuracy, there leaves the potential to miss-lead the public on sensitive issues. It is unclear if the value of the findings outweighs this potential risk outside of an academic setting.

Future work

Addressing reliability with a substantial test set of labeled data will determine model accuracy. This can be used to check for variation between clusters, which will be useful to optimise future prompt designs.

During testing, it became clear that simply removing, or taking the most common class for, divisions suffering from ‘inconsistent classification’ would improve accuracy. This suggests the potential for ensemble Large Language modelling, known as Self-consistency with CoT [40]. At the time of writing, no LLMs met all three requirements for this method, namely public availability, document scanning, and a functioning API. This limitation has since been removed with OpenAI’s developer toolkits [36].^{Footnote 3}

Attempts should be made to extend the data set prior to 2006 filling the gap between 2004 and 2006 in the Hansard dataset. This will help to identify if extreme voting behaviour in recent years is a point of concern or part of normal oscillations between polarisation and moderation, that is, part of a healthy democracy.

Generative summaries could be used as novel input data for traditional clustering algorithms for better clustering. While some literature warns against generative data as ‘garbage in’ [42], generative summaries have the potential to overcome the lack of information which hinders division titles in traditional NLP approaches.

A few directions to extend the method’s academic reach:

We can see evidence of more extreme voting behaviour which we have loosely described as polarisation. The term polarisation has been well defined in the literature, despite a variety of causes which, as with most social phenomena, appear to be non-linear and context dependent [43]. This paper identifies a potential method for measuring the level of strategic voting on key issues, through which partisan polarisation may be expressed. Correlation between strategic voting and the level of polarisation in the UK House of Commons will be dependent on factors such as structure of the leading party, and rebel rate. This work has been omitted from the current paper due to this uncertainty. A simple quantitative metric for polarisation, could be an important measure of the health of our democracy, and useful to compare with other countries around the world.

Finally, rather than splitting the politicians by party we can cluster them based on similar vote profiles. This can be used to identify similar groups within parties, or check for the influence of special interest groups that might influence politicians vote behaviour. Higher dimensional roll-call analysis may reveal previously hidden coalitions.

Conclusion

The visualisation tool outlined in this paper broadly conveys party positions on core issues, while capturing logical reactions to contemporary events. For members of the public with low political awareness, visualising party profiles by linking them to issues we actually care about, could help to close the information gap in UK politics.

Ultimately the goal of the visualisation tool is to boost political engagement. We believe this can be stimulated by users growing more confident in their understanding, or stimulating an emotional reaction to information that conflicts with existing beliefs. Our visualisation tool helps UK citizens to understand how their representatives vote, allowing the public to make more informed decisions about who represents them in parliament.

The UK has begun to lead globally in government transparency, although this tends to focus on ‘thin transparency’ as defined by Edwards [44, 45]. Public initiatives focused on information provision such as ‘They Work for You’ (TWFY) are criticised for over simplification of voting records and miss-characterisation of divisions. The result of which tends to polarise users rather than support political engagement. If we accept that the above visualisations are created in a credible methodology, then the machine based clustering, followed by classification solves the ‘thin transparency’ problem at a greater scale, with greater objectivity. It would be encouraging to see TWFY adopt this method and so use visualisations for more impartial public information provision.

Data availability

Debate-As-Text files are available to the public in the Hansard record of debates. To aid retrieval, these files have been stored in a GitHub repository: https://github.com/JDLilley/JDLilley/tree/main/Digital_Democracy/Data Example from: https://github.com/JDLilley/JDLilley/blob/main/Digital_Democracy/Data/DebateAsText/Armed%20Forces%20Bill%202021-06-23.txt

The generative outputs containing division classifications, as seen in Table 5, have also been stored in the digital democracy GitHub repository under LLM_Output: https://github.com/JDLilley/JDLilley/tree/main/Digital_Democracy/Data Example from: https://github.com/JDLilley/JDLilley/blob/main/Digital_Democracy/Data/LLM_Output/103590.txt.

Notes

A full repository of text files can be found at GitHub Digital Democracy.
In the interest of reproducibility, we have described some functional steps that were required to produce the results using BingChat which can be found in the ReadMe file on github: GitHub Digital Democracy.
Relevant ‘Debate As Text’ files are stored on a Github repository Digital Democracy for easy access. These should be fed into the new GPT4 developer hub for a custom trained large language model. This can be used to improve the accuracy of the classification task.

References

Simon, J., Bass, T., Boelman, V., & Mulgan, G. (January 2017). Digital democracy. The tools transforming political engagement. UK: Nesta
Commission, D. D., et al. (2015). Open up! report of the speaker’s commission on digital democracy. London: House of Commons. http://www.digitaldemocracy.parliament.uk/documents/Open-Up-Digital-Democracy-Report.pdf
Tomaž, D. (2023). Young people’s participation in European democratic processes. AFCO Commitee European Parliment.
Uberoi, E. (2023). Turnout at elections. House of Commons Library.
Google Scholar
Society, H. (2013). Audit of political engagement 10. Hansard Society 2013.
Modernisation, S. C. (2004). Connecting parliament with the public. House of Commons First Report of Session 2003-04
Uberoi, E., & Johnston, N. (2022). Political disengagement in the UK: Who is disengaged? House of Commons Library.
Barber, B. (2003). Strong democracy: Participatory politics for a new age.
Hooghe, M., & Stiers, D. (2017). Do reluctant voters vote less accurately? The effect of compulsory voting on party-voter congruence in Australia and Belgium. Australian Journal of Political Science, 52(1), 75–94.
Article Google Scholar
Armstrong, H. (2015). Compulsary voting. House of Commons Library (00954), 4.
Strathclyde, L. (2010). Parliamentary voting system and constituencies bill. House of Lords Debate 723(00954).
Ian, D. (2023). How Westminster Works... and Why It Doesn’t. Amazon.
Stewart, R. (2023). Politics on the Edge.
Silva, R. N. M., Spritzer, A., & Freitas, C. D. S (2018). Visualization of roll call data for supporting analyses of political profiles. In 2018 31st SIBGRAPI conference on graphics, patterns and images (SIBGRAPI) (pp. 150–157). IEEE.
Kinnaird, P., Romero, M., & Abowd, G. (2010). Connect 2 congress: visual analytics for civic oversight. In CHI ’10 extended abstracts on human factors in computing systems. CHI EA ’10 (pp. 2853–2862). New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/1753846.1753872
Gupta, K., Sampat, S., Sharma, M., & Rajamanickam, V. (2016). Visualization of election data: Using interaction design and visual discovery for communicating complex insights. JeDEM - eJournal of eDemocracy and Open Government, 8, 59–86. https://doi.org/10.29379/jedem.v8i2.422
Article Google Scholar
Boche, A., Lewis, J. B., Rudkin, A., & Sonnet, L. (2018). The new voteview.com: Preserving and continuing Keith Poole’s infrastructure for scholars, students and observers of congress. Public Choice, 176, 17–32.
Article Google Scholar
Kim, N. W., Jung, J., Ko, E.-Y., Han, S., Lee, C. W., Kim, J., & Kim, J. (2016). Budgetmap: Engaging taxpayers in the issue-driven classification of a government budget. In Proceedings of the 19th ACM conference on computer-supported cooperative work & social computing. CSCW ’16 (pp. 1028–1039). New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/2818048.2820004
Borja, F. G., & Freitas, C. M. D. S. (2015). Civisanalysis: Interactive visualization for exploring roll call data and representatives’ voting behaviour. In 2015 28th SIBGRAPI conference on graphics, patterns and images (pp. 257–264). https://doi.org/10.1109/SIBGRAPI.2015.34
Méndez, G. G., & Moreno, O. (2021). Enabling comparative analysis of election data in Ecuador. VINCI ’21. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3481549.3481571
Gupta, K., Sampat, S., Sharma, M., & Rajamanickam, V. (2016). Visualization of election data: Using interaction design and visual discovery for communicating complex insights. JeDEM-eJournal of eDemocracy and Open Government, 8(2), 59–86.
Article Google Scholar
Kumar, M., Narayan, C., Hangal, S., & Trivedi, P. (2020). Lokdhaba: Acquiring, visualizing and disseminating data on Indian elections. In Proceedings of the 3rd ACM SIGCAS conference on computing and sustainable societies (pp. 243–253).
Ondrejka, P. (2016). Mapping election results in proportional electoral systems. Journal of Maps, 12(sup1), 591–596.
Article Google Scholar
Borja, F. G., & Freitas, C. M. (2015). Civisanalysis: Interactive visualization for exploring roll call data and representatives’ voting behaviour. In 2015 28th SIBGRAPI conference on graphics, patterns and images (pp. 257–264). IEEE.
Méndez, G. G., Moreno, O., & Mendoza, P. (2022). Legislatio: A visualization tool for legislative roll-call vote data. In Proceedings of the 15th international symposium on visual information communication and interaction (pp. 1–8).
Kim, N. W., Jung, J., Ko, E.-Y., Han, S., Lee, C. W., Kim, J., & Kim, J. (2016). Budgetmap: Engaging taxpayers in the issue-driven classification of a government budget. In Proceedings of the 19th ACM conference on computer-supported cooperative work & social computing (pp. 1028–1039).
Méndez, G. G., & Moreno, O. (2021). Enabling comparative analysis of election data in Ecuador. In Proceedings of the 14th international symposium on visual information communication and interaction (pp. 1–2).
Kinnaird, P., Romero, M., & Abowd, G. (2010). Connect 2 congress: Visual analytics for civic oversight. In CHI’10 extended abstracts on human factors in computing systems (pp. 2853–2862).
Clinton, J., Jackman, S., & Rivers, D. (2004). The statistical analysis of roll call data. American Political Science Review, 98(2), 355–370.
Article Google Scholar
Kellermann, M. (2012). Estimating ideal points in the British house of commons using early day motions. American Journal of Political Science, 56(3), 757–771.
Article Google Scholar
Shapiro, J. M., & Taddy, N. M. (2015). Measuring polarization in high-dimensional data: Method and application to congressional speech. NBER Working Paper 22423.
Slapin, J. B., & Proksch, S.-O. (2008). A scaling model for estimating time-series party positions from texts. American Journal of Political Science, 52(3), 705–722.
Article Google Scholar
Lauderdale, B. E., & Clark, T. S. (2014). Scaling politically meaningful dimensions using texts and votes. American Journal of Political Science, 58(3), 754–771.
Article Google Scholar
Spirling, A., & McLean, I. (2007). UK OC OK? Interpreting optimal classification scores for the UK house of commons. Political Analysis, 15(1), 85–96.
Article Google Scholar
OpenAI. (2022). Introducing Chatgpt. OpenAI Blog.
OpenAI. (2023). New models and developer products announced at devday. OpenAI Blog.
Whip, T. P. (2022). Why do you refer to majority and minority instead of aye and no? The Public Whip.
Vu, T., Wang, T., Munkhdalai, T., Sordoni, A., Trischler, A., Mattarella-Micke, A., Maji, S., & Iyyer, M. (2020). Exploring and predicting transferability across NLP tasks. Preprint arXiv:2005.00770
Yu, Z., He, L., Wu, Z., Dai, X., & Chen, J. (2023). Towards better chain-of-thought prompting strategies: A survey. Preprint arXiv:2310.04959
Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., & Narasimhan, K. (2023). Tree of thoughts: Deliberate problem solving with large language models. Preprint arXiv:2305.10601
Badhan, M. (2023). Comprehensive guide to chain-of-thought prompting. Mercity Blog.
Simmons, A., & Vasa, R. (2023). Garbage in, garbage out: Zero-shot detection of crime using large language models. Preprint arXiv:2307.06844
Russo, J. (2021). Polarisation, radicalisation, and populism: Definitions and hypotheses. Politikon: The IAPSS Journal of Political Science, 48, 7–25.
Article Google Scholar
Edwards, A., & Kool, D. (2016). Digital democracy: Opportunities and dilemmas. Rathenau Instituut: The Dutch Parliament in a Networked Society. Den Haag.
Curtin, D., & Meijer, A. J. (2006). Does transparency strengthen legitimacy? Information Polity, 11(2), 109–122.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Centre for Environmental Mathematics, University of Exeter, Penryn, Cornwall, TR10 9FE, UK
Joshua Lilley & Stuart Townley

Authors

Joshua Lilley
View author publications
You can also search for this author in PubMed Google Scholar
Stuart Townley
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Joshua Lilley.

Ethics declarations

Conflict of interest

The author(s) declared no potential Conflict of interest with respect to the research, authorship, and/or publication of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (gif 774 KB)

Supplementary file 2 (gif 774 KB)

Supplementary file 3 (gif 774 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Lilley, J., Townley, S. Tackling transparency in UK politics: application of large language models to clustering and classification of UK parliamentary divisions. J Comput Soc Sc 7, 2563–2589 (2024). https://doi.org/10.1007/s42001-024-00317-z

Download citation

Received: 17 January 2024
Accepted: 21 July 2024
Published: 10 October 2024
Issue Date: December 2024
DOI: https://doi.org/10.1007/s42001-024-00317-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Tackling transparency in UK politics: application of large language models to clustering and classification of UK parliamentary divisions

Abstract

Similar content being viewed by others

Digitization of the Australian Parliamentary Debates, 1998–2022

A scoping review on the use of natural language processing in research on political polarization: trends and research prospects

The Norwegian Parliamentary Debates Dataset

Introduction

Literature review

Methods

Data

Clustering

Data engineering

Traditional NLP

LLM

Classification

Model validation

Visualisation

Results

Hypothesis

Discussion

Limitations

Future work

Conclusion

Data availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (gif 774 KB)

Supplementary file 2 (gif 774 KB)

Supplementary file 3 (gif 774 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation