SSAC2018 PaperID5697

Automated Playing Style Detection of Soccer Teams and
Players using Latent Dirichlet Allocation

Paper Track: Other Sports
Paper ID: 5697
Abstract
This paper introduces a new unsupervised method based on Latent Dirichlet Allocation to
automatically detect soccer team and player playing styles. In this approach the model is trained
on Opta’s F9 data to learn a set of underlying playing styles that best describe differences in the
training dataset. Once fit, this model can be used to describe the playing style of teams and
players at scale with minimal human intervention. This method proved to be effective in aligning
data driven style detection with grounded applications that are immediately available for
practitioners such as clubs and coaches to include in their day-to-day operation.
1. Introduction
Soccer data analytics lags far behind its counterparts in other North American sports for widely
cited reasons: Soccer is a complex, dynamic and fluid invasion sport, with highly interdependent
events occurring simultaneously and continuously. On top of this soccer is an extremely low-
scoring sport, with the majority of events on a pitch not influencing the final result directly (with a
goal for example); meaning that not only is the data complex, but the signal from which we want to
discern structure is infrequent. As such, a traditional supervised approach to soccer has little
traction in short time-frames, and is reserved for stakeholders that can afford to hedge on long-
term odds and outcomes like betting syndicates. However, the insight of this type of research can
many times be purely abstract and seen only in applications like small differences in parameters
that either increase or decrease a team’s chances of winning a league in a model; but rarely trickles
down in a descriptive, applicable or accessible way to short-term stakeholders like clubs and
coaches interested in winning the next game or planning their tactical system.
In defiance to the circumstances we just outlined, in this document we will present a novel
unsupervised approach aimed at discerning robust ‘stylistic’ insight from aggregate feature data on
a short match-by-match timescale. Most importantly, our methodology aims to reconcile
mathematical robustness and scalability with descriptiveness and applicability, so that hands-on
practitioners such as coaches or journalists can immediately leverage the results in their
continuous exercise.
The methodology we present here is inspired from the theory of Natural Language Processing
(NLP), more specifically of Topic Extraction, which is concerned with automatically sorting text
documents into the different semantic topics that constitute it. In the information age, automatic
and scalable methods to classify documents semantically are incredibly practical. Burgeoning fields
such as digital marketing and sentiment analysis of social media content rely heavily on the
scalability of text mining: most humans can classify tweets about a brand into different ‘sentiment
categories’, but the manpower needed to do this across the vast quantities of data available is
2018 Research Papers Competition

Presented by:
1
completely unfeasible. Seeking to leverage the paramount benefit of scalability, for our research we
repurpose one of the foremost implementations of topic extraction (Latent Dirichlet Allocation or
LDA) to assign team/player performances to different styles in a completely automatised workflow.
The exact details of our conceptualisation of styles in soccer within the framework of topic
extraction are presented in Section 2, but for now we simply tempt the reader’s curiosity with a
very general introduction: just as different words in documents determine the document’s topic, so
too should different features in a soccer match determine the style of the teams/players involved.
This simple idea is basically what our research is about.
The document is organised as follows: In Section 2 we introduce what an LDA model is and how we
can train such a model, explaining how we have re-conceptualised and repurposed the training
algorithm to deal with learning styles of teams and players rather than topics of documents. In
Section 3 we provide more details on the data used and how it was pre-processed to train the LDA
style model, as well as exploring the fitted models to understand what the learned styles mean in a
soccer context. Section 4 beckons to practitioners by providing an in depth account of applied
possibilities for the method’s results. Finally, Section 5 closes out with some conclusions and
proposes some lines of future work which this research opens the door to.
2. Latent Dirichlet Allocation
Introduced by Blei, Ng and Jordan (2003) in their seminal paper, LDA is used primarily in tasks of
document classification where the underlying topic of a document is explained by unobserved
groups of latent topics. In theory, an LDA model is a generative model that conceptualises an
instance of a document as a mixture of unobserved latent topics sampled from a sparse Dirichlet
distribution, where a topic is a probability distribution on the dictionary (set of words). In a
hypothetical example, this means that latent topics such as religion or politics assign probabilities to
certain words (religion will assign higher probability to words like “god” or “virtue” while politics
will assign higher probability to words like “law” or “constituency”); and the mixture of a document
on these latent topics (i.e. 65% religion and 35% politics) will determine the probability of different
words appearing in it with different frequencies.
FIGURE 1: Generative scheme of LDA model: each document d in the corpus has a sparse Dirichlet distributions over the
latent topics with 𝜃𝑑 ～𝐷𝑖𝑟(𝛼), and each word will be sampled from the document’s topics by 𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝜃𝑑 ). In
parallel, words in the dictionary also have sparse Dirichlet priors so that for topic k, 𝜙𝑘 ～𝐷𝑖𝑟(𝛽𝑘 ) is the distribution over
the dictionary over which words picked from topic k are sampled.
In practice however, NLP researchers won’t realistically have access to a full representation of all
possible latent topics and their probability distribution over all existing words. In light of this, an
Presented by:
2
unsupervised implementation using Bayesian inference techniques which fit an LDA model and
learns the latent topics and their distribution on the dictionary from a corpus of documents is
where LDA has become the celebrated champion of unobserved topic inference (Newman, Smyth,
Welling and Asuncion; 2008). This is a point which is worth emphasizing and asking the reader to
reflect on for a moment: there is absolutely no supervision involved in LDA inference, meaning that
the algorithm has no knowledge at all of topics of human language and the words associated with
them. It simply fits a model which helps explain the frequencies of the different words (‘words’ as
abstract signals which for the algorithm hold no semantic meaning at all) in a collection of
documents. It is the actual structure of language and semantics that cause the algorithm to actually
learn meaningful topics (by ‘meaningful’ we mean semantic topics in the set of documents). The fact
that we will need no supervision to learn meaningful styles of teams/players (in exploring the
results the readers with a knowledge in european soccer will realise that the learned styles actually
correspond to empiric styles usually discussed in soccer) is truly valuable.
A fitted LDA model will output a set of n unlabelled topics (where n is passed as a parameter) along
with the corresponding probability distribution over the dictionary. Even if the researcher does not
previously know the topics in the training set, it will usually not be difficult to infer them by
empirically inspecting the words with higher probabilities for each topic. Additionally, a fitted LDA
model will allow observations (documents) to be explained as a mixture of the learned topics using
maximum likelihood methods (i.e. the topic of said document is 35% topic 1, 20% topic 2, etc.).
In the end, a fitted LDA model is no more than a dimensionality reduction method on new
observations; but in contrast with other methods of non-descriptive dimensionality reduction like
Principal Component Analysis, LDA’s strength lies precisely in the rich descriptiveness of its
components. Its explanatory nature as a generative model for frequencies of discrete features from
different unobserved types has made it appealing and successful in information retrieval problems
other than text mining. It has even made its way to soccer data analytics, although in a different
approach: in a fascinating and truly recommended application, Wang et al (2015) use a variation of
LDA inference to passing combination frequencies by fitting a model to what the authors coin as
“patterns of play”.
2.1. Applying Latent Dirichlet Allocation to Soccer Data

The purpose of this document is to showcase an application for an LDA inference algorithm to
extract meaningful stylistic conclusion from basic aggregate soccer data. In this section, we will
outline how this application is conceived:
The traditional input of an LDA inference algorithm will be a Term Frequency Matrix; a matrix
whose rows correspond to documents and columns correspond to words so that the entry (i,j) of
the matrix corresponds to the number of times that word j appears in document i. A reasonable
parallel can be made between these sort of matrices and aggregate metric data matrices (matrices
where each row correspond to a team’s performance in a match and columns correspond to metrics
such as ‘clearance with head’) where entry (i,j) corresponds to the number of times that feature j
was performed in match i. The figure below illustrates this conceptual analogy.

Presented by:
3
FIGURE 2
The interpretation of the analogy we are making between topics in the setting of document
classification and team/player style or persona in the setting of soccer data is natural: just as a
document’s mixture of topics will determine with what frequency different words will appear, a
team/player’s mixture of styles or personas can be thought of as latent characteristics that
determine the frequency with which they perform certain actions on the pitch. As an example, for a
team that employs the notorious tiki-taka style, the most likely features will probably be something
along the lines of ‘successful passes’, ‘touches’ and ‘accurate short pass’; while a long-ball counter-
attacking team will assign more probability to ‘long balls into opposition half’, ‘fast-break’ and ‘flick-
ons’.
With this framework in mind, our research consists of two stages. First, we will fit an LDA model
using historical match aggregate data. In this stage the model will learn the underlying styles (along
with their probability distribution over the set of actions) in an unsupervised way that best
represent the differences in teams/players’ actions frequencies, and the emerging styles will be
labelled empirically by inspecting the associated distributions.
In the second stage, we transform match observations under the fitted models to express each
observation as a mixture of the styles that were learned in the first stage. The descriptiveness of the
style components will allow us to tell a story and create a mental picture of what type of
performance a team/player produced in a match. However, it is also worth remembering that the
transformed observations are vectorised in an n-dimensional space (where n is the number of
learned topics), and as such we can make full use of these vectorisations to produce a rich dossier of
applications. For example, although the match by match mixture of styles of a team or player can be
highly contextual to the specific match’s circumstances, average team/player vectors throughout a
whole season can produce rich and robust insight into their underlying style or intent of play.
Section 4 will provide a general overview of the applications.
3. Data and Model Fitting

3.1. Data
Presented by:
4
For this research, we used Opta’s F9 data, which is a feed of aggregated data supplied post-match. It
is similar to box-scores, providing a count of aggregated metrics at the team and player level - e.g.
‘big chances’, ‘shots’, ‘tackles’, ‘interceptions’.
We used the complete 2016-17 season, and any matches completed before December 2017 in the
2017/18 season (for the English Premier League for example that is 14 matches per team). The
model was trained on what is known as ‘the big 5’ leagues: the English Premier League, Spanish La
Liga, German Bundesliga, French Ligue 1, and Italian Serie A. Once the model was fit, it was
deployed to transform observations from these same leagues with the addition of the Turkish Süper
Lig, Portuguese Primeira Liga, and Dutch Eredivisie.
3.2. Training the LDA Model

Traditional implementations of unsupervised LDA use a numerical statistic denominated as term
frequency-inverse document frequency (tf-idf), which is defined as 𝑡𝑓(𝑡, 𝑑)/𝑖𝑑𝑓(𝑡, 𝐷) where:
● 𝑡𝑓(𝑡, 𝑑) =(raw count of term t in document d) / (total number of terms in d)

● 𝑖𝑑𝑓(𝑡, 𝐷)= log[(total number of documents in the corpus D) / (documents in D containing term t)]
The purpose of tf-idf is to represent the relative importance of a word in a document by weighting
the frequency of the word in the document offset by the number of documents in which it appears.
Its design ensures that common stop-words such as ‘the’ are weighted lowly even if they appear
with high frequency as they will appear in most if not all documents, while highly-frequent words
which don’t appear in most other documents will have a higher weighting.
Tf-idf has become one of the most popular relative importance of terms definition in NLP (Beel,
Gipp, Langer and Breitinger; 2016); but for our soccer data methodology we cannot implement a
completely analogous statistic since our problem has a small dictionary (the number of features
collected in the dataset which is much lower than the number of words found in a corpus of
documents), and most will appear in almost all matches (every match has passes, shots, head
clearances, etc.). As a relative importance weighting, we break away from traditional text-based LDA
and instead simply use a standard z-score scaler per feature. In other words, instead of feeding the
raw features to the LDA inference algorithm (83 passes for example), we feed the z-score over the
training dataset (2.5 standard deviations above the mean for passes).
3.2.1. Team Model

The team model was trained using 149 features from 4862 observations of matches from the top 5
leagues. A choice of 8 topics was made by empirically examining the results for different choices
(Section 5 discusses this theoretical aspect in more detail for future iterations of this line of
research). The table below reveals the most probable features of each topic/style and the label that
we assigned empirically:
Topic Label Most Probable Features

High Energy Contest Put through, interception won, overrun, duel won, possession lost, possession lost control,
chipped pass, duel lost, aerial lost, tackles, won tackle, blocked pass, aerial won

Presented by:
5
Dominate Passing total pass, open play pass, accurate pass, leftside pass, rightside pass, touches, accurate
and Possession back zone pass, backward pass, accurate forward zone pass, possession percentage,
forward passes, successful long passes from own half into opposition’s
Stern Defending Effective clearance, total clearance, effective head clearance, blocked cross, effective
blocked cross, high claims, good high claims, lost corners, punches, outfielder block,
attempts conceded out of box, possession won in defensive third, clean sheet, total launches
Conceding Chances saves, diving saves, saved in box, attempts conceded in box, saved out of box, attempts
conceded out of box, challenge lost, interceptions in box, outfielder block, error lead to goal,
attempted tackle foul, lost corners, free kick given, yellow cards
Crosses into Box total cross, accurate cross, cross not from corner, corners into box. won corners, crosses
behind 18 yards, crosses after 18 yards, penalty area entries, accurate crosses not from
corners, shot off target, missed headed attempt, total headed attempts, missed attempt in
box
Fast Breaks and attempted fast breaks, shot from fastbreak, total fast breaks, big chance created, one-on-
Playing Behind one attempt, big chance scored, big chance missed, attempt from center of box, close miss,
Defenders miis in box, shot off target, accurate through ball, on target scoring attempt
Long Balls and possession lost, possession lost control, total long balls, accurate launches, long pass from
Launches own half into opposition’s, total launches, total flick ons, aerial won, aerial lost, accurate
flick on, ball recovery, unsuccessful touch, duel won, duel lost, possession won middle third
Many Shots and on target scoring attempt, accurate pull back, attempt on target right foot, total pull back,
Attempts attempt on target in box, accurate through ball, total through ball, attempt saved in low
centre, big chance created, attempt on target left foot, one-on-one attempt, attempt on
target out of box, big chance scored, big chance missed, attempt from open play
3.2.1. Player Models

The player model was a bit less straightforward in that different types or roles of players exist per
se: defenders, midfielders and forwards is a widely used distinction that is recorded and available
without having to be mined from the data. Even more granular categorisations such as ‘central
defender’, ‘full back’, ‘wide midfielder’, etc., are also recorded ad hoc. Additionally, the features that
these different types of player perform during a match are clearly distinct: defenders will perform
interceptions or clearances while forwards will perform shots or layoffs. If an LDA model is trained
on all players simultaneously, the learned styles will first and foremost correspond to these well-
known categories and as such won’t offer any valuable insight.
In light of the above discussion, we decided to train 3 separate models for defenders, midfielders
and forwards respectively (the labels are available in the datasets). This choice eliminates
comparability between these different categories of players as once the observations have been
transformed under different models the resulting dimensionality-reduction is completely
unrelated, but it has the added advantage that the learned styles for each category are perhaps less
obvious and widely recorded, so that the scalability of the method provides true value.
For consistency of the theoretical framework, despite training three essentially unrelated models,
we trained each one with 7 topics/style. Again, this aspect is discussed in Section 5 in more detail,
but for now the tables below present the 7 learned topics for each player model.

Presented by:
6
REMARK: Despite training three separate models, the data pre-processing like scaling was done on
the whole dataset before dividing into different categories.
DEFENDER MODEL
Passing - Forward Passes left, rightside pass, accurate forward zone pass, touches, total forward zone pass,
Areas successful final third passes, open play pass, total pass, successful open play pass, accurate
pass, final third entries, forward pass, backward pass, clean sheet, possession won middle
third
Passing in the Back accurate back zone pass, total back zone pass, leftside pass, accurate pass, successful open
play pass, total pass, open play pass, successful long passes from own half into opposition’s,
touches, forward pass, rightside pass, passes right, accurate forward zone pass, possession
won defensive third, offside provoked, accurate long balls, clean sheet, head pass, ball
recovery
Gritty Defending Interceptions in box, interceptions, offside provoked, yellow card, total tackle, won tackle,
possession won defensive third, outfielder block,, attempted tackle foul, fouls, challenge
lost, duel won, ball recovery, was fouled, successful put through, head pass, possession won
middle third, aerial lost
Stern Defending effective clearance, effective head clearance, total clearance, outfielder block, offside
provoked, aerial won, clean sheet, head pass, aerial lost, possession won defensive third,
duel won, total launches, accurate launches, yellow card, total back zone pass, accurate
back zone pass, long passes from own half into opposition’s
Defending on the effective blocked cross, blocked cross, blocked pass, total tackle, won tackle, put through,
Touchline possession won defensive third, duel won, attempted tackle foul
Crossing into the crosses after 18 yards, total crosses not from corners, accurate cross not from corner,
Box crosses behind 18 yards, total cross, penalty area entries, passes right, off target shot assist,
possession lost, won corner, attempted assist from open play, total attempted assist, total
final third passes, final third entries
Long Balls and Total chipped pass, accurate chipped pass, long pass from own half into opposition’s,
Launches offside provoked, accurate long balls, successful long balls, forward pass, possession won
defensive third, total launches, outfielder block, accurate launches, aerial won, total
clearances
MIDFIELDER MODEL
Goal Attempts Attempt to the centre from out of box, missed out of box attempt, blocked out of box
attempt, total attempts with right foot, blocked scoring attempt, attempt open play, total
scoring attempt, shot off target, total attempts left foot, on target attempts right foot, on
target scoring attempts, missed attempt in box, won corners, touches in opposition box,

Presented by:
7
possession won attacking third, successful final third passes
Defensive Work Total tackle, won tackle, attempted tackle foul, challenge lost, interception, interception
won, yellow card, fouls, possession won middle third, ball recovery, possession won
defensive third, outfielder block, duel won, successfull put through, duel lost, clean sheet,
blocked pass, chipped passes
Dominate Passing Total pass, accurate pass, open play pass, successful open play pass, accurate forward zone
and Possession pass, touches, leftside pass, rightside pass, final third entries, accurate chipped pass,
accurate back zone pass, total back zone pass, successful long passes from own half into
opposition’s, successful final third passes, forward pass, possession won middle third,
accurate long balls
Aerial Game Accurate flick ons, aerial won, head pass, total flick on, aerial lost, effective head clearance,
head clearance, duel won, duel lost, was fouled, fouls, attempt from centre of box
High Risk/High Turnover, unsuccessful touch, overrun, total contest, won contest, blocked pass, put
Reward through, dispossessed, successful put through, duel lost, fouled final third, was fouled,
possession lost, possession lost control, duel won, challenge lost, fouls, possession won
attacking third, aerial lost
Creating Chances Accurate layoffs, total layoffs, attempts assisted open play, on target attempts assisted, total
and Playing within attempts assisted, successful final third passes, total final third passes, backward pass,
Lines accurate forward zone pass, possession won attacking third, passes left, passes right,
touches in opposition box, fouled final third, successful open play pass, penalty area entries
Crosses into Box Total crosses, accurate crosses, penalty area entries, accurate cross not from corner,
crosses behind 18 yards, crosses after 18 yards, off target attempts assisted, total attempts
assisted, won corner, possession lost control, on target attempt assisted, attempted assists
open play, total forward zone passes, total final third passes, blocked pass, total contest
FORWARD MODEL
High-Risk/High Unsuccessful touch, turnover, dispossessed, overrun, duel lost, fouled final third, total
Reward contest, was fouled, blocked pass, put through, won contest, fouls, successful put through,
possession won attacking third, duel won, challenge lost
Blocked and Missed Missed attempt in box, shot off target, attempted missed to the left, attempt missed to the
Attempts right, missed headed attempt, total headed attempts, high missed attempt, total scoring
attempts, big chance missed, attempt in box blocked, blocked scoring attempt, total offside
Aerial Target Man Total flick on, accurate flick on, aerial lost, aerial won, head pass, duel lost, total offside,
duel won, total layoffs, unsuccessful touch, turnover, accurate layoffs, fouls, dispossessed,
total headed attempts, was fouled, possession lost control, effective head clearance
Goal-Scoring Goals, goal inside box, goals open play, big chance scored, goal right foot, on target scoring
attempt, attempt from centre of box, touches in opposition box, total scoring attempts, total
offside, total attempts left foot, total headed attempts, total layoffs
Creating Chances Accurate layoffs, total layoffs, total final third passes, successful final third passes,
and Playing within backward pass, accurate forward zone pass, total forward zone pass, attempt assisted open
Lines play, on target attempt assisted, total attempt assisted, passes left, rightside pass, won
contest, touches in opposition box, off target attempt assisted, penalty area entries, touches

Presented by:
8
Attempts on Target on target attempt in box, on target attempt right foot, attempt saved low centre, on target
scoring attempt, on target attempt left foot, on target attempt out of box, attempt open play,
total scoring attempt, attempt right foot, big chance missed, blocked scoring attempt,
attempt blocked in box, touches in opposition box, attempt from centre of box, attempt
from right of box, won corners
Crosses into Box Big chance created, accurate cross not from a corner, total attempted assist, accruate cross,
corsses behind 18 yards, total crosses not from a corner, total cross, attempt assisted open
play, off target attempt assisted, on target attempt assisted, penalty area entries, crosses
after 18 yard
Before immersing ourselves into the exciting practical applications of the results, on the topic of
readily available labels of player styles/roles, the dataset used also contained slightly more
granular labels like Central Attacking Midfielder (CAM), Defensive Midfielder (DM), Full Back (FB),
etc. A first glance at the projection onto the first 2 principal components of the results for defenders
and midfielders show that this unsupervised method can successfully differentiate between the
style of these different positions. We build on this promising taste of the results in Section 4.
FIGURE 3: Projection onto the first 2 principal component of the Defender model (left) and Midfielder model (right).
Granular role labels available in the data are colored in the results.
4. Results and Applications

The resulting dimensionality-reduction can be leveraged in a wide array of applied ways. In this
section we provide a non-exhaustive account of some of the examples. Throughout the section we
heavily make use of ‘radar chart’ visualisations, as it is a convenient way to simultaneously show
the percentage that each observation’s components corresponds to the different topics/styles.
REMARK: Radar charts are a common feature in soccer data analytics, but in a very different use case
which might make our visualisations ‘miss the point’ if the main difference isn’t explained: in
traditional uses of radar charts, a team/player’s chart can be indefinitely large in all directions of the

Presented by:
9
chart. Since in our use the components of the radar chart correspond to a mixture model (i.e. the
components must add up to 1), then obviously the radar of teams/players cannot be indefinitely large
in all directions. It's important to also drive home the point that the projection onto the stylistic
components is not related to the volume of a player’s actions. Going back to the theory, the percentage
that the model assigns to an observation of a certain topic/style is the posterior probability of that
observation’s features coming from that topic/style distribution on the features. As such, player A can
perform much more of the features associated with a specific topic/style than player B, but if he also
does features associated with other topics while player B mainly sticks to those features, then A’s
projection onto the first topic will probably be smaller than B’s. It is important that the reader keeps
all this in mind while interpreting the radar charts below.
4.1. Team Model
League Styles
Let's begin by the Team model. A first application is identifying stylistic variation across
competitions:
Average ‘league styles’ can be conceptualised by averaging the match projections of matches played
in those league. As can be seen, there are significant differences in the types of matches that tend to
occur across the ‘big 5’ leagues. There are more matches - relatively speaking - characterised by a
large volume of shots and attempts in the French Ligue 1. Strikingly, and in line with general
opinion, matches in the German Bundesliga are more likely to fall under the ‘fast breaks and playing
behind defenders’ category, as well as being a ‘high-energy contest’ (it should be noted that we
believe there may be some idiosyncrasy to the coding of F9 data across these competitions,
especially for the German Bundesliga). The English Premier League, again typifying its stereotype,
contains a disproportionate amount of matches falling under the topic of ‘stern defending’. Games
in the Italian Serie A are most likely to be characterised by ‘crosses into box’ or domination of
‘passing and possession’. They also feature a lot of chances conceded. Somewhat surprisingly, the
Spanish La Liga features a relatively high amount of matches falling under the ‘long balls and
launches’ topic, as well as games in which there are lots of chances conceded.

Presented by:
10
Team Styles
Another use of the model is overlaying the play styles of specific teams to tease out stylistic
differences. We can immediately appreciate how this can be used for pre-match preparation from
coaches wishing to get a quick stylistic appreciation of their next opponents.
The examples above plot the average team profiles over our whole sample of data, but they are also
available on a match-by-match basis to tell a story of team performances in an individual match.
‘With or Without You’-Team Profiles

‘WOWY’ stats, popularised in North American sports, attempt to isolate the marginal effect of a
player by comparing a team’s performance when the player is featuring to when he is absent. Here,
we develop a similar methodology, except the comparison is more stylistic. This has a variety of
possible applications, most notably in opposition scouting, where a team could react to their
upcoming opponent’s injuries or suspensions. Similarly, it can be used internally by a coach to
decide on his lineup given the style or tactical plan he intends for a specific game.
This concept can be demonstrated initially with the stylistic effect of Lionel Messi, widely
considered the best player of the world and one of the all time greats, on hic club FC Barcelona:

Presented by:
11
In matches without Messi, Barcelona’s approach is more often categorised by the ‘crosses into the
box’ category. In contrast they ‘dominate passing and possession’ much more when the Argentinian
plays, having more ‘shots and attempts’ in the process. Although this is strictly a stylistic
observation in intent, there is some element of relative efficacy implied by this result: Barcelona’s
style is more effective with Messi, resulting in less crosses and more possession domination.
Indeed, the stylistic impact of the world’s best players is substantial on the team that they play for.
Since Paul Pogba arrived for an at-the-time world record fee at Manchester United at the beginning
of last season, he has become invaluable to their style of play:

Presented by:
12
Without Paul Pogba, Manchester United create less shots, dominate possession less, and are more
likely to participate in matches characterised by their own ‘stern defending’, ‘conceding chances’ at
a higher relative rate.
The impact of some players to their team’s tactical intentions becomes easily available through our
methodology and radar visualisations. Roberto Firmino, who plays as a striker for Liverpool, plays a
unique role in facilitating their goal-scoring wingers, Mohamed Salah and Sadio Mane. This takes an
unusual sort of striker play that makes Firmino extremely valuable to Liverpool:

Presented by:
13
Without Firmino, Liverpool’s matches are characterised by similar features, but with one dramatic
difference: they fall much less into the ‘many shots and attempts’ category, relatively speaking.
They also fall into the ‘crosses into box’ topic more, suggesting that they have to resort to a slightly
different form of attacking in his absence.
In addition to quantifying the stylistic effect of current first-team players, we can attempt to
anticipate the effects of a retiring star. Francesco Totti, AS Roma’s talismanic player, retired at the
end of last season after more than 25 years as a central player at the club. Below we can see the
dramatic influence that his absence creates in terms of style.
Lastly, this simple framework can also be used in day to day decision making at clubs, like lineup
choices. As an example, we will look two pairs of players that are ‘substitutes’ for each other at their
clubs: at Liverpool, James Milner and Alberto Moreno compete for the left-back spot; at Tottenham,
Danny Rose and Ben Davies grapple for the same position.

Presented by:
14
In the case of Moreno and Milner, Liverpool seem to be more offensive with the Spaniard on the
pitch, but more comfortable in possession with Milner. This lines up with well with general opinion.
In the similar personnel choice at Tottenham, Ben Davies seems to provide added competence and
efficacy in possession. The team also have less matches characterised by ‘conceding chances’ when
he is playing.
Team Similarity
Another way in which we can leverage the results comes from the fact that the model is a
dimensionality reduction onto stylistically relevant components. This means that we can trust that
the metric of the transformed space can serve robustly as a similarity proxy, meaning we can

Presented by:
15
answer question such as: who plays like Barcelona? Using the mahalanobis metric of the
transformed observations, the list below points out the team’s whose average projection is “closest”
to Barcelona’s (lower numbers mean the team’s style is closer to Barcelona’s).
Arsenal, Sevilla, and Paris Saint-Germain come out as the most similar teams.
REMARK: Since the 8 components onto which the method is projecting are a mixture model (i.e. they
must add up to 1 per observation), the projected transformations actually live on a 7-dimensional
hyperplane; meaning that the covariance matrix is singular. To be able to compute the mahalanobis
distance, we need to drop one of the components to get a non-singular covariance matrix.
4.1. Player Model

As with the Team Model, the Player version has a wide range of possible use cases.
The most obvious application is comparing types of players within the same overarching ‘position’
(as in the ‘defender’, ‘midfielder’ and ‘forward’ labels to which each of the 3 Player models
correspond):.
Player Styles
We can begin by looking at the projection under the ‘Forwards’ model of some of the world’s
foremost forwards:

Presented by:
16
Cristiano Ronaldo comes across as more of an archetypal goal-scorer than other comparable elite
players because his style is dominated by goals and goal attempts. Neymar and Dybala feature
prominently in the ‘High-Risk/High-Reward’ category, which lines up with general impressions.
Bale is more of a traditional winger than his peers, characterised by ‘Crosses into Box’. Messi’s
profile is dominated by the ‘Creating Chances and Playing within Lines’ topic.
Looking at different forwards within the Premier League, of perhaps less glamorous reputations,
we can still appreciate how the method can help us quickly understand what type of player
somebody is.

Presented by:
17
Below are some examples from the midfielder and defender models. Obviously due to a lack of
space and practicality we can’t be overly extensive with the examples, but have tried to pick out
some examples which are interesting for soccer fans:
Team Effects on Player Styles

Another interesting use case for the model is comparing a player’s performances for one team to
another, using the natural experiments provided by transfers:

Presented by:
18
Mohamed Salah has been one of the star names in the Premier League since his arrival at Liverpool.
Although the prominence of the ‘Goal-Scoring’ and ‘Attempts on Target’ topics is evident in his time
at Roma, it is clear that he has a greater focus on chance-creation at the Merseyside club.
Contrasting Lukaku’s playstyle at Manchester United and Everton offers valuable insight into the
effects of team quality on a striker’s output:
At Manchester United, Lukaku plays with better attackers who can supply him chances. This is
evident through the increased prominence of the ‘Attempts on Target’, ‘Goal-Scoring’, and ‘Blocked
and Missed Attempts’ topic. That he has to play less like an ‘Aerial Target-Man’ is a function of the
difference in styles between Manchester United and Everton.

Presented by:
19
Moving over to the Defenders model, Leonardo Bonucci transferred from Juventus to AC Milan this
summer, and at the latter he is characterised more by the ‘Long Balls and Launches’ topic than the
‘Passing in the Back’ one which dominated his style at Juventus.
Player Similarity
Once again using the mahalanobis metric on the observation space as a proxy for similarity, we can
provide similarity scores for players. The examples below provide the 10 most similar players to
Lionel Messi and Paul Pogba.
5. Conclusions and Future Work

We have presented an original methodology based on the LDA inference algorithm inspired from
the theory of unsupervised topic extraction of Natural Language Processing; which is exciting in its
own right due to its novelty, but especially because of its success at uncovering significant stylistic
components which explain differences in feature frequencies for different teams/players in an
unsupervised way. These data-driven style components have been shown to match the expectations
of empirical soccer knowledge; and as a result grounded and accessible applications that can be
immediately suitable for clubs have been presented. The interpretability and descriptiveness of the
results is extremely appealing in that the methodology does not need extensive and specialised
knowledge to be a drop-in tool leveraged by clubs and other stakeholders to obtain style snapshots
in a scalable way. The applications which we have showcased are grounded and practical for the
day-to-day workflow of soccer clubs, compatible to many of their daily activities.
However, it’s also worth noting that the applications which we have explored are dependent solely
on the fact that results are deployed as a bayesian mixture model and in essence in a dimensionality
reduction setting; and are essentially independent of the actual assumptions of the underlying
generative model of LDA (i.e. any dimensionality reduction technique could be exploited in the

Presented by:
20
same way if the components have descriptive value). The LDA model and inference algorithm have
impressively proved successful at learning empirically legitimate style components in an
unsupervised way; but ultimately, the assumptions and structure of the model and its
implementations are specifically tailored with a text mining rationale (as the discussion on tf-idf
reveals). Wang et al.. (2015) provide a wonderful example of how the assumptions of LDA can be
changed to make it more apt for specific soccer research (although their implementation looks to
extract tactical patterns using coordinates on the pitch rather than style components). Given the
initial promise that this document has shown, modifying the assumptions of the generative model
of LDA for this type of research is definitely an appealing line for future work.
Additionally, the process to select the number of topics highlighted an interesting research question
which further adds to potential lines of future work. To make the selection of this parameter in our
research, we kept track of the top features representing each topic/style for each choice of n using
different random seeds, and empirically made a choice as to when the learned styles ceased to be
interpretable and descriptive in a natural way. In general, however, the question on the optimum
number of topics is an ongoing open problem in NLP research (Greene, O’Callaghan and
Cunningham, 2014). Our experience in reviewing the distribution over features for different
emerging topics revealed that broad topics either survive from one choice of n to another, or divide
to form two distinct topics. However, this process is muddled by the variability of the training
random state. This thread of thought suggests that a methodology consisting of Monte Carlo
simulations and the tracking of topic persistence using a measure of statistical distance between
feature distributions of the learned styles for different choices of n can provide a framework for the
decision on the optimum number of topics; and additionally can remove the empirical element in
evaluating the learned styles which has the added bonus that potentially emerging styles which are
not registered empirically could be found and studied.
Another interesting question which our research has surfaced is how to effectively use a metric of
the transformed observation space as a proxy for similarity. This is not an uncommon idea in soccer
data analytics (Meza, 2017; Gyarmati, Kwak and Rodriguez, 2014; Peña and Navarro, 2015), but it
requires some design to ensure that the metric chosen is appropriate to the structure of the
transformed entries. In this document we decided to use mahalanobis over euclidean since the
different stylistic components aren’t distributed equally, but the mahalanobis metric also has
deficiencies in that there are collinearities between the different components. What type of metric
to use given the structure of the problem is an interesting question in its own right, and its result (a
robust measure of team/player style similarity) is definitely appealing for the sport.
Finally, another compelling line of research which this document opens up for future researchers is
structuring how players in a team contribute to a team’s style mixture. The features with which the
model is trained for teams are mostly the sum of features performed by their players (i.e. the
number of shots, passes or interceptions that a team performs is the sum of those performed by its
players). This opens the door for quantifying a team’s style mixture into smaller constituent pieces
of its different players, in a similar flavour to the applications showcased in section 4, but in a more
robust, comparable and scalable way.
References
[1] Blei, D.M., Ng, A.Y. and Jordan, M.I., 2003. Latent dirichlet allocation. Journal of machine Learning
research, 3(Jan), pp.993-1022.
Presented by:
21
[2] Greene, D., O’Callaghan, D. and Cunningham, P., 2014, September. How many topics? stability
analysis for topic models. In Joint European Conference on Machine Learning and Knowledge Discovery
in Databases (pp. 498-513). Springer, Berlin, Heidelberg.
[3] Gyarmati, L., Kwak, H. and Rodriguez, P., 2014. Searching for a unique style in soccer. arXiv preprint
arXiv:1409.0308.
[4] Meza, D.A.P., 2017. Flow Network Motifs Applied to Soccer Passing Data. In Proceedings of
MathSport International 2017 Conference (p. 305)
[5] Newman, D., Smyth, P., Welling, M. and Asuncion, A.U., 2008. Distributed inference for latent dirichlet
allocation. In Advances in neural information processing systems (pp. 1081-1088).
[6] Peña, J.L. and Navarro, R.S., 2015. Who can replace Xavi? A passing motif analysis of football
players. arXiv preprint arXiv:1506.07768.
[7] Wang, Q., Zhu, H., Hu, W., Shen, Z. and Yao, Y., 2015, August. Discerning tactical patterns for
professional soccer teams: an enhanced topic model with applications. In Proceedings of the 21th
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 2197-2206).
ACM.

Presented by:
22

SSAC2018 PaperID5697

Uploaded by

Copyright:

Available Formats

SSAC2018 PaperID5697

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SSAC2018 PaperID5697

Uploaded by

Copyright:

Available Formats

Automated Playing Style Detection of Soccer Teams and

Players using Latent Dirichlet Allocation

2018 Research Papers Competition

2. Latent Dirichlet Allocation

2.1. Applying Latent Dirichlet Allocation to Soccer Data

2018 Research Papers Competition

3. Data and Model Fitting

3.2. Training the LDA Model

● 𝑡𝑓(𝑡, 𝑑) =(raw count of term t in document d) / (total number of terms in d)

3.2.1. Team Model

Topic Label Most Probable Features

2018 Research Papers Competition

3.2.1. Player Models

2018 Research Papers Competition

2018 Research Papers Competition

2018 Research Papers Competition

4. Results and Applications

2018 Research Papers Competition

4.1. Team Model

2018 Research Papers Competition

‘With or Without You’-Team Profiles

2018 Research Papers Competition

2018 Research Papers Competition

2018 Research Papers Competition

2018 Research Papers Competition

2018 Research Papers Competition

4.1. Player Model

2018 Research Papers Competition

2018 Research Papers Competition

some examples which are interesting for soccer fans:

Team Effects on Player Styles

2018 Research Papers Competition

2018 Research Papers Competition

5. Conclusions and Future Work

2018 Research Papers Competition

2018 Research Papers Competition

You might also like