Clustering Forwards in Europe’s Elite Leagues Using K-Means for the 2023–24 Season

Abstract

This study conducts a clustering analysis on forwards from Europe’s top five football leagues during the 2023–24 season, utilizing K-Means to group players based on their goal-scoring metrics. The analysis specifically targets players who logged at least 900 minutes of play, equivalent to 10 full matches, ensuring a focus on regularly participating athletes. By applying K-Means clustering, four distinct clusters were identified, each representing unique goal-scoring profiles characterized by specific metrics: goals per 90 minutes (‘G/90’), shots per 90 minutes (‘Sh/90’), shots on target percentage (‘SoT%’), expected goals per 90 minutes (‘xG/90’), goal deviation from expected (‘G-xG’), and penalty area touches per 90 minutes (‘Pen Touch/90’).

The centroids for these clusters illustrate the variance in playing styles and efficiencies: Cluster 0, with moderate goal-scoring and shot metrics; Cluster 1, showing higher efficiency in shots on target; Cluster 2, marked by lower overall performance in the analysed metrics; and Cluster 3, which includes the highest scorers with aggressive attacking metrics. Notably, Cluster 1 forwards demonstrated a significant positive deviation in actual goals scored over expected, highlighting their exceptional scoring efficiency.
 
 The effective application of K-Means clustering demonstrates its capability to classify player roles distinctly using only high-level scoring statistics. This approach affirms the utility of simple performance metrics in revealing substantial differences in player styles, thereby providing insights that can influence team performance and tactical planning. Future studies could enhance the depth and breadth of player analysis by integrating more advanced performance metrics, and include metrics that capture creative contributions, to achieve a more nuanced understanding of player roles and impacts.

Introduction

The integration of data science and analytics into the sports industry has revolutionized traditional approaches to game analysis, player performance monitoring, and tactical decision-making. Recent advancements in Big Data, Machine Learning, and Deep Learning have paved the way for more sophisticated and insightful analytical practices, enhancing competitive strategies across various sports (Thakkar, P. and Shah, M., 2021). In professional sports, especially football, the application of these technologies has led to a significant transformation in how teams operate and compete.

Football, as a sport, not only holds a deep cultural and social significance worldwide but also commands immense economic value. The sport’s global appeal and widespread viewership have fostered substantial financial growth, particularly in Europe’s premier competitions. The top five European leagues — often referred to as the “Big Five” (England, Spain, France, Italy, and Germany) — have seen unprecedented revenue growth, exemplified by a record €15.6 billion revenue achievement in the 2017/18 season, marking a 6% increase from the previous year (Deloitte, 2019). This economic success underscores the leagues’ prominent role in the global sports economy, attracting investments and innovations that drive further growth and development.

Central to the spectacle of football is the role of forwards, whose primary responsibilities include scoring goals and creating offensive opportunities. The effectiveness of forwards is frequently the decisive factor in match outcomes, making their performance analysis crucial for team success. Efficient and precise evaluation of forwards’ gameplay and contributions is, therefore, essential to optimize team strategies and enhance player development.

This study proposes the use of K-means clustering, a renowned machine learning technique, to classify players based on key performance metrics. K-means clustering has been effectively employed in various sports contexts to analyse player performances, design training programs, and evaluate game strategies, demonstrating its utility in grouping data into meaningful categories (Shelly, Z. et al., 2020; Rojas, H.E.U. and Llave, B.C., 2022; Behravan, I. and Razavi, S.M., 2021). By applying this technique to segment forwards in Europe’s top leagues, this research aims to identify distinct player profiles based on metrics such as ‘G/90’, ‘Sh/90’, ‘SoT%’, ‘xG/90’, ‘G-xG’, and ‘Pen Touch/90’, providing an understanding of their performance.

This study aims to demonstrate how even basic statistical data can be employed to reveal distinct playing styles and performance levels among forwards, offering actionable insights that can significantly influence team tactics and player management. The effective application of K-Means clustering demonstrates its capability to classify player performance distinctly using only high-level scoring statistics. This approach affirms the utility of simple performance metrics in revealing substantial differences in player styles, which can provide actionable insights that can influence team performance and tactical planning.

The remainder of this paper is organized as follows: the next section details the methodology used to collect, process, and analyse the data, followed by a presentation of the research findings. The discussion interprets these findings in the context of existing literature, and the final section concludes with a summary of the implications and potential directions for future research.

Methodology

Data Collection

The dataset for this study was sourced from FBREF, a comprehensive football statistics website that aggregates performance data across Europe’s top five leagues. The data specifically focused on player performances during the 2023–24 season.

Data Cleaning

The initial dataset included a wide range of positions; however, for the purpose of this study, only players classified strictly as ‘Forwards’ were retained.

Players who also played in other positions or had less than 900 minutes of gameplay were excluded to maintain a focus on regular and specialized forward players. This filtering was crucial to ensure that the analysis would only reflect the performance of true forwards.

Variable Selection

Given the extensive nature of the dataset, which contained over 45 different performance metrics, a focused approach was necessary.

The selection was narrowed down to key goal-scoring metrics that would most effectively represent the forwards’ performance:

· Goals per 90 minutes (‘G/90’)

· Shots per 90 minutes (‘Sh/90’)

· Shots on Target percentage (‘SoT%’)

· Expected Goals per 90 (‘xG/90’)

· Goal deviation from expected (‘G-xG’)

· Penalty Area Touches per 90 (‘Pen Touch/90’)

These metrics were not initially all presented on a per 90-minute basis; thus, conversions were performed where necessary to standardize the data, allowing for a consistent comparison across all players and games.

Clustering Technique

The clustering analysis was conducted using the K-means algorithm, which finds K average or mean values, about which the data can be clustered by minimizing the variance within each cluster (Burkardt, J., 2009). The number of clusters, four, was determined using the silhouette method, which suggested that this number of clusters best captured the inherent groupings in the data.

Figure 1: Silhouette score plot used to determine the optimal number of clusters for the K-means clustering of forwards. The plot displays silhouette scores for different numbers of clusters.

Implementation

Data Preparation: Missing values within the selected variables were imputed using the mean of each respective metric to maintain data integrity.

Standardization: All features were standardized to have zero mean and unit variance, ensuring that no single metric would disproportionately influence the cluster analysis.

Clustering Execution: The K-means clustering was implemented using Python’s scikit-learn library. Multiple runs with different seed values were performed to ensure the stability and reliability of the clusters.

Centroid Calculation: Post-clustering, the centroids of each cluster were calculated, providing a quantitative profile of the typical performance characteristics within each group.

Python Code: The Python code can be found here.

The subsequent sections will detail the findings from this analysis and discuss their implications for football strategy and player management.

Results

The K-means clustering algorithm, successfully identified four distinct clusters, demonstrating significant variability in playing styles and performance metrics among the players. Each cluster represents a unique type of forward, differentiated by their roles and effectiveness on the field.

Figure 2: Parallel coordinate plot illustrating the distribution of forwards across the four clusters based on performance metrics.
Figure 3: Violin plots demonstrating the distribution and density of player performance metrics within each cluster.
Figure 4: Box plots illustrating the distributions of shots per 90 minutes (Sh/90), shots on target percentage (SoT%), and goals per 90 minutes (G/90) for each cluster.
Figure 5: Box plots illustrating the distributions of expected goals per 90 minutes (xG/90), goal deviation from expected (G-xG), and penalty area touches per 90 minutes (Pen Touch/90) across each cluster.
Figure 6: Centroid values for the four clusters based on key performance metrics.

The results of the K-means clustering have been visually presented in the figures above, illustrating the distinct characteristics and performance metrics of each cluster. These visualizations provide a clear, initial understanding of the variability and specificity in the playing styles and efficiencies of forwards in Europe’s top five leagues. A detailed examination of these results will be explored in the subsequent Discussion section.

Discussion

Cluster 0: This cluster’s players demonstrate balanced performance metrics. They average 0.38 goals per 90 minutes and 3.18 shots per 90. Their shots on target percentage is approximately 37%, with an expected goals rate of 0.47 per 90, which is not the lowest but indicates a modest scoring potential. They exhibit a notably negative goal deviation from expected at -1.75, suggesting underperformance relative to expected outcomes. Their involvement in the penalty area, about 6.23 touches per 90, indicates an active participation in goal-scoring opportunities.

Figure 7: Top Five Goalscorers in Cluster 0

Cluster 1: Players in this cluster exhibit high scoring efficiency, achieving an average of 0.53 goals per 90 minutes from 2.40 shots per 90. Their shots on target percentage is notably high at 47.19%. The expected goals rate for this cluster is 0.41 per 90, which is higher only compared to Cluster 2, which has the lowest rate among all clusters. This metric highlights their efficiency, as they maintain a high goal-scoring rate despite fewer expected goals, further emphasized by a significant positive deviation from expected goals (+2.39). Their involvement in the penalty area, measured at approximately 4.57 touches per 90, indicates moderate but effective engagement in critical goal-scoring areas.

Figure 8: Top Five Goalscorers in Cluster 1

Cluster 2: This cluster comprises players who are less involved in direct goal-scoring, evidenced by the lowest average goals per 90 (0.25) and shots per 90 (2.29). They have the lowest shots on target percentage (34.27%) and the actual lowest expected goals rate (0.27 per 90) among all clusters. The slight negative goal deviation (-0.42) suggests a minor underperformance relative to expected goals. They also have the lowest involvement inside the penalty box with only 4.23 penalty area touches per 90.

Figure 10: Top Five Goalscorers in Cluster 2

Cluster 3: This cluster includes players with the highest offensive output, scoring 0.83 goals per 90 minutes and taking 3.72 shots per 90. Their shots on target percentage is high at 44.78%, and they have the highest expected goals rate (0.74 per 90) of any cluster. A positive goal deviation (+1.79) indicates that they exceed expected goal-scoring metrics. Their involvement in the penalty area is the highest, with 6.91 touches per 90, underscoring their primary role in attacking and finishing plays.

Figure 10: Top Five Goalscorers in Cluster 3

Conclusion

This study successfully employed K-means clustering to analyse performance metrics of forwards from Europe’s top five football leagues, revealing four distinct player profiles based on their goal-scoring capabilities. By segmenting players into these specific clusters, the research fulfilled its aim of demonstrating how basic statistical data can be utilized to identify varied playing styles and performance levels among forwards. This methodological approach provided actionable insights that could significantly influence team tactics and player management, aligning with the study’s objectives to enhance understanding through data-driven analysis.

The clustering technique applied here facilitated a systematic examination of forwards, distinguishing them by key performance metrics such as goals per 90 minutes, shots per 90, shots on target percentage, expected goals, goal deviation from expected, and penalty area touches per 90. These insights allow coaches and analysts to tailor training and match strategies more effectively, optimizing each forward’s role based on their specific strengths and tendencies identified through the cluster analysis.
 
 However, the study’s scope introduces certain limitations. Firstly, the analysis focused solely on forwards, excluding other positions that also play critical roles in match outcomes. This focus potentially limits the applicability of the findings across different team dynamics where the interaction between various player positions is key. Secondly, the study concentrated predominantly on the goal-scoring aspect of playing as a forward. While this is a crucial part of a forward’s role, it omits other dimensions such as ball control, passing ability, and defensive contribution, which are also vital for a comprehensive evaluation of a player’s overall impact on the field.
 
 In conclusion, the application of K-means clustering in this study was successful in distinguishing forwards in Europe’s top football leagues into distinct profiles based on straightforward statistical metrics. This research demonstrated how simple statistics can be effectively utilized to classify player types, providing clear, actionable insights that can significantly influence tactical and strategic decision-making in football. However, while the current approach offers valuable initial findings, future research would benefit substantially from incorporating advanced metrics and broadening the analysis to include various playing positions and areas.

References

Behravan, I. and Razavi, S.M., 2021. A novel machine learning method for estimating football players’ value in the transfer market. Soft Computing, 25(3), pp.2499–2511.

Burkardt, J., 2009. K-means clustering. Virginia Tech, Advanced Research Computing, Interdisciplinary Center for Applied Mathematics.

Deloitte, 2019. Annual Review of Football Finance.

FBref, 2024. Big 5 European Leagues Stats. Available at: https://fbref.com/en/comps/Big5/possession/players/Big-5-European-Leagues-Stats

Guimarães, J.H.M.M., 2018. Data analytics applied to football and football players.

Rojas, H.E.U. and Llave, B.C., 2022. Football pitch condition analysis based on k-means clustering. Interfases, (015), pp.57–69.

Shelly, Z., Burch, R.F., Tian, W., Strawderman, L., Piroli, A., and Bichey, C., 2020. Using K-means clustering to create training groups for elite American football student-athletes based on game demands. International Journal of Kinesiology and Sports Science, 8(2), pp.47–63.

Thakkar, P. and Shah, M., 2021. An assessment of football through the lens of data science. Annals of Data Science, pp.1–14.

Leave a comment