How to best quantify football?

More data to describe and analyse football matches provided from a number of sources is available than ever before quantifying every aspect of the game. To counteract new controversies, FIFA aims to help stakeholders better understand some of these key statistics.  

Not least since the likes of Moneyball in Baseball, have there been attempts to make every sport, including football, into a numbers game. With improving video quality and automated image processing as well as large resources for post-match analysis, the football world has got more statistics available than we care to notice. With media, broadcasters, teams and the audience at large gaining access to this data, and data science taking an interest in the “big data” of football, FIFA assessed the status quo of some of this information’s validity.

One of the longest-standing ways of quantifying a football match is known as “Event Data”. In short the game is broken down into series of “events” such as passes, shots, goals or tackles thus technically describing what is happening at every moment of the game. Typically a game will be divided into several thousand events depending on the depth of information provided. These events then allow to draw statistical analyses of players, games or entire leagues and particular comparisons between teams, seasons or individuals.

A recent internal study carried out during the FIFA Club World Cup 2018 comparing a number of leading providers gave insight into the validity, reliability and comparability of the various data sets. As a result, FIFA can highlight some trends with a call for caution:

    1. Even for the most basic factual decisions, there are still discrepancies between how providers quantify matches. While the number of goals tend to be consistent, the number of free-kicks, corners or even yellow cards are shown not always to be accurate.
    2. More subjective indicators such as “successful tackles” or “completed crosses” which rely on a degree of operational definitions showed discrepancies between providers upwards of 50%.
    3. Due to the nature of the business, almost all data providers still rely heavily on manual processing. While good providers have very high levels of real-time accuracy, much of the data is not verifiable quickly enough.
    4. Quality Control of the data – for example through new technologies – is still in its infancy.

All the above show that a degree of skepticism towards such data is healthy, especially when comparing different competitions or data over time. While there is an understandable wish for having as much data as quickly as possible, there is a due diligence duty to ensure this data is accurately reflecting the game. This becomes all the more critical when developing performance indexes or trying to judge player performance across different data sets.