Machine Learning

To explore the relationship between transfer price and performance on the field, we will use basic techniques of machine learning. But what is machine learning? For this research, we will narrow our focus to supervised learning. Simply put, we want to create a model that maps inputs to outputs. In our case, the inputs will be various match statistics for a player (goals, assists, etc.) and the output will be their transfer price. If we can find a good model for this relationship, then given another player and their statistics, we can predict or assign a fair market transfer value to them. Taking this one step further, we can use our model on past seasons and compare our algorithm’s predicted transfer fee with their actual transfer fee. We can then look at a wide variety of transfers across leagues to see which players were likely not worth the hefty fee. In particular, we’re interested in the differences between expensive, star player transfers and younger, rising talent.

But how exactly will we create this model? We begin with labeled training data. This means we use historical data that has player statistics and their corresponding transfer fees. Now let’s think of our machine learning algorithm as a black box. This black box receives the input of a player (statistics), transforms it, and outputs the transfer fee. To create our model (the black box’s transformation), we give it labeled data (known input-output pairs). It then tunes it’s transformation to match, as best as possible, these input-output pairs. We then have a model that can be used on new inputs to produce a predicted output. But not every machine learning technique is created equal! It is very possible that during training, the algorithm overfit the labeled data, meaning it works very well on the already known training data, but does not generalize to new input (Towards Data Science).

There are two general types of models: classification and regression. Classification models produce a discrete output, such as classifying email as spam or not spam. Regression models produce a continuous output (like a number), such as predicting the temperature or price of a home. Since we are trying to predict or model transfer price, we will use a regression model. This means our model will consist of the sum of our input data with various weights (also known as a linear combination). An example model may look something like this:

Transfer Fee = 0.2 * Goals + 0.015 * Assists – 0.3 * Goals Conceded

This simple model is just an illustration, but would show that goals affect transfer fees more significantly than assists.

Creating a model like this involves a few steps. The first is obtaining the data to base the model on. We need historical (labeled) data, where we know the inputs (statistics) and the output (transfer fee paid). We also need new data (unlabeled) to run our model on to predict transfer fees. Head on over to the Data page to see the data sets and their sources that we use for our analysis.

The next step is data cleansing. This involves removing unwanted, misleading, or badly formatted data from the data set. In our case, we will remove free transfers and loaned players. As explained on the Data page, our statistics differentiate between home games and away games. This will not affect transfer fee, so we remove those categories as well. For goals scored, for instance, we will only look at overall goals scored rather than home goals and away goals separately. Data cleansing also includes feature selection and dimensionality reduction to reduce the training time and improve accuracy. Head on over to the Feature Selection page to learn more.

We then perform model fitting, which involves training with our machine learning algorithm to determine the weights on each important selected feature. The algorithm uses the known training data to learn the relationship between the inputs and the outputs.

Next, we evaluate the model by running it on new data and drawing conclusions and making observations about the results. We also test the model on historical known data not used to train the model to measure accuracy.

Finally, we fine tune the parameters of the model to incrementally improve accuracy and give us the best insights into the problem we are studying!