Feature Selection

One of the largest difficulties with machine learning is that of increasing dimensionality. The dimension of a problem corresponds to the number of distinct input variables. For example, if we were trying to predict the temperature outside given the humidity and air pressure, then our problem would have a dimension of 2. You could imagine that a high performance computer could find such a relationship rather quickly. But what if we made the problem more complex, and asked the computer to find a relationship between 20 different measurements to predict the temperature? The time and number of computations required would explode exponentially as the dimension increases.

When we have a problem like predicting transfer fee, we have a wide variety of statistics to look at. This would give us a very high dimension for our problem, which makes creating a model difficult. But it’s likely the case that some statistics are irrelevant or don’t significantly affect transfer fee. We can thus reduce the dimension of our problem significantly and make it feasible using feature selection.

Feature selection involves determining which features (in our case statistics) are most important to predict our output (transfer fee). Then, we create our model using only these features, which reduces our dimension and increases feasibility (Machine Learning Mastery).

In addition to reducing dimensionality, feature selection also helps with overfitting and improves accuracy. If we include noisy and essentially meaningless input data, our model may overfit by giving value to these inputs for the given training data, when the reality is that they should not be considered in our model. Selecting only certain features also removes potentially misleading data (Machine Learning Mastery). Just as if a human was looking at a sample of this data and tried to draw a connection between hair color and transfer fee, when we know this is not true (although it may be for certain samples of data). Computers and machine learning models can do the exact same thing!

There are many techniques for feature selection; we will use 3 different techniques and analyze the results: univariate selection, recursive feature elimination, and feature importance (Machine Learning Mastery).

Univariate selection involves using a statistical test to determine the features (inputs) that correlate most with the output. This gives us an indication for which variables have the highest impact on the output. The following Python code shows the result of univariate selection using a regressor:

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression

features = SelectKBest(f_regression, k=15).fit(X, y)
df_scores = pd.DataFrame(features.scores_)
df_cols = pd.DataFrame(X.columns)

feature_scores = pd.concat([df_cols, df_scores], axis=1)
feature_scores.columns = ["Feature", "Score"]
print(feature_scores.nlargest(15, "Score"))

Which gives the output:

                          Feature       Score
3                   goals_overall  308.347641
7            clean_sheets_overall  217.322483
13           goals_per_90_overall  208.558743
11  goals_involved_per_90_overall  150.363727
4                 assists_overall  138.483388
16       min_per_conceded_overall  117.849721
5                   penalty_goals  116.795887
2             appearances_overall   90.225162
6                  penalty_misses   86.756875
1          minutes_played_overall   64.364946
0                             age   60.581390
15        conceded_per_90_overall   37.953083
12         assists_per_90_overall   27.192637
18           min_per_card_overall   23.744603
9            yellow_cards_overall   12.693165

This shows us that goals and clean sheets were the most important features.

Next, we use recursive feature elimination. In this technique, features are eliminated in rounds. All of the remaining features are used to build a model, and the least important feature is removed. This process is repeated until the desired number of features remain. The following code shows this algorithm to yield the 3 most important features:

from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

model = LinearRegression()
rfe = RFE(model, 3)
fit = rfe.fit(X, y)
print("Num Features: %d" % fit.n_features_)
print("Selected Features: %s" % fit.support_)
print("Feature Ranking: %s" % fit.ranking_)
for index, val in enumerate(fit.support_):
if val:

Which gives the output:

Num Features: 3
Selected Features: [False False False False False False False False False False False  True
  True  True False False False False False]
Feature Ranking: [ 7 15 11  4  6 12  2  5 10  9  8  1  1  1 17  3 14 13 16]


This technique found goals and assists to be the most valuable. The interesting result here is that it chose the “per 90 minutes” version of each of these statistics, and it found that being involved in a goal in any capacity was the most important.

Finally, we explore feature importance. This technique involves using a randomized decision tree algorithm and looking at the importance value (weight) of each branch of the tree. We can use this as an estimate for the importance of the feature. Here we will use an Extra Trees Regressor, which is a random decision forest regression algorithm (Machine Learning Mastery). The following code shows this:

from sklearn.ensemble import ExtraTreesRegressor
import matplotlib.pyplot as plt

model = ExtraTreesRegressor()
model.fit(X, y)

feature_importances = pd.Series(model.feature_importances_, index=X.columns)

Which gives the output:

From this, we have overall clean sheets, goals, and goals involved per 90 minutes as the most important features.

Now each of these techniques has its benefits and drawbacks. The feature importance technique, for example, uses a randomized algorithm. This means that it results could vary with random chance. Given the size of our data set, recursive feature elimination (RFE) is the most robust. It continually rebuilds the best fit after removing a single feature each iteration. The results from RFE can also be explained logically. It chose all of the “per 90” features, which makes sense since those categories would control for things like injuries, strength of schedule, and other factors. The transfer value of a player is about his productivity, output, and efficiency, not just total overall output. Even so, considering all 3 techniques, a clear trend appears: the value of a player is most explained by goals and assists, and less so by factors such as age, minutes played, cards earned, and appearances.

This result may seem obvious, but it does have some interesting implications. First of all, it does suggest that there is a premium paid for superstars. Superstars are often the flashy ones scoring goals are making beautiful through passes. From this analysis, these actions correlate best with a high transfer fee. But what about the best defenders and keepers in the league? These results would indicate that there is likely a premium paid for superstar forwards over superstar defenders or keepers. In other words, you would pay much more for the best forward than you would for the best defender, even though both provide you with the league’s best output. This could be a tip to managers to get value for their transfer dollars: buy superstar defenders and keepers and look to young rising talent or the club’s youth academy for forwards and attacking midfielders.


Feature Selection For Machine Learning in Python