# FIFA 20 Data

As thoroughly outlined in the Feature Selection landing page, developing a predictive method for the Transfer Fee of a player is an intensive process. The most impactful features in determining a player’s worth included the following:

```goals_involved_per_90_overall
assists_per_90_overall
goals_per_90_overall```

This regression was able to identify the most valuable statistics on the transfer market; however, it is a well-known phenomenon that statistics often tell an incomplete story of a soccer match. Reference this blog post for more information regarding The Difficulty of Statistically Analyzing Match Performance.

In search of a more data-driven approach, the 36 player attributes incorporated into FIFA 20 can be used as a foil for real-life professional soccer statistics. Depending on a player’s position, his or her weighted average rating is calculated based on a unique distribution of these 36 metrics.

(Another interesting component of this report was the position-wise regression of transfer fees. These predictive models calculated using RStudio were able to identify certain statistics as more lucrative relative to the position of the player earning them. For example, we can see that the prediction of a forward’s transfer fee can be calculated by utilizing the equation:

`Transfer Fee (Million £) = 8.78 * Goals + 7.84 * Assists - .027 * Minutes Played + 8.37`

This was the most significant regression, but the midfielders’ results were interesting in that assists were valued significantly higher than goals. Finally, the defenders’ results indicated an inverse correlation between goals conceded and transfer fees – also very reasonable.)

In addition to calculating a player’s overall rating, these 36 attributes can also serve as the data set behind predicting a player’s worth. The very Machine Learning principles that were applied to the official soccer statistics can be used in the same ways for the FIFA data.

The following code was used to print out the 20 features identified by univariate selection:

```bestfeatures = SelectKBest(score_func=chi2, k=10)
fit = bestfeatures.fit(X,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)

featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score']  #naming the dataframe columns
print(featureScores.nlargest(20,'Score'))  #print 20 best features```

Results:

```                 Specs        Score
11         Gk Reflexes  2566.316481
9            Finishing  1829.106330
18             Marking  1721.851206
26      Sliding_Tackle  1571.904722
32             Volleys  1559.240133
29     Standing_Tackle  1520.755150
17          Long_Shots  1365.197172
10  Free_Kick_Accuracy  1280.670941
21         Positioning  1227.818970
8            Dribbling  1227.013862
14       Interceptions  1148.642781
6             Crossing  1011.847402
7                Curve   986.739920
25          Shot_Power   882.620074
20           Penalties   840.008643
5         Ball_Control   793.621343
31              Vision   746.930722
24       Short_Passing   653.089402
2           Aggression   637.132533```

Next, the Feature Importance protocol was initiated to visualize the most deterministic features on transfer fees:

```model = ExtraTreesClassifier()
model.fit(X,y)
print(model.feature_importances_) #use inbuilt class feature_importances of tree based classifiers
#plot graph of feature importances for better visualization
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()```

Results:

Finally, a correlation heat map was created to identify not only how the attributes relate to the transfer fee of a player, but also how each attribute relates to one another.

```corrmat = df.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(100,100))
g=sns.heatmap(df[top_corr_features].corr(),annot=True,cmap="RdYlGn")```

Results: