April 18, 2012

Chess and Statistics Part 1

I've wanted to talk about this for a long time, but I never really found myself the time, and when I did, I always ended up procrastinating and doing something else.

Nevertheless, I start simply by refuting a claim about the Elo rating system which many people already know to be false:

"A person's rating does not mean anything"

I can only say that anybody who finds the nerve to say something like this either:

1. has no understanding of the Elo rating system, or
2. has no understanding of statistics
(or both)

Some may recognise a version of that phrase being said by Raymond some time ago, but it is not limited to him, and he is definitely not the sole target of what is about to follow (I have better things to do!).

It is indeed unfortunate that I have also heard this nonsensical statement (or some less dire version of it) from chess players who I know and respect (though definitely not within the mathematical branch of statistics!), and I hope that by the end of this article, if you do read it, you find that a person's rating does indeed tell you a lot about the player.

To some, it seems incredibly trivial that this statement is clearly incorrect; but if one puts too much thought into it, it does indeed get confusing. Take the famous case of one person who Garry Kasparov, the highest rated player in the world, has a minus score against: Boris Gulko, who, in terms of rating, is relatively much weaker than Kasparov. The rating system is in jeopardy! Suddenly things are not so clear: we have to closely examine the workings of the Elo rating system.

Behind such a fallacious claim, to some extent, is the assumption that the Elo system assumes that the better player will always win. Also be reminded that the Elo system implies that the higher rated player is the better one. It then follows that if the lower rated player beats the higher rated player, the premise of the our rating system breaks down, because the outcome has contradicted the prediction of the system.

To anybody with a formal understanding of logic, it's clear that the fault in such an argument lies in the assumption, which is, in this case, of an assumption!

We need to go back to the foundations of the Elo rating system.

Rhetorical question: What is it?
Answer: It's a prediction model for the outcome of a game between two players (Note: not limited to chess!).

Next rhetorical question: How does it work?
Answer: Unfortunately this is rather complicated to be defined formally and understood at first glance.

A difficult part is assigning a rating to a player; we'll assume that the ratings given are in their "steady state", i.e. "appropriate". This will be the case after assigning a provisional rating and playing a large number of games, assuming the model is good. In a way, two key features of the model support each other: an appropriate rating and the prediction made. Strictly speaking, this is philosophically unsound, but in practice it is acceptable (real world example: money and goods/services; we are willing to exchange one for the other, even though fiat money doesn't have any intrinsic value).

We'll look at the basic ideas of how you compute your rating change:
1. Define the result of a game as we know it: 1 point to the winner, 0 to the loser and 0.5 each for a draw (interpretation: 1 point to be given away per game, 2 players fight for it).
2. Compute the rating difference between ourself and the opponent*. Then we plug this into a formula and get an expected value, denoted "We", which stands for "wins expected", strictly between 0 and 1. This is the model's prediction.
3. Note the actual outcome of the game, W: 1 for a win, 0.5 for a draw and 0 for a loss.
4. Calculate W-We, the difference between what actually happened and what was predicted**.
5. Multiply this by the K-factor which is defined by FIDE (30, 15 or 10). That's your rating change.

In a sense, you could think that your rating changes because the model made an "error" in predicting the outcome of your game (somewhat true, but this "error" is what makes the system work!).

Enough beating around the bush; we are starting to go off tangent here. Remember that the key point is the outcome that the Elo system predicts; in fact, steps 3,4 and 5 are irrelevant to our discussion.

The false claim that we started with essentially stems from the disapproval of step (2). From a statistical point of view, this is due to the unfamiliarity of the notion of an expected value. How can a number like 0.73 make any sense when the actual outcome can only produce 0, 0.5 or 1?
What does this mysterious number tell us?

Well, consider a coin toss. Biased coins aside, we can safely say that the chance of getting heads is just as likely as tails. Say we get a point for a head, and none for a tail. Then the expected value is given by 0.5. Yet, if we tossed a coin once, we can only score 1 or 0. But say we decide to toss the coin 6 times, and I ask you to tell me how many heads will turn up. You wouldn't bet your house on it, but if anything, it seems most probable that the magic number is 3. It may be 2, it may be 4, but most likely 3 (which we obtain by multiplying the expected value by the number of trials, i.e. 0.5*6). Unfortunately, by most likely, I mean there's a 31.25% chance of this happening. But that's still the lion's share of the probabilities; you can have anywhere between 0 and 6 heads, so the remaining 6 individual possibilities have to split 68.75% of probability among themselves. So you are very likely (notice I did not use the word "will"!) to get close to 3 heads: in fact, you'll get between 2 and 4 heads 78.125% of the time.

Anyway, within this context, it is not wrong to say that it is the proportion of the total points you will get from the games played in the long run***. That is, the model predicts that, assuming neither player learns anything from the previous game, if your wins expected is 0.73, then if you played 100 games against each other, you're likely to score about 0.73*100=73 points. This is known as the law of large numbers.

(In fact, this is basically how rating changes are calculated: We don't look at each individual game, but rather overall score in a tournament; we look at your rating, the average rating of all your opponents, then your score against them. Interestingly, the rating change, in some way, is linear: you can calculate the change against each opponent individually, or look at the aggregate of all your opponents, and you get the same numbers. This allows people to calculate individual rating changes instead, which is in fact more common nowadays.)

In a nutshell, some people fail to grasp that a 73% chance of winning your game does not mean that you will win the game, which is when they lose faith in the Elo system.

This is where applied statistics becomes difficult to accept for some people: such a scenario where two players play a large number of games, learning nothing from the previous one, is imposible to create. So how do we know it's working? By empirical evidence! We check the predictions of the model against the actual outcomes:


Taken from ChessBase. The dots represent the actual outcomes and the line is the prediction. As far as the model is concerned, the basic shape is right, except that the line appears slightly too squashed inwards. Also, the horizontal lines are due to the difference between players ratings being limited to 400 when calculations are made, which apparently should be removed. So fundamentally, the model is good, it's just the case of changing a few numerical parameters. Applied statistics works!

*Interesting note: this implies that only the difference in rating matters; the model predicts the same outcome for a game between a 2600 and 2400, or a 2400 and 2200, which further implies that if the ratings of everybody in the world went down by 1000 points in the next rating list, essentially nothing will change!

**This number is always positive if you win, negative if you lose; if you draw, its positive if your rating is lower and negative if it's higher.

***If you're unhappy with this definition, here's a link to the Wikipedia page for a proper one; eat your heart out.

An important thing to remember is that applied statistics in the real world is essentially predictions. A good model makes correct predictions most of the time, and in fact there is no model that can predict everything in a system (in this case, the world of chess games). If there is, then the system would be deterministic, i.e. with no randomness, which we know cannot be the case (chess players can have lapses of concentration and off-days, neither of which we can see coming, at least not 100% of the time).

Needless to say, there is always room for improvement in the model, but sometimes this improvement requires the consideration of more variables, thus complicating the model. An example would be whether you played white or black, where statistics show that white is advantageous. Furthermore, as the absolute (not difference in) strength of the players fall, this advantage diminishes (compare white score in low level tournaments vs. high level tournaments), and we'll lose property (*) in doing so.

And never forget, ratings attempt to reflect your current strength, but changes in ratings are constrained. If a 2000 rated player improves his understanding of the game and is suddenly 100 points stronger, the model would see his performance in a tournament, and say, "Sorry, my prediction was erroneous, here's a positive W-We for you", and go on until it will eventually (not instantly!) be the case that he reaches 2100. Of course it works both ways: a rusty player will definitely play weaker than his "steady state" rating, which is why FIDE has an inactive list.

To conclude: A person's Elo rating is not an arbitrary 4-digit number, and sure as hell means more than nothing. Granted, it is not a magic number that determines the outcome of a game, but you can predict the outcome with a certain degree of certainty based on this number. Empirical evidence has shown that the Elo system is very good at making predictions in general, although there is undoubtedly still room for improvement. Consequently, since the predictions are good, it follows that ratings are indeed a good indicator of a player's strength.

Further reading: ChessBase's and Wikipedia's explanation of the Elo rating system.