Looking around a bit shows I am not unique in using Monte Carlo simulation to estimate the true odds, in horse racing. For examole, see this monte carlo example. The author has a nice idea of using standard horse ratings, namely Official Rating (OR) and the Racing Post Rating (RPR) as the input to the model. My method is to use the past performance of each horse as the input - which I believe has some neat benefits - but the basic approach is the same.
Basically the first we have to do is define a probability density function (PDF) for the speed we think each horse might run in the race. It might look like this:
It represents the probability the horse will run at any given speed. The peak of the PDF represents the most likely speed for the horse. It may run faster or slower, but each are less likely. The PDF tails off at the edges to show this. The extreme edges are getting pretty unlikely. The shape of the PDF is important. The typical thing to do is to use a Normal or in other words Gaussian form for the PDF. This is not a bad choice, because Gaussian PDFs crop up all over the place in nature, so the likely running speed for a horse probably follows one.
When we execute the Monte Carlo race model we run a lot of imaginary races (maybe 1000 or more) and simply count the times each horse wins. To simulate each race, we draw, at random, example speeds for each horse from the PDF- such that the most likely race speed for each horse is in the middle of its distribution, with the frequency falling off towards the edge of the distribution We then rank the horses based on speed and work out the winner. Here the choice of the Gaussian PDF is handy because computer languages often have a ready made function for generating random samples from a nor distribution. I'm using Python, and it does the job nicely.
So, we have the results of a thousand or so simulated races, so now we can calculate the "true" odds for the horses easily enough. We made a few assumptions along the way, but if these are true, our odds should be good. Just to state those assumptions again:
- We assume the past performance of the horse provides a good measure of its quality
- We assume the horse's speed PDF is a Gaussian distribution positioned in proportion to the horse's quality score.
- We have to "invent" a width for this Gaussian, (called the standard deviation). This is a bit of a weakness, in that we have to make this up to make the odds look right. Still, it should probably be fairly constant for all races.
So, that's the basis of the approach. I'm working on some improvements that are quite subtle, and I'll introduce in future posts. Now lets see how well it works!