Wednesday 11 March 2020

Coronavirus Data Analysis: Very Worrying Findings

I have been digging into the coronavirus data. First of all, the easiest one to access: Worldometer,
https://www.worldometers.info/coronavirus/
A useful data source, although it only gives historic data for the total case count and the death count. In their raw form, these accumulating totals are not very informative, so I turned them into a "counts per day" form and then plotted them on a log scale. Here is the cases per day graph.



The useful thing about using a log scale is that exponential increases (a sign of uncontrolled growth) show up as straight lines. On the left side of the graph you can see exponential growth that took place in China, early in the epidemic. Then, there is a dip, as China brought their outbreak under control. On the right. On the right, we see case rate increase again as the virus takes off in Europe. I was initially reassured that this growth seems to be slowing down. But then I took a look at the death rate graph.

The notable feature is that the death rate seems to be accelerating, if anything. It is now at its highest rate ever, surpassing the peak of the Chinese wave. Was this discrepency just a glitch?

To understand more, I downloaded the Johns Hopkins University dataset from GitHub. This seems to be the best source of simple, QC'd data on Covid-19. I scratched out some Python code to do similar plots, broken down by country, and focusing on a few places of interest; especially Hubei, North Korea, Iran and some key European countries. Below are the case rate and death rate plots:

Look at the death rate graph... Most noticeable is the exponential growth occurring in Italy. Also in Iraq, although at a slightly lower rate. This shows that the epidemic is completely out of control in these countries. Now look back at the new case count graph. The growth for these countries is tailing off. So I did a cross check of deaths reported against total cases. Below is the result.

The ratio of deaths to true cases (i.e. the fatality rate) SHOULD be relatively constant. So, a high apparent death rate indicates a very poor rate of detecting cases. So, what this shows is that Italy has exponential increase in cases and deaths, but a terrible, and worsening, case detection rate via testing., almost ten times worse than the best performers. Iran are at least slightly mire under control. The UK is doing better, and Korea, who seem to have things under control now, have done best of all.

The lesson: Beware Italy and Iran. They have immense problems and no solution in sight.


Thursday 5 March 2020

Coronavirus - why politicians and journalists should learn multiplication

Context switch. Coronavirus. Covid-19. Something tells me I am going to be blogging about it a lot in the next few months. And that something is data science and models. Disease spread models tell us, based on some assumptions, what is likely to happen next. In my other life, I have been playing with disease spread models a lot recently, under a research project, and have come to understand their ways.

Currently I am worried about Covid-19, and I believe you should be too. This is a case where being scared can save you. It can save us all. I'll explain all future posts, but here are the basic facts.

When the virus first emerged in China, it was not noticed for a while. When doctors did begin to notice an unusual increase in a particular type of pneumonia, the information was suppressed, for political reasons. Nobody did anything to slow the spread, so it spread like wildfire. Finally the Chinese government came to their senses and resolved to shut down the exponential growth in cases. They were remarkably successful, because they took it seriously, to the maximum degree. Had they not done so, the ever multiplying number of cases would have continued until most of the population had been infected. Multiplying is the key word here. Cases would have doubled every few days - doubling and doubling.

So, the Chinese got a grip, They know how to shut things down. It was remarkable.

But cases dispersed out from the core, travelling on aeroplanes to all corners of the world. And now, the growth begins again. In Iran, where the problem was not acknowledged and again there has been exponential growth. In Italy, it seems to have been spotted rather late, leading to another case of explosive growth. The lessons from China are very clear; If you cut pairwise contact down by an order of magnitude, the virus spread CAN be shut down, but the government here in the UK (for here I be|) is playing it another way. They will start to think seriously about shutting it down, once it gets big and scary. For now, they will watch and wait. This is the most dangerous nonsense. The course of the virus is already set; There are already hundreds of people incubating the virus in the UK. The growth will be exponential, and it will be harder and harder to stop, the longer this goes on.

Key messages for today:

  1. Take this seriously; it's coming
  2. Take the maximum precautions you can; minimize pairwise contact, wash your hands, eat food cooked at home, etc.
  3. Stock up on food
By avoiding catching this at all costs, you protect yourself and others.

I will talk about the models and what the data is saying in future posts. Sadly, journalists and politicians don't understand models and data science. They don't understand, so YOU need to. 


Tuesday 5 March 2019

Bitcoin woes

Just veering off the main topic for a minute to visit old pastures. A couple of news stories about bitcoin exchanges and disappearing funds...

https://www.ccn.com/new-zealand-bitcoin-exchange-cryptopia-postpones-launch

https://www.independent.co.uk/life-style/gadgets-and-tech/news/bitcoin-exchange-quadrigacx-password-cryptocurrency-scam-a8763676.html

This is by no means the first time bitcoin has disappeared of course. Remember MtGox and Mark Karples, or the Silk Road investigation, where a huge amount of coins went missing during the investigation.
Something about these latest cases is a little mystifying to me. I did some work on tracing coin a few years back. Bitcoin is an open ledger. It was always pretty easy to track coins at any point, to find out when coins were transferred out of a wallet and follow the trail downstream, with the right tools. And analytics have improved plenty too, with techniques like heuristic clustering used to cluster wallets together by ownership (working out who owns a given wallet is the harder part). There are companies that specialize in bitcoin analytics. So, what's going on? Why is it so hard to follow illicit coins and blacklist them? Is there some nation state actor (I wonder who?) at work stealing it and converting to hard currency? Perhaps ransomware isn't raising enough these days?

Saturday 9 February 2019

Called it right!

Liverpool 3, Bournemouth 0! Well done boys!

Its a little bit unfair when forecasters take credit for "getting it right" when they make a probability-based forecast.  Not sure if all my odds were right, but the maximum likelihood option did come in!

Liverpool v Bournemouth Today: Analytics and Form Heatmaps

I have been running some analysis on today's games. Here are some numbers for one of the crucial ones - Liverpool v Bournemouth. I've developed a "heatmap" to show the form of both teams!

Here is a heatmap for Livepool. Attcking from increases left to right, while defensive form increases up the page. Believe me, this shows phenomenal form!


Compare this with Bournemouth. They are clearly less good!!

So, here is my odds estimate!

    x:0  x:1  x:2  x:3  x:4  x:5  x:6  x:7  x:8  x:9
0:y  57/1 90/1 367/1 1723/1 8332/1 49999/1 --- --- --- --- 
1:y  15/1 28/1 90/1 426/1 2272/1 16666/1 49999/1 --- --- --- 
2:y  9/1 16/1 58/1 234/1 1999/1 49999/1 --- --- --- --- 
3:y  8/1 13/1 45/1 229/1 1922/1 16666/1 49999/1 --- --- --- 
4:y  9/1 15/1 58/1 242/1 2499/1 16666/1 --- --- --- --- 
5:y  13/1 22/1 78/1 367/1 2173/1 16666/1 --- --- --- --- 
6:y  22/1 39/1 123/1 609/1 4999/1 24999/1 --- --- --- --- 
7:y  47/1 75/1 235/1 1281/1 8332/1 --- --- --- --- --- 
8:y  105/1 179/1 514/1 7142/1 24999/1 --- --- --- --- --- 
9:y  164/1 275/1 745/1 4999/1 49999/1 49999/1 --- --- --- --- 

summary:  Home win, 1/8; Away win, 29/1; Score draw 17/1; No score draw, 57/1.

Looks like the most likely result (by a whisker) is 3 - nil to Liverpool.


A reminder, this is a new method and still needs to be validated, so use with caution.

Wednesday 6 February 2019

Everton v Man City Tonight!

So, here is my first odds forecast, for tonight's prem game...

Here is a grid of the odds for all score combinations up to 9 apiece...

       x:0     x:1  x:2   x:3  x:4  x:5  x:6  x:7  x:8  x:9 
0:y  19/1   8/1  10/1 13/1 32/1 74/1 207/1 768/1 2499/1 9999/1  
1:y  17/1   9/1  9/1 15/1 28/1 70/1 226/1 832/1 1999/1 9999/1  
2:y  42/1   19/1 19/1 31/1 68/1 146/1 285/1 1666/1 - 9999/1  
3:y  151/1 59/1 61/1 92/1 166/1 369/1 1110/1 - 9999/1 -  
4:y  344/1 262/1 178/1 434/1 999/1 1999/1 3332/1 4999/1 - -  
5:y  2499/1 1110/1 908/1 1666/1 3332/1 4999/1 - - - -  
6:y  4999/1 4999/1 - 9999/1 - - - - - -  
7:y  9999/1 - - - - - - - - -  
8:y  - - - - - - - - - -  
9:y  - - - - - - - - - -  


And here are some summary odds: 
Home win: '44/10', 
Away win: '1/2', 
Score draw: '5/1', 
No score draw '19/1'

These are based on 10,000 modeled games and 5000 particles per team. Interested to know if there are other odds people would be interested in.

I compared with Bet365 odds, which are generally fairly similar. My numbers seem to like the idea of a home win slightly more than Bet365. 1 nil to Everton looks like a value bet (although odds are long).

Health warning: I am still validating this model, although I believe the approach is generally solid!

*** Update*** the game finished 0-2 (win for Man City). It stood at 0-1 until injury time, which would have tallied with my most likely result

Sunday 3 February 2019

Back to Bayes-ics

As explained last post, to do our football analytics, what we need are some input parameters about how "good" the two teams facing each other in a match are likely to be, on the day. There are two alternative approaches to doing this. One is based on classical statistics. To follow this approach you look back over a load of matches and work out an average scoring rate and an average rate of conceding. You can also estimate, on average, how much better the team performs at home. This approach has some weaknesses though. A team can get better or worse over the season; Its no good at telling how good the team will be today. It also requires quite a lot of data (a lot of matches) and assumes all the teams form are stable over time. In other words, it makes some assumptions that are not true. Which is never good.
A much better approach is to use Bayesian statistics. Thomas Bayes was a statistician with a keen interest in games of chance. Hence his work is very relevant to all sorts of gambling! The formulas he gave us are all about inferring the underlying truth from a series of observations. Each observation modifies our belief in a given hypothesis. To cut a long story short, bayesian inference crops up everywhere, in modern analytics.
The particular method I am deploying for football match analysis is the Particle Filter - a modern development, based entirely on Bayesian inference. You can find a pretty good intro to particle filters in this slide deck. Note the reference to football results analysis on slide 24... Using a PF for football analysis is s a nice party trick that often crops up in tutorial material, although I do it in a slightly more sophisticated way to the standard approach.
Applying a particle filter to the English Premier League works like this:

Each team is represented by a large number of "particles", each of which is a guess at the "model" - i.e. the qualities of the team (its attacking strength, defensive strength etc.)
  • Between fixtures, we "advance" these models, saying in effect. Last week the team was like this, so this week, how might the team have moved on
  • After a fixture, we "filter" the particles, preferentially keeping those that best explain the result. Incidentally, this is where Bayes comes in. His theorem says that instead of asking the hard question, "how good is my particle (model), given the result", we can ask "how likely is my result, given my model". This turns out to be an easier question and one we can answer. Importantly we consider not just the result but the capabilities of the teams involved. Hence all the analysis is interconnected.
  • Now, when two teams face off, we have a set of guesses about the teams capabilities at the present time that is based on all previous results, especially the last result. We can model the game considering the full range of guesses and get the best possible odds prediction, given the evidence.
In a nutshell, that's it. My plan now is to publish some predictions before the weekend fixtures and try to ascertain if we can beat the bookies!. That's my goal. Bookies are there to be beaten after all.