Covid-19 deaths

I wrote last week about how the number of cases of coronavirus were following a textbook exponential growth pattern. I didn’t look at the number of deaths from coronavirus at the time, as there were too few cases in the UK for a meaningful analysis. Sadly, that is no longer true, so I’m going to take a look at that today.

However, first, let’s have a little update on the number of cases. There is a glimmer of good news here, in that the number of cases has been rising more slowly than we might have predicted based on the figures I looked at last week. Here is the growth in cases with the predicted line based on last week’s numbers.

As you can see, cases in the last week have consistently been lower than predicted based on the trend up to last weekend. However, I’m afraid this is only a tiny glimmer of good news. It’s not clear whether this represents a real slowing in the number of cases or merely reflects the fact that not everyone showing symptoms is being tested any more. It may just be that fewer cases are being detected.

So what of the number of deaths? I’m afraid this does not look good. This is also showing a classic exponential growth pattern so far:

The last couple of days’ figures are below the fitted line, so there is a tiny shred of evidence that the rate may be slowing down here too, but I don’t think we can read too much into just 2 days’ figures. Hopefully it will become clearer over the coming days.

One thing which is noteworthy is that the rate of increase of deaths is faster than the rate of increase of total cases. While the number of cases is doubling, on average, every 2.8 days, the number of deaths is doubling, on average, every 1.9 days. Since it’s unlikely that the death rate from the disease is increasing over time, this does suggest that the number of cases is being recorded less completely as time goes by.

So what happens if the number of deaths continues growing at the current rate? I’m afraid it doesn’t look pretty:

(note that I’ve plotted this on a log scale).

At that rate of increase, we would reach 10,000 deaths by 1 April and 100,000 deaths by 7 April.

I really hope that the current restrictions being put in place take effect quickly so that the rate of increase slows down soon. If not, then this virus really is going to have horrific effects on the UK population (and of course on other countries, but I’ve only looked at UK figures here).

In the meantime, please keep away from other people as much as you can and keep washing those hands.

Covid-19 and exponential growth

One thing about the Covid-19 outbreak that has been particularly noticeable to me as a medical statistician is that the number of confirmed cases reported in the UK has been following a classic exponential growth pattern. For those who are not familiar with what exponential growth is, I’ll start with a short explanation before I move on to what this means for how the epidemic is likely to develop in the UK. If you already understand what exponential growth is, then feel free to skip to the section “Implications for the UK Covid-19 epidemic”.

A quick introduction to exponential growth

If we think of something, such as the number of cases of Covid-19 infection, as growing at a constant rate, then we might think that we would have a similar number of new cases each day. That would be a linear growth pattern. Let’s assume that we have 50 new cases each day, then after 60 days we’ll have 3000 cases. A graph of that would look like this:

That’s not what we’re seeing with Covid-19 cases. Rather than following a linear growth pattern, we’re seeing an exponential growth pattern. With exponential growth, rather than adding a constant number of new cases each day, the number of cases increases by a constant percentage amount each day. Equivalently, the number of cases multiplies by a constant factor in a constant time interval.

Let’s say that the number of cases doubles every 3 days. On day zero we have just one case, on day 3 we have 2 cases, and day 6 we have 4 cases, on day 9 we have 8 cases, and so on. This makes sense for an infectious disease epidemic. If you imagine that each person who is infected can infect (for example) 2 new people, then you would get a pattern very similar to this. When only one person is infected, that’s just 2 new people who get infected, but if 100 people have the disease, then 200 people will get infected in the same time.

On the face of it, the example above sounds like it’s growing much less quickly than my first example where we have 50 new cases each day. But if you are doubling the number of cases each time, then you start to get to scarily large numbers quite quickly. If we carry on for 60 days, then although the number of cases isn’t increasing much at first, it eventually starts to increase at an alarming rate, and by the end of 60 days we have over a million cases. This is what it looks like if you plot the graph:

It’s actually quite hard to see what’s happening at the beginning of that curve, so to make it easier to see, let’s use the trick of plotting the number of cases on a logarithmic scale. What that means is that a constant interval on the vertical axis (generally known as the y axis) represents not a constant difference, but a constant ratio. Here, the ticks on the y axis represent an increase in cases by a factor of 10.

Note that when you plot exponential growth on a logarithmic scale, you get a straight line. That’s because we’re increasing the number of cases by a constant ratio in each unit time, and a constant ratio corresponds to a constant distance on the y axis.

Implications for the UK Covid-19 epidemic

OK, so that’s what exponential growth looks like. What can we see about the number of confirmed Covid-19 cases in the UK? Public Health England makes the data available for download here. The data have not yet been updated with today’s count of cases as I write this, so I added in today’s number (1372) based on a tweet by the Department of Health and Social Care.

If you plot the number of cases by date, it looks like this:

That’s pretty reminiscent of our exponential growth curve above, isn’t it?

It’s worth noting that the numbers I’ve shown are almost certainly an underestimate of the true number of cases. First, it seems likely that some people who are infected will have only very mild (or even no) symptoms, and will not bother to contact the health services to get tested. You might say that it doesn’t matter if the numbers don’t include people who aren’t actually ill, and to some extent it doesn’t, but remember that they may still be able to infect others. Also, there is a delay from infection to appearing in the statistics. So the official number of confirmed cases includes people only after they have caught the disease, gone through the incubation period, developed symptoms that were bothersome enough to seek medical help, got tested, and have the test results come back. This represents people who were infected probably at least a week ago. Given that the number of cases are growing so rapidly, the number of people actually infected today will be considerably higher than today’s statistics for confirmed cases.

Now, before I get into analysis, I need to decide where to start the analysis. I’m going to start from 29 February, as that was when the first case of community transmission was reported, so by then the disease was circulating within the UK community. Before then it had mainly been driven by people arriving in the UK from places abroad where they caught the disease, so the pattern was probably a bit different then.

If we start the graph at 29 February, it looks like this:

Now, what happens if we fit an exponential growth curve to it? It looks like this:

(Technical note for stats geeks: the way we actually do that is with a linear regression analysis of the logarithm of the number of cases on time, calculate the predicted values of the logarithm from that regression analysis, and then back-transform to get the number of cases.)

As you can see, it’s a pretty good fit to an exponential curve. In fact it’s really very good indeed. The R-squared value from the regression analysis is 0.99. R-squared is a measure of how well the data fit the modelled relationship on a scale of 0 to 1, so 0.99 is a damn near perfect fit.

We can also plot it on a logarithmic scale, when it should look like a straight line:

And indeed it does.

There are some interesting statistics we can calculate from the above analysis. The average rate of growth is about a 30% increase in the number of cases each day. That means that the number of cases doubles about every 2.6 days, and increases tenfold in about 8.6 days.

So what happens if the number of cases keeps growing at the same rate? Let’s extrapolate that line for another 6 weeks:

This looks pretty scary. If it continues at the same rate of exponential growth, we’ll get to 10,000 cases by 23 March (which is only just over a week away), to 100,000 cases by the end of March, to a million cases by 9 April, and to 10 million cases by 18 April. By 24 April the entire population of the UK (about 66 million) will be infected.

Now, obviously it’s not going to continue growing at the same rate for all that time. If nothing else, it will stop growing when it runs out of people to infect. And even if the entire population have not been infected, the rate of new infections will surely slow down once enough people have been infected, as it becomes increasingly unlikely that anyone with the disease who might be able to pass it on will encounter someone who hasn’t yet had it (I’m assuming here that people who have already had the disease will be immune to further infections, which seems likely, although we don’t yet know that for sure).

However, that effect won’t kick in until at least several million people have been infected, a situation which we will reach by the middle of April if other factors don’t cause the rate to slow down first.

Several million people being infected is a pretty scary prospect. Even if the fatality rate is “only” about 1%, then 1% of several million is several tens of thousands of deaths.

So will the rate slow down before we get to that stage?

I genuinely don’t know. I’m not an expert in infectious disease epidemiology. I can see that the data are following a textbook exponential growth pattern so far, but I don’t know how long it will continue.

Governments in many countries are introducing drastic measures to attempt to reduce the spread of the disease.

The UK government is not.

It is not clear to me why the UK government is taking a more relaxed approach. They say that they are being guided by the science, but since they have not published the details of their scientific modelling and reasoning, it is not possible for the rest of us to judge whether their interpretation of the science is more reasonable than that of many other European countries.

Maybe the rate of infection will start to slow down now that there is so much awareness of the disease and of precautions such as hand-washing, and that even in the absence of government advice, many large gatherings are being cancelled.

Or maybe it won’t. We will know more over the coming weeks.

One final thought. The government’s latest advice is for people with mild forms of the disease not to seek medical help. This means that the rate of increase of the disease may well appear to slow down as measured by the official statistics, as many people with mild disease will no longer be tested and so not be counted. It will be hard to know whether the rate of infection is really slowing down.

More nonsense about vaping

A paper was published in PLoS One a few days ago by Soneji et al that made the bold claim that “e-cigarette use currently represents more population-level harm than benefit”.

That claim, for reasons we’ll come to shortly, is not remotely supported by the evidence. But this leaves me with rather mixed feelings. On the one hand, I am disappointed that such a massively flawed paper can make it through peer review. It is a useful reminder that just because a paper is published in a peer reviewed journal does not mean that it is necessarily even approximately believable.

But on the other hand, the paper was largely ignored by the British media. I find that rather encouraging. We have seen flawed studies about e-cigarettes cheerfully picked up by the media before (here’s one example, but there are plenty of others), who don’t seem too bothered about whether the research is any good or not, just that it makes a good story. Perhaps the media are starting to learn that parroting press releases, when those press releases are a load of nonsense, is not such a great idea after all.

Sure, the paper made it into two of our most dreadful and unreliable newspapers, but as far as I can tell, the story was not picked up at all by the BBC or any of the broadsheet newspapers. And that’s a good thing.

So what was wrong with the paper then?

It’s important to understand that the paper did not collect any new data. There was no survey or clinical trial or review of health records or anything like that. It was purely a mathematical modelling study based on previously published data.

Soneji et al attempted to model the benefits and harms of e-cigarettes at the population level by considering what proportion of smokers are helped to quit by e-cigarettes, thus experiencing a health benefit, and what proportion of never-smokers are encouraged to start smoking by e-cigarettes, thus experiencing harm.

Of course a mathematical model is only as good as the assumptions that go into it. The big problem with this model is that there is no evidence that e-cigarettes encourage anyone to start smoking.

Now, there have been studies that show that young people who use e-cigarettes are more likely to start smoking that young people who don’t use e-cigarettes. Soneji et al used a meta-analysis of those studies to obtain the necessary estimates of just how much more likely that was.

But there is a big problem here. The assumption in Soneji et al’s modelling paper is that the observed association between e-cigarette use and subsequent smoking initiation is causal. In other words, they assume that those people who use e-cigarettes and then go on to start smoking have started smoking because they used e-cigarettes.

A moment’s thought shows that there are other perfectly plausible explanations rather than a causal relationship. Surely it is more likely that there is confounding by personality type here. The sort of person who uses e-cigarettes is probably the type of person who is more likely to start smoking. If e-cigarettes were not available, those people who first used e-cigarettes and then subsequently started smoking would probably have started smoking anyway.

But this is to some extent guesswork. While Soneji et al can most definitely not prove that the association between e-cigarette use and subsequent smoking is causal, no-one can prove it isn’t causal from those association studies, even if another explanation is more plausible.

We can, however, look at other data to help understand what is going on. Given that e-cigarettes are now far more available than they were a few years ago, if e-cigarettes were really causing people who wouldn’t otherwise have smoked to start smoking, then you would expect to see population-level rates of smoking start to increase.

In fact, according to data from the Office for National Statistics, the opposite is happening. According to the ONS data, “Since 2010, smoking has become less common across all age groups in the UK, with the most pronounced decrease observed among those aged 18 to 24 years”.

Now, of course we can’t say that that decrease in smoking prevalence is because of e-cigarettes, but it does seem to argue strongly against the hypothesis that e-cigarettes are encouraging young people to start smoking on a grand scale.

And if you believe Soneji et al’s claims, people would be starting smoking on a grand scale. Prof Peter Hajek, quoted by the Science Media Centre, has calculated what Soneji et al’s claims would mean if they were true in the UK:

“This new ‘finding’ is based on the bizarre assumption that for every one smoker who uses e-cigs to quit, 80 non-smokers will try e-cigs and take up smoking. It flies in the face of available evidence but it is also mathematically impossible. In the UK alone, 1.5 million smokers have quit smoking with the help of e-cigarettes. The ‘modelling’ in this paper assumes that we also have 120 million young people who became smokers.”

I think we can all see that having 120 million young people who are smokers among the UK population doesn’t make a whole lot of sense. Why could the peer-reviewers of the paper not see that?

Lessons must be learned. It must never happen again.

Now that multiple accusations of rape and other serious sexual offences have been made against Harvey Weinstein, everyone agrees that what happened is terrible, that lessons must be learned, and that it must never happen again.

A few weeks ago, when Grenfell Tower burned down in London, with the loss of dozens of lives, everyone agreed that it was terrible, that lessons must be learned, and that it must never happen again.

When it turned out that British journalists had been hacking phones on a grand scale, including the phone of a dead schoolgirl, everyone agreed that it was terrible, that lessons must be learned, and that it must never happen again.

When it became clear that Jimmy Savile had been a prolific sexual abuser, everyone agreed that it was terrible, that lessons must be learned, and that it must never happen again.

When the banking system collapsed in 2008, causing immense damage to the wider economy, everyone agreed that it was terrible, that lessons must be learned, and that it must never happen again.

It seems to me that the lesson from all these things, and more, is clear. When people are in a position of power, sometimes they will abuse that power. And because they are in a position of power, they will probably get away with it.

This will happen again. People in a position of power are the ones who make the rules, and it doesn’t seem likely that they will change the rules to make it easier to hold powerful people to account.

I suppose it could happen, in a democracy such as the UK, if voters insist that their politicians prioritise holding the powerful to account. Sadly, I can’t see that happening. Most people prioritise other things when they go to the ballot box.

So unless that changes, all these things, and similar, will happen again.

 

Do 41% of middle aged adults really walk for less than 10 minutes each month?

I was a little surprised when I heard the news on the radio this morning and heard that a new study had been published allegedly showing that millions of middle aged adults are so inactive that they don’t even walk for 10 minutes each month. The story has been widely covered in the media, for example here, here, and here.

The specific claim is that 41% of adults aged 40 to 60 in England, or about 6 million people, do not walk for 10 minutes in one go at a brisk pace at least once a month, based on a survey by Public Health England (PHE). I tracked down the source of this claim to this report on the PHE website.

I found that hard to believe. Walking for just 10 minutes a month is a pretty low bar. Can it really be true that 41% of middle aged adults don’t even manage that much?

Well, if it is, which I seriously doubt, then the statistic is at best highly misleading. The same survey tells us that less than 20% of the same sample of adults were physically inactive, where physical activity is defined as “participating in less than 30 minutes of moderate intensity physical activity per week”. Here is the table from the report about physical activity:

So we have about 6 million people doing less than 10 minutes of walking per month, but only 3 million people doing less than 30 minutes of moderate intensity physical activity per week. So somehow, there must be 3 million people who are doing at least 30 minutes of physical activity per week while simultaneously walking for less than 10 minutes per month.

I suppose that’s possible. Maybe those people cycle a lot, or perhaps drive to the gym and have a good old workout and then drive home again. But it seems unlikely.

And even if it’s true, the headline figure that 41% of middle aged adults are doing so little exercise that they don’t even manage 10 minutes of walking a month is grossly misleading. Because in fact over 80% of middle aged adults are exercising for at least 30 minutes per week.

I notice that the report on the PHE website doesn’t link to the precise questions asked in the survey. I am always sceptical of any survey results that aren’t accompanied by a detailed description of the survey methods, including specifying the precise questions asked, and this example only serves to remind me of the importance of maintaining that scepticism.

The news coverage focuses on the “41% walk for less than 10 minutes per month” figure and not on the far less alarming figure that less than 20% exercise for less than 30 minutes per week. The 41% figure is also presented first on the PHE website, and I’m guessing, given the similarity of stories in the media, that that was the figure they emphasised in their press release.

I find it disappointing that a body like PHE is prioritising newsworthiness over honest science.

Brexit voting and education

This post was inspired by an article on the BBC website by Martin Rosenbaum, which presented data on a localised breakdown of EU referendum voting figures, and a subsequent discussion of those results in a Facebook group.  In that discussion, I observed that the negative correlation between the percentage of graduates in an electoral ward and the leave vote in that ward was remarkable, and much higher than any correlation you normally see in the social sciences. My friend Barry observed that age was also correlated with voting leave, and that it was likely that age would be correlated with the percentage of graduates, and questioned whether the percentage of graduates was really an independent predictor, or whether a high percentage of graduates was more a marker for a young population.

The BBC article, fascinating though it is, didn’t really present its findings in enough detail to be able to answer that question. Happily, Rosenbaum made his raw data on voting results available, and data on age and education are readily downloadable from the Nomis website, so I was able to run the analysis myself to investigate.

To start with, I ran the same analyses as described in Rosenbaum’s article, and I’m happy to say I got the same results. Here is the correlation between voting leave and the percentage of graduates, together with a best-fit regression line:

For age, I found that adding a quadratic term improved the regression model, so the relationship between age and voting leave is curved, and increases with age at first, but tails off at older age groups:

Rosenbaum also looked at the relationship with ethnicity, so I did too. Here I plot the percent voting leave against the % of people in each ward identifying as white. Again, I found the model was improved by a quadratic term, showing that the relationship is non linear. This fits with what Rosenbaum said in his article, namely that although populations with more white people were mostly more likely to vote leave, that relationship breaks down in populations with particularly high numbers of ethnic minorities:

It’s interesting to note that the minimum for the % voting leave is a little over 40% white population. I suspect that the important thing here is not so much what the proportion of white people is, but how diverse a population is. Once the proportion of white people becomes very low, then maybe the population is just as lacking in diversity as populations where the proportion of white people is very high.

Anyway, the question I was interested in at the start was whether the percentage of graduates was an independent predictor of voting, even after taking account of age.

The short answer is yes, it is.

Let’s start by looking at it graphically. If we start with our regression model looking at the relationship between voting and age, we can calculate a residual for each data point, which is the difference between the data point in question and the line of best fit. We can then plot those residuals against the percentage of graduates. What we are now plotting is the voting patterns adjusted for age. So if we see a relationship with the percent of graduates, then we know that it’s still an independent predictor after adjusting for age.

This is what we get if we do that:

As you can see, it’s still a very strong relationship, so we can conclude that the percentage of graduates is a good predictor of voting, even after taking account of age.

What if we take account of both age and ethnicity? Here’s what we get if we do the same analysis but with the residuals from an analysis of both age and ethnicity:

Again, the relationship still seems very strong, so the percentage of graduates really does seem to be a robust independent predictor of voting.

For the more statistically minded, another way of looking at this is to look at the regression coefficient for the percentage of graduates alone, or after adjusting for age and ethnicity (in all cases with the % voting leave as the dependent variable). Here is what we get:

Model Regression cofficient t P value
Education alone  -0.97  -45.9 < 0.001
Education and age  -0.90  -52.5 < 0.001
Education and ethnicity  -0.91  -55.0 < 0.001
Education, age, and ethnicity  -0.89  -53.9 < 0.001

So although the regression coefficient does get slightly smaller after adjusting for age and ethnicity, it doesn’t get much smaller, and remains highly statistically significant.

What if we turn this on its head and ask whether age is still an important predictor after adjusting for education?

Here is a graph of the residuals from the analysis of voting and education, plotted against age:

There is still a clear relationship, though perhaps not quite as strong as before. And what if we look at the residuals adjusted for both education and ethnicity, plotted against age?

The relationship seems to be flattening out, so maybe age isn’t such a strong independent predictor once we take account of education and ethnicity (it turns out that areas with a higher proportion of white people also tend to be older).

For the statistically minded, here are what the regression coefficients look like (for ease of interpretation, I’m not using a quadratic term for age here and only looking at the linear relationship with age).

Model Regression cofficient t P value
Age alone 1.66 17.2 < 0.001
Age and education 1.28 25.3 < 0.001
Age and ethnicity 0.71 5.95 < 0.001
Age, education, and ethnicity 0.82 13.5 < 0.001

Here the adjusted regression coefficient is considerably smaller than the unadjusted one, showing that the initially strong looking relationship with age isn’t quite as strong as it seems once we take account of education and ethnicity.

So after all this I think it is safe to conclude that education is a remarkably strong predictor of voting outcome in the EU referendum, and that that relationship is not much affected by age or ethnicity. On the other hand, the relationship between age and voting outcome, while still certainly strong and statistically significant, is not quite as strong as it first appears before education and ethnicity are taken into account.

One important caveat with all these analyses of course is that they are based on aggregate data for electoral wards rather than individual data, so they may be subject to the ecological fallacy. We know that wards with a high percentage of graduates are more likely to have voted remain, but we don’t know whether individuals with degrees are more likely to have voted remain. It seems reasonably likely that that would also be true, but we can’t conclude it with certainty from the data here.

Another caveat is that data were not available from all electoral wards, and the analysis above is based on a subset of 1070 wards in England only (there are 8750 wards in England and Wales). However, the average percent voting leave in the sample analysed here was 52%, so it seems that it is probably broadly representative of the national picture.

All of this of course raises the question of why wards with a higher proportion of graduates were less likely to vote leave, but that’s probably a question for another day, unless you want to have a go at answering it in the comments.

Update 12 February 2017:

Since I posted this yesterday, I have done some further analysis, this time looking at the effect of socioeconomic classification. This classifies people according to the socioeconomic status (SES) of the job they do, ranging from 1 (higher managerial and professional occupations) to 8 (long term unemployed).

I thought it would be interesting to see the extent to which education was a marker for socioeconomic status. Perhaps it’s not really having a degree level education that predicts voting remain, but it’s being in a higher socioeconomic group?

To get a single number I could use for socioeconomic status, I calculated the percentage of people in each ward in categories 1 and 2 (the highest status categories). (I also repeated the analysis calculating the average status for each ward, and the conclusions were essentially the same, so I’m not presenting those results here.)

The relationship between socioeconomic status and voting leave looks like this:

This shouldn’t come as a surprise. Wards with more people in higher SES groups were less likely to vote leave. That fits with what you would expect from the education data: wards with more people with higher SES are probably also those with more graduates.

However, if we look at the multivariable analyses, this is where it starts to get interesting.

Let’s look at the residuals from the analysis of education plotted against SES. This shows the relationship between voting leave and SES after adjusting for education.

You’ll note that the slope of the best-fit regression line is now going the other way: it now slopes upwards instead of downwards. This tells us that, for wards with identical proportions of graduates, the ones with higher SES are now more likely to vote leave.

So what we are seeing here is most definitely a correlation between education and voting behaviour. Other things (ie education) being equal, wards with a higher proportion of people in high SES categories were more likely to vote leave.

For the statistically minded, here is what the regression coefficients look like. Here are the regression coefficients for the effect of socioeconomic status on voting leave:

Model Regression cofficient t P value
SES alone -0.58 -20.6 < 0.001
SES and education 0.81 26.5 < 0.001
SES, education, and ethnicity 0.49 12.4 < 0.001
SES, education, age, and ethnicity 0.31 6.5 < 0.001

Note how the sign of the regression coefficient reverses in the adjusted analyses, consistent with the slope in the graph changing from downward sloping to upward sloping.

And what happens to the regression coefficients for education once we adjust for SES?

Model Regression cofficient t P value
Education alone -0.97 -45.9 < 0.001
Education and SES -1.75 -51.9 < 0.001
Education, SES, age, and ethnicity -1.20 -23.4 < 0.001

Here the relationship between education and voting remain becomes even stronger after adjusting for SES. This shows us that it really is education that is correlated with voting behaviour, and it’s not simply a marker for higher SES. In fact once you adjust for education, higher SES predicts a greater likelihood of voting leave.

To be honest, I’m not sure these results are what I expected to see. I think it’s worth reiterating the caveat above about the ecological fallacy. We do not know whether individuals of higher socioeconomic status are more likely to vote leave after adjusting for education. All we can say is that electoral wards with a higher proportion of people of high SES are more likely to vote leave after adjusting for the proportion of people in that ward with degree level education.

But with those caveats in mind, it certainly seems as if it is a more educated population first and foremost which predicts a higher remain vote, and not a population of higher socioeconomic status.

Do you believe in dinosaurs?

On of my earliest memories is from when I was at primary school. I must have been about 5 years old at the time, and I had just heard about dinosaurs. I can’t remember how I heard about them. Perhaps my parents had given me a book about them. That’s probably the sort of thing that parents do for 5-year-olds, right?

Anyway, I was fascinated by the whole idea (as I expect most kids of that age are), and at school I asked my teacher “Do you believe in dinosaurs?”

The teacher was smart enough to spot that I was asking a question with some rather poor assumptions behind it, and helpfully and patiently explained to me why it’s not really a question of belief. Dinosaurs, she explained, were an established fact, as seen from abundant evidence from the fossil record. Belief didn’t come into it: dinosaurs existed.

I understood what my teacher explained to me, and learned an important lesson that day. Some things are not about belief: they are about facts. In fact looking back on this with the benefit of 40-odd years of hindsight, I think perhaps that lesson was the single most important thing I ever learned at school (yes, even more important than that thing about ox-bow lakes). It’s a shame I can’t remember the name of the teacher, because I’d really like to thank her.

But it’s even more of a shame that so many people who don’t believe in global warming or who do believe in homeopathy or similar didn’t have such a good primary school teacher as I had.

Consequences of dishonest advertising

As I was travelling on the London Underground the other day, I saw an advert that caught my eye.

Please note: if you are a journalist from the Daily Mirror and would like to use this photo in a story, it would be appreciated if you would ask permission first rather than just stealing it like you did with the last photo I posted of a dodgy advert.

That was a surprising claim, I thought. Just wipe a magic potion across your brow, and you get fast, effective relief from a headache.

So I had a look to see what the medical literature had to say about it. Here is what a PubMed search for 4head or its active ingredient levomenthol turned up:

A Google Scholar search similarly failed to find a shred of evidence that the product has any effect whatever on headaches. So I have reported the advert to the ASA. It will be interesting to see if the manufacturer has any evidence to back up their claim. I suppose they might, but they are keeping it pretty well hidden if they do.

But it occurred to me that something is very wrong with the way advertising regulation works. If the advert is indeed making claims which turn out to be completely unsubstantiated, the manufacturer can do that with no adverse consequences whatever. False advertising is effectively legalised lying.

When I last reported a misleading advert to the ASA, the ASA did eventually rule that the advert was misleading  and asked the advertiser to stop using the advert. It took almost a year from when I reported the advert to when the ruling was made, giving the advertiser completely free rein to continue telling lies for almost a whole year.

In a just society, there might be some penalty for misleading the public like that. But there isn’t. The only sanction is being asked to take the advert down. As long as you comply (and with very rare exceptions, even if you don’t), there are no fines or penalties of any sort.

So where is the incentive for advertisers to be truthful? Most dishonest adverts probably don’t get reported, or even if they do get reported the ASA might be prepared to be generous to the advertiser and not find against them anyway. Advertisers know that they can be dishonest with no adverse consequences.

I would like to suggest a new way of regulation for adverts. Every company that advertises would need to nominate an advertising compliance officer, probably a member of the board of directors. That person would need to sign off every advert that the company uses. If an advert is found to be dishonest, that would be a criminal offence, and the advertising compliance officer would be personally liable, facing a criminal record and a substantial fine. The company would be fined as well.

We criminalise other forms of taking money by fraud. Why does fraudulent advertising have to be different?

The Trials Tracker and post-truth politics

The All Trials campaign was founded in 2013 with the stated aim of ensuring that all clinical trials are disclosed in the public domain. This is, of course, an entirely worthy aim. There is no doubt that sponsors of clinical trials have an ethical responsibility to make sure that the results of their trials are made public.

However, as I have written before, I am not impressed by the way the All Trials campaign misuses statistics in pursuit of its aims. Specifically, the statistic they keep promoting, “about half of all clinical trials are unpublished”, is simply not evidence based. Most recent studies show that the extent of trials that are undisclosed is more like 20% than 50%.

The latest initiative by the All Trials campaign is the Trials Tracker. This is an automated tool that looks at all trials registered on clinicaltrials.gov since 2006 and determines, using an automated algorithm, which of them have been disclosed. They found 45% were undisclosed (27% of industry sponsored-trials and 54% of non-industry trials). So, surely this is evidence to support the All Trials claim that about half of trials are undisclosed, right?

Wrong.

In fact it looks like the true figure for undisclosed trials is not 45%, but at most 21%. Let me explain.

The problem is that an automated algorithm is not very good at determining whether trials are disclosed or not. The algorithm can tell if results have been posted on clinicaltrials.gov, and also searches PubMed for publications with a matching clinicaltrials.gov ID number. You can probably see the flaw in this already. There are many ways that results could be disclosed that would not be picked up by that algorithm.

Many pharmaceutical companies make results of clinical trials available on their own websites. The algorithm would not pick that up. Also, although journal publications of clinical trials should ideally make sure they are indexed by the clinicaltrials.gov ID number, in practice that system is imperfect. So the automated algorithm misses many journal articles that aren’t indexed correctly with their ID number.

So how bad is the algorithm?

The sponsor with the greatest number of unreported trials, according to the algorithm, is Sanofi. I started by downloading the raw data, picked the first 10 trials sponsored by Sanofi that were supposedly “undisclosed”, and tried searching for results manually.

As an aside, the Trials Tracker team get 7/10 for transparency. They make their raw data available for download, which is great, but they don’t disclose their metadata (descriptions of what each variable in the dataset represents), so it was rather hard work figuring out how to use the data. But I think I figured it out in the end, as after trying a few combinations of interpretations I was able to replicate their published results exactly.

Anyway, of those 10 “undisclosed” trials by Sanofi, 8 of them were reported on Sanofi’s own website, and one of the remaining 2 was published in a journal. So in fact only 1 of the 10 was actually undisclosed. I posted this information in a comment on the journal article in which the Trials Tracker is described, and it prompted another reader, Tamas Ferenci, to investigate the Sanofi trials more systematically. He found that 227 of the 285 Sanofi trials (80%) listed as undisclosed by Trials Tracker were in fact published on Sanofi’s website. He then went on to look at “undisclosed” trials sponsored by AstraZeneca, and found that 38 of the 68 supposedly undisclosed trials (56%) were actually published on AstraZeneca’s website. Ferenci’s search only looked at company websites, so it’s possible that more of the trials were reported in journal articles.

The above analyses only looked at a couple of sponsors, and we don’t know if they are representative. So to investigate more systematically the extent to which the Trials Tracker algorithm underestimates disclosure, I searched for results manually for 100 trials: a random selection of 50 industry trials and a random selection of 50 non-industry trials.

I found that 54% (95% confidence interval 40-68%) of industry trials and 52% (95% CI 38-66%) of non-industry trials that had been classified as undisclosed by Trials Tracker were available in the public domain. This might be an underestimate, as my search was not especially thorough. I searched Google, Google Scholar, and PubMed, and if I couldn’t find any results in a few minutes then I gave up. A more systematic search might have found more articles.

If you’d like to check the results yourself, my findings are in a csv file here. This follows the same structure as the original dataset (I’d love to be able to give you the metadata for that, but as mentioned above, I can’t), but with the addition of 3 variables at the end. “Disclosed” specifies whether the trial was disclosed, and if so, how (journal, company website, etc). It’s possible that trials were disclosed in more than one place, but once I’d found a trial in one place I stopped searching. “Link” is a link to the results if available, and “Comment” is any other information that struck me as relevant, such as whether a trial was terminated prematurely or was of a product which has since been discontinued.

Putting these figures together with the Trials Tracker main results, this suggests that only 12% of industry trials and 26% of non-industry trials are undisclosed, or 21% overall (34% of the trials were sponsored by industry). And given the rough and ready nature of my search strategy, this is probably an upper bound for the proportion of undisclosed trials. A far cry from “about half”, and in fact broadly consistent with the recent studies showing that about 80% of trials are disclosed. It’s also worth noting that industry are clearly doing better at disclosure than academia. Much of the narrative that the All Trials campaign has encouraged is of the form “evil secretive Big Pharma deliberately withholding their results”. The data don’t seem to support this. It seems far more likely that trials are undisclosed simply because triallists lack the resources to write them up for publication. Research in industry is generally better funded than research in academia, and my guess is that the better funding explains why industry do better at disclosing their results. I and some colleagues have previously suggested that one way to increase trial disclosure rates would be to ensure that funders of research ringfence a part of their budget specifically for the costs of publication.

There are some interesting features of the 23 out of the 50 industry-sponsored trials that really did seem to be undisclosed. 9 of them were not trials of a drug intervention. Of the 14 undisclosed drug trials, 4 were of products that had been discontinued and a further 3 had sample sizes less than 12 subjects, so none of those 7 studies are likely to be relevant to clinical practice. It seems that undisclosed industry-sponsored drug trials of relevance to clinical practice are very rare indeed.

The Trials Tracker team would no doubt respond by saying that the trials missed by their algorithm have been badly indexed, which is bad in itself. And they would be right about that. Trial sponsors should update clinicaltrials.gov with their results. They should also make sure that the clinicaltrials.gov ID number is included in the publication (although in several cases of published trials that were missed by the algorithm, the ID number was in fact included in the abstract of the paper, so this seems to be a fault of Medline indexing rather than any fault of the triallists).

However, the claim made by the Trials Tracker is not that trials are badly indexed. If they stuck to making only that claim, then the Trials Tracker would be a perfectly worthy and admirable project. But the problem is they go beyond that, and claim something which their data simply do not show. Their claim is that the trials are undisclosed. This is just wrong. It is another example of what seems to be all the rage these days, namely “post-truth politics”. It is no different from when the Brexit campaign said “We spend £350 million a week on the EU and could spend it on the NHS instead” or when Donald Trump said, well, pretty much every time his lips moved really.

Welcome to the post-truth world.

 

Evidence-based house moving

I live in London. I didn’t really intend to live in London. But I got a job here that seemed to suit me, so I thought maybe it would be OK to live here for a couple of years and then move on.

That was in 1994. Various life events intervened and I sort of got stuck here. But now I’m in the fortunate position where my job is home-based, and my partner also works from home, so we could pretty much live anywhere. So finally, moving out of London is very much on the agenda.

But where should we move to? The main intention is “somewhere more rural than London”, which you will appreciate doesn’t really narrow it down very much. Many people move to a specific location for a convenient commute to work, but we have no such constraints, so we need some other way of deciding.

So I decided to do what all good statisticians do, and use data to come up with the answer.

There is a phenomenal amount of data that can be freely downloaded from the internet these days about various attributes of small geographic areas.

House prices are obviously one of the big considerations. You can download data from the Land Registry on every single residential property transaction going back many years. This needs a bit of work before it becomes usable, but it’s nothing a 3-level mixed effects model with random coefficients at both middle-layer super output area and local authority area can’t sort out (the model actually took about 2 days to run: it’s quite a big dataset).

Although I don’t have to commute to work every day, I’m not completely free of geographic constraints. I travel for work quite a bit, so I don’t want to be too far away from the nearest international airport. My parents, who are not as young as they used to be, live in Sussex, and I don’t want to be too many hours’ drive away from them. My partner also has family in the southeast of England and would like to remain in easy visiting distance. And we both love going on holiday to the Lake District, so somewhere closer to there would be nice (which is of course not all that easy to reconcile with being close to Sussex).

Fortunately, you can download extensive data on journey times from many bits of the  country to many other bits, so that can be easily added to the data.

We’d like to live somewhere more rural than London, but don’t want to be absolutely in the middle of nowhere. Somewhere with a few shops and a couple of takeaways and pubs would be good. So I also downloaded data on population density.  I figured about 2500 people/square km would be a good compromise between escaping to somewhere more rural and not being in the middle of nowhere, and gave areas more points the closer they came to that ideal.

I’d like to have a big garden, so we also give points to places that have a high ratio of garden space to house space, which can easily be calculated from land use data. Plenty of green space in the area would also be welcome, and we can calculate that from the same dataset.

One of the problems with choosing places with low house prices is that they might turn out to be rather run-down and unpleasant places to live. So I’ve also downloaded data on crime rates and deprivation indices, so that run-down and crime-ridden areas can be penalised.

In addition to all that, I also found data on flood risk, political leanings, education levels, and life satisfaction, which I figured are probably also relevant.

I dare say there are probably other things that could be downloaded and taken into account, though that’s all I can think of for now. Suggestions for other things very welcome via the commenst below.

I then calculate a score for each of those things for each middle-layer super output area (an area of approximately 7000 people), weight each of those things by how important I think it is, and take a weighted average. Anything that scores too badly on an item I figured was important (this was just house prices and distance to my parents) automatically gets a score of zero.

The result is a database of a score for every middle-layer super output area in England and Wales (I figured Scotland was just too far away from Sussex), which I then mapped using the wonderful QGIS mapping software.

The results are actually quite sensitive to the weightings applied to each attribute, so I allowed some of the weightings to vary over reasonable ranges, and then picked the areas that consistently performed well.

The final map looks like this:

map

Red areas are those with low scores, green areas are those with high scores.

Not surprisingly, setting a constraint on house prices ruled out almost all of the southeast of England. Setting a constraint on travelling time to visit my parents ruled out most of the north of England. What is left is mainly a little band around the midlands.

And which is the best place to live, taking all that into account? Turns out that it’s Stafford. I’ve actually never been to Stafford. I wonder if it’s a nice place to live? I suppose I should go and visit it sometime and see how well my model did.