Germany’s Vaccination Backlog

Quite often we hear in the news the lament that “if only we would have enough vaccine!”. In principal that is true, but more the theoretical claim, that only if we would have 170 Mio doses, everybody in Germany could get the two shots … Fact is, that being Germans and doing everything as thorough as possible – or even more – the process isn’t that fast, and in fact we have millions of unused doses. There is certainly some need to keep a certain backlog to avoid shortages when the one or other company does not deliver in time, but the current backlog is enough to cover roughly three weeks of vaccinations without any newly delivered doses.

Looking at the different German federal states, wee see quite a difference in the efficiency of the vaccination process. Given the data as of March 13th, North Rhine-Westphalia has a backlog of more than 36%, whereas Bavaria and Bremen (structurally quite different) are at about 20%.

Here is the data visualized


Time Is Up!

As there are still some blokes around who do not get how critical the Corona pandemic situation in Germany is, I want to boil it down to one number:

>90 days

left until all intensive care beds in whole of Germany are occupied.
(as of April 26th, extrapolating the 7 day average growth rate results in June 28th, as the day when all intensive care beds are occupied.)

Data are taken from Germany’s central intensive care register at

Corona Crisis: The numbers that really matter

2021-02-14: ECLS turnaround
2021-02-07: Rate now slowly increasing to 1,5-2,0%
2021-01-31: Decline in Covid-19 beds still only at a -1% rate
2021-01-24: Despite fewer new infections, intensive care still high
2021-01-17: Still waiting for turn around in death figures
2021-01-10: The turning point! Proceed with fingers crossed!
2021-01-03: For the first time more than 50% of ECLS beds are used
2020-12-27: Almost 70% of all beds are now COVID-19 cases
2020-12-20: Hospitals start to switch to a “Corona Only” strategy
2020-12-13: New strict lockdown starting 2020-12-16

Early on when the corona pandemic did start in China, the Johns Hopkins University (JHU) did start to build a dashboard to monitor the number of infected cases and the number of patients that died from the disease.

The numbers are certainly taken with great care, but as soon as the virus spread outside China mainland, we saw drastic differences between the number of infected and the number of dead, i.e., as of 30.3.2020, 20:00, in Germany 560 out of 63.929 died, which is a rate of 0,876%, whereas in Italy 11.591 dead were counted out of 97.689, which is a rate of 11,87%.

What is obvious, is that neither the medical treatment is so much worse in Italy nor the age structure is so much different from Germany. As testing procedures and testing rates may vary vastly between countries, the number of affected people is only a proxy of what your problem really is.

Sad as it is, we will see people dying from the disease, but the catastrophe really starts, when the medical system is overwhelmed from people who need intensive care, and doctors need to triage, who will get treatment, and who will be “left alone dying”, as it dramatically happened in Italy and Spain.

Germany has a register of Intensive and Emergency Care (DIVI), which shows the availability of intensive care beds, ECLS (Extracorporeal Life Support) capabilities and the number of currently ventilated corona patients.

Monitoring these figures allows to really judge to what extend, the medical system is still able to manage the corona crisis. Here are the figures:




As these figures will emerge over time, we will see how well the German medical system can cope with the crisis and whether or not the strict lock-down may be released step by step.

Stay home and stay healthy!

Interactive Graphics with R Shiny

Well, R is definitively here to stay and made its way into the data science tool zoo. For me as a statistician, I often feel alienated surrounded by these animals, but R is still also the statistician’s tool of choice (yes, it has come to age, but where are the predators ..?)

What was usually a big problem for us statistician, was to get our methods and models out to our customers, who (usually) don’t speak R. At this point Shiny comes in handy and offers a whole suite of bread and butter interface widgets, which can be deployed to web-pages and wired to R functions via all kinds of callback-routines.

A typical example (sorry for the data set) looks like this:

(Please use this example in class to demonstrate how limited k-means is!)

Hey, this is already pretty interactive for what we know from R and all without messing around with Tcl/Tk or other hard to manage and hard to port UI builders. But what struck me was to try out and see what can actually be done with “real” interactive graphics as we know from e.g. Mondrian and in some parts from Tableau.

Here is what I came up with (same data for better recognition ;-):

The whole magic is done with these lines of code:


options(shiny.sanitize.errors = FALSE)

options(shiny.fullstacktrace = TRUE)

ui <- fluidPage(title="Shiny Linking Demo",
                                    click = "plot_click",
                                    brush = brushOpts("plot_brush"),
                                    width = 500,
                                    height = 500
                                    click = "plot2_click",
                                    width = 500,
                                    height = 500
                                    click = "plot3_click",
                                    brush = brushOpts("plot3_brush"),
                                    width = 600,
                                    height = 400

server <- function(input, output, session) {
  keep <- rep(FALSE, 150)
  shift <- FALSE
  old_brush <- -9999
  var<- 1
  keeprows <- reactive({
    keepN <- keep
    if (!is.null(input$plot_click$x) |  !is.null(input$plot3_click$x))
      keepN <- rep(FALSE, 150)
    if (!is.null(input$plot_brush$xmin) ) {
      if( old_brush != input$plot_brush$xmin ) {
        keepN <- brushedPoints(iris, input$plot_brush,
                               xvar = "Sepal.Length",
                               yvar = "Sepal.Width",
                               allRows = TRUE)$selected_
        old_brush <<- input$plot_brush$xmin
    if (!is.null(input$plot2_click$x) ) {
      keepN <- pmax(1,pmin(3,round(input$plot2_click$x))) == as.numeric(iris$Species)
    if (!is.null(input$plot3_brush$xmin) ) {
      if( old_brush != input$plot3_brush$xmin ) {
        var <<- round((input$plot3_brush$xmin + input$plot3_brush$xmax) / 2 )
        coor_min <- min(iris[,var]) + input$plot3_brush$ymin * diff(range(iris[,var]))
        coor_max <- min(iris[,var]) + input$plot3_brush$ymax * diff(range(iris[,var]))
        keepN <- iris[, var] >= coor_min & iris[, var] <= coor_max
        old_brush <<- input$plot3_brush$xmin
    if( is.null(input$key) )
      keep <<- keepN
    else {
      if( input$key )
        keep <<- keepN | keep
        keep <<- keepN
  output$plot1 <- renderPlot({
    plot(iris$Sepal.Length, iris$Sepal.Width, main="Drag to select points")
           iris$Sepal.Width[keeprows()], col=2, pch=16)
  output$plot2 <- renderPlot({
    barplot(table(iris$Species), main="Click to select classes")
    barplot(table(iris$Species[keeprows()]), add=T, col=2)
  output$plot3 <- renderPlot({
    parcoord(iris[,-5], col=keeprows() + 1, lwd=keeprows() + 1)

shinyApp(ui, server)

What makes this example somewhat special is:

  • It does not need too much code
  • It is relatively general, i.e. other plots may be added
  • It uses traditional R graphics off the shelf
  • It is not too slow

Of course it is a hack! But it proves that Shiny is capable to do interactive statistical graphics to some degree.

Something the developer of Shiny actually do think about.

Statistics is dead, long live Statistics!

It was March 7th this year, when this mail from the ASA found its way to the ASA members:

On first sight, it didn’t look like that one needs to pay too much attention, but in the longer pdf-version, you can read these six principles:

  1. P-values can indicate how incompatible the data are with a specified statistical model.
  2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
  3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
  4. Proper inference requires full reporting and transparency.
  5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
  6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

taken from the full statement in The American Statistician.

For me this sounds like “the end” of classical statistics as a sub-discipline of mathematics. The cause seems to be obvious for me: In the light of Data Science as a widely promoted but hardly defined discipline, statistics seems to lose ground more and more. Unfortunately, the ASA does not really deliver new directions, that would make the ordinary statisticians more future proved.

Is this new? I would say, no. Ever since John Tukey promoted EDA (Exploratory Data Analysis, for those who are too young to know) we got new directions from someone who really knew the math behind statistics and as a result saw the limitations.

Digging in my old talks I found this slide from 2002

Nothing new, really, and 15 years ago in the light of the buzz word “Data-Mining”. But the point is the same.

The only question is:

Does the statistics community react too late, and is now doomed to diminish towards insignificance?

Jobs in Data Science

Well, if you are not in Data Science today, you are apparently missing a major trend … many say. Just in the last year, I witnessed at least three people mutating from ordinary computer scientists or statisticians into data scientists or data engineers. If you don’t really know what these people do, Analytics Vidhya has an easy classification for you.

You might have your doubts on what is written there (and maybe these are the same as I have), but one thing is for sure: Your mutation from a computer savvy statistician to a data scientist could be worth no less than 30,000$

Go and reinvent yourself!

Is Big Data all about Dark Data?

So far, my favorite description of Big Data is:

Big Data is when it is cheaper to keep all data than to think about what data you probably need to answer your (business) questions.

Why is this description so attractive? Well, Big Data is primarily a technology, i.e. storing data in a Hadoop File System (HFS) – at least for the most of us. This makes storing data extremely cheap, both in terms of structuring your data (far more expensive in a database) and physically storing it.

But at some point we need to analyze the data, no matter if we stored it “without” much structure in a HFS or with an analysis in mind in a database. In the Big Data case we probably just postponed the process of getting this work done.

Here is where the new buzz word comes into play: “Dark Data Mining”. According to Gartner, Dark Data is data, where we “fail to use for analysis purposes”. And kdnuggest have even a great visualization for the whole problem:

Whereas Kaushik Pal still sees a big business potential within Dark Data, I would look at it from a different perspective.

Dark Data Mining is like coal mining, where you do not separate lignite and spoil during the mining process, but you both put it on the same dump – because it is cheaper – but you start mining for the lignite in the dump once you actually want to use it …

Why touch is less

It is now almost 10 years ago that I asked a friend to bring me an iPod Touch along when he was visiting NYC. I was thrilled and curious to see this new interface Apple did introduce with iOS. Since then, smartphones and tablets are everywhere and the touch-interface is here to stay.

Even the surface of my mouse features a touch interface by now and getting back to my scroll wheel mouse at work is always a pain.

The question that will arise is whether or not touch devices will completely replace the traditional desktop interfaces we got used to the last 30 years.

Interestingly the makers of Windows and MacOS X seem to have different opinions on this.

Whereas Windows10 advertises Continuum, a functionality that lets you use your (high end) Windows Phone as a Desktop (?) computer and vice versa (?), Apple starts to align app functionality between MacOS X and iOS without pushing one interface upon both worlds (Desktop and Touch) or mixing both words to one product.

I couldn’t really argue very much why the one or the other way would be preferable, other than that a touch screen for a laptop seems an odd choice, as you constantly hide content with the touching hand …

After having used Mondrian on a 70” Sharp touch panel at work, I more clearly understand why Apple still goes two separate ways.

Function Desktop Touch
Click yes yes
Click & Drag yes yes
Mouse Over yes no
Range Selection (Shift-Click) yes no
Item Selection (Ctrl-Click) yes no
Right Click yes maybe
Precise Click yes no
Pinch to Zoom no yes

Above table (certainly not exhaustive) clearly shows that a lot of functionality (not to mention the keyboard) is lost when going from the desktop interface to a touch interface. For most (trivial) interactions resp. apps, we can live with this simplification, but when it comes to productivity software, touch is just inferior. That’s fine (and wanted) for my smartphone and tablet, but a problem for my laptop or desktop.

Significantly insignificant

I usually enjoy reading the articles in the significance magazine published by the RSS. Not only is it a glossy magazine (quite uncommon for statistics as a discipline) but also does it often feature very nice case studies from real life problems that matter.

Not so for the article in the current issue (December 2015) on the so called “Diesel Gate”. But before we look deeper into the problem let’s start with looking at emission regulations in the US and Europe. The following figure from “The Long Tail Pipe” illustrates the problem

Whereas the US restricts NOx very strongly, the EU pushes on COx. This makes one think as all emissions are bad for the environment and should equally regarded as “bad” no matter which side of the Atlantic you are residing. Not so! US car makers and consumers value big cars with big (gasoline) engines to reach their speed limit of 55mph as fast as possible and burn as much gas as possible in stop and go traffic – as gasoline in the US is comparably very cheap. As these big engines produce much COx, and (being a gasoline engine) very few NOx, the US limits are set accordingly and as a nice side effect protect the US car market towards more efficient and smaller Diesel engines from the EU and Japan.

But back to the Significance article. It looks into 5 “studies” from the NYT, Vox, Mother Jones and Associate Press, which all try to estimates the number of “Total US deaths” caused by Volkswagen’s defeat device in cars sold between 2009 and 2014 based on the estimated “Excess NOx”. As this estimated varies average miles per year driven and NOx death rates, the authors end up with this histogram of 27 different estimates on extra US NOx deaths

with an average of 160 resp. a median of roughly 80 “extra deaths”. Although it was hard to find a figure for the total annual US NOx emissions, I found a figure of 6,300,000 tons in 2004. With a best case death rate related to NOx of 0.00085 we get roughly 32,130 deaths from which up to 200 (or 0.0062%) are attributed to VWs defeat device. (The number goes down to 0.00056% with a death rate of 0.0095).

If we have a NOx problem in the US, VW probably did not contribute to it significantly with their defeat device.

Btw., the US count roughly 10,000 firearm-related homicides per year, for the period of 2009 to 2014 we face roughly twice as many deaths related to firearm misuse as we get from NOx pollution …

Emissions Gate – Is Volkswagen just a bad cheater?

Well, here goes the reputation of the German car makers, or at least the one of Volkswagen – does it? Cheating is not too special in many areas, but of course none of the instances involved wants to be busted. Volkswagen got busted now and as a first consequence Martin Winterkorn left.

What makes one wonder is that Volkswagen does not really have that competitive edge we would expect from a good cheater – at least Lance Armstrong had one.

As we learned from professional cycling, (almost) everyone did dope but only (too) few actually were convicted. Thus the question arises, whether Volkswagen is the black sheep, or the industry as a whole is cheating? So what is actually behind the #dieselgate or #vwgate?

I did collect some data from the manufacturers websites regarding fuel consumption, and compared it to what actual users report on The data collection (if existing, I used the smallest Diesel engine for each car size class) looked to be easy at first sight, but is a bit tricky regarding sample sizes and comparability – but it does not look too bad.

Lets first look at how much percent cars did consume more than advertised (let’s call this variable excess for now):

With just 14% it is actually Volkswagen’s Pheaton – a well known gas guzzler, which is actually rated as one, even by VW. Top scorer, with 73%, is Audi’s Q7.

Let’s now look at boxplots of excess by make

and car size

As the engines of VW and Audi are largely the same, it is quite surprising that VW is closest to what they advertise while Audi seems to be far off. Probably an indication that the typical drivers of a manufacturer have a big impact as well.

Less surprising is that larger cars are the worst cheaters, as this can be explained by simple physics.

Let me conclude with the scatterplot of all data

The diagonal is what we as consumer should get, but all car makers seem to cheat equally well – so let’s see who is next to get busted?!

PS: Fuel consumption is here used as a proxy of overall emissions, which are hard to measure otherwise.

The Good & the Bad [07/2015]: The most useless Map

Maybe it is a bit too harsh to talk of the “most useless map”, but when I saw this map on the greek bail-out referendum this morning in the FAZ, this was what first came to my mind

Well, yes the vote was without any doubt against the suggestion of the EU to solve the financial dilemma in Greece. But wouldn’t we like to probably learn a bit more – given that we get to see a map?

Yes, the choice is relatively easy and I created a choropleth map, using the data from the FAZ map and some shapefile from the internet. Nothing which is too hard, until I found out that the Greek use ‘k’, ‘c’ and ‘x’ equally likely to create what appear to be different names, but all mean the same (like Khios, Chios and Xiou) … so matching the districts was what took most of the time

Not that we get that a striking story now, but at least we see some structure now – but maybe my greek friends could help me out here with some deeper insight?!

The only thing which I can read from the distribution of the votes over the districts is that the often claimed “big divide” in greek society is not really supported geographically, as we almost see a normal distribution.

Drop me a line if you are interested in the data.

Tour de France 2015

I made sort of an early start this year and have the data for the second stage already sorted out. I will start to log the results in the usual way as in 2005, 2006, 2007, 200820092010, 201120122013 and 2014 now:

Stage Results cumulative Time Ranks
Stage Total Rank
(click on the images to enlarge)

– each line corresponds to a rider
– smaller numbers are shorter times, i.e. better ranks
– all stages are on a common scale,
– stage-results and cum-times are aligned at the median, which corresponds to the peloton

STAGE 2: MARTIN still at the front while ROHAN fell back
STAGE 3: FROOME now at the top, CANCELLARA out after mass collision
STAGE 4: 7 drop outs after 4 stages, more to come …
STAGE 5: top 19 now consistent within roughly 2 minutes
STAGE 6: MARTIN drops out as a crash consequence
STAGE 7: 12 drop outs by now, and the mountains still to come
STAGE 8: SAGAN probably has the strongest team (at least so far …)
STAGE 9: Some mix up in the top 16, but none to fall back
STAGE 10: The mountains change everything, FROOME leads by 3′ now
STAGE 11: BUCHMANN out of nowhere
STAGE 12: No change in the top 6, CONTADOR 4’04” behind
STAGE 13: BENNETT to hold the Lanterne Rouge now
STAGE 14: Is team MOVISTAR strong enough to stop FROOME?
STAGE 15: Top 6 within 5′ – the Alpes will shape the winner
STAGE 16: A group of 23 broke out, but no threat for the classement
STAGE 17: GESCHKE wins and CONTADOR looses further ground
STAGE 18: A gap of more than 20′ after the first 15 riders now
STAGE 19: QUINTANA gains 30” on FROOME
STAGE 20: QUINTANA closes the gap to 1’12”, but not close enough
STAGE 21: Au revoir, with a small “error” in the last stage 😉

Don’t miss the data and make sure to watch Antony’s video on how to analyze the data interactively!