Statistics is dead, long live Statistics!

It was March 7th this year, when this mail from the ASA found its way to the ASA members:

On first sight, it didn’t look like that one needs to pay too much attention, but in the longer pdf-version, you can read these six principles:

  1. P-values can indicate how incompatible the data are with a specified statistical model.
  2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
  3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
  4. Proper inference requires full reporting and transparency.
  5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
  6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

taken from the full statement in The American Statistician.

For me this sounds like “the end” of classical statistics as a sub-discipline of mathematics. The cause seems to be obvious for me: In the light of Data Science as a widely promoted but hardly defined discipline, statistics seems to lose ground more and more. Unfortunately, the ASA does not really deliver new directions, that would make the ordinary statisticians more future proved.

Is this new? I would say, no. Ever since John Tukey promoted EDA (Exploratory Data Analysis, for those who are too young to know) we got new directions from someone who really knew the math behind statistics and as a result saw the limitations.

Digging in my old talks I found this slide from 2002

Nothing new, really, and 15 years ago in the light of the buzz word “Data-Mining”. But the point is the same.

The only question is:

Does the statistics community react too late, and is now doomed to diminish towards insignificance?

Jobs in Data Science

Well, if you are not in Data Science today, you are apparently missing a major trend … many say. Just in the last year, I witnessed at least three people mutating from ordinary computer scientists or statisticians into data scientists or data engineers. If you don’t really know what these people do, Analytics Vidhya has an easy classification for you.

You might have your doubts on what is written there (and maybe these are the same as I have), but one thing is for sure: Your mutation from a computer savvy statistician to a data scientist could be worth no less than 30,000$

Go and reinvent yourself!

Is Big Data all about Dark Data?

So far, my favorite description of Big Data is:

Big Data is when it is cheaper to keep all data than to think about what data you probably need to answer your (business) questions.

Why is this description so attractive? Well, Big Data is primarily a technology, i.e. storing data in a Hadoop File System (HFS) – at least for the most of us. This makes storing data extremely cheap, both in terms of structuring your data (far more expensive in a database) and physically storing it.

But at some point we need to analyze the data, no matter if we stored it “without” much structure in a HFS or with an analysis in mind in a database. In the Big Data case we probably just postponed the process of getting this work done.

Here is where the new buzz word comes into play: “Dark Data Mining”. According to Gartner, Dark Data is data, where we “fail to use for analysis purposes”. And kdnuggest have even a great visualization for the whole problem:

Whereas Kaushik Pal still sees a big business potential within Dark Data, I would look at it from a different perspective.

Dark Data Mining is like coal mining, where you do not separate lignite and spoil during the mining process, but you both put it on the same dump – because it is cheaper – but you start mining for the lignite in the dump once you actually want to use it …

Why touch is less

It is now almost 10 years ago that I asked a friend to bring me an iPod Touch along when he was visiting NYC. I was thrilled and curious to see this new interface Apple did introduce with iOS. Since then, smartphones and tablets are everywhere and the touch-interface is here to stay.

Even the surface of my mouse features a touch interface by now and getting back to my scroll wheel mouse at work is always a pain.

The question that will arise is whether or not touch devices will completely replace the traditional desktop interfaces we got used to the last 30 years.

Interestingly the makers of Windows and MacOS X seem to have different opinions on this.

Whereas Windows10 advertises Continuum, a functionality that lets you use your (high end) Windows Phone as a Desktop (?) computer and vice versa (?), Apple starts to align app functionality between MacOS X and iOS without pushing one interface upon both worlds (Desktop and Touch) or mixing both words to one product.

I couldn’t really argue very much why the one or the other way would be preferable, other than that a touch screen for a laptop seems an odd choice, as you constantly hide content with the touching hand …

After having used Mondrian on a 70” Sharp touch panel at work, I more clearly understand why Apple still goes two separate ways.

Function Desktop Touch
Click yes yes
Click & Drag yes yes
Mouse Over yes no
Range Selection (Shift-Click) yes no
Item Selection (Ctrl-Click) yes no
Right Click yes maybe
Precise Click yes no
Pinch to Zoom no yes

Above table (certainly not exhaustive) clearly shows that a lot of functionality (not to mention the keyboard) is lost when going from the desktop interface to a touch interface. For most (trivial) interactions resp. apps, we can live with this simplification, but when it comes to productivity software, touch is just inferior. That’s fine (and wanted) for my smartphone and tablet, but a problem for my laptop or desktop.

Significantly insignificant

I usually enjoy reading the articles in the significance magazine published by the RSS. Not only is it a glossy magazine (quite uncommon for statistics as a discipline) but also does it often feature very nice case studies from real life problems that matter.

Not so for the article in the current issue (December 2015) on the so called “Diesel Gate”. But before we look deeper into the problem let’s start with looking at emission regulations in the US and Europe. The following figure from “The Long Tail Pipe” illustrates the problem

Whereas the US restricts NOx very strongly, the EU pushes on COx. This makes one think as all emissions are bad for the environment and should equally regarded as “bad” no matter which side of the Atlantic you are residing. Not so! US car makers and consumers value big cars with big (gasoline) engines to reach their speed limit of 55mph as fast as possible and burn as much gas as possible in stop and go traffic – as gasoline in the US is comparably very cheap. As these big engines produce much COx, and (being a gasoline engine) very few NOx, the US limits are set accordingly and as a nice side effect protect the US car market towards more efficient and smaller Diesel engines from the EU and Japan.

But back to the Significance article. It looks into 5 “studies” from the NYT, Vox, Mother Jones and Associate Press, which all try to estimates the number of “Total US deaths” caused by Volkswagen’s defeat device in cars sold between 2009 and 2014 based on the estimated “Excess NOx”. As this estimated varies average miles per year driven and NOx death rates, the authors end up with this histogram of 27 different estimates on extra US NOx deaths

with an average of 160 resp. a median of roughly 80 “extra deaths”. Although it was hard to find a figure for the total annual US NOx emissions, I found a figure of 6,300,000 tons in 2004. With a best case death rate related to NOx of 0.00085 we get roughly 32,130 deaths from which up to 200 (or 0.0062%) are attributed to VWs defeat device. (The number goes down to 0.00056% with a death rate of 0.0095).

If we have a NOx problem in the US, VW probably did not contribute to it significantly with their defeat device.

Btw., the US count roughly 10,000 firearm-related homicides per year, for the period of 2009 to 2014 we face roughly twice as many deaths related to firearm misuse as we get from NOx pollution …

Emissions Gate – Is Volkswagen just a bad cheater?

Well, here goes the reputation of the German car makers, or at least the one of Volkswagen – does it? Cheating is not too special in many areas, but of course none of the instances involved wants to be busted. Volkswagen got busted now and as a first consequence Martin Winterkorn left.

What makes one wonder is that Volkswagen does not really have that competitive edge we would expect from a good cheater – at least Lance Armstrong had one.

As we learned from professional cycling, (almost) everyone did dope but only (too) few actually were convicted. Thus the question arises, whether Volkswagen is the black sheep, or the industry as a whole is cheating? So what is actually behind the #dieselgate or #vwgate?

I did collect some data from the manufacturers websites regarding fuel consumption, and compared it to what actual users report on The data collection (if existing, I used the smallest Diesel engine for each car size class) looked to be easy at first sight, but is a bit tricky regarding sample sizes and comparability – but it does not look too bad.

Lets first look at how much percent cars did consume more than advertised (let’s call this variable excess for now):

With just 14% it is actually Volkswagen’s Pheaton – a well known gas guzzler, which is actually rated as one, even by VW. Top scorer, with 73%, is Audi’s Q7.

Let’s now look at boxplots of excess by make

and car size

As the engines of VW and Audi are largely the same, it is quite surprising that VW is closest to what they advertise while Audi seems to be far off. Probably an indication that the typical drivers of a manufacturer have a big impact as well.

Less surprising is that larger cars are the worst cheaters, as this can be explained by simple physics.

Let me conclude with the scatterplot of all data

The diagonal is what we as consumer should get, but all car makers seem to cheat equally well – so let’s see who is next to get busted?!

PS: Fuel consumption is here used as a proxy of overall emissions, which are hard to measure otherwise.

The Good & the Bad [07/2015]: The most useless Map

Maybe it is a bit too harsh to talk of the “most useless map”, but when I saw this map on the greek bail-out referendum this morning in the FAZ, this was what first came to my mind

Well, yes the vote was without any doubt against the suggestion of the EU to solve the financial dilemma in Greece. But wouldn’t we like to probably learn a bit more – given that we get to see a map?

Yes, the choice is relatively easy and I created a choropleth map, using the data from the FAZ map and some shapefile from the internet. Nothing which is too hard, until I found out that the Greek use ‘k’, ‘c’ and ‘x’ equally likely to create what appear to be different names, but all mean the same (like Khios, Chios and Xiou) … so matching the districts was what took most of the time

Not that we get that a striking story now, but at least we see some structure now – but maybe my greek friends could help me out here with some deeper insight?!

The only thing which I can read from the distribution of the votes over the districts is that the often claimed “big divide” in greek society is not really supported geographically, as we almost see a normal distribution.

Drop me a line if you are interested in the data.

Tour de France 2015

I made sort of an early start this year and have the data for the second stage already sorted out. I will start to log the results in the usual way as in 2005, 2006, 2007, 200820092010, 201120122013 and 2014 now:

Stage Results cumulative Time Ranks
Stage Total Rank
(click on the images to enlarge)

– each line corresponds to a rider
– smaller numbers are shorter times, i.e. better ranks
– all stages are on a common scale,
– stage-results and cum-times are aligned at the median, which corresponds to the peloton

STAGE 2: MARTIN still at the front while ROHAN fell back
STAGE 3: FROOME now at the top, CANCELLARA out after mass collision
STAGE 4: 7 drop outs after 4 stages, more to come …
STAGE 5: top 19 now consistent within roughly 2 minutes
STAGE 6: MARTIN drops out as a crash consequence
STAGE 7: 12 drop outs by now, and the mountains still to come
STAGE 8: SAGAN probably has the strongest team (at least so far …)
STAGE 9: Some mix up in the top 16, but none to fall back
STAGE 10: The mountains change everything, FROOME leads by 3′ now
STAGE 11: BUCHMANN out of nowhere
STAGE 12: No change in the top 6, CONTADOR 4’04” behind
STAGE 13: BENNETT to hold the Lanterne Rouge now
STAGE 14: Is team MOVISTAR strong enough to stop FROOME?
STAGE 15: Top 6 within 5′ – the Alpes will shape the winner
STAGE 16: A group of 23 broke out, but no threat for the classement
STAGE 17: GESCHKE wins and CONTADOR looses further ground
STAGE 18: A gap of more than 20′ after the first 15 riders now
STAGE 19: QUINTANA gains 30” on FROOME
STAGE 20: QUINTANA closes the gap to 1’12”, but not close enough
STAGE 21: Au revoir, with a small “error” in the last stage 😉

Don’t miss the data and make sure to watch Antony’s video on how to analyze the data interactively!

90+4 Minutes of Horror

As the FC Bayern München already won the German Soccer league many weeks ago (and consequently dropped out of the remaining contests), no one was really interested in the Bundesliga any longer – no one?

Being boring at one end of the table does not mean that the other end is boring as well. And quite opposite to the top, the bottom of the table was extremely thrilling, with 6 teams that potentially could relegate before the last match day. Even more excitingly, all 6 teams were only spread over 4 games, meaning 4 of the teams were facing a direct opponent in the fight for staying in the Bundesliga.

Enough said, within the 90+4 Minutes of match time, we saw 11 goals and 6 of these goals affected the table ranking, i.e. who was to relegate or not:

What do we learn from this visualization?

  1. Paderborn, starting the last match day as last and found itself at the very same place, is to relegate. The 32 minutes of hope from 0:04 to 0:36 did fade with Stuttgarts equalizer.
  2. HSV is one of the two rollercoaster teams, with a spread of 4 and 4 rank changes. Winning their game qualifies them to play against KSC of the 2nd Bundesliga, to see who is to play Bundesliga next season, as they climbed from 17th to 16th.
  3. The second rollercoaster team is the VfB, with a spread of 4 and even 5 rank changes. Spending most of the last match day (exactly 55 minutes) on a direct relegation rank, winning against Paderborn pushed them up to rank 14.
  4. Hannover did not take any chance and won against Hertha, climbing from 15 to 13
  5. Freiburg is the looser of the day. Starting as 14th, they found themselves 3 ranks down on 17th, directly relegating to 2nd Bundesliga.
  6. Hertha had only a very small chance to actually being endangered to relegate. Nonetheless, they slipped 2 ranks but as 15th, they are to stay in the Bundesliga.

The Good & the Bad [02/2015]

Yes, it’s been a while since the last post, but hey – isn’t it a good sign that I presumably do not stumble over too bad graphics every day (or I just might not have the time to write about it …)

This example comes from SAS’s online manual on their Visual Analytics tool. And here it is:

“The Bad” is even supported by explanatory text:

This example uses a butterfly chart to show the actual sales compared to the predicted sales for a line of retail products. The butterfly chart is useful for comparing two unique values. In this chart, the two values are arranged on each side of the Y axis.

I am sure that my friends at SAS know better, so I won’t start bashing here, and I will also not start to refine the plot to perfection, as the problem seems to be too obvious:

Why comparing continuous quantities not side by side on the same scale, but put them on separate, opposite scales – especially when we look at quite small differences?

Maybe the programmers of SAS VA or the writers of this online help are too young, so they might have missed Bill Cleveland’s “The Elements of Graphing Data” from 1985, but they could have stumbled over this 1985 paper (available on todays internet). This paper is an essence of what the book talks about in all of Chapter 4 “Graphical Perception”, which has not changed the last 30 years (which can not be said about Figure 3.84 on page 216 ;-).

So here is Cleveland’s advice on the precision of perception tasks:

And here is the same data on a comparable scale – I hardly dare to call it “The Good”:

If someone supplies the ggplot “solution”, please feel free to comment and I will add it to the plot.

Global Warming: 2 years of new data

2 years ago, I did post the data on the CO2 and global temperature relationship. The conclusion was: at least in the last 10 years, CO2 concentration kept rising, but the global temperature didn’t.

Now, as the last year was the warmest year ever measured, it is time to look at the data again – having two more years of data.

Let’s first look at the scatterplot of temperature vs. CO2 concentration, with the last 10 years highlighted:

Again, there is no (linear) relationship whatsoever. Certainly CO2 is a greenhouse gas, and we all know how a greenhouse works.

Looking at the temperature development, we can’t ignore that 2014 was the warmest year ever recorded. Nonetheless, when we use a smoother with a wider span (smoothing spline with 6 degrees of freedom), which picks up the almost linear trend nicely, the temperature rise looks like it has stalled roughly 10 years ago:

Using a far more flexible smoother (25 degrees of freedom) we get a different result, indicating a dramatic rise in temperature in the last year:

As we all know, the volatility of a trend estimate is always highest at the end, where we only have data on one side of the estimate.
Thus, I am afraid we need to wait for another 2 years of data to tell, whether 2014 was the end of the temperature stagnation or not.

Merry Christmas and Happy Holidays

… to all readers! I am off after Christmas, no internet, some kilos of (physical) books and probably some elks – I might take my Macbook along for programming, though.

For those of you who think 2014 went optimal and think about making 2015 even more efficient here a scene from The Little Prince:

“Good morning,” said the little prince.

“Good morning,” said the merchant.

This was a merchant who sold pills that had been invented to quench thirst. You need only swallow one pill a week, and you would feel no need for anything to drink.

“Why are you selling those?” asked the little prince.

“Because they save a tremendous amount of time,” said the merchant. “Computations have been made by experts. With these pills, you save fifty-three minutes in every week.”
“And what do I do with those fifty-three minutes?” asked the little prince.

“Anything you like…”

“As for me,” said the little prince to himself, “if I had fifty-three minutes to spend as I liked, I should walk at my leisure toward a spring of fresh water.”

― Antoine de Saint-Exupéry, The Little Prince