Is there something like “old” and “new” statistics?

The blog post by Vincent Granville “Data science without statistics is possible, even desirable” starts talking about “old statistics” and “new statistics”, which started some more discussion about how statistics and data science relate.

Whereas I agree that there is something like “old” and “new” thinking of the role and the tools in statistics, I am less convinced that Data Science blesses us with many new techniques, which are generally more useful than what we use in statistics for quite a while.

Anyway, I promise, I won’t pester you with more big data related posts (at least this year) and want to close with my favorite quote regarding big data by George Dyson

“Big data is what happened when the cost of storing information became less than the cost of making the decision to throw it away.” 

Thus the “value” of this kind of data can be doubted, and it becomes quite clear, that most companies

  1. probably don’t have the need for big data solutions
  2. will struggle to find a decent business case for big data

UEFA Champions League Round of 16 draw

Each year after the group stage, there is the much awaited drawing of the eighth-final, which essentially defines a team’s fate. So far the thing is not too complicated, as there are 16 teams out of which we need to generate 8 games – no problem if it would be possible to draw the teams without restrictions. But there are quite a few:

  1. Group winners only play group runner up
  2. You can’t play a team which was in the same group
  3. Teams from the same league can’t play each others

Thus there is some combinatorics to solve. Sebastian created a shiny app and the necessary R-Code to generate the probabilities of who plays whom:

Here we immediately see the restrictions as 0% probability, as there are 8 zeros on the diagonal (restriction 2) and 7 zeros off-diagonal (restriction 3). As each row and column must add up to one (a fact that the friends at SPON got wrong as they initially posted a false solution), combinations at intersections of rows and columns with many zeros get higher probabilities. So the most likely draws (greedy) are:

  • FC Chelsea vs. Bayer 04 Leverkusen
  • FC Bayern Munich vs. FC Arsenal
  • Borussia Dortmund vs. Manchester City
  • AS Monaco vs. FC Schalke 04

If these matches would be drawn, we would end up with equal probabilities and still three different opponents for all the remaining teams:

Things look quite different, when we go for the least probable matches for each draw, these are e.g.:

  • Real Madrid vs. FC Shakthar Donetsk (1 out of 9 with 11%)
  • FC Porto vs. FC Basel (9.3%)
  • FC Barcelona vs. Juventus Turin (13%)
  • AS Monaco vs. Manchester City (1 out of 7 with 16.7%)

Now, only after 4 draws, the remaining matches are all fixed by one of the restrictions:

Now we see what makes the drawing so interesting. Given what matches were already drawn, the remaining matches are more or less fixed.

Thanks to Sebastian for the nice tool, and have fun to play around – maybe you find three matches which already fix all remaining?! Let’s see what happens on Monday, when the actual drawimg takes place.

Anyway a fantastic example of how useful shiny can be.

Statistics vs. Computer Science: A Draw?

I’ve been thinking about what Big Data really is for quite a while now, and am always happy about voices that can shed some more light on this phenomenon – especially by contrasting to what we call statistics for centuries now.

Recently I stumbled over two articles. The first is from Miko Matsumura, who claims “Data Science Is Dead“, and does largely lament about data science lacking (statistical) theory, and the other one is from Thomas Speidel, who asks “Time to Embrace a New Identity?“, largely lamenting that statistics is not embracing new technologies and application.

In the end, it looks like both think “the grass is always greener on the other side” – nothing new for people who are able to reflect. But there is certainly more to it. Whereas statistics is based on concepts and relies on the interplay of exploration and modeling, technology trends are very transient, and what today is the bleeding edge technology, is tomorrow’s old hat.

So both are right, and the only really new thing for statisticians is that we do not only need to talk to the domain experts, but we also need to talk to the data providers more thoroughly, or start to work on Hadoop clusters and start using Hive and HQL …

Big Data Visualization

… would have been the better title for the book “Graphics of Large Datasets“. As the book was published a few years before the birth of the next buzz word promoted by McKinsey with the white paper “Big data: The next frontier for innovation, competition, and productivity“, we just did not know better.

But to be honest, much of the book’s first half – to be precise the first 101 pages – deals with exact with what we generally face when displaying millions of data points / rows / records.

Here is just a small collection of what you have to expect beyond Amazon’s “look inside” feature:

Well, there is certainly one big drawback with this book, as it does not try to simplify things and offer trivial solutions. To cite Albert Einstein:

Everything Should Be Made as Simple as Possible, But Not Simpler

As we deal with really big data, i.e., millions of records, thousands of variables or thousands of models, problems at least need more thought than classical business graphics.

Talking about classical business graphics, in a very modern (i.e., big data compatible) look, we find, e.g., what the guys from Datameer offer as a solution to big data visualization:

Big Data: Not without my Stats Textbook!

Google is certainly the world champion in collecting endless masses of data, be it search terms, web surfing preferences, e-mail communication, social media posts and links, …

As a consequence, at Google they are not only masters of statistics (hey, my former boss at AT&T Labs who was heading statistics research went there!) but they also need to know how to handle Big Data – one might believe. But with all big companies, there are “those who know” and “those who do”, which are unfortunately often not identical.

So, “those who do” at Google built Google Correlate. A simple tool that correlates search terms. To start with an example (all based in Germany as search term origin), let’s look at what correlates with “VW Tiguan”:

With a correlation of 0.894 it is the forth highest ranking correlation, as I left out “Tiguan” and “Volkswagen Tiguan” as well as “MAN TGX” (which all relate to the term itself or to another car/truck). www.notebookcheck.com is a notebook related website in german language, which is definitely absolutely unrelated to the VW Tiguan. The corresponding scatterplot looks like this:

Besides the general problem of Big Data applications, to make sense out of what we collected, we are facing two major problems to tackle – no matter what kind of data we are actually looking at:

  • With millions to billions of records, differences usually all get significant no matter how small they are, when using classical statistical approaches
  • The other way round, when looking for similarities, we tend to find things that “behave the same” although there is no causality at all, just by the amount of the data

But what went wrong with Google Correlate? They certainly fell for the latter of the two above listed problems; the question is why? First there is the pseudo correlation (see here for a nice collection of similar causality-free time series), which is solely based on the stationary part of the time series. If you remove the stationary part of the series (I used a simple lowess-smoother) the scatterplot looks like this:

with a correlation of 0.0025, i.e., no correlation. Looking closer a the time series, it is quite obvious, that apart from the stationary component there is no correlation whatsoever.

Enough of Google-bashing now, but the data isn’t iid and a Pearson coefficient of correlation not an adequate measure for the similarity of two time series. In the end, it boils down to a rather trivial verdict: trust your common sense and don’t forget what you have learned in your statistics courses!

(btw. try searching for “Edward Snowden” in Google Correlation – it appears the name has been censored)

Big Data: The Truth, Nothing But the Truth


With Big Data and the internet, we all feel like we can know and analyze everything. Certainly Google must feel that way, as they collect not only data, but also what we – the users – find interesting in that vast pile of information.

As we should always keep in mind: Google is not charity and does not offer its services for free, and we should expect to see their commercial interests interfere with what we would usually refer to as “neutrality” or “truth”

Just after the soccer world cup semi final, I stumbled over this article on npr, where it says:

But Google itself is choosing to steer clear of negative terms. The company has created an experimental newsroom in San Francisco to monitor the World Cup, and turn popular search results into viral content. And they’ve got a clear editorial bias.

Their motivation is only superficially of a kindhearted nature as:

“We’re also quite keen not to rub salt into the wounds,” producer Sam Clohesy says, “and a negative story about Brazil won’t necessarily get a lot of traction in social.”

So we need to go to the English press directly, to get these fantastic headlines talking about “German tanks rolling into Belgium” (I guess this was the Euro 1980 with two goals of Horst Hrubesch – who was probably mistaken with a tank …) or the 2010 headline of The Sun “Men v boys.. and the boys won”.

The bottom line is clear: If you want an unbiased excerpt of “the news”, you can’t really rely on Google as they whitewash news in order to make them as much “shareable” and “clickable” as possible in order to fuel revenues.

The Tour is over – long live the Tour!

The Tour 2014 is over and has a winner – Vincenzo Nibali. As some readers asked how they could analyze the data interactively themselves, I post this video by Antony Unwin, who looked at the 2013 data, which was the 100th anniversary of the Tour.

If you are inspired now, go download the data and the software and start exploring yourself!

Why NIBALI has only a 50% chance to win the tour

Well, to be honest, I see a far higher chance for him to win the tour, but first let’s look at the data. Having collected 10 years of Tour de France data, it is time to look at structural features of a whole tour. With a sample size of 10 (yes, still far away from big data …) we might want to look at the rank of the winner of a tour within the tour.

The graph shows the empirical probabilities (supported by a natural spline smoother of degree 5) for each stage that

  1. the current leader wins the tour
  2. the winner is in the top 3
  3. in the top 5, or
  4. in the top 10

From this model we can read off the graph, that the chance to win the tour is 50% if you are the leader after stage 14.

What really surprised me is the fact there is such a big gap between leader and top-3 and a far smaller between top-3 and top-5.

But everyone who knows the basic set-up of a tour knows that the race is decided in the mountains, i.e., the Alps and the Pyrenees, which usually come up between stage 11 – 14 and 16 – 19, depending on the route the Alps first or not. As there is often an individual time trial as the last “counting” stage (you might know of the “non-agression pact” in the last stage), this time trial might switch the leader for a last time if the gaps are non-bigger than say 3′-4′.

So this concludes my personal assessment that NIBALI has a far greater chance to win than 50%, as his lead is almost 5′ now, and if he can maintain his performance in the remaining stages in the Pyrenees (which is still some way to go), he will be this year’s winner.

I conclude with a parallel box plot for the ranks of the winners of the last 10 years:

(the highlighted winner is LANDIS in 2006, who was found guilty of doping immediately after his phenomenal comeback in stage 17, harming the sport as well as my statistics …)

Brazil vs. Germany

It is bold to post this after 30 mins in the game, but what can you say …

(Just to make sure, this is just meant as an example for effective visualization ;-) )

Tour de France 2014

After the first 4 stages passed, I will start to log the results in the usual way as in 2005, 2006, 2007, 200820092010, 20112012 and 2013 now:

Stage Results cumulative Time Ranks
Stage Total Rank
(click on the images to enlarge)

- each line corresponds to a rider
– smaller numbers are shorter times, i.e. better ranks
– all stages are on a common scale,
– stage-results and cum-times are aligned at the median, which corresponds to the peloton

STAGE 4: KITTEL to win his third stage, but still only rank 147
STAGE 5: Two ASTANA riders take the lead
STAGE 6: Yet another german victory, but NIBALI and FUGLSANG still in the lead
STAGE 7: A group of 42 already set apart 3 min, but the mountains are still to come
STAGE 8: After the first hills, KADRI wins and NIBALI can double his lead
STAGE 9: GERMANY got the 4th soccer world cup title  – congratulations!
STAGE 10: NIBALI back in the lead with now almost 2:30
STAGE 11: As the classement does not change too much, let’s look at the dropouts
STAGE 12: GALLOPIN is the looser of the day – first mountains to come next stage
STAGE 13: After the first stage in the Alps, the top 16 spread 11’11”
STAGE 14: Anyone to stop NIBALI? Now 4’37” in the lead
STAGE 15: After a transfer stage, we may look at JI’s Chinese contribution …
STAGE 16: A group of 14 set apart, but can not endanger NIBALI’s lead
STAGE 17: MAJKA’s great second half of the tour is pushing him >100 ranks
STAGE 18: And yes, NIBALI wins again …
STAGE 19: NIBALI and JI are the “envelope” of the tour
STAGE 20: MARTIN flies to win the time trial – far from a podium finish in Paris
STAGE 21: Another victory for KITTEL, but we close with the trace of the winner!

Don’t miss the data :-)

German Election: 5. The Mystery of the AfD

It is nothing new, to see the rise of some populist party right before an election, exploiting some anxiety in the population. With the AfD (Alternative for Germany) it is the easy to activate fear of economical decline, potentially caused by the economical solidarity within Europe. The set-up is simple, with an ever smiling economics professor at the top as some sort of build-in authority for the promoted anti-Euro politics.

To cut a long story short, the AfD almost made it into the german parliament with 4,7%. Recent polls have them now by even 6%.Thus the question must be: “Who did actually vote for this party?”.

We may want to look at the socio economical data for the last election and hunt for high correlations with the AfD results. Surprisingly, the only variable that shows a decent positive correlation with the AfD result is the percentage of voters between age 60 to 75.

Interestingly there are far more variables which have a negative correlation and it shows that educations helps – as often the case … With the socio demographic variables not being a good indicator, it is worthwhile to look at the geographical distribution of the election result.

What the map shows is quite surprising, and most people I showed this result so far did doubt that the contiguity of the areas with high AfD results could be “for real”. This upcoming conspiracy theory can be pushed even more when we look at the map, which separates areas above the overall result of 4,7% and below:

This (fragment of a) shape is too well known in german history (cf. here). But apart from all conspiracies there is a good approximation for these areas by just selecting german states.

This ensemble of plot shows that the strongholds of the AfD are more or less isolated in only 6 states, either motivated by drawing voters from only locally organized right wing groups (like in the east) or attracting voters how are too afraid of loosing their well established “German Gemütlichkeit” by helping out Greece or other troubled EURO-states (like in Baden Württemberg and Bavaria).

If you want to dig deeper, here is the data, map and the software to do so – have fun!

 

German Election: 4. How Swabia kills Party Leaders

Now its time to show some maps. I won’t go through the usual party maps, as you might have seen them over and over again in TV, newspapers and the web (in fact it is impressing, what you get online by now!).

Instead, I want to look the two losers of the election: FDP and Greens. Here are the maps for the losses for each party. The brighter the yellow, the higher the loss for the FDP and the greener the green (doesn’t this sound lyrical?), the worse the losses for the Greens:


Whereas for the Greens, the losses seem to not only be concentrated on Baden-Würtemberg, for the FDP, the brightest yellow shines in the center of this state.

It is even easier to see, once you select only Baden-Würtemberg and look at the histograms of the losses. I put them on the same scale, which again highlights that the “problem” is far worse for the FDP.


The selected voting districts from Baden-Würtemberg clearly are on the left side of the distributions. Once you switch to spineplots, you even better see the conditional distribution of this selected state. As the biggest losses for the Greens are in Berlin, the leftmost bar is not completely highlighted as it is for the FDP.


Given these losses, almost all of the party leaders from the FDP and the Greens quit their office, which to a greater extend can be blamed on the swabian voters …

Stay tuned for the next post, where we look at the AfD, which almost entered parliament, but nobody really knows who did vote for them.