German Election: 4. How Swabia kills Party Leaders

Now its time to show some maps. I won’t go through the usual party maps, as you might have seen them over and over again in TV, newspapers and the web (in fact it is impressing, what you get online by now!).

Instead, I want to look the two losers of the election: FDP and Greens. Here are the maps for the losses for each party. The brighter the yellow, the higher the loss for the FDP and the greener the green (doesn’t this sound lyrical?), the worse the losses for the Greens:

Whereas for the Greens, the losses seem to not only be concentrated on Baden-Würtemberg, for the FDP, the brightest yellow shines in the center of this state.

It is even easier to see, once you select only Baden-Würtemberg and look at the histograms of the losses. I put them on the same scale, which again highlights that the “problem” is far worse for the FDP.

The selected voting districts from Baden-Würtemberg clearly are on the left side of the distributions. Once you switch to spineplots, you even better see the conditional distribution of this selected state. As the biggest losses for the Greens are in Berlin, the leftmost bar is not completely highlighted as it is for the FDP.

Given these losses, almost all of the party leaders from the FDP and the Greens quit their office, which to a greater extend can be blamed on the swabian voters …

Stay tuned for the next post, where we look at the AfD, which almost entered parliament, but nobody really knows who did vote for them.

German Election 2013: 3. Structural Considerations

The german reunification is now on its way for almost a quarter of a century. One might think that by now, it might be hard to find the artificial division as a result from WWII, as structural features resulting from centuries of common history might be overruling what was a 40 year political intermezzo.

Not so, when you look at these graphs, based on the 2013 election results and the accompanying socio-economic data:

The boxplot shows the quota of people who did not get any school degree. Unfortunately, this measure still divides Germany into two parts.

Not having any school degree also is a good predictor for unemployment, which we can read from the following scatterplot on the left. A bit surprising is, that no matter how badly people are educated, at a certain point unemployment does hardly rise any further – as can be seen from the lowess smoother.

The right scatterplot shows the impact of the unemployment quota on the result of the former communist party (“Die Linke”). Again, we see a strict divide between east and west (even within Berlin). The interesting thing though (which is almost certainly by chance) is the fact, that for each % unemployment, the communists gain 0.5% of votes – no matter wether you are in the west or east. The funny thing is that this party claims full employment, organized by the state, which would make them the worst voting result (using this model 🙂 ).

Stay tuned for the next post, which will show how the small people of the Swabians killed the party leaders of FDP and Greens

German Election 2013: 2. Long & short term prognosis

As long as the final results are not yet published by the “Bundeswahlleiter”, I was curious to see how accurate the institutes did forecast the election result. Obviously the last polls were quite a bit off from the final results, and even the initial projections from the first counted voting districts were not too accurate.

Here is the simple visualization for infratest dimap for the CDU/CSU, which is not too different from the other institutes. In the end, being 1-2% off isn’t that bad regarding the prognosis, but is too much if the results are that close as they were last night.

(Thanks to the guys at for compiling all the data, and sorry for the bad x-scale from MS-Excel)

“Here’s to the crazy ones,…”

Today, Marcel Reich-Ranicki aged 93, died in Frankfurt, Germany. Having survived the Warsaw Ghetto and loosing all his family in German concentration camps, it was all but reasonable to stay in post war Germany and believe in the culture of the “nation of poets and philosophers”. He did, and he did not stop to tell us how to move forward (not only in literature).

His resistless way of criticizing should be an inspiration for all of us and encourage us to name something as “bad” (or even worse) when it actually is. Maybe Robert’s section on criticism of visualizations is a contribution into the right direction and hopefully some of my “Good and Bad” encourage us to move forward by learning from what was not good.



German Election 2013: 1. Strongholds

The German election is only four weeks away, so it might be worthwhile to take a look at the historic data from the elections in 2002, 2005 and 2009. Unfortunately, the voting districts are all but stable, such that a direct comparison is not trivial.

The maps show the voting districts, which where won by the either CDU/CSU, SPD or Die Linke (FDP or Bündnis 90/Die Grünen didn’t win any of the districts in the 3 last Bundestagswahlen). The lightest shade indicates “won in one election” … the darkest shade indicates “won in all three elections”.

What we see immediately is:

  • The south is clearly dominated by CDU/CSU
  • The only stronghold of the SPD is the center, i.e. Hesse and the Ruhr area
  • Die Linke managed to get hold of parts of the former east

For the two major parties the messages is thus relatively clear. CDU/CSU should battle to win the north against the SPD, the SPD must battle to win the east from Die Linke.


The new Digital Age – Love it, or Hate it

You might ask yourself what the book of the Google chairman Eric Schmidt has to do on a statistics blog. Well, Google’s success was based on doing the “right statistics” on the “right data” at the “right time”.

And not to mention Hal Varian (Google’s Chief Economist) who said: “I keep saying the sexy job in the next ten years will be statisticians.”

In the end, Google makes its money with (our) data, and that’s the stuff statisticians used to analyze and visualize on.

But let’s take a look what’s actually inside:

The book consists of 7 chapters, each telling us something on the future of something – ranging from “Our Future Selves” over “The Future of States” to “The Future of Reconstruction”.
“Our Future Selves” is like a science fiction story, which would be fun to read, if it weren’t for the business case Google already has in mind. The reader should decide for himself/herself whether he/she likes to wear shoes that vibrate when you should get up from breakfast and go to work, or “drive” a driverless car, which optimizes its routes to work automatically. After all, humans are amongst other things special because they can acquire knowledge and skills, which in Schmidt’s future will be obsolete as machines and algorithms will take over.

I was a bit reminded of Jacques Tati in “Mon Oncle”, perfectly alienated, getting lost in the  optimized and engineered world of aspiring post war France:

It is hard to argue with Schmidt when it comes to all the changes in politics and society in general, caused by “being connected”. These changes will happen (btw. the word “will” is the most frequently used word in this book, more often than in any apocalyptical scripture in the bible), and are here to stay.

But from a data perspective there is more at stake. The NSA scandal showed what happens when organizations and companies go haywire with our data, and the buzzword “big data” also called statisticians on the plan. There are limits that need to be respected; limits that also limit the stock market price of Google – something we need to keep in mind, when we read Schmidt’s book.

Making Movies

Making Movies is not only the name of an album by Dire Straits, but also the invitation of the ASA Statistical Graphics Section to enter the video competition. You might find the link a bit late (where I can’t dispute, but most creatives prefer to deliver “last minute”, so there is probably still some time left …) but it is actually not the direct reason for this post.

Inspired by the video competition, Antony sat down and actually created not one but three movies of interactive graphics in action – not intended to go into the competition but to motivate others to either use these methods or to create their own case study videos:

1. Titanic

(here is the data)

2. Decathlon

(here is the data – thanks to the excellent decathlon site)

3. Tour de France 2013

(here is the data – make sure to double click times tagged with a barchart to convert them to continuous variables when using Mondrian)


If you feel inspired by what you see, or have your own case study you want to present in Mondrian, go capture your screen and post it via, e.g., DropBox. The best movies will be added to the Mondrian video library and rewarded with a signed copy of “Interactive Graphics for Data Analysis – Principles and Examples“.

If you don’t feel like your own movie director, you might still probably want to download the data and redo what Antony did …

Tour de France 2013

Welcome to the Tour de France No. 100!

After the first 6 stages passed, I will start to log the results in the usual way as in 2005, 2006, 2007, 200820092010, 2011 and 2012 now:

Stage Results cumulative Time Ranks
Stage Total Rank
(click on the images to enlarge)

– each line corresponds to a rider
– smaller numbers are shorter times, i.e. better ranks
– all stages are on a common scale,
– stage-results and cum-times are aligned at the median, which corresponds to the peloton

STAGE 6: GREIPEL wins the stage, but he is still far behind
STAGE 7: Still a crowd of 56 riders very close at the top
STAGE 8: FROOME uses the first mountain arrival to grep the yellow jersey
STAGE 9: The last day in the Pyrenees caused 5 of the 16 drop outs so far
STAGE 10: KITTEL to win his second stage, but almost 2h behind FROOME
STAGE 11: Tony MARTIN recovered from the crash and escaped FROOME by 12”
STAGE 12: KITTEL wins his 3rd stage but is far behind FROOME’s yellow jersey
STAGE 13: Amazing what you can do with Clenbuterol … CONTADOR 3rd by now
STAGE 14: The top 10 stays together as the stages in the Alps will finalize the tour
STAGE 15: FROOME’S ride is a bit like that of LANDIS in the 2006 Tour … let’s see
STAGE 16: As there is no change at the top, let’s look at the compact group of last 7
STAGE 17: Can CONTADOR’s team push him over the Alps faster than FROOME’s
STAGE 18: FROOME now 5′ in front after the legendary Alpe-d’Huez stage
STAGE 19: Only a crash can stop FROOME – no change within the top 7
STAGE 20: Another stage with no significant change
STAGE 21: As usual – the trace of the winner (and hey, yet another Stage for KITTEL!)

And not to forget the big thanks to Sergej, who helped with the scripts!

Oh, I almost forgot to give you the data 🙂

Modern – What?

This is what I got in the mail some days ago …

Modern What

Hmm, if these are the modern statistical tools and techniques, which are the past statistical tools and techniques?

Oh, btw Mondrian turns 15 these days (and I struggle to get version 1.5 finished) … which makes it almost as modern as R.

The L’Aquila earthquake – Could have known better?

It took a while until I got the December issue of “Significance” shipped and finally got some time to read it, but the article from Jordi Prat “The L’Aquila earthquake: Science or risk on trial” immediately caught my attention. Besides the scary fact that you may end up in jail as a consulting statistician, it was Figure 1, which struck me:

Reproduction from Figure 1

Even as a statistician who always seeks exploration first, I was wondering, what a simple scatterplot smoother would look like that estimates the intensity, and whether or not it would be an indicator, of what (might have) happened.

Spline estimate of average magnitude

Looking at a smoothing spline with 4 degrees of freedom, separately for all the measurements before and after the earthquake, we see a sharp rise and a narrowing confidence band before April 6th 2009. As I am not a geologist, I can only interpret the raw data in this case, which I think should have alerted scientists and officials equally.

Naturally, we are always wiser after the event actually happened, so let’s look at the estimate (I use a loess-smoother with 0.65 span here) we get three week, one week and one day before the disastrous earthquake on April 6th.

Three estimate with varying horizon

Whereas three weeks before the quake things seem to calm down again, one week before the quake, the smoother starts to rise not only due to the 4.1 magnitude quake on March 30th. One day before the disaster, the gradient goes up strongly.

A simple zoom-in on a histogram supports the fatal hint on an apparent slow down in activity a few weeks before the earthquake.

A Histogram of Earthquake activity

Let me stop speculating here, but it let’s me rest more assured (as a statistician) as relatively simple data and methods do show that a stronger event might have been quite close on the evening of April 5th 2009.

I got the data from the Italian Seismological Instrumental and Parametric Data-Base and it can be accessed here. There are many articles on the web regarding the case and conviction – I only want to point here for further discussion.


Global Warming: Causality vs. Timeframes

The weather channel wetter-online pointed me to the latest global temperature anomalies which made me think about this post. Everybody knows that worldwide temperatures are rising. Rising as well does the concentration of CO2, which is literally fueled by burning fossil fuels. Ok, here goes the proof that rising CO2 levels correspond to rising temperatures:

Temp vs. CO2 for 1970 to 2000

As CO2 is the “most famous” greenhouse gas and thus causes temperatures to rise, the whole thing fits – at least for the timeframe we are looking at, which is 1970 to 2000. (OK, putting the two quantities on selected separate scales is a bit cheesy, but this is how media will sell these topics to us …)

Looking at the temperatures from 1880 to 2013 alone, gives rise to new questions when we look at the last decade:

Global Temperature 1880 to 2013

Looking at the smoothing spline for the monthly data, we see that global warming has stalled for almost a decade now – temperatures even seem to fall slightly.

From the first plot we know that CO2 concentration rises steadily, even in the last decade. So let’s take a look at the correlation between global temperatures and CO2 concentration.

Global Temperature vs CO2 concentration

A simple linear regression for the years 1953 to 2003 (red dots) supports the causal relationship, and is supported with an R2 of 0,61. Temperatures rose roughly 0.01 degrees centigrade per 1 ppmv. This was an easy to use model, which leads to apocalyptic temperatures when being projected some decades to the future, as CO2 concentration rises roughly 2ppmv per year right now.

Looking at the timeframe of 2003 to 2013 (green dots), the linear trend is slightly negative with an R2 of 0,005, which leads to the conclusion that CO2 does not really have an influence on the global temperatures right now. Brushing over an arbitrary decade shows, that this change is really unique since the mid 60s:


But what is the conclusion now? Is the whole CO2 story bogus? The answer is a clear maybe. Doubtlessly, it is stupid to burn the very limited resources of fossil fuels at the rate we are doing right now – especially after fracking became the salvation for all global energy problems. There is no way around using regenerative energy sources which are CO2 neutral and thus are no threat to the climate.

Do we fully understand the changes in global climate? A clear “No”. The timeframes we are looking at for which we have reliable data is so small compared to the timeframe global climate changes occur, that it is hard to derive final conclusions from what happened in the last decades. Nonetheless, we can stop doing stupid thing, and e.g. sell our gas guzzling SUV tomorrow!

(Thanks to D Kelly O’Day’s blog at which was inspiration and guide to data as well. The data are taken from GISS and NOAA.)

Understanding Area Based Plots: Trellis Displays

This is the third and last post on area based plots. Area based was certainly true for tree maps and mosaic plots, but falls a bit short for trellis displays, such that the term “grid based” would be more suitable. Nonetheless, all three plot types use conditioning within their core definition and the layout of the plot elements is more or less done on a grid such that a similarity is clearly given.

The use of trellis displays (users of R will know them as lattice graphics) was invented by Bill Cleveland in the early to mid 90’s. First as so called co-plots, and later on as Trellis Displays within the S-Plus package.
The basic idea is pretty simple. We use categorical variables to systematically condition the plot we want to look at in the first place. Let’s look at an example:
Scatterplot MPG vs. 0-60
This first plot is nothing more than a scatterplot for the cars data I already used in a previous post. The trellis display now conditions the plot according to the car type:
A trellis display conditioned by car type
The plot make all the more sense when we add an estimate a functional relationship for the two quantities. Let’s start with a linear estimate:
A Trellis Display with linear estimate
In general, you could use up to three variables to condition on (one in the rows of the trellis, one in columns and one via colors), and two variables as the so called panel plot, i.e., the plot which is drawn for each conditioned subset.
Above example is rather simple and trellis hardcore user will use this plot type extensively for advanced model diagnostics, but that would be too much for this post.
Personally I would handle above example in an interactive setting which allows to select any subgroup you like:
The data from the trellis plot in an interactive setting
This is what it looks like in Mondrian.