On Twisters and Killer Tornados

Given the trouble I got into after my post on the Japan earthquake, I probably should stay put when it comes to looking at data on hazardous events …

More seriously, as statistician (or data analyst in general) we often lack the expertise from the domain expert, who usually collected the data. Today, in a “data everywhere” world, we are in the fortunate position to easily access interesting data from various domains, but probably don’t know much about the background.

Thus I was happy to see the three posts

on Jim’s blog. As Jim has a BS from SUNYA in Atmospheric Sciences, MS from FSU in Meteorology, and a PhD from ISU in Agricultural Meteorology, I am pretty sure he knows enough about tornados to reason beyond speculations.

You can find the data (5.5MB) here to play around yourself, which was compiled from this NOAA website. If you need a tool, you might be happy to use Mondrian.

PS: Jim agreed to write a guest post in the next few weeks, so we might learn a bit more on tornados here soon.

Happy Holidays …

… which is usually Merry Christmas around here and some (still too few) Happy Chanukah.

I had a good laugh when I saw Andrew’s reference to this barchart.

Maybe this is the right way to teach upper management the concept of uncertainty via confidence intervals, as the concept of mistrust is surely well known in these circles.

But to leave you with a bit of Christmas feeling, here is a great version of the ancient, latin Christmas hymn “Veni, Veni Emmanuel!”. Although I am not a particular bluegrass fan, I have to admit that this version is far closer to the original intention then what many well-meaning church choirs around will deliver.

-Enjoy!

Note: The video was recorded on Canon 5D Mark II, which gives you (given the right lenses) impressive depth of field effects – given you have the money to own a 5D MII.

EU Debt Crisis – What Crisis?

Following the news and trying to understand what is going on in the “EU debt crisis” is a hard job and maybe a good visualization can help. At first sight the BBC did it. Eurozone debt web: Who owes what to whom? shows nicely how the relation between the most “interesting” debtors and creditors in the EU (spiced up with the US and Japan) is.

There is also a short explanation to each country’s situation in relation to its GDP right of the graph, but that is full of interpretations and “insights” which hardly match with the figures in the graph.

After spinning around in the debt web, I keyed in all the data, and can now create the debt matrix:

I am not sure how much more I can seen now, but I see it now all at once at least. Surprisingly (or maybe not surprisingly) the two countries which would trouble me most, are not within the Euro-Zone and don’t seem to be part of any concerns: UK and US.

One last graph which looks at the influence of the highlighted countries, which already called for help and thus have quite some potential of defaulting on their debts:

The barchart shows creditors sorted according to the share of troubled debt – though I don’t feel enlightened enough to draw any immediate conclusion form this result … I guess the data does only show a small part of what’s going on and no matter how we visualize it, we are not really getting more insight into the crisis.

Maybe it takes another post with more / better data …

We know what you like – do you?

It’s been a while since Georgios sent me the link to this interesting “psychogram” of iOS users vs. Android users.


In the first place I thought the really bad thing (but maybe also amusing thing) of this “analysis” is the fact that some sample has been pushed through some multivariate statistical procedure and generated some output – many opportunities for failure and no idea about significance. While this kind of “analysis” (how did you find yourself in the two worlds?) might be somewhat frightening, the real frightening thing is the site, which generated the data.


Hunch.com is a site which gives you automated recommendations about things you (apparently) like, using some “psychogram” questions and sniffing your social network neighborhood. From a statistical or machine learning point of view the task is clear: classification and prediction; from a personal point of view it might feel a bit disconcerting. Each individual, no matter how smart or dumb, is far more nuanced than the few dimensions set up in the model hunch.com might use. In the end, hunch.com does not do this out of pure altruism, they want to sell you stuff you otherwise would not have bought which makes them put us into categories we probably don’t fit into.

Statistics can be of great help in many places, but we should not actively hand over our interests to the results of some data mining algorithm.

The Good & the Bad [12/2011]

This was not meant to be a Good & Bad, but it turned out, that the argument is most effective, when it goes beyond pure criticism and actually offers alternative – so we need a Good.

We find this nice illustration of German energy data at the GE visualization site:
This kind of visualization is quite common now and had its “initial public offering” with “The Baby Name Wizard” by Martin Wattenberg. The stacked display has some issues (which can make it to “a Bad”) and it takes a careful construction to make sure it is well readable (it actually “only” needs the right stacking order – if there is one). What struck me with above graphics was the fact, that none of the bands is actually aligned at some sort of straight base – typically the x-axis in a plot. As a consequence it is really hard to tell the story behind the data. Most frustrating, the most recent data is extremely jiggling which makes a judgement of the current trend almost impossible.

It took me a while to get the data out of the visualization, but you can actually download the whole visualization here. My first attempt to understand the data better was using simple time series which I created by “misusing” a parallel coordinate plot:
What we lose is the total, as the series are no longer stacked – though, it was quite hard to judge the total in the original visualization as well. The barchart is used as a reference and shows the most recent distribution. What can we learn from this graph:

  1. Well, there was the oil crisis in 1973 – God knows what would have happened without the crisis stopping this ridicules greed for oil in the early 70s.
  2. The second oil crisis in 1979 was actually having a real impact, as the decline in oil consumption lasted for four years and since then stayed on a lower level – quite contrary to the crisis in 1973.
  3. Germany abandoned half of the brown coal sources shortly after the reunification.
  4. Nuclear energy stalled in 2000 and is now on a (projected) decline.
  5. Renewable energy sources are the only ones with a significant growth, but it still takes a long way to supersede oil and gas.
  6. Coal is declining steadily.

You certainly can read off all the topics from the GE-visualization, but you probably would need to know these fact before, which is certainly the wrong way, as a visualization should generate insight and not visualize already existing knowledge.

PS: I tried to find a good stacking order, but after 30min. moving series up and down it looked like there is none.

PPS: There is a quite similar post here

Understanding Area Based Plots: Mosaic Plots

Mosaic Plots are the swiss army knife of categorical data displays. Whereas bar charts are stuck in their univariate limits, mosaic plots and their variants open up the powerful visualization of multivariate categorical data.

But let’s start with an introductory example. The Titanic data is still the most convincing application of mosaic plots, though many of us saw this example over and over again – I will show other examples as well once we are done with it.

Above example starts with a simple bar chart of passengers by class at the top left, with all surviving passengers highlighted (I guess everybody is familiar with what happened to the Titanic …). The top right plot modifies the bar chart such that we can compare the highlighted proportions, i.e., the proportionality of width and height is interchanged, without changing the highlighting direction. We call this plot a spineplot.

With a spineplot, we are almost there for a 2-dim. mosaic plot, shown at the bottom of above graphic. Now we can derive the general building principle of a mosaic plot. We start with a blank rectangle and recursively split each tile according to the conditional distribution of the variable to add within that tile, e.g., we split the whole according to the distribution of class, and each class according to the second variable – in our case survived.

Leaving the survival information as highlighting, we can recursively split Class by Age and Gender and get the classical Titanic mosaic plot:

I guess it won’t take you long to find the “Women and Children first!” in the plot …

Now it is easy to see the fundamental difference to tree maps. Whereas in a tree map, we may split each node according to an individual criterion, the “tree” behind a mosaic plot is always fully balanced and the splits on a specific level are always according to the distribution of one fixed variable.

On the highest level, there are basically two general uses of mosaic plots.

  1. Conditional Distributions
    Looking at a single response (like survival in the above example) or an interaction, conditioned on (or given a) set of variables (class x age x sex)
  2. Structural properties of high-dim. categorical data
    Often we need to understand the general structure of a high-dim. categorical datasets in terms of finding empty or very small combinations, the dominating classes, or trends and patterns in the data.
    In this case we can make use of the numerous variations of mosaic plots (see, e.g., here for a Multiple Barchart), which mostly leave the strict area proportional constraint (which we need in 1.) and move to a matrix like layout (see Heike’s paper on more details, or try them out in Mondrian. See also Alex’s RMB-plots as latest contribution to this class of plots.)

Let me give you two more examples of mosaic plots. The first is using longitudinal categorical data on respiratory diseases.

For five points in time we see the different development of the disease depending on gender and kind of treatment, with highlighted cases marking patients with a “good” status. We see the highest discrimination between the treatments for t(2) for female patients and t(3) for male patients, and a decreasing effect for t(4) for both genders.

I will close with showing Simpson’s Paradox with the famous Berkeley admission data using mosaic plots:

The mosaic plot of gender with admitted students highlighted (left) shows clearly that the proportion of females is smaller than the one of males. If we split up by department (lower right plot) the share of admitted students is almost completely balanced for departments B-F and even higher for females in department A.

I leave it to the reader to find a neat verbal explanation of what is going one here (as this post is already way too long …), but so much can be said: it has to do with the proportion of females and males within the different departments.

MacOSX Lion: King of OS’s GUIs

Mac OS X Lion is now the 7th incarnation of Apple’s new operating system. Each of the version upgrades had minor additions to the graphical user interface (GUI). None of the increments did really have a big impact on how we used the OS – at least for me, things like Exposé, Spaces or the Dashboard were functions I once in a while used, but they didn’t really add to my productivity.

With Mission Control, we now have all things in one place, and it is only the next swipe away to reach the desired functionality. I think this is a good example, that often we only lack the last missing link to get to the point where the UI functions fall into place – all of the single functions where released in previous OS releases before, but only now it is completely natural to use them all – and not just once in a while.

There is certainly the “one more thing” regarding UI changes in Lion: Natural Scrolling. Just search for the comments you find on the web – they reach from “Apple’s ‘natural scrolling’ feels horribly unnatural. Here’s why.” to “Wow, Everyone’s Complaining About “Natural Scrolling” In OS X Lion“. Well to be honest, it took me a few days to adopt as well, but once you are “over it”, it just works fine (even switching back and forth between the scroll wheel on my Win PC at work and my Mac at home). It is amazing how conservative people are regarding the way they use their computer – even if it is wrong. If we did it wrong for ten years, it has to stay that way … And there is no doubt about the fact, that there is really no physical metaphor behind the direction we used the scroll wheel so far – someone just programmed it this way and we used it.

Removing the scroll bars seems to be a comparatively small interference to the user’s expectation – still having enough potential to stir users up.

To sum up, with Lion we see how little progress we made with UI improvements in the last decades – but if we really leap forward, we feel the resistive force in the user base …

The Good & the Bad [7/2011]

This time it is easy to make a point; not because of my improvement advise being so well thought and fine tuned – no, just because “The Bad” is so convincingly bad. You find it here at slideshare, called “The Razorfish Social Influence Marketing Report”. Figure 1 on page 10 looks like this:

I would call it the most fluffy pie chart I have ever seen (and when I say fluffy, I mean fluffy – ask Agnes). We have been talking about 3-d effects, projection problems, wild use of colors or transparency misuse … but this one is really to the top as almost every thing is wrong about this chart. It deserves a seat in the hall of shame of pie charts!

My “good” is more Tufty style as it does not show axes nor annotated values, but only proportional areas and class labels:

Enjoy! (Thanks to Marco for this great example)

Statistical Graphics vs. InfoVis

The current issue of the Statistical Computing and Graphics Newsletter features two invited articles, which both look at the “graphical display of quantitative data” – one from the perspective of statistical graphics, and one from the perspective of information visualization.

Robert Kosara writes from an InfoVis view: 

Visualization: It’s More than Pictures!

Information visualization is a field that has had trouble defining its boundaries, and that consequently is often misunderstood. It doesn’t help that InfoVis, as it is also known, produces pretty pictures that people like to look at and link to or send around. But InfoVis is more than pretty pictures, and it is more than statistical graphics.

The key to understanding InfoVis is to ignore the images for a moment and focus on the part that is often lost: interaction. When we use visualiza- tion tools, we don’t just create one image or one kind of visualization. In fact, most people would argue that there is not just one perfect visualization configuration that will answer a question [4]. The process of examining data requires trying out different visualization techniques, …

read on in the Newsletter.

Andrew Gelman and Antony Unwin write from an statistical graphics view:

Visualization, Graphics, and Statistics

Quantitative graphics, like statistics itself, is a young and immature field. Methods as fundamental as histograms and scatterplots are common now, but that was not always the case. More recent developments like parallel coordinate plots are still establishing themselves. Within academic statistics (and statistically-inclined applied fields such as economics, sociology, and epidemiology), graphical methods tend to be seen as diversions from more “serious” analytical techniques. Statistics journals rarely cover graphical methods, and Howard Wainer has reported that, even in the Journal of Computational and Graphical Statistics, 80% of the articles are about computation, only 20% about graphics.

Outside of statistics, though, infographics and data visualization are more important. Graphics give a sense of the size of big numbers,  …

… read on in the Newsletter.

You will be surprised about the amount of consensus, as well as the topics of dispute – both might probably not match your expectation, but can be a start of an open discussion.

This blog post shall be the platform for this discussion and we are looking forward to reading your comments …

Tour de France 2011

— that’s it for this year, see you in 2012 (the latest) - au revoir! —
(With now 7 years of full Tour de France data, I might start to compare the different tours on a more “global” level.)

Again, the Tour de France has to compete with the soccer world championship – ok, this time it’s the girl’s turn and the attention is somewhat smaller …

Although I “missed” the first 5 stages, I will start to log the results in the usual ways as in 2005, 2006, 2007, 20082009 and 2010.

Stage Results cumulative Time Ranks
Stage Total Rank
(click on the images to enlarge)

- each line corresponds to a rider
- smaller numbers are shorter times, i.e. better ranks
- all stages are on a common scale,
- stage-results and cum-times are aligned at the median, which corresponds to the peloton

STAGE 6: Still a very compact group of 18 riders at the top of the field
STAGE 7: Easy riding
STAGE 8: The first mountains shrink the top group to 11
STAGE 9: Thomas VÖKLER now almost 2′ ahead
STAGE 10: No changes in the Top 42
STAGE 11: Still waiting for the Pyrenees
STAGE 12: Team Leopard-Trek lead by the SCHLECK brothers with 5 in the top 44
STAGE 13: Vincent JEROME lost hist last place for the first time
STAGE 14: Thomas VÖCKLER still almost 2” in front
STATE 15: Here are the 28 drop-outs so far, as the ranks almost stay unchanged
STAGE 16: HUSHOVD wins and TEAM GARMIN takes the lead
STAGE 17: The worse riders are all collected into the peloton
STAGE 18: Andy SCHLECK rushes (almost) to the top! Still 4 within 1′ reach!
STAGE 19: VOECKLER can’t defeat the SCHLECK brothers but still in reach
STAGE 20: EVANS too fast for Andy Schleck but not for MARTIN
STAGE 21: Congratulations to Cadel EVANS

(Note: The official results for stage 2 are missing, and thus are calculated from the differences of the total times)

For those who want to play with the data. The graphs are created with Mondrian.

There is a more elaborate analysis of Tour de France (2005) data in the book.

Of course – as every year – a big thanks to Sergej for updating the script!

A Design Classic demystified?

Mr. Beck’s London Tube map is a real design classic. Besides the timeless and universal design, the chosen geographical distortion has always been a point of discussion.

At fourthway [via infosthetics] we find a nice animation between the “real map”, which is geographically correct and the stylized map, which is optimized for reading and aesthetics.

Here are the two versions:

At first glance, the very nice animation might give us the impression that Beck’s creation is really of no help as the geographically correct map is still nicely readable and gives us so much more insight of where we actually are or go.

So did Beck err, and we are being fooled by strange subway maps around the world for no reason? Certainly not!

The answer is quite simple. The clip the guys at fourthway use is a quite small part of London’s inner city. Thus the average distance between stations has a relatively small variance (and is small itself), and – as we see – it does not make much of a difference which version of the map we look at.

Taking the current complete map and shading the chosen clip in it, shows how much of the subway network is not covered:

I am too lazy now to get the real distances fixed, but only looking at the fare zones (as a proxy for distance) shows us that the clip covers mainly only fare zone one, and almost any line extends to at least zone 4, some even as far as 9.

The full story of a geographically correct London Tube map looks like this

and can be found at Wikimedia along with some more detailed maps.

Now it is very easy to understand why Mr. Beck really had a brilliant idea in choosing this particular design which was consequently adopted in subways around the world.

R GUIs: Which one fits you?

The gap of the new “digital divide” between those who only use computers when they are as easy to use as iPads and smartphones and those who like (or at least accept) to type commands to perform jobs, seems to get bigger and bigger.

R – the lingua franca of statistical computing – is exactly such a command-line based language, reasonably well designed but still not GUI based at all. At this point GUIs are the only solution to make R accessible for “generation point-and-click” and bridge the divide.

Personally, I am happy to use all well designed GUIs but as well see the power of language based command line interfaces – you need to work with both to be most effective.
But let’s come to the comparison of the four different frontends for R (in lexicographic order) which try to do more than the built-in standard GUIs for the supported platforms:

(mouse-over the entries in the table to get more details)

JGR

RCommander

RKWard

R Studio

Technology JAVA tcl/tk KDE Qt
Platform
Installation simple simple painful easy
Approach IDE comprehensive comprehensive IDE
Interface SDI MDI (plus R) TDI MDI
Maturity 1.7-5 1.6-3 0.5.5 0.92.44
Console yes yes yes yes
CodeEditor yes no yes yes
Objbrowser yes no yes yes
DataEditor yes via fix() yes no
ModelBrws yes no no no
Logging console console extended console
Plugins via iWidgets yes yes no
Web-Client no no no yes

There are certainly more frontends and features (especially on the technical side) to consider, and not everybody will share my verdict on every point (which I even probably didn’t get completely right), but that’s what comments are for  …

My summary recommendations (regarding the four candidates) are:

  • working styles are very different such that many of the above mentioned issues may be pointless
  • for many of us the built-in GUIs are pretty good already, but differ from platform to platform (so you maybe want to avoid any further hassle)
  • if you are on a Mac, half of the choices are gone already …
  • those who really don’t like to “being helped” by your software, opt for the IDE approaches!
  • those who really don’t want to learn any of the R-syntax and are purely on a user level, use one of the comprehensive approaches – you still might not be too happy though
  • if you hate installation procedures, make sure to avoid RKWard (under Windows)
  • the sleekest GUI is definitely R Studio
  • if the webpage would be wider, I should have certainly mentioned Deducer, which is a comprehensive offspring of JGR.