The Good & the Bad [9/2010]

This is quite an unusual Good & Bad posting, as it does not refer to some extraordinarily bad graph, but just wants to show some additional aspects of a dataset, compared to the original visualization found on Kaiser’s Junk Charts.

The comments on Kaiser’s post mainly picked on the variability of ranks, such that I set off to the US Bureau of Labor Statistics to get the (raw, though seasonal adjusted) yearly data. Here is what it looks like for the last 10 years with the US total rate highlighted:

As we can see from Kaiser’s post, the ranks end up to be a big zig-zag so I leave this graph out for a while. To judge the variability and range a bit easier, here is the corresponding plot based on boxplots.

The difference between the medians and the US average is a hint that in years with higher unemployment larger states seem to be hit more severely.

But how does the data look like when we use the US average as the reference? The following figure centers all data around the US average retaining the same scale, but using different shifts:

Now we can see more clearly that there are “winners” and losers within the evolving crisis starting in 2008. I highlighted three states that somewhat stick out of the rest. Alaska seems not to do very well in the first years, but also does not seem to be hit by the crisis very much. Nevada did improve until 2004 to a top 10 state, but fell behind starting in 2005 and was hit by the crisis most severely. Finally Michigan was worsening steadily with the upcoming crisis not really making things even worse:

Being down to only three states of interest, ranks seem to be the ideal view to show the ups and downs of the unemployment rate:

The post is already way too long, so I leave you with the data (incl. map) and the software to play around on your own and find some more interesting facts …

(Note: The data is not identical to the data Kaiser used, so there are smaller differences in the plots. The currently released version of Mondrian does not show the colored lines emphasized so nicely yet … stay tuned)

  • email
  • Twitter
  • Facebook
  • Google Bookmarks
  • Digg
  • del.icio.us
  • Reddit
  • StumbleUpon
  • RSS

Let’s do it in Parallel

Parallel coordinate displays are popular – especially in InfoVis – for quite a while. Now we have the ultimate reference with Al Inselberg‘s book (not surprisingly) called “Parallel Coordinates“.

Al Inselberg giving a talk at the DataVis workshop in Berlin 2006

Most of the book actually looks at geometric properties of parallel coordinates, and thus tends to overtax my mathematical education. The most interesting part of the book (from my point of view, which is always biased towards data analysis applications) is Chapter 10: “Data Mining and Other Applications”. One soon gets the idea that parallel coordinate views need an interactive working style/tools and thus many graphics of real world data sets fill this chapter. To give a good idea what is crucial about parallel coordinates, let me point to the discussion of this older post on Andrew’s blog.

So if you still can’t figure out what these funny plots mean, go and get the book!

The only point I don’t like too much about the book is the “useless” CD which comes with the book, which has some sample data but misses the real word examples discussed in chapter 10. Nowadays everybody would expect this data to come via a webpage.

The ultimate question though, will not be answered by the book: Who did invent parallel coordinates? This miracle will still be open and only real insiders will know the answer ;-) .

  • email
  • Twitter
  • Facebook
  • Google Bookmarks
  • Digg
  • del.icio.us
  • Reddit
  • StumbleUpon
  • RSS

Why do we go to Conferences?

Andrew pointed to a blog post on his blog, by Panos Ipeirotis who asked the question, why we do not use peer reviewing for conference talks in the same way we are used to it for journal papers.

His idea (which is not coming up the first time, and this year’s InfoVis worked pretty much this way) is to improve the overall quality of presentations, as we all have been sitting in boring or technically disastrous talks, which we would have liked to have seen improved.

As you can see from the image above, taken at the joint stat. computing and stat. graphics mixer at the 2009 JSM, I see a very important point of conference in the informal meetings around talks. Here is my comment on Andrew’s blog:

This idea might be interesting, but I think it totally misses the idea of oral presentations at conferences.

Conferences are for meeting people and exchanging ideas – that is what brings research forward. Having a reviewing process will destroy most of it.

What about being provocative and spontaneous? The reviewing would destroy all of this spice.

What is the point of a conference, which essentially gives us journal papers read aloud?

  • email
  • Twitter
  • Facebook
  • Google Bookmarks
  • Digg
  • del.icio.us
  • Reddit
  • StumbleUpon
  • RSS

Mac vs PC Reloaded – really?

It is the silly season aka as the “Sommerloch”, or the “morte-saison”, so there is time for this post. We all know the legendary “Mac vs. PC” spots, which Apple aired between 2006 and 2009. The underlying idea was the smart newcomer Mac attacking the bold and not always very smart acting established PC.

Mac vs. PC

So far, everything matches the system. If you are in a weaker market position, with an apparently smarter product, you need to attack.

Well, now the still 10:1 outselling market leader Microsoft seems to strike back – although it is hard to understand why – given the 10:1 share for Microsoft is still true? Microsoft’s campaign seems to be quite odd. Among other surprising things, MS claims that “Macs can take time to learn” and qualify their claim with “Things just don’t work the same way on Macs if you’re used to a PC”.

And now we are getting to the point, which I notice frequently. Developers of ill designed software (and I am really only relating to the user interface here) managed to completely screw up the user’s expectation of how things should work. Computer users who are exposed to quirky interfaces for years (if not generations) do not expect the obvious any more. Being trained to look for the work-around in the first place, one seems to be unable to expect the straight solution.

Given this situation, MS strange claim seems to sell: “Things just don’t work the same way on Macs if you’re used to a PC”, but it does not say at all that PCs do a good job in helping us solving our problems – no, they only meet people’s degraded expectations … sad.

  • email
  • Twitter
  • Facebook
  • Google Bookmarks
  • Digg
  • del.icio.us
  • Reddit
  • StumbleUpon
  • RSS

The Good & the Bad [08/2010]

The last regular issue of “The Good & the Bad” dates back to [11/2006], so it is more than time to post.

I found this flowchart on Kaiser’s junkchart.


The graph was originally posted on the Internet Monk‘s blog – the data comes from a study, which can be found here. There was no data for this migration matrix posted in either of the blogs, so I reverse engineered the graphics (pixel by pixel) and created the data table.

Although the migration matrix only has 36 potential values to depict (from which some don’t even exist), the flowchart is already tremendously cluttered. The general question, which group loses most is hard to see, and the many small migration paths obscure the graph’s message to some extend.

Here is my suggestion which uses barcharts for source and destination distribution and a fluctuation diagram for the migrations.

What can we read from the graphs? There is a general movement towards “None”, which is by far the biggest receiving group. Both “Catholics” and “Evangelicals” lose substantially, but “Evangelicals” at least gains somewhat from other groups. “Evangelicals” and “Mainline Protestants” seem to be the biggest 2×2 exchange. “Black Protestants” only lose to “None”, which also might just be a data error.

  • email
  • Twitter
  • Facebook
  • Google Bookmarks
  • Digg
  • del.icio.us
  • Reddit
  • StumbleUpon
  • RSS

The Seven Deadly Sins of Conducting a Survey Study

I stumbled upon this “survey analysis” on an Apple related list called “iPad Opinion Profile – iPad Personality Clash: Elites vs. Geeks”. The brief summary of this survey suggests that iPad owners are “Selfish Elites” and those who oppose the iPad are “Independent Geeks”.

It takes a bit to get an idea of what these guys did, but here is my list of the seven deadly sins of conducting a survey:

  1. don’t care about being representative, just ask some guys on, say facebook
    (the survey actually started before the iPad was released …)
  2. normalize the data – somehow
    (it says: “The survey sample was normalized to match the gender, age and personality distribution of 13-49 year olds living in the United States”), good luck!
  3. only pick a tiny fraction of the data to make things more interesting
    (the “study” only looks at 9% of the survey data and mixes owners of an iPad with those who intend to buy one …)
  4. ask a lot of unrelated questions
    (the question for “The Biggest Sin” is really something that haunts us, especially when thinking about touch devices)
  5. never ever show the questions you actually asked
    (no sample questionnaire is supplied)
  6. don’t mention the absolute numbers behind the results, only use ratios, which really go haywire for small numbers; compare to the undefined “average person”
    (no quantities at all, only the overall size of 20,000)
  7. only pick the findings that make a good headline and match your insinuation – never point to contradicting results
    (according to the study, iPad owners are strongest for families with many children, but are referred to being “not very kind or altruistic” – great)

Although I am neither an iPad owner, nor intend to buy one anytime soon (though donations are welcome ;-) ), my verdict from the results of the study are, that iPad critics are “low stimulated, introverted, reserved, insecure, neurotic young males, tending to be aggressive and lazy, mainly found in Hawaii and Alaska”.

- sometimes I feel guilty being a statistician

  • email
  • Twitter
  • Facebook
  • Google Bookmarks
  • Digg
  • del.icio.us
  • Reddit
  • StumbleUpon
  • RSS

Surprise Me!

I was a bit puzzled when I read the lines in Robert’s hint to the InfoVis Workshop called “Telling Stories with Data“, saying: “If you haven’t watched the Hans Rosling video yet, you probably haven’t realized that visualization isn’t just there for data analysis, it’s also a great tool for telling stories.

This is exactly what I mentioned earlier in an older post:

A good visualization should tell us a story about the data you didn’t know before and not the other way round, i.e., once you know the story, you create a visualization around it.

Here is a nice proof of how this usually works:

Antony and Alan talking about some visualization at the 2002 JSM in NYC

If the result of your graphical analysis is not something you can put into a story, you probably didn’t really succeed with your analysis. Of course, we are not equally gifted in telling stories …

  • email
  • Twitter
  • Facebook
  • Google Bookmarks
  • Digg
  • del.icio.us
  • Reddit
  • StumbleUpon
  • RSS

Tour de France 2010 Statistics Art Gallery

Sometimes the title may promise more than the post can hold … but I still try my best. As you might know, there is the usual visualization of the stage times, total times and ranks of all riders in the regular post to start with.

As we have more data on the Tour and riders, it is fun to look at these data as well:

Lets first look at the different types of riders and how they performed:

Total Time by Type of Rider

Note that smaller numbers in the boxplots of the total time by Type of Rider correspond to shorter times and better performance. Obviously the classification is quite accurate. The ordering of the types is not surprising, and given the many hard core mountain stages climbers are definitely in for a good all over all performance.

Type of Rider vs Year of Birth (Age)

Interestingly “Leader and Top Rider” are oldest on average, and less surprisingly you seem to start your career as a “Helper”, try your luck as a “Sprinter” and at some probably get to be more than (only) a specialist.

Team by Type

We only highlighted the Leader and Top Rider and Helper here, and sorted after the number of top riders. Team RadioShack really was different, although they won the team classement only by 9′.

Age by Team

Although Team Radioshack is not the oldest on average (actually median), if you look at the Top Riders in the team (highlighted in red), you definitely see that they will need some “fresh blood” the next years – no, I do not refer to doping here ;-) .

As we already looked at Age, some more physicals might be interesting:

Result vs Age

We actually look at year of birth (which is a bit more time invariant than Age). The Tour de France best ager (apart from the older professionals, who “survived” over many years) seem to be around 32, i.e, born in 1978.

There is obviously the correlation between height and weight of the riders, which applies to all of us, such that we rather look at the BMI

Result vs. BMI

Although the variance gets quite big at the ends of the data range, we see that it is no good to enjoy the good french food and red vine during the tour too much.

Well, the post is already way to long, though there is more to explore here. I can only encourage you to grab the data and play around yourself using Mondrian or some other visualization package – it’s fun.

  • email
  • Twitter
  • Facebook
  • Google Bookmarks
  • Digg
  • del.icio.us
  • Reddit
  • StumbleUpon
  • RSS

Blinded by Animation?

Stopping by at http://www.gapminder.org, you will easily get to the “default” example, which shows the scatterplot of life-expectancy vs. income-per-person running through the years 1800 to 2009.

You really have to look carefully to spot the problem with Russia in 1933. How do we explain a spontaneous drop of life-expectancy from 33 years to only 12 years? This is obviously an error, which was not fixed before the data was released.

It gets quite clear when you look at the time series itself, which you get when you select time to be the x-axis:

Apart from the drop in 1933, we also find strange data for the time of WWII. The war apparently has no effect on the life-expectancy at all. Hard to believe – but this data might just have been “optimized” by the political regime.

Once again time to remember Peter J. Huber’s words: “Never underestimate the rawness of raw data!”

  • email
  • Twitter
  • Facebook
  • Google Bookmarks
  • Digg
  • del.icio.us
  • Reddit
  • StumbleUpon
  • RSS

Why do we do it – ’cause we can!

I was pointed to this nice video of work from Robert Kosara by Hadley via Antony.

Emerging technologies – and muti-touch must be counted as such – offer new possibilities of creating an interaction with graphics. This implementation of Robert is certainly clean and straight-forward, but still raises the question, whether or not these operations are really things we need during a data analysis.

What I found always very distracting when selecting data dynamically, was the amount of coordination which was necessary for the selection, which ultimately drew attention from the highlighting triggered by the selection. Often enough, this highlighting was most interesting in a different plot, and thus hard to watch while trying to get the dynamic selection right.

I wonder how much this is the case for Robert’s prototype, but I am afraid I can’t tell until I get my hands on the software and a new MacBook Pro.

The final question though for me is whether it will help people to get their data analysis jobs done more easily or not!

  • email
  • Twitter
  • Facebook
  • Google Bookmarks
  • Digg
  • del.icio.us
  • Reddit
  • StumbleUpon
  • RSS

World Cup Aftermath

Now that the world cup is over, and we finally have a winner, it is time to compare the expected values with the real outcome – don’t mix this up with comparing the outcome we would have liked to see with the real outcome, which is often done in business analytics …

The expected values are taken from Leitner, Zeileis and Hornik’s paper on the chances to win the world cup. What is appealing in their approach is to look at the bookmaker’s quotes rather than at the more long term scores from FIFA or ELO.

Here is a visualization ordered by the winning probabilities in %

When being ordered by the winning probabilities, the team ranked 1st should win the cup, number two should be the loser in the final, and so on. During the group stage all teams perform 3 games, but we assume the 8 smallest ranks to be the last in their group.

Given this ranking, we can visualize whether or not a team met the expectation or not. Teams falling short are indicated with red bars, i.e., stages they never reached, teams that performed above expectations extend with green bars.

What can we read from the graph? ITALY and FRANCE were the worst under-performer, as they did not only fall short of two stages, but they also ranked last within their groups. URUGUAY is clearly furthest above expectation as according to their rank, they were not even meant to advance to the last 16, but actually made it into the semi final.

What about SPAIN? Although they did win the cup, there was nothing really surprising given they were ranked 1st anyway.

Using the actual winning probabilities, we can also calculate what it actually took the teams to get to the point where they finally dropped out – that might probably rank them quite differently … but that will be another post.

  • email
  • Twitter
  • Facebook
  • Google Bookmarks
  • Digg
  • del.icio.us
  • Reddit
  • StumbleUpon
  • RSS

Tour de France 2010

July 3rd was probably the worst day to start the Tour de France, as many of us where captured by the quarter finals, which sent home nobody less than Diego Maradona’s dream team, which may dream for another four years now …

Although the world cup yet has to see its best matches, I will start to log the results in the usual ways as in 2005, 2006, 2007, 2008 and 2009. Contrary to the world cup, I swear not to give any model that predicts the winner …

Stage Results cumulative Time Ranks
Stage Total Rank
(click on the images to enlarge)

- each line corresponds to a rider
- smaller numbers are shorter times, i.e. better ranks
- all stages are on a common scale,
- stage-results and cum-times are aligned at the median, which corresponds to the peloton

STAGE 2: QuickStep’s CHAVANEL takes the lead – did Fabian ran out of batteries ;-) ?
STAGE 3: CANCELLARA recharged; further spread of the field
STAGE 4: BOLE now last on a day without many changes
STAGE 5: Almost 75% of all riders roll in in the peloton
STAGE 6: GRABSCH “Loser of the day” – 9 drop outs by now
STAGE 7: CHAVANEL back at the top, with a newly sorted field after hitting the mountains
STAGE 8: ARMSTRONG falls behind
STAGE 9: ARMSTRONG gains some ranks on CONTADOR but the gap grows
STAGE 10: SCHLECK and CANCELLARA show quite opposite rank profiles
STAGE 11: LEIPHEIMER and CONTADOR the only “constant” top 10 riders

STAGE 14: Neither SCHLECK nor CONTADOR risk any kind of attack
STAGE 15: CONTADOR does not show fair play  and gets yellow only due to a technical defect of SCHLECK’s bike – but what would you expect with so many doping allegations on his account …
STAGE 16: ARMSTRONG still good for an extraordinary performance
STAGE 17: 26 drop outs – the rest will most probably make it to the Champs-Élysées
STAGE 18: Not the day of HERNANDEZ, but he knows how it feels to be last …
STAGE 19: GRABSCH and MENCHOV each get their 3rd place
STAGE 20: The profile of the winner, Alberto CONTADOR

See also the summary in this post

For those who want to play with the data. The graphs are created with Mondrian.

  • email
  • Twitter
  • Facebook
  • Google Bookmarks
  • Digg
  • del.icio.us
  • Reddit
  • StumbleUpon
  • RSS