Japan Earthquake: An Exploratory View

Thanks to the data provided by the USGS, we can take a look at all earthquakes since 1973, which cover almost the last 40 years of earthquake activity worldwide.

Let’s first take a look at the yearly development of the earthquake activity overall:

The apparent increase in the last 10 years is striking – though I don’t have any explanation for this change, which is most probably not even man-made. Interestingly the magnitude (see next figure) does not increase, though the chance of stronger earthquakes will grow with the overall number.

The distribution of magnitudes (which is used for the coloring) is even more striking, looking at the earthquake in Japan, March 11th which is now rated as a 9.0.

The whole dataset contains only one earthquake at a higher magnitude, i.e., the earthquake originating the terrible tsunami on the 2nd Christmas day in 2004 at a magnitude of 9.1.

Keep in mind that the Richter scale is logarithmic, i.e., stepping up one unit means a 10 times stronger earthquake. The strongest ever measured earthquake was in Chile in 1960 at 9.5.

If we look at the coordinates of the measurements in longitudes and latitudes, we see how much the activity is concentrated on the tectonic hotspots.

We roughly see the shapes of some continents, with one exception. Africa seems to be free of any activity; probably due to the fact that it sits happily on its own tectonic plate.

Looking at this data, we can only start to understand the devastation Japan is facing.

(The data can be loaded directly into Mondrian, which was used to create the graphs above.)

The Good & the Bad [3/2011]

This post could as well be called “Which Smartphone is right for you?”, or “Plotting conditional distribution – but the right way!”. Here is the original visualization from Nielsen, which is not really bad, but still hides the important message to some extent.

Kaiser adequately pointed out that some features – important features – of the data are hard to spot in the Nielsen graphic. His improved version does not use areas any more, but shows the shares of the different OSs as lines over the age axis.

From this display we may conclude two things:

  1. Areas are unsuitable to display this kind of data
  2. We understand the data better when we condition on age instead of OS
    (it somehow seems more natural that given a particular age, we might choose a certain phone and not vice versa)

Thanks to Kaiser who shares the data on his blog, I was able to create the “rotated” mosaic plot, which also conditions on age but still uses the proportional areas.

We clearly see that not the areal 2-dimensional representation is the problem, but the conditioning was just chosen the wrong way. In this representation we also can retain the overall sizes of the groups, which is an advantage over the line plot.

Things are even easier to interpret to with the marginals as legends:

From this mosaic plot we can perfectly read some of the features of the data:

  • the popularity of Android phones decreases with age
    (maybe because they are cheap and tailored towards tech-oriented people)
  • iPhones and Palms show an increasing popularity for ages above 55, and
    (probably due to an interface more suited for “ordinary” people)
  • vice versa Windows and Blackberry are underrepresented for 55+
    (maybe they are no longer being forced by their employers to use these phones)

(Graphs were made with Mondrian)

Too Hot to handle?

This is the ideal post to combine Infographics/Visualizations with the user interface aspect. I found it on Kaiser’s Junk Charts.

Having spent only a few years of my life in the US and being inculturated in orderly and standardized Germany, I can tell that most faucets here come pretty close to the “should be” situation. This is mainly due to the fact that we handle temperature and water flow in two or more, but separate dimensions. The combined interface is as wired as hard to handle. My impression though is that the “magma” range is as wide as the “ice cold”, what does not change the problem at all.

Nonetheless, once you took a shower in the US, you know what it means to find this tiny slot between “ice cold” and “magma” – and for the non-US readers, I really mean “magma” not just “hot water” ;-) . Thanks to the creator of this fun graphics!

Visualizing Soccer League Standings

I feel ashamed for this boring title, but hope that the entry can make up for it. This visualization did inspire me, as a comment did point to my Tour de France visualizations.

As with all visualizations, we need data first – this sounds trivial, but is sometimes a frustrating show-stopper. After I found the Bundesliga data for each round, the only thing missing was the script to pull the data off the website. R‘s xml-package was the choice:

library(XML)
games = 23
for (i in 1:games) {
   url = paste("http://www.sport1.de/dynamic/datencenter/sport/ergebnisse/
                fussball/bundesliga-2010-2011/_r10353_/_m",i,"_/", sep="")
   rawtab = readHTMLTable(url)
   tab = rawtab[[6]][3:20,c(2,9)]
   ids = order(tab[,1])
   if( i == 1 )
     result = tab[ids,]
   else
     result[,i+1] <- tab[ids,2]
}
resdf <- as.data.frame(result)
names(resdf)[1] = "Team"
names(resdf)[2:(games+1)] = 1:games
write.table(resdf, "Bundesliga.txt", quote=F, row.names=F, sep="\t")

Although I didn’t use readHTMLTable before, it was a 15 min. job to get the script fixed – a definite recommendation for jobs like this!

But now to the visualizations: Let’s start with the simple trajectories of the points of each team.

As one of the comments on reddit already suggested, we might want to align the developing scores along the median:

Now, as this weekend the “Rekordmeister” – as the FC Bayern names itself full pride – lost at home against BVB 1:3, it might be worthwhile to look at the scores from a FC Bayern perspective, i.e., we align the scores at the result from the FCB:


Easy to see that the gap to BVB remains at the same level for more than 10 games now, and for roughly five games, the direct opponents are somehow not to get rid off.

Here is the text file, you might use to play around yourself using Mondrian – which was used to create the visualizations.

Advertising and Statistics

There is certainly a prerequisite for statistics we can’t get around: data. Online advertising services generate tons of it; most not accessible for the public and many probably not very interesting at all.

Chitika has made one statistics public for us: the penetration of iPhones on the AT&T and Verizon network

We don’t get any info on how the data is measured (iPhone versions, representative placement of ads, …) and (unfortunately) there is no historic data, i.e., a time series.

Nonetheless, verizon is catching up quite fast, and I bet the CEOs of competitor networks in other countries Apple finally opened up, would die to get these figures for their networks …

Statistical Computing and Graphics Newsletter

The new issue (Vol. 21, No. 2) is out now. Featured articles are:

barNest: Illustrating nested summary measures
by Jim Lemon and Ofir Levy

You say “graph invariant,” I say “test statistic”
by Carey E. Priebe, Glen A. Coppersmith and Andrey Rukhin

Computation in Large-Scale Scientific and Internet Data Applications is a Focus of MMDS 2010
by Michael W. Mahoney

and of course, announcements and the news from the section chairs.

Andreas Krause passed the graphics editorship over to me last fall, and I am looking forward to a lot of interesting submissions in the coming years.

Please feel free to contact us (Nicholas, computing co-editor, or me, graphics co-editor) no matter whether you are a student or professor, a statistician or a practitioner, … whatever is interesting to the community and has its quality has a pretty good chance to be published.

Data Analysis of Yesteryear

It is not too often that a book is published that integrates data analytical methodology and the illustration of the appropriate use of specific tools. When Henk pointed me to the just released “Data Analysis with Open Source Tools” by Philipp Janert, the excitement was big, but it evaporated as soon as I read through the book.

I did start to flip through the pages with Amazon Preview, and was positively surprised that Part I of the book was on “Graphics: Looking at Data” and the following sections were actually progressing in the dimensionality of the data looked at – nice concept, and well copied. The first figure though, is a jittered dotplot – something we were doing in the 70s when we were still sending our plot commands to a pen plotter, and were trying to avoid ink soaked holes in the paper – we should know better more than a quarter of a century later.

It takes quite some pages until the book hits the widely used boxplots in the section “Only when Appropriate: Summary Statistics and Box Plots“, and we read “These summary statistics (mean and median, standard deviation, and percentiles) apply only under certain assumptions and are misleading, if not downright wrong, is those assumptions are not fulfilled.” Well, how can a median be wrong?

A surprising highlight can be found on page 68, where Janert absolutely hits the point in the distinction between “Graphical Analysis and Presentation Graphics” – something he seems to have forgotten just 50 pages later.

In the section on multivariate data analysis Janert talks about “Interactive Exploration” and writes “Now I could imagine a tool that allows us to select a bin in one of the histograms and then highlights the contribution from the points in that bin in all the other histograms“. His imagination could come true with a few clicks when he would use the appropriate tools. On page 124, he throws ggobi and Mondrian in the subtly named group of “Experimental Tools“. He claims “I don’t think any of these novel plot types have been refined to a point where they are clearly useful.” Certainly, if you do not use these (novel?) plots – btw. PCPs had their 25th anniversary last year and mosaic plots will celebrate their 30th anniversary this year – you wont see their usefulness. That Janert most likely did not use Mondrian is somehow apparent, otherwise he would not need to imagine a tool that links histograms.
The last lowlight to present here  is the “histogram” in Figure 9.4 on page 202, which is – hey – just a scatterplot; they are not that hard to tell apart.

I hate being so critical, but we should not let someone get away with a book on data analysis published in 2010 bashing what is standard in modern, interactive, graphical data analysis for more than a decade now. Who would consider using Gnuplot for graphical data analysis in 2011?

If you answer above question with “yes”, go buy the book – if not, save the money for a more up-to-date book.

Mondrian Version 1.2 released

The new version (1.2) of Mondrian adds the following (significant) features:

  • Scatterplotsmoother now includes “principle curves“, which are one of the nonlinear generalizations of principal components.
  • All smoothers can be plotted for subgroups, which have a color assigned, “smoother by colors“.
  • The color scheme has been refined once again, to make use of colors as efficiently as possible.
  • alpha-transparency is now consistent between scatterplots and parallel coordinate plots.
  • A new transformation: columnwise minimum and maximum.
  • Sorting of levels is now stable, i.e. levels which have the same value for an ordering criterion will keep their previous order.
  • The Reference Card speaks Windows now, i.e., Windows users no longer need to translate keyboard shortcuts from the Mac world.
Being able to use colors to estimate scatterplot smoothers for different subgroups is really handy – and actually “stolen” from early versions of DataDesk.
Principal curves are quite fun to play with, as there is no actual functional relationship needed, but the curve is generated such that the sum of the squared orthogonal distances to the curve is minimized. With no flexibility allowed this is obviously the PCA solution, with more and more flexibility the solution(s) get less obvious …
Above example shows the principal curve (actually the PCA-regression, left plot) and a linear least square fit on the first two principal components (right plot), which is actually the same line as in the left plot, only rotated to be a horizontal line. The highlighting in the left plot underlines, that principal curves are not following a functional relationship like y = f(x).
How different the various fits in a scatterplot can look like, can be seen here:
The plot shows the results of the 1st and the last time trial of the Tour de France in 2005. Depending on the type of rider we might expect the one or the other correlation between the two dimensions, and it is not too obvious, how the times should depend on each other.

Graphics *and* Statistics: The Facebook Map

There is this beautiful graph created by the facebook intern Paul Butler showing all (?) connections between facebook accounts:

Paul’s article is called “Visualizing Friendships“, which I would more call “Visualizing connections between facebook accounts”, but that is probably a different matter.

Although this is a beautiful piece of artwork, from a statistical point of view it is not really giving us a great deal of insights. Sure, there are certain “white spots” on the map, where either there is a competitor of facebook more successful or people don’t want to, or can not use this kind of “social” contacts. Obvious examples are Russia or China. But this is info more on a meta level, i.e., not really part of the info shown.

What would be more interesting are things like a comparison between the expected link intensity based on either population, broadband connections or actual facebook accounts and the data Paul compiled. Looking at Germany, e.g., we see the former eastern part being less connected, which is based on both, smaller population density as well as a poorer development of broadband connections.

A visualization of these connection intensities should be hierarchic, starting with continents with the ability to drill down into countries, states and cities. That would certainly mean some development and could not be done in R (yes, this map was created in R!) so easily – maybe a case for iplots.

Sharpen your Eyes

We definitely live in a world of overflowing information – certainly more than a human can and wants to digest. Of course, the internet is the principal motor for this, but it also happens with the design of simple everyday’s things.

Antrepo has a nice example of how product designs can be reduced to what is really unique to it. Here is the example of my daughter’s favorite spread:

How things usually work (and that is the other way round, i.e., from clean to cluttered) can be seen in this great video on youtube:



What can we learn from this for creating better visualizations?

Visualizing information/data we always face the problem to reduce a bigger amount of information/data to an essential message. We will only succeed when we manage to focus on what is essential and do not fall for the next best attention grabber.

Merry Christmas and remember this post when
unboxing your presents on Christmas Eve

Soccer: Can Money buy a Good Team?

The German Bundesliga has its (very short) winter break after 17, i.e., half the games played. We all know – or at least would not disagree immediately – that good players will cost a team a fortune, and the more a team can invest, the better will be the result.

Using the (potential) value of the 18 German teams from www.transfertmarkt.de at the beginning of the 2010/11 season and the points achieved after 17 games, we get the following correlation:

The R^2 is at mere 2.2% for all teams and at vanishing 0.2% if we leave the outlier FC Bayern out (red line). The team managers will hate me for this, but money does not really make the day here.

But fortunately there is the old rule that the goals against the team will make the difference. And indeed, the R^2 is at staggering 73.5% if we look at the scatterplot of Points vs. Goals Against:

(That regression doesn’t even change if we take “FC Bayern” out …)

Visualization makes Life Easier

I recently got my current Miles & More balance. As you might guess, I am not really a frequent flyer, at least not with Lufthansa and its allies.

According to the numbers, I need 36.000 miles resp. 30 flight segments to get Frequent Traveller status. Given my currently 1.500 miles resp. 4 segments, I am still 96% resp. 87% short to get this status.

The nice graph though, shows me that I am almost there ?!? Great!

Hoping that the Lufthansa pilots at least have a better sense of how far their destination still is … they probably trust their numbers ;-)