It is just about a year ago (exactly January 6th, 2009) that a New York Times article on R did fuel the dispute on what statistical analysis tool is “the best”. One of the highlight of the article was a quote from SAS’ Anne H. Milley:
“I think it addresses a niche market for high-end data analysts that want free, readily available code,” said Anne H. Milley, director of technology product marketing at SAS. She adds, “We have customers who build engines for aircraft. I am happy they are not using freeware when I get on a jet.”
I recently found a SAS press release (dating March 23, 2009) entitled: “SAS to offer R integration to support analytical innovation”, which reads:
“It is no secret that SAS has been working on interfacing with R,” said Anne Milley, SAS’ Senior Director of Technology Product Marketing. “SAS and R are here to stay, and as organizations work to harness the full potential of their data, an expanded set of analytics options can only help.”
First let’s be cheerful about this move (whatever the actual solution will look like anyway), but on the other side, if Anne Milley’s quotes stand for SAS’ reliability, I doubt they deserve their reputation.
Pretty hard to get any attention while Steve is presenting the iPad
, but nonetheless I like to point to the new version 1.1 of Mondrian. Here are the most important new features:
- Load data directly from R workspace files
- New color schemes
- Compatible with Java 6 on all platforms
- Very many bug fixes and minor features added

All about Mondrian can be found at the website and in the book. (Sorry, Steve, for the Windows 7 screenshot … it looks much nicer under MacOS X
)
On Wednesday, January 27th, (not only) the IT-world will be looking westwards to what is coming from Cupertino. Apple will reveal their “latest creation”.

Nothing new regarding the staging: rumors are piling in blogs and news lists for months, analysts predict the dark or golden future of Apple and their competitors (depending on who pays them).
And yet, there seems to be a different touch this time. The Apple iSlate has been rumored for more than a year now, and even Steve Ballmer talked about “… what we will call slate PCs …” at this year’s CES. He showed actual hardware by HP and others, but they were running “only” Windows 7, i.e., there were merely PCs reduced to a small(er) screen without a keyboard - nothing what we would call innovation.
Not that the R&D departments of Microsoft, HP, Sony … have been shut down. No, even worse, they seem to hold still until Apple has defined the new standard of interactions and services for a tablet PC. This somehow reminds me of what happened with the iPhone, but whereas with the iPhone it seemed to have happened by chance, this time it looks to be on purpose.
Let’s see what we’ll get on Wednesday … !
PS: We don’t really know how the “iThing” will look like in particular, but take a look at this nice video from Bonnier R&D, which gives an idea of what it takes to move forward to new standards.
Posted
on 01/18/2010, 21:28,
by martin,
under
General.
Here is the post-post scriptum of one of Andrew Gelman’s blog entries. The post was discussing how it could possibly be that such an influential statistician like Brian Ripley has such an outdated webpage:
P.P.S. Somebody pointed out that you can search for B D Ripley’s recent papers using Google. Here’s what’s been going on since 2002. Aside from the R stuff, he seems to have been focusing on applied work. … I find that working with applied collaborators gives me insights that I never would’ve had on my own, and I’d be interested in hearing Ripley’s thoughts on his own successes and struggles on applied problems.
I am a bit puzzled that influential statisticians like Andrew Gelman seem to be surprised that the very important inputs come from real life problems. But maybe this is mainly caused by the fact that in graphical data analysis there is not much like a theory. The next important development usually comes from the next dataset which we can’t analyze efficiently. Once one understood the generalization of the solution, a new piece can be put into the mosaic.
Anyway, life outside the ivory tower is different (but reality) and I think it is important to regularly move in and out the tower.
On an Apple related list I found a pointer to this price comparison chart. Although the author already put a disclaimer in his post that this graph was not intended to be “mathematically correct”, it is amazing how badly the actually information is hidden behind the rainbow chart.

Using a simple barchart just does not deliver any dramatic story at all, but hey, if prices are almost identical in Hong Kong and the US, please don’t show a difference. Here is the less appealing, but faithful chart:

Not to mention all the problems of adding the correct taxes, which are not really solved for these prices …
I found this on the infoaesthetics blog. There is one slide in the presentation that made me think:

I got the impression that this quote from Herbert George Wells - more known for his science fiction literature - suffers badly when modified this way.
Statistical thinking - from my point of view - means the ability to understand figures (as in numbers) in a way that utilizes meaningful summaries and graphs, and can somehow distinguish between a signal and noise. I doubt that there is something like “visual thinking”. Rather there is statistical thinking (or more generally speaking analytical thinking) which utilizes graphical representations of the data in order to more easily summarize the essential information.
What Alex Lundry probably wanted to exchange in this slide was the presentation of statistical information in tables with the presentation of this information in graphs.
After getting the data together which was used to generate the visualization criticized in this post, it is just fair to prepare a better version. Tom Carden already showed some quick graphs which improve the initial “pie chart“. Note that I only show the 7 most relevant diseases and grouped the rest into one group “Rest” for simplicity.
A typical problem of the initial chart is that it tries to put many views into one single graph - this usually makes interpretation very hard. Looking at the data we can identify four major questions:
- How do the absolute total costs develop over age?
- Which diseases are dominant at what ages, i.e., a relative view?
- What are the shares to pay by the patients, and how do they differ between diseases and change over ages?
- Are the costs per patient much different for the diseases and how do they develop for older patients?
The absolute numbers of patients are only a secondary question here. To answer the four above questions four relatively simple graphs can be used which are easy to create in a simple (?) tool like MS Excel.
Absolute Costs

Not much to learn here (which is not better shown in the next chart), except for the fact that whereas Hypertension and Diabetes are almost generating no costs up to age 25, they are the major cost driver around the age of 60.
Relative Costs

The relative view tells most of the story of the data (of course in conjunction with the absolute plot above). There are basically three groups of diseases:
- Decreasing relative costs:
Chronic Sinusitis, Asthma and Depression
- Almost constant relative costs:
Acid Reflux
- Increasing relative costs:
Osteoporosis, Diabetes and Hypertension (and Rest)
Share of Personal Costs

The share of costs for a patient declines from on average roughly above 20% for younger patients to 15% for patients above 70. The share is almost constant for Acid Reflux and Osteoporosis up to age 60, and shows the strongest decline for Chronic Sinusitis.
Average Costs per Patient

The per patient costs do not show much difference between diseases and increase almost linearly up to age 55. Diabetes is by far the most costly disease whereas Chronic Sinusitis is by far the cheapest. Per patient costs increase sharply in the age range of 70-80 years no matter what disease we look at.
All over all the four graphs are relatively simple and easy to read. They hopefully enable us to get more easily somewhat like a story out of the data.
When it comes to graphing data in a chart, the scale of the data is the most important factor to determine which graphical representation might be useful. Please pardon me for the examples using the “Iris Data” and the “Titanic Data”; but these data sets are prototypes for multivariate continuous data and multivariate categorical data everybody can relate to.
The first pair of plots (upper row) should puzzle everybody and only extremists of the one or the other graphing method would actually plot the data this way.

The second pair of plots (lower row) uses a SPLOM to graph the iris data and a mosaic plot to visualize the Titanic data, i.e., both datasets a plotted in a graph which respects the scale of the data.
Admittedly, this example seems to be too obvious, but when it comes to more complex datasets with a mixture of continuous and categorical variables it might be quite helpful to know how to choose the right (set of) graphics in order to visualize (and analyze) the data most properly.
(To create the graphics yourself you might use Mondrian)
Robert has a very long and profound post on this chart:

The whole interactive thing can be found here on the GE site. It seems to be a bit of a provocation that Ben Fry’s company uses a tattered pie chart to visualize the data, which is definitely better visualized in a line-chart (i.e., a time series with age as the time axis). There is the suspicion that the radius is proportional to the quantities, which would really take it to the top … Unfortunately we don’t have the data at hand to give an improved version - anyone wants to take the burden to note all the values by hand?
Apart from all technical criticism - which includes the cute animation which is completely useless - there is the fundamental chicken and egg question:
A good visualization should tell us a story about the data you didn’t know before and not the other way round, i.e., once you know the story, you create a visualization around it.
I actually have a hard time to find a story here at all …
Posted
on 10/31/2009, 21:50,
by martin,
under
Books.
As statisticians we are used to the fact that we have a hard time analyzing data where we lack the knowledge of the background. Election data are a common target in statistical investigations and some of us can not be stopped talking about red and blue states over and over again.
Having won this book at this year’s ASA statistical graphics and computing mixer at the JSM, I was quite surprised that there is so much more to election data than understanding the election process.
Most of my enlightenment was more on the negative side and I felt lucky that German elections still lack a lot of the negative campaigning which is common in the US for decades now.
In any case I found the book as interesting as entertaining - worth reading it!