{"id":1875,"date":"2014-08-30T14:20:50","date_gmt":"2014-08-30T13:20:50","guid":{"rendered":"http:\/\/www.theusrus.de\/blog\/?p=1875"},"modified":"2014-08-30T14:20:50","modified_gmt":"2014-08-30T13:20:50","slug":"big-data-not-without-my-stats-textbook","status":"publish","type":"post","link":"https:\/\/www.theusrus.de\/blog\/big-data-not-without-my-stats-textbook\/","title":{"rendered":"Big Data: Not without my Stats Textbook!"},"content":{"rendered":"<p>Google is certainly the world champion in collecting endless masses of data, be it search terms, web surfing\u00a0preferences, e-mail communication, social media posts and links, &#8230;<\/p>\n<p>As a consequence, at Google they are not only masters of statistics (hey, my former boss at AT&amp;T Labs who was heading statistics research went there!) but they also need to know how to handle Big Data &#8211; one might believe. But with all big companies, there are &#8220;those who know&#8221; and &#8220;those who do&#8221;, which are unfortunately often not identical.<\/p>\n<p>So, &#8220;those who do&#8221; at Google built\u00a0<a href=\"http:\/\/www.google.com\/trends\/correlate\" target=\"_blank\">Google Correlate<\/a>. A simple tool that correlates search terms. To start with an example (all based in Germany as search term origin), let&#8217;s look at what correlates with &#8220;VW Tiguan&#8221;:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter\" src=\"http:\/\/www.theusRus.de\/Blog-files\/CorrelateTSlong.png\" alt=\"\" width=\"575\" height=\"423\" \/><\/p>\n<p>With a correlation of 0.894 it is the forth highest ranking correlation, as I left out &#8220;Tiguan&#8221; and &#8220;Volkswagen Tiguan&#8221; as well as &#8220;MAN TGX&#8221; (which all relate to the term itself or to another car\/truck). <em>www.notebookcheck.com<\/em> is a notebook related website in german language, which is definitely absolutely unrelated to the VW Tiguan. The corresponding scatterplot looks like this:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter\" src=\"http:\/\/www.theusRus.de\/Blog-files\/CorrelateScatter.png\" alt=\"\" width=\"575\" height=\"403\" \/><\/p>\n<p>Besides the general problem of Big Data applications, to make sense out of what we collected, we are facing two major problems to tackle &#8211; no matter what kind of data we are actually looking at:<\/p>\n<ul>\n<li>With millions to billions of records, differences usually all get significant no matter how small they are, when using classical statistical approaches<\/li>\n<li>The other way round, when looking for similarities, we tend to find things that &#8220;behave the same&#8221; although there is no causality at all, just by the amount of the data<\/li>\n<\/ul>\n<p>But what went wrong with Google Correlate? They certainly fell for the latter of the two above listed problems; the question is why?\u00a0First there is the pseudo correlation (see <a href=\"http:\/\/www.tylervigen.com\/\" target=\"_blank\">here<\/a> for a nice collection of similar causality-free time series), which is solely based on the stationary part of the time series. If you remove the stationary part of the series (I used a simple lowess-smoother) the scatterplot looks like this:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter\" src=\"http:\/\/www.theusRus.de\/Blog-files\/CorrelateScatterNo.png\" alt=\"\" width=\"548\" height=\"495\" \/><\/p>\n<p>with a correlation of 0.0025, i.e., no correlation. Looking closer a the time series, it is quite obvious, that apart from the stationary component there is no correlation whatsoever.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter\" src=\"http:\/\/www.theusRus.de\/Blog-files\/CorrelateTSshort.png\" alt=\"\" width=\"576\" height=\"425\" \/><\/p>\n<p>Enough of Google-bashing now, but the data isn&#8217;t iid and a Pearson coefficient of correlation not an adequate measure for the similarity of two time series. In the end, it boils down to a rather trivial verdict: <strong>trust\u00a0your common sense and don&#8217;t forget what you have learned in your statistics courses!<\/strong><\/p>\n<p>(btw. try searching for &#8220;Edward Snowden&#8221; in Google Correlation &#8211; it\u00a0appears the name has\u00a0been censored)<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Google is certainly the world champion in collecting endless masses of data, be it search terms, web surfing\u00a0preferences, e-mail communication, social media posts and links, &#8230; As a consequence, at Google they are not only masters of statistics (hey, my former boss at AT&amp;T Labs who was heading statistics research went there!) but they also [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[17,1],"tags":[],"class_list":["post-1875","post","type-post","status-publish","format-standard","hentry","category-big-data","category-general"],"_links":{"self":[{"href":"https:\/\/www.theusrus.de\/blog\/wp-json\/wp\/v2\/posts\/1875","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.theusrus.de\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.theusrus.de\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.theusrus.de\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.theusrus.de\/blog\/wp-json\/wp\/v2\/comments?post=1875"}],"version-history":[{"count":13,"href":"https:\/\/www.theusrus.de\/blog\/wp-json\/wp\/v2\/posts\/1875\/revisions"}],"predecessor-version":[{"id":1890,"href":"https:\/\/www.theusrus.de\/blog\/wp-json\/wp\/v2\/posts\/1875\/revisions\/1890"}],"wp:attachment":[{"href":"https:\/\/www.theusrus.de\/blog\/wp-json\/wp\/v2\/media?parent=1875"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.theusrus.de\/blog\/wp-json\/wp\/v2\/categories?post=1875"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.theusrus.de\/blog\/wp-json\/wp\/v2\/tags?post=1875"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}