Sat, 23 Mar 2019 08:13:20 -0700

Tags: academic

Vancouver has a great machine learning paper reading meet-up aptly named Learn Data Science. Every two weeks, the group gets together and discuss a paper voted by the group the previous meeting. There are no presentations, just going person by person around the room discussing the paper. I find the format very positive and I try to attend whenever the selected paper is aligned with my interests and I have time (lately I have not been able to attend much, as I have been writing a feature engineering book but I'll blog about it another day). Last meet-up they picked a paper I selected, so this blog is a short summary of the paper, to help people reading it.

In 2005, Prof. Leo Breiman (of Random Forests fame), passed away after a long battle with cancer. He was 77 years old. This information comes from his obituary, that also highlights a very varied life that is captured by the paper for the Learn DS meetup of March 27th. Four years before passing away, when he was well established and well regarded in the field, he wrote Statistical Modeling: The Two Cultures a journal article for Statist. Sci., Volume 16, Issue 3 (2001), 199-231. The paper presents the field in a pivotal moment on its history, as it is about that time that "Data Science" took off (for example, Columbia University started publishing The Journal of Data Science in 2003). The paper itself is written in a clear tone with an unambiguous message but it got published with four letters and a rejoinder. The subtlelties in the letters and rejoinder present the most interesting material. Particularly as the letters included well known statisticians such D.R. Cox and Emanuel Parzen. I will summarize this content in turn.

The central thesis of the main paper is that statisticians, as of 2001, did care more about picking an understandable model and deriving properties about it than making sure the chosen model was a good surrogate of the process being modeled. He calls this the "data modeling culture". According to the paper, the fact that the model is a poor approximation is seldom addressed and if its, model validation is done as a yes-no matter using goodness-of-fit tests and residual analysis. Because these distinctions are yes-no, models cannot be compared regarding with how well they approximate the problem under study. In contrast, the other culture, the "algorithmic modeling culture" uses predictive accuracy on unseen data for model validation.

The paper spends most of its energy explaining the algorithmic modeling culture (estimated by Breiman to by employed by 2% of statisticians in 2001) to statisticians unfamiliar with it. For people coming from computer science and/or machine learning, the material is obvious and describe accepted practices and expectations in the field: an algorithmic model is complex and obscure, models that offer better predictions are to be preferred. And even with all their obscurity, certain algorithmic models allow for limited interpretability beyond prediction. What is fascinating is the need to "sell" that point of view to the community. And indeed the paper is a hard sell asking statisticians to care beyond model elegance and address a broader, less clear-cut set of problems. The intentions for the author to write it are reflected on the conclusion of the rejoinder:

Many of the best statisticians I have talked to over the past years have serious concerns about the viability of statistics as a field. Oddly, we are in a period where there has never been such a wealth of new statistical problems and sources of data. The danger is that if we define the boundaries of our field in terms of familiar tools and familiar problems, we will fail to grasp the new opportunities.

Understanding his objectives in writing help understand the confrontational tone, I believe. He starts describing in Sec. 3, two projects he worked on when he was a consultant. Both projects are described in detail and are quite the usual machine learning projects. He spends some time explaining why the number of variables and the characteristics of the data made them unsuitable to be addressed via usual statistical techniques or (in the case of Sec. 3.1), how attempting to use them lead to a failed project.

Sec. 4 is very short and focus on how most statistical
papers by 2001 started with the phrase *Assume that the
data are generated by the following model:...*. It
reminds me of the expression let's assume a
spherical cow and clearly that expression stresses a
similar concern.

Sec. 5 then goes on to explain how the lack of care making sure the model is a good surrogate of reality leads to the statisticians to confuse conclusions derived from the model with conclusions about the problem being (mis-)modeled. This is particularly concerning if these wrong conclusions then shape public policy and the direction of scientific research. The problem, says Sec. 5.2, is that the model validation is done as a yes-no but Sec. 5.3 correctly states that there are plenty of models that explain the data and picking the right one should be a topic of study but, according to the author, is seldom studied. Instead, in Sec. 5.4, the author argues for using predictive accuracy for a more discerning model validation.

Sec. 6 uncovers the main claim of the paper. That the data modeling community is kept hostage of simple models that stop them from working on multivariate problems that lack a clear understanding of how the data can be generated. These problems, however, have plenty of data and practical interest going on for them. Interestingly, he concedes that

Bayesian methods combined with Markov Chain Monte Carlo are cropping up all over. This may signify that as data becomes more complex, the data models become more cumbersome and are losing the advantage of presenting a simple and clear picture of nature’s mechanism.

This is quite relevant for the LearnDS meetup, where I have enjoyed many MCMC papers in the past.

Sec. 7 describes how practitioners of machine learning do things and most probably can be skipped. He then proceed to discuss the problems of multiplicity of good models (Sec. 8), the limitations of Occam's razor of simplicity vs. accuracy (Sec. 9), the curse of dimensionality (Sec. 10). Multiplicity of good models is described as a source of instability and left as a mostly unsolved problem, but he points at bagging as a possible solution (not an unexpected answer from the creator of Random Forests). The discussion about Occam's Razor (a maxim that says to prefer simpler explanations) is quite insightful. The section equates simplicity with interpretability and it revisits the blackbox/whitebox discussion by comparing the ease of interpretability of CART trees (Sec. 9.1) vs. the improved accuracy of Random Forests (Sec. 9.2-9.3). He concludes with the Occam dilemma:

Accuracy generally requires more complex prediction methods. Simple and interpretable functions do not make the most accurate predictors.

Regarding the curse of dimensionality, the problem is well understood by any machine learning practitioner, but I was not aware that it was anathema to statisticians. Aggressive feature selection (variable selection) seems to be mandated for them, according to the paper. Breiman tries to challenge that view by explaining support vector machines to this community. The explanation is wholesome, but it can be definitely skipped by machine learning practitioners.

Sec. 11 is a little self-serving, it tries to "rally the troops" about the interpretability of Random Forests feature importance. It compares them to published work by Efron and others (Efron is one of the scientists that commented the article) using logistic regression and finds that the approach by Breiman is better at finding the true signal in the data. This is later contested by Efron but the evidence presented seem to side with Breiman. Therefore, he shows that Random Forests can beat beat statistic methods on their own turf. Sec. 11.3 then shows how they can be applied to DNA microarrays, which large number of features make them outside of scope for statistical methods. These examples are supposed to whet the appetite of statistics practitioners for the methods and possibilities of the algorithmic modelling community. It can also be skipped by machine learning practitioners.

It concludes with

Over the last ten years, there has been a noticeable move toward statistical work on real world problems (...) this trend (...) has to continue if we are to survive as an energetic and creative field.

The main "meat" of the paper is DR Cox comment and Breiman's rejoinder. I will not summarize it but it is only three pages and truly worth reading. His criticism seems to embody a whole set of criticism most probably shared by many statisticians when hearing about the work done by machine learning people. Sir Cox is British and the wording is politely and nuanced, making it difficult to read (at least to me), reading Breiman rejoinder made it easier to follow. Just consider that it starts by saying:

He has combined this with a critique of what, for want of a better term, I will call mainstream statistical thinking, based in part on a caricature. Like all good caricatures, it contains enough truth and exposes enough weaknesses to be thought-provoking.

Efron's comments can be summarized by reading directly Breiman's rejoinder. The other two comments are mostly in agreement and only add to the material by highlighting relevant experience or ideas to the topic by the commenters. The discussion by Dr. Hoadley on the the work using GAMs for interpretable scoring models by Fair, Isaac in the 60s and 70s is very long and detailed but it also gives an incredible historical perspective of the problem.

All in all, I found the paper has changed my views on the topic, before I was inclined to agree with the meme that the 10 years challenge for 2019 machine learning is the 2009 statistics by another name. I can now see there is much more than meets the eye.