-
Notifications
You must be signed in to change notification settings - Fork 41
Lecture 3
-
citation: Tukey, John W. "The future of data analysis." The Annals of Mathematical Statistics 33.1 (1962): 1-67.
-
reading: ONLY SECTIONS 1-11, 33 , 43-END are required
-
Questions:
- stats: math vs science
- role of universities
- world as is vs should be
We will come back to these questions later with Breiman's The Two Cultures and Cleveland's "Data Science", both 2001
- citation: Diaconis, Persi. "Theories of data analysis: from magical thinking through classical statistics." In Exploring Data Tables, Trends and Shapes. Edited by D. Hoaglin, F. Mosteller, and J. Tukey. 1-36. New York: Wiley, 1985.
- reading: PAGES 1-9 AND 31-END
-
citation: Chambers "Greater or Lesser Statistics"
-
Questions:
- what is CS not able to do?
- role of universities
-
citation: John W. Tukey-Exploratory Data Analysis-Addison Wesley (1977).pdf
-
context: This is Tukey's 1977 textbook written about six years before PCs became widely available. It's VERY long. Just glance through and look at some of the suggestions he has for work to do with pen and paper--and computer, when they become available. Extremely valuable today!
-
readings: please focus on 3 sections:
- section 1a, pp1-3 Note:
- emphasis on subjectivity
- emphasis on domain expertise (this is a term engineers, particularly software engineers, use to mean knowing something about a specific application area that makes engineering tools more successful, for example the expertise which helped statisticians decide which aspects of a country would be more or less useful for a king to know about. The role of domain expertise in machine learning is a topic of current debate, with successful engineering results in 'deep learning' seeming to illustrate that we need no knowledge of the "best features" for some application areas.)
- early distinction between exploratory (w/o "model") and confirmatory (w/model, esp. with p-values)
- section 2c, pp39-43 Background:
- the "5 numbers" JWT mentions are:
- high extreme
- high hinge (usually interpreted as quartile; defined graphically on p33)
- median
- low hinge (usually interpreted as quartile; defined graphically on p33)
- low extreme
Note:
- the emphasis on the mechanics of EDA, even down to the paper he used, absent of software solutions
- sec 19B, pp623-625 Note:
- immediately dispels the idea that the Gaussian is a "law to which data must adhere"
- pp624-625: tables of calculated numbers were more common before personal computers
A couple of things to think about as you read these three pieces
-
in a historical and philosophy vein, think about how you would do the sort of work that Desrosieres did with the early modern statisticians: what are the different visions of statistical work these authors present? Desrosiers uses controversies to help articulate the stakes of different forms of scientific practice--can you do that here? How do these authors portray those they are arguing against?
-
In a more instrumentalist vein, please bring in at least one form of data analysis you've gleaned from these papers you might like to do, or to figure out how to do, in class.