-
Notifications
You must be signed in to change notification settings - Fork 41
Syllabus
Matthew L. Jones (A&S) and Chris Wiggins (SEAS)
Data and data-empowered algorithms now shape our professional, personal, and political realities. This course introduces students both to critical thinking and practice in understanding how we got here, and the future we now are building together as scholars, scientists, and citizens.
The intellectual content of the class will comprise
-
the history of human use of data;
-
functional literacy in how data are used to reveal insight and support decisions;
-
critical literacy in investigating how data and data-powered algorithms shape, constrain, and manipulate our commercial, civic, and personal transactions and experiences; and
-
rhetorical literacy in how exploration and analysis of data have become part of our logic and rhetoric of communication and persuasion, especially including visual rhetoric.
While introducing students to recent trends in the computational exploration of data, the course will survey the key concepts of "small data" statistics.
All students will be required to:
- write one 750 word op-ed on the ethics and practice of using data by midterm (20%)
- respond to readings each week on Slack (15%)
- participate in laboratory hours including posting to Slack as assigned during class (5%)
Students will be assigned, typically based on their major, into one of two tracks. Students with less technical majors will do more technical work, including problem sets; students with more technical background will do more humanistic work, including longer writing assignments. Students for which there is ambiguity are encouraged to clarify with instructors and TAs before the 1st assignment is due.
a) more technical background track (60%)
-
write a 15 pp paper on a topic of their choice
-
complete 3 problem sets (i.e., the last 3 problem sets, #3, #4, and #5)
b) more humanistic background track (60%)
-
write a 10 pp paper on a topic of their choice
-
complete 5 problem sets, these problem sets will involve both computational work and writing work
Tentative schedule for homeworks:
- hw1: assigned feb 04, due 11:59 pm NYC time on feb 18
- hw2: assigned feb 18, due 11:59 pm NYC time on mar 04
- hw3: assigned mar 04, due 11:59 pm NYC time on mar 25
- hw4: assigned mar 25, due 11:59 pm NYC time on apr 08
- hw5: assigned apr 08, due 11:59 pm NYC time on apr 22
Tentative and subject to change
2022-01-19
(See Slides)
Lecture 1 was an overview of the class, with some setting of stakes; for Lecture 2 we'll dive in to some examples of writing which has had an impact in drawing peoples attention to unintended consequences of a reality shaped by data-empowered algorithms:
-
Hanna Wallach (2014, December). Big data, machine learning, and the social sciences: Fairness, accountability, and transparency. In NeurIPS Workshop on Fairness, Accountability, and Transparency in Machine Learning. Available via https://medium.com/@hannawallach/big-data-machine-learning-and-the-social-sciences-927a8e20460d .Dr. Wallach ( http://dirichlet.net/about/ ) is a former CS Professor now working in NYC at Microsoft Research. She's been a leader both in machine learning research and the emerging discipline of computational social science. This piece is an early example of technologists beginning to question data and propose a new research field.
-
danah boyd and Kate Crawford. "Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon." Information, communication & society 15, no. 5 (2012): 662-679. Available via https://www.tandfonline.com/doi/abs/10.1080/1369118X.2012.678878 .
- 14 very readable pages from 2014 by Zeynep Tufekci on tech & politics. This will wrap up our "setting the stakes" readings on our data-driven present. Tufekci, Zeynep. "Engineering the public: Big data, surveillance and computational politics." First Monday 19, no. 7 (2014). Available via https://firstmonday.org/ojs/index.php/fm/article/view/4901/4097.
This week we start the "historical" view, centered on readings from the 18th and 19th century. We'll have one primary text and two secondary texts. (We'll also have one optional text from Florence David which is encouraged!)
-
Ch 3 of our book draft: please send feedback! Comments, questions, suggestions welcome!
-
Gigerenzer, Gerd, ed. The Empire of Chance: How Probability Changed Science and Everyday Life. Ideas in Context. Cambridge: Cambridge University Press, 1989, Section 1.6 ("Risk and Insurance") 8 very quick-moving and readable pages from a very quick-moving and readable book.
-
Quetelet, Adolphe “Preface” and “Introductory,” A Treatise on Man (1842), (full book available via https://ia801409.us.archive.org/27/items/treatiseonmandev00quet/treatiseonmandev00quet_bw.pdf ) A "game changer", one would now call it. Enjoy!
We're assigning this not as optional because it's not great -- it is great!!! -- but because there’s no clear “best” part of them which means they’re a bit long in total:
-
Porter, Theodore. The Rise of Statistical Thinking, 1820-1900 (Princeton, N.J.: Princeton University Press, 1986), chap. 2 (40-70) + 100-109. Porter (one of the authors from the Empire of Chance) with lots of context around Quetelet's role in shaping our thinking about data, people, and policy.
-
David, F. N. (1980). Adolphe Quetelet: Prophet of the New Statistics (No. CU-SL-80-02-ONR). CALIFORNIA UNIV BERKELEY STATISTICAL LAB. available via https://apps.dtic.mil/dtic/tr/fulltext/u2/a084771.pdf
Florence Nightingale David ( https://en.wikipedia.org/wiki/Florence_Nightingale_David ) is amazing, a great statistician and a great writer. Very funny and sardonic. Please do at least skim this one for secondary and historical look at Quetelet by a statistician. We'll talk more about her later this term.
-
Ch3 of book draft “the statistics of the deviant” (25pp)
-
Stephen J. Gould, Mismeasure of Man, ch. 3 Gould brings a sword in this chapter. Great stuff. Opens with a bang, doesn't let up for 40 pages. Warning that he deals head on with the racism of quantitative research of the late 19th c.
- Desrosières, Alain. "Correlation and the Realism of Causes," in The Politics of Large Numbers: A History of Statistical Reasoning. Cambridge, Mass.: Harvard University Press, 1998.
- excerpt from ch 4 on Galton, pp112-127
I have to warn you: Desrosières is not messing around. The book is a now-standard in the history of stats, but it's pretty scholarly, i.e., dense. Ch 4 is so good though we had to assign it, or at least the Galton part of it.
Great references as well, if you're inspired to dig in, please do enjoy!
-
Ch 5 of Nancy Stepan. The idea of race in science: Great Britian, 1800–1960. London: Macmillan, 1982. (Excerpt): Galton and racism in more context
-
From Galton: “Regression Towards Mediocrity in Hereditary Stature”. The Journal of the Anthropological Institute of Great Britain and Ireland. 15: 246–263. 1886. doi:10.2307/2841583. JSTOR 2841583. There are many Galton papers we could assign and if you’re interested please do check out some of his other primary references!!!!!!!!!
our reading for Tuesday moves from Galton's pondering of proxies for "greatness" to direct attempts to define, quantify, and rank intelligence, with direct policy implications:
-
Ch 4 of the book
-
Gould, Stephen Jay. The mismeasure of man. WW Norton & Company, 1996. ONLY pp: 280-2, 286-288, 291-302, 347-350. This is Gould's treatment of Spearman, plus a 3-page addendum about a late 20th century revival, but minus two mathy bits (which is why I list the pages in 4 chunks).
Spearman invented PCA in order to come up with a single number representing "general intelligence". If of interest, see:
- https://en.wikipedia.org/wiki/G_factor_(psychometrics)
- https://en.wikipedia.org/wiki/Charles_Spearman
- https://en.wikipedia.org/wiki/Principal_component_analysis
for more.
- pp 272-277 ONLY of: Spearman, Charles. ""General Intelligence," objectively determined and measured." The American Journal of Psychology 15, no. 2 (1904): 201-292, ( available at https://web.archive.org/web/20140407100036/http://www.psych.umn.edu/faculty/waller/classes/FA2010/Readings/Spearman1904.pdf )
-
Freedman, David. From association to causation : some remarks on the history of statistics. Journal de la société française de statistique, Volume 140 (1999) no. 3, pp. 5-32. http://www.numdam.org/item/JSFS_1999__140_3_5_0/ Freedman ( https://en.wikipedia.org/wiki/David_A._Freedman ) was a great statistician as well as expository and historian of statistics. He writes so well! In this piece please focus on the parts about Yule (Sec 4, pp 11-14) but really the whole thing is great and sets you up well for the next several weeks (and life in general!!!!!!!!!!!!!!!)
-
Yule's actual paper: Yule, G. (1899). An Investigation into the Causes of Changes in Pauperism in England, Chiefly During the Last Two Intercensal Decades (Part I.). Journal of the Royal Statistical Society, 62(2), 249-295. doi:10.2307/2979889
(or at least read footnote 25:
Strictly, for " due to " read " associated with." Changes in proportion of old and in population were much the same in the two groups in all cases. The assumption could not have been made, I imagine, had the kingdom been taken as a whole.
)
- Our reading of Gould, from 1996, might give you the impression that reifying IQ was a problem 25 years ago but is all better now. It's not. A scholarly, contemporary, historically-informed look at this among other current problems of data and race is Angela Saini's 2019 work Superior: The Return of Race Science. Optional reading for this lecture is Ch 5.
For more info https://en.wikipedia.org/wiki/Superior:_The_Return_of_Race_Science
- Whereas British statisticians focused on issues of class, Statisticians in post reconstruction America went to work documenting deep divides between the races, providing empirical grounds for the world of Jim Crow. Figures such as Du Bois carefully tore apart their claims to scientific exactitude. For more on this please see Ch 2 of Muhammad, Khalil Gibran. The condemnation of Blackness: Race, crime, and the making of modern urban America, with a new preface. Harvard University Press, 2019.
For more info https://en.wikipedia.org/wiki/Khalil_Gibran_Muhammad
our reading for Tuesday moves from Spearman's mathematical definition of intelligence, and Yule's work implicitly applying mathematical models to policy decisions, to an explicitly mathematical "baptism" of statistics, applied directly both to decisions and to defining what is true, what is proven, and what is "science."
-
The Book, Ch 5 “Chapter 5: data’s mathematical baptism”
-
“Surrogate Science: The Idol of a Universal Method for Scientific Inference” (2015) about the mathbattle between Fisher and Neyman. I’m a huge fan of Gerd Gigerenzer ( https://en.wikipedia.org/wiki/Gerd_Gigerenzer ); over the years we have assigned many parts of his book Empire of Chance. This is a review article from not so many years ago; to be clear he’s written many times in many ways on this topic, so there are many such reviews by Gerd we could have chosen, all with slightly differing emphases.
Next, two documents from the two belligerents, summarizing the bitter mathbattle:
-
Fisher, R. A. "Scientific thought and the refinement of human reasoning." (1960)., also available online here.
-
Neyman, J. (1961). Silver jubilee of my dispute with Fisher, also available online here
- Box, Joan Fisher. “Guinness, Gosset, Fisher, and Small Samples.” Statistical Science 2, no. 1 (1987): 45–52.
Joan was one of Fisher's 6 children with Eileen Guinness. She married the statistician George Box, now most well known for the quip "All models are wrong, but some are useful.", which you should keep in mind while thinking about hypothesis testing. This brief piece gives context into Fisher's writing and his half-century fight with Neyman.
- A special treat: a preprint "Inference Rituals: Algorithms and the History of Statistics" (i.e., please do not distribute) from Chris Phillips, CMU's history department. Great summary with lots of citations.
- context:
As promised, this week we go somewhat back in time, from this week’s “bratty” mathbattle of 1935-1960 in the academy to the actual behind-the-fence birth of computational statistics at Bletchley park during WWII.
- Modern day relevance:
This is the birth of digital computation!! And computation -- including that driving the device on which you're reading these words -- was born of data; that is, the first arena of attack for building digital computers is breaking codes (data) using statistical methods (what we would now call a Bayesian inference problem), along with abundant "shoe leather" work (domain expertise). Completely breaking from the academic tribe setting the tone and values for making sense of the world through data, martial and industrial (as contractors to the military) concerns and skills are the primary movers at this point through present day. I claim that these readings reveal the origin of modern data science --- applied computational statistics, driven by concerns of a domain of application, including engineering concerns, rather than being driven by mathematical rigor or scientific hypotheses.
-
Ch 6 of The Book: “data at war”
-
“Bayes goes to War”, part of Ch 4 of “The Theory That Would Not Die” (2011), a popular book written by the science writer Sharon Bertsch McGrayne. (please also read the 2-page dessert, ch 5) While you're reading this, consider
- the scientific and intellectual networks the characters came from: computing was another "trading zone", bringing together hardware engineers and mathematicians, but no statisticians. The vigorous academic debate we saw raging in "mathematical statistics" 1935-1960 is absent from the dawn of computing with data
- the role of hardware and physical labor vs. mathematics and philosophy, different from prior authors
- the focus on a job to be done (especially "decisions") rather than a scientific inquiry ("truth")
- the marked break-point in how we deal with data as it becomes a digital computation, with specialized hardware
McGrayne covers only briefly two things that are going to be important:
- the relation between computation and data on this side of the Atlantic and 2
- the role of hidden, gendered labor
To that end, let's read
- "Breaking Codes and Finding Trajectories: Women at the Dawn of the Digital Age”, Ch 1 of “Recoding Gender: Women’s Changing Participation in Computing” (2012) By Janet Abbate, a professor at Virgina Tech.
Of this, please focus on
- pp. 14-16,
- pp. 21-22,
- bottom of 26-29, and
- pp. 33-35.
While you're reading this, consider
- the role of physical labor and how it is valued
- the biases about the different skills needed
- the marked break-point in how we deal with data as it becomes a digital computation, with specialized hardware
This is the birth of digital computation!! And computation -- including that driving the device on which you’re reading these words -- was born of data; that is, the first arena of attack for building digital computers is breaking codes (data) using statistical methods (what we would now call a Bayesian inference problem), along with abundant “shoe leather” work (domain expertise). Completely breaking from the academic tribe setting the tone and values for making sense of the world through data, martial and industrial (as contractors to the military) concerns and skills are the primary movers at this point through present day. I claim that these readings reveal the origin of modern data science --- applied computational statistics, driven by concerns of a domain of application, including engineering concerns, rather than being driven by mathematical rigor or scientific hypotheses.
- The above give some view of the role of math, hardware, and physical labor at the UK dawn of digital computing -- which was computing with data! However this doesn't shed light on the "special relationship" between UK intelligence and US corporations -- the corporate contractors who participated in the early military-industrial complex. As we will see in the remaining weeks, these companies dominated what would become data science, particularly Bell Labs here in NYC (after the war Bell moved to NJ). In light of that please enjoy this breezy excerpt from Hodge's biography of Alan Turing.
While you're reading this, consider
- the "scaling up" of computing with data
- the marked break-point in how we deal with data as it becomes a digital computation, with specialized hardware and....
- who has the infrastructure to build and maintain such hardware? Communications (e.g., postal service in the UK, AT&T / Bell labs in the US) feature prominently in this chapter and in succeeding weeks!
This week's reading is about another present-shaping, history-changing innovation from Alan Turing (in addition to computation itself): artificial intelligence.
-
Ch 7 of book: “intelligence without data”
-
Secondary reading: Artificial Intelligence by Stephanie Dick https://hdsr.mitpress.mit.edu/pub/0aytgrau
Context, including the primary literature:
- Alan Turing: "Computing machinery and intelligence"" Mind 59, no. 236 (1950): 433. ( https://en.wikipedia.org/wiki/Computing_Machinery_and_Intelligence ) In these few pages, Turing lays out a plan for what would later be called AI, and thinks through the necessary hardware, the computation, the capabilities, and even many of the critiques and doubts people would raise about the possibility of AI.
There are two ways to interpret the many connections between this almost 70-year old document and the present: the first is to say "wow! Turing was so prescient to have realized back then everything that would happen for the next 50 years!" The second is to say "wow! For the next 50 years all people did was execute the plan laid out in this document!" Either way you can see in this document what would be the future (and our present!)
questions to ask yourself:
- What was "the Turing test" initially? How did this relate to Turing's biography?
- What objections did he expect against AI? Did he address them well?
- Can you tell that there are multiple ways to achieve AI in this work?
- McCarthy, John, Marvin Minsky, Nathaniel Rochester, and Claude Shannon. "A proposal for the Dartmouth summer research project on artificial intelligence, august 1955." (1955). This is the only reading this term that is actually a funding proposal, not a scholarly work, but it has tremendous historical impact. It first introduces the phrase "artificial intelligence" itself. By Macarthy's own admission:
"I invented the term artificial intelligence. I invented it because ...we were trying to get money for a summer study in 1956...aimed at the long term goal of achieving human-level intelligence." -- JCM, during the Lighthill debate [1]
There are two ways to interpret the many connections between this 64-year old document and the present: the first is to say "wow! They were so prescient to have realized back then everything that would happen for the next 50 years!" The second is to say "wow! For the next 50 years all people did was execute the plan laid out in this document!" Either way you can see in this document what would be the future (and our present!)
questions to ask yourself:
- What part of their goal overlaps with what we now think of as AI?
- What parts relate to CS as a field?
- Which goals have we not yet achieved?
- The backlash against AI -- sometimes called "the first AI winter" -- accelerated in the early 1970s. A well-documented example from the UK is "the Lighthill report". There are copious assets from and reactions to this report, and it is a great illustration of how incumbents in the scientific establishment, including Lighthill himself [2], had a role in smiting the upstart field of AI. If you have time I strongly encourage you to watch the videos (see "6", below) of the televised debate itself, just amazing stuff, featuring a who's who of British AI work [3] along with McCarthy himself, trying to defend the nascent field. In prior years we've assigned the report along with reactions; please for Tuesday just read the report itself. A note: Don't expect it to make sense. It's a rant by someone who had been cramming on AI for 3 months before pontificating. He gets plenty wrong.
Let's just read JCM's reaction, written in 1973 (though typeset in 2000; ignore the year):
questions to ask yourself:
- What did Lighthill get right? What did he get wrong?
- How well do you think he argued against AI as a field?
-
Professor Sir James Lighthill, FRS. "Artificial intelligence: a paper symposium. In: Science Research Council, 1973." (1974): 317-322.
-
Video of the full debate Lighthill, J. "BBC TV–June 1973 ‘Debate at the Royal Institution" https://www.youtube.com/watch?v=03p2CADwGF8
-
Pamela McCorduck (2009). Machines who think: A personal inquiry into the history and prospects of artificial intelligence. AK Peters/CRC Press.,
- 8.1 Chapter 5, "The Dartmouth Conference": McCorduck is an unusual secondary source in that she knows personally almost all the people she writes about -- unparalleled access to the minds and interests of the participants.
questions to ask yourself: What were the interests and goals of the participants?
- 8.2 Chapter 9, "L’Affaire Dreyfus": The backlash wasn't just in the UK, of course. A prominent example is the book Dreyfus, Hubert L. "What computers can't do." (1972), discussed at length by Pamela McCorduck.
[1] Lighthill, J. "BBC TV–June 1973 ‘Debate at the Royal Institution" https://www.youtube.com/watch?v=pyU9pm1hmYs&t=266s , 3'00"
[2] https://en.wikipedia.org/wiki/James_Lighthill, who was Lucasian Professor of Mathematics ( https://en.wikipedia.org/wiki/Lucasian_Professor_of_Mathematics )
[3] One of the audience members called upon to speak ( https://youtu.be/3GZWFnWOqkA?t=407 ) is Chris Strachey https://en.wikipedia.org/wiki/Christopher_Strachey, whose whole life story is amazing, including
- following our read of Turing's life: "At the end of his third year at Cambridge, Strachey suffered a nervous breakdown, possibly related to coming to terms with his homosexuality."
- He wrote a love-letter generating machine. See Siobhan Roberts's piece in our local magazine https://www.newyorker.com/tech/annals-of-technology/christopher-stracheys-nineteen-fifties-love-machine
-
Last week, you encountered the "aspiration" of AI (1950, 1956), and the struggles (1960s-1980s) of its leading candidate methods: rules, schema, and expertise.
-
Next week, you'll meet "AI 2.0" (and find out that it's secretly "machine learning" (which is secretly "pattern recognition" (which --- don't tell anyone --- was secretly statistics all along))).
-
But first! We have to find out how did it come to pass that we even got so much data we could learn from it? How did we go from the birth of computing to the birth of record-keeping (followed by record-keeping
At Web Scale™) ?
So this week we'll meet the birth of computational record-keeping, the "big data" of its time, and, importantly, how society reacted to big data when it became too big to be merely an industrial concern. It's not always easy to enter the literature on the history of databases and data warehousing. In fact, I've heard that some people find the subject dry. So first, let's start with the human side of big data, so to speak:
-
Chapter 8: “getting hooked on data”
-
Sarah Igo. The Known Citizen: A History of Privacy in Modern America. Harvard University Press, 2018. Chapter 6: The Record Prison.
Igo traces battles over corporate and state control of data, and public reactions, in 1960s and 1970s. This is a very enjoyable, award-winning book. It draws from many sources including the lawyer-author Arthur Miller, who is the first of the two "optional" readings for students who would like to dive in to more detail.
Think through: who was gathering data about people at the time? What did citizens fear and for what actions were reformers arguing at this time?
- It’s useful to make sure we connect with “present day” a bit. To that end please do read legal scholar Paul Ohm’s very brief HBR piece that will introduce you to the concept of “a database of ruin”, building on themes from Igo: https://hbr.org/2012/08/dont-build-a-database-of-ruin
- The pieces above are written post-WWW, so they don't feel as weird as pieces when the keeping of digital records and attempts to learn from data were new. For that let's take a look at the brief IBM piece that introduced the world to the term "Business Intelligence", in 1958
"A business intelligence system." IBM Journal of research and development 2, no. 4 (1958): 314-319.
Igo's "secondary" reading gives a great view of the cultural context; we next move on to a "primary" reading from IBM's Hans Peter Luhn, shining a light on how and when "big data" became technically possible. Born in Germany, Luhn was an engineer-inventor credited with a new process in textile manufacturing (cf. http://www.lunometer.com/ ). At IBM he evangelized for organizing the world’s information and making it universally accessible and useful. He was ahead of his time. This very brief (5 substantive pages, one page-filling infographic) manifesto from 1958 introduced the term "business intelligence" argues that organizing information and making it useful should be recognized as a critical function of organizations, and that organizations should provide resources -- both in tooling and talent -- to execute this function.
Think through: who was his audience and what was the status quo beforehand?
- Martha Poon. "Scorecards as devices for consumer credit: the case of Fair, Isaac & Company Incorporated." The sociological review 55 (2007): 284-306.
Poon did a tremendous amount of ethnographic work for this article, finding and interviewing 14 people who were involved in the creation of our current system of credit scoring, itself a nationally recognized and standardized process but with origin in a for profit company (not unlike, e.g., the SAT and ACT tests). As in our readings from the dawn of digital computation, this data processing involved a great deal of labor and was heavily gendered.
Extra creepy is when data about people are reduced to a single number. In the case of credit scoring it's not only a single number -- it's a very powerful single number. Poon reviews the development, breaking into 3 stages:
- 1958-1974,
- 1980-1985,
- 1986-1991.
As you read think about
- statistical measures: what is the "score" meant for as a summary? Is it a description? A prediction? A prescription?
- ethical principles of fairness and beneficence
- Joanna Radin. "“Digital Natives”: How Medical and Indigenous Histories Matter for Big Data." Osiris 32, no. 1 (2017): 43-64.
Another creepy context is health data, which is dominating the current news cycle and will likely continue to do so. This article touches directly on the ethics of justice --- e.g., fair use and fair distribution of benefits --- by focusing on the members of the Gila River Indian Community Reservation.
Two weeks ago we learned AI; last week we understood the privatization of big data; Next week we’ll encounter a democratization of machine learning and data analysis under the name “Data Science”, and its tensions. This week we encounter AI2.0, i.e., “Machine Learning”.
-
Ch 9: machines, learning
-
One milestone for “field-making” is creating a journal. “Machine Learning” started in 1986, and in 2011 one of the founding editors reflected on the changing meaning of the term.
-
As we mentioned in lab, deep pushes to an extreme the tension between performance and interpretability. As one deep expert put to me years ago “I am anti-interpretability! I think it is a distraction!“. This has consequences for real-world problems, particularly for automated decision systems. For that reason, machine learning researcher Cynthia Rudin asks us please to “Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.” (Fun fact: Joanna Radin, whose work we encountered last week, is actually the cousin of Cynthia Rudin. Cynthia’s mom would’ve been Radin-Rudin if she’d hyphenated.)
-
Next we’ll read a 6 page review paper from 2015 by the two most well known names in machine learning in the academy: Michael I Jordan of Berkeley and Tom Mitchell, who created the world’s first “machine learning” department. This piece is fun to read because if you know their two very differing worldviews you can tell which paragraphs are Jordan’s and which are Mitchell’s. It’s a good view of ML as it was understood in 2015.
- Deep Learning is not that new a field but is already too vast and too quickly moving to cover in technical depth; however, there’s a great nontechnical view from a few years ago which will help: https://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html Within this long article please focus on the 1st two sections:
- Prologue: You Are What You Have Read
- Part I: Learning Machine
- The 1983 Workshop organized by Tom Mitchell and others was another milestone in the creation of the field. That workshop included a provocative talk by Herb Simon, only person ever to win a Turing award and a Nobel Prize (in Econ). His talk at the birth of machine learning pushes back on the assumptions of the meeting: “Why should machines learn?”
The last several weeks were about ways WWII and its aftermath directly impacted our world in data: artificial intelligence, the rise of big data, and the statistical approach now known as machine learning, which builds on that data. This week is about the broader community -- not just researchers -- and traces the origin of 'data science' both as a term and a mindset.
- David Donoho "50 years of data science." Journal of Computational and Graphical Statistics 26, no. 4 (2017): 745-766.
Donoho is a respected academic statistician ( https://en.wikipedia.org/wiki/David_Donoho ); in this article he baptizes data science as statistics. Donoho wraps up 50 years of data science history, starting with the famous paper "Frontiers of Data Analysis" by the heretical statistician John Tukey, who also appears in Bin Yu's talk (see below). Pay attention to Tukey's role as well as that of Breiman; we'll be discussing both in particular, as well as the role of 1) Bell Labs 2) Military funding and interests.
Tukey spent his whole career split between Princeton and Bell Labs. (You might recall that we encountered Tukey in our discussion of exploratory data analysis -- he wrote the book defining the field in 1977.) His PhD (1939) was in topology, but as he later put it "By the end of late 1945, I was a statistician rather than a topologist," having worked on code-breaking and other martial applications. Despite being the founding chair of Princeton's statistics department, he had an adversarial relationship with mathematical statistics. His 1962 paper is the most-cited attack on the field. Breiman, like Tukey, was trained as a proper mathematician, then, like Tukey, worked on extremely applied problems. Also like Tukey, he wrote and spoke trying to get academic mathematical statisticians to embrace computation and data rather than just math.
Donoho worked with Tukey when he was an undergraduate and is at this point possibly the most anointed living computational statistician. This paper baptizes the heresy, bringing it into the church of statistics by providing his view on how statistics should be defined in a way to include data science as he sees it.
Gina Neff, Anissa Tanweer, Brittany Fiore-Gartland, and Laura Osburn. "Critique and contribute: A practice-based framework for improving critical data studies and data science." Big Data 5, no. 2 (2017): 85-97.
This piece by Professor Gina Neff (CC'93!! https://en.wikipedia.org/wiki/Gina_Neff ) is an astute example of Critical Data Studies -- including the non-technical aspects that differentiate data science from statistics in practice.
Professor Neff has a long history of understanding technology and scientific communities. Her PhD here at Columbia involved very applied ethnographic work hanging out with Silicon Alley people in the late 1990s and understanding their values and networks. In this piece she ties together data science in theory with data science in practice. Enjoy!!
- Jeannette M. Wing, "Ten Research Challenge Areas in Data Science" Sep 30, 2020. This short piece by Columbia's director of the Data Science Institute sets out current and future areas of activity. As you can see, AI and ML are parts of but just parts of this endeavor.
https://hdsr.mitpress.mit.edu/pub/d9j96ne4/release/2
- "Let us own data science" ( https://en.wikipedia.org/wiki/Bin_Yu ): A 2014 talk/post by Bin Yu; she's an extremely accomplished statistician who, like many at the data science - statistics intersection, worked at Bell Labs at some point (in her case, 1998-2000). She's also very aware of the fact that academics, whether they know it or not, are human beings. This lecture tries to frame a new relationship between statistics and data science, as well as a new relationship between academia and industry.
- html https://imstat.org/2014/10/01/ims-presidential-address-let-us-own-data-science/
- pdf http://pages.stat.wisc.edu/~wahba/binyu.on.datascience.pdf
- video https://www.youtube.com/watch?v=92OjsYQJC1U
With last week's readings we've brought ourselves from data's past to data's present. It's time for us in our remaining weeks to think about the data's future, which, since the data are derived from our behavior, we will all shape.
Likely you've noticed, over the past few weeks, an increasing use of the words "should" and "ethics": in our readings, in your responses, and in our discussions and labs. This week takes a look at what we talk about when we talk about ethics, and how we operationalize this definition into process.
- Salganik Ch 6: Salganik, Matthew J. Bit by bit: social research in the digital age. Princeton University Press, 2017.
Matt Salganik [1] got his PhD at Columbia in 2006 in Sociology. Chapter 6 of his book 'Bit by bit' is explicitly about ethics. For Tuesday, please read sections
- 6.1 "Introduction"
- 6.2 "Three examples"
- 6.4 "Four Principles"
- 6.6 "Difficulty"
(fun fact: Tukey invented the word 'bit'!)
-
Metcalf, Jacob, and Emanuel Moss. "Owning Ethics: Corporate Logics, Silicon Valley, and the Institutionalization of Ethics." Social Research: An International Quarterly 86, no. 2 (2019): 449-476. https://datasociety.net/wp-content/uploads/2019/09/Owning-Ethics-PDF-version-2.pdf This piece from last fall gives a sense for how ethics is (or isn't) integrated into the organizational chart and the decision making process at tech platform companies.
-
Sweeney on re-identification https://techscience.org/a/2015092903/ Sweeney, Latanya. "Only you, your doctor, and many others may know." Technology Science 2015092903, no. 9 (2015): 29.
We've mentioned Professor Latanya Sweeney [2] several times in class, e.g., when she was a grad student and sent the Governor of Massachusetts his medical records. This piece is technical but also legible. The section "approaches" makes clear one of the problems of "anonymous" data, as we discussed in the lab on the Netflix prize.
[1] https://en.wikipedia.org/wiki/Matthew_J._Salganik
[2] https://en.wikipedia.org/wiki/Latanya_Sweeney
This is the last week of the class; however we will have 2 lectures this week: one on problems, and one on solutions.
Within the broad space of problems we're going to focus on a particularly volatile mix: the combination of the ad economy (aka the "attention economy") and venture capital.
There are many, many different places you can read about each of these but since we have 2 lectures in one week, I'm just going to suggest we all focus on two readings, and I'll give everything else as "optional" and you can dig in if you are interested in knowing more.
This week we set a frame for thinking analytically about ethics, and we tried to show how ethics conflicts with profit in general; I alluded that privacy, in particular, with conflict with "the ad model".
The two readings are about:
- the data-enabled advertising model; and
- the venture model, which accelerates the monetization of data.
If you can make time I encourage you to read one more which is on: the consequences for information platforms.
Data will be in the background this week, making the transformations possible but mentioned only as an assumed accelerant (e.g., in references to machine learning and personalization).
- The advertising model, 1997: Michael Goldhaber "The attention economy and the net." First Monday 2, no. 4 (1997).
This is a very prescient article, written in early days of WWW. It's entertaining to see what was proven right, and what was not!
- The venture model, 2019
"Epilogue: From the Past to the Present and the Future" from "VC: An American History", Tom Nicholas, 2019: This is a comprehensive book and we encourage you to read more than this if you'd like to know the history in more detail, but this brief piece gives a sense for the past, present, and future of venture capital.
If you can make time for more:
Grimmelmann, James. "The platform is the message." Georgetown Law Technology Review (2018): 18-30: This brief piece from 2 years ago by Lawyer and scholar Grimmelmann focuses on fake news and how ambiguous this is even to define as content moderation, thus a challenge to algorithms.
If you'd like to know more about the above topics, you might consider
-
Excerpt from Isaac, Mike. "Super pumped: The battle for Uber." WW Norton & Company, 2019.: This is less scholarly, more human and fun. A very brief excerpt in which Issac explains not only venture capital but more importantly venture capitalists and the founders they fund. They are people too so it's useful to remember the human part of this dynamic as we think about the rapid growth of information platform companies, particularly 1998-2010.
-
Anna Lee Saxenian. Regional advantage. Harvard University Press, 1996. Ch1 Great book fro Anna Lee Saxenian ( https://en.wikipedia.org/wiki/AnnaLee_Saxenian ), Dean of the UCB I-school. Predicted (accurately) that Silicon Valley would overtake Boston. This chapter is a brief history of venture capital, WWII -> 1970s.
-
"The fundamental problem with Silicon Valley’s favorite growth strategy", February 5 2019 by Tim O'Reilly ( https://en.wikipedia.org/wiki/Tim_O%27Reilly )
This piece contrasts growth as a venture-backed startup with other models and points out consequences potentially bad for society. The author has been a technical writer since 1977 and has also been a VC.
DiResta, Renee "Computational propaganda: If you make it trend, you make it true." The Yale Review 106, no. 4 (2018): 12-29.: This wide-ranging piece covers the design of platform companies and their optimizing algorithms, along with their consequences. Available via https://yalereview.yale.edu/computational-propaganda
1973 (optional): Richard Serra and Carlotta Schoolman. Television Delivers People. Castelli-Sonnabend Films and Tapes, 1973.
- Video: http://www.vdb.org/titles/television-delivers-people
- Transcript: https://www.persee.fr/doc/comm_0588-8018_1988_num_48_1_2857 This is the earliest reference I know of to "you are the product".
This is our last lecture; as such, these are our last readings. Note: many of you will find the readings very helpful for your final papers!
The implicit promise of the name of the class --- “data, past, present and future” --- is that we will also be covering our shared future.
In some classes, discussing the future means cutting edge research. This class, as you might have noticed, is about truth and power; so, discussing the future means discussing the current balances among powers, particularly arenas in which these powers are contested.
A useful framework we’ll use in our last class is grouping relevant powers into
- state power,
- people Power, and
- corporate power. Over the last weeks, we’ve talked quite a bit about state power, for example, in the history of privacy and the construction of ethics. Last week, we covered corporate power, as we covered some of our current problems. By engaging with the contest among powers We will also of course be engaging with potential solutions.
Your readings for this week are organized around those three powers:
We’ll start with how corporate powers are responding to the growing narrative drawing attention to problems in the way data and machine learning are governing our shared reality. We will see that the corporate response has largely been around technical fixes--- particularly privacy. To dig in, we will study three very short pieces.
- title: Technology Can’t Fix Algorithmic Injustice
- author: Annette Zimmermann, Elena Di Rosa, Hochan Kim
- date: January 09, 2020
- URL: https://bostonreview.net/science-nature-politics/annette-zimmermann-elena-di-rosa-hochan-kim-technology-cant-fix-algorithmic
Moving on from corporate power and technical fixes advocated for by corporations, we’ll discuss state power, which for many people is the first form of solution they think of regarding problematic uses of data and algorithms by corporations.
On this topic please read this 25 page excerpt from a longer law article called “artificial intelligence risk to democracies.” The subsection is called “regulation in the age of AI” and gives a useful contemporary overview of some of the existing and proposed regulatory mechanisms.
- title: Artificial Intelligence: Risks to Privacy and Democracy
- author: Karl Manheim and Lyric Kaplan
- date: Dec 13, 2019
- URL: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3273016
- pages: 160-184
Finally, let’s return to the subject of people power. Of course there are many forms of people power you’re experiencing your daily life outside this class, like voting (impacting State power), or choosing to spend consumer dollars on companies you support (or, conversely, to boycott them).
This very brief nine page excerpt is about a different type of people power; specifically, the power of employees at technology companies.
(Since this week is about contests of power, it’s in some ways not surprising that this week we have two different pieces that are from the context of law.)
- title: Employees as Regulators: The New Private Ordering in High Technology Companies
- author: Jennifer S. Fan
- date: January 09, 2020
- URL: https://dc.law.utah.edu/ulr/vol2019/iss5/2/
- pages: 990-998
A much lighter and more human look at people power is this piece from two years ago, by the publication California Sunday. This piece contains no editorial content, other than a timeline of protests. However, it contains direct interviews with many people. Some anonymous, some named, discussing push-back on the corporate power of their own employers. You may enjoy this direct and ethnographic approach.
- title: The Tech Revolt
- date: Jan 23, 2019
- URL: https://story.californiasunday.com/tech-revolt
if you’re interested in more thinking on Corporate Power, some links:
- The first is a piece by two technologists, specifically computer scientists who recently written a book on how to make algorithms more ethical. This short piece argues that the solution to our problems should come from technical fixes, rather than from immediate regulations.
- title: Ethical algorithm design should guide technology regulation
- authors: Michael Kearns and Aaron Roth,
- date: January 13, 2020
- URL: https://www.brookings.edu/research/ethical-algorithm-design-should-guide-technology-regulation/
- A second short piece investigates more critically one particular corporate fix around privacy. Specifically, the extremely recent proposal by Google and Apple to collaborate on a privacy preserving tracking mechanism in response to the novel coronavirus pandemic.
- title: Apple and Google Announced a Coronavirus Tracking System. How Worried Should We Be?
- author: Jennifer Stisa Granick
- date: April 16, 2020
- URL: https://www.aclu.org/news/privacy-technology/apple-and-google-announced-a-coronavirus-tracking-system-how-worried-should-we-be