Skip to content

Syllabus

chris wiggins edited this page Nov 19, 2020 · 95 revisions

Data: Past, Present, and Future

instructors

Matthew L. Jones (A&S) and Chris Wiggins (SEAS)

Course description

Data and data-empowered algorithms now shape our professional, personal, and political realities. This course introduces students both to critical thinking and practice in understanding how we got here, and the future we now are building together as scholars, scientists, and citizens.

The intellectual content of the class will comprise

  • the history of human use of data;

  • functional literacy in how data are used to reveal insight and support decisions;

  • critical literacy in investigating how data and data-powered algorithms shape, constrain, and manipulate our commercial, civic, and personal transactions and experiences; and

  • rhetorical literacy in how exploration and analysis of data have become part of our logic and rhetoric of communication and persuasion, especially including visual rhetoric.

While introducing students to recent trends in the computational exploration of data, the course will survey the key concepts of "small data" statistics.

Requirements

All students will be required to:

  • participate in laboratory hours including posting to Slack as assigned during class (10%)
  • respond to readings each week on Slack (15%)
  • write one 750 word op-ed on the ethics and practice of using data by midterm (15%)

Students will be assigned, typically based on their major, into one of two tracks. Students with less technical majors will do more technical work, including problem sets; students with more technical background will do more humanistic work, including longer writing assignments. Students for which there is ambiguity are encouraged to clarify with instructors and TAs before the 1st assignment is due.

a) more technical background track (60%)

  • pursue a semester long project culminating in a 15pp paper and any associated code

  • complete 3 problem sets

b) more humanistic background track (60%)

  • write a 10 pp paper on a topic of their choice

  • complete 5 problem sets, these problem sets will involve both computational work and writing work

Syllabus

Tentative and subject to change

Lecture 1 : intro to course

(See Slides)

Lecture 2 : setting the stakes

Lecture 1 was an overview of the class, with some setting of stakes; for Lecture 2 we'll dive in to some examples of writing which has had an impact in drawing peoples attention to unintended consequences of a reality shaped by data-empowered algorithms:

  1. Hanna Wallach (2014, December). Big data, machine learning, and the social sciences: Fairness, accountability, and transparency. In NeurIPS Workshop on Fairness, Accountability, and Transparency in Machine Learning. Available via https://medium.com/@hannawallach/big-data-machine-learning-and-the-social-sciences-927a8e20460d . Dr. Wallach ( http://dirichlet.net/about/ ) is a former CS Professor now working in NYC at Microsoft Research. She's been a leader both in machine learning research and the emerging discipline of computational social science. This piece is an early example of technologists beginning to question data and propose a new research field.

  2. danah boyd and Kate Crawford. "Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon." Information, communication & society 15, no. 5 (2012): 662-679. Available via https://www.tandfonline.com/doi/abs/10.1080/1369118X.2012.678878 .

  1. From Zeynep Tufekci, 14 very readable pages from 2014 on tech & politics. This will wrap up our "setting the stakes" readings on our data-driven present.

Tufekci, Zeynep. "Engineering the public: Big data, surveillance and computational politics." First Monday 19, no. 7 (2014).

Lecture 3 : risk and social physics

Lecture 3 is on risk and "social physics"

This week we start the "historical" view, centered on readings from the 18th and 19th century. We'll have one primary text and two secondary texts. (We'll also have one optional text for those of you who really want to dig in.)

  1. Gigerenzer, Gerd, ed. The Empire of Chance: How Probability Changed Science and Everyday Life. Ideas in Context. Cambridge: Cambridge University Press, 1989, Section 1.6 ("Risk and Insurance") 8 very quick-moving and readable pages from a very quick-moving and readable book.

  2. Porter, Theodore. The Rise of Statistical Thinking, 1820-1900 (Princeton, N.J.: Princeton University Press, 1986), chap. 2 (40-70) + 100-109. Porter (one of the authors from the Empire of Chance) with lots of context around Quetelet's role in shaping our thinking about data, people, and policy.

  3. Quetelet, Adolphe “Preface” and “Introductory,” A Treatise on Man (1842), (full book available via https://ia801409.us.archive.org/27/items/treatiseonmandev00quet/treatiseonmandev00quet_bw.pdf ) A "game changer", one would now call it. Enjoy!

OPTIONAL

If you dig harder reading, get to know Desrosières:

  1. Desrosières, Alain. "Averages and the Realism of Aggregates" in The Politics of Large Numbers: A History of Statistical Reasoning. Cambridge, Mass.: Harvard University Press, 1998, ch 3. Desrosières on Galton through Durkheim.

Lecture 4 : statecraft and quantitative racism

our reading for Tuesday 2020-02-11 is:

  1. Desrosières, Alain. "Correlation and the Realism of Causes," in The Politics of Large Numbers: A History of Statistical Reasoning. Cambridge, Mass.: Harvard University Press, 1998.
  • excerpt 1 on 'statistics' vs 'vulgar statistics', pp19-23 (top of 23)

  • excerpt from ch 4 on Galton, pp112-127

I have to warn you: Desrosières is not messing around. The book is a now-standard in the history of stats, but it's pretty scholarly, i.e., dense. Ch 4 is so good though we had to assign it, or at least the Galton part of it.

Great references as well, if you're inspired to dig in, please do enjoy!

  1. Galton, Francis. “Typical Laws of Heredity,” Royal Institution of Great Britain. Notices of the Proceedings at the Meetings of the Members 8 (February 16, 1877): 282ff. Primary text.

  2. Stephen J. Gould, Mismeasure of Man, ch. 3 Gould brings a sword in this chapter. Great stuff. Opens with a bang, doesn't let up for 40 pages. Warning that he deals head on with the racism of quantitative research of the late 19th c.

OPTIONAL

  1. Gillham, Nicholas. "Sir Francis Galton and the Birth of Eugenics." Ann. Rev. Genet. 35 (2001): 83-101. A scholarly engagement of Galton, his interests, and the consequences

Lecture 5 : intelligence, causality, and policy

our reading for Tuesday 2020-02-18 moves from Galton to intelligence, with direct policy implications:

  1. Gould, Stephen Jay. The mismeasure of man. WW Norton & Company, 1996. ONLY pp: 280-2, 286-288, 291-302, 347-350. This is Gould's treatment of Spearman, plus a 3-page addendum about a late 20th century revival, but minus two mathy bits (which is why I list the pages in 4 chunks). Spearman invented PCA in order to come up with a single number representing "general intelligence". If of interest, see:

for more.

  1. pp 272-277 ONLY of: Spearman, Charles. ""General Intelligence," objectively determined and measured." The American Journal of Psychology 15, no. 2 (1904): 201-292, ( available at https://web.archive.org/web/20140407100036/http://www.psych.umn.edu/faculty/waller/classes/FA2010/Readings/Spearman1904.pdf )

  2. Freedman, David. From association to causation : some remarks on the history of statistics. Journal de la société française de statistique, Volume 140 (1999) no. 3, pp. 5-32. http://www.numdam.org/item/JSFS_1999__140_3_5_0/ Freedman ( https://en.wikipedia.org/wiki/David_A._Freedman ) was a great statistician as well as expository and historian of statistics. He writes so well! In this piece please focus on the parts about Yule (Sec 4, pp 11-14) but really the whole thing is great and sets you up well for the next several weeks (and life in general!!!!!!!!!!!!!!!)

Optional: Yule's actual paper

  1. Yule, G. (1899). An Investigation into the Causes of Changes in Pauperism in England, Chiefly During the Last Two Intercensal Decades (Part I.). Journal of the Royal Statistical Society, 62(2), 249-295. doi:10.2307/2979889

(or at least read footnote 25:

Strictly, for " due to " read " associated with." Changes in proportion of old and in population were much the same in the two groups in all cases. The assumption could not have been made, I imagine, had the kingdom been taken as a whole.

)

Lecture 6 : data gets real: mathematical baptism

our reading for Tuesday 2020-02-25 moves from Spearman's mathematical definition of intelligence, and Yule's work implicitly applying mathematical models to policy decisions, to an explicitly mathematical "baptism" of statistics, applied directly both to decisions and to defining what is true, what is proven, and what is "science."

  1. Box, Joan Fisher. “Guinness, Gosset, Fisher, and Small Samples.” Statistical Science 2, no. 1 (1987): 45–52.

Joan was one of Fisher's 6 children with Eileen Guinness. She married the statistician George Box, now most well known for the quip "All models are wrong, but some are useful.", which you should keep in mind while thinking about hypothesis testing. This brief piece gives context into Fisher's writing and his half-century fight with Neyman.

  1. "The Controversy", 10 page excerpt from "The Empire of Chance", a secondary text you've already encountered (the book is a collaboration among the distinguished historians John Beatty, Lorraine Daston, Gerd Gigerenzer, Lorenz Kruger, Theodore Porter, and Zeno Swijtink). This section makes clear the ideas and animus of the fight.

Next, two documents from the two belligerents, summarizing the bitter mathbattle:

  1. Fisher, R. A. "Scientific thought and the refinement of human reasoning." (1960)., also available online here.

  2. Neyman, J. (1961). Silver jubilee of my dispute with Fisher, also available online here

OPTIONAL

A special treat: a preprint "Inference Rituals: Algorithms and the History of Statistics" (i.e., please do not distribute) from Chris Phillips, CMU's history department. Great summary with lots of citations.

Lecture 7 : WWII, dawn of digital computation

readings for Tuesday 2020-03-03:

context:

As promised, this week we go somewhat back in time, from this week’s “bratty” mathbattle of 1935-1960 in the academy to the actual behind-the-fence birth of computational statistics at Bletchley park during WWII.

  1. We’ll open this week with “Bayes goes to War”, Ch 4 of “The Theory That Would Not Die” (2011), a popular book written by the science writer Sharon Bertsch McGrayne. (please also read the 2-page dessert, ch 5) While you're reading this, consider
  • the scientific and intellectual networks the characters came from: computing was another "trading zone", bringing together hardware engineers and mathematicians, but no statisticians. The vigorous academic debate we saw raging in "mathematical statistics" 1935-1960 is absent from the dawn of computing with data
  • the role of hardware and physical labor vs. mathematics and philosophy, different from prior authors
  • the focus on a job to be done (especially "decisions") rather than a scientific inquiry ("truth")
  • the marked break-point in how we deal with data as it becomes a digital computation, with specialized hardware
  1. McGrayne doesn't go into much detail as to the technical matters -- what they actually did at Bletchley. For this, please read the first 11 pages of an unclassified report by the NSA.

  2. Mar Hicks has a great new book "Programmed Inequality" which you have hopefully heard of. She also has an article specifically about Bletchley and the dawn of computation. In this chapter please focus on pages 19-42.

  3. The above give some view of the role of math, hardware, and physical labor at the UK dawn of digital computing -- which was computing with data! However this doesn't shed light on the "special relationship" between UK intelligence and US corporations -- the corporate contractors who participated in the early military-industrial complex. As we will see in the remaining weeks, these companies dominated what would become data science, particularly Bell Labs here in NYC (after the war Bell moved to NJ). In light of that please enjoy this breezy excerpt from Hodge's biography of Alan Turing.

While you're reading this, consider

  • the "scaling up" of computing with data
  • the marked break-point in how we deal with data as it becomes a digital computation, with specialized hardware and....
  • who has the infrastructure to build and maintain such hardware? Communications (e.g., postal service in the UK, AT&T / Bell labs in the US) feature prominently in this chapter and in succeeding weeks!

Modern day relevance:

This is the birth of digital computation!! And computation -- including that driving the device on which you're reading these words -- was born of data; that is, the first arena of attack for building digital computers is breaking codes (data) using statistical methods (what we would now call a Bayesian inference problem), along with abundant "shoe leather" work (domain expertise). Completely breaking from the academic tribe setting the tone and values for making sense of the world through data, martial and industrial (as contractors to the military) concerns and skills are the primary movers at this point through present day. I claim that these readings reveal the origin of modern data science --- applied computational statistics, driven by concerns of a domain of application, including engineering concerns, rather than being driven by mathematical rigor or scientific hypotheses.

OPTIONAL:

Mar Hicks's chapter complements the great direct quotes which drive "Breaking Codes and Finding Trajectories: Women at the Dawn of the Digital Age”, Ch 1 of “Recoding Gender: Women’s Changing Participation in Computing” (2012) By Janet Abbate, a professor at Virgina Tech.

Of this, please focus on

  • pp. 14-16,
  • pp. 21-22,
  • bottom of 26-29, and
  • pp. 33-35.

While you're reading this, consider

  • the role of physical labor and how it is valued
  • the biases about the different skills needed
  • the marked break-point in how we deal with data as it becomes a digital computation, with specialized hardware

Lecture 8 : birth and death of AI

This week's reading is about another present-shaping, history-changing innovation from Alan Turing (in addition to computation itself): artificial intelligence.

  1. Secondary reading: Artificial Intelligence by Stephanie Dick https://hdsr.mitpress.mit.edu/pub/0aytgrau

Context, including the primary literature

  1. Alan Turing: "Computing machinery and intelligence"" Mind 59, no. 236 (1950): 433. ( https://en.wikipedia.org/wiki/Computing_Machinery_and_Intelligence ) In these few pages, Turing lays out a plan for what would later be called AI, and thinks through the necessary hardware, the computation, the capabilities, and even many of the critiques and doubts people would raise about the possibility of AI.

There are two ways to interpret the many connections between this almost 70-year old document and the present: the first is to say "wow! Turing was so prescient to have realized back then everything that would happen for the next 50 years!" The second is to say "wow! For the next 50 years all people did was execute the plan laid out in this document!" Either way you can see in this document what would be the future (and our present!)

questions to ask yourself:

  • What was "the Turing test" initially? How did this relate to Turing's biography?
  • What objections did he expect against AI? Did he address them well?
  • Can you tell that there are multiple ways to achieve AI in this work?
  1. McCarthy, John, Marvin Minsky, Nathaniel Rochester, and Claude Shannon. "A proposal for the Dartmouth summer research project on artificial intelligence, august 1955." (1955). This is the only reading this term that is actually a funding proposal, not a scholarly work, but it has tremendous historical impact. It first introduces the phrase "artificial intelligence" itself. By Macarthy's own admission:

"I invented the term artificial intelligence. I invented it because ...we were trying to get money for a summer study in 1956...aimed at the long term goal of achieving human-level intelligence." -- JCM, during the Lighthill debate [1]

There are two ways to interpret the many connections between this 64-year old document and the present: the first is to say "wow! They were so prescient to have realized back then everything that would happen for the next 50 years!" The second is to say "wow! For the next 50 years all people did was execute the plan laid out in this document!" Either way you can see in this document what would be the future (and our present!)

questions to ask yourself:

  • What part of their goal overlaps with what we now think of as AI?
  • What parts relate to CS as a field?
  • Which goals have we not yet achieved?

The backlash against AI -- sometimes called "the first AI winter" -- accelerated in the early 1970s. A well-documented example from the UK is "the Lighthill report". There are copious assets from and reactions to this report, and it is a great illustration of how incumbents in the scientific establishment, including Lighthill himself [2], had a role in smiting the upstart field of AI. If you have time I strongly encourage you to watch the videos (see "6", below) of the televised debate itself, just amazing stuff, featuring a who's who of British AI work [3] along with McCarthy himself, trying to defend the nascent field. In prior years we've assigned the report along with reactions; please for Tuesday just read the report itself. A note: Don't expect it to make sense. It's a rant by someone who had been cramming on AI for 3 months before pontificating. He gets plenty wrong.

Let's just read JCM's reaction, written in 1973 (though typeset in 2000; ignore the year):

questions to ask yourself:

  • What did Lighthill get right? What did he get wrong?
  • How well do you think he argued against AI as a field?

OPTIONAL

extras for further reading

  1. Professor Sir James Lighthill, FRS. "Artificial intelligence: a paper symposium. In: Science Research Council, 1973." (1974): 317-322.

  2. The full debate Lighthill, J. "BBC TV–June 1973 ‘Debate at the Royal Institution" https://www.youtube.com/watch?v=03p2CADwGF8

  3. Pamela McCorduck (2009). Machines who think: A personal inquiry into the history and prospects of artificial intelligence. AK Peters/CRC Press.,

  • 7.1 Chapter 5, "The Dartmouth Conference": McCorduck is an unusual secondary source in that she knows personally almost all the people she writes about -- unparalleled access to the minds and interests of the participants.

questions to ask yourself: What were the interests and goals of the participants?

  • 7.2 Chapter 9, "L’Affaire Dreyfus": The backlash wasn't just in the UK, of course. A prominent example is the book Dreyfus, Hubert L. "What computers can't do." (1972), discussed at length by Pamela McCorduck.

References:

[1] Lighthill, J. "BBC TV–June 1973 ‘Debate at the Royal Institution" https://www.youtube.com/watch?v=pyU9pm1hmYs&t=266s , 3'00"

[2] https://en.wikipedia.org/wiki/James_Lighthill, who was Lucasian Professor of Mathematics ( https://en.wikipedia.org/wiki/Lucasian_Professor_of_Mathematics )

[3] One of the audience members called upon to speak ( https://youtu.be/3GZWFnWOqkA?t=407 ) is Chris Strachey https://en.wikipedia.org/wiki/Christopher_Strachey, whose whole life story is amazing, including

Lecture 9 : big data, old school (1958-1980)

The first birth and burial of AI was last week. Next week it's time to get to know "big data" from a historical view. In our opening reading, Hanna Wallach warned that big data is creepy because it's granular and because it's about people. This week's reading takes a historical look.

Our readings are

Sarah Igo. The Known Citizen: A History of Privacy in Modern America. Harvard University Press, 2018. Chapter 6: The Record Prison.

Igo traces battles over corporate and state control of data, and public reactions, in 1960s and 1970s. This is a very enjoyable, award-winning book. It draws from many sources including the lawyer-author Authur Miller, who is the first of the two "optional" readings for students who would like to dive in to more detail.

Think through: who was gathering data about people at the time? What did citizens fear and for what actions were reformers arguing at this time?

"A business intelligence system." IBM Journal of research and development 2, no. 4 (1958): 314-319.

Igo's "secondary" reading gives a great view of the cultural context; we next move on to a "primary" reading from IBM's Hans Peter Luhn, shining a light on how and when "big data" became technically possible. Born in Germany, Luhn was an engineer-inventor credited with a new process in textile manufacturing (cf. http://www.lunometer.com/ ). At IBM he evangelized for organizing the world’s information and making it universally accessible and useful. He was ahead of his time. This very brief (5 substantive pages, one page-filling infographic) manifesto from 1958 introduced the term "business intelligence" argues that organizing information and making it useful should be recognized as a critical function of organizations, and that organizations should provide resources -- both in tooling and talent -- to execute this function.

Think through: who was his audience and what was the status quo beforehand?

Joanna Radin. "“Digital Natives”: How Medical and Indigenous Histories Matter for Big Data." Osiris 32, no. 1 (2017): 43-64.

Another creepy context is health data, which is dominating the current news cycle and will likely continue to do so. This article touches directly on the ethics of justice --- e.g., fair use and fair distribution of benefits --- by focusing on the members of the Gila River Indian Community Reservation.

Karl Bode. "COVID-19 Could Provide Cover for Domestic Surveillance Expansion", Vice, Mar 16 2020. https://www.vice.com/en_us/article/884ew5/covid-19-could-provide-cover-for-domestic-surveillance-expansion

With Igo, Luhn, and Radin as a backdrop for thinking about the societal impact, the technical capabilities, and the sensitivities of human health data, respectively, let's move on to the present day. This recent post from Vice ties together all of the above, centering on the ethical tensions between rights and harms. Rights include the right to consent to disclosure of information about a person, which is sacrificed when our personal data are shared with other agents (e.g., doctors, companies). Concerns for harms-vs-benefits include the risk of future discrimination based on medical or demographic information. Concerns about benefits both to an individual patient (by statistical comparison with others) and to society (by statistical analyses which correlate, e.g., treatment with outcome). There is an ethical tension -- a tension between the ethical principles of rights and harms (we will later frame this as a tension between deontology and consequentialism) -- at the center of human health data, which has long been discussed in applied ethical literature but is once again entering public discourse. This article is just one of many appearing the past week we could have chosen, we encourage you to explore on your own and hope that our class gives a useful framing as you navigate this discourse.

OPTIONAL

If you'd like to dig in more, we encourage you to start with these two readings:

Arthur Raphael Miller. "Assault on privacy: Computers, Data Banks, and Dossiers" (1971)., excerpt from Ch 2: "The New Technology's Threat to Personal Privacy".

This book is referenced heavily in Igo's Chapter 6. Even in 1971 it was clear that technology was going to conflict directly with our notions of privacy. Miller was ahead of his time in sounding the alarm.

(This Arthur miller https://en.wikipedia.org/wiki/Arthur_R._Miller ; not this one: https://en.wikipedia.org/wiki/Arthur_Miller )

As you read, think through what has changed and what hasn't, both in tech and in public attitudes.

  1. Martha Poon. "Scorecards as devices for consumer credit: the case of Fair, Isaac & Company Incorporated." The sociological review 55 (2007): 284-306.

Poon did a tremendous amount of ethnographic work for this article, finding and interviewing 14 people who were involved in the creation of our current system of credit scoring, itself a nationally recognized and standardized process but with origin in a for profit company (not unlike, e.g., the SAT and ACT tests). As in our readings from the dawn of digital computation, this data processing involved a great deal of labor and was heavily gendered.

Extra creepy is when data about people are reduced to a single number. In the case of credit scoring it's not only a single number -- it's a very powerful single number. Poon reviews the development, breaking into 3 stages:

  • 1958-1974,
  • 1980-1985,
  • 1986-1991.

As you read think about

  • statistical measures: what is the "score" meant for as a summary? Is it a description? A prediction? A prescription?
  • ethical principles of fairness and beneficence

Lecture 10: data science, 1962-2017

The last several weeks were about how communities other than statisticians thought about data. (In fact, the last time we read from or about statisticians was in February!). We traced the military-crypto-computational effort (March 2), the cognitive-computational birth of AI (March 12), and the government-industrial birth of big data (March 26). This week is about the reaction of industrial statisticians to big data and computation, and traces the origin of 'data science' both as a term and a mindset.

  1. David Donoho "50 years of data science." Journal of Computational and Graphical Statistics 26, no. 4 (2017): 745-766.

Donoho is a respected academic statistician ( https://en.wikipedia.org/wiki/David_Donoho ); in this article he baptizes data science as statistics. Donoho wraps up 50 years of data science history, starting with the famous paper "Frontiers of Data Analysis" by the heretical statistician John Tukey, who also appears in Bin Yu's talk (see below). Pay attention to Tukey's role as well as that of Breiman; we'll be discussing both in particular, as well as the role of 1) Bell Labs 2) Military funding and interests.

Tukey spent his whole career split between Princeton and Bell Labs. (You might recall that we encountered Tukey in our discussion of exploratory data analysis -- he wrote the book defining the field in 1977.) His PhD (1939) was in topology, but as he later put it "By the end of late 1945, I was a statistician rather than a topologist," having worked on code-breaking and other martial applications. Despite being the founding chair of Princeton's statistics department, he had an adversarial relationship with mathematical statistics. His 1962 paper is the most-cited attack on the field. Breiman, like Tukey, was trained as a proper mathematician, then, like Tukey, worked on extremely applied problems. Also like Tukey, he wrote and spoke trying to get academic mathematical statisticians to embrace computation and data rather than just math.

Donoho worked with Tukey when he was an undergraduate and is at this point possibly the most anointed living computational statistician. This paper baptizes the heresy, bringing it into the church of statistics by providing his view on how statistics should be defined in a way to include data science as he sees it.

Gina Neff, Anissa Tanweer, Brittany Fiore-Gartland, and Laura Osburn. "Critique and contribute: A practice-based framework for improving critical data studies and data science." Big Data 5, no. 2 (2017): 85-97.

This piece by Professor Gina Neff (CC'93!! https://en.wikipedia.org/wiki/Gina_Neff ) is an astute example of Critical Data Studies -- including the non-technical aspects that differentiate data science from statistics in practice.

Professor Neff has a long history of understanding technology and scientific communities. Her PhD here at Columbia involved very applied ethnographic work hanging out with Silicon Alley people in the late 1990s and understanding their values and networks. In this piece she ties together data science in theory with data science in practice. Enjoy!!

  1. "Let us own data science" ( https://en.wikipedia.org/wiki/Bin_Yu ): A 2014 talk/post by Bin Yu; she's an extremely accomplished statistician who, like many at the data science - statistics intersection, worked at Bell Labs at some point (in her case, 1998-2000). She's also very aware of the fact that academics, whether they know it or not, are human beings. This lecture tries to frame a new relationship between statistics and data science, as well as a new relationship between academia and industry.

Lecture 11: AI2.0

Two weeks ago we learned AI; last week we understood the privatization of big data; this week we understood "Data Science" and its tensions. Next week we encounter AI2.0, i.e., "Machine Learning".

  1. First we'll read a 6 page review paper from 2015 by the two most well known names in machine learning in the academy: Michael I Jordan of Berkeley and Tom Mitchell, who created the world's first "machine learning" department. This piece is fun to read because if you know their two very differing worldviews you can tell which paragraphs are Jordan's and which are Mitchell's. It's a good view of ML as it was understood in 2015.

  2. Deep Learning is not that new a field but is already too vast and too quickly moving to cover in technical depth; however, there's a great nontechnical view from a few years ago which will help: https://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html Within this long article please focus on the 1st two sections:

  • Prologue: You Are What You Have Read
  • Part I: Learning Machine
  1. As we mentioned in lab, deep pushes to an extreme the tension between performance and interpretability. As one deep expert put to me years ago "I am anti-interpretability! I think it is a distraction!". This has consequences for real-world problems, particularly for automated decision systems. For that reason, machine learning researcher Cynthia Rudin asks us please to "Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead."

  2. The technical field of fairness gained great attention from a particular journalistic piece about machine learning + bias called "machine bias". This piece will set us up nicely for understanding ethics: https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing

OPTIONAL:

Deep learning has also been critiqued by some heretical machine learning researchers as "alchemy", echoing the 1965 RAND paper of Dreyfus ( https://www.rand.org/pubs/papers/P3244.html ) , who we encountered briefly in week 8. If you want to know more I encourage you to dig into the more recent alchemy kerfuffle, either by video or blog post:

Lecture 12: ethics

With last week's readings we've brought ourselves from data's past to data's present. It's time for us in our remaining weeks to think about the data's future, which, since the data are derived from our behavior, we will all shape.

Likely you've noticed, over the past few weeks, an increasing use of the words "should" and "ethics": in our readings, in your responses, and in our discussions and labs. This week takes a look at what we talk about when we talk about ethics, and how we operationalize this definition into process.

  1. Salganik Ch 6: Salganik, Matthew J. Bit by bit: social research in the digital age. Princeton University Press, 2017.

Matt Salganik [1] got his PhD at Columbia in 2006 in Sociology. Chapter 6 of his book 'Bit by bit' is explicitly about ethics. For Tuesday, please read sections

  • 6.1 "Introduction"
  • 6.2 "Three examples"
  • 6.4 "Four Principles"
  • 6.6 "Difficulty"

(fun fact: Tukey invented the word 'bit'!)

  1. Metcalf, Jacob, and Emanuel Moss. "Owning Ethics: Corporate Logics, Silicon Valley, and the Institutionalization of Ethics." Social Research: An International Quarterly 86, no. 2 (2019): 449-476. https://datasociety.net/wp-content/uploads/2019/09/Owning-Ethics-PDF-version-2.pdf This piece from last fall gives a sense for how ethics is (or isn't) integrated into the organizational chart and the decision making process at tech platform companies.

  2. Sweeney on re-identification https://techscience.org/a/2015092903/ Sweeney, Latanya. "Only you, your doctor, and many others may know." Technology Science 2015092903, no. 9 (2015): 29.

We've mentioned Professor Latanya Sweeney [2] several times in class, e.g., when she was a grad student and sent the Governor of Massachusetts his medical records. This piece is technical but also legible. The section "approaches" makes clear one of the problems of "anonymous" data, as we discussed in the lab on the Netflix prize.

[1] https://en.wikipedia.org/wiki/Matthew_J._Salganik

[2] https://en.wikipedia.org/wiki/Latanya_Sweeney

Lecture 13: present problems: attention economy+VC=dumpsterfire

Two weeks ago we caught up to the present, including AI + personal data; this week we set a frame for thinking analytically about ethics. Next week it's time to see how the power of AI and personal data has been monetized, and how the particular way its been monetized has transferred the "power" of our personal data from the state to private companies, focused on growth rather than accountability to consumers or consumer protection. The readings cover:

  • the data-enabled advertising model;
  • the venture model, which accelerates the monetization of data; and
  • the consequences for information platforms.

Data will be in the background this week, making the transformations possible but mentioned only as an assumed accelerant (e.g., in references to machine learning and personalization).

  1. The advertising model, 1997: Michael Goldhaber "The attention economy and the net." First Monday 2, no. 4 (1997).

This is a very prescient article, written in early days of WWW. It's entertaining to see what was proven right, and what was not!

  1. The venture model, 2019

next 2 excerpts from books in 2019 about venture capital and its impacts on company growth and governance

  • 2.1 "Epilogue: From the Past to the Present and the Future" from "VC: An American History", Tom Nicholas, 2019: This is a comprehensive book and we encourage you to read more than this if you'd like to know the history in more detail, but this brief piece gives a sense for the past, present, and future of venture capital.

  • 2.2 Excerpt from Isaac, Mike. "Super pumped: The battle for Uber." WW Norton & Company, 2019.: This is less scholarly, more human and fun. A very brief excerpt in which Issac explains not only venture capital but more importantly venture capitalists and the founders they fund. They are people too so it's useful to remember the human part of this dynamic as we think about the rapid growth of information platform companies, particularly 1998-2010.

  1. DiResta, Renee "Computational propaganda: If you make it trend, you make it true." The Yale Review 106, no. 4 (2018): 12-29.: This wide-ranging piece covers the design of platform companies and their optimizing algorithms, along with their consequences. Available via https://yalereview.yale.edu/computational-propaganda

Enjoy!!!

OPTIONAL

PS: Some optional extra readings if you're interested in more on any of the 3 subjects:

Advertising

  1. 1973 (optional): Richard Serra and Carlotta Schoolman. Television Delivers People. Castelli-Sonnabend Films and Tapes, 1973.

Venture

  1. Anna Lee Saxenian. Regional advantage. Harvard University Press, 1996. Ch1 Great book fro Anna Lee Saxenian ( https://en.wikipedia.org/wiki/AnnaLee_Saxenian ), Dean of the UCB I-school. Predicted (accurately) that Silicon Valley would overtake Boston. This chapter is a brief history of venture capital, WWII -> 1970s.

Consequences for information platforms:

  1. Grimmelmann, James. "The platform is the message." Georgetown Law Technology Review (2018): 18-30: This brief piece from 2 years ago by Lawyer and scholar Grimmelmann focuses on fake news and how ambiguous this is even to define as content moderation, thus a challenge to algorithms.

  2. "Ethical Principles, OKRs, and KPIs: what YouTube and Facebook could learn from Tukey" Chris Wiggins, April 2018 https://datascience.columbia.edu/ethical-principles-okrs-and-kpis-what-youtube-and-facebook-could-learn-tukey

Lecture 14: future solutions

This is our last week; as such, these are our last readings.

The implicit promise of the name of the class --- "data, past, present and future" --- is that we will also be covering our shared future.

In some classes, discussing the future means cutting edge research. This class, as you might have noticed, is about truth and power; so, discussing the future means discussing the current balances among powers, particularly arenas in which these powers are contested.

A useful framework we'll use in our last class is grouping relevant powers into

  • state power,
  • people Power, and
  • corporate power. Over the last weeks, we've talked quite a bit about state power, for example, in the history of privacy and the construction of ethics. Last week, we covered corporate power, as we covered some of our current problems. By engaging with the contest among powers We will also of course be engaging with potential solutions.

Your readings for this week are organized around those three powers:

Corporate power:

We'll start with how corporate powers are responding to the growing narrative drawing attention to problems in the way data and machine learning are governing our shared reality. We will see that the corporate response has largely been around technical fixes--- particularly privacy. To dig in, we will study three very short pieces.

  1. The first is a piece by two technologists, specifically computer scientists who recently written a book on how to make algorithms more ethical. This short piece argues that the solution to our problems should come from technical fixes, rather than from immediate regulations.
  1. A second short piece investigates more critically one particular corporate fix around privacy. Specifically, the extremely recent proposal by Google and Apple to collaborate on a privacy preserving tracking mechanism i response to the novel coronavirus pandemic.
  1. The third piece in the section is a direct critique of this argument, called "tech can fix"

State power:

Moving on from corporate power and technical fixes advocated for by corporations, we'll discuss state power, which for many people is the first form of solution they think of regarding problematic uses of data and algorithms by corporations.

  1. On this topic please read this 25 page excerpt from a longer law article called "artificial intelligence risk to democracies." The subsection is called "regulation in the age of AI" and gives a useful contemporary overview of some of the existing and proposed regulatory mechanisms.

People power:

Finally, let's return to the subject of people power. Of course there are many forms of people power you're experiencing your daily life outside this class, like voting (impacting State power), or choosing to spend consumer dollars on companies you support (or, conversely, to boycott them).

  1. This very brief nine page excerpt is about a different type of people power; specifically, the power of employees at technology companies.

(Since this week is about contests of power, it's in some ways not surprising that this week we have two different pieces that are from the context of law.)

Optional:

If you'd like to know more about this, some optional readings:

  1. A much lighter and more human look at people power is this piece from last year, by the publication California Sunday. This piece contains no editorial content, other than a timeline of protests. However, it contains direct interviews with many people. Some anonymous, some named, discussing push-back on the corporate power of their own employers. You may enjoy this direct and ethnographic approach.
  1. A second optional piece about state power is this piece, again from the context of law, investigating what life after Facebook might look like for the US government and governance, including some specific proposals.

Finally, if you'd like to know more, I encourage you to read the rest of these papers rather than the briefings excerpts we assigned.

We look forward to discussing these with you on Tuesday.

Prior years' syllabi

Clone this wiki locally