I’m trying something new by posting my notes on the textbook Bit by Bit: Social Research in the Digital Age by Matthew J. Salganik.

When I was first entering the computer science field, I was inspired by the research being done using social networks. The rise of Twitter meant we could track the impact of wildfires at a scale and granularity we never had before. The online game FoldIt helped biochemists find efficient protein structures by formulating the problem as a game. My small effort of tweeting or playing a game could contribute to our understanding. At the same time, I was studying how to produce software and wrangle that data in order to understand the world.

A decade later, some initial results from social research using big data didn’t pan out, I work with big data as part of my day job, and articles are frequently published on the troubling aspects of big data and digital products, including questions about user data privacy or the ethics of online experiments and products.

Bit by Bit gives an overview and practical advice for conducting research and designing digital experiments with ethics in mind. At the same time, the book expresses the optimism I felt a decade ago.

Part of the book outlines statistical approaches for mining data, gathering surveys, and performing experiments. Computing counts and taking weighted averages of survey data might not be the shiniest of methods, but extracting experimental results out of data and being able to answer a solid research question is an invaluable tool.

While Bit by Bit sometimes feels targetted to social researchers in academia, journalism, or governments, I think the textbook’s material is useful for anyone who works with big data. I’d even argue the perspective of social researchers is important for data scientists, software engineers, and product designers at tech companies that use big data, as the perspective can help us make better decisions about how to safely use data and some of the data’s potential problems.

Notes

What follows are my (biased) notes from Bit by Bit. They are intended as a table of contents, to supplement your own notes, or to show what’s inside the book. My notes omit most of the book’s anecdotes from real-world social research that illustrate many of these points. Conveniently, the book text is also available online, so you can jump to the relevant section and read the original text!

I don’t promise that the notes are completely accurate, but feel free to open an issue on my GitHub if you want to suggest improvements.

Introduction (Chapter 1)

Examples of social research using big data

  • Using phone records to Economic distribution over Rwanda using phone records. [related article]
  • Using Google Search results to predict flu activity. [wikipedia]
  • Manipulating the sentiment of Facebook statuses. [wikipedia]
  • Using an online video game to find more optimal ways of folding proteins. [wikipedia]

Reoccurring themes of the book

  • “Readymades” vs “custommades”: Should you use existing datasets, products, and such to host your experiment? or if you do need to create something new, how can you do that?
  • Ethics (and optimism): Many social research projects in the digital age come with ethical questions about consent and privacy.

Summary of chapters

  • Observing Data (Chapter 2): Observing behavior by using big data.
  • Asking Questions (Chapter 3): Using surveys to ask questions to find out why.
  • Running Experiments (Chapter 4): Finding causality
  • Creating Mass Collaboration (Chapter 5): Getting help by crowd-sourcing
  • Ethics (Chapter 6): Increasing the potential for good while reducing harm.
  • The Future (Chapter 7): How these will look in the future

Observing Data (Chapter 2)

Observing behavior by using big data.

In traditional experimentation, “big data” corresponds to observational data. Observational data contrasts with surveys, which are qualitative like observations but the researcher can also ask why, and experiments, which give you experimental data and a clearer picture of causality.

Using big data in social research is not new. Government records have been used in social research for a long time. What is new is that there is a lot of data with records constantly being made. The data is also not as clean.

Typically, big data is intended for another purpose (tracking the operations of a website! names from phone books!)

Two points with big data:

  • You can collect a lot of data, but it’s not collected for research.
  • You can run experiments never possible before.
Tip: First think of the ideal dataset to solve your problem. Then find what differences exist in what data is available. This can help you see biases and other problems with your available dataset.

Ten common characteristics of big data (Section 2.3)

Big

Lots of it!

But, has systematic bias

Always-on

Logged constantly, sometimes even before research project is conceived.

Means it's easier to look in the past or future.

Non-reactive

More realistic. Unlike a lab setting, users won't act like they're being studied.

But, still has social desireability bias.

Incomplete demographics

Not completely representative: e.g. XBox is mostly young guys.

Can follow up with surveys, user-attribute inference, and imputation

Missing behavior elsewhere

What are they doing on other platforms?

Can do record linkage

Bad proxy measure

Missing "data to operationalize theoretical construct." Might not measure what the experiment is about.

Inaccessible

Companies are risk-adverse about sharing. Sometimes researchers can get access, but are restricted from sharing it elsewhere.

Nonrepresentative

Not a random sample, so can't generalize to population.

But that can be okay! Some experiments don't need to generalize to a population to prove their point.

Drifting

"To measure change, don't change the measure", but site generating data can change

For example:
  • population: who uses it (% genders)
  • behavior: how it's used (hashtags during protests)
  • system: length of tweets

Algorithm confounded

System influences numbers.

For example, Facebook encouraged users to have at least 20 friends.

Dirty

Data can have spam and bots.

Sensitive

There are ethical concerns. Some information should be kept private.

Research Strategies (Section 2.4)

  • Counting Things: Generally better to count to answer an interesting question than to count something that’s never been counted before. Big data means more possibility for segmenting populations.
  • Forecasting things: Forecasting is tricky, so instead usually do research by “nowcasting”: predict now from the past.
  • Approximating experiments: Use observational data to create a natural experiment.

Mathematical Notes

The mathematical notes of chapter 2 describe the potential outcomes framework, a way to make a causal inference from non-experimental data.

Asking Questions (Chapter 3)

Surveys: Asking questions and finding out why

This chapter was about surveys and in-depth interviews.

  • Why ask questions? Because data doesn’t tell you what the people are thinking.

The total survey error framework (Section 3.3)

The total survey error is the combination of representation error (who is asked) and measurement error (how the questions translate into research questions.)

Representation

To get from the target population to the respondents, there are three sources of error:

  • Target population: population intended to generalize experiment to. Since the survey will only target certain people (e.g. people with telephones), the error when going to frame population is the coverage error.
  • Frame population: population to sample from. The error caused by sampling from the frame population to create the sampling population is called the sampling error.
  • Sample population: the population that is sent surveys. The error caused by respondents not responding is the Non-response error.
  • Respondents: the population that responds to surveys and makes up survey’s dataset.

Measurement

The questions asked in surveys can change what is tested. Some ways to try not to be wrong are:

  • Read up on how to write questions
  • Borrow other questions (with citations)
  • Give both questions
  • Randomly split of population and give each cohort a different question
  • Do a pre-test

Who to ask (Section 3.4)

In probability-sampling, all members in target distribution have a known probability of being sampled.

However, sometimes you don’t know the probability of being sampled. Methods can still be used which can be called non-probability sampling. For example, using post-stratification on the data can help infer, assuming that there is “homogenous-response-propensities-within-groups.” In another case, researchers may run into sample size issues when needing to split the respondents into small groups. Multilevel regression can make up for sample sizes by pooling from similar groups.

Surveys linked to big data sources (Section 3.6)

Two ways to combine survey results with big data sources are:

  • enriched asking: expand survey dataset by joining with a data source.
  • amplified asking: extrapolate survey results to a larger population by predicting based on data source.

Mathematical Notes

  • Probability sampling: Random sampling to estimate the population mean.
  • Horvitz-Thompson: Adjust by the probability of inclusion.
  • Stratified sampling: Split the population into groups to measure. Can predict population mean for each subgroup.
  • Post-stratification: Split into groups, make predictions, then recombine with weights. Helps reduce variance.

Can have two types of non-response: unit (person) or item (question)

If a random sample is sent surveys, who responds is biased, which could effect results if there is a relation with what is being measured (e.g. unemployed people may respond more often, which would overestimate unemployment rate). A way around the bias is to find subgroups with little variation in response propensities and outcomes. Then find groups like the nonresponse groups.

Running Experiments (Chapter 4)

Distinguish between:

  • Perturb-and-observe experiments: Change something and see what happens. Hard to draw conclusions about treatment vs background effect.
  • Randomized controlled trials: Split into treatment and control groups. Only give treatment to the treatment group. Can measure effects.

What are experiments? (Section 4.2)

Randomized controlled trials have several steps.

  • Recruit participants
  • Randomize treatment
  • Deliver treatment
  • Measure of outcome

Two dimensions of experiments: lab-field and analog-digital (Section 4.3)

One dimension is the lab-field spectrum:

  • lab: Controlled. Can test for the exact mechanism (“why”).
  • field: Realistic. Might have more variance. No additional data to answer “why.”

Another dimension is the analog-digital spectrum.

Can use digital in some or all steps of a randomized controlled trial For example, an experiment ran on Wikipedia can be recruited, randomized, delivered, and measured using digital tools. But a researcher may use MTurk helps with recruiting participants in an otherwise analog experiment.

Some benefits of digital experiments

  • In some types of digital experiments, it is very cheap to increase the size of the study. Can go from 100 to millions of participants without extra work.
  • Tend to have more access to pretreatment information, due to the always-on nature of big data. Pretreatment information can be used for blocking, targetted recruitment, heterogeneity of treatment, and covariate adjustment.

Some weaknesses of digital field experiments

  • Ethical concerns
  • Can only be used to test what is able to be manipulated.

Moving Beyond Simple Experiments (Section 4.4)

There are additional ways to run experiments and digital can help.

Two types of experiments:

  • Within-subjects look at same subjects over time
  • Between-subjects compare different subjects with different treatments

There are three other things to consider: validity, heterogeneity of treatment effects, and mechanism.

There are ways to evaluate the experiment’s validity:

  • statistical conclusion validity: Are we using statistics correctly?
  • internal validity: Are we correctly randomizing, delivering, measuring?
  • construct validity: Are we measuring what we think we’re measuring?
  • external validity: Does the experiment generalize?

Because of heterogeneity of treatment effects, people within the same group may react differently from one another. The digital age means that we have more participants and know more about participants which can help us understand the heterogeneity of treatment effects.

One challenge of experiments is to try to discover the mechanism (or intervening variables or mediating variables), or how the treatment causes the effect. One way is to use the ease of getting additional participants to run a full factorial design.

Making it happen (Section 4.5)

Use existing systems

For example, run ads or use ratings online.

Some ethical questions. Cheap, easy, but restricted.

Experiment product

Custom-built product for the purpose of running the lab experiment.

Expensive.

Closer to lab experiment, because it is unrealistic, but can zoom in on mechanism.

Build a product

Create a custom-built product. Try to attract users.

Like MovieLens. But it's hard

Partner with the powerful

Partner with a tech company with access to data and participants.

But business has other motivations.

Advice (Section 4.6)

Advice for running experiments

  • Think about the experiment before you conduct it.
  • Run a bunch of different experiments: all will have short-comings.
  • decrease variable cost (per individual)

Reducing harm

  • replace: experiments with less invasive methods
  • refine: treatments to make it as harmless as possible.
  • reduce: the number of participants.

Mathematical Notes

The difference-in-differences estimator can help reduce variance. Instead of comparing Control and Treatment groups directly, compare each subject before and after treatment (or lack of treatment).

Creating Mass Collaboration (Chapter 5)

Mass collaboration: Getting help by crowd-sourcing

Use people to do more work!

Human Computation

Give a lot of people a bunch of tiny tasks, such as classifying examples or labeling datasets. Then combine and de-bias.

Ways to de-bias when using human computation

  • Redundancy! Get many answers to the same question.
  • Remove bogus answers from users manipulating system.
  • Find systematic biases. (e.g. distant galaxies get classified as elliptic)
  • Weight by how good each user is.

Open calls

For example, the Netflix prize, FoldIt, Kaggle competitions.

Open calls work best with prediction problems, but don’t work as well for answering “why” or “how”.

Distributed data collection

Let citizen scientists post findings on a website.

Tips for getting more accurate data collection

  • Flag weird results (e.g. reports of birds showing up in region they don’t live).
  • Educate people on how to use it correctly.
  • Use statistical models to correct (e.g. Give a higher weight to posts by more accurate birdwatchers).

Ethics (Chapter 6)

There are ethical questions about experiments using big data for social research: on one hand, people are getting enrolled in unethical experiments, there are data privacy issues, and there is concern about expanding surveillance. On the other hand, using big data for social research is a powerful technique and some important research isn’t being done because of uncertainty around ethics.

Reasonable people will disagree about what is ethical!

  • Examples of digital experiments with ethical gray areas (Section 6.2).
  • Examples of historical experiments that have had human rights abuse (Section 6.3).

Digital experiments differ from traditional experiments because

  • always being tracked.
  • unanticipated secondary use.
  • Experiments include more people in more places, so there is inconsistent and overlapping rules, laws, and norms.

Principles (Section 6.4)

Some example principles that can be used when running experiments come from the U.S. Department of Homeland Security’s Menlo Report, the digital version of the Belmont Report.

Respect for Persons

Treat individual as autonomous. If less autonomous, they get more protections

e.g. Informed concent

Beneficience

Risk-benefit analysis. Likelihood vs severity. Try to increase benefit while decreasing risks.

e.g. Smaller study size. Exclude minors

Justice

Protection: don't experiment on the vulternable.

Access: include minorities so they can benefit

e.g. Don't experiment on the poor for medical improvements to the rich.

Respect for Law + Public Interest

Compliance: Follow terms and Conditions, and laws.

Transparency-based accountability: share what you're doing.

Two ethical frameworks (Section 6.5)

The book mentions two ethical frameworks that can capture why “reasonable people will disagree, Deontology and consequentialism. The “quick and crude” way to differentiate them is deontologists focus on means while consequentialists focus on ends.

Areas of Difficulty (Section 6.6)

Rule of thumb: “Some form of informed consent for most research.” The default should be “informed consent for everything!” There are cases where informed consent isn’t required, but you should have a good argument why.

When informed consent is difficult

  • Some important research can’t happen if informed consent was required. For example, discrimination studies that submit the same resumes with different names wouldn’t work if resume reviewers were informed before the experiment. In these cases, it’s good to debrief.
  • Other times, informed consent can lead to more risk. For example, research on people being censored can endanger those being censored. In these cases, one approach is to make information public and provide a way to opt-out. Researchers can also seek consent from groups that represent participants.
  • It’s hard to get permission in anonymous areas. One approach is to contact a sample.

Informational Risk

Rule of thumb: all data is “potentially identifiable, and potentially dangerous”.

Data Protection Plan (Desai, Ritchie, and Welpton 2016)

  • Safe projects: e.g. make the project ethical
  • Safe people: e.g. limit who has access
  • Safe data: e.g. de-identify and aggregate data
  • Safe settings: e.g. data is stored on computers with passwords
  • Safe output: e.g. review the experiment output and prevent privacy breaches

Release strategies (Goroff 2015)

  • If you don’t share, there’s little benefit to society and little risk to participants.
  • If you share data to a small number of people, the benefits to society grow a lot and the risk to participants doesn’t increase as much.
  • If you “release and forget”, there is a smaller marginal benefit to society and much higher marginal risk to participants

Privacy

Some say invading privacy is harm while others say it only harms if bad things result.

One framework that can be used is the context-relative informational norms (Nissenbaum 2010) which has you frame the problem as:

Data about [subjects] is aggregated and sent by [actor:sender] to [actor:recipiant] which has [attributes:types of information] with [transmission principles].

Examples:

  • U.S. Voter records are public, but shouldn’t be published in a newspaper.
  • Full data given to a sketchy government for any possible use is different than giving partially anonymized data to a university subjected to an ethics committee.

Making decisions

The uncertainty around whether things are ethical leads to Precautionary principle, or “better safe than sorry.” However, that means not doing the best experiments.

  • Minimal risk standard: Compare to the risk of other activities.
  • Power analysis: Run on as few users as possible.
  • Ethical-response survey: Ask people in a survey if it’s ethical.
  • Staged trials: Start experiments on smaller groups at first.

The Future (Chapter 7)

  • Blend readymade and custommade
  • Designing more for participants
  • Research will be designed with ethics in mind

See Also