SSDS Syllabus

Seminar for the Study of Development Strategies

General Information

The focus of the course is close reading and re-analysis of emerging research in the political economy of development, broadly construed. The focus is on well identified research whether based on experimental or observational data. It is intended for advanced graduate students (3rd - 4th year) that already have strong analytic skills. Auditors are welcome as long as they put in the work. Second time takers/auditors are also welcome.

The overall structure is that in most weeks an external speaker comes to discuss new or in-progress research. The speaker does not present the work however; instead they share their papers, data and code in advance with the class and a “replication team” has a week to put together a detailed discussion of the work. In other weeks we do something similar with work in progress of students in the class.

Note this course has an unusual format, meeting roughly once every two weeks over the course of a year. This course meets in room 711 of IAB building on Wednesdays from 4:10 - 6:00, generally followed by a dinner for a group of participants. It is led by Macartan Humphreys (mh2245@columbia.edu). If you want to see how this document was made, you can see the code here. Thanks to Jasper Cooper and Tara Slough who have done enormous work on the schedule and thinking through the structure and workflow of the class.

Expectations

Reading

The reading loads are not especially heavy; typically the speaker will provide 1 or 2 readings that give a sense of their research agenda. You should read these carefully. You should also look at the data whether or not you are on the “rep” team. There is no point coming to the class unprepared. My thoughts on reading and discussanting are here http://www.macartan.nyc/teaching/how-to-read/ and here http://www.macartan.nyc/teaching/a-checklist-for-discussants/.

Participation

The course will alternate between External and Internal weeks. During external weeks, guest speakers will present research to the class. Student research will be presented during Internal weeks.

External weeks:

Guest speakers will be asked to share data in advance, and students are encouraged to replicate results and submit the results to robustness checks before each class.

Key elements of this are:

  1. Be in touch with authors and be sure you have the data, papers, and all you need at least a week in advance
  2. Make sure you can make sense of the data and run a basic replication.
  3. When you have a feel of things jot down a brief pre-replication plan. What do you plan to look at? What do you expect to find? Archive this on dropbox.
  4. Then there are two ways to expand the analysis;
  5. Meet me briefly on the Monday before class to go over your main material.
  6. Generate a presentation that
  7. Note that while we focus a lot on statistical replication and re-analysis there are many sides to a paper. Your presentation should not shy from discussing more fundamental conceptual or interpretational issues as appropriate.

Internal weeks

During Internal weeks, student research will be presented.

Each student should expect to serve as a discussant for a guest speaker once per semester and to have his or her research presented once in the year and to act as both a defender and a devil's advocate for another student's research once in the year.

Writing requirement

You will be expected to write a paper displaying original research to be presented during one of the internal weeks. These research papers will contain (i) a theoretical argument or motivation, (ii) an empirical test of that argument and (iii) a discussion of policy prescriptions resulting from the argument. A draft of this paper should be the paper used for your “internal” week; it does not have to have been written for this class specifically. However the final paper should however be the revised paper in light of the internal week discussions. Some thoughts on writing here http://www.macartan.nyc/teaching/on-writing/.

The Speakers

The Agenda

Our current speaker line up is as follows:

Date Speaker Provisional Topic
16-Sep Shira Mitchell Millennium Development Villages
23-Sep Rich Nielsen Violent Extremism
14-Oct Eli Berman Economics and Conflict
28-Oct Donald Green Vote-buying
4-Nov Pablo Querubin Accountability
18-Nov Leonard Wantchekon Deliberation
9-Dec Graeme Blair Nollywood or oil in the delta
3-Feb Jessica Gottlieb TBC
10-Feb Gwyneth McClendon TBC
24-Feb Thomas Fujiwara TBC
9-Mar Peter Bergman Education
23-Mar Daniel Hidalgo TBC
6-Apr Maarten Voors Health Systems Sierra Leone
13-Apr Jens Hainsmueller TBC
27-Apr Francesco Trebbi TBC

The Rules

It is a very unusual thing for speakers to come and share data on unpublished work. It makes for terrific feedback and learning, but can also bring some risks to speakers. This cannot be thought of as a public presentation of research in the usual way and different rules apply. In particular:

Workflow and Tools

We are going to be pretty hardcore about the workflow and using a set of very recent research tools to make sure all the work in the class is transparent and replicable.

The main tools that we will employ are:

GitHub

GitHub will serve four main purposes:

  1. Collaborating on code together

  2. Publishing replications as web pages

  3. Discussing and managing issues in the course using the 'issues' feature

  4. Sharing code, functions, packages

To get started with GitHub, you will need two things:

  1. a GitHub account
  2. the GitHub desktop app

Markdown

Please write all class reports in Markdown. Information on this here: http://rmarkdown.rstudio.com/. R markdown is fairly simple but has the advantage of letting you a) write \(\LaTeX\) as needed b) integrate your R code directly c) compile to either a pdf, html or even word file. For transparency and error reduction b) is particularly important since we want to stay close to the data and set things up so that everyone in the class plus other presenters can follow your code and analysis.

To create a Markdown document in R:

R

Analysis should be done in R. If you don't know R you should teach yourself. There are various online courses which you can take; have a look at http://tryr.codeschool.com/ and https://www.datacamp.com/. If you love your Stata or Excel and just cannot get on top of R, make sure you are on a team with someone who can so that final analyses can be implemented in R.

We will keep an updated list of packages that you will need in the the install_packages.R script. Run this script to get the new version of all

Using Dropbox

We will keep all data on dropbox so that it can be sourced in from a single location. This is good practice and means that everything has to run off core data and not from individually customized files.

The easiest way to share data on Dropbox is to:

Using this method we don't have to each store the data on our own computer, but can just temporarily use it in R. This avoids over-burdening our hard drives with large datasets.

Using R, Markdown and Dropbox together

So for example here is some data:

rm(list = ls(all = TRUE))
library(repmis)
data <- source_DropboxData(key = "5zqvxaz6evtc16d",file = "dummydata.csv")
## Downloading data from: https://dl.dropboxusercontent.com/s/5zqvxaz6evtc16d/dummydata.csv 
## 
## SHA-1 hash of the downloaded data file is:
## 3e82bde102084dc6cac7f14558ab1add4c4cf786

It looks like this

data 
##   ID Age Voted
## 1  1  21     1
## 2  2  NA     1
## 3  3  25     0
## 4  4  60     1
## 5  5  30     0
## 6  6  15     0

And here is some analysis:

# Age difference between voters and non voters:
t.test(Age ~ Voted, data = data)
## 
##  Welch Two Sample t-test
## 
## data:  Age by Voted
## t = -0.85866, df = 1.1034, p-value = 0.5372
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -221.2860  186.9526
## sample estimates:
## mean in group 0 mean in group 1 
##        23.33333        40.50000

All claims in your text should come from the data. For example the average age is 30.2.

See here for an example of what a replication published to GitHub might look like.

Using DeclareDesign to formally characterize the research designs

For each analysis we want to try to formally characterize the design and wring it through the alpha version (alpha as in struggling, not as in tough) of DeclareDesign. DeclareDesign is a package that I am working on with Graeme Blair, Jasper Cooper and Alex Coppock. It is designed to let you describe the core elements of a research design in an abstract way and then you get a set of outputs that provide information on the features of the design — bias, power, coverage — as well as objects useful for registration such as dummy data and mock analyses.

In the DeclareDesign framework, there are six core elements of a research design. You should be able to identify each of these for each replication:

  1. The population. The set of units about which inferences are sought;
  2. The potential outcomes function. The outcomes that each unit might exhibit depending on how the causal process being studied changes the world;
  3. The sampling strategy. The strategy used to select units to include in the study sample;
  4. The estimands. The specification of the things that we want to learn about the world, described in terms of potential outcomes;
  5. The assignment function. The manner in which units are assigned to reveal one potential outcome or another;
  6. The estimator function. The procedure for generating estimates of quantities we want to learn about.

In a replication, you will typically already have the data. The instructions below demonstrate how DeclareDesign can be used with pre-existing data.

To install the package, use devtools in combination with the access key. Please do not share the key during this alpha phase.

# Use this code here to install the DeclareDesign package 
rm(list=ls())
devtools::install_github(repo = "egap/DeclareDesign", 
                         auth_token = "7c4a0e3d05e33bd9bc15eae4a198a69f614e77ac"
                         )

We generate some example data using DeclareDesign DGP functions. You should already have data, so this step will not be necessary.

population_user <- declare_population(
  individuals = list(
    income = declare_variable()),
  villages = list(
    development_level = declare_variable(multinomial_probabilities = 1:5/sum(1:5))
  ),
  group_sizes_per_level = list(
    individuals = rep(1,1000), 
    villages = rep(5,200)
  ))

user_data <- draw_population(population = population_user)

save(user_data, file = "baseline_data.RData")

First, we load the baseline data created by the user, and then define a set of covariates that will be simulated to conduct power analysis and for simulated analyses.

load("baseline_data.RData")

kable(head(user_data), digits = 3)
villages_ID income individuals_ID development_level
1 1 -0.939 1 5
314 63 -0.042 2 4
636 128 0.829 3 5
681 137 -0.439 4 2
627 126 -0.314 5 5
692 139 -2.129 6 5

Second, we define the potential outcomes, which will be simulated based on the baseline covariate data.

potential_outcomes     <-  declare_potential_outcomes(
  condition_names = c("Z0","Z1"),
  outcome_formula = Y ~ .01 + 0*Z0 + .2*Z1 + .1*income
)

Then resample (bootstrap) from user data, respecting levels

population <- declare_population(
  individuals = list(),
  villages = list(),
  N_per_level = c(500, 10),
  data = user_data)

Fourth, we define one or more analyses we will run based on simulated data. This analysis will also be used for power analysis.

estimand <- declare_estimand(declare_ATE(), target = "population", label = "ATE")

Then we declare the design of the experiment, in this case a simple one without clusters or blocking.

assignment <- declare_assignment(potential_outcomes = potential_outcomes)

Then declare the estimator.

estimator <- declare_estimator(formula = Y ~ Z, estimates = difference_in_means, estimand = estimand)

Before finalizing the design, we conduct a power analysis to determine whether 500 units and 10 clusters (villages) are sufficient. To do this, we use the diagnose function.

The output of the diagnose() function is a summary of important statistical properties of the design, including the statistical power, bias, and frequentist coverage (among other uses, an indicator of whether the statistical power is calculated correctly). Here is the diagnosis summary for our simple experiment:

diagnosis <- diagnose(population = population, assignment = assignment, 
                      estimator = estimator, potential_outcomes = potential_outcomes, sims = 1000)
kable(summary(diagnosis), digits = 3)
PATE sd(SATE) Power RMSE Bias Coverage
Y~Z1-Z0_diff_in_means_estimator 0.2 0 1 0.006 0 0.96

The information that diagnose outputs can be very useful for characterizing designs ex post.

The output has six important pieces of information. The first is the population average treatment effect, or PATE, the causal effect of the treatment on those in a finite population from which we have sampled. The sample average treatment effect, or SATE, is different: when we sample a particular set of units, the true average difference in potential outcomes might deviate from the PATE. In this example, we are treating the sample as the population, so there is no deviation of the SATE. Power in this simulation is defined as the probability of obtaining a statistically significant difference-in-means – this occurred in 100\% of the simulations. Reassuringly, the difference-in-means estimator does not exhibit any bias. Moreover, the coverage is very close to the theoretical target of 0.95, implying that the estimated confidence interval covers the true effect roughly 95\% of the time, as it should.

For more details on how to use DeclareDesign, visit the alpha version of the website here.