Seminar for the Study of Development Strategies
The focus of the course is close reading and re-analysis of emerging research in the political economy of development, broadly construed. The focus is on well identified research whether based on experimental or observational data. It is intended for advanced graduate students (3rd - 4th year) that already have strong analytic skills. Auditors are welcome as long as they put in the work. Second time takers/auditors are also welcome.
The overall structure is that in most weeks an external speaker comes to discuss new or in-progress research. The speaker does not present the work however; instead they share their papers, data and code in advance with the class and a “replication team” has a week to put together a detailed discussion of the work. In other weeks we do something similar with work in progress of students in the class.
Note this course has an unusual format, meeting roughly once every two weeks over the course of a year. This course meets in room 711 of IAB building on Wednesdays from 4:10 - 6:00, generally followed by a dinner for a group of participants. It is led by Macartan Humphreys (email@example.com). If you want to see how this document was made, you can see the code here. Thanks to Jasper Cooper and Tara Slough who have done enormous work on the schedule and thinking through the structure and workflow of the class.
The reading loads are not especially heavy; typically the speaker will provide 1 or 2 readings that give a sense of their research agenda. You should read these carefully. You should also look at the data whether or not you are on the “rep” team. There is no point coming to the class unprepared. My thoughts on reading and discussanting are here http://www.macartan.nyc/teaching/how-to-read/ and here http://www.macartan.nyc/teaching/a-checklist-for-discussants/.
The course will alternate between External and Internal weeks. During external weeks, guest speakers will present research to the class. Student research will be presented during Internal weeks.
Guest speakers will be asked to share data in advance, and students are encouraged to replicate results and submit the results to robustness checks before each class.
Every registered student will be expected to write a one-page response paper in advance of the talk each week. This is due into the class dropbox by midnight Monday of the day before. If you are presenting in a given week this is not required.
A “rep” team of two students will be assigned a formal role as discussants and prepare oral and written commentary for the guest speaker.
Key elements of this are:
During Internal weeks, student research will be presented.
Students that are not at that stage will be expected to provide an advanced draft of a research design by the end of the year. An advanced design means not only theory, hypothesis and identification strategy but also draft instruments and protocols and a dummy dataset and analysis.
In internal weeks, two students will be assigned to present the research. The first will be assigned to act as the defender of the research and will prepare a presentation and defense of the research. The second student will serve as a devil's advocate, preparing a critique of the presented research.
Each student should expect to serve as a discussant for a guest speaker once per semester and to have his or her research presented once in the year and to act as both a defender and a devil's advocate for another student's research once in the year.
You will be expected to write a paper displaying original research to be presented during one of the internal weeks. These research papers will contain (i) a theoretical argument or motivation, (ii) an empirical test of that argument and (iii) a discussion of policy prescriptions resulting from the argument. A draft of this paper should be the paper used for your “internal” week; it does not have to have been written for this class specifically. However the final paper should however be the revised paper in light of the internal week discussions. Some thoughts on writing here http://www.macartan.nyc/teaching/on-writing/.
Our current speaker line up is as follows:
|16-Sep||Shira Mitchell||Millennium Development Villages|
|23-Sep||Rich Nielsen||Violent Extremism|
|14-Oct||Eli Berman||Economics and Conflict|
|9-Dec||Graeme Blair||Nollywood or oil in the delta|
|6-Apr||Maarten Voors||Health Systems Sierra Leone|
It is a very unusual thing for speakers to come and share data on unpublished work. It makes for terrific feedback and learning, but can also bring some risks to speakers. This cannot be thought of as a public presentation of research in the usual way and different rules apply. In particular:
We are going to be pretty hardcore about the workflow and using a set of very recent research tools to make sure all the work in the class is transparent and replicable.
The main tools that we will employ are:
GitHub will serve four main purposes:
Collaborating on code together
Publishing replications as web pages
readme.mdusing knitr in R (see Rmd_to_md.R for an example - feel free to add to this script).
readmeinto a webpage. When you convert an .Rmd file to an .md file, you've told R to take the .Rmd, compile all the R code, and make a Markdown file out of it. In each subdirectory, GitHub reads the
readmefile and turns it into a webpage which everyone in the class can read and which you can use for the presentations.
Discussing and managing issues in the course using the 'issues' feature
Sharing code, functions, packages
To get started with GitHub, you will need two things:
Please write all class reports in Markdown. Information on this here: http://rmarkdown.rstudio.com/. R markdown is fairly simple but has the advantage of letting you a) write \(\LaTeX\) as needed b) integrate your R code directly c) compile to either a pdf, html or even word file. For transparency and error reduction b) is particularly important since we want to stay close to the data and set things up so that everyone in the class plus other presenters can follow your code and analysis.
To create a Markdown document in R:
knit()function in R (this is how we make .md files), or simply click the “Knit HTML / PDF / Word” button on the top panel of RStudio
Analysis should be done in R. If you don't know R you should teach yourself. There are various online courses which you can take; have a look at http://tryr.codeschool.com/ and https://www.datacamp.com/. If you love your Stata or Excel and just cannot get on top of R, make sure you are on a team with someone who can so that final analyses can be implemented in R.
We will keep an updated list of packages that you will need in the the install_packages.R script. Run this script to get the new version of all
We will keep all data on dropbox so that it can be sourced in from a single location. This is good practice and means that everything has to run off core data and not from individually customized files.
The easiest way to share data on Dropbox is to:
source_DropboxData()function, from the
Using this method we don't have to each store the data on our own computer, but can just temporarily use it in R. This avoids over-burdening our hard drives with large datasets.
So for example here is some data:
rm(list = ls(all = TRUE)) library(repmis) data <- source_DropboxData(key = "5zqvxaz6evtc16d",file = "dummydata.csv")
## Downloading data from: https://dl.dropboxusercontent.com/s/5zqvxaz6evtc16d/dummydata.csv ## ## SHA-1 hash of the downloaded data file is: ## 3e82bde102084dc6cac7f14558ab1add4c4cf786
It looks like this
## ID Age Voted ## 1 1 21 1 ## 2 2 NA 1 ## 3 3 25 0 ## 4 4 60 1 ## 5 5 30 0 ## 6 6 15 0
And here is some analysis:
# Age difference between voters and non voters: t.test(Age ~ Voted, data = data)
## ## Welch Two Sample t-test ## ## data: Age by Voted ## t = -0.85866, df = 1.1034, p-value = 0.5372 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -221.2860 186.9526 ## sample estimates: ## mean in group 0 mean in group 1 ## 23.33333 40.50000
All claims in your text should come from the data. For example the average age is
See here for an example of what a replication published to GitHub might look like.
DeclareDesignto formally characterize the research designs
For each analysis we want to try to formally characterize the design and wring it through the alpha version (alpha as in struggling, not as in tough) of
DeclareDesign is a package that I am working on with Graeme Blair, Jasper Cooper and Alex Coppock. It is designed to let you describe the core elements of a research design in an abstract way and then you get a set of outputs that provide information on the features of the design — bias, power, coverage — as well as objects useful for registration such as dummy data and mock analyses.
DeclareDesign framework, there are six core elements of a research design. You should be able to identify each of these for each replication:
In a replication, you will typically already have the data. The instructions below demonstrate how
DeclareDesign can be used with pre-existing data.
To install the package, use
devtools in combination with the access key. Please do not share the key during this alpha phase.
# Use this code here to install the DeclareDesign package rm(list=ls()) devtools::install_github(repo = "egap/DeclareDesign", auth_token = "7c4a0e3d05e33bd9bc15eae4a198a69f614e77ac" )
We generate some example data using
DeclareDesign DGP functions. You should already have data, so this step will not be necessary.
population_user <- declare_population( individuals = list( income = declare_variable()), villages = list( development_level = declare_variable(multinomial_probabilities = 1:5/sum(1:5)) ), group_sizes_per_level = list( individuals = rep(1,1000), villages = rep(5,200) )) user_data <- draw_population(population = population_user) save(user_data, file = "baseline_data.RData")
First, we load the baseline data created by the user, and then define a set of covariates that will be simulated to conduct power analysis and for simulated analyses.
load("baseline_data.RData") kable(head(user_data), digits = 3)
Second, we define the potential outcomes, which will be simulated based on the baseline covariate data.
potential_outcomes <- declare_potential_outcomes( condition_names = c("Z0","Z1"), outcome_formula = Y ~ .01 + 0*Z0 + .2*Z1 + .1*income )
Then resample (bootstrap) from user data, respecting levels
population <- declare_population( individuals = list(), villages = list(), N_per_level = c(500, 10), data = user_data)
Fourth, we define one or more analyses we will run based on simulated data. This analysis will also be used for power analysis.
estimand <- declare_estimand(declare_ATE(), target = "population", label = "ATE")
Then we declare the design of the experiment, in this case a simple one without clusters or blocking.
assignment <- declare_assignment(potential_outcomes = potential_outcomes)
Then declare the estimator.
estimator <- declare_estimator(formula = Y ~ Z, estimates = difference_in_means, estimand = estimand)
Before finalizing the design, we conduct a power analysis to determine whether 500 units and 10 clusters (villages) are sufficient. To do this, we use the
The output of the
diagnose() function is a summary of important statistical properties of the design, including the statistical power, bias, and frequentist coverage (among other uses, an indicator of whether the statistical power is calculated correctly). Here is the diagnosis summary for our simple experiment:
diagnosis <- diagnose(population = population, assignment = assignment, estimator = estimator, potential_outcomes = potential_outcomes, sims = 1000) kable(summary(diagnosis), digits = 3)
The information that
diagnose outputs can be very useful for characterizing designs ex post.
The output has six important pieces of information. The first is the population average treatment effect, or PATE, the causal effect of the treatment on those in a finite population from which we have sampled. The sample average treatment effect, or SATE, is different: when we sample a particular set of units, the true average difference in potential outcomes might deviate from the PATE. In this example, we are treating the sample as the population, so there is no deviation of the SATE. Power in this simulation is defined as the probability of obtaining a statistically significant difference-in-means – this occurred in 100\% of the simulations. Reassuringly, the difference-in-means estimator does not exhibit any bias. Moreover, the coverage is very close to the theoretical target of 0.95, implying that the estimated confidence interval covers the true effect roughly 95\% of the time, as it should.
For more details on how to use
DeclareDesign, visit the alpha version of the website here.