Brett J. Gall
PhD Candidate | Duke University | http://brettgall.com
2020-10-06
If your study or grant benefitted from this guide, I would greatly appreciate your acknowledgement of its contribution to your work through appropriate citation:
Gall, Brett J. 2020. “Simulation-based Power Calculations for Conjoint Experiments.” OSF Preprints. October 6. osf.io/bv6ug.
Why calculate power?
Social scientists increasingly rely on conjoint experiments to estimate population-level preferences, beliefs, and behavioral propensities. Despite the proliferation of conjoint experiments in social scientific research, much of the design and implementation of conjoint experiments is based on largely ad-hoc procedures, rules-of-thumb, and fuzzy intuitions. This is particularly common when researchers are choosing (a) the number of choices each participant must make and (b) the number of participants for a study. This is problematic because these approaches can lead to choices about sampling and design that generate under-powered studies incapable of detecting hypothesized effects or the inefficient over-allocation of resources to over-powered studies. This latter consequence carries addditional risk of the preclusion of study implementation due to the inaccurate perception sufficiently-powered studies are cost-prohibitive.
A priori power calculations help researchers better select study design parameters - including the number of participants and number of choices each participant makes - by formalizing these intuitions. This formalization produces a set of assumptions about the processes assigning treatment status and outcomes to experimental units that can help researchers study how well suited a particular experimental design is to providing sufficiently precise estimates of the effect(s) of interest to them, given their assumptions about the “true” effect(s).
Approaches to calculating power
By far the most common approach to calculating power is to use canned software, such as G*Power, to calculate power based on analytical solutions. Given a set of assumptions about effect sizes and their distributions, these tools typically quickly tell users the exact power of their studies. While convenient and exact, these analytical tools are limited in that they are typically only available for a limited set of models and estimands or impose a narrow set of distributional assumptions on the data generating process.
Simulation provides an alternative to analytical-solution-based power calculations. Simulation is appealing because we only need to specify the expected distributions of variables and their relationship to each other to calculate power with arbitrary distributional assumptions, models, and estimands. The intuition behind simulation is to generate many data-sets that one might observe if the data were generated according to a specified process the researcher believes reasonably captures the “true” process, estimate the effect of interest in each data set, then calculate power based on how often the estimates detect the assumed effect.
While the simulation approach may prove computationally-demanding because of the need to generate many data sets and fit potentially-complex models on those data sets, computational constraints are rarely binding due to the typical size of experimental data, complexity of models, and availability of computing power and storage. Furthermore, although simulation approaches provide only approximations of “true” power (in contrast with the exact power offered by analytical solutions), this approximation error is typically small in practice and decreases with the number of simulated data sets.
Simulating conjoint power
There are many guides for simulating data to calculate power for certain types of experiments and observational studies. For example, McConnell and Vera-Hernandez (2017) provide the formulas and code for simulation-based power calculations for a variety of designs in Stata, Coppock (2013) provides examples in R, Hughes (2017) offers analytic and simulation-based approaches to power calculations, and a variety of examples are found in Arnold et al. (2011). These resources tend to ignore some of the issues that arise from discrete choice experiments, such as the deterministic interdependence of outcomes (choice status) for a given choice, binary dependent variable, or estimation of certain estimands. In contrast, I provide code and instruction for rapidly simulating and analyzing data from the type of conjoint designs typically used in the social sciences.
While the code in the following is flexible enough for users to modify it for a larger number of use cases, I impose some restrictions on the types of conjoint designs and estimands where this code can be applied for instructional purposes. Specifically, I assume participants engage in a forced, binary choice between two profiles and researchers are interested in estimating one of two estimands that dominate applications of conjoint experimental designs: the average marginal component effect or the average component interaction effect (Hainmueller et al. 2013). In practice, alternative models and estimands are as easily adopted in the simulation as they are in an analysis with a single dataset.
Our approach is straightforward and largely follows that recommended by the creators of DeclareDesign:
- Create data based on proposed values for parameters describing the main structure of the experiment: (a) number of study participants, (b) choice tasks per participant, and (c) number of profiles per choice task.
- Create profile attributes and attribute levels.
- Create dummy variables for each profile attribute level.
- Assign potential outcomes to different profile combinations.
- Reveal realized outcome for each observation in light of the assigned attributes and interdependence of outcomes for each set of profiles.
- Estimate model(s), save p-value(s) and point estimate(s).
- Repeat for many simulated data sets.
- Calculate power from model results.
In the following, I’ll introduce code for each of these steps and we’ll slowly build up our code until we have something that fairly easily can simulate power of conjoint experiments via simulation. All examples use the R programming language. While I do not plan to maintain the code within this document, I aim to provide enough detail and intuition that you can update code as necessarily in response to any breaking updates to the tools I employ.
One notable limitation of the code below is its focus on the case of a participant selecting one profile from a set of two profiles. Some conjoint designs ask each participant to choose from more than two profiles, allow participants to select multiple profiles, or provide opportunities to make non-binary choices over profiles. While the code in some cases already accommodates such designs, consider it a starting point for writing your own functions.
Set-up
Let’s first load some packages to make everything faster and easier. We’ll load our packages with a function we’ll define named loadpkg()
, which checks if our packages are installed and installs them if not, then loads all required packages.
#########################################################################
# 'loadpkg' function
# Checks if a vector of packages are installed. If not, installs the
# package. Then loads all packages in vector.
#
# Argument:
# toLoad - character vector of package names
#########################################################################
loadpkg <- function(toLoad){
for(lib in toLoad){
if(! lib %in% installed.packages()[,1]) {
install.packages(lib, repos='http://cran.rstudio.com/')
}
suppressMessages( library(lib, character.only=TRUE) )
}
}
We’ll want the dplyr
package for cleaner code and the piping operator (%>%
), ggplot2
for plotting power at different values of our parameters, and estimatr
for its fast implementation of ordinary least squares estimation with clustered standard errors. Although none of these packages are required (you could replace them with base R equivalents), they will this much quicker and more readable.
Let’s load these packages and set our seed while we’re at it to ensure we can reproduce our data later.
Creating a design “skeleton”
Next, let’s define a create_design_df()
function to generate a data frame containing the high-level parameters of the conjoint design - a “skeleton” upon which we will build: the number of participants, number of choice tasks (or screens) each participant completes, and the number of profiles per task. We’ll also add an indicator if a profile was the first profile on the screen since this helps us introduce profile ordering effects and makes some of the later steps easier.
#########################################################
# Define function: create_design_df
#
# Description: Generates a data frame with participant
# IDs, screen ID, profile_ID (unique identifier for each
# profile) based on specified parameters, and an indicator
# for the "first" profile for a given screen. Data are tidy.
#
# Arguments:
# N = # of participants
# K = # choice tasks observed
# W = # of profiles per choice task
#########################################################
create_design_df <- function(N, K, W) {
temp <- data.frame(
participant_id = sort(rep(seq_len(N), K * W)),
screen_id = sort(rep(seq_len(K * N), W)),
profile_id = seq_len(N * K * W)
)
# Add first_profile == 1 if first in profile set, 0 otherwise
# dplyr::if_else() faster than base::ifelse
temp %>%
dplyr::group_by(screen_id) %>%
dplyr::mutate(first_profile = dplyr::if_else(row_number() == 1, 1, 0)) %>%
dplyr::ungroup()
}
Note that the function is designed to produce the long data structure that is typically used for the analysis of repeated-observation data and conjoint data: each row is one of the profiles from the study and columns provide information about each profile - such as the participant that observed the profile, whether the profile was chosen, profile attributes, and so forth. Since we’re assuming each participant observes two profiles per screen, there will be two rows in our data per screen.
Example: create_design_df()
Creating profile attributes
We now need attributes for each profile. We’ll define a function called create_attributes()
that takes a vector or list of vectors and randomly draws one value from each vector for each profile in our conjoint. We can provide a vector of names for the new attributes via the attr_names
argument and we can specify a list (if we have multiple attributes) or vector (if we have one attribute) of probabilities a value is drawn from the attribute-level vector via the prob
argument. The function produces default names if no names are provided and draws attribute values with equal probability if no probabilities are provided.
##############################################################
# Define function: create_attributes()
#
# Creates randomly generated vectors of values from sets
# of potential values for a specified number of attributes
#
# *Arguments*
#
# attr_names (optional)
# vector of attribute names of attr_n length. If no values
# provided, attributes named attrib1, attrib2, etc.
#
# attr_levels
# vector or list of vectors of length N, each containing the
# unique levels each attribute can take. May include NA values,
# which may be useful for experiments randomizing the presence
# of attributes themselves.
#
# n_profiles
# number of profiles, i.e. # of values to draw for each
# attribute.
#
# probs (optional)
# A vector or list of numeric vectors containing probabilities
# for sampling from the set of unique attribute values. Probabilities
# are assigned to each element in the attrinute value set based
# on vector index. If no value supplied, defaults to assigning
# equal sampling probabilities to all attribute levels.
#
# NOTE: Currently no test is implemented for whether the number
# of probabilities passed to probs() is equal to the number of
# levels for the attribute or that the probabilities provided
# sum to 1 for each vector. Function does check that the number
# of probability vectors is equal to the number of attributes.
##############################################################
create_attributes <-
function(attr_levels,
n_profiles,
attr_names = NULL,
probs = NULL) {
# Convert attr_levels and probs to list to work with lists
# throughout for simplicity/speed. Since vectors are list
# but lists not necessarily vectors, use is.list(X) rather
# than is.vector(X) since, passing a list to is.vector is
# also TRUE.
if (!is.list(attr_levels)) {
attr_levels <- list(attr_levels)
}
if (!is.null(probs) & !is.list(probs)) {
probs <- list(probs)
}
# Number of attributes to generate
attr_n <- length(attr_levels)
# Test inputs are correct length
stopifnot(is.null(attr_names) | attr_n == length(attr_names))
stopifnot(is.null(probs) | attr_n == length(probs))
# If attribute names not provided, generate
# default attribute names, e.g. attrib1, attrib2, ...
if (is.null(attr_names)) {
attr_names <- paste0("attrib", seq_len(attr_n))
}
# If probs not provided, generate default
# equal probabilities of assignment
if (is.null(probs)) {
probs <- list()
for (i in 1:attr_n) {
attr_i_levels_n <- length(attr_levels[[i]])
probs[[i]] <- rep(1 / attr_i_levels_n, attr_i_levels_n)
}
}
# Initialize list to store each attribute vector
attrib_list <- list()
# Generate attributes
for (i in 1:length(attr_names)) {
# If an attribute has only one level, is numeric, and
# >0, the x = argument in sample() thinks the level
# refers to the number of levels, incorrectly. So,
# separate these out since prob = 1 for the single value.
# See ?sample > Details for more information.
if (is.numeric(attr_levels[[i]]) &
length(attr_levels[[i]]) == 1) {
attrib_list[[i]] <- rep(attr_levels[[i]], n_profiles)
} else {
attrib_list[[i]] <- sample(
x = attr_levels[[i]],
size = n_profiles,
replace = TRUE,
prob = probs[[i]]
)
}
}
# Collapse list into data frame, replace names
attrib_df <- dplyr::bind_cols(attrib_list)
colnames(attrib_df) <- attr_names
# Return attribute data frame
attrib_df
}
Example: create_attributes()
# n_profiles <- 50
#
# # One attribute, default probs
# attr_levels <- c("a","b","c")
# create_attributes(attr_levels = attr_levels,
# n_profiles = n_profiles)
#
#
# # One attribute,specified probs
# attr_levels <- c("a","b","c")
# probs <- c(0.10,0.10,0.80)
# create_attributes(attr_levels = attr_levels,
# n_profiles = n_profiles,
# probs = probs)
#
#
# # Two attributes, default probs
# attr_levels <- list(c("a","b","c"),
# c(seq_len(6)))
#
# create_attributes(attr_levels = attr_levels,
# n_profiles = n_profiles)
#
#
# # Two attributes, specified probs
# attr_levels <- list(c("a","b","c"),
# c(seq_len(6)))
# probs <- list(c(0.50, 0.30, 0.20),
# c(0.05, 0.05, 0.05, 0.05, 0.40, 0.40))
#
# create_attributes(attr_levels = attr_levels,
# n_profiles = n_profiles,
# probs = probs)
#
#
# # Two attributes, specified probs and names
# attr_levels <- list(c("a","b","c"),
# c(seq_len(6)))
#
# probs <- list(c(0.50, 0.30, 0.20),
# c(0.05, 0.05, 0.05, 0.05, 0.40, 0.40))
#
# attr_names <- c("foo", "bar")
#
# attr_df <- create_attributes(attr_levels = attr_levels,
# n_profiles = n_profiles,
# attr_names = attr_names,
# probs = probs)
Create dummy variables
While we now know each profile’s value for each attribute, we typically want dummy variables for each potential value an attribute may assume. This makes generating potential outcomes and model estimation easier. We cannot simply use commmon solutions, such asfastDummies::dummy_cols()
or model.matrix()
, on our data to generate these indicators because these approaches produce columns for each value that each attribute actually assumes in our data. We want column for each value in the set of potential values an attribute might assume. This is important since there is typically a non-zero probability an attribute is never assigned to any profile!
For this purpose, we write our own function, create_value_dummies()
, that will take the output of our create_attributes()
function and add dummy variables.
#########################################################
# Define function: create_value_dummies()
#
# Create dummy variables indicating if a given
# profile takes on a specific value from its
# possible values, independent of whether the actual
# random sampling does (not) draw that value and assign
# it to a profile.
#
# Arguments
# attr_df - the output of create_attributes()
# attr_levels - the input list (or vector) of levels to
# create_attributes()
#########################################################
create_value_dummies <-
function(attr_df,
attr_levels) {
# Convert attribute levels to a list for
# easing referencing if not already list
if (!is.list(attr_levels)) {
attr_levels <- list(attr_levels)
}
# Initialize list to store output data frame
attr_cols <- list()
# Generate dummy variables
for (i in 1:ncol(attr_df)) {
# List to store current attribute's
# output columns
temp <- list()
# Levels of current attribute
this_attr_levels <- attr_levels[[i]]
# Generate dummies
for (j in 1:length(this_attr_levels)) {
temp[[j]] <- dplyr::if_else(attr_df[, i] == attr_levels[[i]][j], 1, 0)
}
# Convert each attribute's vectors from list to data frame
temp <- dplyr::bind_cols(temp)
# Generate column names for each
# Add variable names to columns
colnames(temp) <-
paste0(colnames(attr_df)[i], "_", attr_levels[[i]])
# Save all dummies for each attribute into an element
# of the attr_cols list
attr_cols[[i]] <- temp
}
# Collapse each attribute's columns into a single
# data frame, return
dummy_df <- dplyr::bind_cols(attr_cols)
# Joint the columns to the original data
dplyr::bind_cols(attr_df, dummy_df)
}
Combining our non-outcome functions
So far, we have several functions that generates all of the data we need except for the actual outcomes for each profile, i.e. whether a profile was chosen (or not). To make things even simpler, let’s combine all of those functions into a single function called create_predictors()
. The nice thing about doing this is that we don’t need to repeatedly provide arguments that are common across our functions. Instead, we can generate everything with one function call. Below, we combine everything we’ve done so far into a single function. We add two additional arguments to augment the functions:
- sims: determines how many data sets to generate (defaults to 1)
- seed: specifies the randomization seed (defaults to 123).
################################################
# Define function: create_predictors
# Creates randomly generated data sets containing
# all design variables, attributes, and dummy
# variables for the right-hand side of the conjoint
# analysis (excludes the outcome variable),
# based on specified data parameter values.
# Data are "long," such that each row of the data
# is a conjoint profile.
#
# Arguments:
# N: number of participants
# K: number of choices completed
# W: number of profiles per screen
# attr_n: number of attributes overall
# attr_levels: list (or vector)
# of possible levels for each attribute.
# attr_names: (optional): names of attributes. defaults a generic
# name followed by the index of the attribute
# in attr_levels.
# probs (optional): list (or vector) of probabilities used to randomly
# draw values from the attribute levels vectors and assign
# them to profiles.
# sims (optional): number of times to simulate the data, default 1
# seed (optional): randomization seed, default 123
################################################
create_predictors <- function(N,
K,
W,
attr_n,
attr_levels,
attr_names = NULL,
probs = NULL,
sims = 1,
seed = 123) {
# Defines loadpkg()
loadpkg <- function(toLoad) {
for (lib in toLoad) {
if (!lib %in% installed.packages()[, 1]) {
install.packages(lib, repos = 'http://cran.rstudio.com/')
}
suppressMessages(library(lib, character.only = TRUE))
}
}
# Load packages
pkgs <- c("dplyr",
"estimatr",
"ggplot2")
loadpkg(pkgs)
# Set seed for reproducibility
set.seed(seed)
# Define create_design_df
create_design_df <- function(N, K, W) {
temp <- data.frame(
participant_id = sort(rep(seq_len(N), K * W)),
screen_id = sort(rep(seq_len(K * N), W)),
profile_id = seq_len(N * K * W)
)
temp %>%
dplyr::group_by(screen_id) %>%
dplyr::mutate(first_profile = dplyr::if_else(row_number() == 1, 1, 0)) %>%
dplyr::ungroup()
}
# Define create_attributes()
create_attributes <-
function(attr_levels,
n_profiles,
attr_names = NULL,
probs = NULL) {
if (!is.list(attr_levels)) {
attr_levels <- list(attr_levels)
}
if (!is.null(probs) & !is.list(probs)) {
probs <- list(probs)
}
attr_n <- length(attr_levels)
stopifnot(is.null(attr_names) | attr_n == length(attr_names))
stopifnot(is.null(probs) | attr_n == length(probs))
if (is.null(attr_names)) {
attr_names <- paste0("attrib", seq_len(attr_n))
}
if (is.null(probs)) {
probs <- list()
for (i in 1:attr_n) {
attr_i_levels_n <- length(attr_levels[[i]])
probs[[i]] <- rep(1 / attr_i_levels_n, attr_i_levels_n)
}
}
attrib_list <- list()
for (i in 1:length(attr_names)) {
if (is.numeric(attr_levels[[i]]) &
length(attr_levels[[i]]) == 1) {
attrib_list[[i]] <- rep(attr_levels[[i]], n_profiles)
} else {
attrib_list[[i]] <- sample(
x = attr_levels[[i]],
size = n_profiles,
replace = TRUE,
prob = probs[[i]]
)
}
}
attrib_df <- dplyr::bind_cols(attrib_list)
colnames(attrib_df) <- attr_names
attrib_df
}
# Define create_value_dummies
create_value_dummies <-
function(attr_df,
attr_levels) {
if (!is.list(attr_levels)) {
attr_levels <- list(attr_levels)
}
attr_cols <- list()
for (i in 1:ncol(attr_df)) {
temp <- list()
this_attr_levels <- attr_levels[[i]]
for (j in 1:length(this_attr_levels)) {
temp[[j]] <- dplyr::if_else(attr_df[, i] == attr_levels[[i]][j], 1, 0)
}
temp <- dplyr::bind_cols(temp)
colnames(temp) <-
paste0(colnames(attr_df)[i], "_", attr_levels[[i]])
attr_cols[[i]] <- temp
}
dummy_df <- dplyr::bind_cols(attr_cols)
dplyr::bind_cols(attr_df, dummy_df)
}
# Generate data
# Create design data frame
design_df <- create_design_df(N = N, K = K, W = W)
# Initialize list to store data
attribs <- vector("list", length = sims)
# Number of profiles
n_profiles <- N * K * W
# Create sims number of data sets with the attributes values
# and value dummies randomized across data sets
for (i in seq_len(sims)) {
attribs[[i]] <- create_attributes(
attr_levels = attr_levels,
n_profiles = n_profiles,
attr_names = attr_names,
probs = probs
) %>%
create_value_dummies(attr_df = .,
attr_levels = attr_levels)
}
# Join the attribute data to the design data
dat_out <- lapply(attribs, function(attrib_df) {
dplyr::bind_cols(x = design_df, y = attrib_df)
})
# Return list of generate data frames
dat_out
}
With the above code, we can generate arbitrary numbers of data sets with the structure we want in a single call. To reduce the demands on our computers, only the parts of the code dependent on randomization are repeated for each additional simulation.
Example: create_predictors()
# N <- 50
# K <- 3
# W <- 2
# attr_n <- 3
# attr_levels <- list(c(1, 2, 3), c("a", "b", "c", "d", "e"), c(5))
# attr_names <- NULL
# probs <- NULL
# sims <- 2
# seed <- 123
#
#
# dat <- create_predictors(
# N = N,
# K = K,
# W = W,
# attr_n = attr_n,
# attr_levels = attr_levels,
# attr_names = attr_names,
# probs = probs,
# sims = sims,
# seed = seed
# )
Creating (potential) outcomes
We next need to assign binary outcomes (whether a profile was chosen or not) to each profile in our data. We will follow the standard approach for experiments where participants choose only one profile out of at least two profiles and assume our outcome \(Y\) is distribution \(Y \sim Binomial(n,p)\), where \(n = 1\) and \(p\) is the probability a profile is chosen over the other profile. \(Y\) is therefore a dichotomous variable and we will allow for \(p\) to be an arbitrary function of the attribute values for that profile. Attribute values are allowed to have both additive and non-additive relationships with \(p\).
Revealing outcomes
Since we need to specify all of the potential outcomes for arbitrary functions of the profile attribute values, we’ll have to generate potential outcomes manually. Since the manual generation of outcomes will depend on the number of attributes and attribute levels, we’re going to want to create our code in response to a specific conjoint design.
For exemplary purposes, we’ll generate data assuming we have a conjoint experiment with 3 attributes and each attribute will have two levels. We’ll call the first attribute atta
, the second attribute attb
, and the third attribute attc
. We’ll furthermore allow for the effect of atta
on the probability a profile is selected to vary with the value of attb
. Such interactions are commonly of interest to researchers and we may be interested in calculating power for detecting such inter-attribute interactions. To keep things simple, we’ll also assume that our third variable attc
has a deterministic effect of exactly 0 on our outcome. This saves us some manual coding since we don’t need to let \(Y\) vary with this variable.
# Generate data to which we'll assign outcomes
# Specify data parameters
N <- 100
K <- 3
W <- 2
attr_n <- 3
attr_levels <- list(c(0, 1), c("x", "z"), c(0,1))
attr_names <- c("atta","attb","attc")
probs <- NULL
sims <- 5
seed <- 123
# Simulate data
dat <- create_predictors(
N = N,
K = K,
W = W,
attr_n = attr_n,
attr_levels = attr_levels,
attr_names = attr_names,
probs = probs,
sims = sims,
seed = 123
)
Now that we have a list of simulated data frames, we’ll want to simulate outcomes for each profile in each simulated data set. To do this, we’ll first need to assign p
to each observation based on the joint distributions of attributes atta
and attb
. We don’t need to include attc
because it does not affect p
. Normally, we would calculate p
based on the marginal distributions of atta
and attb
, but we embrace the complexity of using the joint distribution because we are allowing for an inter-attribute interaction between atta
and attb
that is of interest to us.
To assign p
to different cells of the joint distributions of atta
and attb
, let’s assume that atta = 1
is chosen 4% more often than atta = 0
when attb = x
and 1% more often when attb = z
. We can then take half of the differences in potential outcomes, add it to 0.50 for the the corresponding profile, then take a random draw from the corresponding distribution. The reason we take half of the difference in the probabilities of selection is that we want symmetry around 50% for the probability a profile is selected. In other words, if we want participants to choose a profile assigned a value of \(J\) for some attribute \(4%\) more often than a profile assigned a value of \(K\) for the same attribute, we would want the probability the former profile is selected to be \(0.52\) and the probability the latter profile is selected to be \(1 - 0.52 = 0.48\). If we don’t allow for this symmetry around 50% probability for the different attribute combinations, we would end up with a scenario where the sum of the probabilities of selecting profiles across all attribute values does not sum to 1.
Let’s now produce the mapping of p
to different attribute combinations for each set of profiles.
# Assign p to different distributions of attribute
# values across the two profiles in each set of
# profiles.
#
# Note: it is best to keep notation straightforward,
# e.g. the values below indicate that the first profile
# has a value of 1 and the second profile a value
# of 0 for attribute atta while the first profile has
# value of x and the second profile a value of x for
# attribute attb.
atta_10_attb_xx <- 0.02
atta_10_attb_xz <- 0.02
atta_10_attb_zz <- 0.005
atta_10_attb_zx <- 0.005
atta_01_attb_xx <- -0.02
atta_01_attb_xz <- -0.02
atta_01_attb_zz <- -0.005
atta_01_attb_zx <- -0.005
# Indifferent between attribute b values
# independent of effect on atta. This
# is not necessary but can be useful
# for note-keeping.
atta_00_attb_xx <- 0
atta_00_attb_xz <- 0
atta_00_attb_zz <- 0
atta_00_attb_zx <- 0
atta_11_attb_xx <- 0
atta_11_attb_xz <- 0
atta_11_attb_zz <- 0
atta_11_attb_zx <- 0
The above code is necessarily a manual process and we can often make coding mistakes. To make sure we’re producing the effects (AMCEs and ACIEs typically) we want, let’s calculate different AMCEs and examine whether AMCEs are conditional on other attributes as expected. The above code should make the results below self-evident, but this can be a good check that you did not make any coding mistakes.
# Check outcomes match the desired effects
# Can add identical profiles to obtain standard AMCE,
#~~~~~~~~~~~~~~~~
# atta AMCEs
#~~~~~~~~~~~~~~~~
# Unconditional AMCE: atta = 1 vs 0
# (0 vs 1 is the AMCE * -1)
c(atta_10_attb_xx - atta_01_attb_xx,
atta_10_attb_xz - atta_01_attb_xz,
atta_10_attb_zz - atta_01_attb_zz,
atta_10_attb_zx - atta_01_attb_zx) %>% mean()
# AMCE | attb = x: atta = 1 vs 0
c(atta_10_attb_xx - atta_01_attb_xx,
atta_10_attb_xz - atta_01_attb_xz) %>% mean()
# AMCE | attb = z: atta = 1 vs 0
c(atta_10_attb_zz - atta_01_attb_zz,
atta_10_attb_zx - atta_01_attb_zx) %>% mean()
#~~~~~~~~~~~~~~~~
# attb AMCEs
#~~~~~~~~~~~~~~~~
# xz
c(atta_10_attb_xz - atta_10_attb_xx,
atta_11_attb_xz - atta_11_attb_xx,
atta_01_attb_xz - atta_01_attb_xx,
atta_00_attb_xz - atta_00_attb_xx) %>% mean() # 0
c(atta_10_attb_xz - atta_10_attb_zx,
atta_11_attb_xz - atta_11_attb_zx,
atta_01_attb_xz - atta_01_attb_zx,
atta_00_attb_xz - atta_00_attb_zx) %>% mean() # 0
c(atta_10_attb_xz - atta_10_attb_zz,
atta_11_attb_xz - atta_11_attb_zz,
atta_01_attb_xz - atta_01_attb_zz,
atta_00_attb_xz - atta_00_attb_zz) %>% mean() # 0
# zz
c(atta_10_attb_zz - atta_10_attb_xx,
atta_11_attb_zz - atta_11_attb_xx,
atta_01_attb_zz - atta_01_attb_xx,
atta_00_attb_zz - atta_00_attb_xx) %>% mean() # 0
c(atta_10_attb_zz - atta_10_attb_zx,
atta_11_attb_zz - atta_11_attb_zx,
atta_01_attb_zz - atta_01_attb_zx,
atta_00_attb_zz - atta_00_attb_zx) %>% mean() # 0
c(atta_10_attb_zz - atta_10_attb_xz,
atta_11_attb_zz - atta_11_attb_xz,
atta_01_attb_zz - atta_01_attb_xz,
atta_00_attb_zz - atta_00_attb_xz) %>% mean() # 0
# zx
c(atta_10_attb_zx - atta_10_attb_xx,
atta_11_attb_zx - atta_11_attb_xx,
atta_01_attb_zx - atta_01_attb_xx,
atta_00_attb_zx - atta_00_attb_xx) %>% mean() # 0
c(atta_10_attb_zx - atta_10_attb_xz,
atta_11_attb_zx - atta_11_attb_xz,
atta_01_attb_zx - atta_01_attb_xz,
atta_00_attb_zx - atta_00_attb_xz) %>% mean() # 0
c(atta_10_attb_zx - atta_10_attb_zz,
atta_11_attb_zx - atta_11_attb_zz,
atta_01_attb_zx - atta_01_attb_zz,
atta_00_attb_zx - atta_00_attb_zz) %>% mean() # 0
# xx
c(atta_10_attb_xx - atta_10_attb_xz,
atta_11_attb_xx - atta_11_attb_xz,
atta_01_attb_xx - atta_01_attb_xz,
atta_00_attb_xx - atta_00_attb_xz) %>% mean() # 0
c(atta_10_attb_xx - atta_10_attb_zz,
atta_11_attb_xx - atta_11_attb_zz,
atta_01_attb_xx - atta_01_attb_zz,
atta_00_attb_xx - atta_00_attb_zz) %>% mean() # 0
c(atta_10_attb_xx - atta_10_attb_zx,
atta_11_attb_xx - atta_11_attb_zx,
atta_01_attb_xx - atta_01_attb_zx,
atta_00_attb_xx - atta_00_attb_zx) %>% mean() # 0
Before proceeding, it is worth emphasizing what we are capturing in this quantity: this is a natural decision given we want to estimate differences in the probabilities of choosing profiles randomly assigned to different attribute values rather than the effect of randomly assigning attribute values to a profile. However, depending upon how one defines the population and choice problem, this is not necessarily the typical quantity researchers estimate (the typical AMCE). If you want to estimate the AMCE, you’ll need to add additional elements to the mean formulas above. However, if you want to estimate an estimand better capturing preferences, you can keep the above but remove the profiles prior to fitting models. More details regarding the mapping between preferences, the AMCE, and typical estimators can be found here, but the key implication is that the effect above are actually larger than the effects expected using the typical estimation procedure (which is incorporated into the code later).
We can at last define our function to generate values of p
for draws from a binomial distribution and assign values of \(Y\) based on these draws. We’ll pass this add_outcomes()
function a data frame (of simulated data) and it will add p
and Y
to the data. Since this also requires some more manual coding to check which value of \(p\) the function should assign, I’ve included some commented-out code that can help with checking you did not make any manual errors. The function is defined assuming we have already loaded our create_predictors()
function, meaning we do not need to re-load other dependencies again.
############################################
# Define function: add_outcomes()
# Generates values p for draws from binomial
# distribution and assigns observed outcomes
# for each unit for a given data frame
#
# Arguments
# - data: data frame(s) of predictor variables,
# commonly a data frame in the list of
# produced via create_predictors()
############################################
# For each profile, calculate p in Binom(n,p). We'll
# use a fairly inelegant for() loop. Start with a vector
# of missing values, then replace the value for the first
# profile on a screen and take 1 - this value for the second
# profile on the screen. Could easily write a function to
# take the variables for each of the four inputs, but
# this took < 5 minutes (albeit prone to humane error, harder
# to read). We'll wrap this into a function so we can apply
# it to each simulated data set.
add_outcomes <- function(df) {
# Vector of missing values to store p
df$p <- NA_real_
# Number of screens (for looping)
screens <- max(df$screen_id)
# Loop through profiles in each screen, calculate p
# based on whether the distribution of attribute
# values matches those for the different values of
# p we previously specified. We can go through
# each of the possible outcomes, multiply logical checks
# for whether an attribute value matches, then add the
# value to 0.50 (indifference) if it matches.
for (i in 1:screens) {
# atta_10_attb_xx
atta_10_attb_xx_out <- atta_10_attb_xx *
df[df$screen_id == i & df$first_profile == 1, "atta_1"] *
df[df$screen_id == i & df$first_profile == 0, "atta_0"] *
df[df$screen_id == i & df$first_profile == 1, "attb_x"] *
df[df$screen_id == i & df$first_profile == 0, "attb_x"]
# atta_10_attb_xz
atta_10_attb_xz_out <- atta_10_attb_xz *
df[df$screen_id == i & df$first_profile == 1, "atta_1"] *
df[df$screen_id == i & df$first_profile == 0, "atta_0"] *
df[df$screen_id == i & df$first_profile == 1, "attb_x"] *
df[df$screen_id == i & df$first_profile == 0, "attb_z"]
# atta_10_attb_zz
atta_10_attb_zz_out <- atta_10_attb_zz *
df[df$screen_id == i & df$first_profile == 1, "atta_1"] *
df[df$screen_id == i & df$first_profile == 0, "atta_0"] *
df[df$screen_id == i & df$first_profile == 1, "attb_z"] *
df[df$screen_id == i & df$first_profile == 0, "attb_z"]
# atta_10_attb_zx
atta_10_attb_zx_out <- atta_10_attb_zx *
df[df$screen_id == i &
df$first_profile == 1, "atta_1"] *
df[df$screen_id == i &
df$first_profile == 0, "atta_0"] *
df[df$screen_id == i &
df$first_profile == 1, "attb_z"] *
df[df$screen_id == i & df$first_profile == 0, "attb_x"]
# atta_01_attb_xx
atta_01_attb_xx_out <- atta_01_attb_xx *
df[df$screen_id == i &
df$first_profile == 1, "atta_0"] *
df[df$screen_id == i &
df$first_profile == 0, "atta_1"] *
df[df$screen_id == i &
df$first_profile == 1, "attb_x"] *
df[df$screen_id == i &
df$first_profile == 0, "attb_x"]
# atta_01_attb_xz
atta_01_attb_xz_out <- atta_01_attb_xz *
df[df$screen_id == i &
df$first_profile == 1, "atta_0"] *
df[df$screen_id == i &
df$first_profile == 0, "atta_1"] *
df[df$screen_id == i &
df$first_profile == 1, "attb_x"] *
df[df$screen_id == i &
df$first_profile == 0, "attb_z"]
# atta_01_attb_zz
atta_01_attb_zz_out <- atta_01_attb_zz *
df[df$screen_id == i &
df$first_profile == 1, "atta_0"] *
df[df$screen_id == i &
df$first_profile == 0, "atta_1"] *
df[df$screen_id == i &
df$first_profile == 1, "attb_z"] *
df[df$screen_id == i &
df$first_profile == 0, "attb_z"]
# atta_01_attb_zx
atta_01_attb_zx_out <- atta_01_attb_zx *
df[df$screen_id == i &
df$first_profile == 1, "atta_0"] *
df[df$screen_id == i &
df$first_profile == 0, "atta_1"] *
df[df$screen_id == i &
df$first_profile == 1, "attb_z"] *
df[df$screen_id == i &
df$first_profile == 0, "attb_x"]
# Given the matches, add them to 0.50 and assign sum to p
df[df$screen_id == i &
df$first_profile == 1, "p"] <- 0.50 +
atta_10_attb_xx_out +
atta_10_attb_xz_out +
atta_10_attb_zz_out +
atta_10_attb_zx_out +
atta_01_attb_xx_out +
atta_01_attb_xz_out +
atta_01_attb_zz_out +
atta_01_attb_zx_out
}
# In our example, we chose effects so the range of p is
# between 0.50 +/- 0.02. Let's check this is true!
# df$p %>% table()
# Draw Y for each profile where we have a value for p,
# ignoring the warning that NAs were produced since we WANT
# exactly half of our profiles (half of the rows in our
# input data should be NA)
df <- df %>% dplyr::mutate(Y = rbinom(n = n(),
size = 1,
prob = p))
# Let's make sure our function returns the right # of NA cases!
stopifnot( df$Y %>% is.na() %>% sum() == screens)
# Within each pair of profiles, assign a value of 1 - Y
# to the profile where first_profile == 0, where Y is the
# value of Y when first_profile == 1 within that screen
for (j in 1:screens) {
df[df$screen_id == j &
df$first_profile == 0, "Y"] <-
1 - df[df$screen_id == j & df$first_profile == 1, "Y"]
}
df
}
We can now simply apply our add_outcomes()
function to the list of data frames we obtained as output of create_predictors()
. Below shows an example where we add outcomes to the list of data frames we previously obtained.
Example: add_outcomes()
Creating model results
Model Choice
We can now fit our preferred statistical models to the data. The most common approach in political science is to fit a linear probability model predicting \(Y\) as a function of indicators for profile attributes. As Hainmueller et al. (2013) recommend, standard errors are adjusted for arbitrary intra-participant error correlations via a clustering correction. We’ll fit the model with the lm_robust()
function from the estimatr
package, which is fast and implements the “CR2” clustered correction. See the function’s documentation for more information. While I choose this model, feel free change the model to any other common type for the analysis of discrete choice data.
Adding modeling
Rather than generating our data, storing it in memory, then fitting it on the model object, we’ll save some computing time and avoid running out of memory by adding the modeling step at the end of our add_outcomes()
function, returning only parameters desired for power calculations. Let’s therefore augment our function to produce a new function: create_model_output()
. For clarity, I’ve some deleted comments from our previous example.
##########################################
# Define function: create_model_ouput()
#
# Description: Generates a data frame
# of selected p-values and point
# estimates for linear probability
# model fit on each simulated data
# set.
#
# Arguments
# df: data frame(s) of predictor variables,
# commonly a data frame in the list of
# produced via create_predictors()
# formula: string containing model formula asin
# in the case of lm(), e.g. "y ~ x"
# clusters: bare name of cluster variable,
# typically participant_id
# pvars: character vector of model terms for
# which we want to return the p-values
# alpha: numeric value for significance threshold,
# defaults to 0.05
##########################################
create_model_output <- function(df, formula, clusters, pterms, alpha = 0.05) {
df$p <- NA_real_
screens <- max(df$screen_id)
for (i in 1:screens) {
# atta_10_attb_xx
atta_10_attb_xx_out <- atta_10_attb_xx *
df[df$screen_id == i & df$first_profile == 1, "atta_1"] *
df[df$screen_id == i & df$first_profile == 0, "atta_0"] *
df[df$screen_id == i & df$first_profile == 1, "attb_x"] *
df[df$screen_id == i & df$first_profile == 0, "attb_x"]
# atta_10_attb_xz
atta_10_attb_xz_out <- atta_10_attb_xz *
df[df$screen_id == i & df$first_profile == 1, "atta_1"] *
df[df$screen_id == i & df$first_profile == 0, "atta_0"] *
df[df$screen_id == i & df$first_profile == 1, "attb_x"] *
df[df$screen_id == i & df$first_profile == 0, "attb_z"]
# atta_10_attb_zz
atta_10_attb_zz_out <- atta_10_attb_zz *
df[df$screen_id == i & df$first_profile == 1, "atta_1"] *
df[df$screen_id == i & df$first_profile == 0, "atta_0"] *
df[df$screen_id == i & df$first_profile == 1, "attb_z"] *
df[df$screen_id == i & df$first_profile == 0, "attb_z"]
# atta_10_attb_zx
atta_10_attb_zx_out <- atta_10_attb_zx *
df[df$screen_id == i &
df$first_profile == 1, "atta_1"] *
df[df$screen_id == i &
df$first_profile == 0, "atta_0"] *
df[df$screen_id == i &
df$first_profile == 1, "attb_z"] *
df[df$screen_id == i & df$first_profile == 0, "attb_x"]
# atta_01_attb_xx
atta_01_attb_xx_out <- atta_01_attb_xx *
df[df$screen_id == i &
df$first_profile == 1, "atta_0"] *
df[df$screen_id == i &
df$first_profile == 0, "atta_1"] *
df[df$screen_id == i &
df$first_profile == 1, "attb_x"] *
df[df$screen_id == i &
df$first_profile == 0, "attb_x"]
# atta_01_attb_xz
atta_01_attb_xz_out <- atta_01_attb_xz *
df[df$screen_id == i &
df$first_profile == 1, "atta_0"] *
df[df$screen_id == i &
df$first_profile == 0, "atta_1"] *
df[df$screen_id == i &
df$first_profile == 1, "attb_x"] *
df[df$screen_id == i &
df$first_profile == 0, "attb_z"]
# atta_01_attb_zz
atta_01_attb_zz_out <- atta_01_attb_zz *
df[df$screen_id == i &
df$first_profile == 1, "atta_0"] *
df[df$screen_id == i &
df$first_profile == 0, "atta_1"] *
df[df$screen_id == i &
df$first_profile == 1, "attb_z"] *
df[df$screen_id == i &
df$first_profile == 0, "attb_z"]
# atta_01_attb_zx
atta_01_attb_zx_out <- atta_01_attb_zx *
df[df$screen_id == i &
df$first_profile == 1, "atta_0"] *
df[df$screen_id == i &
df$first_profile == 0, "atta_1"] *
df[df$screen_id == i &
df$first_profile == 1, "attb_z"] *
df[df$screen_id == i &
df$first_profile == 0, "attb_x"]
# Given the matches, add them to 0.50 and assign sum to p
df[df$screen_id == i &
df$first_profile == 1, "p"] <- 0.50 +
atta_10_attb_xx_out +
atta_10_attb_xz_out +
atta_10_attb_zz_out +
atta_10_attb_zx_out +
atta_01_attb_xx_out +
atta_01_attb_xz_out +
atta_01_attb_zz_out +
atta_01_attb_zx_out
}
# Draw Y for each profile where we have a value for p
df <- df %>% dplyr::mutate(Y = rbinom(n = n(),
size = 1,
prob = p))
# Within each pair of profiles, assign a value of 1 - Y
# to the profile where first_profile == 0, where Y is the
# value of Y when first_profile == 1 within that screen
for (j in 1:screens) {
df[df$screen_id == j &
df$first_profile == 0, "Y"] <-
1 - df[df$screen_id == j & df$first_profile == 1, "Y"]
}
# Fit statistical model (linear probability model, clustered SEs)
# we need to convert our formula string to a formula with as.formula()
# and convert clusters to a name. However, won't work unless we wrap
# this in do-call to pass to the lm_robust() function, which requires
# bare names for the cluster variable
model_out <- do.call(estimatr::lm_robust,
list(
formula = as.formula(formula),
data = df,
clusters = as.name(clusters)
))
# Create df of p-values for the preferred model terms
model_df <- data.frame(term = names(model_out$p.value[pterms]),
pval = model_out$p.value[pterms])
# Return data frame of p-values and significant indicator
model_df %>% dplyr::mutate(pvar_sig = dplyr::if_else(pval < alpha, 1, 0))
}
Example: create_model_output()
# Using our list of simulated data frames
# as our data source...
# Example using previously generated data,
# let's estimate a model with interactions
# between two attributes, participant clusters,
# and a significance threshold of 0.10, returning
# model information for two terms.
create_model_output(
df = dat[[1]],
formula = "Y ~ atta_0*attb_z",
clusters = "participant_id",
pterms = c("atta_0", "atta_0:attb_z"),
alpha = 0.10
)
term pval pvar_sig
1 atta_0 0.3896093 0
2 atta_0:attb_z 0.3853775 0
Calculating power
With our function, calculating power is as simple as applying our function to all of the simulated data, combining the results, and calculating the share of simulation for which the p-value is statistically significant at the specified threshold (alpha
). We can create a wrapper to make this simpler.
#########################################################
# Define function: calculate_power_wrapper
# Description: calculates power given a list of data
# frames
#
# Arguments
# sim_list: list of simulated data frames
# formula: string of model formulas
# clusters: string of variable identifying clusters
# pterms: character vector for terms whose power should
# be returned
# alpha: numeric significance threshold
#########################################################
calculate_power_wrapper <-
function(sim_list,
formula,
clusters,
pterms,
seed = 123,
alpha = 0.05) {
# Set seed
set.seed(seed)
# Fit models
models <- lapply(sim_list, function(x) {
create_model_output(
df = x,
formula = formula,
clusters = clusters,
pterms = pterms,
alpha = alpha
)
})
# Combine results into a single data frame,
# calculate power for each term
models <- models %>%
dplyr::bind_rows() %>%
dplyr::group_by(term) %>%
summarize(power_pct = 100 * mean(pvar_sig, na.rm = TRUE))
return(models)
}
Let’s now see our new power calculator work on the output of our data simulation functions.
Example: calculate_power_wrapper()
# Given data, calculate power
power_df <- calculate_power_wrapper(
sim_list = all_data,
formula = "Y ~ atta_0*attb_z",
clusters = "participant_id",
pterms = c("atta_0", "atta_0:attb_z"),
seed = 123,
alpha = 0.10
)
# Show results
power_df
# A tibble: 2 x 2
term power_pct
<fct> <dbl>
1 atta_0 20
2 atta_0:attb_z 20
The output contains the power for detecting the interaction of atta_0
and attb_z
and for the indicator for atta_0
. While we only used 5 simulations for example purposes, we typically want to produce many (i.e. 1000+ simulations).
Final workflow
We’ve now created functions to generate our data and calculate power from the data. Let’s show the full workflow now. We’ll breakdown the workflow into three categories:
- Functions
- Inputs
- Calls
Functions
First, we’ll need to load all of our functions. The code gives several options for each of the functions and should be able to accommodate a wide array of conjoint designs without much change, but you will typically need to change the variables used to calculate p
in the create_model_output()
function to the dimensionality of your specific data with regard to attributes and the levels of those attributes.
################################################
# Define function: create_predictors
# Creates randomly generated data sets containing
# all design variables, attributes, and dummy
# variables for the right-hand side of the conjoint
# analysis (excludes the outcome variable),
# based on specified data parameter values.
# Data are "long," such that each row of the data
# is a conjoint profile.
#
# Arguments:
# N: number of participants
# K: number of choices completed
# W: number of profiles per screen
# attr_n: number of attributes overall
# attr_levels: list (or vector)
# of possible levels for each attribute.
# attr_names: (optional): names of attributes. defaults a generic
# name followed by the index of the attribute
# in attr_levels.
# probs (optional): list (or vector) of probabilities used to randomly
# draw values from the attribute levels vectors and assign
# them to profiles.
# sims (optional): number of times to simulate the data, default 1
# seed (optional): randomization seed, default 123
################################################
create_predictors <- function(N,
K,
W,
attr_n,
attr_levels,
attr_names = NULL,
probs = NULL,
sims = 1,
seed = 123) {
# Defines loadpkg()
loadpkg <- function(toLoad) {
for (lib in toLoad) {
if (!lib %in% installed.packages()[, 1]) {
install.packages(lib, repos = 'http://cran.rstudio.com/')
}
suppressMessages(library(lib, character.only = TRUE))
}
}
# Load packages
pkgs <- c("dplyr",
"estimatr",
"ggplot2")
loadpkg(pkgs)
# Set seed for reproducibility
set.seed(seed)
# Define create_design_df
create_design_df <- function(N, K, W) {
temp <- data.frame(
participant_id = sort(rep(seq_len(N), K * W)),
screen_id = sort(rep(seq_len(K * N), W)),
profile_id = seq_len(N * K * W)
)
temp %>%
dplyr::group_by(screen_id) %>%
dplyr::mutate(first_profile = dplyr::if_else(row_number() == 1, 1, 0)) %>%
dplyr::ungroup()
}
# Define create_attributes()
create_attributes <-
function(attr_levels,
n_profiles,
attr_names = NULL,
probs = NULL) {
if (!is.list(attr_levels)) {
attr_levels <- list(attr_levels)
}
if (!is.null(probs) & !is.list(probs)) {
probs <- list(probs)
}
attr_n <- length(attr_levels)
stopifnot(is.null(attr_names) | attr_n == length(attr_names))
stopifnot(is.null(probs) | attr_n == length(probs))
if (is.null(attr_names)) {
attr_names <- paste0("attrib", seq_len(attr_n))
}
if (is.null(probs)) {
probs <- list()
for (i in 1:attr_n) {
attr_i_levels_n <- length(attr_levels[[i]])
probs[[i]] <- rep(1 / attr_i_levels_n, attr_i_levels_n)
}
}
attrib_list <- list()
for (i in 1:length(attr_names)) {
if (is.numeric(attr_levels[[i]]) &
length(attr_levels[[i]]) == 1) {
attrib_list[[i]] <- rep(attr_levels[[i]], n_profiles)
} else {
attrib_list[[i]] <- sample(
x = attr_levels[[i]],
size = n_profiles,
replace = TRUE,
prob = probs[[i]]
)
}
}
attrib_df <- dplyr::bind_cols(attrib_list)
colnames(attrib_df) <- attr_names
attrib_df
}
# Define create_value_dummies
create_value_dummies <-
function(attr_df,
attr_levels) {
if (!is.list(attr_levels)) {
attr_levels <- list(attr_levels)
}
attr_cols <- list()
for (i in 1:ncol(attr_df)) {
temp <- list()
this_attr_levels <- attr_levels[[i]]
for (j in 1:length(this_attr_levels)) {
temp[[j]] <- dplyr::if_else(attr_df[, i] == attr_levels[[i]][j], 1, 0)
}
temp <- dplyr::bind_cols(temp)
colnames(temp) <-
paste0(colnames(attr_df)[i], "_", attr_levels[[i]])
attr_cols[[i]] <- temp
}
dummy_df <- dplyr::bind_cols(attr_cols)
dplyr::bind_cols(attr_df, dummy_df)
}
# Generate data
# Create design data frame
design_df <- create_design_df(N = N, K = K, W = W)
# Initialize list to store data
attribs <- vector("list", length = sims)
# Number of profiles
n_profiles <- N * K * W
# Create sims number of data sets with the attributes values
# and value dummies randomized across data sets
for (i in seq_len(sims)) {
attribs[[i]] <- create_attributes(
attr_levels = attr_levels,
n_profiles = n_profiles,
attr_names = attr_names,
probs = probs
) %>%
create_value_dummies(attr_df = .,
attr_levels = attr_levels)
}
# Join the attribute data to the design data
dat_out <- lapply(attribs, function(attrib_df) {
dplyr::bind_cols(x = design_df, y = attrib_df)
})
# Return list of generate data frames
dat_out
}
##########################################
# Define function: create_model_ouput()
#
# Description: Generates a data frame
# of selected p-values and point
# estimates for linear probability
# model fit on each simulated data
# set.
#
# Arguments
# df: data frame(s) of predictor variables,
# commonly a data frame in the list of
# produced via create_predictors()
# formula: string containing model formula asin
# in the case of lm(), e.g. "y ~ x"
# clusters: bare name of cluster variable,
# typically participant_id
# pvars: character vector of model terms for
# which we want to return the p-values
# seed: randomization seed, default 123
# alpha: numeric value for significance threshold,
# defaults to 0.05
##########################################
create_model_output <- function(df,
formula,
clusters,
pterms,
seed = 123,
alpha = 0.05) {
set.seed(seed)
df$p <- NA_real_
screens <- max(df$screen_id)
for (i in 1:screens) {
# atta_10_attb_xx
atta_10_attb_xx_out <- atta_10_attb_xx *
df[df$screen_id == i & df$first_profile == 1, "atta_1"] *
df[df$screen_id == i & df$first_profile == 0, "atta_0"] *
df[df$screen_id == i & df$first_profile == 1, "attb_x"] *
df[df$screen_id == i & df$first_profile == 0, "attb_x"]
# atta_10_attb_xz
atta_10_attb_xz_out <- atta_10_attb_xz *
df[df$screen_id == i & df$first_profile == 1, "atta_1"] *
df[df$screen_id == i & df$first_profile == 0, "atta_0"] *
df[df$screen_id == i & df$first_profile == 1, "attb_x"] *
df[df$screen_id == i & df$first_profile == 0, "attb_z"]
# atta_10_attb_zz
atta_10_attb_zz_out <- atta_10_attb_zz *
df[df$screen_id == i & df$first_profile == 1, "atta_1"] *
df[df$screen_id == i & df$first_profile == 0, "atta_0"] *
df[df$screen_id == i & df$first_profile == 1, "attb_z"] *
df[df$screen_id == i & df$first_profile == 0, "attb_z"]
# atta_10_attb_zx
atta_10_attb_zx_out <- atta_10_attb_zx *
df[df$screen_id == i &
df$first_profile == 1, "atta_1"] *
df[df$screen_id == i &
df$first_profile == 0, "atta_0"] *
df[df$screen_id == i &
df$first_profile == 1, "attb_z"] *
df[df$screen_id == i & df$first_profile == 0, "attb_x"]
# atta_01_attb_xx
atta_01_attb_xx_out <- atta_01_attb_xx *
df[df$screen_id == i &
df$first_profile == 1, "atta_0"] *
df[df$screen_id == i &
df$first_profile == 0, "atta_1"] *
df[df$screen_id == i &
df$first_profile == 1, "attb_x"] *
df[df$screen_id == i &
df$first_profile == 0, "attb_x"]
# atta_01_attb_xz
atta_01_attb_xz_out <- atta_01_attb_xz *
df[df$screen_id == i &
df$first_profile == 1, "atta_0"] *
df[df$screen_id == i &
df$first_profile == 0, "atta_1"] *
df[df$screen_id == i &
df$first_profile == 1, "attb_x"] *
df[df$screen_id == i &
df$first_profile == 0, "attb_z"]
# atta_01_attb_zz
atta_01_attb_zz_out <- atta_01_attb_zz *
df[df$screen_id == i &
df$first_profile == 1, "atta_0"] *
df[df$screen_id == i &
df$first_profile == 0, "atta_1"] *
df[df$screen_id == i &
df$first_profile == 1, "attb_z"] *
df[df$screen_id == i &
df$first_profile == 0, "attb_z"]
# atta_01_attb_zx
atta_01_attb_zx_out <- atta_01_attb_zx *
df[df$screen_id == i &
df$first_profile == 1, "atta_0"] *
df[df$screen_id == i &
df$first_profile == 0, "atta_1"] *
df[df$screen_id == i &
df$first_profile == 1, "attb_z"] *
df[df$screen_id == i &
df$first_profile == 0, "attb_x"]
# Given the matches, add them to 0.50 and assign sum to p
df[df$screen_id == i &
df$first_profile == 1, "p"] <- 0.50 +
atta_10_attb_xx_out +
atta_10_attb_xz_out +
atta_10_attb_zz_out +
atta_10_attb_zx_out +
atta_01_attb_xx_out +
atta_01_attb_xz_out +
atta_01_attb_zz_out +
atta_01_attb_zx_out
}
# Draw Y for each profile where we have a value for p
df <- df %>% dplyr::mutate(Y = rbinom(n = n(),
size = 1,
prob = p))
# Within each pair of profiles, assign a value of 1 - Y
# to the profile where first_profile == 0, where Y is the
# value of Y when first_profile == 1 within that screen
for (j in 1:screens) {
df[df$screen_id == j &
df$first_profile == 0, "Y"] <-
1 - df[df$screen_id == j & df$first_profile == 1, "Y"]
}
# Fit statistical model (linear probability model, clustered SEs)
# we need to convert our formula string to a formula with as.formula()
# and convert clusters to a name. However, won't work unless we wrap
# this in do-call to pass to the lm_robust() function, which requires
# bare names for the cluster variable
model_out <- do.call(estimatr::lm_robust,
list(
formula = as.formula(formula),
data = df,
clusters = as.name(clusters)
))
# Create df of p-values for the preferred model terms
model_df <- data.frame(term = names(model_out$p.value[pterms]),
pval = model_out$p.value[pterms])
# Return data frame of p-values and significant indicator
model_df %>% dplyr::mutate(pvar_sig = dplyr::if_else(pval < alpha, 1, 0))
}
#########################################################
# Define function: calculate_power_wrapper
# Description: calculates power given a list of data
# frames
#
# Arguments
# sim_list: list of simulated data frames
# formula: string of model formulas
# clusters: string of variable identifying clusters
# pterms: character vector for terms whose power should
# be returned
# seed : randomization seed, default 123
# alpha: numeric significance threshold
#########################################################
calculate_power_wrapper <-
function(sim_list,
formula,
clusters,
pterms,
seed = 123,
alpha = 0.05) {
# Fit models
models <- lapply(sim_list, function(x) {
create_model_output(
df = x,
formula = formula,
clusters = clusters,
pterms = pterms,
seed = seed,
alpha = alpha
)
})
# Combine results into a single data frame,
# calculate power for each term
models <- models %>%
dplyr::bind_rows() %>%
dplyr::group_by(term) %>%
summarize(power_pct = 100 * mean(pvar_sig, na.rm = TRUE))
models
}
Inputs
Second, we’ll want to assign values to the inputs to our functions. We can categorize the inputs based on their purpose: creating predictors, creating outcomes, and statistical modeling/selection of results. These will change with the conditions under which you wish to calculate power. For example, if we wanted to calculate power when we have 500 participants, all we need to do is change N <- 100
to N <- 500
.
# Parameters for predictor creation
N <- 100
K <- 3
W <- 2
attr_n <- 3
attr_levels <- list(c(0, 1), c("x", "z"), c(0,1))
attr_names <- c("atta","attb","attc")
probs <- NULL
sims <- 5
seed <- 123
# Parameters for outcome creation
atta_10_attb_xx <- 0.02
atta_10_attb_xz <- 0.02
atta_10_attb_zz <- 0.005
atta_10_attb_zx <- 0.005
atta_01_attb_xx <- -0.02
atta_01_attb_xz <- -0.02
atta_01_attb_zz <- -0.005
atta_01_attb_zx <- -0.005
atta_00_attb_xx <- 0
atta_00_attb_xz <- 0
atta_00_attb_zz <- 0
atta_00_attb_zx <- 0
atta_11_attb_xx <- 0
atta_11_attb_xz <- 0
atta_11_attb_zz <- 0
atta_11_attb_zx <- 0
# Parameters for modeling
# and power calculation
#sim_list <- not needed due to directly passing data to function
formula <- "Y ~ atta_0*attb_z"
clusters <- "participant_id"
pterms <- c("atta_0", "atta_0:attb_z")
alpha <- 0.10
Calls
Third, and finally, use the inputs to generate our data and pass the data to our functions to calculate power.
# Create predictors
predictors <- create_predictors(
N = N,
K = K,
W = W,
attr_n = attr_n,
attr_levels = attr_levels,
attr_names = attr_names,
probs = probs,
sims = sims,
seed = seed
)
# Add outcomes, fit models
power_calc <- calculate_power_wrapper(
sim_list = predictors,
formula = formula,
clusters = clusters,
pterms = pterms,
seed = seed,
alpha = alpha
)
# Show power
power_calc
# A tibble: 2 x 2
term power_pct
<fct> <dbl>
1 atta_0 20
2 atta_0:attb_z 20
Conclusion
This document hopefully provides (a) the intuition for how to calculate power via simulation, (b) several functions you can take away to calculate power for typical conjoint designs, and (c) a starting point for calculating power for more complicated designs.
Using the above approach, researchers can provides pass different parameter values to our functions and calculate power across a variety of different designs. While we have only calculated power for one set of parameters (e.g. sample size) and with a small number of simulations, it is straightforward to change these parameters to calculate power across many parameters and with many simulations.
There are several areas where entrepreneurial programmers might improve on the code. First, nearly everything is processed sequentially and therefore the code is quite slow. One might modify the code for parallel processing or use some basic C++ to speed things up (e.g. see these examples of how to optimize code). Second, one might add additional functionality. Specific areas where features are wanting is the inclusion of intraclass correlation coefficients for each participant’s choices, ordering effects, automated generation of potential outcomes, and choices over alternative statistical models, such as generalized linear models, mixed models, and Bayesian implementations.