Brett J. Gall

PhD Candidate | Duke University | http://brettgall.com

2020-10-06

If your study or grant benefitted from this guide, I would greatly appreciate your acknowledgement of its contribution to your work through appropriate citation:

Gall, Brett J. 2020. “Simulation-based Power Calculations for Conjoint Experiments.” OSF Preprints. October 6. osf.io/bv6ug.

Why calculate power?

Social scientists increasingly rely on conjoint experiments to estimate population-level preferences, beliefs, and behavioral propensities. Despite the proliferation of conjoint experiments in social scientific research, much of the design and implementation of conjoint experiments is based on largely ad-hoc procedures, rules-of-thumb, and fuzzy intuitions. This is particularly common when researchers are choosing (a) the number of choices each participant must make and (b) the number of participants for a study. This is problematic because these approaches can lead to choices about sampling and design that generate under-powered studies incapable of detecting hypothesized effects or the inefficient over-allocation of resources to over-powered studies. This latter consequence carries addditional risk of the preclusion of study implementation due to the inaccurate perception sufficiently-powered studies are cost-prohibitive.

A priori power calculations help researchers better select study design parameters - including the number of participants and number of choices each participant makes - by formalizing these intuitions. This formalization produces a set of assumptions about the processes assigning treatment status and outcomes to experimental units that can help researchers study how well suited a particular experimental design is to providing sufficiently precise estimates of the effect(s) of interest to them, given their assumptions about the “true” effect(s).

Approaches to calculating power

By far the most common approach to calculating power is to use canned software, such as G*Power, to calculate power based on analytical solutions. Given a set of assumptions about effect sizes and their distributions, these tools typically quickly tell users the exact power of their studies. While convenient and exact, these analytical tools are limited in that they are typically only available for a limited set of models and estimands or impose a narrow set of distributional assumptions on the data generating process.

Simulation provides an alternative to analytical-solution-based power calculations. Simulation is appealing because we only need to specify the expected distributions of variables and their relationship to each other to calculate power with arbitrary distributional assumptions, models, and estimands. The intuition behind simulation is to generate many data-sets that one might observe if the data were generated according to a specified process the researcher believes reasonably captures the “true” process, estimate the effect of interest in each data set, then calculate power based on how often the estimates detect the assumed effect.

While the simulation approach may prove computationally-demanding because of the need to generate many data sets and fit potentially-complex models on those data sets, computational constraints are rarely binding due to the typical size of experimental data, complexity of models, and availability of computing power and storage. Furthermore, although simulation approaches provide only approximations of “true” power (in contrast with the exact power offered by analytical solutions), this approximation error is typically small in practice and decreases with the number of simulated data sets.

Simulating conjoint power

There are many guides for simulating data to calculate power for certain types of experiments and observational studies. For example, McConnell and Vera-Hernandez (2017) provide the formulas and code for simulation-based power calculations for a variety of designs in Stata, Coppock (2013) provides examples in R, Hughes (2017) offers analytic and simulation-based approaches to power calculations, and a variety of examples are found in Arnold et al. (2011). These resources tend to ignore some of the issues that arise from discrete choice experiments, such as the deterministic interdependence of outcomes (choice status) for a given choice, binary dependent variable, or estimation of certain estimands. In contrast, I provide code and instruction for rapidly simulating and analyzing data from the type of conjoint designs typically used in the social sciences.

While the code in the following is flexible enough for users to modify it for a larger number of use cases, I impose some restrictions on the types of conjoint designs and estimands where this code can be applied for instructional purposes. Specifically, I assume participants engage in a forced, binary choice between two profiles and researchers are interested in estimating one of two estimands that dominate applications of conjoint experimental designs: the average marginal component effect or the average component interaction effect (Hainmueller et al. 2013). In practice, alternative models and estimands are as easily adopted in the simulation as they are in an analysis with a single dataset.

Our approach is straightforward and largely follows that recommended by the creators of DeclareDesign:

Create data based on proposed values for parameters describing the main structure of the experiment: (a) number of study participants, (b) choice tasks per participant, and (c) number of profiles per choice task.
Create profile attributes and attribute levels.
Create dummy variables for each profile attribute level.
Assign potential outcomes to different profile combinations.
Reveal realized outcome for each observation in light of the assigned attributes and interdependence of outcomes for each set of profiles.
Estimate model(s), save p-value(s) and point estimate(s).
Repeat for many simulated data sets.
Calculate power from model results.

In the following, I’ll introduce code for each of these steps and we’ll slowly build up our code until we have something that fairly easily can simulate power of conjoint experiments via simulation. All examples use the R programming language. While I do not plan to maintain the code within this document, I aim to provide enough detail and intuition that you can update code as necessarily in response to any breaking updates to the tools I employ.

One notable limitation of the code below is its focus on the case of a participant selecting one profile from a set of two profiles. Some conjoint designs ask each participant to choose from more than two profiles, allow participants to select multiple profiles, or provide opportunities to make non-binary choices over profiles. While the code in some cases already accommodates such designs, consider it a starting point for writing your own functions.

Set-up

Let’s first load some packages to make everything faster and easier. We’ll load our packages with a function we’ll define named loadpkg(), which checks if our packages are installed and installs them if not, then loads all required packages.

#########################################################################
# 'loadpkg' function
# Checks if a vector of packages are installed. If not, installs the
# package. Then loads all packages in vector.
#
# Argument:
# toLoad - character vector of package names
#########################################################################

loadpkg <- function(toLoad){
  for(lib in toLoad){
    if(! lib %in% installed.packages()[,1]) {
      install.packages(lib, repos='http://cran.rstudio.com/')
    }
    suppressMessages( library(lib, character.only=TRUE) )
  }
}

We’ll want the dplyr package for cleaner code and the piping operator (%>%), ggplot2 for plotting power at different values of our parameters, and estimatr for its fast implementation of ordinary least squares estimation with clustered standard errors. Although none of these packages are required (you could replace them with base R equivalents), they will this much quicker and more readable.

Let’s load these packages and set our seed while we’re at it to ensure we can reproduce our data later.

# Load packages
pkgs <- c("dplyr",    # for data processing and pipe operators
          "estimatr", # for fast lm() with clustered SEs
          "ggplot2"  # for visualizing power
          ) 

loadpkg(pkgs)

# Set seed for reproducibility
set.seed(123)

Creating a design “skeleton”

Next, let’s define a create_design_df() function to generate a data frame containing the high-level parameters of the conjoint design - a “skeleton” upon which we will build: the number of participants, number of choice tasks (or screens) each participant completes, and the number of profiles per task. We’ll also add an indicator if a profile was the first profile on the screen since this helps us introduce profile ordering effects and makes some of the later steps easier.

#########################################################
# Define function: create_design_df
#
# Description: Generates a data frame with participant
# IDs, screen ID, profile_ID (unique identifier for each
# profile) based on specified parameters, and an indicator
# for the "first" profile for a given screen. Data are tidy.
#
# Arguments:
# N = # of participants
# K = # choice tasks observed
# W = # of profiles per choice task
#########################################################

create_design_df <- function(N, K, W) {
  temp <- data.frame(
    participant_id = sort(rep(seq_len(N), K * W)),
    screen_id = sort(rep(seq_len(K * N), W)),
    profile_id = seq_len(N * K * W)
  )
  
  # Add first_profile == 1 if first in profile set, 0 otherwise
  # dplyr::if_else() faster than base::ifelse
  temp %>%
    dplyr::group_by(screen_id) %>%
    dplyr::mutate(first_profile = dplyr::if_else(row_number() == 1, 1, 0)) %>%
    dplyr::ungroup()
}

Note that the function is designed to produce the long data structure that is typically used for the analysis of repeated-observation data and conjoint data: each row is one of the profiles from the study and columns provide information about each profile - such as the participant that observed the profile, whether the profile was chosen, profile attributes, and so forth. Since we’re assuming each participant observes two profiles per screen, there will be two rows in our data per screen.

Example: `create_design_df()`

# Generate example parameter values
# N = 200
# K = 2
# W = 2
#
# Create design data
# create_design_df(N = N, K = K, W = W)

Creating profile attributes

We now need attributes for each profile. We’ll define a function called create_attributes() that takes a vector or list of vectors and randomly draws one value from each vector for each profile in our conjoint. We can provide a vector of names for the new attributes via the attr_names argument and we can specify a list (if we have multiple attributes) or vector (if we have one attribute) of probabilities a value is drawn from the attribute-level vector via the prob argument. The function produces default names if no names are provided and draws attribute values with equal probability if no probabilities are provided.

##############################################################
# Define function: create_attributes()
#
# Creates randomly generated vectors of values from sets
# of potential values for a specified number of attributes
#
# *Arguments*
#
# attr_names (optional)
# vector of attribute names of attr_n length. If no values
# provided, attributes named attrib1, attrib2, etc.
#
# attr_levels
# vector or list of vectors of length N, each containing the
# unique levels each attribute can take. May include NA values,
# which may be useful for experiments randomizing the presence
# of attributes themselves.
#
# n_profiles
# number of profiles, i.e. # of values to draw for each
# attribute.
#
# probs (optional)
# A vector or list of numeric vectors containing probabilities
# for sampling from the set of unique attribute values. Probabilities
# are assigned to each element in the attrinute value set based
# on vector index. If no value supplied, defaults to assigning
# equal sampling probabilities to all attribute levels.
#
# NOTE: Currently no test is implemented for whether the number 
# of probabilities passed to probs() is equal to the number of 
# levels for the attribute or that the probabilities provided
# sum to 1 for each vector. Function does check that the number
# of probability vectors is equal to the number of attributes.
##############################################################

create_attributes <-
  function(attr_levels,
           n_profiles,
           attr_names = NULL,
           probs = NULL) {
    # Convert attr_levels and probs to list to work with lists
    # throughout for simplicity/speed. Since vectors are list
    # but lists not necessarily vectors, use is.list(X) rather
    # than is.vector(X) since, passing a list to is.vector is
    # also TRUE.
    if (!is.list(attr_levels)) {
      attr_levels <- list(attr_levels)
    }
    
    if (!is.null(probs) & !is.list(probs)) {
      probs <- list(probs)
    }
    
    # Number of attributes to generate
    attr_n <- length(attr_levels)
    
    # Test inputs are correct length
    stopifnot(is.null(attr_names) | attr_n == length(attr_names))
    stopifnot(is.null(probs) | attr_n == length(probs))
    
    # If attribute names not provided, generate
    # default attribute names, e.g. attrib1, attrib2, ...
    if (is.null(attr_names)) {
      attr_names <- paste0("attrib", seq_len(attr_n))
    }
    
    # If probs not provided, generate default
    # equal probabilities of assignment
    if (is.null(probs)) {
      probs <- list()
      for (i in 1:attr_n) {
        attr_i_levels_n <- length(attr_levels[[i]])
        probs[[i]] <- rep(1 / attr_i_levels_n, attr_i_levels_n)
      }
    }
    
    # Initialize list to store each attribute vector
    attrib_list <- list()
    
    # Generate attributes
    for (i in 1:length(attr_names)) {
      # If an attribute has only one level, is numeric, and 
      # >0, the x = argument in sample() thinks the level 
      # refers to the number of levels, incorrectly. So, 
      # separate these out since prob = 1 for the single value.
      # See ?sample > Details for more information.
      if (is.numeric(attr_levels[[i]]) &
          length(attr_levels[[i]]) == 1) {
        attrib_list[[i]] <- rep(attr_levels[[i]], n_profiles)
      } else {
        attrib_list[[i]] <- sample(
          x = attr_levels[[i]],
          size = n_profiles,
          replace = TRUE,
          prob = probs[[i]]
        )
      }
    }
    # Collapse list into data frame, replace names
    attrib_df <- dplyr::bind_cols(attrib_list)
    colnames(attrib_df) <- attr_names
    
    # Return attribute data frame
    attrib_df
  }

Example: `create_attributes()`

# n_profiles <- 50
# 
# # One attribute, default probs
# attr_levels <- c("a","b","c")
# create_attributes(attr_levels = attr_levels,
#                   n_profiles = n_profiles)
# 
#
# # One attribute,specified probs
# attr_levels <- c("a","b","c")
# probs <- c(0.10,0.10,0.80)
# create_attributes(attr_levels = attr_levels, 
#                   n_profiles = n_profiles,
#                   probs = probs)
# 
#
# # Two attributes, default probs
# attr_levels <- list(c("a","b","c"), 
#                     c(seq_len(6)))
# 
# create_attributes(attr_levels = attr_levels,
#                   n_profiles = n_profiles)
# 
#
# # Two attributes, specified probs
# attr_levels <- list(c("a","b","c"), 
#                     c(seq_len(6)))
# probs <- list(c(0.50, 0.30, 0.20), 
#               c(0.05, 0.05, 0.05, 0.05, 0.40, 0.40))
# 
# create_attributes(attr_levels = attr_levels,
#                   n_profiles = n_profiles,
#                   probs = probs)
#
#
# # Two attributes, specified probs and names
# attr_levels <- list(c("a","b","c"),
#                     c(seq_len(6)))
# 
# probs <- list(c(0.50, 0.30, 0.20),
#               c(0.05, 0.05, 0.05, 0.05, 0.40, 0.40))
# 
# attr_names <- c("foo", "bar")
# 
# attr_df <- create_attributes(attr_levels = attr_levels,
#                             n_profiles = n_profiles,
#                             attr_names = attr_names,
#                             probs = probs)

Create dummy variables

While we now know each profile’s value for each attribute, we typically want dummy variables for each potential value an attribute may assume. This makes generating potential outcomes and model estimation easier. We cannot simply use commmon solutions, such asfastDummies::dummy_cols() or model.matrix(), on our data to generate these indicators because these approaches produce columns for each value that each attribute actually assumes in our data. We want column for each value in the set of potential values an attribute might assume. This is important since there is typically a non-zero probability an attribute is never assigned to any profile!

For this purpose, we write our own function, create_value_dummies(), that will take the output of our create_attributes() function and add dummy variables.

#########################################################
# Define function: create_value_dummies()
#
# Create dummy variables indicating if a given
# profile takes on a specific value from its 
# possible values, independent of whether the actual
# random sampling does (not) draw that value and assign
# it to a profile.
#
# Arguments
# attr_df - the output of create_attributes()
# attr_levels - the input list (or vector) of levels to 
#               create_attributes()
#########################################################

create_value_dummies <-
  function(attr_df,
           attr_levels) {
    # Convert attribute levels to a list for
    # easing referencing if not already list
    if (!is.list(attr_levels)) {
      attr_levels <- list(attr_levels)
    }
    
    # Initialize list to store output data frame
    attr_cols <- list()
    
    # Generate dummy variables
    for (i in 1:ncol(attr_df)) {
      # List to store current attribute's
      # output columns
      temp <- list()
      
      # Levels of current attribute
      this_attr_levels <- attr_levels[[i]]
      
      # Generate dummies
      for (j in 1:length(this_attr_levels)) {
        temp[[j]] <- dplyr::if_else(attr_df[, i] == attr_levels[[i]][j], 1, 0)
      }
      
      # Convert each attribute's vectors from list to data frame
      temp <- dplyr::bind_cols(temp)
      
      # Generate column names for each
      
      # Add variable names to columns
      colnames(temp) <-
        paste0(colnames(attr_df)[i], "_", attr_levels[[i]])
      
      # Save all dummies for each attribute into an element
      # of the attr_cols list
      attr_cols[[i]] <- temp
    }
    
    # Collapse each attribute's columns into a single
    # data frame, return
    dummy_df <- dplyr::bind_cols(attr_cols)
    
    # Joint the columns to the original data
    dplyr::bind_cols(attr_df, dummy_df)
  }

Example: `create_value_dummies()`

## Generate example data: 2 attributes,
## different number of levels per 
## attribute
#df <- data.frame(foo = rep(1:4,10), bar = rep(c("a","b"),20))
#dflevels <- list(c(1:4), c("a","b"))

## Produce dummies
#create_value_dummies(attr_df = df, attr_levels = dflevels)

Combining our non-outcome functions

So far, we have several functions that generates all of the data we need except for the actual outcomes for each profile, i.e. whether a profile was chosen (or not). To make things even simpler, let’s combine all of those functions into a single function called create_predictors(). The nice thing about doing this is that we don’t need to repeatedly provide arguments that are common across our functions. Instead, we can generate everything with one function call. Below, we combine everything we’ve done so far into a single function. We add two additional arguments to augment the functions:

sims: determines how many data sets to generate (defaults to 1)
seed: specifies the randomization seed (defaults to 123).

################################################
# Define function: create_predictors
# Creates randomly generated data sets containing
# all design variables, attributes, and dummy
# variables for the right-hand side of the conjoint
# analysis (excludes the outcome variable),
# based on specified data parameter values.
# Data are "long," such that each row of the data 
# is a conjoint profile.
#
# Arguments:
# N: number of participants
# K: number of choices completed
# W: number of profiles per screen
# attr_n: number of attributes overall
# attr_levels: list (or vector)
#              of possible levels for each attribute.
# attr_names: (optional): names of attributes. defaults a generic
#             name followed by the index of the attribute
#             in attr_levels.
# probs (optional): list (or vector) of probabilities used to randomly
#                   draw values from the attribute levels vectors and assign 
#                   them to profiles.
# sims (optional): number of times to simulate the data, default 1
# seed (optional): randomization seed, default 123
################################################

create_predictors <- function(N,
                              K,
                              W,
                              attr_n,
                              attr_levels,
                              attr_names = NULL,
                              probs = NULL,
                              sims = 1,
                              seed = 123) {
  # Defines loadpkg()
  loadpkg <- function(toLoad) {
    for (lib in toLoad) {
      if (!lib %in% installed.packages()[, 1]) {
        install.packages(lib, repos = 'http://cran.rstudio.com/')
      }
      suppressMessages(library(lib, character.only = TRUE))
    }
  }
  
  # Load packages
  pkgs <- c("dplyr",
            "estimatr",
            "ggplot2")
  
  loadpkg(pkgs)
  
  # Set seed for reproducibility
  set.seed(seed)
  
  # Define create_design_df
  create_design_df <- function(N, K, W) {
    temp <- data.frame(
      participant_id = sort(rep(seq_len(N), K * W)),
      screen_id = sort(rep(seq_len(K * N), W)),
      profile_id = seq_len(N * K * W)
    )
    
    temp %>%
      dplyr::group_by(screen_id) %>%
      dplyr::mutate(first_profile = dplyr::if_else(row_number() == 1, 1, 0)) %>%
      dplyr::ungroup()
  }
  
  # Define create_attributes()
  create_attributes <-
    function(attr_levels,
             n_profiles,
             attr_names = NULL,
             probs = NULL) {
      if (!is.list(attr_levels)) {
        attr_levels <- list(attr_levels)
      }
      
      if (!is.null(probs) & !is.list(probs)) {
        probs <- list(probs)
      }
      
      attr_n <- length(attr_levels)
      
      stopifnot(is.null(attr_names) | attr_n == length(attr_names))
      stopifnot(is.null(probs) | attr_n == length(probs))
      
      if (is.null(attr_names)) {
        attr_names <- paste0("attrib", seq_len(attr_n))
      }
      
      if (is.null(probs)) {
        probs <- list()
        for (i in 1:attr_n) {
          attr_i_levels_n <- length(attr_levels[[i]])
          probs[[i]] <- rep(1 / attr_i_levels_n, attr_i_levels_n)
        }
      }
      
      attrib_list <- list()
      for (i in 1:length(attr_names)) {
        if (is.numeric(attr_levels[[i]]) &
            length(attr_levels[[i]]) == 1) {
          attrib_list[[i]] <- rep(attr_levels[[i]], n_profiles)
        } else {
          attrib_list[[i]] <- sample(
            x = attr_levels[[i]],
            size = n_profiles,
            replace = TRUE,
            prob = probs[[i]]
          )
        }
      }
      attrib_df <- dplyr::bind_cols(attrib_list)
      colnames(attrib_df) <- attr_names
      attrib_df
    }
  
  # Define create_value_dummies
  create_value_dummies <-
    function(attr_df,
             attr_levels) {
      if (!is.list(attr_levels)) {
        attr_levels <- list(attr_levels)
      }
      
      attr_cols <- list()
      for (i in 1:ncol(attr_df)) {
        temp <- list()
        
        this_attr_levels <- attr_levels[[i]]
        for (j in 1:length(this_attr_levels)) {
          temp[[j]] <- dplyr::if_else(attr_df[, i] == attr_levels[[i]][j], 1, 0)
        }
        
        temp <- dplyr::bind_cols(temp)
        
        
        colnames(temp) <-
          paste0(colnames(attr_df)[i], "_", attr_levels[[i]])
        
        attr_cols[[i]] <- temp
      }
      
      dummy_df <- dplyr::bind_cols(attr_cols)
      
      dplyr::bind_cols(attr_df, dummy_df)
    }
  
  # Generate data
  
  # Create design data frame
  design_df <- create_design_df(N = N, K = K, W = W)
  
  # Initialize list to store data
  attribs <- vector("list", length = sims)
  
  # Number of profiles
  n_profiles <- N * K * W
  
  # Create sims number of data sets with the attributes values
  # and value dummies randomized across data sets
  for (i in seq_len(sims)) {
    attribs[[i]] <- create_attributes(
      attr_levels = attr_levels,
      n_profiles = n_profiles,
      attr_names = attr_names,
      probs = probs
    ) %>%
      
      create_value_dummies(attr_df = .,
                           attr_levels = attr_levels)
  }
  
  # Join the attribute data to the design data
  dat_out <- lapply(attribs, function(attrib_df) {
    dplyr::bind_cols(x = design_df, y = attrib_df)
  })
  
  # Return list of generate data frames
  dat_out
}

With the above code, we can generate arbitrary numbers of data sets with the structure we want in a single call. To reduce the demands on our computers, only the parts of the code dependent on randomization are repeated for each additional simulation.

Example: `create_predictors()`

# N <- 50
# K <- 3
# W <- 2
# attr_n <- 3
# attr_levels <- list(c(1, 2, 3), c("a", "b", "c", "d", "e"), c(5))
# attr_names <- NULL
# probs <-  NULL
# sims <- 2
# seed <- 123
# 
# 
# dat <- create_predictors(
#   N = N,
#   K = K,
#   W = W,
#   attr_n = attr_n,
#   attr_levels = attr_levels,
#   attr_names = attr_names,
#   probs = probs,
#   sims = sims,
#   seed = seed
# )

Creating (potential) outcomes

We next need to assign binary outcomes (whether a profile was chosen or not) to each profile in our data. We will follow the standard approach for experiments where participants choose only one profile out of at least two profiles and assume our outcome \(Y\) is distribution \(Y \sim Binomial(n,p)\), where \(n = 1\) and \(p\) is the probability a profile is chosen over the other profile. \(Y\) is therefore a dichotomous variable and we will allow for \(p\) to be an arbitrary function of the attribute values for that profile. Attribute values are allowed to have both additive and non-additive relationships with \(p\).

Correlated outcomes

One tricky aspect of the typical implementation of conjoint experiments (and some other types of discrete forced choice designs) is that outcomes for every profile within each choice task are correlated: if one profile is selected from a set of profiles on a given screen, it must be true that the other profiles on that screen are not selected. For example, let’s proceed by assuming we have presented each of \(N\) participants with \(K\) screens/tasks containing \(W = 2\) profiles. Due to the deterministic relationship, if profile is assigned \(Y = 1\), then the other profile in the task must be assigned \(Y = 0\). If we ignore this dependence and generate outcomes for each profile (based on its attribute values) independent of the outcome for the other profile, we will not have data that looks like the data we expect from a conjoint experiment.

To this end, we will ensure that the average differences in \(Pr(Y = y)\) at different levels of attributes (i.e. differences in potential outcomes) are preserved while maintaining this mechanical correlation in outcomes. We’ll accomplish this by assigning potential outcomes based on the joint distribution of profile attributes (thanks to Flavien Ganter for sharing thoughts on this approach).

For each task, we will begin by assigning one profile the value \(p = 0.50\), which is the expected value of \(p\) when participants are indifferent between the two profiles in a choice task. Next, we will add the effects of differences in attribute values across profiles to the expected outcome for that profile. For example, if we think \((p|A = 1) - (p|A = 0) = 0.10\), and have a profile \(X\) where \(A_x = 1\) and a profile \(Z\) where \(A_z = 0\), we would draw \(Y_x\) - the outcome for profile \(X\) - from \(Binomial(1, 0.55)\) and let \(Y_z = 1 - Y_x\). In cases where the attribute value assignment is reversed, as in \(A_x = 0\) and \(A_z = 1\), we simply subtract the amount we would normally add.

Before proceeding, I should note that I will make several simplifying assumptions about the determinants of choice in the experiment. Specifically, I assume (1) the ordering of attributes on the screen, (2) the ordering of profiles on the screen, and (3) the ordering of tasks do not affect the probability a profile is chosen. It is relatively straightforward to introduce such effects, but for brevity I do not include them here. I furthermore do not include participant-specific errors, but will correct for arbitrary correlation of errors for each participant’s choices through post-hoc adjustment of the co-variance matrix by clustering at the participant level for all statistical inferences. Participant-specific errors are as simple to incorporate as assuming the distribution of effects is itself drawn from an individiual-specific distribution.

Revealing outcomes

Since we need to specify all of the potential outcomes for arbitrary functions of the profile attribute values, we’ll have to generate potential outcomes manually. Since the manual generation of outcomes will depend on the number of attributes and attribute levels, we’re going to want to create our code in response to a specific conjoint design.

For exemplary purposes, we’ll generate data assuming we have a conjoint experiment with 3 attributes and each attribute will have two levels. We’ll call the first attribute atta, the second attribute attb, and the third attribute attc. We’ll furthermore allow for the effect of atta on the probability a profile is selected to vary with the value of attb. Such interactions are commonly of interest to researchers and we may be interested in calculating power for detecting such inter-attribute interactions. To keep things simple, we’ll also assume that our third variable attc has a deterministic effect of exactly 0 on our outcome. This saves us some manual coding since we don’t need to let \(Y\) vary with this variable.

# Generate data to which we'll assign outcomes

# Specify data parameters
N <- 100
K <- 3
W <- 2
attr_n <- 3
attr_levels <- list(c(0, 1), c("x", "z"), c(0,1))
attr_names <- c("atta","attb","attc")
probs <- NULL
sims <- 5
seed <- 123

# Simulate data
dat <- create_predictors(
   N = N,
   K = K,
   W = W,
   attr_n = attr_n,
   attr_levels = attr_levels,
   attr_names = attr_names,
   probs = probs,
   sims = sims,
   seed = 123
   )

Now that we have a list of simulated data frames, we’ll want to simulate outcomes for each profile in each simulated data set. To do this, we’ll first need to assign p to each observation based on the joint distributions of attributes atta and attb. We don’t need to include attc because it does not affect p. Normally, we would calculate p based on the marginal distributions of atta and attb, but we embrace the complexity of using the joint distribution because we are allowing for an inter-attribute interaction between atta and attb that is of interest to us.

To assign p to different cells of the joint distributions of atta and attb, let’s assume that atta = 1 is chosen 4% more often than atta = 0 when attb = x and 1% more often when attb = z. We can then take half of the differences in potential outcomes, add it to 0.50 for the the corresponding profile, then take a random draw from the corresponding distribution. The reason we take half of the difference in the probabilities of selection is that we want symmetry around 50% for the probability a profile is selected. In other words, if we want participants to choose a profile assigned a value of \(J\) for some attribute \(4%\) more often than a profile assigned a value of \(K\) for the same attribute, we would want the probability the former profile is selected to be \(0.52\) and the probability the latter profile is selected to be \(1 - 0.52 = 0.48\). If we don’t allow for this symmetry around 50% probability for the different attribute combinations, we would end up with a scenario where the sum of the probabilities of selecting profiles across all attribute values does not sum to 1.

Let’s now produce the mapping of p to different attribute combinations for each set of profiles.

# Assign p to different distributions of attribute
# values across the two profiles in each set of 
# profiles.
# 
# Note: it is best to keep notation straightforward,
# e.g. the values below indicate that the first profile
# has a value of 1 and the second profile a value
# of 0 for attribute atta while the first profile has
# value of x and the second profile a value of x for
# attribute attb.
atta_10_attb_xx <- 0.02 
atta_10_attb_xz <- 0.02
atta_10_attb_zz <- 0.005
atta_10_attb_zx <- 0.005

atta_01_attb_xx <- -0.02
atta_01_attb_xz <- -0.02
atta_01_attb_zz <- -0.005
atta_01_attb_zx <- -0.005

# Indifferent between attribute b values 
# independent of effect on atta. This
# is not necessary but can be useful
# for note-keeping.
atta_00_attb_xx <- 0 
atta_00_attb_xz <- 0 
atta_00_attb_zz <- 0 
atta_00_attb_zx <- 0 
  
atta_11_attb_xx <- 0  
atta_11_attb_xz <- 0 
atta_11_attb_zz <- 0 
atta_11_attb_zx <- 0

The above code is necessarily a manual process and we can often make coding mistakes. To make sure we’re producing the effects (AMCEs and ACIEs typically) we want, let’s calculate different AMCEs and examine whether AMCEs are conditional on other attributes as expected. The above code should make the results below self-evident, but this can be a good check that you did not make any coding mistakes.

# Check outcomes match the desired effects
# Can add identical profiles to obtain standard AMCE,

#~~~~~~~~~~~~~~~~
# atta AMCEs
#~~~~~~~~~~~~~~~~

# Unconditional AMCE: atta = 1 vs 0
# (0 vs 1 is the AMCE * -1)
c(atta_10_attb_xx - atta_01_attb_xx,  
  atta_10_attb_xz - atta_01_attb_xz,
  atta_10_attb_zz - atta_01_attb_zz,
  atta_10_attb_zx - atta_01_attb_zx) %>% mean() 

# AMCE | attb = x:  atta = 1 vs 0
c(atta_10_attb_xx - atta_01_attb_xx,  
  atta_10_attb_xz - atta_01_attb_xz) %>% mean()

# AMCE | attb = z: atta = 1 vs 0
c(atta_10_attb_zz - atta_01_attb_zz,
  atta_10_attb_zx - atta_01_attb_zx) %>% mean()

#~~~~~~~~~~~~~~~~
# attb AMCEs
#~~~~~~~~~~~~~~~~

# xz
c(atta_10_attb_xz - atta_10_attb_xx,
  atta_11_attb_xz - atta_11_attb_xx,
  atta_01_attb_xz - atta_01_attb_xx,
  atta_00_attb_xz - atta_00_attb_xx) %>% mean() # 0

c(atta_10_attb_xz - atta_10_attb_zx,
  atta_11_attb_xz - atta_11_attb_zx,
  atta_01_attb_xz - atta_01_attb_zx,
  atta_00_attb_xz - atta_00_attb_zx) %>% mean() # 0

c(atta_10_attb_xz - atta_10_attb_zz,
  atta_11_attb_xz - atta_11_attb_zz,
  atta_01_attb_xz - atta_01_attb_zz,
  atta_00_attb_xz - atta_00_attb_zz) %>% mean() # 0

# zz
c(atta_10_attb_zz - atta_10_attb_xx,
  atta_11_attb_zz - atta_11_attb_xx,
  atta_01_attb_zz - atta_01_attb_xx,
  atta_00_attb_zz - atta_00_attb_xx) %>% mean() # 0

c(atta_10_attb_zz - atta_10_attb_zx,
  atta_11_attb_zz - atta_11_attb_zx,
  atta_01_attb_zz - atta_01_attb_zx,
  atta_00_attb_zz - atta_00_attb_zx) %>% mean() # 0

c(atta_10_attb_zz - atta_10_attb_xz,
  atta_11_attb_zz - atta_11_attb_xz,
  atta_01_attb_zz - atta_01_attb_xz,
  atta_00_attb_zz - atta_00_attb_xz) %>% mean() # 0

# zx
c(atta_10_attb_zx - atta_10_attb_xx,
  atta_11_attb_zx - atta_11_attb_xx,
  atta_01_attb_zx - atta_01_attb_xx,
  atta_00_attb_zx - atta_00_attb_xx) %>% mean() # 0

c(atta_10_attb_zx - atta_10_attb_xz,
  atta_11_attb_zx - atta_11_attb_xz,
  atta_01_attb_zx - atta_01_attb_xz,
  atta_00_attb_zx - atta_00_attb_xz) %>% mean() # 0

c(atta_10_attb_zx - atta_10_attb_zz,
  atta_11_attb_zx - atta_11_attb_zz,
  atta_01_attb_zx - atta_01_attb_zz,
  atta_00_attb_zx - atta_00_attb_zz) %>% mean() # 0

# xx
c(atta_10_attb_xx - atta_10_attb_xz,
  atta_11_attb_xx - atta_11_attb_xz,
  atta_01_attb_xx - atta_01_attb_xz,
  atta_00_attb_xx - atta_00_attb_xz) %>% mean() # 0

c(atta_10_attb_xx - atta_10_attb_zz,
  atta_11_attb_xx - atta_11_attb_zz,
  atta_01_attb_xx - atta_01_attb_zz,
  atta_00_attb_xx - atta_00_attb_zz) %>% mean() # 0

c(atta_10_attb_xx - atta_10_attb_zx,
  atta_11_attb_xx - atta_11_attb_zx,
  atta_01_attb_xx - atta_01_attb_zx,
  atta_00_attb_xx - atta_00_attb_zx) %>% mean() # 0

Before proceeding, it is worth emphasizing what we are capturing in this quantity: this is a natural decision given we want to estimate differences in the probabilities of choosing profiles randomly assigned to different attribute values rather than the effect of randomly assigning attribute values to a profile. However, depending upon how one defines the population and choice problem, this is not necessarily the typical quantity researchers estimate (the typical AMCE). If you want to estimate the AMCE, you’ll need to add additional elements to the mean formulas above. However, if you want to estimate an estimand better capturing preferences, you can keep the above but remove the profiles prior to fitting models. More details regarding the mapping between preferences, the AMCE, and typical estimators can be found here, but the key implication is that the effect above are actually larger than the effects expected using the typical estimation procedure (which is incorporated into the code later).

We can at last define our function to generate values of p for draws from a binomial distribution and assign values of \(Y\) based on these draws. We’ll pass this add_outcomes() function a data frame (of simulated data) and it will add p and Y to the data. Since this also requires some more manual coding to check which value of \(p\) the function should assign, I’ve included some commented-out code that can help with checking you did not make any manual errors. The function is defined assuming we have already loaded our create_predictors() function, meaning we do not need to re-load other dependencies again.

############################################
# Define function: add_outcomes()
# Generates values p for draws from binomial
# distribution and assigns observed outcomes
# for each unit for a given data frame
#
# Arguments
# - data: data frame(s) of predictor variables,
#         commonly a data frame in the list of
#         produced via create_predictors()
############################################

# For each profile, calculate p in Binom(n,p). We'll
# use a fairly inelegant for() loop. Start with a vector
# of missing values, then replace the value for the first
# profile on a screen and take 1 - this value for the second
# profile on the screen. Could easily write a function to
# take the variables for each of the four inputs, but
# this took < 5 minutes (albeit prone to humane error, harder
# to read). We'll wrap this into a function so we can apply
# it to each simulated data set.

add_outcomes <- function(df) {
  # Vector of missing values to store p
  df$p <- NA_real_
  
  # Number of screens (for looping)
  screens <- max(df$screen_id)
  
  # Loop through profiles in each screen, calculate p
  # based on whether the distribution of attribute
  # values matches those for the different values of
  # p we previously specified. We can go through
  # each of the possible outcomes, multiply logical checks
  # for whether an attribute value matches, then add the
  # value to 0.50 (indifference) if it matches.
  
  for (i in 1:screens) {
    #  atta_10_attb_xx
    atta_10_attb_xx_out  <- atta_10_attb_xx *
      df[df$screen_id == i & df$first_profile == 1, "atta_1"] *
      df[df$screen_id == i & df$first_profile == 0, "atta_0"] *
      df[df$screen_id == i & df$first_profile == 1, "attb_x"] *
      df[df$screen_id == i & df$first_profile == 0, "attb_x"]
    
    # atta_10_attb_xz
    atta_10_attb_xz_out  <- atta_10_attb_xz *
      df[df$screen_id == i & df$first_profile == 1, "atta_1"] *
      df[df$screen_id == i & df$first_profile == 0, "atta_0"] *
      df[df$screen_id == i & df$first_profile == 1, "attb_x"] *
      df[df$screen_id == i & df$first_profile == 0, "attb_z"]
    
    # atta_10_attb_zz
    atta_10_attb_zz_out  <- atta_10_attb_zz *
      df[df$screen_id == i & df$first_profile == 1, "atta_1"] *
      df[df$screen_id == i & df$first_profile == 0, "atta_0"] *
      df[df$screen_id == i & df$first_profile == 1, "attb_z"] *
      df[df$screen_id == i & df$first_profile == 0, "attb_z"]
    
    # atta_10_attb_zx
    atta_10_attb_zx_out  <- atta_10_attb_zx *
      df[df$screen_id == i &
           df$first_profile == 1, "atta_1"] *
      df[df$screen_id == i &
           df$first_profile == 0, "atta_0"] *
      df[df$screen_id == i &
           df$first_profile == 1, "attb_z"] *
      df[df$screen_id == i & df$first_profile == 0, "attb_x"]
    
    #  atta_01_attb_xx
    atta_01_attb_xx_out  <- atta_01_attb_xx *
      df[df$screen_id == i &
           df$first_profile == 1, "atta_0"] *
      df[df$screen_id == i &
           df$first_profile == 0, "atta_1"] *
      df[df$screen_id == i &
           df$first_profile == 1, "attb_x"] *
      df[df$screen_id == i &
           df$first_profile == 0, "attb_x"]
    
    #  atta_01_attb_xz
    atta_01_attb_xz_out  <- atta_01_attb_xz *
      df[df$screen_id == i &
           df$first_profile == 1, "atta_0"] *
      df[df$screen_id == i &
           df$first_profile == 0, "atta_1"] *
      df[df$screen_id == i &
           df$first_profile == 1, "attb_x"] *
      df[df$screen_id == i &
           df$first_profile == 0, "attb_z"]
    
    #  atta_01_attb_zz
    atta_01_attb_zz_out  <- atta_01_attb_zz *
      df[df$screen_id == i &
           df$first_profile == 1, "atta_0"] *
      df[df$screen_id == i &
           df$first_profile == 0, "atta_1"] *
      df[df$screen_id == i &
           df$first_profile == 1, "attb_z"] *
      df[df$screen_id == i &
           df$first_profile == 0, "attb_z"]
    
    #  atta_01_attb_zx
    atta_01_attb_zx_out  <- atta_01_attb_zx *
      df[df$screen_id == i &
           df$first_profile == 1, "atta_0"] *
      df[df$screen_id == i &
           df$first_profile == 0, "atta_1"] *
      df[df$screen_id == i &
           df$first_profile == 1, "attb_z"] *
      df[df$screen_id == i &
           df$first_profile == 0, "attb_x"]
    
    # Given the matches, add them to 0.50 and assign sum to p
    df[df$screen_id == i &
         df$first_profile == 1, "p"] <- 0.50 +
      atta_10_attb_xx_out +
      atta_10_attb_xz_out +
      atta_10_attb_zz_out +
      atta_10_attb_zx_out +
      atta_01_attb_xx_out +
      atta_01_attb_xz_out +
      atta_01_attb_zz_out +
      atta_01_attb_zx_out
  }
  
  # In our example, we chose effects so the range of p is
  # between 0.50 +/- 0.02. Let's check this is true!
  # df$p %>% table()
  
  # Draw Y for each profile where we have a value for p,
  # ignoring the warning that NAs were produced since we WANT
  # exactly half of our profiles (half of the rows in our
  # input data should be NA)
  df <- df %>% dplyr::mutate(Y = rbinom(n = n(),
                                        size = 1,
                                        prob = p))
  
  # Let's make sure our function returns the right # of NA cases!
  stopifnot( df$Y %>% is.na() %>% sum() == screens)
  
  # Within each pair of profiles, assign a value of 1 - Y
  # to the profile where first_profile == 0, where Y is the
  # value of Y when first_profile == 1 within that screen
  for (j in 1:screens) {
    df[df$screen_id == j &
         df$first_profile == 0, "Y"] <-
      1 - df[df$screen_id == j & df$first_profile == 1, "Y"]
  }
  df
}

We can now simply apply our add_outcomes() function to the list of data frames we obtained as output of create_predictors(). Below shows an example where we add outcomes to the list of data frames we previously obtained.

Example: `add_outcomes()`

# Add outcomes to all data in the list of
# simulated data frames
all_data <- lapply(dat, add_outcomes)

Creating model results

Model Choice

We can now fit our preferred statistical models to the data. The most common approach in political science is to fit a linear probability model predicting \(Y\) as a function of indicators for profile attributes. As Hainmueller et al. (2013) recommend, standard errors are adjusted for arbitrary intra-participant error correlations via a clustering correction. We’ll fit the model with the lm_robust() function from the estimatr package, which is fast and implements the “CR2” clustered correction. See the function’s documentation for more information. While I choose this model, feel free change the model to any other common type for the analysis of discrete choice data.

Adding modeling

Rather than generating our data, storing it in memory, then fitting it on the model object, we’ll save some computing time and avoid running out of memory by adding the modeling step at the end of our add_outcomes() function, returning only parameters desired for power calculations. Let’s therefore augment our function to produce a new function: create_model_output(). For clarity, I’ve some deleted comments from our previous example.

##########################################
# Define function: create_model_ouput()
#
# Description: Generates a data frame
# of selected p-values and point
# estimates for linear probability
# model fit on each simulated data
# set.
#
# Arguments
# df: data frame(s) of predictor variables,
#     commonly a data frame in the list of
#     produced via create_predictors()
# formula: string containing model formula asin
#          in the case of lm(), e.g. "y ~ x"
# clusters: bare name of cluster variable,
#           typically participant_id
# pvars: character vector of model terms for
#        which we want to return the p-values
# alpha: numeric value for significance threshold, 
#        defaults to 0.05
##########################################

create_model_output <- function(df, formula, clusters, pterms, alpha = 0.05) {
  df$p <- NA_real_
  screens <- max(df$screen_id)
  
  for (i in 1:screens) {
    #  atta_10_attb_xx
    atta_10_attb_xx_out  <- atta_10_attb_xx *
      df[df$screen_id == i & df$first_profile == 1, "atta_1"] *
      df[df$screen_id == i & df$first_profile == 0, "atta_0"] *
      df[df$screen_id == i & df$first_profile == 1, "attb_x"] *
      df[df$screen_id == i & df$first_profile == 0, "attb_x"]
    
    # atta_10_attb_xz
    atta_10_attb_xz_out  <- atta_10_attb_xz *
      df[df$screen_id == i & df$first_profile == 1, "atta_1"] *
      df[df$screen_id == i & df$first_profile == 0, "atta_0"] *
      df[df$screen_id == i & df$first_profile == 1, "attb_x"] *
      df[df$screen_id == i & df$first_profile == 0, "attb_z"]
    
    # atta_10_attb_zz
    atta_10_attb_zz_out  <- atta_10_attb_zz *
      df[df$screen_id == i & df$first_profile == 1, "atta_1"] *
      df[df$screen_id == i & df$first_profile == 0, "atta_0"] *
      df[df$screen_id == i & df$first_profile == 1, "attb_z"] *
      df[df$screen_id == i & df$first_profile == 0, "attb_z"]
    
    # atta_10_attb_zx
    atta_10_attb_zx_out  <- atta_10_attb_zx *
      df[df$screen_id == i &
           df$first_profile == 1, "atta_1"] *
      df[df$screen_id == i &
           df$first_profile == 0, "atta_0"] *
      df[df$screen_id == i &
           df$first_profile == 1, "attb_z"] *
      df[df$screen_id == i & df$first_profile == 0, "attb_x"]
    
    #  atta_01_attb_xx
    atta_01_attb_xx_out  <- atta_01_attb_xx *
      df[df$screen_id == i &
           df$first_profile == 1, "atta_0"] *
      df[df$screen_id == i &
           df$first_profile == 0, "atta_1"] *
      df[df$screen_id == i &
           df$first_profile == 1, "attb_x"] *
      df[df$screen_id == i &
           df$first_profile == 0, "attb_x"]
    
    #  atta_01_attb_xz
    atta_01_attb_xz_out  <- atta_01_attb_xz *
      df[df$screen_id == i &
           df$first_profile == 1, "atta_0"] *
      df[df$screen_id == i &
           df$first_profile == 0, "atta_1"] *
      df[df$screen_id == i &
           df$first_profile == 1, "attb_x"] *
      df[df$screen_id == i &
           df$first_profile == 0, "attb_z"]
    
    #  atta_01_attb_zz
    atta_01_attb_zz_out  <- atta_01_attb_zz *
      df[df$screen_id == i &
           df$first_profile == 1, "atta_0"] *
      df[df$screen_id == i &
           df$first_profile == 0, "atta_1"] *
      df[df$screen_id == i &
           df$first_profile == 1, "attb_z"] *
      df[df$screen_id == i &
           df$first_profile == 0, "attb_z"]
    
    #  atta_01_attb_zx
    atta_01_attb_zx_out  <- atta_01_attb_zx *
      df[df$screen_id == i &
           df$first_profile == 1, "atta_0"] *
      df[df$screen_id == i &
           df$first_profile == 0, "atta_1"] *
      df[df$screen_id == i &
           df$first_profile == 1, "attb_z"] *
      df[df$screen_id == i &
           df$first_profile == 0, "attb_x"]
    
    # Given the matches, add them to 0.50 and assign sum to p
    df[df$screen_id == i &
         df$first_profile == 1, "p"] <- 0.50 +
      atta_10_attb_xx_out +
      atta_10_attb_xz_out +
      atta_10_attb_zz_out +
      atta_10_attb_zx_out +
      atta_01_attb_xx_out +
      atta_01_attb_xz_out +
      atta_01_attb_zz_out +
      atta_01_attb_zx_out
  }
  
  # Draw Y for each profile where we have a value for p
  df <- df %>% dplyr::mutate(Y = rbinom(n = n(),
                                        size = 1,
                                        prob = p))
  
  # Within each pair of profiles, assign a value of 1 - Y
  # to the profile where first_profile == 0, where Y is the
  # value of Y when first_profile == 1 within that screen
  for (j in 1:screens) {
    df[df$screen_id == j &
         df$first_profile == 0, "Y"] <-
      1 - df[df$screen_id == j & df$first_profile == 1, "Y"]
  }
  
  # Fit statistical model (linear probability model, clustered SEs)
  # we need to convert our formula string to a formula with as.formula()
  # and convert clusters to a name. However, won't work unless we wrap
  # this in do-call to pass to the lm_robust() function, which requires
  # bare names for the cluster variable
  model_out <- do.call(estimatr::lm_robust,
                       list(
                         formula = as.formula(formula),
                         data = df,
                         clusters = as.name(clusters)
                       ))
  
  # Create df of p-values for the preferred model terms
  model_df <- data.frame(term = names(model_out$p.value[pterms]),
                         pval = model_out$p.value[pterms])
  
  # Return data frame of p-values and significant indicator
  model_df %>% dplyr::mutate(pvar_sig = dplyr::if_else(pval < alpha, 1, 0))
  }

Example: `create_model_output()`

# Using our list of simulated data frames
# as our data source...

# Example using previously generated data,
# let's estimate a model with interactions
# between two attributes, participant clusters,
# and a significance threshold of 0.10, returning
# model information for two terms.

create_model_output(
  df = dat[[1]],
  formula = "Y ~ atta_0*attb_z",
  clusters = "participant_id",
  pterms = c("atta_0", "atta_0:attb_z"),
  alpha = 0.10
)

           term      pval pvar_sig
1        atta_0 0.3896093        0
2 atta_0:attb_z 0.3853775        0

Calculating power

With our function, calculating power is as simple as applying our function to all of the simulated data, combining the results, and calculating the share of simulation for which the p-value is statistically significant at the specified threshold (alpha). We can create a wrapper to make this simpler.

#########################################################
# Define function: calculate_power_wrapper
# Description: calculates power given a list of data
#              frames
#
# Arguments
# sim_list: list of simulated data frames
# formula: string of model formulas
# clusters: string of variable identifying clusters
# pterms: character vector for terms whose power should 
#         be returned
# alpha: numeric significance threshold
#########################################################

calculate_power_wrapper <-
  function(sim_list,
           formula,
           clusters,
           pterms,
           seed = 123,
           alpha = 0.05) {
    
    # Set seed
    set.seed(seed)
    
    # Fit models
    models <- lapply(sim_list, function(x) {
      create_model_output(
        df = x,
        formula = formula,
        clusters = clusters,
        pterms = pterms,
        alpha = alpha
      )
    })
    
    # Combine results into a single data frame,
    # calculate power for each term
    models <- models %>%
      dplyr::bind_rows() %>%
      dplyr::group_by(term) %>%
      summarize(power_pct = 100 * mean(pvar_sig, na.rm = TRUE))
    
    return(models)
  }

Let’s now see our new power calculator work on the output of our data simulation functions.

Example: `calculate_power_wrapper()`

# Given data, calculate power
power_df <- calculate_power_wrapper(
  sim_list = all_data,
  formula = "Y ~ atta_0*attb_z",
  clusters = "participant_id",
  pterms = c("atta_0", "atta_0:attb_z"),
  seed = 123,
  alpha = 0.10
)

# Show results
power_df

# A tibble: 2 x 2
  term          power_pct
  <fct>             <dbl>
1 atta_0               20
2 atta_0:attb_z        20

The output contains the power for detecting the interaction of atta_0 and attb_z and for the indicator for atta_0. While we only used 5 simulations for example purposes, we typically want to produce many (i.e. 1000+ simulations).

Final workflow

We’ve now created functions to generate our data and calculate power from the data. Let’s show the full workflow now. We’ll breakdown the workflow into three categories:

Functions
Inputs
Calls

Functions

First, we’ll need to load all of our functions. The code gives several options for each of the functions and should be able to accommodate a wide array of conjoint designs without much change, but you will typically need to change the variables used to calculate p in the create_model_output() function to the dimensionality of your specific data with regard to attributes and the levels of those attributes.

################################################
# Define function: create_predictors
# Creates randomly generated data sets containing
# all design variables, attributes, and dummy
# variables for the right-hand side of the conjoint
# analysis (excludes the outcome variable),
# based on specified data parameter values.
# Data are "long," such that each row of the data 
# is a conjoint profile.
#
# Arguments:
# N: number of participants
# K: number of choices completed
# W: number of profiles per screen
# attr_n: number of attributes overall
# attr_levels: list (or vector)
#              of possible levels for each attribute.
# attr_names: (optional): names of attributes. defaults a generic
#             name followed by the index of the attribute
#             in attr_levels.
# probs (optional): list (or vector) of probabilities used to randomly
#                   draw values from the attribute levels vectors and assign 
#                   them to profiles.
# sims (optional): number of times to simulate the data, default 1
# seed (optional): randomization seed, default 123
################################################

create_predictors <- function(N,
                              K,
                              W,
                              attr_n,
                              attr_levels,
                              attr_names = NULL,
                              probs = NULL,
                              sims = 1,
                              seed = 123) {
  # Defines loadpkg()
  loadpkg <- function(toLoad) {
    for (lib in toLoad) {
      if (!lib %in% installed.packages()[, 1]) {
        install.packages(lib, repos = 'http://cran.rstudio.com/')
      }
      suppressMessages(library(lib, character.only = TRUE))
    }
  }
  
  # Load packages
  pkgs <- c("dplyr",
            "estimatr",
            "ggplot2")
  
  loadpkg(pkgs)
  
  # Set seed for reproducibility
  set.seed(seed)
  
  # Define create_design_df
  create_design_df <- function(N, K, W) {
    temp <- data.frame(
      participant_id = sort(rep(seq_len(N), K * W)),
      screen_id = sort(rep(seq_len(K * N), W)),
      profile_id = seq_len(N * K * W)
    )
    
    temp %>%
      dplyr::group_by(screen_id) %>%
      dplyr::mutate(first_profile = dplyr::if_else(row_number() == 1, 1, 0)) %>%
      dplyr::ungroup()
  }
  
  # Define create_attributes()
  create_attributes <-
    function(attr_levels,
             n_profiles,
             attr_names = NULL,
             probs = NULL) {
      if (!is.list(attr_levels)) {
        attr_levels <- list(attr_levels)
      }
      
      if (!is.null(probs) & !is.list(probs)) {
        probs <- list(probs)
      }
      
      attr_n <- length(attr_levels)
      
      stopifnot(is.null(attr_names) | attr_n == length(attr_names))
      stopifnot(is.null(probs) | attr_n == length(probs))
      
      if (is.null(attr_names)) {
        attr_names <- paste0("attrib", seq_len(attr_n))
      }
      
      if (is.null(probs)) {
        probs <- list()
        for (i in 1:attr_n) {
          attr_i_levels_n <- length(attr_levels[[i]])
          probs[[i]] <- rep(1 / attr_i_levels_n, attr_i_levels_n)
        }
      }
      
      attrib_list <- list()
      for (i in 1:length(attr_names)) {
        if (is.numeric(attr_levels[[i]]) &
            length(attr_levels[[i]]) == 1) {
          attrib_list[[i]] <- rep(attr_levels[[i]], n_profiles)
        } else {
          attrib_list[[i]] <- sample(
            x = attr_levels[[i]],
            size = n_profiles,
            replace = TRUE,
            prob = probs[[i]]
          )
        }
      }
      attrib_df <- dplyr::bind_cols(attrib_list)
      colnames(attrib_df) <- attr_names
      attrib_df
    }
  
  # Define create_value_dummies
  create_value_dummies <-
    function(attr_df,
             attr_levels) {
      if (!is.list(attr_levels)) {
        attr_levels <- list(attr_levels)
      }
      
      attr_cols <- list()
      for (i in 1:ncol(attr_df)) {
        temp <- list()
        
        this_attr_levels <- attr_levels[[i]]
        for (j in 1:length(this_attr_levels)) {
          temp[[j]] <- dplyr::if_else(attr_df[, i] == attr_levels[[i]][j], 1, 0)
        }
        
        temp <- dplyr::bind_cols(temp)
        
        
        colnames(temp) <-
          paste0(colnames(attr_df)[i], "_", attr_levels[[i]])
        
        attr_cols[[i]] <- temp
      }
      
      dummy_df <- dplyr::bind_cols(attr_cols)
      
      dplyr::bind_cols(attr_df, dummy_df)
    }
  
  # Generate data
  
  # Create design data frame
  design_df <- create_design_df(N = N, K = K, W = W)
  
  # Initialize list to store data
  attribs <- vector("list", length = sims)
  
  # Number of profiles
  n_profiles <- N * K * W
  
  # Create sims number of data sets with the attributes values
  # and value dummies randomized across data sets
  for (i in seq_len(sims)) {
    attribs[[i]] <- create_attributes(
      attr_levels = attr_levels,
      n_profiles = n_profiles,
      attr_names = attr_names,
      probs = probs
    ) %>%
      
      create_value_dummies(attr_df = .,
                           attr_levels = attr_levels)
  }
  
  # Join the attribute data to the design data
  dat_out <- lapply(attribs, function(attrib_df) {
    dplyr::bind_cols(x = design_df, y = attrib_df)
  })
  
  # Return list of generate data frames
  dat_out
}

##########################################
# Define function: create_model_ouput()
#
# Description: Generates a data frame
# of selected p-values and point
# estimates for linear probability
# model fit on each simulated data
# set.
#
# Arguments
# df: data frame(s) of predictor variables,
#     commonly a data frame in the list of
#     produced via create_predictors()
# formula: string containing model formula asin
#          in the case of lm(), e.g. "y ~ x"
# clusters: bare name of cluster variable,
#           typically participant_id
# pvars: character vector of model terms for
#        which we want to return the p-values
# seed: randomization seed, default 123
# alpha: numeric value for significance threshold, 
#        defaults to 0.05
##########################################

create_model_output <- function(df,
                                formula, 
                                clusters,
                                pterms,
                                seed = 123,
                                alpha = 0.05) {
  set.seed(seed)
  df$p <- NA_real_
  screens <- max(df$screen_id)
  
  for (i in 1:screens) {
    #  atta_10_attb_xx
    atta_10_attb_xx_out  <- atta_10_attb_xx *
      df[df$screen_id == i & df$first_profile == 1, "atta_1"] *
      df[df$screen_id == i & df$first_profile == 0, "atta_0"] *
      df[df$screen_id == i & df$first_profile == 1, "attb_x"] *
      df[df$screen_id == i & df$first_profile == 0, "attb_x"]
    
    # atta_10_attb_xz
    atta_10_attb_xz_out  <- atta_10_attb_xz *
      df[df$screen_id == i & df$first_profile == 1, "atta_1"] *
      df[df$screen_id == i & df$first_profile == 0, "atta_0"] *
      df[df$screen_id == i & df$first_profile == 1, "attb_x"] *
      df[df$screen_id == i & df$first_profile == 0, "attb_z"]
    
    # atta_10_attb_zz
    atta_10_attb_zz_out  <- atta_10_attb_zz *
      df[df$screen_id == i & df$first_profile == 1, "atta_1"] *
      df[df$screen_id == i & df$first_profile == 0, "atta_0"] *
      df[df$screen_id == i & df$first_profile == 1, "attb_z"] *
      df[df$screen_id == i & df$first_profile == 0, "attb_z"]
    
    # atta_10_attb_zx
    atta_10_attb_zx_out  <- atta_10_attb_zx *
      df[df$screen_id == i &
           df$first_profile == 1, "atta_1"] *
      df[df$screen_id == i &
           df$first_profile == 0, "atta_0"] *
      df[df$screen_id == i &
           df$first_profile == 1, "attb_z"] *
      df[df$screen_id == i & df$first_profile == 0, "attb_x"]
    
    #  atta_01_attb_xx
    atta_01_attb_xx_out  <- atta_01_attb_xx *
      df[df$screen_id == i &
           df$first_profile == 1, "atta_0"] *
      df[df$screen_id == i &
           df$first_profile == 0, "atta_1"] *
      df[df$screen_id == i &
           df$first_profile == 1, "attb_x"] *
      df[df$screen_id == i &
           df$first_profile == 0, "attb_x"]
    
    #  atta_01_attb_xz
    atta_01_attb_xz_out  <- atta_01_attb_xz *
      df[df$screen_id == i &
           df$first_profile == 1, "atta_0"] *
      df[df$screen_id == i &
           df$first_profile == 0, "atta_1"] *
      df[df$screen_id == i &
           df$first_profile == 1, "attb_x"] *
      df[df$screen_id == i &
           df$first_profile == 0, "attb_z"]
    
    #  atta_01_attb_zz
    atta_01_attb_zz_out  <- atta_01_attb_zz *
      df[df$screen_id == i &
           df$first_profile == 1, "atta_0"] *
      df[df$screen_id == i &
           df$first_profile == 0, "atta_1"] *
      df[df$screen_id == i &
           df$first_profile == 1, "attb_z"] *
      df[df$screen_id == i &
           df$first_profile == 0, "attb_z"]
    
    #  atta_01_attb_zx
    atta_01_attb_zx_out  <- atta_01_attb_zx *
      df[df$screen_id == i &
           df$first_profile == 1, "atta_0"] *
      df[df$screen_id == i &
           df$first_profile == 0, "atta_1"] *
      df[df$screen_id == i &
           df$first_profile == 1, "attb_z"] *
      df[df$screen_id == i &
           df$first_profile == 0, "attb_x"]
    
    # Given the matches, add them to 0.50 and assign sum to p
    df[df$screen_id == i &
         df$first_profile == 1, "p"] <- 0.50 +
      atta_10_attb_xx_out +
      atta_10_attb_xz_out +
      atta_10_attb_zz_out +
      atta_10_attb_zx_out +
      atta_01_attb_xx_out +
      atta_01_attb_xz_out +
      atta_01_attb_zz_out +
      atta_01_attb_zx_out
  }
  
  # Draw Y for each profile where we have a value for p
  df <- df %>% dplyr::mutate(Y = rbinom(n = n(),
                                        size = 1,
                                        prob = p))
  
  # Within each pair of profiles, assign a value of 1 - Y
  # to the profile where first_profile == 0, where Y is the
  # value of Y when first_profile == 1 within that screen
  for (j in 1:screens) {
    df[df$screen_id == j &
         df$first_profile == 0, "Y"] <-
      1 - df[df$screen_id == j & df$first_profile == 1, "Y"]
  }
  
  # Fit statistical model (linear probability model, clustered SEs)
  # we need to convert our formula string to a formula with as.formula()
  # and convert clusters to a name. However, won't work unless we wrap
  # this in do-call to pass to the lm_robust() function, which requires
  # bare names for the cluster variable
  model_out <- do.call(estimatr::lm_robust,
                       list(
                         formula = as.formula(formula),
                         data = df,
                         clusters = as.name(clusters)
                       ))
  
  # Create df of p-values for the preferred model terms
  model_df <- data.frame(term = names(model_out$p.value[pterms]),
                         pval = model_out$p.value[pterms])
  
  # Return data frame of p-values and significant indicator
  model_df %>% dplyr::mutate(pvar_sig = dplyr::if_else(pval < alpha, 1, 0))
}

#########################################################
# Define function: calculate_power_wrapper
# Description: calculates power given a list of data
#              frames
#
# Arguments
# sim_list: list of simulated data frames
# formula: string of model formulas
# clusters: string of variable identifying clusters
# pterms: character vector for terms whose power should 
#         be returned
# seed : randomization seed, default 123
# alpha: numeric significance threshold
#########################################################
calculate_power_wrapper <-
  function(sim_list,
           formula,
           clusters,
           pterms,
           seed = 123,
           alpha = 0.05) {
    # Fit models
    models <- lapply(sim_list, function(x) {
      create_model_output(
        df = x,
        formula = formula,
        clusters = clusters,
        pterms = pterms,
        seed = seed,
        alpha = alpha
      )
    })
    
    # Combine results into a single data frame,
    # calculate power for each term
    models <- models %>%
      dplyr::bind_rows() %>%
      dplyr::group_by(term) %>%
      summarize(power_pct = 100 * mean(pvar_sig, na.rm = TRUE))
    
    models
  }

Inputs

Second, we’ll want to assign values to the inputs to our functions. We can categorize the inputs based on their purpose: creating predictors, creating outcomes, and statistical modeling/selection of results. These will change with the conditions under which you wish to calculate power. For example, if we wanted to calculate power when we have 500 participants, all we need to do is change N <- 100 to N <- 500.

# Parameters for predictor creation
N <- 100
K <- 3
W <- 2
attr_n <- 3
attr_levels <- list(c(0, 1), c("x", "z"), c(0,1))
attr_names <- c("atta","attb","attc")
probs <- NULL
sims <- 5
seed <- 123

# Parameters for outcome creation
atta_10_attb_xx <- 0.02 
atta_10_attb_xz <- 0.02
atta_10_attb_zz <- 0.005
atta_10_attb_zx <- 0.005

atta_01_attb_xx <- -0.02
atta_01_attb_xz <- -0.02
atta_01_attb_zz <- -0.005
atta_01_attb_zx <- -0.005

atta_00_attb_xx <- 0 
atta_00_attb_xz <- 0 
atta_00_attb_zz <- 0 
atta_00_attb_zx <- 0 
  
atta_11_attb_xx <- 0  
atta_11_attb_xz <- 0 
atta_11_attb_zz <- 0 
atta_11_attb_zx <- 0

# Parameters for modeling
# and power calculation
#sim_list <- not needed due to directly passing data to function
formula <- "Y ~ atta_0*attb_z"
clusters <- "participant_id"
pterms <- c("atta_0", "atta_0:attb_z")
alpha <- 0.10

Calls

Third, and finally, use the inputs to generate our data and pass the data to our functions to calculate power.

# Create predictors
predictors <- create_predictors(
  N = N,
  K = K,
  W = W,
  attr_n = attr_n,
  attr_levels = attr_levels,
  attr_names = attr_names,
  probs = probs,
  sims = sims,
  seed = seed
)

# Add outcomes, fit models
power_calc <- calculate_power_wrapper(
  sim_list = predictors,
  formula = formula,
  clusters = clusters,
  pterms = pterms,
  seed = seed,
  alpha = alpha
)

# Show power
power_calc

# A tibble: 2 x 2
  term          power_pct
  <fct>             <dbl>
1 atta_0               20
2 atta_0:attb_z        20

Conclusion

This document hopefully provides (a) the intuition for how to calculate power via simulation, (b) several functions you can take away to calculate power for typical conjoint designs, and (c) a starting point for calculating power for more complicated designs.

Using the above approach, researchers can provides pass different parameter values to our functions and calculate power across a variety of different designs. While we have only calculated power for one set of parameters (e.g. sample size) and with a small number of simulations, it is straightforward to change these parameters to calculate power across many parameters and with many simulations.

There are several areas where entrepreneurial programmers might improve on the code. First, nearly everything is processed sequentially and therefore the code is quite slow. One might modify the code for parallel processing or use some basic C++ to speed things up (e.g. see these examples of how to optimize code). Second, one might add additional functionality. Specific areas where features are wanting is the inclusion of intraclass correlation coefficients for each participant’s choices, ordering effects, automated generation of potential outcomes, and choices over alternative statistical models, such as generalized linear models, mixed models, and Bayesian implementations.

Simulation-based power calculations for conjoint experiments