--- title: "Randomness" author: "JJB + Course" date: "02/18/2019" output: html_document: toc: true toc_float: collapsed: false --- # Randomness ## Example: Random Numbers {r} set.seed(55) rnorm(1) set.seed(55) rnorm(1)  ## Example: Recall the Modulus operator {r example-mod} 12 %% 7  {r create-mod-combination-table} outer(1:5, 1:5, *)  {r format-mod-table} mod_table = data.frame(outer(1:9, 2:9, %%)) colnames(mod_table) = paste0("mod ", seq_len(ncol(mod_table)) + 1) rownames(mod_table) = paste0("x = ", seq_len(nrow(mod_table))) knitr::kable(mod_table)  ## Example: Linear Congruential Method (LCM) The LCM method **was** a popular method for generating random numbers. The generator provides the ability to create sequences of deterministic values given a seed. For instance, the generator would create: ${X_0},{\text{ }}{X_1},{\text{ }}{X_2},{\text{ }}{X_3}, \cdots ,{\text{ }}{X_n}$ where: - $X_0$ is the **seed specified**. - $X_1$ is the **first** random number generated. - $X_2$ is the **second** random number generated. - ... - $X_n$ is the **n**th random number generated. The generator creates random values by the following _deterministic_ process: \begin{align*} X_0 &= \text{seed} \\ {X_{i + 1}} &= \left( {a{X_i} + c} \right)\bmod m \end{align*} where: - $X_{i}$ is the **previous** random number generated - $X_{i+1}$ is the **next** random number generated - $a$ is the multiplier - requires $0 < a < m$ - $c$ is the increment or shifted value - requires $0 \le c < m$ - $m$ is the modulus - requires $m > 0$. The choices of these values are crucial in ensuring good periodicity. (See slides for bad choices vs. good.) Generally, LCG works nicely if: 1. $m$ and $c$ are relatively prime. (e.g. only 1 divides both) 1. Only variants of $m$ divide $a - 1$ 1. If $m$ is divisible by $4$, then $a-1$ is divisible by $4$. **Note:** Mersenne-Twister RNG by M. Matsumoto and T. Nishimura (1997), provides sequence generation of $2^{19937}-1$. For more information, please see ?RNGkind. {r lcm-method} # Setup values a = 5181215173 c = 12581 m = 2^32 # Initial Seed seed = as.numeric(Sys.time()) * 1000 # Pre-allocation of numeric vector x = numeric(length = 2) # Set initial value x[1] = seed x[2] = (a * x[1] + c) %% m x # Real number r = x[2]/m r  **Question:** How can we ensure $R_i$ is able to generate values between $\left[ {0,1} \right]$ and not $\left[ {0,1} \right)$? # Sampling There are many different ways to break up a sample. ## Example: Sample without Replacement {r sample-without-replacement-numbers} # Set seed for reproducibility set.seed(90210) # Sample values without replacement sample(x = 8, size = 4, replace = FALSE) # Sample values without replacement sample(x = 8, size = 8, replace = FALSE)  What happens if we sample values **without** replacement with a sample size that is _larger_ than our initial population? {r, error = TRUE} sample(x = 8, size = 10, replace = FALSE)  Alternatively, we can use actual vectors for x in sample(). In this case, we're sampling from a vector containing two values. {r sample-without-replacement-characters} # Set seed for reproducibility set.seed(42) # Sample values without replacement sample(x = c("heads", "tails"), size = 1, replace = FALSE)  ### Exercise: Die Roll {r ex-sample-w-o-replace} # Set a seed to control reproducibility set.seed(385)  ## Example: Sampling with Replacement {r sample-with-replacement-numbers} # Set seed for reproducibility set.seed(5318008) # Sample numbers with replacement sample(x = 10, size = 20, replace = TRUE)  {r sample-with-replacement-characters} # Set seed for reproducibility set.seed(376006) # Sample characters with replacement y = sample(x = c("heads", "tails"), size = 10, replace = TRUE, prob = c(0.4, 0.6)) y  Count observations and/or obtain the proportion: {r show-frequencies} table(y)  Obtain the proportions of how often the observations show up in the data. {r show-proportions} prop.table(table(y))  {r} # Sample characters with replacement y_100 = sample(x = c("heads", "tails"), size = 100, replace = TRUE, prob = c(0.4, 0.6)) prop.table(table(y_100)) # Sample characters with replacement y_1000 = sample(x = c("heads", "tails"), size = 1000, replace = TRUE, prob = c(0.4, 0.6)) prop.table(table(y_1000))  ### Exercise: Creating a "Loaded" die Create a call to sample such that it can roll a "loaded" six-sided die twice with probabilities: 1 := 0.1, 2 := 0.3, 3 = 0.1, 4 := 0.15, 5 := 0.15, 6 := 0.2 {r rng-probs-loaded-dice} set.seed(8888) # your code here probs = c(0.1, 0.3, 0.1, 0.15, 0.15, 0.2) sum(probs) sample(6, size = 2, replace = TRUE, prob = c(0.1, 0.3, 0.1, 0.15, 0.15, 0.2))  **Question:** What happens if the sum of probability doesn't equal 1? {r rng-probs-loaded-dice} set.seed(8888) # your code here probs = c(0.1, 0.3, 0.1, 0.15, 0.15, 0.9) sum(probs) my_sample = sample(6, size = 1e8, replace = TRUE, prob = probs) prop.table(table(my_sample))  # Probability Distributions ## Example: d, p, q, r Distribution Functions Within _R_ there are four different kinds of distribution functions. These functions are known by: | Function | Description | |:-------------------|:------------------------------------------------------------------| | d*(x) | Compute the probability density function height at $x$ | |--------------------|-------------------------------------------------------------------| | p*(x) | Calculate $q$ given $x$ under $P\left(X\le x\right) = q$ | |--------------------|-------------------------------------------------------------------| | q*(q) | Calculate $x$ given $q$ under $P\left(X\le x\right) = q$ | |--------------------|-------------------------------------------------------------------| | r*(n) | Generate $n$ random values from the distribution | |--------------------|-------------------------------------------------------------------| where *() refers to the different kinds of distribution. For instance: +--------------+-------------+---------------+------------------------------------+ | Distribution | *() | Parameters | Parameter Description | +==============+=============+===============+====================================+ | Normal | norm() | mean = 0 | Mean or Center of distribution | | | | sd = 1 | Standard deviation or spread | +--------------+-------------+---------------+------------------------------------+ | Uniform | unif() | min = 0 | Distribution minimum value | | | | max = 1 | Distribution maximum value | +--------------+-------------+---------------+------------------------------------+ As an example, consider the Normal distributions' d, p, q, and r-variants. {r example-distribution-calls} # Normal PDF dnorm(c(-1, 0, 1)) # Normal CDF pnorm(c(-1, 0, 1)) # Normal iCDF qnorm(c(0.15, 0.5, 0.84)) # Normal RNG set.seed(15) # Set seed to ensure reproducibility rnorm(2)  ### Probability Density Function Consider a probability distribution under: $f\left( {x|\mu ,\sigma } \right) = \frac{1}{{\sigma \sqrt {2\pi } }}\exp \left( { - \frac{{{{\left( {x - \mu } \right)}^2}}}{{2{\sigma ^2}}}} \right)$ where: - $\mu$ is the mean - $\sigma$ is the standard deviation - $\sigma^2$ is the variance {r standard-normal} library("ggplot2") ggplot(data.frame(x = c(-5, 5)), aes(x)) + stat_function(fun = dnorm) + labs(title = "Normal: PDF", y = "f(x)", x = "x") + theme_bw() + theme(axis.text = element_text(size = 12), axis.title = element_text(size = 13), plot.title = element_text(size = 16))  {r dnorm-ex} library("ggplot2") ggplot(data.frame(x = c(-5, 5)), aes(x)) + stat_function(fun = dnorm)+ stat_function(fun = dnorm, xlim = c(-5, 1.96), geom = "area", fill = "blue") + geom_vline(xintercept = 1.96, color = "orange") + labs(title = "Normal: PDF", y = "f(x)", x = "x") + theme_bw() + theme(axis.text = element_text(size = 12), axis.title = element_text(size = 13), plot.title = element_text(size = 16))  ### Cumulative Density Function The CDF function is given as: ${F_X}\left( x \right) = P\left( {X \le x} \right)$ The CDF function is shown with the PDF when we see "shaded" area under the graph. {r pnorm-ex} ggplot(data.frame(x = c(-5, 5)), aes(x)) + stat_function(fun = pnorm) + geom_point(data = data.frame(x = 1.96, y = pnorm(1.96)), aes(y = y), color = "blue", size = 3) + geom_vline(xintercept = 1.96, color = "orange") + labs(title = "Normal: CDF", y = "F(x)", x = "x") + theme_bw() + theme(axis.text = element_text(size = 12), axis.title = element_text(size = 13), plot.title = element_text(size = 16))  ### Quantile Function The quantile function is defined as: $$x = F_X^{ - 1}\left( p \right)$$ The quantile approach allow us to convert a probability value (output) into an input value for the CDF function. As such, this is often referred to as the the 'inverse CDF' since: \begin{align*} x &= F_X^{ - 1}\left( p \right) \\ {F_X}\left( {F_X^{ - 1}\left( p \right)} \right) &= p \end{align*} {r qnorm-ex} ggplot(data.frame(x = c(0, 1)), aes(x)) + stat_function(fun = qnorm) + labs(title = "Normal: Quantile", y = "x", x = "F(x)") + geom_point(data = data.frame(x = 0.975, y = qnorm(0.975)), aes(y = y), color = "blue", size = 3) + geom_vline(xintercept = 0.975, color = "orange") + theme_bw() + theme(axis.text = element_text(size = 12), axis.title = element_text(size = 13), plot.title = element_text(size = 16))  ### Quantile Function - Z-Score {r qnorm-ex-zscore} ggplot(data.frame(x = c(0, 1)), aes(x)) + stat_function(fun = qnorm) + labs(title = "Normal: Quantile", y = "x", x = "F(x)") + geom_point(data = data.frame(x = 0.975, y = qnorm(0.975)), aes(y = y), color = "blue", size = 3) + geom_vline(xintercept = 0.975, color = "orange") + geom_hline(yintercept = qnorm(0.975), color = "red") + theme_bw() + theme(axis.text = element_text(size = 12), axis.title = element_text(size = 13), plot.title = element_text(size = 16))  ### Exercise: Finding quantiles Using the qnorm() function, determine what the quantiles are at probabilities: - 0.5 - 0.95 - 0.975 **Have you seen these values before?** {r find-probs}  ### Random Number Generation Function {r rnorm-ex} set.seed(1) # Generate 1000 values from a # Normal distribution centered at mean = 0 and sd = 1 generate_normal_values = rnorm(1000) # Generate 1000 values from a # Normal distribution centered at mean = 0 and sd = 4 generate_normal_values_sd = rnorm(1000, sd = 4) # Generate 1000 values from a # Normal distribution centered at mean = 2 and sd = 9 generate_normal_values_msd = rnorm(1000, mean = 2, sd = 9) set.seed(1) ggplot(data.frame(x = rnorm(1000)), aes(x = x)) + geom_histogram(aes(y = ..density..), fill = "orange") + stat_function(fun = dnorm, color = "blue") + geom_vline(xintercept = 0, color = "red") + labs(title = "Normal: Random Numbers", y = "f(x)", x = "Samples") + theme_bw() + theme(axis.text = element_text(size = 12), axis.title = element_text(size = 13), plot.title = element_text(size = 16))  ### Exercise: Finding quantiles How can we recreate a **coin flip** for 100 observations using: - Uniform Distribution - Binomial Distribution {r flip-uniform} uniform_draws = runif(10) uniform_draws # [0, 1] ifelse(uniform_draws > 0.5, "heads", "tails")  {r flip-binomial} set.seed(999) # Problematic because the values range from [0, 2] binom_draws = rbinom(10, size = 2, prob = 0.5) # We need to ccontrol the randomness such that # values come from [0, 1] binom_draws_correct = rbinom(10, size = 1, prob = 0.5) ifelse(binom_draws_correct, "heads", "tails")