Afterthoughts

For many psychology students, Bayesian statistics remains shrouded in mystery. At the undergraduate level, Bayes’ theorem may be taught as part of probability theory, but the link between probability theory and scientific inference is almost never made. This is unfortunate, as this link—first made almost a century ago—provides a mathematically elegant and robust basis for the quantification of scientific knowledge.

As argued by Wrinch and Jeffreys (1921) and later in the works of Harold Jeffreys, probability theory is extended logic. Jaynes (2003) calls it “the logic of science.” Indeed, it is easy to see how probability theory maps directly to propositional logic if all statements are fully “true” or “false” – that is, all probabilities are either 0 or 1. Take for example the statement P(A|B) = 1. “If B is true, then the probability of A is 1″ is simply another way of saying that “B implies A” (BA). Similarly, P(A|B) = 0 is the same as B → ¬A. Probability theory extends this concept to include uncertainty, but the rules of probability have the same status as the rules of logic: they can be used to derive statements that are guaranteed to be correct if the premises are correct. Paraphrasing Edwards, Lindman, and Savage (1963, p. 194): Probability is orderly uncertainty and inference from data is revision of uncertainty in the light of relevant new information. Bayesian statistics, then, is nothing more—and nothing less—than the application of probability theory to real problems of inference.

The close relationship of probability theory and logic leads to further fertile insights. For example, a common misunderstanding regarding Bayesian methods is that they are somehow invalidated by the fact that conclusions may depend on the prior probabilities assigned to parameter values or hypotheses. Translated to the terminology of formal logic, this claim is that logical deduction is somehow invalidated because conclusions depend on premises. Clearly, an inferential procedure is not pathological because its conclusions depend on its assumptions – rather the inverse is true. Conclusions that do not depend on assumptions may be robust, but they cannot be rational any more than conclusions that do not depend on observations.

However, the dependence on prior probabilities involves another dimension that is often misunderstood: At first glance, it appears that the prior introduces the analyst’s beliefs—an element of subjectivity—into the inference, and this is clearly undesirable if we are to be objective in our scientific inquiries. Two observations address this issue. First, it is important to emphasize—lest we forget—that “subjective” is not synonymous with “arbitrary.” Rather than beliefs, we may think of probability as conveying information. It is not at all peculiar to say that relevant information may be subjective – after all, not all humans have access to the same information. Accordingly, the information that is encoded in probability distributions may be subjective, but that does not mean it is elective. Belief—in the sense in which it is used in probability theory—is not an act of will, but merely a state in which the individual passively finds itself. Accordingly, different scientists using different sources of information can rationally reach different conclusions.

The second observation regarding the subjectivity of the prior follows from inspection of Bayes’ theorem:

P(Θ|y) = P(y|Θ)P(Θ)/P(y).

In the right hand side numerator appears the product P(y|Θ)P(Θ): likelihood and prior side by side determine the relative density of all possible values of Θ. In a typical cognitive-modeling scenario, researchers will specify these distributions with some care – much defense and reasoning will often go into the selection of which prior to use, possibly using arguments from previous literature and graphical exploration of the prior predictive distribution; criticism of prior decisions is common and expected. The likelihood is defined also.

The way these components of Bayes’ theorem are specified is somewhat reminiscent of the Biblical description of the creation of the heavens, in which “God made two great lights; the greater light to rule the day, and the lesser light to rule the night: he made the stars also” (Gen 1:16, KJV). Much like how in this verse the billions upon trillions of stars are created as an afterthought, far less argument is usually deemed necessary for the definition of the likelihood function even though it is usually much more consequential than the definition of the prior – after all, given even moderate amounts of data the prior will typically wash out in favor of the likelihood. To see argument at all for the choice of likelihood is not typical and the tacit assumptions of sequential independence and normally-distributed residual are ubiquitous. Jaynes (2003) writes that “if one fails to specify the prior information, a problem of inference is just as ill-posed as if one had failed to specify the data” (p. 373), but the emphasis can apply to both factors in the RHS numerator of Bayes’ theorem: if we fail to question the likelihood it is as if we fail to question the prior.

In some contexts, however, questioning the likelihood is common: we ask whether this or that is the “right model for the data.” For example, in the reaction time modeling world, we might wonder if a set of observations is best described by a standard linear ballistic accumulator or by some stochastic variant. In more conventional scenarios, we sometimes worry if a t test with equal variances is appropriate, or an unequal-variance procedure should be used instead. This invites a question: What if we want to estimate the magnitude of some manipulation effect but are unwilling to commit to model E (equal variance) or U (unequal variance)? Perhaps unsurprisingly, probability theory has an answer. If the posterior distribution of the effect size assuming some model M (M ∈ {E,U}) is p(δ|y,M) and the posterior probability that E is the correct model of the two is P(E|y) = 1 – P(U|y), then the posterior distribution of δ, averaged over these two models, is immediately given by the sum rule of probability:

p(δ|y) = p(δ|y,E)P(E|y) + p(δ|y,U)P(U|y).

One interpretation of this equation is that the exact identity of the model is a nuisance variable, and we can “integrate it out” by taking an average weighted by the posterior probability of each model. It provides a posterior distribution of δ that does not assume that model E is true or that model U is true, only that one of them is. This technique of marginalizing over models is a direct consequence of probability theory that is often called Bayesian model averaging. It can be applied in a staggering variety of circumstances.

While most psychologists readily draw conclusions that are based on an often arbitrary and tenuously appropriate likelihood, one who is uncomfortable with any of the assumptions can apply Bayesian model averaging to assuage their concerns. This way, we can avoid having to commit to any particular set of model assumptions function by averaging over likelihood functions – and so it goes with priors also.

References
Edwards, W., Lindman, H., & Savage, L. J. (1963). Bayesian statistical inference for psychological research. Psychological Review, 70(3), 193.
Jaynes, E. T. (2003). Probability theory: The logic of science. Cambridge University Press.
Wrinch, D., & Jeffreys, H. (1921). On certain fundamental principles of scientific inquiry. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 42(249), 369–390.

(c) 2016 Joachim Vandekerckhove (CC-By Attribution 4.0 International)

Simulating like a Bayesian

Thinking like a Bayesian is often different from thinking like an orthodox frequentist statistician. To be a frequentist is to think about long-run frequency distributions working with certain assumptions: What would the data X look like if T were true? Often, no attention is paid to the possibility that T might not be true and the focus is exclusively on false alarm rates: How often will I conclude that an effect exists when there really is none?

To be a Bayesian is completely different in this respect. Bayesians are rarely interested in such long-run behaviors, and will never condition their conclusions on the existence or nonexistence of an effect. While frequentists ask what is the probability of observing data X given truth T, Bayesians will instead ask what is the probability of truth T given that data X are observed? I contend that the latter is an interesting question for scientists in general, and the former is only of niche interest to a small minority — process engineers, perhaps.

This difference in focus shines through when we perform simulation studies of the behavior of statistical algorithms and procedures. While it is often more straightforward to assume a certain truth T (say, the null hypothesis) and generate data from that to study its long-run behavior, this leads inevitably to frequentist questions for which few researchers have any use at all. A more interesting simulation study is an attempt to emulate the situation real scientists are in: having the data as given and drawing conclusions from them.

The standard example

The distinction between conditioning on truth and conditioning on data is emphasized in nearly every first introduction to Bayesian methods, often through a medical example. Consider the scenario of a rare but serious illness, with an incidence of perhaps one in a million, and the development of a new test, with a sensitivity of 99% but a false alarm rate of 2%. What is the probability that a person with a positive diagnosis has the illness? We can study this issue with a simple simulation:

 n = 1e8;                            % the simulation sample size
 truth = rand(n, 1) < 1e-6;          % the true diagnosis is positive in 1/1,000,000 cases
 hit = truth & (rand(n, 1) < .99);   % 99% of positive cases are correctly identified
 FA = ~truth & (rand(n, 1) < .02);   % 2% of negative cases are incorrectly flagged
 diag = hit | FA;                    % the test reads positive if a hit or false alarm

We can summarize the outcome of this simulation study in a small two-by-two table:

= 1 = 0
= 1 101 1999093
= 0 1 98000805

At this point, there are two ways of looking at this table. The frequentist view considers each column of the table separately: if the illness is present (T = 1, left column), then in approximately 99% of cases the diagnosis will be positive. If the illness is not present (T = 0, right column), then in approximately 2% of cases the diagnosis will be positive. Now that’s all good and well, but it does not address the interesting question: what is the probability that the illness is present given a positive diagnosis? For this, we need the Bayesian view that considers only one row of the table, namely where D = 1. In that row, only about 0.005% of cases are true, and this is the correct answer.

This answer is slightly counterintuitive because the probability is so low (after all, the test is supposedly 99% accurate); this impression comes from a cognitive fallacy known as base rate neglect. Consider that the probability of the illness being present is about 50 times larger after the diagnosis than before, but it remains small. The base rate is a necessary component to address the question of interest; we could not have answered the question without it.

Of course any Bayesian worth their salt would scoff at all the effort we just put in to demonstrate a basic result of probability theory. After all,

0

 .99 * 1e-6 / (.99 * 1e-6 + .02 * (1-1e-6))
 ans = 4.9498e-05

On to bigger things

We can apply the same analysis to significance testing! Let’s generate some data for a one-sample t test with n = 15. Some proportion of the data are generated from the null hypothesis:

 clear
 nt = 1e6;
 nx = 15;
 prior_h0 = 0.50;
 truth = randn(1, nt) .* (rand(1, nt) > prior_h0);
 hist(truth, 31)

1

Now that we’ve generated “true” effect sizes, we can do t tests (feel free to check the math). Let’s do them one-sided for simplicity:

 e = randn(nx, nt);
 meanE = mean(e);
 stdE = std(e);
 observedX = meanE + truth;
 observedT = observedX .* sqrt(nx) ./ stdE;
 criticalT = 1.761;
 significant = observedT > criticalT;
 false_alarm_rate = 100*mean(significant(~truth));

The false alarm rate in the simulation is 4.9959%.

And now we can look at the outcomes with our Bayesian goggles; looking at the distribution of true effects conditional on the outcome measure (in this case, whether the test was significant). Let’s plot the two histograms together, with the “nonsignificant” one flipped upside-down. Because I saw this for the first time in one of Jeff’s talks, I think of this as a Rouder plot:

 xbins = linspace(-4, 4, 31);
 [ys, xs] = hist(truth(significant), xbins);
 [yn, xn] = hist(truth(~significant), xbins);
 set(bar(xs, ys/nt), 'facecolor', [.25 .25 1], 'edgecolor', 'none')
 hold on
 set(bar(xn, -yn/nt), 'facecolor', [1 .25 .25], 'edgecolor', 'none')
 xlabel 'true effect size'
 legend 'p < .05' 'p > .05' location southwest
 hold off

2

I like this graph because you can read off interesting conditional distributions: if p > .05, many more true effects are 0 than when p < .05, which is comforting. But how many more?

Well, we can make that same table as above (now with proportions):

= 1 = 0
= 1 0.17 0.02
= 0 0.33 0.47

Note here that, looking at the right column only, we can see that the frequentist guarantee is maintained: if the null is true, we see a significant result only roughly 5% of the time. However, if we decide the null is false on the basis of a significant effect (i.e., looking only at the top row), we will be mistaken at a different rate seen by dividing the value in the upper right by the sum of the top row.

The continuous domain

So far, we’ve looked only at discrete events: the effect is either zero or it is not (T) and the effect is either significant or it is not (D). If we call x the sample means and t the true means, we can plot the joint distribution p(x, t) of their exact values:

 close all
 bwx = .25; bwy = .25; % plotting resolution
 [rho, xvec, yvec] = density(truth, observedT, bwx, bwy, [-4 4 -4 4]);
 title('joint {\itp}(data, truth)')

3

Here, too, we can inspect the results of our simulation in a frequentist fashion by inspecting only columns: we can condition on a particular model (e.g., t = 0) and study the behavior of the data x under this model. Alternatively, and more interestingly, we can be Bayesian and see what distribution of true values t is associated with a particular observation x.

In making figures, we can again make it explicit that you can condition on model (the frequentist idea; left) or on data (the Bayesian idea; right)

 idx = ceil(length(xvec)/2);
 idy = ceil(length(yvec)/2);
 normto = @(x, t) x ./ sum(x(:)) / t;
 subplot(1,2,1)
 bar(yvec, normto(rho(:,idx), bwy))
 title(sprintf('frequentist {\\itp}(data|truth = %g)', xvec(idx)))
 uistack(line(yvec, tpdf(yvec, 14), 'color', [1 .2 .2], 'linewidth', 3), 'bottom')
 subplot(1,2,2)
 bar(xvec, normto(rho(idy,:), bwx))
 title(sprintf('bayesian {\\itp}(truth|data = %g)', yvec(idy)))

4

The distribution on the left is a vertical slice from the joint distribution above — right out of the middle. It is the expected distribution of the observed data if the null hypothesis is true (t = 0). This is commonly known as the t distribution (here, with 14 degrees of freedom), which I’ve underlaid in red for emphasis. The distribution on the right is a horizontal slice: the distribution of true values that generated a particular observation (in this case, the ones that generated x values close to 0). This distribution is commonly called the posterior distribution.

I can think of no real-world use for the distribution on the left, but the distribution on the right tells you what you want to know: the distribution of the true effect size, given a possible data outcome. Obviously, you can make this figure for many possible outcomes, including the one from your experiment.

And that is how you simulate like a Bayesian.

(c) 2016 Joachim Vandekerckhove, all rights reserved