base-rate

In early 2021, I used some of our internal security team training to go over Bayesian reasoning. This document doesn’t go into why that is relevant to security, but instead talks about some places where typical human intuition can be very wrong about certain sorts of probability assessments and introduce tools and ways of thinking that can help correct for some of the sorts of erroneous judgments that people typically make.

This document displays only some of the bits of R code that is used for all of these computations. The RMarkdown source for generating this is available upon request if that link doesn’t work for you.

This file was last built on 2024-08-28 16:33 CDT

The Scenario

A group of police officers have breathalyzers displaying false drunkenness in 5% of the cases in which the driver is sober. However, the breathalyzers never fail to detect a truly drunk person. One in a thousand drivers is driving drunk. Suppose the police officers then stop a driver at random to administer a breathalyzer test. It indicates that the driver is drunk. We assume you do not know anything else about them. How high is the probability they really are drunk?

Generate drunk data

Let’s get some of the important numbers from the scenario into some variables:

# "One in a thousand drivers is driving drunk."
drunk_rate <- 1/1000

# "false drunkenness in 5% of the cases in which the driver is sober"
false_pos_rate <- 0.05

# "breathalyzers never fail to detect a truly drunk person"
false_neg_rate <- 0.0

We are going to generate a random sample from those probabilities, but will set a seed for the random number generator so that we get the same results each time through. (As I am copying data from here into Keynote slides)

We now generate a bunch of random data of 30000 drivers.

And look at what we get from this simiulation in a contigency table:

reality	negative	positive	Total
drunk	0	29	29
sober	28543	1428	29971
Total	28543	1457	30000

From that table in the simulation, we see that a total of 1457 drivers tested positive but of those only 29 were really drunk, which works out to about 1.99% of those who test positive.

One way to calculate what we should expect (without having to run simulations) is to represent the problem in a tree diagram, which is one way to see all of the ways that someone can test positive, with the probabilities of falling into one of four possible outcomes (true positive, false negative, false positive, true negative).

Tree diagram with paths to all positive and negative breath test results

If the tree diagram is just confusing at this point, igore it. You might find that coming back to it later will help.

If we want to know the probabibility that someone selected at random who tests positive actually is drunk, we need to first consider all the the ways that somone can test positive. There are two ways.

They can be drunk and test positive
They can be sober and test positive

The probability of being drunk and testing positive is the product of the possibility of being drunk (1/1000) times the probability that the test gives the correct result in that case (which is 100%) so

true_pos <- drunk_rate * (1 - false_neg_rate)

So that works out to a chances that someone in this population (of 1/1000 drunk drivers) will test positive correctly is 0.1%. That is, they are drunk (1/1000) and test positve when drunk, 100% of the time they are drunk.

Now we look at the probability of testing positive and being sober. That is going to be the probability of being sober (1 - 1/1000) times the false positive 0.05.

false_pos <- (1 - drunk_rate) * false_pos_rate

So the false positives given this population is 5%

I am going to call the probability of a randomly selected driver testing positive \(\Pr(B)\), and use the variable name B.prob.

B.prob <- true_pos + false_pos

The event B is when someone tests positive (either truly or falsely) So the probabability that someone from this population tests positive (truly positives and false positives) is \(\Pr(B)\), 5.1%.

We (well,the police) know whether some random driver tested positive because we (well, the police) tested them. What we want to know is what is \(\Pr(A)\), the probability that someone is drunk given that they tested positive. This is called a “conditional probability”. This is written “\(\Pr(A\vert B)\)” and is read “probability of A given B.”

Here we need to look at what proportion of B are true positive. We’ve already calculated that, so

drunk_given_positive = true_pos / B.prob

Resulting in a drunk given a positive test probability 0.0196271. That is only 1.96%.

Putting this all together, we can calculate \(\Pr(A\vert B)\) from the probability of being drunk in the first place, \(\Pr(A)\); the true positive rate, \(\Pr(B \vert A)\); and the probability of testing positive (either falsely or correctly), \(\Pr(B)\). This gives us a nice neat formula.

\[\begin{equation} \Pr(A \vert B) = \frac{\Pr(B \vert A) \Pr(A)}{\Pr(B)} \end{equation}\]

We calculated \(\Pr(B)\) using

the probability that a random person is drunk, \(\Pr(A)\), and
the probability that a drunk person tests positive, \(\Pr(B\vert A)\), and
the probability that a sober person tests positive, \(\Pr(B\vert \neg A)\).

Leaving us with \(\Pr(B) = \Pr(B \vert A)\Pr(A) + \Pr(B \vert \neg A)\Pr(\neg A)\). And so we end up with a less pretty formula, but one that only requires us to plug in the information that we have”

\[\begin{equation} \Pr(A \vert B) = \frac{\Pr(B \vert A) \Pr(A)}{\Pr(B \vert A)\Pr(A) + \Pr(B \vert \neg A)\Pr(\neg A)} \end{equation}\]

And all of this allows us to write a simple function in R for this:

bayes <- function(prior, false_positive_rate, false_negative_rate) {
  A = prior
  not_A = 1 - A
  true_B = (1 - false_negative_rate) * A
  false_B = false_positive_rate * not_A
  
  B = true_B + false_B
  
  true_B/B
}

Now that we have the bayes function defined, we can use it.

A_given_B.prob = bayes(prior = 1/1000,
      false_positive_rate = 0.05,
      false_negative_rate = 0)

This computes the same answer we got eariler. The chancesa of someone being drunk given that they test positive in this scenario 1.96%. Nineteen times out of twenty someone who tests positive in this situation is sober.

Updating

They example used above is contrived to illustrate what happens when Pr(A) is very small. But now let’s modify the situation to one in which it is small, but not so small. So the police random stopping checkpoint is on a Friday night in a dry county on the road leading from a bar in the the neighoring (wet) county. Suppose that 1 in 100 drivers are dunk. Now let’s see what we get.

posterior <- bayes(prior = 1/100,
                   false_positive_rate = false_pos_rate,
                   false_negative_rate = false_neg_rate)

With the prior probability of 1/000, a positive breath test gives us a 16.8% chance that the randomly tested driver is drunk. This is about eight times higher than what we had with the 1/1000 base rate, but it still is not something you’d bet on.

Now suppose that there is a blood test that is more accurate and sensative than the breath test.

blood.false_pos_rate <- 0.005 # B given -A
blood.false_neg_rate <- 0 # -B given A

Suppose that the police were only to perform blood tests on those who had tested positive on the breath tests (and that errors for each kind of test are independent of each other)

Remember that those on this Friday night who test positive on the breath test have 16.8% chance of being drunk. That posterior probability from our last computation becomes our new prior probability for our next use of Bayes’ Rule.

updated <- bayes(prior = posterior,
                 false_positive_rate = blood.false_pos_rate,
                 false_negative_rate = blood.false_neg_rate)

By using the posterior, 16.8%, from our previous testing as a the new prior we get an updated posterior chance of 97.6% that this driver is drunk.

Now let’s look at the case where we just randomly perform the blood test on drivers, without first doing the breath test. One out of 100 are drunk, the the blood test is highly sensitive with only a 0.5%.

blood_only <- bayes(1/100,
    false_positive_rate = blood.false_pos_rate,
    false_negative_rate = blood.false_neg_rate)

This gives us a probability of 0.6688963. If we randomly blood tested all drivers on this road we would get a 66.9% chance that a tested driver would actually be drunk. At least it is better than chance, but these arent numbers you would want to take to court. (Again, all of these numbers are fabricated. I have no idea of what any of the input probabilities actually are.)

So what we have is

Random driver: 1%
Positive breath test only: 16.8%
Positive blood test only: 66.9%
Positive both tests: 97.6%

The point is to illustrate that while you must always consider the prior probability (which can have a bigger impact than you might initially think), increased evidence can be added in to the assessent. Bayesian thinking is not just about considering the priors, but it is about considering how new information should update our probability assessments.

Monty Hall

Now let’s look at a very different problem.

Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to pick door No. 2?” Is it to your advantage to switch your choice?

Note that the host, Monty Hall, will always open a door with a goat before asking whether you want to switch. He will not reveal the door that you originally picked and he will not reveal the door with the car.

Tree diagram Monty Hall problem (3 doors)

Let’s test out a few cases, for which we will need to define a win() function that takes the true location of the prize, the player’s initial pick, and whether the player switches or not. To make the presentation of the win() function clearer, we will first create a separate host_opens() function [not shown] that determines which doors the host opens.

With host_opens() (to give us which door(s) the host opens) we can now define the function for whether the player wins given their choices and where the prize is.

win <- function(pick, prize_door, switch = FALSE, doors = 3, verbose = FALSE)
{
  
  # Get the list of doors that the host opens
  opens <- host_opens(prize_door =  prize_door,
                      player_pick =  pick,
                      count = doors)
  if (verbose) {
    message('host opens ', paste(opens, collapse = " "))
  }
  
  ### Now we move on to switching or not
  
  if (switch == FALSE) {
    ## If the player doesn't switch, then we don't need
    ## to know or care about which door the host openned
    return (pick == prize_door)
  } else {
    ## If the player switches, we need to have them
    ## to the door that isn't their original pick
    ## and isn't the one that was openned.
    
    switch_to <- 1:doors
    switch_to <- switch_to[c(-pick,-opens)]
    
    if (verbose) { message('player switches to ', switch_to) }
    
    return(switch_to == prize_door)
  }
}
win <- Vectorize(win)

So let’s look at a couple of case. Suppose the prize is behind door 1. Let’s look at what happens if the player picks door the correct door initially.

win(prize_door = 1, pick = 1, switch = FALSE, verbose = TRUE)

## host opens 2

## [1] TRUE

win(prize_door = 1, pick = 1, switch = TRUE, verbose = TRUE)

## host opens 3

## player switches to 2

## [1] FALSE

So in the case that the player picked the correct door initially, switching makes them lose. Now suppose the player initially picks the wrong door, door 2 in this example.

win(prize_door = 1, pick = 2, switch = FALSE, verbose = TRUE)

## host opens 3

## [1] FALSE

win(prize_door = 1, pick = 2, switch = TRUE, verbose = TRUE)

## host opens 3

## player switches to 1

## [1] TRUE

When the player picked the wrong door initially, switching helped. Let’s see if this holds when the player picks the other (door 3) wrong door initially.

win(prize_door = 1, pick = 3, switch = FALSE, verbose = TRUE)

## host opens 2

## [1] FALSE

win(prize_door = 1, pick = 3, switch = TRUE, verbose = TRUE)

## host opens 2

## player switches to 1

## [1] TRUE

With the prize being behind one of three doors, the player getting to initially pick one of three doors, and the player chosing whether to switch or not, we have 3 * 3 * 2 possible cases. We can generate a data table that covers each possible case ones [not shown].

And we can compute which cases the player wins.

Let’s produce a table just for the cases where the player doesn’t switch

Wins and losses when player does not switch
Prize Location	Initial pick	Wins
1	1	TRUE
1	2	FALSE
1	3	FALSE
2	1	FALSE
2	2	TRUE
2	3	FALSE
3	1	FALSE
3	2	FALSE
3	3	TRUE

So when the player doesn’t switch they win in three of the possible combinations and lose in 6. They have a 1/3 chance of winning:

##   wins n percent
##  FALSE 6   66.7%
##   TRUE 3   33.3%
##  Total 9  100.0%

Let’s create the parallel tables for the cases where the player always switches.

Wins and losses when player switches
Prize Location	Initial pick	Wins
1	1	FALSE
1	2	TRUE
1	3	TRUE
2	1	TRUE
2	2	FALSE
2	3	TRUE
3	1	TRUE
3	2	TRUE
3	3	FALSE

And we can count those

##   wins n percent
##  FALSE 3   33.3%
##   TRUE 6   66.7%
##  Total 9  100.0%

You can work through all 18 cases to see the fact that it works out this way, but to help get the intuition, recall that if the player isn’t going to switch then the information about opening doors is irrelevant and so their initial 1/3 chance remains.

Looking through the cases where the player does not switch, only their first pick and which door has the prize matters at all. They start with a 1/3 chance of winning, and end with a 1 in 3 chance of winning.

If you look at lines 14–17 of the function definition in \(\ref{win_function}\) you will see that if the player doesn’t switch then no information about which door was opened is used. So we can modify and simply that whole function as

win2 <-
  function(pick,
           prize_door,
           switch = FALSE,
           doors = 3,
           verbose = FALSE)
  {
    if (switch == FALSE) {
      ## If the player doesn't switch, then we don't need
      ## to know or care about which door the host openned
      return (pick == prize_door)
    }
    # If we get this far in this function, then the player is switching
    
    # Get the list of doors that the host opens
    opens <- host_opens(prize_door =  prize_door,
                        player_pick =  pick,
                        count = doors)
    if (verbose) {
      message('host opens ', paste(opens, collapse = " "))
    }
    
    # player switches to the unopened door that is not they
    # one they first picked.
    switch_to <- 1:doors
    switch_to <- switch_to[c(-pick, -opens)]
    if (verbose) {
      message('player switches to ', switch_to)
    }
    
    return(switch_to == prize_door)
  }
win2 <- Vectorize(win2)

The function win2 should behave identically to win except that it doesn’t even bother computing which door(s) get openned when the player doesn’t switch. Here again is what we see with the original function:

win(prize_door = 1, pick = 3, switch = FALSE, verbose = TRUE)

## host opens 2

## [1] FALSE

And now with the modified one:

win2(prize_door = 1, pick = 3, switch = FALSE, verbose = TRUE)

## [1] FALSE

The results are the same, but it never calculates which door the host opens. I am emphasizing this to help make the Monty Hall problem more intuitive and illustrate that if the player does not switch, their chances of winning are unaffected by the host openning other non-prize doors. So it remains 1 out of 3.

More doors

Suppose that instead of starting with three doors, we start with 100 doors. There is a prize behind one of the doors, the player makes an initial pick, and the host will open 98 non-prize doors, leaving the just the player’s original pick and one other door closed.

Let’s see what happens when the prize is behind, say, door 51, and the player picks door 91.

win2(switch = FALSE,
     pick = 91, prize_door = 51, doors = 100, verbose = TRUE )

## [1] FALSE

win2(switch = TRUE,
     pick = 91, prize_door = 51, doors = 100, verbose = TRUE )

## host opens 35 61 53 17 15 89 1 62 79 36 13 47 41 65 54 45 16 6 37 68 82 73 72 5 86 74 18 60 29 19 40 96 67 34 57 10 27 31 83 39 55 81 71 24 80 42 56 97 26 43 63 95 59 87 88 3 21 25 99 22 8 44 90 28 33 64 9 38 48 84 98 49 4 50 30 46 7 77 2 76 58 100 12 23 11 94 69 75 20 92 70 66 14 32 52 78 85 93

## player switches to 51

## [1] TRUE

Two psychologies

People’s intuitions about both the drunk test example and the game show example are wildly off the mark. (There is a lot of research and debate about what triggers the errors, but it is clear that these are systematic errors.) In the first case, we have people ignoring the prior probability and judging only based on the new information. In the second case, we have people relying solely on the prior probability and not recognizing that the new information is actually relevant.

What is important for us in the security field is to be aware of both of these human psychological tendencies and be on the guard for making these errors. Knowing how to formally model such problems may help, but it isn’t necessary unless you are trying to compute precise probabilities. What is necessary is learning that your intuitions may be mistaken, and ideally to reshape your intuitions with practice with similar problems.

It is also important to understand that most people do get this stuff wrong, and so they may systematically make erroneous judgements. This can help us better understand where people are coming from.