Bayes Theorem and Naive Bayes
Outline
- Bayes theorem
- Naive Bayes
- Naive Bayes in machine learning
- Naive Bayes for continuous input attributes
- Advantages and disadvantages
- Applications
Bayes Theorem everywhere in our life
What the chance of getting wet on our way to the university?
Probability
Probability gives a numerical description of how likely an event is to occur:
Probability of event(outcome)A:P(A)
- P(rainy) = 0.3
- P(snowy) = 0.05
- P(a coin lands on its head) = 0.5
P(A) + P(^A)=1.0
- P(rainy) + P(notrainy)=1,so P(notrainy)=0.7
Random variables
A random variable is avariable to describe the outcome of a random experiment
Consider a discrete random variable Weather
- P(Weather = rainy) = 0.3
- P(Weather = snowy) = 0.05
- P(Weather = sunny) = 0.4
- P(Weather = cloudy) = 0.25
Sum of the probabilities of all the outcomes is 1
Probability of multiple random variables
We may want to know the probability of two or more events occurring, e.g.,
- P(Weather = rainy, Season = winter) = 0.1
- P(Weather = rainy, Season = spring) = 0.15
- P(Weather = snowy, Season = summer) = 0
Joint probability of events A and B: P(A,B) or P(A B)
Conditional Probability
We may also be interested in the probability of an event given the occurrence of another event: P(A given B) or P(A $\mid$ B)
- P(Weather = rainy $\mid$ Season = winter) = 0.2
- P(Weather = rainy $\mid$ Season = spring) = 0.3
- P(Weather = snowy $\mid$ Season = winter) = 0.5
Calculating a conditional probability without the joint probability
Where:
- $A$: Hypothesis
- $B$: Evidence
- $P(A \mid B)$: Posterior Probability
- $P(A)$: Prior Probability
- $P(B \mid A$: Likelihood
- $P(B)$: Evidence Probability
Example
What the chance of getting wet on our way to the university?
- Probability of raining at Birmingham in March is 0.4
- Probability of taking an umbrella when raining is 0.9
- Probability of taking an umbrella when not raining is 0.2
Question
$P(raining \mid \hat {umbrella}) = ?$
Analyse
- $H = raining,\: E = \hat{umbrella}$
- $P(raining) = 0.4, \: P(\hat {raining} = 0.6)$
- $P(umbrella \mid raining) = 0.9, \: P(\hat{umbrella}\mid raining) = 0.1$
- $P(umbrella\mid \hat{raining}) = 0.2, \: P(\hat{umbrella} \mid \hat{raining}) = 0.8$
$\therefore \: $
Bayes Theorem
Example 2: Drug Testing
Suppose that a drug test will produce 99% true positive results for drug users and 99% true negative results for non-drug users. Suppose that 0.5% of people are users of the drug. What is the probability that a randomly selected individual with a positive test is a drug user? What is the probability that a randomly selected individual with a negative test is a drug user?
- Drug user: $drug$
- Non-drug user: $\hat{drug}$
- Positive: $+$
- Negative: $-$
Question
- $P(drug\mid +) = ?$
- $P(drug\mid -) = ?$
Analyse
$P(drug\mid +) = ?$
$P(drug\mid +) = P(drug) \cdot P(+\mid drug) \:/\: P(+)$
$P(+) = P(drug) \cdot P(+\mid drug) + P(\hat{drug}) \cdot P(+ \mid \hat{drug})=1.49\%$
$P(drug\mid +) = 0.5\% \times 99\% \:/\: 1.49\% \approx 33.2\%$
$P(drug\mid -) = ?$
$P(drug\mid -) = P(drug) \cdot P(-\mid drug) \:/\: P(-)$
$P(-) = P(drug) \cdot P(-\mid drug) + P(\hat{drug}) \cdot P(- \mid \hat{drug}) = 98.51\%$
$P(drug\mid -) = 0.5\% \times 1\% \:/\: 98.51\% \approx 0.005\%$
Revisit Drug Testing - Extension
Suppose that a drug test will produce 99% true positive results for drug users and 99% true negative results for non‐drug users. Suppose that 0.5% of people are users of the drug. What is the probability that a randomly selected individual with two positive tests is a drug user?
(assume that these two tests are mutually independent given whether if the person is a drug user or not)
Question
$P(drug\mid ++) = ?$
Analyse
$P(drug\mid ++) = P(drug) \cdot P(++\mid drug) \:/\: P(++)$
$P(++\mid drug) = P(+\mid drug) \cdot P(+\mid drug) = 99\% \cdot 99\% = 98.01\%$
$P(++) = P(drug)\cdot P(++\mid drug) + P(\hat{drug})\cdot P(++\mid \hat{drug}) = 0.5\%$
$P(drug\mid ++) = 98.01\%$
Interpretation of Bayes Theorem
- Probability measures a “degree of belief”
- Bayes theorem links the degree of belief in a hypothesis (proposition) before and after accounting for evidence
- $P(H \mid E) = P(H) \cdot P(E \mid H)\:/\:P(E)$, hypothesis $H$ and evidence $E$
- $P(H)$ is the initial degree of belief in $H$ (i.e. prior probability)
- $P(H\mid E)$ is the degree of belief having accounted for $E$
- Bayes theorem is not to determine our belief, but to update our belief
Bayes Theorem for Classification
- A machine learning model is to predict the relationships in the data, such as the relationship between features (i.e. attributes) (e.g. round, juicy, orange skin) and classes (e.g. orange, apple)
- Bayes theorem provides a probabilistic way to model the relationship in data: evidence = feature (input), hypothesis = class (output)
- It calculates the probability of an object given a number of
features (input $f_1,f_2,…, f_n$) belonging to a particular class (output $c$): $P(Y=c \mid X_1 = f_1, X_2 = f_2,…, X_n = f_n)$
Example: Tooth cavity testing
- To predict whether people have cavity when they do not do mouth wash by Bayes theorem
- Difference from the previous examples: unknown prior probability, likelihood, etc; need to learn from training examples
Analyse
- Input variable: Wash
- Output variable: Cavity
Training Set
Person | Wash | Cavity |
---|---|---|
P1 | no | yes |
P2 | no | yes |
P3 | yes | yes |
P4 | yes | no |
P5 | yes | no |
P6 | no | no |
Frequency Table
Cavity = no | Cavity = yes | Total | |
---|---|---|---|
Wash = no | 1 | 2 | 3 |
Wash = yes | 2 | 1 | 3 |
Total | 3 | 3 | 6 |
Probabilities based on frequency table
$P(\hat{wash} \mid vacity) = \cfrac{2}{3}$
$P(wash \mid vacity) = \cfrac{1}{3}$
$P(cavity) = \cfrac{3}{6}$
$P(\hat{cavity}) = \cfrac{3}{6}$
$P(wash) = \cfrac{3}{6}$
$P(\hat{wash}) = \cfrac{3}{6}$
Now we can predict whether people have cavity when not doing mouth wash.
$P(cavity \mid \hat{wash})
= P(cavity) \cdot P(\hat{wash}\mid cavity) \:/\: P(\hat{wash})
\approx 67\%$$P(\hat{cavity} \mid \hat{wash}) \approx 33\%$
Bayes Theorem for multiple input attributes
To predict whether people have cavity when they do not do
mouth wash and have pain: $P(Y=cavity\mid X1=\hat{wash}, X2=pain)$
- Input variables: Wash, Pain
- Output variable: Cavity
Frequency table of joint input attributes
Training Set
Person | Wash | Pain | Cavity |
---|---|---|---|
P1 | no | yes | yes |
P2 | no | yes | yes |
P3 | yes | yes | yes |
P4 | yes | no | no |
P5 | yes | no | no |
P6 | no | no | no |
Frequency Table
Cavity = no | Cavity = yes | Total | |
---|---|---|---|
Wash = no, Pain = no | 1 | 0 | 1 |
Wash = no, Pain = yes | 0 | 2 | 2 |
Wash = yes, Pain = no | 2 | 0 | 2 |
Wash = yes, Pain = yes | 0 | 1 | 1 |
Total | 3 | 3 | 6 |
- Downside: Number of possible combinations of input attributes can become very large
Naive Bayes
Naive Bayes = Bayes Theorem + Independence Assumption
Bayes Theorem:
Independence Assumption: Input attributes $f_1, …, f_n$ are mutually independent, conditional on any output, that is
- Naive Bayes
If there are m possible classes: $c_1, c_2, …, c_m$, based on the independent assumption we have
As $P(f_1, …, f_n)$ is the same (i.e. a constant) for any class $c$, we can denote $P(f_1, …, f_n) = \cfrac{1}{\alpha}$,
where $\alpha$ is a normalised factor
Therefore
Frequency table of joint input attributes
Training Set
Person | Wash | Pain | Cavity |
---|---|---|---|
P1 | no | yes | yes |
P2 | no | yes | yes |
P3 | yes | yes | yes |
P4 | yes | no | no |
P5 | yes | no | no |
P6 | no | no | no |
Frequency Table
Cavity = no | Cavity = yes | Total | |
---|---|---|---|
Wash = no, Pain = no | 1 | 0 | 1 |
Wash = no, Pain = yes | 0 | 2 | 2 |
Wash = yes, Pain = no | 2 | 0 | 2 |
Wash = yes, Pain = yes | 0 | 1 | 1 |
Total | 3 | 3 | 6 |
Therefore
Frequency Table
Cavity = no | Cavity = yes | Total | |
---|---|---|---|
Wash = no | 1 | 2 | 3 |
Wash = yes | 2 | 1 | 3 |
Total | 3 | 3 | 6 |
Cavity = no | Cavity = yes | Total | |
---|---|---|---|
Pain = no | 3 | 0 | 3 |
Pain = yes | 0 | 3 | 3 |
Total | 3 | 3 | 6 |
Naive Bayes classifier
and
Where
Categorical naive Bayes
Categorical naive Bayes: all the input attributes are categorical variables.
- A categorical variable is a variable that can take on one of a limited number of possible values
- This contrasts with numerical (continuous) naive Bayes, where
some input variable can be represented by real numbers (i.e. infinite possible attribute values)
Categorical vs Numerical attributes
Training Set
Person | Wash | Pain | Cavity |
---|---|---|---|
P1 | no | yes | yes |
P2 | no | yes | yes |
P3 | yes | yes | yes |
P4 | yes | no | no |
P5 | yes | no | no |
P6 | no | no | no |
Categorical
Person | Wash | Cavity |
---|---|---|
P1 | no | yes |
P2 | no | yes |
P3 | yes | yes |
P4 | yes | no |
P5 | yes | no |
P6 | no | no |
Numerical
Person | Sugar (grams) | Cavity |
---|---|---|
P1 | 40 | yes |
P2 | 35 | yes |
P3 | 60 | yes |
P4 | 20 | no |
P5 | 30 | no |
P6 | 17 | no |
Deal with numerical input attributes by discretising values
Transform numerical input values into categories (i.e., discrete values)
E.g.: Sugar $\rightarrow$ small, medium, large
However, a person may consider large as >30g. Another may consider it as >25g
If the amount of sugar is large when $\ge$ 30g, should 29.9g be considered medium? Not large?
This level of subjectiveness and crispiness may lead to loss of information
Probability Function
Probability Function describes the probability distribution of a random variable, i.e., the probabilities of occurrence of different possible outcomes in an experiment.
Probability Mass Function: the random variable is discrete
Person | Wash | Cavity |
---|---|---|
P1 | no | yes |
P2 | no | yes |
P3 | yes | yes |
P4 | yes | no |
P5 | yes | no |
P6 | no | no |
Probability Mass Function for P(Wash|cavity):
- X axis represents possible values of the feature
- Y axis represents the probability of the feature being a value given a particular class
Probability Density Function
Probability Density Function is a probability function where the variable is continuous.
Person | Sugar (grams) | Cavity |
---|---|---|
P1 | 40 | yes |
P2 | 35 | yes |
P3 | 60 | yes |
P4 | 20 | no |
P5 | 30 | no |
P6 | 17 | no |
- Shape of the function: Assume the examples drawn from Gaussian Distribution (aka normal Distribution).
- Shape is controlled by two parameters: Mean ($\mu$) and Standard Deviation ($\sigma$), determined by the training examples.
Gaussian Distribution $N(x \mid \mu , \sigma )$:
Gaussian Distribution
Probability Density Function calculation for people having cavity
$x_i$ is the amount of the sugar person takes, where is the number of the people having cavity in the training examples.
Therefore
Use $\mu$ and $\sigma$ to plot the Gaussian distribution graph:
Sugar is $x$ and $P(Sugar\mid cavity)$ is $N(x \mid \mu, \sigma)$.
Probability Density Function calculation for people not having cavity
$x_i$ is the amount of the sugar person takes, where is the number of the people not having cavity in the training examples.
Therefore
Use $\mu$ and $\sigma$ to plot the Gaussian distribution graph:
Sugar is $x$ and $P(Sugar\mid \hat{cavity})$ is $N(x \mid \mu, \sigma)$.
Probability distribution functions
Probability distribution function of Sugar given cavity or no cavity:
Person | Sugar (grams) | Cavity |
---|---|---|
P1 | 40 | yes |
P2 | 35 | yes |
P3 | 60 | yes |
P4 | 20 | no |
P5 | 30 | no |
P6 | 17 | no |
General idea
Decide on a type of probability density function (usually Gaussian distribution) for each numerical input attribute $X$.
Determine the parameters of the probability function for each possible output attribute value (i.e.$c_1, c_2, …, c_m$) according to the training examples.
Use the obtained probability function for $c_i$ to calculate $P(X =f \mid Y = c_i)$ for the corresponding input value $f$.
Example
We want to predict if a person has cavity when they do not do mouth wash and take sugar 20 grams everyday.
We need to calculate $P(cavity \mid \hat{wash}, Sugar=20)$ and $P(\hat{cavity}\mid \hat{wash}, Sugar=20)$.
Person | Wash | Sugar (grams) | Cavity |
---|---|---|---|
P1 | no | 40 | yes |
P2 | no | 35 | yes |
P3 | yes | 60 | yes |
P4 | yes | 20 | no |
P5 | yes | 30 | no |
P6 | no | 17 | no |
Calculate $P(Sugar=20 \mid cavity)$:
- Probability Density Function of the variable Sugar given cavity
Calculate $P(Sugar=20 \mid \hat{cavity})$:
- Probability Density Function of the variable Sugar given no cavity
Result
Predict if people have cavity when they do not do mouth wash and take sugar 20 grams everyday.
$P(cavity \mid \hat{wash}, Sugar=20)
= \alpha \times \cfrac{3}{6} \times \cfrac{2}{3} \times0.0025 = 0.00083\alpha$
$P(\hat{cavity} \mid \hat{wash}, Sugar=20)
= \alpha \times \cfrac{3}{6} \times \cfrac{1}{3} \times0.0657 = 0.01095\alpha$
Predicted results: No cavity
Advantages and disadvantages of naive Bayes
Advantages:
- Does very well in classification, especially on high‐dimensional dataset
- Requires a small amount of training data; training is very quick
- Easy to build/implement
Disadvantages:
- Requires the assumption of independent input attributes
- For continuous input attributes, it assumes a certain probability distribution for input attributes
- Not work very well for regression
Applications
To categorise emails, e.g., spam or not spam
To check a piece of text expressing positive emotions, or negative emotions
To classify a news article about technology, politics, or sports
Software defect prediction
Medical diagnosis
…
Logic
Outline
- Propositional Logic
- First Order Logic
AI Branches
Propositional Logic
- A proposition is simply a statement, either true or false
- There is no concept of a degree of truth
- Propositional logic deals with logical relationships between propositions (statements, claims) taken as wholes
- Such letters are not variables, called propositional constant
P = “John is wearing a red coat”
Compound proposition
Compound proposition is constructed by individual propositions and connectives (logical operators)
P and Q (conjunction of two propositions, denoted by $\wedge$)
- John is wearing a red coat and he’s stolen a jeep
P or Q (disjunction of two propositions, denoted by $\vee$)
- John is at the library or he is studying
If P then Q (conditional of one proposition implying the
other, denoted by $\rightarrow$)- If John is at the library, then he is preparing the exam
Not P (contradictory of the proposition, denoted by $\neg$)
- It is not the case that John is at the library
Connectives
If (A or B) then (C and not D)
- $A \vee B \rightarrow C \wedge \neg D$
Given the truth value of A, B, C, D, is this compound proposition true or false?
Truth table
A truth table shows how the truth or falsity of a compound statement depends on the truth or falsity of the individual statements from which it’s constructed.
P | Q | P $\wedge$ Q | P $\vee$ Q | P$\rightarrow$Q | $\neg$P |
---|---|---|---|---|---|
T | T | T | T | T | F |
T | F | F | T | F | F |
F | T | F | T | T | T |
F | F | F | F | T | T |
Inclusive/exclusive OR
Inclusive or: A triangle can be defined as a polygon with three sides or a polygon with three vertices
Exclusive or (denoted xor): The coin landed either heads or tails
P | Q | P $\vee$ Q | P$\bigoplus$Q |
---|---|---|---|
T | T | T | F |
T | F | T | T |
F | T | T | T |
F | F | F | F |
Conditionals
Conditional can be thought of as an “if”statement
If I miss the bus, then I will be late for work
P | Q | P $\rightarrow$ Q |
---|---|---|
T | T | T |
T | F | F |
F | T | T |
F | F | T |
When a conditional is false?
If I study hard, then I will pass the test
P: I study hard; Q: I pass the test
- Case 1: P is true, Q is true
- I studied hard, I passed the test
- Case 2: P is true, Q is false
- I studied hard, I did not pass the test
- Case 3: P is false, Q is true
- I did not study hard, I passed the test
- Case 4: P is false, Q is false
- I did not study hard, I did not pass the test
P | Q | P $\rightarrow$ Q |
---|---|---|
T | T | T |
T | F | F |
F | T | T |
F | F | T |
Biconditionals
The bidirectional (denoted P $\leftrightarrow$ Q) implies in both directions
“(If P then Q) and (If Q then P)”
P iff Q (is read P if and only if Q)
(During the day) The sun is shining iff there are no clouds covering the sun
If I have a pet goat, then my homework will be eaten (true)
If my homework is eaten, then I have a pet goat (not true)
P | Q | P $\leftrightarrow$ Q |
---|---|---|
T | T | T |
T | F | F |
F | T | F |
F | F | T |
Contradictories
Contradictory of P (¬P or not P): a claim that always has the opposite truth value of P
P: “John is at the movies”
It can be read “it is not the case that John is at the movies”
Arguments
An argument is a set of propositions, called the premises, intended to determine the truth of another statement, called the conclusion
Multiple statements
Premises: Supporting statements
Conclusion: Statement that follows from premises
- $P \leftrightarrow Q$
- $\cfrac{Q}{\therefore P}$
P: The river is flooded; Q: It rained heavily
Validity
- An argument is valid if the premises lead to the conclusion presented
- P: I run Marathon, Q: I am exhausted
Soundness
A sound argument is valid and all of its premises are true
A unsound argument lacks validity or truth
Common arguments: Valid or not
$((P \rightarrow Q) \wedge Q) \therefore P$
- P implies Q and Q, therefore P
$((P \rightarrow Q) \wedge \neg Q) \therefore \neg P$
- P implies Q and not Q, therefore not P
$((P \rightarrow Q) \wedge (Q \rightarrow R)) \therefore P \rightarrow R$
- P implies Q and Q implies R, therefore P implies R
$(P \leftrightarrow Q) \therefore (P \wedge Q) \vee (\neg P \wedge \neg Q)$
- P iff Q, therefore (P and Q) or (not P and not Q)
Argument $((P \rightarrow Q) \wedge Q) \therefore P$
P | Q | $P \rightarrow Q$ | $(P \rightarrow Q) \wedge Q$ | $((P \rightarrow Q) \wedge Q) \rightarrow P$ |
---|---|---|---|---|
T | T | T | T | T |
T | F | F | F | T |
F | T | T | T | F |
F | F | T | F | T |
Not valid
Argument $((P \rightarrow Q) \wedge \neg Q) \therefore \neg P$
P | Q | $P \rightarrow Q$ | $\neg Q$ | $(P \rightarrow Q) \wedge \neg Q$ | $\neg P$ | $((P \rightarrow Q) \wedge \neg Q) \rightarrow \neg P$ |
---|---|---|---|---|---|---|
T | T | T | F | F | F | T |
T | F | F | T | F | F | T |
F | T | T | F | F | T | T |
F | F | T | T | T | T | T |
Valid
First Order Logic
First-order logic (aka predicate logic) is a generalisation of propositional logic
First-order logic is a powerful language that uses quantified variables over objects and allows the use of sentences that contain variables
First-order logic adds predicates and quantifiers to propositional logic
Predicates
“Socrates is a philosopher”
“Plato is a philosopher”
The predicate “is a philosopher” occurs in both sentences, which have a common structure “𝑥 is a philosopher”, “𝑥” can be Socrates or Plato
Such a sentence can be represented as Predicate(term1, term2, …); for example, philosopher(𝑥), philosopher(Socrates)
First-order logic uses Predicate(term1, term2…) to represent basic sentences, where terms can be variables/constants
—
Birds fly:
- fly(birds)
John respects his parents:
- respect(John, parents)
x is greater than y:
- greater(x, y)
1 is greater than 2:
- greater(1, 2)
Connectives
We can use logic connectives on predicates
$philosopher(x) \rightarrow scholar(x)$
- “x is a philosopher” implies that “x is a scholar”
$respect(John, parents) \wedge respect(John, wife)$
- “John respects his parents” and “John respects his wife”
Quantifiers
A quantifier turns a sentence about something having some property into a sentence about the number (quantity) of things having the property
Universal quantifier is “for all” (i.e. “for every”)
- $\forall x \:\: P(x)$: “For all x, x is P”.
- $\forall x \:\: philosopher(x)$: “For all x, x is philosopher”.
Existential quantifier is “for some” (i.e. “there exists”)
- $\exists x \:\: P(x)$: “For some x, x is P”.
- $\exists x \:\: philosopher(x)$: “For some x, x is philosopher”.
Universal quantifier
- All man drink coffee
- $\forall x \:\: man(x) \rightarrow drink(x, coffee)$
- In universal quantifier we use implication “$\rightarrow$”
Existential quantifier
- Some boys are intelligent
- $\exists x\:\: boys(x) \wedge intelligent(x)$
- In existential quantifier we use and “$\wedge$”
Examples
All philosophers are scholars
- For all x, where “x is a philosopher” then “x is a scholar”
- $\forall x\:\: philosopher(x) \rightarrow scholar(x)$
Every man respects his parents
- $\forall x\:\: man(x) \rightarrow respect(x, parents)$
Some boys play football
- $\exists x\:\: boy(x) \wedge play(x, football)$
Not all students like both mathematics and science
- $\neg \forall x\:\: [student(x) \rightarrow like(x, mathematics) ∧ like(x, science)]$