Introduction
In this page,
I shall present a version of the standard classical probability theory
that I shall call CPT. Just as in case of CPL in chapter 2 it is based on standard elementary
algebra and English, and has all the standard theorems of standard
probability theory  on a slightly different new basis.
Here is an outline and preview with links:
Sections
1. A new set of axioms for probability
(*)
2. Kolmogorov's standard axioms for probability are
derived
3. Over 20 standard theorems of probability theory:
A. Basic unconditional theorems
B. Basic conditional theorems
C. Basic theorems
about irrelevance
4. How probability theory explains learning
from experience
Confirmation
Undermining
Competition
Support
5. The
utility of probability theory
You can take the first for granted
and go straight to the second or third,
while the fourth is the most interesting and
involves some fundamental applications of the basic ideas of
probability theory to reasoning and to learning from experience.
Something much like this and much more is in G. Polya's two
volumes on Plausible Reasoning.
In particular, in section 4 it will be shown that
here are a number of important and intuitive principles of confirmation
which are always used by people reasoning about matters of facy and
which can be proved with the help of probability theory.
These principles are stated to be of four kinds and may be summarised as:
Confirmation:

The
probability of a theory increases as its consequences are verified.

Support:

The
probability of a theory increases as relevant circumstances are
verified.

Competition:

The
probability of a theory increases as its competing theories are
falsified.

Undermining:

The
probability of a theory decreases as its assumptions are falsified

These principles are
then proved on the basis of what was established
in the earlier sections.
All reasoning and mathematics in what
follows is elementary but some knowledge of propositional logic is
presupposed if not strictly necessary, since most formulas are
(initially) given English readings.
1. New
axioms for probability theory
What I shall provide is a set of three new axioms that imply the
standard axioms for probability of Kolmogorov. These new axioms make it
easier to join probability theory to propositional logic than
Kolmogorov's axioms and are more elementary and simpler than his in
several respects  as shall be shown. (*)
Here are the axioms, where all that is
assumed about "pr(A)" is that it is equal to some number and read as
"the probability of A". This means that one must add synactical rules
to the effect
Notation:

"pr(A)" is "the probability of A"

"_{T}A" is "A is a theorem
of theory T"

CPTSyntaxis

As for CPL plus:

 CPTpr() : If "A" is a proposition of CPL and "a" any
number between 0 and 1 inclusive,
"pr(A)=a"
is proposition of CPT.

 CPT_{T} : If "A" is a proposition of CPL and
"T" a name for a set of statements of CPL,
"_{T
}A" is a proposition of CPT.

This means that CPT is syntactically an
extension of CPL: "pr(A)" refines [A] in that (as we shall prove) 0
<= pr(A) <= 1.
The notation "_{T}"
is introduced to facilitate the link to Kolmogorov's statement and to
have a convenient abbreviation for "A is a theorem of theory T".
Introducing it is not necessary, for "[A]=1 holds in theory T" or
"pr(A)=1 holds in theory T" are taken to mean the same. Also it is
noteworthy that it does not
follow that one can iterate either "_{T}" as in "_{T}(_{T}A)"
or iterate "pr()" as in "pr(pr(A)=a)=b".
Now the semantical
axioms for CPT are:

If A and B are any propositions in CPL:

Alternatively
expressed:

AxA.

( A)
> pr(A)=1

[A]=1
> pr(A)=1

AxB.

(( A
> B) > pr(A)<=pr(B)

[A>B]=1
> pr(A) <= pr(B)

AxC.

pr(A)=pr(A&B)+pr(A&~B)

pr(A)=pr(A&B)+pr(A&~B)

Here " A" formalizes the notion that "A is a theorem in the presumed
theory", where "a theory" is "a set of assumptions added to the axioms
of logic" and the "theorems of the theory" are all statements that can
be deduced from the theory by inference rules of a presumed logic, such
as CPL.
Note that in what
follows the reference to a theory T is abstracted from (though in any
application this will be what one will want to find logical
consequences from), and that therefore while [A]=1 iff  A is useful,
the notation " A_{T}" makes reference to an item "[A]=1"
doesn't (though it could be easily added).
Also, it is
noteworthy that mere factual truth of A is not sufficient to make the
hypotheses of AxA and AxB true: Indeed, what one normally wants is an
assurance (and so a proof) that a given theory T does logically imply
or fail to imply a certain proposition P  after which one has an
external check on theory T, by finding whether the proposition P is in
fact true or false.
I abstract from
reference to theories to simplify and eliminate clutter, but it is
useful to state a version with such references and provide readings,
i.a. because this shows how neatly the axioms tie PT to PL in the
present formulations

If A and B are any propositions in CPL:

Alternatively
expressed:

AxA.

(_{T}
A) > pr(AT)=1

[A_{T}]=1
> pr(AT)=1

AxB.

((_{T} A > B) > pr(AT)<=pr(BT)

[A_{T}>B_{T}]=1 > pr(AT) <=
pr(BT)

AxC.

pr(AT)=pr(A&BT)+pr(A&~BT)

pr(AT)=pr(A&BT)+pr(A&~BT)

Here is a reading with the various
optional reference to a supposed theory T (a sequence of statements of
CPL) left ou

CPTaxioms in words:

AxA:

A is a theorem only if the probability of A is 1.

AxB:

(A only if B) is a theorem only if the probability of A
is less than or equal to the probability of B.

AxC:

The probability of A is the sum of the probabilities of A
and B and A and not B.

It is from the formal
statement of these axioms  dropping references to T  we shall now
derive Kolmogorov's axioms  which also do not explicitly refer to a
theory that may be used in the hypotheses of its axioms.
2. The
proof of the standard Kolmogorov axioms for probability theory:
These standard Kolmogorov axioms for
probability are normally stated in such terms as:

Kolmogorov axioms for probability theory:


Suppose that $ is a set of propositions P, Q, R etc. and
that this set is closed for negation, conjunction and disjunction,
which is to say that whenever (P e $) and (Q e $), so are ~P, (P&Q)
and (PVQ). Now we introduce pr(.) as a function that maps the
propositions in $ into the real numbers in the following way, that is,
satisfying the following three axioms:



A1.

For all P e $ the probability of P, written as pr(P), is
some nonnegative real number.

A2.

If P is logically valid, pr(P)=1.

A3.

If ~(P&Q) is logically valid, pr(PVQ)=pr(P)+pr(Q).

In fact, we don't need the initial statement, since we simpy presume
CPL, which does meet the specifications of the initial statement. What
we do need is proofs of A1, A2 and A3. Here they come.
First, there is the
fundamental theorem that permits inferences from logical equivalences
to probabilities:
T*1:


(A iff B) > pr(A)=pr(B)

Equivalent propositions have the same probability

(1)

 (A iff B) >  (A > B) > pr(A) <=
pr(B)

AxB

(2)

 (A
iff B) >  (B > A) > pr(B) <= pr(A)

AxB

(3)

 (A iff B) > pr(A) <= pr(B) & pr(B) <= pr(A)

(1), (2)

(4)

 (A iff B) > pr(A) = pr(B)

(3),
Algebra

Next, it is proved contradictions have
probability 0:
T*2

pr(A&~A)=0

Contradictory propositions have zero probability

(1)

pr(A)=pr(A&A)+pr(A&~A)

AxC

(2)

pr(A)=pr(A&A)

T*1 with ( A iff (A&A))

(3)

=pr(A)+pr(A&~A)

(1), (2)

(4)

pr(A&~A)=0

(3), Algebra

It is often helpful to have in propositional logic two special
constants, such as Taut (from "tauology") and Contrad (from
"contradiction"). These are defined as: Taut iff AV~A and Contrad iff
A&~A. Taking this for granted:
T*3

0 <=
pr(A) <= 1

Probabilities are between 0 and 1 inclusive

(1)

 A > pr(A)=1

AxA

(2)

 pr(Taut)=1

(1) and  Taut

(3)

 A > Taut

Logic

(4)

pr(A) <= pr(Taut)

(3), AxB

(5)

pr(A) <= 1

(2), (4)

(6)

pr(Contrad)=0

T2

(7)

 (Contrad > A)

Logic

(8)

pr(Contrad) <= pr(A)

(7), AxB

(9)

0 <= pr(A)

(6), (8)

(10)

0 <= pr(A) <= 1

(5), (9)

Next, we need to prove the
probabilistic theorem for denial. We do it in two steps:
T*4

pr(AV~A)=pr(A)+pr(~A)

Probability of disjunction of exclusives is sum of
probability of factors

(1)

pr(AV~A)=pr((AV~A)&A)+pr((AV~A)&~A)

AxC

(2)

pr(A)=pr((AV~A)&A)

T*1, as  ((AV~A)&A) iff AT*1, as  ((AV~A)&~A)
iff ~A

(3)

pr(~A)
= pr((AV~A)&~A)


(4)

pr(AV~A) = pr(A)+pr(~A

(1),(2),(3)

T*5

pr(~A)=1pr(A)

Probability of denial is complementary probability

(1)

pr(AV~A)=pr(A)+pr(~A)

T*4

(2)

1=pr(A)+pr(~A)

AxA since  AV~A

(3)

pr(~A) = 1pr(A)

(2), Algebra

Next, we have this parallel to AxA:
T*6

~A
> pr(A)=0

Provable nontruths have zero probability

(1)

~A

Assumption

(2)

pr(~A)=1

(1), AxA

(3)

1pr(A)=1

(2), T*5

(4)

pr(A)=0

(3), Algebra

The main point of T*6
and AxA is that if one can prove that A (or ~A), then thereby it
follows that pr(A)=1 (or pr(A)=0 if ~A). This is normally important
in comparing the supposed truths and nontruths one can logically infer
from a theory, with what the facts are (so that if one can prove that _{T}
A, while in fact one finds ~A one thereby has learned the assumptions
of theory T can't be all true, if the proof of _{T} A was
without mistakes in reasoning. Incidentally, this shows one should not
define "_{T} A" as "Nec A", with "Nec" the modality of
necessary truth: This amounts to the presumption that T is true.)
Next, we need a theorem that serves as a lemma to the
next theorem, but that needs a remark itself. The theorem is:
T*7

pr(A&B)+pr(A&~B)+pr(~A&B)+pr(~A&~B)=1

Full disjunctive probabilistic sum of two factors

(1)

pr(A)+pr(~A)=1

T*5

(2)

pr(A&B)+pr(A&~B)+pr(~A&B)+pr(~A&~B)=1

(1), AxC

The promised remark is that T*7 differs
essentially from the similar theorem in CPL minus the probabilities: In
CPL ([A&B]+[A&~B]+[~A&B]+[~A&~B]) is true and implies
that precisely one of the four factors is true. In PT
pr(A&B)+pr(A&~B)+pr(~A&B)+pr(~A&~B) is true, but
normally none of the four alternatives itself is provably true by
itself; normally none of the four alternatives are true; and normally
several or all of the alternatives will have a probability between 0
and 1 (conforming T*3).
Indeed, a very
interesting aspect of PT is that it assigns numerical measures to all
alternatives the underlying logic can distinguish, regardless of
whether these alternatives are true or have ever been true. And part of
the interest is that there normally are far more logically possible
alternatives than logically provable alternatives.
To finish the proof CPT indeed implies all of
Kolmogorov's axioms for PT we need to derive his Ax3:
T*8

~(A&B)
> pr(AVB)=pr(A)+pr(B)

Conditional
sums

(1)

~(A&B)

AI

(2)

pr(A&B)=0

T*6

(3)

pr(A)=pr(A&~B)

(2), AxC

(4)

pr(B)=pr(~A&B)

(2),
AxC, T*1

(5)

pr(AVB)=1pr(~A&~B)

T5, T*1 with (~(~A&~B)) iff (AVB)

(6)

=pr(A&B)+pr(A&~B)+pr(~A&B)

T*7

(7)

=pr(A&~B)+pr(~A&B)

(2),(6)

(8)

=pr(A)+pr(B)

(3),(4),(7)

I have now proved all of Kolmogorov's axioms for the finite case: A1
follows from T*3; A2 is AxA; and A3 is T*8.
3.
Some
fundamental theorems of CPT
Irrespective of the axiomatization or
interpretation of probability, there are a number of important theorems
which we shall need  just as we need laws like (a+b)=(b+a) for
counting, irrespective of axioms used to prove them or of what we
choose to count. The advantage and use of axioms is that one can use
them to prove the theorems one needs  and having given a valid proof
one knows that any objection against the theorem must be directed
against the axioms, for the theorem was proved to follow from them. So
what we shall do first is to derive some useful theorems.
A. Basic
unconditional theorems
First, then, there is
a group of theorems that the reader may derive from Kolmogorov's axioms
(from which they do follow) and that I derived above from my axioms:
T1

pr(~P)=1pr(P)

T*5

T2

0 <=
pr(P) <= 1

T*3

T3

If
P  Q, then pr(P) <= pr (Q)

AxB

T4

If P is logically equivalent to Q, then pr(P)=pr(Q)

T*1

T5

pr(P)=pr(P&Q)
+ pr(P&~Q)

AxC

T6

pr(PVQ)=pr(P)+pr(Q)pr(P&Q)

T*7, T5

These were all proved in section (4.1).
We only add
T7

pr(P&Q)
<= pr(P) <= pr(PVQ)


that is: The probability of a conjunction is not larger than the
probability of any of its conjuncts, and the probability of a
disjunction is not smaller than the probability of any of its
disjuncts. It follows from T5 and T6 or from AxB and logic.
In what follows I'll state and prove
the most important theorems of elementary finite probability theory,
firstly because I have never seen this done properly in one paper,
secondly because it seems to me one of the cornerstones of human
reasoning, and thirdly to be able to show how we can learn from
experience using probability theory. (The last subject starts in
section 4.6. It deserves to be better known than it is, for it could
help to defuse, refute or ridicule much improbable nonsense that people
believe in.)
In what follows
proofs when referring to axioms refer to Kolmogorov's. Readers
thoroughly familiar with elementary probability theory may choose to
skip the rest of this chapter, but are advised to read the last
sections, 4.11 and 4.12.
B.
Basic
conditional theorems
Most
probabilities are not, as they were in this chapter so far, absolute,
but are conditional: Rather than saying "the probability of Q = x" we
usually introduce a condition and say, "the probability of Q, if P is
true, = y". This idea, that of the probability of a proposition Q given
that one or more propositions P1, P2 etc. are true is formalised by the
following important definition:
Definition
1

pr(QP)
= pr(P&Q):pr(P)


That is: The
conditional probability of Q, given or assumed that P is true, equals
the probability that (P&Q) is true, divided by the probability that
(P) is true. NB, as this fact has important implications for the
interpretation and application of probability theory: A conditional
probability is defined in terms of absolute probabilities, so therefore
we need absolute probabilities to establish conditional ones.
Definition 1 has many
applications, and many of these turn on the fact that it also provides
an implicit definition of pr(P&Q), namely as pr(P)pr(QP) (simply
by multiplying both sides of Def 1 by pr(P)). Consequently, we have as
a theorem (if pr(P)>0 and pr(Q)>0)
T8

pr(P&Q)=pr(P)pr(QP)=pr(Q)pr(PQ)


The second equality is, of course, also
an application of Def 1, and T8 accordingly says that the probability
of a conjunction equals the probability of one conjunct time the
probability of the other given that the one is true. Another
consequence of Def 1 i
which results from T5 and Def 1 upon division by pr(P),
and says that the probability of Q if P plus the probability of ~Q if P
equals 1. Of course, this admits of a statement like T1:
which shows that conditional
probabilities are like unconditional ones. A theorem to the same
effect, that parallels T3 is
That 0 <= pr(QP)
follows from D1, because the components of a conditional are both
>=0 by A1; and that pr(QP)<=1 is equivalent to pr(P&Q) <=
pr(P), which holds by T7. A theorem in the vein of T4 is
T12

If P  Q, then pr(P&~Q)=0


This is proved by noting that if P  Q holds, then so
does ~(P&~Q), which, by A3, entails that pr(PV~Q)=pr(P)+pr(~Q). As
by T6 pr(PV~Q)=pr(P)+pr(~Q)pr(P&~Q), it follows pr(P&~Q)=0 if
P  Q. From this it easily follows that
T13

If P  Q, then pr(QP)=1 provided pr(P)>0


which is to say that
if Q is a logical consequence of P, the probability of Q is P is true
is 1. The proviso is interesting, for it denies the possibility of
inferring Q from a logical contradiction or known falsehood. This means
that the def: P  Q =df pr(QP)=1 strengthens the logical "" by
adding that proviso. T13 immediately follows from T5, T12 and Def 1.
Def 1 may, of course, list any finite number of premises,
as in pr(QP1&....&Pn) = pr(Q&P1&....&Pn):
pr(P1&....&Pn). Such long conjunctions admit of a
theorem like T8:
T14

pr(P1&.....&Pn)=pr(P1)pr(P2P1)pr(P3P1&P2).....pr(PnP1&.....&Pn)


This says that the
probability that n propositions are true equals the probability that
the first (in any convenient order) is true times the probability that
the second is true if the first is true times the probability that the
third is true if the first and the second are true etc. The pattern of
proof can be seen by noting that for n=3
pr(P1)pr(P2P1)pr(P3P1&P2) = pr(P1&P2)pr(P3P1&P2) =
pr(P3&P2&P1) because the denominators successively drop out by
Def 1. That the premises can be taken in any order is a consequence
from T4: Conjuncts taken in any order are equivalent to the same
conjuncts in any other order.
T11 and T13, together
with T9 and T10, show that conditional probabilities are probabilities
we need just one further theorem:
T15

If
R  ~(P&Q), then pr(PVQR) = pr(PR)+pr(QR)


which parallels A3. It is easily proved
by noting that pr(PVQR) =
(pr(P&R)+pr(Q&R)pr(P&Q&R)):pr(R) by Def 1, T4 and T6,
and that pr(P&Q&R)=0 by T12 and T4 on the hypothesis. The
conclusion then follows by Def 1.
C. Basic theorems about irrelevance
A second
important concept which now can be defined is that of irrelevance. Two
propositions P and Q are said to be  probabilistically  irrelevant,
abbreviated PirrQ if the following is true:
Def 2

PirrQ
iff pr(P&Q)=pr(P)pr(Q)


Evidently, irrelevance is symmetric:
But there are more interesting results.
Let's call a logically valid statement a tautology and a logically
false statement a contradiction. Then we can say:
T17

Any proposition is irrelevant to any tautology and to any
contradiction.


Note that this
entails that tautologies are also mutually irrelevant. To prove T17,
first suppose that P is tautology. By A2 pr(P)=1. Since tautologies are
logically entailed by any proposition, Q  P, and so pr(Q&~P)=0 by
T12. Consequently, it follows pr(Q)=pr(Q&P) by T5, and so
pr(P).pr(Q)=1.pr(Q&P)= pr(P&Q) and we have irrelevance. Next,
suppose (P) is a contradiction. If so, ~(P) is a tautology, and so
pr(P)=0 by T1. By T7 pr(P&Q) <= pr(P) and as by A1 all
probabilities are >= 0, it follows pr(P&Q)=0. But then
pr(P)pr(Q)=0.pr(Q)= 0=pr(P&Q), and again we have irrelevance.
Def 2 is often stated in two other forms, which are both
slightly less general, as they require respectively that pr(P)>0 or
that pr(P)>0 and pr(~P)>0, in both cases to prevent division by
0. Both alternative definitions depend on Def 1, and the first is given
by
T18

If
pr(P)>0, then PirrQ iff pr(QP)=pr(Q)


This is an immediate consequence of Defs 1 and 2. It
states clearly the important property that irrelevance3 signifies: If P
is irrelevant of Q, the fact that P is true does not alter anything
about the probability that Q is true  and conversely, by T16,
supposing that Q is not also a contradiction. So irrelevance of one
proposition to another is always mutual, and means that the truth of
the one makes no difference to the probability of the truth of the
other.
This can again be stated in yet another form, with once
again a slightly strengthened premise, for now it is required that both
pr(P) and pr(~P) are > 0:
T19

If
0 < pr(P) < 1, then PirrQ iff pr(QP)=pr(Q~P)


Suppose the
hypothesis, which may be taken as meaning that P is an empirical
proposition, is true.T19 may be now proved by noting the following:
pr(QP)=pr(Q~P) iff pr(Q&P):pr(P) = pr(Q&~P): (1pr(P)) iff
pr(Q&P)  pr(P)pr(Q&P) = pr(P)pr(Q&~P) iff pr(Q&P) =
pr(P)(pr(Q&P)+pr(Q&~P)) iff pr(Q&P) = pr(P)pr(Q).
Another important property of irrelevance is that if P
and Q are irrelevant, then so are their denials:
T20

PirrQ
iff (~P)irrQ iff Pirr(~Q) iff (~P)irr(~Q).


This too can be
proved by noting some series of equivalences that yield irrelevance. First consider pr(P&~Q), assuming PirrQ. Then
pr(P&~Q) = pr(P)pr(P&Q) = pr(P)pr(P)pr(Q) = pr(P)(1pr(Q))=
pr(P)pr(~Q). So Pirr(~Q) if PirrQ. The
converse can be proved by running the argument in reverse order, and so
Pirr Q iff Pirr(~Q). The other equivalences are proved similarly.
Finally, the concept of irrelevance, which so far has
been used in an unconditional form, may be given a conditional form,
when we want to say that P and Q are irrelevant if T is true:
Def 3

PirrQT
iff pr(QT&P) = pr(QT)


This says that the probability that Q
is true if T is true is just the same as when T and P are both true 
i.e. P's truth makes no difference to Q's probability, if T is true. It
should be noted that Def 3 requires that pr(T&P) > 0 (which
makes pr(T) > 0), but that on this condition T19 shows that Def 3 is
just a simple extension of Def 2. And as with Def 2 there is
symmetry:
For suppose PirrQT. By
Def 3 pr(QT&P)=pr(QT) iff
pr(Q&T&P):pr(T&P)=pr(Q&T):pr(T) by Def 1. This is so
iff pr(Q&T&P):pr(Q&T) = pr(T&P):pr(T) iff
pr(PQ&T)=pr(PT) iff QirrPT by Def 3.
And this conditional irrelevance of Q from P if T does
not only hold in case P is true, but also in case P is false. That
is:
T22

PirrQT
iff (~P)irrQT


For suppose PirrQT,
i.e. pr(QT&P) = pr(QT). By def 1 this is equivalent to
pr(Q&T&P):pr(T&P) = pr(Q&T):pr(T) iff pr(Q&T&P)
= pr(T&P)pr(Q&T):pr(T). Now pr(Q&T&P) =
pr(Q&T)pr(Q&T&~P), and so we obtain the equivalent
pr(Q&T&~P) = pr(Q&T)(pr(T&P)pr(Q&T):pr(T)) =
pr(Q&T)(1(pr(T&P):pr(T)) = pr(Q&T)((pr(T)pr(T&P)) :
pr(T)) = pr(Q&T)(pr(T&~P):pr(T)) from which we finally obtain
as equivalent to PirrQT pr(Q&T&~P):pr(T&~P) = pr(Q&T)
: pr(T), which is by Def 3 the same as (~P)irrQT. Qed.
And finally T21 and T22 yield the same result for
conditional irrelevance as for irrelevance:
T23

PirrQT
iff QirrPT


(1)

iff (~P)irrQT

T21

(2)

iff
Pirr(~Q)T

T21, T22, (1)

(3)

iff
(~P)irr(~Q)T

(2)

The proof is: The
first line is T21, the second T22. The third results thus: By both
theorems, QirrPT iff (~P)irrQT whence PirrQT iff (~P)irrQT by T21.
The fourth results from this by substituting (~Q) for Q. Qed.
