Stochastic Calculus (Lawler) — Chapter 1

Study notes for Lawler Stochastic Calculus (Chapter 1).

Section 1.1 — Conditional Expectation

Notation at a Glance

Symbol Meaning
\(\Omega\) Sample space — the set of all possible outcomes
\(\mathcal{F}\) \(\sigma\)-algebra — the collection of all observable events
\(P\) Probability measure — assigns numbers in \([0,1]\) to events in \(\mathcal{F}\)
\((\Omega, \mathcal{F}, P)\) Probability space — the full mathematical arena
\(A \in \mathcal{F}\) A is a measurable event (we can assign it a probability)
\(\Omega \setminus A\) Complement of A — all outcomes not in A
\(\mathcal{F}_n\) \(\sigma\)-algebra generated by \(X_1,\ldots,X_n\) — information at time n
\(\mathcal{F}_m \subseteq \mathcal{F}_n\) All events knowable at time m are also knowable at time n
\(E[Y]\) Unconditional expectation — best guess for Y with no information
\(E[Y \mid \mathcal{F}_n]\) Conditional expectation — best guess for Y given information \(\mathcal{F}_n\)
\(\mathbf{1}_A\) Function equal to 1 on event A, 0 elsewhere
i.i.d. Independent and identically distributed
\(\mu \ll P\) \(\mu\) absolutely continuous w.r.t. \(P\) — every P-null set is also a \(\mu\)-null set
\(S_n\) Partial sum \(X_1 + X_2 + \cdots + X_n\)

Part 1 — The Core Intuition

Conditional expectation is the central object of stochastic calculus. At its core it answers one question: given that we have observed some (but not all) information, what is our best guess for a random variable $Y$? The answer is not a single number but another random variable — one that changes as our information changes.

CORE IDEAS

\(E[Y]\) is the best guess for \(Y\) given no information at all. The unconditional expectation is the baseline: if you know nothing about the outcome of an experiment, your single best guess (in the mean-squared-error sense) is \(E[Y]\).

\(E[Y \mid \mathcal{F}_n]\) is the best guess for Y given the information \(\mathcal{F}_n\). As data arrives one variable at a time — \(X_1, X_2, \ldots, X_n\) — we collect more information. The conditional expectation updates our best guess for Y using whatever is currently known.

\(E[Y \mid \mathcal{F}_n]\) is itself a random variable, not a fixed number. Because it depends on the observed values of \(X_1, \ldots, X_n\) — which are random — it is a function of those observations, hence random.

The formal definition bypasses explicit computation via one key property. \(E[Y \mid \mathcal{F}_n]\) is defined as the unique \(\mathcal{F}_n\)-measurable random variable satisfying \(E[E[Y \mid \mathcal{F}_n] \cdot \mathbf{1}_A] = E[Y \cdot \mathbf{1}_A]\) for all \(\mathcal{F}_n\)-measurable events A.

⚠️Common Misconception
Wrong: "$E[Y \mid \mathcal{F}_n]$ is just a number, like $E[Y]$ but computed with less data."
Correct: $E[Y \mid \mathcal{F}_n]$ is a random variable. Its value changes depending on which values $X_1, \ldots, X_n$ take. If you observe different data, you get a different conditional expectation. $E[Y]$ is the special case where zero data is observed — a single fixed number. $E[Y \mid \mathcal{F}_n]$ is a whole function of the observations.

Part 2 — The Probability Space

Before defining conditional expectation rigorously, we need the underlying mathematical arena: a probability space $(\Omega, \mathcal{F}, P)$. Everything — random variables, events, filtrations — lives inside this structure.

We assume that the random variables \(X_1, X_2, \ldots\) are defined on a probability space \((\Omega, \mathcal{F}, P)\). Here \(\mathcal{F}\) is a \(\sigma\)-algebra or \(\sigma\)-field of subsets of \(\Omega\), that is, a collection of subsets satisfying:

i. \(\emptyset \in \mathcal{F}\)
ii. \(A \in \mathcal{F}\) implies \(\Omega \setminus A \in \mathcal{F}\)
iii. \(A_1, A_2, \ldots \in \mathcal{F}\) implies \(\bigcup_{n=1}^{\infty} A_n \in \mathcal{F}\).

This defines the three \(\sigma\)-algebra axioms. (i) The empty set — the impossible event — must be an event. (ii) If A is observable, so is its complement: “A did not happen” must also be observable. (iii) Countable unions of events are events: “at least one of \(A_1, A_2, \ldots\) happened” is observable. These three rules make \(\mathcal{F}\) a self-consistent collection of questions we can ask about the experiment.

Minimal example: two fair coin flips

Sample space: \(\Omega = \{HH, HT, TH, TT\}\)

The full \(\sigma\)-algebra \(\mathcal{F} = 2^\Omega\) contains all 16 subsets. This always satisfies all three axioms and encodes complete information.

A smaller valid \(\sigma\)-algebra encoding only the first flip:

\[\mathcal{F}_1 = \{ \emptyset, \{HH, HT\}, \{TH, TT\}, \Omega \}\]
  • \(\emptyset \in \mathcal{F}_1\) ✓  (axiom i)
  • \(\{HH,HT\}^c = \{TH,TT\} \in \mathcal{F}_1\) ✓  (axiom ii)
  • Any union of these four sets stays in \(\mathcal{F}_1\) ✓  (axiom iii)

\(\mathcal{F}_1\) lets us distinguish “first flip = H” from “first flip = T” — nothing more.

Key point: A $\sigma$-algebra is a mathematical encoding of a state of knowledge. Larger $\sigma$-algebra = more information. $\mathcal{F}_1 \subseteq 2^\Omega$ reflects that knowing one flip is strictly less information than knowing both.

Part 3 — The Filtration

A filtration is a growing sequence of σ-algebras modelling information accumulating over time. At time $n$, $F_n$ records everything observed up to time $n$ — and once something is known it is never forgotten.
Let $X_1, X_2, \ldots$ be random variables which we think of as a time series with the data arriving one at a time. At time $n$ we have viewed the values $X_1, \ldots, X_n$. … We will write $\mathcal{F}_n$ for 'the information contained in $X_1, \ldots, X_n$.

\(\mathcal{F}_n\) is not a number or a random variable — it is a \(\sigma\)-algebra: a collection of events. “Information in \(X_1,\ldots,X_n\)” means all questions of the form “did the observations satisfy some condition?” that can be answered once \(X_1,\ldots,X_n\) are known.

“The information \(\mathcal{F}_n\) is the smallest sub \(\sigma\)-algebra \(G\) of \(\mathcal{F}\) such that \(X_1, \ldots, X_n\) are \(G\)-measurable. The latter statement means that for all \(t \in \mathbb{R}\), the event \(\{X_j \leq t\} \in \mathcal{F}_n\).”

”\(X_j\) is \(\mathcal{F}_n\)-measurable” means: for every threshold \(t\), the set of outcomes where \(X_j \leq t\) is an event in \(\mathcal{F}_n\). Knowing \(X_1,\ldots,X_n\) is enough to determine \(X_j\). The word “smallest” is important — we take only events genuinely required to describe \(X_1,\ldots,X_n\), not any extras.

⚠️Common Misconception
Wrong: "$X$ is $\mathcal{F}_n$-measurable just means $X$ is one of the variables $X_1,\ldots,X_n$."
Correct: Any function of $X_1,\ldots,X_n$ is also $\mathcal{F}_n$-measurable — e.g., $X_1 + X_2$, $\max(X_1,\ldots,X_n)$, or $S_n^2$. The condition is that X's value is fully determined once $X_1,\ldots,X_n$ are known, not that X appears explicitly in the list.
Example of a Measurable random variable on a finite space

Let \(\Omega=\{\omega_1,\omega_2,\omega_3\}\) and define the \(\sigma\)-algebra \(\mathcal{F}=\{\emptyset, \Omega, \{\omega_1,\omega_2\}, \{\omega_3\}\}\).

Define the random variable \(X(\omega_1)=0,\; X(\omega_2)=0,\; X(\omega_3)=1\).

To check measurability, verify that \(\{X\le t\}\in\mathcal{F}\quad \forall t\in\mathbb{R}\).

For example:

\(\{X\le 0\}=\{\omega_1,\omega_2\}\in\mathcal{F}\),

\(\{X\le 0.5\}=\{\omega_1,\omega_2\}\in\mathcal{F}\),

\(\{X\le 2\}=\Omega\in\mathcal{F}\).

Hence \(X\) is \(\mathcal{F}\)-measurable.

Key point: A random variable is measurable if the information structure $\mathcal{F}$ can distinguish exactly the events needed to determine its value. Here, $\mathcal{F}$ separates $\omega_3$ from $\{\omega_1,\omega_2\}$, which is exactly what $X$ depends on.
Example of a non-measurable function

Let \(\Omega=\{\omega_1,\omega_2,\omega_3\}\) and consider the same \(\sigma\)-algebra \(\mathcal{F}=\{\emptyset,\Omega,\{\omega_1,\omega_2\},\{\omega_3\}\}\)

Define the function \(Y(\omega_1)=0,\; Y(\omega_2)=1,\; Y(\omega_3)=0\).

Now check measurability.

Consider the event \(\{Y\le 0\}=\{\omega_1,\omega_3\}\).

But \(\{\omega_1,\omega_3\}\notin\mathcal{F}\).

Hence \(Y\) is not \(\mathcal{F}\)-measurable.

Key point: The $\sigma$-algebra $\mathcal{F}$ cannot distinguish $\omega_1$ from $\omega_2$ in a way consistent with $Y$. Since $Y$ separates these outcomes differently while $\mathcal{F}$ does not, the function is “too fine” for the available information structure.

Part 4 — Motivating the Definition (Density Case)

"Suppose that $(X, Y)$ have a joint density $f(x, y)$, $0 < x, y < \infty$, with marginal densities $f(x) = \int f(x,y)\,dy$, $g(y) = \int f(x,y)\,dx$. The conditional density $f(y\mid x)$ is defined by $f(y\mid x) = \frac{f(x,y)}{f(x)}$."

\(f(x,y)\) is the joint probability density near \((x,y)\). \(f(x)\) is the marginal density of \(X\) — the probability near \(x\) regardless of \(Y\), obtained by integrating out all \(y\)-values. \(f(y\mid x)\) is the relative weight on each \(y\)-value given \(X = x\). Division by \(f(x)\) normalises so that \(f(y\mid x)\) integrates to 1 over \(y\).

This gives the familiar undergraduate formula:

\[E[Y \mid X = x] = \int_{-\infty}^{\infty} y\, f(y \mid x)\, dy = \frac{\int_{-\infty}^{\infty} y\, f(x,y)\, dy}{f(x)}\]

Note that \(E[Y \mid X]\) is a random variable which is determined by the value of the random variable \(X\).

\(E[Y \mid X = x]\) for a fixed \(x\) is a number. But \(E[Y \mid X]\) — without fixing \(x\) — is a function of \(X\). Since \(X\) is random, this function is itself random. This is the key conceptual leap the formal definition must capture.

The tower property emerges naturally:

\[E\bigl[E[Y \mid X]\bigr] = \int_{-\infty}^{\infty} E[Y \mid X = x]\, f(x)\, dx = \iint y\, f(x,y)\, dy\, dx = E[Y]\]

Averaging the conditional best-guess over all possible observations recovers the unconditional expectation. This is the tower property, generalised by the abstract definition below.


Part 5 — The Formal Definition

The conditional expectation $E[Y \mid \mathcal{F}_n]$ is the unique random variable satisfying the following.
(i) $E[Y \mid \mathcal{F}_n]$ is $\mathcal{F}_n$-measurable.
(ii) For every $\mathcal{F}_n$-measurable event A, $E[E[Y \mid \mathcal{F}_n] \cdot \mathbf{1}_A] = E[Y \cdot \mathbf{1}_A]$.

Condition (i): The output is computable from \(X_1,\ldots,X_n\) alone — it cannot use future information.

Condition (ii): On every slice of \(\Omega\) that can be identified from current information (any \(A \in \mathcal{F}_n\)), the average of \(E[Y \mid \mathcal{F}_n]\) over that slice equals the average of Y over the same slice.

These two conditions uniquely determine \(E[Y \mid \mathcal{F}_n]\) without requiring a closed-form formula.

What is \(\mathbf{1}_A\)?

\[\mathbf{1}_A(\omega) = \begin{cases} 1 & \text{if } \omega \in A \\ 0 & \text{if } \omega \notin A \end{cases}\]

So \(E[Z \cdot \mathbf{1}_A]\) = probability-weighted average of Z over outcomes where A occurs = \(\int_A Z \, dP\).

Why not give an explicit formula? In general probability spaces (uncountable \(\Omega\), continuous distributions), no single formula works universally. The two-condition characterisation is both sufficient and rigorous. Existence follows from the Radon-Nikodym theorem; uniqueness from: if \(Z_1\) and \(Z_2\) both satisfy the conditions then \(E[(Z_1 - Z_2)^2] = 0\), so \(Z_1 = Z_2\) almost surely.

⚠️Common Misconception
Wrong: Condition (ii) $E[E[Y \mid \mathcal{F}_n] \cdot \mathbf{1}_A] = E[Y \cdot \mathbf{1}_A]$ is just saying $E[Y \mid \mathcal{F}_n] = Y$
Correct: It says they agree on average over every observable event A — not pointwise. $E[Y \mid \mathcal{F}_n]$ is a smoothed version of $Y$: it preserves the same probability mass on every $\mathcal{F}_n$-identifiable slice, but replaces $Y$'s within-slice variation with a single average value. The pointwise equality $E[Y \mid \mathcal{F}_n](\omega) = Y(\omega)$ only holds when $Y$ is itself $\mathcal{F}_n$-measurable.

Part 6 — Properties of Conditional Expectation

Proposition 1.1.1: Suppose \(X_1, X_2, \ldots\) is a sequence of random variables and \(\mathcal{F}_n\) denotes the information at time n. The conditional expectation \(E[Y \mid \mathcal{F}_n]\) satisfies the following properties.

Property 1 — If Y is already known, conditioning changes nothing

\[\text{If } Y \text{ is } \mathcal{F}_n\text{-measurable}, \quad E[Y \mid \mathcal{F}_n] = Y.\]

Why: Y is determined by \(X_1,\ldots,X_n\). No uncertainty remains — the best guess is the value itself.

Property 2 — Tower property (law of iterated expectations)

\[m < n \implies E\bigl[\,E[Y \mid \mathcal{F}_n]\;\big|\;\mathcal{F}_m\bigr] = E[Y \mid \mathcal{F}_m].\]

Why: Compute the best guess for \(Y\) using information up to time \(n\), then downgrade it to only use information up to time \(m < n\). The result is identical to computing the best guess with time \(m\) information directly.

Special case \(m = 0\): \(E[E[Y \mid \mathcal{F}_n]] = E[Y]\).

Property 3 — Independence means conditioning is useless

\[X_1,\dots,X_n \perp Y \implies E[Y \mid \mathcal{F}_n] = E[Y].\]

Why: If the observations carry zero information about \(Y\), the best guess remains the unconditional mean.

Property 4 — Linearity

\[E[aY + bZ \mid \mathcal{F}_n] = a\,E[Y \mid \mathcal{F}_n] + b\,E[Z \mid \mathcal{F}_n].\]

Why: Conditional expectation is an integral, and integrals are linear.

Property 5 — Known factors pull out (constants rule)

\[Z \text{ is } \mathcal{F}_n\text{-measurable} \implies E[YZ \mid \mathcal{F}_n] = Z\cdot E[Y \mid \mathcal{F}_n].\]

Why: Z is already determined by current information — it plays the role of a known constant. Only \(Y\) carries residual randomness to average over.

⚠️Common Misconception — Tower Property
Wrong: "$E[E[Y \mid \mathcal{F}_n] \mid \mathcal{F}_m] = E[Y \mid \mathcal{F}_n]$ — the inner conditioning dominates because it has more information."
Correct: The outer conditioning dominates: the result is $E[Y \mid \mathcal{F}_m]$. When you condition on less information ($\mathcal{F}_m \subseteq \mathcal{F}_n$), you lose the fine detail provided by $\mathcal{F}_n$. The coarser $\sigma$-algebra always wins. Think of it as: the final answer can only use what the outermost conditioning permits.

Part 7 — Worked Examples

Example 1.1.1 — $E[S_n \mid \mathcal{F}_m]$ for independent increments

Setup: \(X_1, X_2, \ldots\) independent with \(E[X_j] = \mu\). Let \(S_n = X_1 + \cdots + X_n\), \(\mathcal{F}_m = \sigma(X_1,\ldots,X_m)\), \(m < n\).

\[E[S_n \mid \mathcal{F}_m] = E[S_m \mid \mathcal{F}_m] + E[S_n - S_m \mid \mathcal{F}_m]\]

\(S_m = X_1+\cdots+X_m\) is fully determined by \(X_1,\ldots,X_m\), so it is \(\mathcal{F}_m\)-measurable:

\[E[S_m \mid \mathcal{F}_m] = S_m\] \[S_n - S_m = X_{m+1}+\cdots+X_n\] \[E[S_n - S_m \mid \mathcal{F}_m] = E[S_n - S_m] = (n-m)\mu\]

\(E[S_n \mid \mathcal{F}_m] = S_m + (n-m)\mu\)

Example 1.1.2 — $E[S_n^2 \mid \mathcal{F}_m]$ for zero-mean increments

Setup: \(\mu = 0\), \(E[X_j^2] = \sigma^2 < \infty\). Same \(S_n\), \(\mathcal{F}_m\) as above.

Write \(S_n = S_m + (S_n - S_m)\) and expand the square:

\[E[S_n^2 \mid \mathcal{F}_m] = E[S_m^2 \mid \mathcal{F}_m] + 2 E[S_m(S_n-S_m) \mid \mathcal{F}_m] + E[(S_n-S_m)^2 \mid \mathcal{F}_m]\]

Term 1: \(S_m^2\) is \(\\mathcal{F}_m\)-measurable so \(E[S_m^2 \mid \mathcal{F}_m] = S_m^2\)

Term 2: \(S_m\) pulls out (\(\mathcal{F}_m\)-measurable); \(S_n-S_m\) is independent of \(\mathcal{F}_m\) with mean 0:

\[2 E[S_m(S_n-S_m) \mid \mathcal{F}_m] = 2 S_m \cdot E[S_n-S_m] = 2 S_m \cdot 0 = 0\]

Term 3: \((S_n-S_m)^2\) is independent of \(\mathcal{F}_m\); its expectation is the variance of \((n-m)\) increments:

\[E[(S_n-S_m)^2 \mid \mathcal{F}_m] = \text{Var}(S_n-S_m) = (n-m)\sigma^2\]

Combine Terms 1, 2, 3:

\[E[S_n^2 \mid \mathcal{F}_m] = S_m^2 + (n-m)\sigma^2\]

Part 8 — Filtration (Formal Definition)

"Definition If $X_1, X_2, \ldots$ is a sequence of random variables, then the associated (discrete time) filtration is the collection $\{\mathcal{F}_n\}$ where $\mathcal{F}_n$ denotes the information in $X_1, \ldots, X_n$. One assumption in the definition of a filtration, which may sometimes not reflect reality, is that information is never lost. If m < n, then everything known at time m is still known at time n."

The key mathematical consequence of “information is never lost” is the set inclusion \(\mathcal{F}_m \subseteq \mathcal{F}_n\) for all \(m < n\). Every event in \(\mathcal{F}_m\) is also in \(\mathcal{F}_n\). This is not a philosophical claim — it is a precise constraint that the sequence of \(\sigma\)-algebras must satisfy to qualify as a filtration.

Formally: A filtration is an increasing sequence of \(\sigma\)-algebras:

\[\mathcal{F}_0 \subseteq \mathcal{F}_1 \subseteq \mathcal{F}_2 \subseteq \cdots \subseteq \mathcal{F}\]

\(\mathcal{F}_0 = \{\emptyset, \Omega\}\) — the trivial \(\sigma\)-algebra, representing the state before any observation.


Term Glossary

Probability space $(\Omega, \mathcal{F}, P)$ Definition
The triple: $\Omega$ = sample space (all outcomes); $\mathcal{F}$ = $\sigma$-algebra (observable events); $P : \mathcal{F} \to [0,1]$ probability measure with $P(\Omega) = 1$ and countable additivity. All random variables and stochastic processes in this book live on such a triple.
$\sigma$-algebra Definition
A collection $\mathcal{F}$ of subsets of $\Omega$ closed under complementation and countable unions, containing $\emptyset$. Encodes "which events are distinguishable." The trivial $\sigma$-algebra $\{\emptyset, \Omega\}$ encodes no information; the power set $2^\Omega$ encodes complete information.
Filtration $\{\mathcal{F}_n\}$ Definition
An increasing sequence $\mathcal{F}_0 \subseteq \mathcal{F}_1 \subseteq \mathcal{F}_2 \subseteq \cdots$ modelling accumulating information. $\mathcal{F}_n = \sigma(X_1,\ldots,X_n)$ — the smallest $\sigma$-algebra making $X_1,\ldots,X_n$ measurable. Information never shrinks: $m < n$ implies $\mathcal{F}_m \subseteq \mathcal{F}_n$.
$\mathcal{F}_n$-measurable Measurability
A random variable Z is $\mathcal{F}_n$-measurable if $\{Z \leq t\} \in \mathcal{F}_n$ for every $t \in \mathbb{R}$. Equivalently: $Z = \varphi(X_1,\ldots,X_n)$ for some measurable $\varphi$. Z's value is fully determined by the first n observations.
Conditional expectation $E[Y \mid \mathcal{F}_n]$ Definition
The unique $\mathcal{F}_n$-measurable random variable satisfying $E[E[Y \mid \mathcal{F}_n]\cdot\mathbf{1}_A] = E[Y\cdot\mathbf{1}_A]$ for all $A \in \mathcal{F}_n$. The minimum-MSE predictor of Y given information $\mathcal{F}_n$. A random variable — not a number — because its value depends on the observations $X_1,\ldots,X_n$.
Indicator function $\mathbf{1}_A$ Notation
$\mathbf{1}_A(\omega) = 1$ if $\omega \in A$, else $0$. So $E[Z\cdot\mathbf{1}_A] = \int_A Z \, dP$ = the probability-weighted average of Z over outcomes where A occurs. Central to the formal two-condition definition of conditional expectation.
Tower property Property
$E[E[Y \mid \mathcal{F}_n] \mid \mathcal{F}_m] = E[Y \mid \mathcal{F}_m]$ for $m < n$. The outer (coarser) conditioning always governs. Special case: $E[E[Y \mid \mathcal{F}_n]] = E[Y]$. Proved using the defining property of conditional expectation and the inclusion $\mathcal{F}_m \subseteq \mathcal{F}_n$.
Constants rule (pull-out property) Property
If Z is $\mathcal{F}_n$-measurable then $E[YZ \mid \mathcal{F}_n] = Z\cdot E[Y \mid \mathcal{F}_n]$. Z behaves as a known constant. Proved first for $Z = \mathbf{1}_A$ ($A \in \mathcal{F}_n$) using the definition, extended to simple random variables by linearity, then to general Z by monotone convergence.
Radon-Nikodym theorem Theorem
Guarantees existence of conditional expectation. The function $\mu(A) = E[Y\cdot\mathbf{1}_A]$ is a signed measure on $(\Omega, \mathcal{F}_n, P)$ with $\mu \ll P$. By Radon-Nikodym, there exists an $\mathcal{F}_n$-measurable Z with $\mu(A) = E[Z\cdot\mathbf{1}_A]$ for all $A \in \mathcal{F}_n$. This Z is $E[Y \mid \mathcal{F}_n]$.

Section 1.2 — Martingales

Notation at a Glance

Symbol Meaning
\(M_n\) The martingale process at time \(n\)
\(\mathcal{F}_n\) Filtration at time \(n\) — information in \(M_0, M_1, \ldots, M_n\)
\(E[M_n \mid \mathcal{F}_m] = M_m\) The defining martingale condition for \(m < n\)
\(\Delta M_n = M_n - M_{n-1}\) Increment of the martingale at step \(n\)
\(B_n\) The “bet” at time \(n\) — an \(\mathcal{F}_{n-1}\)-measurable process
\(W_n = \sum_{j=1}^n B_j \Delta M_j\) Winnings under a betting strategy — the discrete stochastic integral
\(S_n = X_1 + \cdots + X_n\) Partial sum of i.i.d. mean-zero random variables
\(A_n = \sigma_1^2 + \cdots + \sigma_n^2\) Cumulative variance (predictable compensator)
\(E[M_n \mid \mathcal{F}_m] \geq M_m\) Submartingale condition
\(E[M_n \mid \mathcal{F}_m] \leq M_m\) Supermartingale condition
\(\mathcal{F}_{n-1}\)-measurable \(B_n\) is known before time \(n\) — the “non-anticipating” condition

Part 1 — The Core Intuition

A martingale is the mathematical model of a fair game. At every moment, no matter what has happened so far, the expected future value of the process equals its current value.

CORE IDEAS

A martingale models a fair game. If \(M_n\) represents cumulative winnings, “fair” means: regardless of the history of play, the expected winnings at any future time equal the current winnings. No strategy can give you a systematic advantage over a martingale in finite time.

The martingale condition is a statement about conditional expectations. The defining equation \(E[M_n \mid \mathcal{F}_m] = M_m\) for \(m < n\) is a direct application of §1.1: given everything observed up to time \(m\), the best prediction of \(M_n\) is simply \(M_m\) itself.

To verify the martingale property it suffices to check one step at a time (The One-Step Criterion). Rather than checking \(E[M_n \mid \mathcal{F}_m] = M_m\) for all pairs \(m < n\), it is enough to verify \(E[M_{n+1} \mid \mathcal{F}_n] = M_n\) for every \(n\). The tower property of §1.1 propagates this to all future times.</div>

A martingale has constant expected value. Taking the full expectation: $\mathbb{E}[M_n] = \mathbb{E}[E[M_n \mid \mathcal{F}_0]] = \mathbb{E}[M_0]$ for all $n$. The mean is time-invariant — a necessary (but not sufficient) condition for fairness.


Part 2 — The Formal Definition

Suppose \(X_1, X_2, \ldots\) is a sequence of random variables to which we associate the filtration \(\{\mathcal{F}_n\}\) where \(\mathcal{F}_n\) is the information contained in \(X_1, \ldots, X_n\).

A sequence of random variables \(M_0, M_1, \ldots\) is called a martingale with respect to the filtration \(\{\mathcal{F}_n\}\) if:

(i) For each \(n\), \(M_n\) is an \(\mathcal{F}_n\)-measurable random variable with \(E[\lvert M_n \rvert] < \infty\).

(ii) If \(m < n\), then

\(E[M_n \mid \mathcal{F}_m] = M_m. \tag{1.4}\)

Unpacking condition (i):

  • \(\mathcal{F}_n\)-measurable: the value of \(M_n\) is fully determined by the information available at time \(n\). It does not look into the future.
  • \(E[\lvert M_n \rvert] < \infty\): integrability — the process cannot take infinitely large values on average. This is needed for the conditional expectation \(E[M_n \mid \mathcal{F}_m]\) to be well-defined.

Unpacking condition (ii): Given everything observed up to time \(m\), the best prediction of \(M_n\) at any later time \(n > m\) is simply the current value \(M_m\). The process has no predictable drift in either direction.

Equivalent one-step form: It suffices to check $E[M_{n+1} \mid \mathcal{F}_n] = M_n$ for every $n \geq 0$. The tower property then gives $E[M_{n+2} \mid \mathcal{F}_n] = E[E[M_{n+2} \mid \mathcal{F}_{n+1}] \mid \mathcal{F}_n] = E[M_{n+1} \mid \mathcal{F}_n] = M_n$, and so on for all future times.

Part 3 — Sub- and Supermartingales

If the condition of martingale (\(E[M_n \mid \mathcal{F}_m] = M_m\)) is replaced with \(E[M_n \mid \mathcal{F}_m] \geq M_m\), then the process is called a submartingale. If it is replaced with \(E[M_n \mid \mathcal{F}_m] \leq M_m\), then it is called a supermartingale.

Process Condition Interpretation
Martingale \(E[M_n \mid \mathcal{F}_m] = M_m\) Fair game — no systematic gain or loss
Submartingale \(E[M_n \mid \mathcal{F}_m] \geq M_m\) Favourable game — expected value grows over time
Supermartingale \(E[M_n \mid \mathcal{F}_m] \leq M_m\) Unfavourable game — expected value shrinks over time

In other words, games that are always in one’s favour are submartingales and games that are always against one are supermartingales. (At most games in Las Vegas, one’s winnings give a supermartingale.)

⚠️Common Misconception — Sub vs Super
Wrong: "A supermartingale is 'super' — it grows faster than a martingale."
Correct: The naming is counterintuitive. A supermartingale has $E[M_n \mid \mathcal{F}_m] \leq M_m$ — its expected value decreases over time (unfavourable game). A submartingale has $E[M_n \mid \mathcal{F}_m] \geq M_m$ — its expected value increases (favourable game). The terminology is inherited from the analogy with superharmonic and subharmonic functions, not from a comparison of growth rates. A martingale is simultaneously both a submartingale and a supermartingale.

Part 4 — The Discrete Stochastic Integral

Suppose that \(M_0, M_1, \ldots\) is a martingale with respect to the filtration \(\mathcal{F}_n\) . For \(n \geq 1\), let \(\Delta M_n = M_n - M_{n-1}\). Let \(B_j\) denote the ‘bet’ on the \(j\)th game. We allow negative values of \(B_j\) which indicate betting that the price will go down or the game will be lost. Let \(W_n\) denote the winnings in this strategy: \(W_0 = 0\) and for \(n \geq 1\),

\(W_n = \sum_{j=1}^n B_j [M_j - M_{j-1}] = \sum_{j=1}^n B_j \Delta M_j.\)”

Three conditions on the betting strategy \(B_n\):

  1. Boundedness: \(\lvert B_n \rvert \leq K_n < \infty\) for some finite constant \(K_n\) — bets cannot be infinite.
  2. Non-anticipating (predictability): \(B_n\) is \(\mathcal{F}_{n-1}\)-measurable — the bet at time \(n\) can only use information from before time \(n\). You cannot see the outcome before placing the bet.
  3. Integrability: The bound above ensures \(E[\lvert W_n \rvert] < \infty\).

\(W_n\) is a martingale — verification:

\[E[W_{n+1} \mid \mathcal{F}_n] = E[W_n + B_{n+1}(M_{n+1} - M_n) \mid \mathcal{F}_n].\]

Since \(W_n\) is \(\mathcal{F}_n\)-measurable, \(E[W_n \mid \mathcal{F}_n] = W_n\). Also, since \(B_{n+1}\) is \(\mathcal{F}_n\)-measurable and \(M\) is a martingale,

\[E[B_{n+1}(M_{n+1} - M_n) \mid \mathcal{F}_n] = B_{n+1} E[M_{n+1} - M_n \mid \mathcal{F}_n] = 0.\]

Therefore, \(E[W_{n+1} \mid \mathcal{F}_n] = W_n\).

\(W_n\) is known at time \(n\). \(B_{n+1}\) is known at time \(n\) and pulls out. The remaining factor \(E[M_{n+1} - M_n \mid \mathcal{F}_n] = 0\) by the martingale property of \(M\).


Part 5 — Worked Examples

Example 1.2.1 — Simple random walk as a martingale Example

Suppose \(X_1, X_2, \ldots\) are independent random variables with \(E[X_j] = 0\) for each \(j\). Let \(S_0 = 0\) and \(S_n = X_1 + \cdots + X_n\). In the last section we showed that if \(m < n\), then \(E[S_n \mid \mathcal{F}_m] = S_m\). Hence, \(S_n\) is a martingale with respect to \(\mathcal{F}_n\), the information in \(X_1, \ldots, X_n\).

Verification using the one-step criterion:

Check \(E[S_{n+1} \mid \mathcal{F}_n] = S_n\):

\[E[S_{n+1} \mid \mathcal{F}_n] = E[S_n + X_{n+1} \mid \mathcal{F}_n] = E[S_n \mid \mathcal{F}_n] + E[X_{n+1} \mid \mathcal{F}_n].\]
  • \(S_n\) is \(\mathcal{F}_n\)-measurable \(\Rightarrow\) \(E[S_n \mid \mathcal{F}_n] = S_n\).
  • \(X_{n+1}\) is independent of \(\mathcal{F}_n\) and \(E[X_{n+1}] = 0\) \(\Rightarrow\) \(E[X_{n+1} \mid \mathcal{F}_n] = 0\).
\[E[S_{n+1} \mid \mathcal{F}_n] = S_n + 0 = S_n. \checkmark\]
Key point: Mean-zero independent increments are the prototype martingale. The increments $X_j$ play the role of "fair coin tosses" — no single step has a predictable direction, so the running sum has no predictable drift.
Example 1.2.2 — $S_n^2 - A_n$ as a martingale Example

Suppose \(X_n, S_n, \mathcal{F}_n\) are as in Example 1.2.1 and also assume \(\text{Var}[X_j] = E[X_j^2] = \sigma_j^2 < \infty\). Let

\[A_n = \sigma_1^2 + \cdots + \sigma_n^2, \qquad M_n = S_n^2 - A_n,\]

where \(M_0 = 0\). Then \(M_n\) is a martingale with respect to \(\mathcal{F}_n\).

Verification — one-step check:

\[E[S_{n+1}^2 \mid \mathcal{F}_n] = E[(S_n + X_{n+1})^2 \mid \mathcal{F}_n]\] \[= E[S_n^2 \mid \mathcal{F}_n] + 2E[S_n X_{n+1} \mid \mathcal{F}_n] + E[X_{n+1}^2 \mid \mathcal{F}_n].\]
  • Term 1: \(S_n^2\) is \(\mathcal{F}_n\)-measurable \(\Rightarrow\) \(E[S_n^2 \mid \mathcal{F}_n] = S_n^2\).
  • Term 2: \(S_n\) pulls out; \(E[X_{n+1} \mid \mathcal{F}_n] = E[X_{n+1}] = 0\) \(\Rightarrow\) \(2S_n \cdot 0 = 0\).
  • Term 3: \(X_{n+1}\) independent of \(\mathcal{F}_n\) \(\Rightarrow\) \(E[X_{n+1}^2 \mid \mathcal{F}_n] = E[X_{n+1}^2] = \sigma_{n+1}^2\).

Therefore \(E[S_{n+1}^2 \mid \mathcal{F}_n] = S_n^2 + \sigma_{n+1}^2\), and:

\[\begin{aligned} E[M_{n+1} \mid \mathcal{F}_n] &= E[S_{n+1}^2 - A_{n+1} \mid \mathcal{F}_n] \\ &= S_n^2 + \sigma_{n+1}^2 - (A_n + \sigma_{n+1}^2) = S_n^2 - A_n = M_n. \checkmark \end{aligned}\]
Key point: $S_n^2$ alone is a submartingale (it grows by $\sigma_{n+1}^2$ in expectation at each step). Subtracting the cumulative variance $A_n$ exactly compensates this drift and restores the martingale property. The sequence $A_n$ is called the predictable compensator of $S_n^2$. This is the discrete analogue of the quadratic variation.
Example 1.2.4 — The martingale betting strategy: infinite time beats the game Example

Martingale betting strategy. Let \(X_1, X_2, \ldots\) be independent random variables with

\[P\{X_j = 1\} = P\{X_j = -1\} = \tfrac{1}{2}\]

… We will consider the following betting strategy. We start by betting $1. If we win, we quit; otherwise, we bet $2 on the next game. If we win the second game, we quit; otherwise we double our bet to $4 and play. Each time we lose, we double our bet. At the time that we win, we will be ahead $1d.

The bets are:

\(B_1 = 1\), \(B_j = 2^{j-1} \text{ if } X_1 = X_2 = \cdots = X_{j-1} = -1\), otherwise \(B_j = 0\).

The winnings \(W_n\): either \(+1\) (if we won at some point), or \(-(1 + 2 + 4 + \cdots + 2^{n-1}) = -(2^n - 1)\) (if we lost all \(n\) rounds).

Direct verification that \(E[W_n] = 0\):

\[E[W_n] = 1 \cdot \Bigl(1 - \tfrac{1}{2^n}\Bigr) + \bigl(-(2^n - 1)\bigr) \cdot \tfrac{1}{2^n} = 1 - \tfrac{1}{2^n} - 1 + \tfrac{1}{2^n} = 0.\]

The paradox: With probability one we eventually win, so \(W_\infty = \lim_{n \to \infty} W_n = 1\), giving \(E[W_\infty] = 1 \neq 0 = E[W_0]\).

This does not contradict the martingale property of \(W_n\). The issue is that the strategy requires potentially infinite bets (\(B_n = 2^{n-1}\)) — the boundedness condition \(\lvert B_n \rvert \leq K_n\) is satisfied for each fixed \(n\), but the constants \(K_n\) grow without bound. In finite time \(W_n\) is always a martingale; the “beating” of the game only emerges after infinitely many steps.

Key point: You can beat a fair game in infinite time by tolerating potentially unbounded bets and losses. The martingale property of $W_n$ — which holds for every finite $n$ — does not prevent $E[W_\infty] \neq E[W_0]$. This distinction motivates the Optional Sampling Theorem in §1.3, which gives conditions under which $E[W_T] = E[W_0]$ is preserved at stopping times $T$.

Part 6 — Connection to §1.1 Properties

Every step in §1.2 relies directly on the five properties of conditional expectation from §1.1. The table below makes these dependencies explicit.

Martingale argument §1.1 property used
\(E[S_n \mid \mathcal{F}_n] = S_n\) — current sum is known Property 1 — Known \(Y\)
\(E[X_{n+1} \mid \mathcal{F}_n] = E[X_{n+1}] = 0\) — future increment is independent Property 3 — Independence
\(E[S_n X_{n+1} \mid \mathcal{F}_n] = S_n E[X_{n+1} \mid \mathcal{F}_n]\) — \(S_n\) pulls out Property 5 — Constants rule
\(E[W_n \mid \mathcal{F}_n] = W_n\) — current winnings are known Property 1 — Known \(Y\)
\(E[B_{n+1}(M_{n+1} - M_n) \mid \mathcal{F}_n] = B_{n+1} \cdot 0\) — \(B_{n+1}\) pulls out Property 5 — Constants rule
One-step check implies all-future check Property 2 — Tower property

Term Glossary

Adapted process Definition
A sequence $M_0, M_1, \ldots$ is adapted to $\{\mathcal{F}_n\}$ if $M_n$ is $\mathcal{F}_n$-measurable for every $n$. Equivalently, the value of $M_n$ is fully determined by the information $X_1, \ldots, X_n$ available at time $n$. All martingales must be adapted — they cannot depend on future observations.
Integrability condition $E[\lvert M_n \rvert] < \infty$ Property
Required so that $E[M_n \mid \mathcal{F}_m]$ is well-defined. Without this, the conditional expectation may not exist. It is a weak condition — it rules out processes that take infinitely large values with positive probability but allows unbounded processes as long as their absolute mean is finite.
Increment $\Delta M_n = M_n - M_{n-1}$ Notation
The change in the martingale from step $n-1$ to step $n$. For a martingale, $E[\Delta M_n \mid \mathcal{F}_{n-1}] = 0$ — increments have conditional mean zero given all past information. This is the one-step restatement of the martingale condition.
Predictable process (non-anticipating) Definition
A sequence $B_1, B_2, \ldots$ where $B_n$ is $\mathcal{F}_{n-1}$-measurable for every $n$. The bet at time $n$ can only use information strictly before time $n$. This is the key condition that prevents "cheating" in the discrete stochastic integral and ensures $W_n$ remains a martingale.
Discrete stochastic integral $W_n = \sum_{j=1}^n B_j \Delta M_j$ Definition
The cumulative winnings from applying a predictable betting strategy $\{B_j\}$ to a martingale $\{M_j\}$. When $B_j$ is predictable and bounded, $W_n$ is itself a martingale. This is the discrete-time prototype of the Itô integral $\int_0^t A_s \, dB_s$ in Chapter 3.

Section 1.3 — Optional Sampling Theorem

Notation at a Glance

Symbol Meaning
\(T\) A stopping time — a random time that depends only on past and present observations
\(\{T = n\} \in \mathcal{F}_n\) The defining measurability condition for a stopping time
\(n \wedge T\) \(\min\{n, T\}\) — the process stopped at time \(T\)
\(M_{n \wedge T}\) The stopped process — equals \(M_T\) once \(T\) is reached, stays there afterwards
\(E[M_T] = E[M_0]\) The optional sampling conclusion — martingale mean preserved at stopping time \(T\)
\(P\{T \leq k\} = 1\) \(T\) is bounded — the strongest assumption guaranteeing \(E[M_T] = E[M_0]\)
\(P\{T < \infty\} = 1\) \(T\) is almost surely finite — weaker condition, needs extra hypotheses
\(\mathbf{1}_{\{T > n\}}\) Indicator function equal 1 if \(T > n\) and 0 if \(T < n\)
\(E[\lvert M_n \rvert \mathbf{1}_{\{T>n\}}] \to 0\) The uniform integrability condition in OST II
\(E[M_{n \wedge T}^2] \leq C\) The \(L^2\) boundedness condition in OST III
\(S_n\) Simple symmetric random walk \(X_1 + \cdots + X_n\) with \(P\{X_j = \pm 1\} = \tfrac{1}{2}\)

Part 1 — The Core Intuition

The Optional Sampling Theorem (OST) answers the question: if you are allowed to stop a martingale at a random time of your choosing, can you change its expected value? The answer is no — under appropriate conditions. This is the mathematical statement that you cannot beat a fair game even by choosing when to stop playing.

CORE IDEAS

Stopping a martingale produces another martingale. The stopped process \(Y_n = M_{n \wedge T}\) is always a martingale, regardless of what the stopping time \(T\) is.

\(E[M_T] = E[M_0]\) requires extra conditions beyond just \(P\{T < \infty\} = 1\). The stopped process has constant mean \(E[M_{n \wedge T}] = E[M_0]\) for all finite \(n\). Passing this to the limit \(n \to \infty\) — to get \(E[M_T] = E[M_0]\) — requires uniform integrability or \(L^2\) boundedness.

The OST is a tool for computing probabilities and expectations. By choosing a clever stopping time $T$ and applying OST to two different martingales simultaneously — \(S_n\) and \(S_n^2 - n\) — one can extract exact formulas for hitting probabilities and expected hitting times. Examples 1.3.1 and 1.3.2 demonstrate this technique completely.


Part 2 — Stopping Times

Definition — Stopping Time Definition

A nonnegative integer-valued random variable \(T\) is a stopping time with respect to the filtration \(\{\mathcal{F}_n\}\) if for each \(n\) the event \(\{T = n\}\) is \(\mathcal{F}_n\)-measurable.

What this means: The decision to stop at time \(n\) can only use information available up to and including time \(n\). You cannot decide to stop “because something will happen tomorrow.”

Equivalent condition: \(\{T \leq n\} \in \mathcal{F}_n\) for every \(n\), since

\[\{T \leq n\} = \{T = 0\} \cup \{T = 1\} \cup \cdots \cup \{T = n\},\]

and each event \(\{T = k\} \in \mathcal{F}_k \subseteq \mathcal{F}_n\).

The betting interpretation: \(T\) is a stopping time if and only if the strategy “bet 1 on rounds \(1, 2, \ldots, T\) and bet 0 afterwards” is an allowable (predictable) betting strategy in the sense of §1.2. The bet \(B_j = \mathbf{1}_{\{j \leq T\}}\) is \(\mathcal{F}_{j-1}\)-measurable precisely because \(\{T \geq j\} = \{T \leq j-1\}^c \in \mathcal{F}_{j-1}\).

⚠️Common Misconception
Wrong: "Any random time $T$ with $P\{T < \infty\} = 1$ is a stopping time."
Correct: A random time is a stopping time only if the event $\{T = n\}$ is $\mathcal{F}_n$-measurable for every $n$ — it can only depend on the observations up to time $n$, not on future values. The condition $P\{T < \infty\} = 1$ is a statement about the distribution of $T$, not about its measurability structure.

Part 3 — The Stopped Process

Let $T$ be the stopping time for the betting strategy in §1.2. Then the winnings at time $t$ is

\[M_0 + \sum_{j=1}^n B_j [M_j - M_{j-1}],\]

where \(B_j = 1\) if \(j \leq T\) and \(B_j = 0\) if \(j > T\). We can write this as \(M_{n \wedge T}\), where \(n \wedge T\) is shorthand for \(\min\{n, T\}\).

The stopped process \(M_{n \wedge T}\): at time $n$, the process equals:

  • \(M_n\) if \(T > n\) (we have not stopped yet),
  • \(M_T\) if \(T \leq n\) (we have already stopped and the value is frozen at \(M_T\)).

Since \(B_j = \mathbf{1}_{\{j \leq T\}}\) is predictable (it is \(\mathcal{F}_{j-1}\)-measurable because \(\{T \geq j\} \in \mathcal{F}_{j-1}\)), the discrete stochastic integral result from §1.2 immediately gives:

The stopped process $Y_n = M_{n \wedge T}$ is always a martingale with respect to $\{\mathcal{F}_n\}$. In particular, $E[M_{n \wedge T}] = E[M_0]$ for every finite $n$.

Part 4 — The Three Versions of the OST

Theorem 1.3.1 — OST I - Bounded Stopping TimesTheorem

Suppose \(T\) is a stopping time and \(M_n\) is a martingale with respect to \(\{\mathcal{F}_n\}\). Then \(Y_n = M_{n \wedge T}\) is a martingale. In particular, for each \(n\),

\[E[M_{n \wedge T}] = E[M_0].\]

If \(T\) is bounded, that is, if there exists \(k < \infty\) such that \(P\{T \leq k\} = 1\), then

\[E[M_T] = E[M_0]. \tag{1.7}\]

Why: Since \(P\{T \leq k\} = 1\), we have \(n \wedge T = T\) for all \(n \geq k\). Therefore \(E[M_{n \wedge T}] = E[M_T]\) for \(n \geq k\). Combined with \(E[M_{n \wedge T}] = E[M_0]\) for all \(n\), we get \(E[M_T] = E[M_0]\).

No extra conditions are needed when \(T\) is bounded — the martingale property of the stopped process is all that is required.

Theorem 1.3.2 — OST II (a.s. finite $T$ with UI condition) Theorem

Suppose \(T\) is a stopping time and \(M_n\) is a martingale with respect to \(\{\mathcal{F}_n\}\). Suppose that \(P\{T < \infty\} = 1\), \(E[\lvert M_T \rvert] < \infty\), and for each \(n\),

\[\lim_{n \to \infty} E[\lvert M_n \rvert \mathbf{1}_{\{T > n\}}] = 0. \tag{1.8}\]

Then, \(E[M_T] = E[M_0]\).

Where the condition comes from: For every finite \(n\),

\[E[M_0] = E[M_{n \wedge T}] = E[M_T \mathbf{1}_{\{T \leq n\}}] + E[M_n \mathbf{1}_{\{T > n\}}].\]

As \(n \to \infty\), the first term converges to \(E[M_T]\). The second term vanishes by condition (1.8). Hence \(E[M_0] = E[M_T]\).

Condition (1.8) in plain English: The contribution to the expected value from paths that have not yet stopped by time \(n\) must vanish as \(n \to \infty\). Paths that take very long to stop and reach very large values can violate this.

The doubling strategy fails here: In Example 1.2.4, if \(T = \min\{n : W_n = 1\}\), then \(P\{T < \infty\} = 1\) but \(E[\lvert W_n \rvert \mathbf{1}_{\{T > n\}}] = (2^n - 1) \cdot 2^{-n} \to 1 \neq 0\). Condition (1.8) is violated, consistently with \(E[W_T] = 1 \neq 0 = E[W_0]\).

Theorem 1.3.3 — OST III ($L^2$ bounded stopped process) Theorem

Suppose \(T\) is a stopping time and \(M_n\) is a martingale with respect to \(\{\mathcal{F}_n\}\). Suppose that \(P\{T < \infty\} = 1\), \(E[\lvert M_T \rvert] < \infty\), and that there exists \(C < \infty\) such that for each \(n\),

\[E[M_{n \wedge T}^2] \leq C. \tag{1.9}\]

Then, \(E[M_T] = E[M_0]\).

Why (1.9) implies (1.8) — the key argument:

For any \(b > 0\), split the expectation:

\[E[\lvert M_n \rvert \mathbf{1}_{\{T > n\}}] = E[\lvert M_n \rvert \mathbf{1}_{\{T > n,\, \lvert M_n \rvert \geq b\}}] + E[\lvert M_n \rvert \mathbf{1}_{\{T > n,\, \lvert M_n \rvert < b\}}].\]
  • First term: By the H ̈older inequality and (1.9), \(E[\lvert M_n \rvert \mathbf{1}_{\{\lvert M_n \rvert \geq b,\, T > n\}}] \leq \frac{C}{b}\).
  • Second term: \(E[\lvert M_n \rvert \mathbf{1}_{\{T > n,\, \lvert M_n \rvert < b\}}] \leq b \cdot P\{T > n\} \to 0\) since \(P\{T < \infty\} = 1\).

Therefore \(\limsup_{n \to \infty} E[\lvert M_n \rvert \mathbf{1}_{\{T > n\}}] \leq C/b\) for every \(b > 0\), so the limit is \(0\).

Practical value: Condition (1.9) is often easier to verify than (1.8) directly — it suffices to bound the second moment of the stopped process uniformly in \(n\).

Summary — Three OST Versions

Version Assumption on $T$ Extra condition Conclusion
OST I \(P\{T \leq k\} = 1\) (bounded) None \(E[M_T] = E[M_0]\)
OST II \(P\{T < \infty\} = 1\) \(E[M_{n}\mathbf{1}_{\{T>n\}}] \to 0\) \(E[M_T] = E[M_0]\)
OST III \(P\{T < \infty\} = 1\) \(E[M_{n \wedge T}^2] \leq C < \infty\) \(E[M_T] = E[M_0]\)

Part 5 — Worked Examples

Example 1.3.1 — Gambler's ruin: hitting probability for random walk Example

Gambler’s ruin for random walk. Let \(X_1, X_2, \ldots\) be independent, coin-tosses as in (1.6) and let \(S_n = 1 + X_1 + \cdots + X_n\). $S_n$ is called simple (symmetric) random walk starting at 1. … Let \(K > 1\) be a positive integer and let $T$ denote the first time $n$ such that \(S_n = 0\) or \(S_n = K\).

Setup:

  • \(S_n\) is a martingale (zero-mean increments, §1.2 Example 1.2.1).
  • \(T = \min\{n : S_n = 0 \text{ or } S_n = K\}\) is a stopping time.
  • \(S_n\) stays in \([0, K]\) for all \(n \leq T\), so \(0 \leq M_{n \wedge T} \leq K\) — the stopped process is bounded.
  • Therefore (1.9) holds with \(C = K^2\), and OST III applies.

Apply OST to \(S_n\):

\[1 = E[S_0] = E[S_T] = 0 \cdot P\{S_T = 0\} + K \cdot P\{S_T = K\}.\]

Solving:

\[\boxed{P\{S_T = K\} = \frac{1}{K}, \qquad P\{S_T = 0\} = \frac{K-1}{K}.}\]

Interpretation: Starting at 1, with a fair game, the probability of reaching \(K\) before 0 is \(1/K\). As \(K \to \infty\), \(P\{S_T = K\} \to 0\): a gambler with $1 playing against a casino with $K almost surely goes broke. This is the gambler’s ruin estimate.

Key technique: The OST converts a martingale identity ($E[S_T] = E[S_0]$) into a linear equation in the unknown probability $P\{S_T = K\}$. The bounded stopping time (or $L^2$ bound) is what justifies passing the martingale identity through to time $T$.
Example 1.3.2 — Expected hitting time via the $S_n^2 - n$ martingale Example

Let \(S_n = X_1 + \cdots + X_n\) be simple random walk starting at 0. We have seen that \(M_n = S_n^2 - n\) is a martingale. Let \(J, K\) be positive integers and let \(T = \min\{n : S_n = -J \text{ or } S_n = K\}\).

Step 1 — Find \(P\{S_T = K\}\) using \(S_n\):

Apply OST to \(S_n\) (same argument as Example 1.3.1, starting at 0):

\[0 = E[S_T] = (-J) P\{S_T = -J\} + K \cdot P\{S_T = K\}.\] \[\boxed{P\{S_T = K\} = \frac{J}{J+K}, \qquad P\{S_T = -J\} = \frac{K}{J+K}.}\]

Step 2 — Find \(E[T]\) using \(M_n = S_n^2 - n\):

Exercise 1.13 establishes that \(E[M_{n \wedge T}^2] \leq C < \infty\), so OST III applies to \(M_n\):

\[E[M_T] = E[M_0] \implies E[S_T^2] - E[T] = 0 \implies E[T] = E[S_T^2].\]

Compute \(E[S_T^2]\) directly:

\[\begin{aligned} E[S_T^2] = J^2 P\{S_T = -J\} + K^2 P\{S_T = K\} &= J^2 \cdot \frac{K}{J+K} + K^2 \cdot \frac{J}{J+K} \\ &= \frac{JK(J+K)}{J+K} = JK. \end{aligned}\] \[\boxed{E[T] = JK.}\]

Interpretation: The expected time for symmetric random walk to exit \((-J, K)\) starting from 0 is exactly \(JK\). In particular, to travel distance \($K\) from the origin in either direction, the expected time is \(K^2\) (set \(J = K\)).

Example 1.3.3 — OST fails: $P\{T < \infty\} = 1$ but $E[S_0] \neq E[S_T]$ Counterexample

As in Example 1.3.2, let \(S_n = X_1 + \cdots + X_n\) be simple random walk starting at 0. Let \(T = \min\{n : S_n = 1\}\).

\(P\{T < \infty\} = 1\): From Example 1.3.2, Probility random walk hits 1 before \(-J\) \(P\{S_T = 1\}= J/(J+1) \to 1\) as \(J \to \infty\). So with probability one the walk reaches 1.

But \(E[T] = \infty\): From Example 1.3.2, the expected time to exit \((-J, 1)\) starting at 0 is \(J \cdot 1 = J\). Since this holds for all \(J\) and \(T \geq T_J\) (where \(T_J\) is the exit time from \((-J,1)\)), we have \(E[T] \geq J\) for all \(J\), so \(E[T] = \infty\).

OST fails: \(S_T = 1\) almost surely, so \(E[S_T] = 1 \neq 0 = E[S_0]\).

Why: Neither condition (1.8) nor (1.9) holds for this \(T\). In particular, \(E[M_{n \wedge T}^2] = E[S_{n \wedge T}^2] \to \infty\) as \(n \to \infty\), violating (1.9).

Key lesson: $P\{T < \infty\} = 1$ is not sufficient on its own. An infinite expected stopping time ($E[T] = \infty$) is a warning sign that the extra conditions of OST II or III may fail. Always verify one of the three conditions before applying OST.

Part 6 — Why the Doubling Strategy Does Not Contradict OST

The martingale betting strategy from §1.2 Example 1.2.4 has:

  • \(W_n\) is a martingale for every finite \(n\), so \(E[W_n] = 0\).
  • \(T = \min\{n : W_n = 1\}\) satisfies \(P\{T < \infty\} = 1\).
  • \(W_T = 1\), so \(E[W_T] = 1 \neq 0 = E[W_0]\).

Checking that OST conditions fail:

\[E[\lvert W_n \rvert \mathbf{1}_{\{T > n\}}] = (2^n - 1) \cdot 2^{-n} \to 1 \neq 0,\]

so condition (1.8) is violated. Since \(E[W_{n \wedge T}^2] \to \infty\) as well, condition (1.9) also fails. And \(T\) is clearly unbounded. All three OST conditions fail, consistently with \(E[W_T] \neq E[W_0]\).


Term Glossary

Bounded stopping time Definition
A stopping time $T$ with $P\{T \leq k\} = 1$ for some finite $k$. The strongest assumption — guarantees OST with no additional conditions. In Examples 1.3.1 and 1.3.2 the stopping time is not bounded (the walk may take arbitrarily long to exit the interval) but the stopped process is bounded in value, which gives $L^2$ boundedness.
Uniform integrability condition (1.8) Property
The condition $\lim_{n \to \infty} E[\lvert M_n \rvert \mathbf{1}_{\{T > n\}}] = 0$ required by OST II. Ensures that paths which have not yet stopped by time $n$ contribute negligibly to the expectation as $n \to \infty$. Violated by the martingale doubling strategy: $E[\lvert W_n \rvert \mathbf{1}_{\{T > n\}}] \to 1$.
$L^2$ boundedness condition (1.9) Property
The condition $E[M_{n \wedge T}^2] \leq C < \infty$ for all $n$, required by OST III. Implies the uniform integrability condition (1.8) via Cauchy-Schwarz. Often easier to verify in practice — it suffices to show the stopped martingale has uniformly bounded second moment.
Gambler's ruin estimate Theorem
For simple symmetric random walk $S_n$ starting at $x \in \{0, 1, \ldots, K\}$ and $T = \min\{n : S_n = 0 \text{ or } S_n = K\}$: $$P\{S_T = K \mid S_0 = x\} = \frac{x}{K}.$$ Derived by applying OST to $S_n$ and solving the resulting linear equation. Shows that starting at 1 against a casino at $K$, the ruin probability is $(K-1)/K \to 1$ as $K \to \infty$.
Expected hitting time Theorem
For simple symmetric random walk starting at 0 and $T = \min\{n : S_n = -J \text{ or } S_n = K\}$: $$E[T] = JK.$$ Derived by applying OST to the martingale $M_n = S_n^2 - n$ together with the hitting probabilities from the gambler's ruin. In particular, starting at 0, the expected time to first exit the interval $(-K, K)$ is $K^2$.
Recurrence of random walk Theorem
Simple symmetric random walk returns to the origin with probability one: $P\{\tau < \infty\} = 1$ where $\tau = \min\{n \geq 1 : S_n = 0\}$. Follows from the gambler's ruin estimate by letting $K \to \infty$: $P\{S_T = 0\} = (K-1)/K \to 1$.

Section 1.4 — Martingale Convergence Theorem

Notation at a Glance

Symbol Meaning
\(M_\infty\) The almost-sure limit of \(M_n\) — guaranteed by the MCT when \(E[\lvert M_n \rvert] \leq C\)
\(\sup_n E[\lvert M_n \rvert] \leq C\) Uniform \(L^1\) bound — the hypothesis of the MCT
\(\liminf_{n\to\infty} M_n\), \(\limsup_{n\to\infty} M_n\) Smallest and largest cluster points of the sequence \((M_n)\)
\(U_n(a,b)\) Number of upcrossings of \([a,b]\) by time \(n\)
\(U_\infty(a,b)\) Total upcrossings of \([a,b]\); MCT proof shows \(E[U_\infty] < \infty\)
\(S_j,\, T_j\) Buy and sell stopping times in the upcrossing strategy
\(W_n\) Winnings from the buy-low-sell-high strategy used in the proof
\(R_n,\, G_n\) Number of red and green balls in Pólya’s urn after \(n\) draws
\(M_n = R_n/(n+2)\) Fraction of red balls — a bounded nonneg martingale converging a.s.
\(M_\infty \sim \mathrm{Uniform}[0,1]\) The limiting distribution of Pólya’s urn
\(f_{n,k}(\theta)\) Posterior Beta density for \(\theta\) after \(n\) Bernoulli trials with \(k\) successes

Part 1 — The Core Intuition

The Martingale Convergence Theorem (MCT) answers: if a martingale is uniformly bounded in $L^1$, must it converge? The answer is yes — it converges almost surely to a finite limit. The proof uses a financial argument: if the sequence did not converge, it would oscillate infinitely between some values $a$ and $b$, and a "buy low, sell high" strategy would extract infinite expected profit from a fair game. Since that is impossible, convergence must hold. Crucially, the limit $M_\infty$ need not satisfy $E[M_\infty] = E[M_0]$ — the martingale property can fail in the limit even when it holds at every finite time.

CORE IDEAS

Uniform \(L^1\) boundedness forces almost-sure convergence. The condition \(\sup_n E[\lvert M_n \rvert] \leq C < \infty\) prevents the martingale from wandering to \(\pm\infty\), and the upcrossing argument shows it cannot oscillate indefinitely between any two levels \(a < b\). Together these force \(\lim_{n\to\infty} M_n\) to exist with probability one.

The proof is a financial argument via upcrossings. If \(M_n\) crossed between $a$ and $b$ infinitely often, buying every time \(M_n \leq a\) and selling every time \(M_n \geq b\) would produce infinite expected profit. Since \(M_n\) is a fair game, that is impossible. The upcrossing inequality makes this precise and quantitative.

\(E[M_\infty] = E[M_0]\) does NOT follow from the MCT. The theorem guarantees a.s. convergence but not convergence of expectations. The martingale doubling strategy from §1.2 illustrates this: \(E[\lvert W_n \rvert] \leq 2\) for all \(n\), so the MCT applies and \(W_\infty = 1\) a.s., yet \(E[W_\infty] = 1 \neq 0 = E[W_0]\). Preserving the mean requires the extra conditions of the OST.

Nonnegative martingales automatically satisfy the MCT hypothesis. If \(M_n \geq 0\) for all \(n\), then \(E[\lvert M_n \rvert] = E[M_n] = E[M_0]\), so the \(L^1\) bound holds with \(C = E[M_0]\). Pólya’s urn is the prime example.


Part 2 — The Theorem

Theorem 1.4.1 — Martingale Convergence Theorem Theorem

Suppose \(M_n\) is a martingale with respect to \(\{\mathcal{F}_n\}\) and there exists \(C < \infty\) such that \(E[\lvert M_n \rvert] \leq C\) for all \(n\). Then there exists a random variable \(M_\infty\) such that with probability one

\(\lim_{n \to \infty} M_n = M_\infty.\)

Unpacking the hypothesis — \(E[\lvert M_n \rvert] \leq C\) uniformly in \(n\): The absolute mean of \(M_n\) stays bounded no matter how far the process runs. Since \(E[\lvert M_n \rvert] \geq \lvert E[M_n] \rvert = \lvert E[M_0] \rvert\), this implicitly requires \(E[M_0]\) to be finite, but it is strictly stronger — it controls the size of fluctuations, not just the level.

Unpacking the conclusion — “with probability one”: There exists a single event \(\Omega^* \subseteq \Omega\) with \(P(\Omega^*) = 1\) such that for every \(\omega \in \Omega^*\), the real sequence \((M_n(\omega))\) converges to a finite number \(M_\infty(\omega)\).

What the theorem does NOT say:

  • It does not say \(E[M_\infty] = E[M_0]\).
  • It does not say convergence is in \(L^1\) or \(L^2\).
  • It does say \(E[\lvert M_\infty \rvert] \leq C\) — this follows from Fatou’s lemma: \(E[\lvert M_\infty \rvert] = E[\liminf \lvert M_n \rvert] \leq \liminf E[\lvert M_n \rvert] \leq C\).

Part 3 — The Upcrossing Proof

The proof is built on the "buy low, sell high" strategy. The key object is the number of upcrossings — the number of times the sequence rises from below $a$ to above $b$. The Upcrossing Inequality bounds the expected number of such crossings by a quantity that is finite whenever $E[\lvert M_n \rvert] \leq C$. Finiteness of expected upcrossings then forces a.s. convergence.

Step 1 — Define the upcrossing stopping times

Fix \(a < b\). Define stopping times implementing “buy at \(a\), sell at \(b\)”:

\[S_1 = \min\{n : M_n \leq a\}, \qquad T_1 = \min\{n > S_1 : M_n \geq b\},\] \[S_j = \min\{n > T_{j-1} : M_n \leq a\}, \qquad T_j = \min\{n > S_j : M_n \geq b\}.\]

The number of completed upcrossings by time \(n\) is:

\[U_n = U_n(a,b) = \max\{j : T_j \leq n\}.\]

Each completed upcrossing — going from \(\leq a\) up to \(\geq b\) — produces a profit of at least \(b - a\).

Step 2 — Construct the predictable betting strategy

Define bets \(B_k = \mathbf{1}_{\{S_j \leq k-1 < T_j \text{ for some } j\}}\), i.e., \(B_k = 1\) when a position is held (bought but not yet sold), \(B_k = 0\) otherwise. This is \(\mathcal{F}_{k-1}\)-measurable since whether we hold at step \(k\) depends only on \(M_0, \ldots, M_{k-1}\).

The winnings from this strategy satisfy:

\[W_n \geq U_n(b - a) - (a - M_n).\]

Here \(U_n(b-a)\) is the profit from completed upcrossings, and \((a - M_n)\) is the potential loss on a position still open at time \(n\) (held below \(a\)).

Step 3 — Apply the martingale property of \(W_n\)

Since \(B_k\) is predictable and bounded, \(W_n\) is a martingale (§1.2), so \(E[W_n] = 0\):

\[0 = E[W_n] \geq (b-a)\, E[U_n] - E[(a - M_n)].\]

Using \((a - M_n) \leq \lvert a \rvert + \lvert M_n \rvert\):

\[\boxed{E[U_n(a,b)] \leq \frac{\lvert a \rvert + E[\lvert M_n \rvert]}{b - a} \leq \frac{\lvert a \rvert + C}{b - a} < \infty.}\]

This is Doob’s Upcrossing Inequality.

Step 4 — Conclude almost-sure convergence

Since \(E[U_n] \leq (\lvert a \rvert + C)/(b-a)\) for every \(n\), monotone convergence gives:

\[E[U_\infty(a,b)] \leq \frac{\lvert a \rvert + C}{b - a} < \infty \implies U_\infty(a,b) < \infty \quad \text{a.s.}\]

Now let \(a, b\) range over all rational pairs \(a < b\). There are only countably many such pairs, so:

\[P\bigl(\exists\, a < b \text{ rational}: U_\infty(a,b) = \infty\bigr) = 0.\]

On the complementary event (probability one), \(U_\infty(a,b) < \infty\) for every rational pair, which forces \(\liminf_{n \to \infty} M_n = \limsup_{n \to \infty} M_n\) a.s. Hence \(M_\infty = \lim_{n \to \infty} M_n\) exists and is finite a.s.


Part 4 — Pólya’s Urn

Pólya's urn is the canonical application of the MCT. The fraction of red balls $M_n = R_n/(n+2)$ is a bounded nonneg martingale, so the MCT guarantees its a.s. convergence to a limit $M_\infty$. The remarkable fact: $M_\infty$ is uniformly distributed on $[0,1]$ — the urn settles to a random stable fraction that is completely unpredictable in advance.

Setup

Start with one red and one green ball. At each step: draw a ball uniformly at random, observe its colour, return it plus one new ball of the same colour.

  • \(R_0 = G_0 = 1\), and \(R_n + G_n = n + 2\) after \(n\) draws.
  • \(M_n = R_n/(n+2)\): fraction of red balls.
  • Transition: \(P\{R_{n+1} = R_n + 1 \mid \mathcal{F}_n\} = M_n\).

\(M_n\) is a martingale — verification

\(\begin{aligned}\)E[M_{n+1} \mid \mathcal{F}_n] &= M_n \cdot \frac{R_n + 1}{n+3} + (1 - M_n) \cdot \frac{R_n}{n+3}
&= \frac{R_n(n+3)}{(n+2)(n+3)}
&= \frac{R_n}{n+2}
&= M_n. \checkmark\(\end{aligned}\)

MCT applies

Since \(0 \leq M_n \leq 1\), we have \(E[\lvert M_n \rvert] = E[M_n] = E[M_0] = \tfrac{1}{2}\). The uniform \(L^1\) bound holds with \(C = \tfrac{1}{2}\), so the MCT gives:

\[M_\infty = \lim_{n \to \infty} M_n \quad \text{exists a.s.}\]

The limiting distribution is \(\mathrm{Uniform}[0,1]\)

Exercise 1.11 establishes that for each \(n\), \(M_n\) is uniform on the finite set \(\bigl\{\tfrac{1}{n+2}, \tfrac{2}{n+2}, \ldots, \tfrac{n+1}{n+2}\bigr\}\). As \(n \to \infty\) this discrete uniform converges in distribution to \(\mathrm{Uniform}[0,1]\), so \(M_\infty \sim \mathrm{Uniform}[0,1]\).

Starting from one red and one green ball, the long-run fraction of red balls $M_\infty$ is uniformly distributed on $[0,1]$. The urn stabilises to a definite proportion, but that proportion is itself completely random.

Part 5 — Connection to Bayesian Statistics

Pólya's urn is not merely a toy model — it is exactly the Bayesian update rule for a Bernoulli parameter $\theta$ with a uniform prior. The MCT, reinterpreted, is a Bayesian consistency theorem: the posterior mean converges a.s. to the true $\theta$ as data accumulates.

Setup: Run Bernoulli trials with unknown success probability \(\theta\), starting from a \(\mathrm{Uniform}[0,1]\) prior. After \(n\) trials with \(S_n = k\) successes, the Bayes update gives the posterior:

\[f_{n,k}(\theta) \propto \theta^k (1-\theta)^{n-k}, \quad \theta \in (0,1).\]

This is the \(\mathrm{Beta}(k+1,\, n-k+1)\) density, with posterior mean:

\[E[\theta \mid S_n = k] = \frac{k+1}{n+2} = \frac{S_n + 1}{n+2}.\]

This is identical to the Pólya urn fraction \(M_n = R_n/(n+2)\) when \(R_n = S_n + 1\) (successes plus the one initial red ball). The urn fraction equals the posterior mean at every step.

MCT as Bayesian law of large numbers: By the strong law, \(S_n/n \to \theta\) a.s. Combined with \(M_n \to M_\infty\) a.s., we get \(M_\infty = \theta\) a.s. — the posterior mean converges to the true parameter.


Part 6 — MCT vs OST: Complementary Tools

  MCT OST
Question Does \(M_n\) converge as \(n \to \infty\)? Does \(E[M_T] = E[M_0]\) at a stopping time?
Hypothesis \(\sup_n E[\lvert M_n \rvert] \leq C\) Bounded \(T\), or (1.8), or (1.9)
Conclusion \(M_n \to M_\infty\) a.s. \(E[M_T] = E[M_0]\)
Mean preserved? Not guaranteed Yes — that is the conclusion
Counterexample Simple random walk: \(E[\lvert S_n \rvert] \to \infty\), no convergence Doubling strategy: \(P\{T<\infty\}=1\) but \(E[W_T] \neq E[W_0]\)
Key tool Upcrossing inequality Stopped process is a martingale

Term Glossary

Martingale Convergence Theorem (MCT) Theorem
If $M_n$ is a martingale with $\sup_n E[\lvert M_n \rvert] \leq C < \infty$, then $M_\infty = \lim_{n\to\infty} M_n$ exists and is finite a.s. The limit satisfies $E[\lvert M_\infty \rvert] \leq C$ by Fatou's lemma, but $E[M_\infty]$ need not equal $E[M_0]$.
Uniform $L^1$ bound Property
The condition $\sup_{n \geq 0} E[\lvert M_n \rvert] \leq C < \infty$. The hypothesis of the MCT. Automatically satisfied for nonneg martingales (since $E[\lvert M_n \rvert] = E[M_n] = E[M_0]$) and for $L^\infty$-bounded martingales. Fails for simple random walk.
Upcrossing $U_n(a,b)$ Definition
The number of times $(M_0, M_1, \ldots, M_n)$ rises from $\leq a$ up through $\geq b$ (a complete upward crossing of the interval $[a,b]$). A sequence converges if and only if $U_\infty(a,b) < \infty$ for every rational $a < b$. The MCT proof shows this holds whenever $\sup_n E[\lvert M_n \rvert] \leq C$.
Doob's Upcrossing Inequality Theorem
For any martingale $M_n$ with $\sup_n E[\lvert M_n \rvert] \leq C$ and any $a < b$: $$E[U_n(a,b)] \leq \frac{\lvert a \rvert + C}{b - a}.$$ The right side is finite and independent of $n$. Monotone convergence then gives $E[U_\infty(a,b)] \leq (\lvert a \rvert + C)/(b-a) < \infty$, forcing $U_\infty(a,b) < \infty$ a.s. and hence a.s. convergence of $M_n$.
Almost-sure (a.s.) convergence Definition
$M_n \to M_\infty$ a.s. means $P(\{\omega : M_n(\omega) \to M_\infty(\omega)\}) = 1$. Stronger than convergence in probability; weaker than $L^1$ convergence. The MCT delivers a.s. convergence. Upgrading to $L^1$ convergence (and hence $E[M_\infty] = E[M_0]$) requires the additional condition of uniform integrability.
Uniform integrability Property
A sequence $(M_n)$ is uniformly integrable if $\lim_{K\to\infty} \sup_n E[\lvert M_n \rvert \mathbf{1}_{\{\lvert M_n \rvert \geq K\}}] = 0$. Strictly stronger than $\sup_n E[\lvert M_n \rvert] \leq C$. Under UI, a.s. convergence implies $L^1$ convergence and $E[M_\infty] = E[M_0]$. The doubling strategy $W_n$ is not uniformly integrable.

Sections 1.5 – Square Integrable Martingales, Random Walk Integrals, and Maximal Inequality

Notation at a Glance

Symbol Meaning
\(E[M_n^2] < \infty\) Square integrability — the hypothesis of §1.5
\(L^2(\Omega, \mathcal{F}, P)\) Hilbert space of square-integrable random variables with inner product \((X,Y) = E[XY]\)
\((X, Y) = E[XY]\) Inner product on \(L^2\) — orthogonality means \((X,Y) = 0\)
\(\Delta M_n = M_n - M_{n-1}\) Increment of \(M_n\) at step \(n\)
\(E[\Delta M_{n+1} \cdot \Delta M_{m+1}] = 0,\; m \neq n\) Orthogonality of martingale increments
\(J_n\) Predictable integrand — \(J_n\) is \(\mathcal{F}_{n-1}\)-measurable
\(Z_n = \sum_{j=1}^n J_j X_j\) Discrete stochastic integral of \(J\) with respect to the random walk \(S_n\)
\(\sigma^2 = E[X_j^2]\) Variance of each i.i.d. increment
\(\mathrm{Var}[Z_n] = \sigma^2 \sum_{j=1}^n E[J_j^2]\) Variance rule for the discrete stochastic integral
\(\bar{Y}_n = \max\{Y_0, Y_1, \ldots, Y_n\}\) Running maximum of a nonneg submartingale \(Y_n\)
\(\overline{M}_n = \max\{\lvert M_0 \rvert, \ldots, \lvert M_n \rvert\}\) Running maximum of \(\lvert M_n \rvert\)
\(P\{\bar{Y}_n \geq a\} \leq a^{-1} E[Y_n]\) Doob’s maximal inequality for submartingales
\(P\{\overline{M}_n \geq a\} \leq a^{-2} E[M_n^2]\) Doob’s \(L^2\) maximal inequality for square integrable martingales

Part 1 — Square Integrable Martingales

Definition — Square Integrable Martingale Definition

A martingale \(M_n\) is called square integrable if for each \(n\), \(E[M_n^2] < \infty\).

This is the condition that \(M_n \in L^2(\Omega, \mathcal{F}_n, P)\) at every time \(n\).

Why stronger than integrability: Square integrability \(E[M_n^2] < \infty\) implies integrability \(E[\lvert M_n \rvert] < \infty\) by Jensen’s inequality, but not vice versa.

Why weaker than uniform \(L^2\) boundedness: The definition requires \(E[M_n^2] < \infty\) for each fixed \(n\), but the bound may grow with \(n\). The stronger condition \(\sup_n E[M_n^2] \leq C < \infty\) (used in OST III and the MCT) is a separate, stricter requirement.

Orthogonality of increments

Random variables \(X, Y\) are orthogonal if \(E[XY] = E[X]\, E[Y]\).

For zero-mean random variables, orthogonality reduces to \(E[XY] = 0\). Independent random variables are orthogonal, but orthogonal random variables need not be independent.

Proposition 1.5.1 — Orthogonality of Martingale Increments Proposition

Suppose that \(M_n\) is a square integrable martingale with respect to \(\{\mathcal{F}_n\}\). Then if \(m < n\),

\[E[(\Delta M_{n+1})(\Delta M_{m+1})] = 0,\]

where \(\Delta M_k = M_k - M_{k-1}\). Moreover, for all \(n\),

\[E[M_n^2] = E[M_0^2] + \sum_{j=1}^n E\bigl[(\Delta M_j)^2\bigr].\]

Proof of orthogonality:

For \(m < n\), the increment \(\Delta M_{m+1} = M_{m+1} - M_m\) is \(\mathcal{F}_n\)-measurable (since \(m+1 \leq n\)). Therefore:

\[E[(\Delta M_{n+1})(\Delta M_{m+1}) \mid \mathcal{F}_n] = (\Delta M_{m+1})\, E[\Delta M_{n+1} \mid \mathcal{F}_n] = (\Delta M_{m+1}) \cdot 0 = 0.\]

The second equality uses the martingale property: \(E[\Delta M_{n+1} \mid \mathcal{F}_n] = E[M_{n+1} - M_n \mid \mathcal{F}_n] = 0\). Taking full expectations gives \(E[(\Delta M_{n+1})(\Delta M_{m+1})] = 0\).

Proof of the Pythagorean identity:

Write \(M_n = M_0 + \sum_{j=1}^n \Delta M_j\) and expand the square:

\[M_n^2 = M_0^2 + \sum_{j=1}^n (\Delta M_j)^2 + \sum_{j \neq k} (\Delta M_j)(\Delta M_k).\]

Taking expectations and using orthogonality (all cross terms vanish):

\[E[M_n^2] = E[M_0^2] + \sum_{j=1}^n E[(\Delta M_j)^2].\]
Interpretation: This is the Pythagorean theorem in $L^2$. The variance of $M_n$ equals the sum of variances of all its increments — because the increments are mutually orthogonal (uncorrelated), there are no cross-term contributions. This is the exact analogue of $\lvert a_1 e_1 + \cdots + a_n e_n \rvert^2 = a_1^2 + \cdots + a_n^2$ for orthonormal vectors.
⚠️Common Misconception
Wrong: "Orthogonal martingale increments are independent."
Correct: Orthogonality ($E[\Delta M_{n+1} \cdot \Delta M_{m+1}] = 0$ for $m \neq n$) is weaker than independence. It says the increments are uncorrelated, not that they have no dependence structure. Independence implies orthogonality (for zero-mean variables), but the converse fails. The proof only uses the martingale property — no independence assumption is needed.

Part 2 — Integrals with Respect to Random Walk

This section defines the discrete stochastic integral and establishes its three fundamental properties. The setting is a predictable integrand $J_n$ and a random walk $S_n$ with i.i.d. mean-zero increments. The integral $Z_n = \sum_{j=1}^n J_j X_j$ is the discrete prototype of the Itô integral $\int_0^t A_s\, dB_s$.

Setup

Suppose that \(X_1, X_2, \ldots\) are independent, identically distributed random variables with mean zero and variance \(\sigma^2\).

The two main examples are:

  • Coin-tossing: \(P\{X_j = 1\} = P\{X_j = -1\} = \tfrac{1}{2}\), giving \(\sigma^2 = 1\).
  • Normal increments: \(X_j \sim N(0, \sigma^2)\).

Let \(S_n = X_1 + \cdots + X_n\) and let \(\{\mathcal{F}_n\}\) be the filtration generated by \(X_1, \ldots, X_n\).

A sequence of random variables \(J_1, J_2, \ldots\) is called predictable (with respect to \(\{\mathcal{F}_n\}\)) if for each \(n\), \(J_n\) is \(\mathcal{F}_{n-1}\)-measurable.

This is the non-anticipating condition from §1.2: The integrand \(J_n\) is determined by observations strictly before time \(n\).

The discrete stochastic integral is defined by:

\[Z_n = \sum_{j=1}^n J_j X_j = \sum_{j=1}^n J_j \,\Delta S_j.\]

Three fundamental properties

Three Properties of the Discrete Stochastic Integral Proposition

Property 1 — Martingale property

\[Z_n \text{ is a martingale with respect to } \{\mathcal{F}_n\}.\]

Proof:

\[E[Z_{n+1} \mid \mathcal{F}_n] = E[Z_n + J_{n+1} X_{n+1} \mid \mathcal{F}_n] = Z_n + J_{n+1}\, E[X_{n+1} \mid \mathcal{F}_n] = Z_n + J_{n+1} \cdot 0 = Z_n.\]

Here: \(Z_n\) is \(\mathcal{F}_n\)-measurable (Property 1 of §1.1); \(J_{n+1}\) is \(\mathcal{F}_n\)-measurable and pulls out (Property 5); \(X_{n+1}\) is independent of \(\mathcal{F}_n\) with \(E[X_{n+1}] = 0\) (Property 3).


Property 2 — Linearity

If \(J_n, K_n\) are predictable sequences and \(a, b\) constants, then \(aJ_n + bK_n\) is predictable and:

\[\sum_{j=1}^n (aJ_j + bK_j) X_j = a \sum_{j=1}^n J_j X_j + b \sum_{j=1}^n K_j X_j.\]

Property 3 — Variance rule

\[\mathrm{Var}[Z_n] = E[Z_n^2] = \sigma^2 \sum_{j=1}^n E[J_j^2].\]

Proof:

Using orthogonality of martingale increments (§1.5), the cross terms \(E[J_j X_j \cdot J_k X_k]\) vanish for \(j \neq k\):

\[E[Z_n^2] = \sum_{j=1}^n E[J_j^2 X_j^2].\]

Since \(J_j\) is \(\mathcal{F}_{j-1}\)-measurable and \(X_j\) is independent of \(\mathcal{F}_{j-1}\):

\[E[J_j^2 X_j^2] = E\bigl[E[J_j^2 X_j^2 \mid \mathcal{F}_{j-1}]\bigr] = E\bigl[J_j^2\, E[X_j^2 \mid \mathcal{F}_{j-1}]\bigr] = E[J_j^2]\, \sigma^2.\]

Summing over \(j\) gives \(E[Z_n^2] = \sigma^2 \sum_{j=1}^n E[J_j^2]\).

Why the variance rule matters: It gives an explicit formula for the second moment of $Z_n$ purely in terms of the integrand $J_j$ and the variance $\sigma^2$ of the increments. This is the discrete analogue of the Itô isometry $E\!\left[\left(\int_0^t A_s\, dB_s\right)^2\right] = \int_0^t E[A_s^2]\, ds$, which is the cornerstone of the $L^2$ theory of stochastic integration.

Comparison: discrete stochastic integral vs. Itô integral

Feature Discrete: \(Z_n = \sum J_j X_j\) Continuous: \(\int_0^t A_s\, dB_s\)
Integrand condition \(J_n\) is \(\mathcal{F}_{n-1}\)-measurable (predictable) \(A_s\) is adapted, square-integrable
Martingale property \(Z_n\) is a martingale \(\int_0^t A_s\, dB_s\) is a martingale
Variance rule \(E[Z_n^2] = \sigma^2 \sum E[J_j^2]\) \(E\!\left[\left(\int_0^t A_s\, dB_s\right)^2\right] = \int_0^t E[A_s^2]\, ds\)
Linearity ✓ direct from summation ✓ by construction

Part 3 — A Maximal Inequality

Theorem 1.7.1 — Doob's Maximal Inequality for Submartingales Theorem

Suppose \(Y_n\) is a nonneg submartingale with respect to \(\{\mathcal{F}_n\}\), and \(\bar{Y}_n = \max\{Y_0, Y_1, \ldots, Y_n\}\). Then for every \(a > 0\),

\[P\{\bar{Y}_n \geq a\} \leq \frac{1}{a}\, E[Y_n].\]

Proof:

Let \(T = \min\{k \leq n : Y_k \geq a\}\) (with \(T = n+1\) if no such \(k\) exists). Then:

\[\{\bar{Y}_n \geq a\} = \bigsqcup_{k=0}^n A_k, \quad A_k = \{T = k\}.\]

Each \(A_k \in \mathcal{F}_k\). Since \(Y_n\) is a submartingale, \(E[Y_n \mid \mathcal{F}_k] \geq Y_k\) for \(k \leq n\), so:

\[E[Y_n \mathbf{1}_{A_k}] = E\bigl[E[Y_n \mid \mathcal{F}_k]\, \mathbf{1}_{A_k}\bigr] \geq E[Y_k\, \mathbf{1}_{A_k}] \geq a\, P(A_k).\]

Summing over \(k = 0, 1, \ldots, n\):

\[E[Y_n] \geq E\!\left[Y_n\, \mathbf{1}_{\{\bar{Y}_n \geq a\}}\right] = \sum_{k=0}^n E[Y_n\, \mathbf{1}_{A_k}] \geq a\, P\{\bar{Y}_n \geq a\}.\]

Dividing by \(a\) gives the result.

Corollary 1.7.2 — Doob's $L^2$ Maximal Inequality Corollary

If \(M_n\) is a square integrable martingale with respect to \(\{\mathcal{F}_n\}\) and \(\overline{M}_n = \max\{\lvert M_0 \rvert, \ldots, \lvert M_n \rvert\}\), then for every \(a > 0\),

\[P\{\overline{M}_n \geq a\} \leq \frac{E[M_n^2]}{a^2}.\]

Proof: Exercise 1.15 shows that if \(M_n\) is a martingale and \(\varphi\) is a convex function, then \(\varphi(M_n)\) is a submartingale. Taking \(\varphi(x) = x^2\) gives that \(M_n^2\) is a nonneg submartingale. Apply Theorem 1.7.1 to \(Y_n = M_n^2\) with threshold \(a^2\):

\[P\{\bar{Y}_n \geq a^2\} \leq \frac{E[M_n^2]}{a^2}.\]

Since \(\{\bar{Y}_n \geq a^2\} = \{\max_k M_k^2 \geq a^2\} = \{\overline{M}_n \geq a\}\), the result follows.

Why this matters: The $L^2$ maximal inequality is used throughout Chapter 3 to show that stochastic integrals defined on dyadic times extend continuously to all times (the Kolmogorov continuity argument). It is also used in Chapter 4 to prove the OST under the $L^2$ boundedness condition (Theorem 1.3.3). Controlling the running maximum by the second moment at the final time is the key tool.
⚠️Common Misconception
Wrong: "Doob's maximal inequality says $P\{\overline{M}_n \geq a\} \leq E[\lvert M_n \rvert]/a$ for any martingale."
Correct: Theorem 1.7.1 applies to nonneg submartingales, not arbitrary martingales. For a general martingale $M_n$, we apply it to the submartingale $Y_n = \lvert M_n \rvert$ (since $|\cdot|$ is convex) to get $P\{\overline{M}_n \geq a\} \leq E[\lvert M_n \rvert]/a$. The $L^2$ version in Corollary 1.7.2 applies $Y_n = M_n^2$ and replaces $a$ by $a^2$, giving the sharper $1/a^2$ bound under the stronger $L^2$ hypothesis.

Part 4 — Worked Example: Verifying the Variance Rule

Variance rule for coin-tossing random walk Example

Setup: \(X_1, X_2, \ldots\) i.i.d. with \(P\{X_j = \pm 1\} = \tfrac{1}{2}\), so \(\sigma^2 = 1\). Let \(S_n = X_1 + \cdots + X_n\).

Integrand: \(J_j = S_{j-1}\) (the running sum just before step \(j\), which is \(\mathcal{F}_{j-1}\)-measurable ✓).

Integral: \(Z_n = \sum_{j=1}^n S_{j-1} X_j.\)

Apply the variance rule:

\[E[Z_n^2] = \sigma^2 \sum_{j=1}^n E[J_j^2] = \sum_{j=1}^n E[S_{j-1}^2] = \sum_{j=1}^n (j-1) = \frac{n(n-1)}{2}.\]

Here we used \(E[S_{j-1}^2] = j - 1\) (since \(\mathrm{Var}[S_{j-1}] = j-1\) for zero-mean i.i.d. increments with \(\sigma^2 = 1\)).

Direct check with Itô’s formula analogy: The integral \(Z_n = \sum S_{j-1} X_j\) is the discrete analogue of \(\int_0^t B_s\, dB_s\), which by Itô’s formula equals \(\tfrac{1}{2}(B_t^2 - t)\). The variance of \(\tfrac{1}{2}(B_t^2 - t)\) at time \(t\) is \(\tfrac{t^2}{2}\), consistent with \(n(n-1)/2 \approx n^2/2\) for large \(n\).

Key point: The variance rule allows computing $E[Z_n^2]$ without expanding the square and tracking all cross terms — orthogonality kills them all. The only computation needed is $E[J_j^2]$ for each $j$, which is a much simpler task.

Part 5 — The Three Properties as a Unified Blueprint

All three sections prepare the same three-property package that will recur throughout Chapters 3 and 4:

Property §1.6 Discrete version Chapter 3 Continuous version
Martingale \(Z_n = \sum J_j X_j\) is a martingale \(\int_0^t A_s\, dB_s\) is a martingale
Linearity \(\sum (aJ_j + bK_j) X_j = a Z_n^J + b Z_n^K\) \(\int (aA + bC)\, dB = a\int A\, dB + b \int C\, dB\)
Variance rule (Itô isometry) \(E[Z_n^2] = \sigma^2 \sum E[J_j^2]\) \(E\!\left[\left(\int_0^t A_s\, dB_s\right)^2\right] = \int_0^t E[A_s^2]\, ds\)
Maximal inequality \(P\{\overline{M}_n \geq a\} \leq E[M_n^2]/a^2\) \(P\{\sup_{s \leq t} \lvert M_s \rvert \geq a\} \leq E[M_t^2]/a^2\)

Term Glossary

Square integrable martingale Definition
A martingale $M_n$ with $E[M_n^2] < \infty$ for each $n$. Stronger than ordinary integrability ($E[\lvert M_n \rvert] < \infty$), weaker than uniform $L^2$ boundedness ($\sup_n E[M_n^2] \leq C$). The natural domain for the Pythagorean identity and the variance rule.
$L^2(\Omega, \mathcal{F}, P)$ Definition
The Hilbert space of square-integrable random variables on $(\Omega, \mathcal{F}, P)$, with inner product $(X,Y) = E[XY]$ and norm $\lVert X \rVert_2 = \sqrt{E[X^2]}$. The conditional expectation $E[Y \mid \mathcal{F}_n]$ is the orthogonal projection of $Y$ onto the closed subspace $L^2(\Omega, \mathcal{F}_n, P)$, minimising $E[(Y-Z)^2]$ over all $\mathcal{F}_n$-measurable $Z$.
Orthogonal random variables Definition
$X$ and $Y$ are orthogonal if $E[XY] = E[X]\,E[Y]$. For zero-mean variables this reduces to $E[XY] = 0$, i.e., $(X,Y) = 0$ in $L^2$. Independent zero-mean variables are orthogonal; the converse fails. Proposition 1.5.1 shows martingale increments $\Delta M_n$ are mutually orthogonal for $n \neq m$ using only the martingale property.
Pythagorean identity for martingales Property
$E[M_n^2] = E[M_0^2] + \sum_{j=1}^n E[(\Delta M_j)^2]$ for any square integrable martingale. Follows from orthogonality of increments: expanding $M_n^2 = (M_0 + \sum \Delta M_j)^2$ and taking expectations, all cross terms $E[\Delta M_j \cdot \Delta M_k]$ for $j \neq k$ vanish. This is the discrete analogue of the Itô isometry.
Predictable process Definition
A sequence $J_1, J_2, \ldots$ where $J_n$ is $\mathcal{F}_{n-1}$-measurable for every $n$. The value of $J_n$ is known strictly before time $n$. This is the allowable betting condition from §1.2, and the exact discrete analogue of the adapted condition for Itô integrands.
Discrete stochastic integral $Z_n = \sum_{j=1}^n J_j X_j$ Definition
The sum of a predictable integrand $J_j$ against the i.i.d. increments $X_j$ of a random walk. Satisfies three properties: (1) martingale; (2) linearity; (3) variance rule $E[Z_n^2] = \sigma^2 \sum E[J_j^2]$. The discrete prototype of the Itô integral $\int_0^t A_s\, dB_s$ in Chapter 3.
Variance rule (Itô isometry, discrete form) Property
$\mathrm{Var}[Z_n] = E[Z_n^2] = \sigma^2 \sum_{j=1}^n E[J_j^2]$. Proved by using: (a) orthogonality of martingale increments to kill cross terms; (b) the pull-out property to separate $J_j^2$ from $X_j^2$; (c) independence of $X_j$ from $\mathcal{F}_{j-1}$ to replace $E[X_j^2 \mid \mathcal{F}_{j-1}]$ by $\sigma^2$.
Doob's maximal inequality Theorem
For a nonneg submartingale $Y_n$ with running maximum $\bar{Y}_n = \max_{k \leq n} Y_k$: $$P\{\bar{Y}_n \geq a\} \leq \frac{E[Y_n]}{a}.$$ Proved by partitioning $\{\bar{Y}_n \geq a\}$ into the disjoint events $A_k = \{T = k\}$ (first time $Y$ hits $a$), and using the submartingale inequality $E[Y_n \mathbf{1}_{A_k}] \geq a\,P(A_k)$ for each $k$.
Doob's $L^2$ maximal inequality Theorem
For a square integrable martingale $M_n$ with $\overline{M}_n = \max_{k \leq n} \lvert M_k \rvert$: $$P\{\overline{M}_n \geq a\} \leq \frac{E[M_n^2]}{a^2}.$$ Follows from Theorem 1.7.1 applied to the submartingale $Y_n = M_n^2$ (convexity of $x^2$ makes $M_n^2$ a submartingale). Controls the running maximum by the terminal second moment — used in the Kolmogorov continuity argument and in the proof of OST III.