Statistical Distances and Their Implications to GAN Training

Max Daniels
Northeastern University



Some figures in this article are marked with a . These figures are interactive, and you can manipulate them by tapping, clicking, and using the arrow keys.

Introduction

Generative Adversarial Networks are a class of neural architectures that can produce high quality samples of structured, high dimensional data. In particular, GANs make the assumption that this data is controlled by a small number of "factors of variation," which themselves can be modeled with a simple parametric distribution [1]. During training, they learn a map from the factors of variation to the data space – like predicting rain given the humidity and temperature.

At its core, the training process relies on defining some notion of distance between a partially trained model and a hypothetical "optimal model" that can connect the factors of variation to the data perfectly 1. Minimizing this distance implies bringing the two closer together, so that eventually, you have a nearly perfect model. The feasibility of this process at scale depends on how we choose to determine this distance – for example, we could decide that two models either have distance zero if they are exactly the same, or infinite distance otherwise. In this case, if we don't already know the optimal model, this distance is neither computable nor informative.

Instead, this article will explore some of the more creative ways to define distances for GAN training and in particular on the differences between information-theoretic distance functions and Wasserstein distance. Wasserstein distance has recently been proposed as improvement to information-theoretic distances which can improve training stability and output quality in regimes that are typical for common GAN applications, so it is important to understand the intuition behind why this is the case.

What does it mean to have a Statistical Distance?

The GAN itself is a map, and since the input to this map is a random variable, so is its output. The data itself is also a random variable, since there are certain items that are fairly likely ("heavy rain" or "slight drizzles") and other outcomes that are definitely impossible ("raining cats and dogs"). So, to determine the distance between a given GAN and the optimal model, we're really asking about the distance between two probability distributions.

As a simple example, we can consider distributions in the form of a histogram, where each column represents an event and the boxes in a column represent the relative likelihood of that event. One of the simplest ways to compare histograms is by total Euclidean distance: $$D(x, y) = \sum_{b \in \text{bins}} \vert b_x - b_y \vert$$

Here, we simply use the magnitude of difference between each bin to say how far apart the distributions are. In the figure, this corresponds to the length of each line above or below a bin.

The grey markers indicate where these two distributions differ. One can compare the two distributions by summing the the total line length.

When each bin matches exactly, the distance is zero, since the distribution over bins is the same. More importantly, if we make a small change to distribution $X$, by taking an for example, this distance will reflect that change and decrease. We can use the distance as a guide to iteratively update $X$ until it matches $Y$. This is the approach that GANs take when approximating $P_d$, the true distribution over data, by the generator distribution $P_{\theta}$. The parameters $\theta$ are like the boxes and the statistical distance function is the guide for manipulating $\theta$ to make $P_{\theta}$ faithful to $P_d$.

The Euclidean distance is good for illustrating this concept, but it isn't used in practice. The original GAN formulation was shown to reduce an upper bound of the Jenson Shannon Divergence between $P_d$ and $P_{\theta}$, an information-theoretic distance between distributions based on the idea of entropy. These GANs are infamous for their difficulty to train and they typically require lots of fine tuning to be implemented successfully. Within the last two years, there has been increasing interest in Wasserstein distance as a more stable alternative.

Entropy

The Jenson Shannon Divergence (JSD) between two distributions is a distance function that has its roots in information entropy. To understand JSD, we must first introduce the notion of an "information channel." An observer Bob has an open channel to Alice, so that he can tell her about samples that he draws from a probability distribution $P$. For simplicity, we assume that Bob is communicating with bits and that he and Alice have agreed ahead of time on their choice of codes. The intuition is that two distributions are similar when you can communicate their events in similar ways. Under this abstraction, entropy is the minimum number of bits on average that Bob needs to tell Alice about which samples he observes from $P$.

This quantity depends completely on the "shape" of $P$ and in turn, it serves as a representation of that shape. For example, under a distribution where and are the only events that can happen, a single bit is enough for unambiguous communication, so the entropy is $1$.

Since there are only two possible events in this distribution, its events can be described with a single bit.

As a general rule of thumb, any distribution where most of the probability mass is consolidated around relatively few events has lower entropy than a distribution with evenly distributed probability mass.

$H$ =

Example distributions and their entropies.

To understand why, we can explore the process behind constructing an optimal code. First, since entropy is an average over codes for different events, it's helpful to start with a special case where constructing the optimal codes is easy: the uniform distribution over events $x_i$. Then, interpreting $P(x_i)$ as a ratio $\frac{s_{x_i}}{N}$ of success to trials, we see that $\frac{1}{P(x)} = \frac{N}{s_{x_i}}$ is the number of unique events that we need to distinguish with our code2. Each outcome corresponds to all events of a particular $x_i$, which we assume occurs with frequency $s_{x_i}$ out of $N$ trials.

To distinguish between the events, we want at least $N^* = \frac{1}{P(x_i)}$ unique binary codes to assign to different events. These codes can be visualized as paths in a binary decision tree:

Paths in a binary tree are equivalent to binary codes. Each $1$ is equivalent to a path taking the right branch and each $0$ corresponds to a path which takes the left branch.

Each layer of the tree represents one bit and each leaf corresponds to the binary code for the path to that leaf. Adding a layer increases the number of leaves exponentially, so we need a minimum of $d = \log_2 N^* $ layers to guarantee at least $N^*$ unique codes.

This means that we can trivially construct an optimal coding when all events have the same likelihood. According to the rule of thumb, this case has the highest entropy of all distributions with $N^*$ events. So, we can now consider the effect of introducing a non-uniform distribution over the events and verify that this is indeed the case. At this point, we will introduce the notation $x^{(1)}_i$, $x^{(2)}_i$, and so on. The upper indices identify a set $x^{(k)}$ of unique events, each having the same probability. We will build an optimal code for each of the different $x^{(k)}_i$ iteratively, where the probability of the $x^{(k)}$ will determine the point at which those events have their codes assigned.

We start with the least likely event, say $\xp{1}{1}$. We want to match this event with a unique path to a leaf of the tree, in a way that allocates an optimal number of bits to that event. As in the previous example, we can start with a uniform distribution of copy events $\xp{1}{i}$ and assign codes with a decision tree of depth $\log_2 \frac{1}{\xp{1}{}}$. Then, to represent an event more likely than $\xp{1}{}$ – say, $\xp{2}{1}$ – we can simply merge a few of the copies $\xp{1}{i}$, so that a path to any of them is an equivalent way to communicate $\xp{2}{1}$. This is advantageous whenever we can find sibling copies to merge, since it allows us to save a bit by opting not to distinguish those siblings at all.

Replacing two sibling $x_1$ events with an $x_2$ event eliminates a bit.

Iterating this process, we can create a leaf for any $\xp{k}{}$ event by merging two $\xp{k-1}{}$ leaves. From the bottom up, this means starting with a tree of all $\xp{1}{}$ leaves and merging as many as possible, keeping only enough $\xp{1}{}$ leafs to represent the $\xp{1}{}$ events in $P$. Then, in the next layer, the algorithm freezes leaves for any $\xp{2}{}$ events in $P$, merging the leftover nodes. This repeats until all $\xp{k}{i}$ have a place in the tree. In the diagram below, we can see all the merged nodes for events in the histogram, along with the fully colored nodes representing actual $\xp{k}{i}$ leaves.

The fully colored nodes represent leaves at which an event is unambiguously specified in the tree. Assigning shorter paths to more common events reduces the expected number of bits for communication.

This procedure is called the Huffman Algorithm and it can be used to efficiently construct optimal codes for the events in $P$. Whenever all the probabilities are of the form $\frac{1}{2^k}$, the tree can always be constructed perfectly, in the sense that a node for $\xp{n}{}$ can be constructed out of repeated merges having probability exactly $P(\xp{n}{})$. Then, the path length to $\xp{n}{}$ is always $d(\xp{n}{}) = \log_2(\frac{1}{P(\xp{n}{})})$, which will be an integer quantity. More realistically, the probabilities won't line up exactly, so $d(\xp{n}{})$ will be a real quantity that bounds the optimal attainable path length from below. Despite this, Huffman codes generate the optimal attainable coding for individual events3 and the lower bound on the expected number of bits transferred is the entropy, $H$: $$H(P) = \expectedunder{P}{\log_2{\frac{1}{P(x)}}}$$

Kullback-Leibler, Cross Entropy, and Jenson Shannon Divergence

Once you understand entropy as a characteristic of probability distributions, the cross entropy, Kullback–Leibler Divergence, and Jenson Shannon Divergence are all short steps away. We can start with the typical definition of KL Divergence:

$$ \begin{align} D_{KL}(P \vert \vert Q) & = - \sum_{x} P(x) \log_2 \frac{Q(x)}{P(x)} \\ \end{align} $$

We can coerce this equation into one which aligns with the path-length understanding of entropy:

$$ \begin{align} D_{KL}(P \vert \vert Q) & = - \sum_{x} P(x) \log_2 \frac{Q(x)}{P(x)} \\ & = \sum_{x} P(x) (\log_2 \frac{1}{Q(x)} - \log_2 \frac{1}{P(x)}) \\ & = \sum_{x}{P(x)(d_q(x) - d_p(x))} \end{align}$$

For each event $x$, $d_q(x) - d_p(x)$ is the difference in the code length assigned to $x$ under distributions $P$ and $Q$. The KL Divergence is the expected difference over all events $x$, under the distribution $P$4. This brings us back to the information-theoretic perspective on distances between distributions: $P_d$ and $P_{\theta}$ are similar when you can communicate their events in similar ways. More specifically, they are similar when we expect their codes for each event to have similar lengths.

It's important to note that this distance is asymmetric, so it would be more accurate to characterize it as the extra number of bits we expect to send by communicating events from $P$ using codes from $Q$. In fact, KL is also known as the relative entropy between $P$ and $Q$. Cross entropy is just the total expected number of bits when communicating events from $P$ using the code for $Q$. Defining it in terms of $D_{KL}$ is simple:

$$H(P) + D_{KL}(P \vert \vert Q)$$

This is the entropy of $P$, plus the extra bits we have to send when the code for $Q$ is not optimal for $P$. Minimizing either of these differences forces $P$ to be similar to $Q$, which is exactly what we need to train a GAN. You can try minimizing KL yourself by manipulating the left distribution in the figure below:

$KL(P\vert \vert Q)$ = _

KL Divergence is the expected difference in path lengths for items in each tree. Select an event to see its path in the Huffman Tree.

Jenson Shannon Divergence

The last measurement, Jenson Shannon Divergence, is intended to solve a numerical stability problem with both Cross Entropy and KL, namely that neither are defined when either of $P(x)$ or $Q(x)$ is 0. When $P(x) = 0$, a simple solution is to just omit $x$ from the sum, as though $(0)(\log_2(0)) = 0$5. But if $Q(x) = 0$, this trick doesn't apply.

And, following the intuition behind KL, it shouldn't. If we want to know how much further in $Q$'s tree we have to travel to find some event $x$ in $P$'s tree, there is no good answer in this case, because $x$ doesn't exist in $Q$'s tree. JSD solves this problem by introducing a new distribution, $P_m(x)$, that averages over the two. This guarantees that $P_m(x) \not = 0$ whenever either $Q(x) \not = 0$ or $P(x) \not = 0$. The quantity itself is also an average, over the KL Divergences $D_{KL}(P \vert \vert P_m)$ and $D_{KL}(Q \vert \vert P_m)$.

$$ P_m(x) = \frac{P(x) + Q(x)}{2} \\ JSD(P, Q) = \frac{1}{2}(D_{KL}(P \vert \vert P_m) + D_{KL}(Q \vert \vert P_m)) $$

Entropy is a very elegant idea for comparing two distributions and JSD is the band aid fix that makes it robust to edge cases. Cross entropy and KL are also often used in classification tasks, where it's easy to define discrete events for the domain of network's output distribution and it's reasonable to assume that none of those events have exactly $0$ probability. As we will see later, JSD still has an Achilles' heel, in that even though it is defined everywhere, its gradient may not always be useful.

Transport

Like entropy, Wasserstein distance relies on an abstraction to determine the distance between probability distributions, and in this case, that abstraction is the "transport plan." Simply put, a transport plan is a literal blueprint for turning one distribution into another – for each event in the source distribution, it defines the locations and quantities in the target distribution to which some amount of "mass" should be transported.

A transport plan defines how one probability distribution can be transformed into another.

In other words, the transport plan gives each of the source events its own distribution over the target events, shown in black in the diagram above. It may not be obvious at first, but these types of transport plans are exactly equivalent to a probabalistic joint distribution, where the source and target distributions are marginals. Here's the same transport plan, but this time represented as marginals of a joint distribution.

A transport plan from distribution $P$ to $Q$ is equivalent to a joint distribution with marginals $P$ and $Q$.

For each cell, its location $(i, j)$ indicates that the cell corresponds to source event $x_i$ and target event $x_j$. The joint distribution assigns a value $p(i, j) = s$, which can be interpreted as a transfer of $s$ units of "probability mass" from $x_i$ to $x_j$.

Before we can use a transport plan as a distance function, we need a way to squeeze a number out of it that represents the actual distance quantity. The most natural way is to assign a cost when each elements is transported, proportional both to the amount of mass being moved and to the distance it is moved over. Think of the actual costs involved in, for example, moving heavy boxes across the city – every mile traveled costs a certain amount of gas and if you take one trip per box then the cost is multiplied by the total number of boxes. Provided that one can actually define a distance $d(x_i, x_j)$ between statistical events, it can be used to define the cost of a transport plan in the same way:

$$ C = \sum_{x_i, x_j} d(x_i, x_j) p(x_i, x_j) $$

Of course, this distance has more to do with a particular transport plan than it does with either of its marginal distributions. In order to define the Wasserstein distance for probability distributions, we add the condition that $d(P, Q)$ is the minimum cost under all possible transport plans that convert $P$ to $Q$. This makes the quantity representative of the similarities between $P$ and $Q$, similar to how entropy becomes a meaningful value under the stipulation that Bob and Alice communicate to each other using optimal codes. So, the full formulation of Wasserstein distance is given by:

$$ W(P, Q) = \inf_{p \in \Gamma(P, Q)} [ \sum_{x_i, x_j} d(x_i, x_j) p(x_i, x_j) ] $$

where $\Gamma(P, Q)$ is the set of all joint distributions whose marginals are $P$ and $Q$.

For comparison, we can take the same two distributions in the previous example and compute an optimal transport plan between them by searching over $\Gamma(P, Q)$.

An optimal transport plan for the given two distributions.

This plan has a total cost of $30$ box-columns, where the original plan costs $70$ box-columns. Under the optimal plan, we see that the nonzero cells in the joint distributions are clustered around the diagonal, since minimizing the transportation distance means bringing $i$ as close to $j$ as possible.

As a bonus, the Wasserstein distance is more robust than the entropy based measurements. If $P(x_i) = 0$, then the transport plan just doesn't move any mass away from $x_i$, and if $Q(x_j) = 0$, the plan just doesn't move any mass to $x_j$. In either case, the only effect is that $p(x_i, x_j) = 0$ for all $x_i$ with $P(x_i) = 0$ and for all $Q(x_j) = 0$. This is a general property of joint distributions.

Manifold Learning: The Pathological Case

Earlier, we assumed that typical data tends to depend on relatively few factors of variation compared to its actual dimensionality. In other words, we assume most data lies in a low dimensional manifold. We can now analyze the expected behavior of Jenson Shannon Divergence and Wasserstein distance in this setting.

First, lets consider what a simple manifold distribution might actually look like. Consider a distribution $P$ in $\mathbf{R}^2$ that assigns uniform probability to points in the closed line segment starting at $a$ and ending at $b$. This distribution lies on a $1$-dimensional manifold in $\mathbf{R}^2$ – if we want, we can parametrize the line segment as $f_P(t) = t a + (1-t) b$, for $t \in [0, 1]$, so that $t$ is our single factor of variation. Then, the probability of drawing a point on the segment from $f(t_1)$ to $f(t_2)$ is just $\frac{\vert t_2 - t_1 \vert}{1} = \vert t_2 - t_1 \vert$.

The distribution $P$ represents $P_d$, the distribution we want to match. For $P_{\theta}$, we introduce a second manifold distribution, $Q$. Like $P$, we will assume $Q$ is a line segment, this time connecting the points $u$ and $v$. This distribution has its own parametrized line segment, $f_Q(t) = t u + (1-t) v$.

Because these two distributions lie in $1$ dimensional manifolds, they obviously can't have an intersection with more than $1$ dimension. The intersection can only be $1$-dimensional when $a$, $b$, $u$, and $v$ all lie on the same line together. Otherwise, even if they're just slightly misaligned, the intersection is either a single point or the empty set.

Two low dimensional manifold distributions can have an even lower dimensional intersection, which is trivial.

The effect is that any misalignment causes either $P(x)$ or $Q(x)$ to be $0$ at almost all points, except for the potential singleton point where $f_p(t) = f_q(s)$. In this case, JSD attains its maximum value, $\log_2(2) = 1$, so that the distributions are "as far apart as possible." But, any incremental step $Q \gets Q'$ that can't immediately fix the misalignment will also have $JSD(P\vert \vert Q')= \log_2(2)$, which means that the gradient of JSD becomes $0$. It no longer gives any information about how to incrementally bring $Q$ to $P$, so for manifold distributions JSD becomes as helpful as the binary distance.

This is the Achilles' heel of JSD, mentioned earlier. Even though JSD fixes the numerical instabilities of KL, its relationship to entropy means that two distributions whose events are only negligibly intersecting are, where entropy is concerned, as different as they can possibly be. Meanwhile, as long the Euclidean distance between events is defined, this case is nothing special for Wasserstein distance.

How does this distinction affect training of generative models?

When discussing entropy, we were able to build a rich understanding of the measurement by constructing the Huffman tree and were in turn able to derive closed forms for Entropy, KL, and JSD. Unfortunately, Wasserstein distance is much more difficult to compute exactly and there is no analogous closed form that we can use to compute the distance between $P$ and $Q$ directly.

Even if there were such a formula, there is another problem that has been glossed over up to this point: calculating the distance between $P_{\theta}$ and $P_d$ requires knowledge of $P_d (x)$, which we would typically not be able to calculate exactly. This in turn prevents us from calculating both the distance $d(P_{\theta}, Px_d)$ and its gradient. No matter the computation power, resource access, or network capacity, the key problem in training a GAN at scale is constructing a training scheme that can circumvent this calculation and still minimize a distance between $P_{\theta}$ and $P_d$.

In the original and now famous method, a discriminator network $D(x)$ has the goal of predicting whether a given sample was generated by $P_d$. It is trained to maximize the joint log-likelihood of predicting false on $x \sim P_{\theta}(x)$ and true on $x^* \sim P_d(x)$, while the generator network $G(x)$ that induces $P_{\theta}$ is simultaneously trained to maximize the likelihood that $x$ is predicted as true [1].

The paper shows that, assuming $D(x)$ is optimal, this log likelihood loss is equivalent to JSD up to a constant. That assumption isn't true during training, but partially optimizing the discriminator gives a good enough approximation that the generator can approach $P_d$.

Similarly, the more recent WGAN paper presents a similar training methodology, where training steps applied to the discriminator turn it into a differentiable approximation of the Wasserstein distance between $P_{\theta}$ and $P_d$ [2].

As a control example, we can observe the training of a GAN and WGAN on "nice" distributions, in this case a fixed Gaussian distribution. The following animation plots the output statistics of points sampled from two separate generator networks with the same architecture, as the left generator is trained to minimize the approximate upper bound of Wasserstein distance and the right is trained to minimize the approximate upper bound of the Jenson Shannon Divergence.

Minimizing Jenson Shannon Divergence

Minimizing Wasserstein Distance





As we can see, both generators have a fairly easy time learning the target distribution. But, if we instead try to fit the generators to to a $1$-dimensional manifold embedded in 2 dimensions, we see that the minimizing the Wasserstein upper bound is successful where minimizing the JSD upper bound is not.

Minimizing Jenson Shannon Divergence

Minimizing Wasserstein Distance





When minimizing JSD, the generator output stays in its initial distribution throughout optimization, because no change to the generator's parameters can reduce the estimated JSD. Technically, this is intentional – the optimization makes no changes because it has already found a local minima, which is any of the points in parameter space where the generator distribution has a trivial intersection with the true distribution. This is not a flaw in JSD, as much as it is a demonstration that JSD can break when it is applied outside its definition.

This effect has been discussed in various treatments as a thought experiment and it is especially well illustrated in [2], Example 1. The animations above were generated directly by training simple neural networks on synthetic data as an empirical demonstration. The scripts to do so are available in this article's repository.

Conclusion

As we have seen, statistical distances play a key role in GAN training by giving information about how to incrementally shift $P_{\theta}$, a GAN output distribution, into $P_d$, the data generating distribution. In nice cases, where both distributions assign probability to significantly overlapping events, both JSD and Wasserstein distance can fill this role. But, under the assumption that typical data lies on a manifold, having relatively few factors of variation with respect to the data dimensionality, JSD is no longer able to provide useful gradients.

In theory, this should make many of the most impressive demonstrations of GAN technology impossible. Instead, because of both the noise introduced in minimizing an upper bound of JSD and the many tricks applied to reduce training instability, traditional GAN training is just difficult, but possible.

At the same time, it seems in theory that Wasserstein distance gives improved training stability for free, which is also not the empirical case. The proposed framework for minimizing an approximated upper bound requires severe constraints to be imposed on the discriminator, so finding alternative methods to minimize Wasserstein distance is an open area for exploration6.

For the interested reader, we provide a few recommended resources for digging deeper into the information presented here. We also recommend two papers [4, 5] which provide a more general view of the JSD and Wasserstein distances, as special cases in the classes of $f$-divergences and Integral Probability Metrics respectively. Finally, we include a reference to the recent BEGAN architecture [3], which is a good example of an architecture designed to avoid imposing harmful constraints on the discriminator.


Footnotes

  1. The term "distance" is used here as a general term, encompassing statistical distances, divergences, various kinds of entropy, and metrics in the formal sense.
  2. Whether or not probabilities should be interpreted as ratios is contentious and the Bayesian view of statistics casts probabilities as degrees of certainty instead. Here, the frequentist view is just convenient for explanation.
  3. There is an alternative, called Arithmetic Coding, which encodes strings of events instead of individual events.
  4. It is natural to ask whether the KL divergence could be negative, since any individual event might have a shorter code in $Q$'s tree than in $P$'s tree. If this were the case, then transmitting events drawn from $P$ with codes from $Q$ would require fewer bits on average than transmitting events using $P$'s own code. But we assume that events from $P$ are transmitted optimally, so this is impossible.
  5. $0 = 0 \log_2(0)$ is a contradiction because exponentiating both sides yields $0 = 1$. But, $ \displaystyle \lim_{x \to 0^+} x \log_2(x) = 0$, so this is still a sensible choice.
  6. In particular, the discrinimator must approximate a Lipschitz continuous function, so the Lipschitz continuity must be enforced as a constraint. Arjovsky's WGAN formulation accomplishes this through either weight clipping or gradient clipping in the discriminator.

Citations

  1. Generative adversarial nets
    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014
  2. Wasserstein generative adversarial networks
    Martin Arjovsky, Soumith Chintala, and Léon Bottou. In Doina Precup and Yee Whye Teh, editors,Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 214–223, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR
  3. BEGAN: boundary equilibrium generative adversarial networks
    David Berthelot, Tom Schumm, and Luke Metz. CoRR, abs/1703.10717, 2017
  4. A note on integral probability metrics and $\phi$-divergences
    Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Gert R. G.Lanckriet, and Bernhard Schölkopf. CoRR, abs/0901.2698, 2009.
  5. f-gan: Training generative neural samplers using variational divergence minimization
    Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 271–279.Curran Associates, Inc., 2016.
  6. Elements of Information Theory.
    Thomas M. Cover and Joy A. Thomas. (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, New York, NY, USA, 2006
  7. Optimal Transport: Old and New.
    C. Villani. Grundlehren der mathematischen Wissenschaften. Springer Berlin Heidelberg, 2008
  8. Probability and random processes
    G.R. Grimmett and D.R. Stirzaker. volume 80. Oxford university press, 2001.

Further Reading

For learning more about the material in this article, the beginning chapters of [8] are a great introduction to formal probability theory, which underlies analysis of Wasserstein distance. Wasserman's notes (below, see other articles) give a clear and concise introduction at this level.

Here are other examples of fantastic articles on entropy and Earth Mover's (Wasserstein) distance.

  1. Optimal Transport and Wasserstein Distance by Larry Wasserman
  2. Wasserstein GAN and the Kantorovich-Rubinstein Duality by Vincent Hermann
  3. Visual Information Theory by Chris Olah

Thanks

Thanks to Karthik Boyareddygari, Ilze Greeves, Ben Knower, Akash Srivastava, Hendrik Strobelt, and Caroline Zeng, without whom this article would not have been created. Thanks also to my reviewers for their helpful comments.