Data Compression and Flow-based Models

Notes from Week 2 : Deep Unsupervised Learning, UC Berkeley

Compression (Pieter)

A prefix code is a type of code system (typically a variable-length code) distinguished by its possession of the “prefix property”, which requires that there is no whole code word in the system that is a prefix (initial segment) of any other code word in the system

Entropy$(H(X))$ refers to the expected encoding length when encoding each symbol i with $\log_2\frac{1}{p(x_i)}$ bits

$H(X)=\sum_{i}p(x_i)\log_2\frac{1}{p(x_i)}$

The above equation intuitively implies that entropy can be understood as the expected number of bits of the encoding scheme.

For an order 0 Markov model $P(X)$, Shannon’s theorem states that any compression scheme must use $>=H(X)$ bits on average. Huffman later proved that using an optimal prefix code, this expected number could be $<=H(X)+1$.

If we don’t know the actual probability distribution to be compressed ($p$) and use $\hat{p}$ instead, we incur an extra $KL(p

\hat{p})$ bits on average.

$l_a = KL(p||\hat{p})+H(p)$

The closer $\hat{p}$ is to $p$, the fewer bits we require to compress our data on average. However, if $p$ by itself has high entropy, we would have to employ other means to get better prefix codes.

We can decrease entropy by conditioning our distribution on previously known datapoints

$H(X|C)=\sum_{c}p(c)\sum_{i}p(x|c)\log_2\frac{1}{p(x|c)}$ $\implies H(X|C)<=H(X)$

This is what autoregressive models aim to do.

In scenarios where $H(X)$ is already small in comparison to the $+1$ that the Huffman code adds, we try to encode many symbols together to pay the $+1$ overhead less frequently. This is implemented in fax machines as Run-length Coding.

Arithmetic coding involves indexing into a distribution to avoid this $+1$ overhead (assumes that receiver knows the prior probabilities. Here, we have the line [0,1] broken down into intervals representing the probabilities of datapoints. Indexing into these would give rise to uniquely decodable symbols, assuming the decoder knows the probabilities.

Binary numbers correspond to intervals of all possible completions. This means that .00 would map to [0,0.25), .11 would map to [.75,1) and so on. It turns out that for an interval of size $s$, we can always find a codeword of length $\left \lceil{\log_2\frac{1}{s}}\right \rceil+1$

We can integrate autoregressive models into this setup by using different probabilities at each step while splitting the intervals. These updated probabilities at each step correspond to the autoreregressive model’s estimation of conditioned probabilities. However, this neural network by itself could be massive overhead as the decoder would have to have the model too.

LZ algorithm is an example of an adaptive compression scheme where the estimated probability distribution is always changing to better fit the actual distribution. It is also nonparametric and only uses a sliding window to look forward/ahead to compress data.

Flows (Johnathan)

Density models work with continuous data and try to fit probability density functions instead of probability mass functions. We could use Gaussian mixture models to estimate pdfs but adding Gaussian noise to real world data often results in poorly defined images. Cumulative distribution functions offer an invertible differentiable representation of continuous data.

A flow refers to an invertible differentiable function from x to z. The idea is to work with flows instead of pdfs for easier sampling. We fit flows with maximum likelihood

$arg\min_{\theta}[-\log{p_{\theta}(x)}]$

A flow $f_{\theta}$ defines a distribution over x via sampling

$z\sim p(z), x=f_{\theta}^{-1}(z)$ $p_\theta(x)dx=p(z)dz=p(z)f'(x)dx$

The more the defined flow ‘stretches’ space, the more likely that the sampled point corresponds to a given datapoint from the domain.

For a CDF flow,

$\log{p(x)}=\log{p(z)}+\log{CDF'(x)}$ $=\log{PDF(x)}$

For high dimensional data, the sampling process linearly transforms a small cube $dz$ to a parallelopiped $dx$, probability is conserved. However, instead of the derivative here we have to use the determinant of the Jacobian matrix, the computation of which is often prohibitively slow. To overcome this, we could use elementwise flows that produce diagonal matrices, the determinants of which are easy to compute.

We can also take many flows and compose them to take a new flow. This gives us a very easy way to increase expressiveness.

RealNVP - Flow-based model using affine coupling to split the data into two halves. The first half is passed through the identity flow while the second is an affine transformation, the parameters of which depend on the first. This gives rise to a triangular Jacobian and a determinant that is just the product of its diagonal entries. Coupling layers allow unrestricted neural nets to be used in flows, while preserving invertibility and tractability.

Written on October 6, 2019