## Project on Dictionary learning

Project on Dictionary learning

Problem

Consider a set of signals fY1; Y2 : : : ; YLg ⊂ IRN, Suppose that there exits an overcomplete dictionary D 2 IRN×M with M ≥ N such that each signal can be sparsely

approximated under the dictionary D. The dictionary learning aims at find such a

dictionary so that each signal Y‘ can be well approximated by Dc‘ where c‘ 2 IRM

is a sparse vector. The dictionary learning problem can be formulated as solving the

following optimization problem

min

D2IRN×M;C2IRM×L

kY – DCk2 F; (1)

subject to

kDjk1 = 1; for j = 1; 2; : : : ; M;

kC‘k0 ≤ K; for ‘ = 1; 2; : : : ; L:

Goal

In Lecture notes, The K-SVD method is presented for solving dictionary learning

problem. The goal of this project is to implement the K-SVD method with possible

modifications to have your version of the dictionary learning method that solves (1).

Implementation

In this project, the following routine is expected.

DICTLEARN Computing the sparsity-based dictionary of input signals

D = dictlearn (Y, M, K) computes the dictionary D with M atoms

that can sparsify the set of signals Y with degree K.

For L signals with size Nx1, the dimensionality of Y is NxL.

The dimensionality of D is NxM with M≥N.

National University of Singapore Department of Mathematics

02/2017 MA5232 MODELING AND NUMERICAL SIMULATIONS Project on Dictionary learning

Problem Consider a set of signals {Y1,Y2 …,YL} ⊂ IRN, Suppose that there exits an over- complete dictionary D ∈ IRN×M with M ≥ N such that each signal can be

sparsely approximated under the dictionary D. The dictionary learning aims at ﬁnd such a dictionary so that each signal Y` can be well approximated by Dc` where c` ∈

IRM is a sparse vector. The dictionary learning problem can be formulated as solving the following optimization problem

min D∈IRN×M,C∈IRM×LkY −DCk2 F, (1)

subject to

kDjk1 = 1,for j = 1,2,…,M; kC`k0 ≤ K,for ` = 1,2,…,L.

Goal In Lecture notes, The K-SVD method is presented for solving dictionary learning problem. The goal of this project is to implement the K-SVD method with possible

modiﬁcations to have your version of the dictionary learning method that solves (1).

Implementation In this project, the following routine is expected.

DICTLEARN Computing the sparsity-based dictionary of input signals D = dictlearn (Y, M, K) computes the dictionary D with M atoms that can sparsify the set of signals

Y with degree K.

For L signals with size Nx1, the dimensionality of Y is NxL. The dimensionality of D is NxM with M≥N.

Grading Policy This is an individual project. You may discuss the ideas with your classmates, but the sharing of code is strictly prohibited. You are not allowed to

use any code from

online sources or from other classmates’ projects, which will be checked via some standard code plagiarism checking system (e.g. theory.stanford.edu/∼aiken/moss) The

grade of the submitted project is based on the following three factors: 1) whether meets the guidelines; 2) the correctness of the results; and (3) the computational

eﬃciency in terms of running time in MATLAB.

Submissions You are required to submit the following:

1. (optional) The PDF ﬁle that describes any idea or technique for either speeding up the computation of the algorithm or stability of the algorithm

2. the source code of DICTLEARN in program format, e.g., dictlearn.m

Please package all your ﬁles in a single zip ﬁle named by your student ID and the suﬃx ”-DL”, and upload the zip ﬁle into IVLE, e.g., U1234567-DL.zip. The deadline for

submission is Sunday, 23-April-2017. Any submission after the deadline will lead to some grade penalty.

If you have any question, please contact me by email: matjh@nus.edu.sg for help.

4 Sparse recovery and convex relaxation A vector x ∈RN is called s-sparse if at most s entries are not zero. To seek a sparse solution from an under-determined system,

Ax = b, can be done by solving

min z kzk0, s.t. Az = Ax. Theorem 32. Consider a matrix A ∈RM×N with normalized columns. Suppose that every sub- matrix ∈ RM×M of A is full-rank. Then, x is the only

global minimizer of the above `0-norm based model, provided that s < M 2 .
Proof. Surpose that ¯ x is a global minimizer, then it has at most s non-zero elements. Then, the vector δ = x− ¯ x has at most 2s < M non-zero elements, and Aδ = 0.
By the fact that the matrix is full rank, thus δ = 0. The proof is done.
Although the theorem above shows that we can use the `0 norm based model to ﬁnd the truth with very mild condition, i.e. as long as the sparsity degree of x is less
than half of the measurements availble. It is known that it is an NP-hard problem, and thus the OMP cannot guarantee that it can ﬁnd the correct solution. One often
used strategy is to consider using `1 norm as the convex relaxation of `0 norm. The advantage is that there exist many numerical solvers to ﬁnd a global solution (or
obtain a solution with desired accuracy in ﬁnite iterations). However, the question remains that whether a global minimizer is corresponding to the truth, as there
might be many global minimizers considering the fact that the optimization problem is not strictly convex. In the next, we consider the noise-free case. We will
provide two conditions that guarantee the following convex model whose only global minimizer is the truth:
min z kzk1, s.t. Az = Ax.
The ﬁrst is a suﬃcient condition, which is easy to check through computation, but is very strong such that it is diﬃcult to design such matrix. Another one is a
suﬃcient and necessary condition, which is much weaker than the ﬁrst one, but is very diﬃcult to verify through computation. Deﬁnition 33 (Mutual coherence). Consider a
matrix A ∈RM×N whose columns are normalized to 1, i.e., kAjk2 = 1, for j = 1,2,...,N. The mutual coherence of the matrix A is deﬁned as
µ(A) = max i6=j |hAi,Aji|. Clearly, if A is an unitary matrix, we have µ(A) = 0. In the next, we show that the lower bound of mutual coherence when the column
dimensionality is larger than row dimensionality. Theorem 34. Consider a full-rank matrix A ∈RM×N with M < N. Then, we have µ(A) ≥s N −M M(N −1) = O( 1 √M ). Proof.
See the references. Consider a vector x ∈ RN, for any index set S, let xS denote the elements of x whose index is in S, and let Sc denote the complementation to S in
[1,2,...,N].
19
Theorem 35 (Suﬃcient condition). Consider a full rank matrix A ∈ RM×N with normalized columns. For any s-sparse vector x ∈RN satisfying s ≤ 1 2(1 + µ(A)−1), we have x
= ¯ x, ¯ x ∈{argminkzk1,s.t.Az = Ax}. Proof. Let S denote the support of x. Consider a global minimizer, z, which is diﬀerent from x. Deﬁne δ = z−x. Then, we have Aδ =
0. Notice that kzk1 ≤kxk1, and
kzk1 −kxk1 =
N X k=1
|z(k)|−
N X k=1
|x(k)|
=X Sc
|z(k)|+X S
(|z(k)|−|x(k)|)
≥X Sc
|z(k)|−X S
(|z(k)−x(k)|)
=X Sc
|δ(K)|−X S
|δ(k)|
= kδk1 −2X S
|δ(k)|,
We have then
kδk1 ≤ 2kδSk1.
From A>Aδ = 0, we have |δj| = |X i6=j

(A>A)j,iδi|≤X i6=j

|(A>A)j,i||δi|≤ µ(A)(kek1 −|δj|),

which gives

|δj|≤ (1 + µ(A))−1µ(A)kδk1. Taking the summation of |δjk over the index set S, we have kδSk1 =X j∈S |δj|≤ s(1 + µ(A))−1µ(A)kδk1 < 1 2 1 + µ(A)−1 1 + µ(A)−1|δk1 =
1 2kδk1.
Thus, we have the contradiction. The proof is done.
In the next, we provide a necessary and suﬃcient condition for exact recovery of sparse signals using `1-norm based model.
Deﬁnition 36 (Null space property). A matrix A is said to satify the null space property with sparse degree S, if for any x ∈ ker(A): kxSk1 < kxSck1, for all index set
S with its cardinality ≤ s. Theorem 37 (Suﬃcient and necessary condition). Consider a matrix A ∈RM×N. The suﬃcient and necessary condition that guarantees for any s-
sparse vector x,
x = ¯ x, ¯ x = argminAz=Axkzk1. is A satisﬁes null space property.
20
Proof. Suﬃciency. For any s-sparse vector x, let S be the support of x. For any z 6= x that obeys Az = Ax, we have δ = z−x ∈ ker(A) and kzk1 = kxS + δSk1 +kδSck1,,
≥kxSk1 −kδSk1 +kδSck1, (29) = kxSk1 + (kδSck1 −kδSk1). As A satisﬁes null space property, kδSck1 ≥kδSk1. Thus, kxk1 = kxSk1 < kzk1, which leads to x = argminAz=Axkzk1.
Necessity. For any index set S of size |S|≤ s and any δ ∈ ker(A). Deﬁne δS(k) = δ(k), if k ∈S; and 0 otherwise. Then, the vector δS is a s-sparse vector. Thus, we have
δS = argminAz=AδSkzk1,. On the other hand, Aδ = 0 implies that A(−δSc) = AδS. Thus, we have kδSck1 > kδSk1, which means that A satisﬁes null space property.

When the system is contaminated by the noise, i.e.,

Ax = b + ,

where kzk2 ≤ δ. The following model will be used min z kzk1, s.t. kAz−bk2 2 ≤ δ2

Theorem 38. The matrix A is said to satisfy the robust null space property, if there exist two constants D > 0 and 0 ≤ β < 1 such that for any index set S with
cardinality s, kxSk2 ≤ DkAxk2 + βs−1/2kxSck1. Suppose that A satisﬁes the robust null space property, then, the solution from the constrained optimization model above.
denoted by ¯ x satisﬁes
kx− ¯ xk2 ≤ Cδ,
for some constant C.
Proof. The proof is similar to null space property.
It can be seen that robust null space property implies null space property by considering x ∈ ker(A) and the fact that kxSk1 ≤√skxSk2.
21
References
[1] Royden H L, Fitzpatrick P. Real analysis. New York: Macmillan, 1988.
[2] Rudin W. Functional analysis. International series in pure and applied mathematics[J]. 1991.
[3] Daubechies I. Ten lectures on wavelets. Philadelphia: Society for industrial and applied math- ematics, 1992.
[4] Fan Z, Ji H, Shen Z. Dual Gramian analysis: duality principle and unitary extension principle. Mathematics of Computation, 2016, 85(297): 239-270.
[5] Ji H, Shen Z, Zhao Y. Directional Frames for Image Recovery: Multi-scale Discrete Gabor Frames.
[6] B. Dong, Z. Shen, MRA based wavelet frames and applications, IAS Lecture Notes Series, Summer Program on ”The Mathematics of Image Processing”, Park City
Mathematics Insti- tute.
[7] Donoho D L, Elad M. Optimally sparse representation in general (nonorthogonal) dictionaries via `1 minimization. Proceedings of the National Academy of Sciences,
2003, 100(5): 2197- 2202.
22
Review In past classes, we studied the problem of sparsity. Sparsity problem is that we are given a set of basis D (dictionary, design matrix) and a signal y, and we
want to ﬁnd a data x of high dimensions, but of few elements being nonzero. If the measurements we have are linear measurements, then the problem can be expressed as y
= Dx, where x is the data we want to recover from the linear measurements y and matrix D. If written down as a formulation, it is represented as minx ||x||0 s.t. y =
Dx, or approximately y ≈ Dx, satisfying ky − Dxkp ≤ ε. Or if we limited the sparsity to be s, then it is represented as minx ||y − Dx|| 2 2 s.t. ||x||0 ≤ s.
1. Dictionary Learning Now we look at the reverse problem: could we design dictionaries based on learning? Our goal is to ﬁnd the dictionary D that yields sparse
representations for the training signals. We believe that such dictionaries have the potential to outperform commonly used predeter- mined dictionaries. With
evergrowing computational capabilities, computational cost may become secondary in importance to the improved performance achievable by methods that adapt dictionaries
for special classes of signals. Given y1, y2, · · · , yN , where N is large, ﬁnd a dictionary matrix D with columns number K N, such that every yi is a sparse
combination of columns of D. From now on, denote dictionary matrix as D, with columns {di} K 1=1; collection of signals as a m × N matrix Y , whose columns are points
yis; representation matrix X, a n × N matrix with columns xis being representation of yis in dictionary D.
24-1
Dictionary learning and sparse coding
Intuitively, under natural idea, we would have the formulation represented as:
min D,{xi}X i
kyi −Dxik2 s.t. kxik0 ≤ s (24.1)
But representation (24.1) has several issues to be considered: • Issue 1: there is a combinatorial sparsity constraint; • Issue 2: Optimization is over both D and
{xi}, so this is a non-convex problem due to bi-linearity in objective function.
2. Dictionary Learning, Idea 1 Do Lagrangian relaxation to the formulation (24.1) and relax the `0 norm to `1 norm, we have min D,{xi}kY −DXk2 F + λkxk1 (24.2) Here kY
−DXk2 F =Pikyi −Dxik2. Formulation (24.2) does penalize `1 norm of x, butdoes not penalize the entries of D as it does for x. Thus, the solution will tend to increase
the dictionary entries values, in order to allow the coeﬃcients to become closer to zero. This diﬃculty has been handled by constraining the `2 norm of each basis
element, so that the output variance of the coeﬃcients is kept at an appropriate level.
An iterative method was suggested for solving (24.2). It includes two main steps in each iteration: • Sparse coding: calculate the coeﬃcients xis; • Dictionary update:
ﬁx {xi}, update D.
24-2
The iterative method would lead to a local-minimum, so the performance of the solution is sensitive to the initial D and {xi} we choose and the method we use to
upgrade dictionary and coeﬃcients. In sparse coding part, there is a bunch of methods that could be used: OMP, `1 minimiza- tion, iterative thresholding, etc, which,
more or less, were already been covered in former classes.
In dictionary update part, ﬁx {xi}, we are left with solving min D kY −DXk2 F (24.3)
Option 1: directly solve the least square problem.
Issue: the “guess” X might be lousy. By doing the “exact” solving step, the whole process would be “pushed” to local minimum nearest the initial guess. The situation
is similar to “over-ﬁtting” current observation in regression model, where noise is unsuitably considered in optimization and “push” regression model to a lousy
direction.
Option 2: simple gradient descent procedure with small step
D(n+1) = D(n) −η
N X i=1
(D(n)xi −yi)x0 i
Initializing D: greedy algorithm 1 Pick largest (`2 norm) column of Y , move to D from Y ; 2 To all columns that remain in Y , subtract their orthogonal projection to
range(D) repeat (1) and (2) till a full basis D is found.
3. Dictionary Learning, Idea 2
Unions of orthonormal dictionaries
Consider a dictionary composed as a union of orthonormal bases. D = [D1,··· ,DL] where Di ∈O(m) (m is the dimension of yis), that is, D0 iDi = DiD0 i = Im×m, for i =
1,··· ,L.
24-3
Correspondingly, divide X to L submatrices, with each submatix Xi containing the co- eﬃcients of of Di
X = [X0 1,X0 2,··· ,X0 L]0 The update algorithm need to preserve the orthonormality of Dis in each iteration. As- suming known coeﬃcients, the proposed algorithm
updates each orthonormal basis Di se- quentially. The update of Di is done by ﬁrst computing the residual matrix Ei = Y −X j6=i DjXj Then we solve minDi∈O(m) kEi −
DiXik2F for updated Di
4. K-SVD algorithm Overall, this algorithm is a “generalization” of K-means algorithm for clustering points {yi}iN=1. K-means algorithm
In K-means, ﬁrst we are oﬀered K centers, denoted as C = [c1,c2,··· ,cK]. This could also be seen as a basis or dictionary. The index j of point yi is selected by
ﬁnding j such that kyi −Cejk2 2 ≤kyi −Cekk,∀k 6= j This has analogy to sparse coding stage. We force the sparsity of coeﬃcients to be 1 and the nonzero component
should be 1. Written as formulation and in matrix form
min x kY −CXk2 F s.t. xi = ek for some k,∀i (kxik0 = 1) And in updating {ci} stage, let Rk be the set of index of points clustered into kth cluster
ck =
1 |Rk|X i∈Rk
yi = argmincX i∈Rk
kyi −ck2 2
combine all K such optimization problem, we have update of C is
C = argmin C kY −CXk2 F
24-4
remember here X comes from sparse coding stage, deﬁning which cluster each point in Y lies. Denote kth row of X as xk T, then Y −CX = Y −X k ckxk T = Y −X j6=k cjxj T
Ek −ckxk T Since the whole problem is decomposable to subproblems of ﬁnding each ck,∀k, so it is equivalent to sequentially search ck that minimizes kEk −ckxk Tk2 F
=Pi∈Rk kyi −ckk2 2. Back to dictionary learning problem, Y −DX = (Y −X j6=k djxj T Ek )−dkxk T
So
kY −DXk2 F = kEk −dkxk Tk2 F In dictionary update step, analogous to K-means, assuming known X, we could sequentially solve min dk kEk −dkxk Tk2 F for update of
columns of D.
K-SVD algorithm: Data: {yi}N i=1 Initialization: D(0) ∈Rm×K, with each column normalized; t = 1 Repeat until convergence: 1 Do sparse coding, solve for each {xi}N i=1
using suitable algorithm; 2 for each column k = 1,··· ,K in D(t−1) 1 Deﬁne Rk = {i|1 ≤ i ≤ N,xk T(i) 6= 0} 2 Compute Ek = Y −Pj6=k djxj T, extract columns with indices
in Rk, denote as ERk k ; ERk k ∈Rm×|Rk| 3 Apply SVD, ERk k = U∆V T, choose updated dictionary column dk as the ﬁrst column of U, update the nonzero items of coeﬃcient
xk T by xk Rk, xk Rk is the ﬁrst column of V multiplied by σ1(ERk k )
3 Set t = t + 1
24-5
1 Introduction
Background and goals: The goal of this lecture is to give you an introduction to the mathematics, algorithms, and applications in the ﬁeld of sparse analysis,
including sparse approximation and compressive sensing. Despite learning in high school mathematics that solving an under-determined linear system of the form Ax = b
is impossible, mathematicians, computer scientists, and engineers attempt to do so in a myriad of ﬁelds, applications, and settings. This problem arises in signal and
image processing where A plays the role of a redundant dictionary, b is the image or signal to compress, and x is the collection of “transform” coeﬃcients that
captures the exact linear combination of dictionary elements which closely approximate b. In this case, solving Ax = b amounts to ﬁnding a sparse or parsimonious
linear representation over A that is close to the input signal b. Given A and b, a streaming algorithm strives to compute quickly statistical or combinatorial
information about x. In these applications, the number of rows of the matrix A is smaller than the number of columns, sometimes exponentially so. In compressive
sensing, one designs a matrix A, takes linear measurements of a signal using A, Ax = b, and from these measurements and knowledge of A, one tries to reconstruct
information about x. One mathematical feature that allows us to solve many variants of this under-determined linear system problem is sparsity; roughly speaking, as
long as the vector x does not contain too many non-zero components (or has a few dominating components), we can “solve” the under-determined system Ax = b. Common
variants of the problem are collectively called sparse approximation (SA) problems. In this course we will highlight the multidisciplinary nature of sparse signal
analysis, covering topics ranging from its mathematical core to its potential impact and applications, as well as the role of algorithms. Content: An approximate list
of topics follows.
1. background: sparse and compressible signal models, mathematical approximation theory, the sparse representation problem, computational hardness;
2. algorithms: convex optimization, greedy pursuit methods
3. applications: image compression , machine learning, coding theory;
1
and discrete math), electrical engineering (signal processing and coding theory), computer science (algorithms and complexity theory), and statistics (statistical
learning theory). The minimum requirements for a student to feel somewhat comfortable following the lectures are and linear algebra. There will almost certainly be
topics for which you do not have the prerequisite courses so my main prerequisite is a willingness to learn new material, to forge ahead even when the topic is
unfamiliar to you.
1.1 Basic image compression: transform coding We begin our study of sparse signal analysis with a simple example of image compression. In Figure 1, we take an image
and compute its 2d wavelet transform. This transform is an orthogonal transform so the transform produces a set of wavelet coeﬃcients that are nothing other than a
change of basis from the original image. The left panel of Figure 1 shows both the original image and its 2d wavelet transform. Notice that many of the wavelet
coeﬃcients in the lower left ﬁgure are dark or close to zero. If we were to quantize all of the coeﬃcient values, we would spend a considerable fraction of our total
bits quantizing the small coeﬃcient values. Instead, let’s spend most of our bits on the large (in absolute value) coeﬃcients and set to zero those coeﬃcients that are
smaller (in absolute value) than a threshold θ. If we then compute the inverse wavelet transform of the thresholded coeﬃcients, we obtain a close approximation to the
original image (top right panel). Those few coeﬃcients we retained capture almost all of the important features of the original image and there are considerably fewer
of these coeﬃcients than pixels in the original image so we have produced a lossy compressed representation that consists of a few non-zero wavelet coeﬃcients.
Figure 1: original image (top left); 2D wavelet transform of image (bottom left); threshold wavelet coeﬃcients (bottom right); inverse wavelet transform of thresholded
wavelet coeﬃcients (top right)
Let’s make this discussion a bit more formal. Let Φ denote orthogonal matrix that corresponds to our orthonormal basis; the vectors of the basis are the columns of the
matrix Φ. And let x denote the signal that we wish to compress. The encoding algorithm that we used above The
2
Input: Φ and x Output: t sparse vector c = Φ∗x // compute orthogonal transform t = Θ(c) // threshold small coeﬃcients
decoding (or reconstruction) algorithm that we use to synthesize the image from its compressed representation is straightforward.
Input: Φ and t Output: x ˆ x = Φt
This encoding algorithm is an example of a nonlinear encoding algorithm and a linear decoding algorithm. The encoding algorithm is nonlinear in the sense that which
coeﬃcients are retained in the thresholding process depend on the input signal x. Or, more precisely, for two diﬀerent input signals x and x0 Θ(Φ(x + x0)) = Θ(Φx + Φx0)
6= Θ(Φx) + Θ(Φx0). This conﬁguration is a typical one for modern codecs; more computational resources are spent encoding signals or images so that the decoder can be
relatively simple. Later in this course, we will see that compressive sensing attempts to invert this typical allocation of resources with a linear encoder and a
nonlinear decoder. Nonlinear encoding with a ﬁxed orthonormal basis Φ for signal compression has several advan- tages and one large disadvantage. Both the encoding and
decoding algorithms are relatively easy to compute and, given a signal x, the compression that the encoder produces is the best set of coeﬃcients for that signal;
i.e., the compression is instance optimal. It is important to note that the compressibility of the input signal x depends on the ﬁxed orthonormal basis Φ and that the
compression is only as good as the basis Φ is for representing x. Given a large class of input signals x, it is challenging to design a single orthonormal basis that
compresses all of the input signals well. This is especially true if the class consists of natural images. If we ﬁx a single input x and then design one orthonormal
basis Φ, it is easy to produce a basis that compresses x perfectly (simply take the ﬁrst vector in the basis to be x, and all the other vector orthogonal to x);
however, the orthonormal basis we construct may not compress many other signals! Throughout this discussion, we have referred informally to the terms sparsity and
compressibility somewhat interchangeably but it’s important that we have precise deﬁnitions of these terms before we proceed. Deﬁnition 1 The sparsity of a vector x ∈
Rd or Cd is the number of non-zero entries in x. We denote this quantity by kxk0 despite the fact that this is not a proper norm. Deﬁnition 2 Suppose that we sort in
decreasing absolute value the entries of a vector x ∈ Rd. The rate of decay of these entries is the compressibility of x. Let |x(`)| be the sorted vector x, then the
compressibility is the exponent α in the relationship |x(`)| ∼ `−α. See Figure 2 for an example of a compressible signal and what eﬀect thresholding entries in a
compressible vector has on its sparsity.
3
Figure 2: The set Λ consists of those entries of a vector x that are larger (in absolute value) than a threshold .
These deﬁnitions assume that the basis in which we represent or compress the signal x is the canonical basis but, if instead we use a diﬀerent orthonormal basis Φ, the
vector in question is the vector of coeﬃcients Φx. Sparsity or compressibility are two ways of measuring the number of items we need to encode to represent a signal;
the fewer the items, the fewer the bits, and the greater the compression. On a more philosophical level, sparsity or compressibility is a mathematical version of
Occam’s razor; we strive to capture the essential features of a signal.
1.2 Complete and redundant dictionaries One of the biggest drawbacks of using a single orthonormal basis for compression, especially for a large class of input
signals, is that the compression is only as good as the basis is “right” for that class and it is diﬃcult to design a single orthonormal basis that compresses well a
large class of signals. It is much easier, instead, to use more than orthonormal basis from which to represent a single input signal x or a class of signals. With
apologies to George Orwell, if one orthonormal basis is good, surely two (or more) are better...especially for images. Figure 3 shows the diﬀerent features that are
present in an image, from the brush-like texture of the mandrill’s hair to the pointillist-like residual that remains after we remove a linear combination of both a
few wavelets and a few wavelet packets. No single orthonormal basis is rich enough (or large enough!) to include vectors that resemble all of these features. Which
bring us to a rigorous deﬁnition of a dictionary for signal representation. Deﬁnition 3 A dictionary Φ in Rd (or Cd) is a collection {φ`}N `=1 ⊂ Rd (or Cd) of unit-
normvectors: kφ`k2 = 1 for all `. • Elements are called atoms • If span{φ`}= Rd, the dictionary is complete • If {φ`} are linearly dependent and the dictionary is
complete, the dictionary is redundant We represent a dictionary as a matrix Φ =φ1 φ2 ... φNwith atoms as the columns of the matrix so that Φc =X ` c`φ`. See Figure 4
for the various linear algebraic operations.
4
Figure 3: Diﬀerent bases capture diﬀerent features of an image.
x
Φᵀ
Φ c
Figure 4: The operation Φc produces a linear combination of dictionary atoms and is used to synthesize a signal in Rd (left) while the operation ΦTx produces a vector
of dot products of x with the dictionary atoms or the analysis of a signal over a dictionary (right).
1.1.1 Examples 1. Fourier-Dirac: Φ = [F|I ]. In Cd, we deﬁne the dictionary as: φ`(t) = 1 √de2πi`t/d ` = 1,2,...,d φ`(t) = δ`(t) ` = d +1,d +2,...,2d.
It is of size 2d and the atoms are shown in Figure 5 . 2. DCT-Wavelets: Φ = [F|W]. In Rd×d (or two-dimensional images), we have the union of the discrete cosine
transform (DCT) (a basis traditionally used to compress video frames) and
5
Figure 5: The ﬁrst four rows are the Fourier basis and the second four rows show the Dirac basis for d = 32.
two-dimensional wavelets (another plausible basis for compressing images). This dictionary is of size 2d2, or twice as large as d×d images themselves. See Figure 6 for
several representative atoms in this dictionary.
Figure 6: The DCT of an image (left) and three diﬀerent 2-dimensional wavelets (right).
3. Wavelet packets: Φ in Rd consists of dlogd diﬀerent wavelet packets which can be arranged into 2O(d) diﬀerent orthonormal bases. See Figure 7 for an example.
1.3 Sparse approximation basic problems If we use the convention that we represent a signal as a linear combination over a redundant dictionary Φc, then there are two
competing metrics for the quality of this representation: (i) the cost of the representation itself (typically, how sparse or how compressible c is) and (ii) the error
in the approximation x ≈ Φc. In signal or image compression, we usually trade-oﬀ the rate of the encodingwiththedistortionthatitproducesascomparedtotheoriginalsignal.
Wewilldeﬁnethese terms slightly diﬀerently from that of typical source coding; nonetheless, our deﬁnitions address this inherent trade-oﬀ in how expensive versus how
accurate the representation or compression is. The following three problems capture precisely three types of compression or signal represen- tation: lossless
compression, lossy compression that meets an error (or distortion) tolerance, and lossy compression that meets a memory (or rate) tolerance.
6
Figure 7: The Haar wavelet packet dictionary for d = 8.
• Exact.Given a vector x ∈Rd and a complete dictionary Φ, solve argminckck0 s.t. x = Φc i.e., ﬁnd a sparsest representation of x over Φ. • Error.Given ≥0, solve
argminckck0 s.t. kx−Φck2 ≤ i.e., ﬁnd a sparsest approximation of x that achieves error . • Sparse.Given k ≥1, solve argminckx−Φck2 s.t. kck0 ≤ k i.e., ﬁnd the best
approximation of x using k atoms. Now that we have formalized three fundamental sparse approximation problems, there are two directions to pursue: (i) how to compute
sparse approximations and (ii) if we could compute them, how accurate would they be (or how should we evaluate the output of the algorithms)? In the next lecture, we
will discuss (ii), how do we evaluate the compressibility of signals and signal classes in diﬀerent bases, an excursion into mathematical approximation theory.
7
2.1 Agenda 1. SA algorithms for orthonormal bases (in Rn) 2. Arbitrary complete or redundant dictionaries: Bad news for algorithms
In this lecture, we turn to algorithms and how to compute these sparse representations. We begin with a simple setting, we restrict Φ to be an orthonormal basis, so as
to understand (i) the basic architecture of one of the two major classes of algorithms, (ii) how to analyze algorithms, (iii) how to prove that the algorithms perform
as advertised (and what that advertisement should say), and (iv) what pitfalls we might encounter in generalizing both the analysis and algorithms to less simple
complete dictionaries. We also begin a (short!) crash course in complexity theory so that we can both prove the hardness of the SA problems and understand what that
result tells us.
2.2 SA algorithms for orthonormal bases Let Φ be an arbitrary orthonormal basis in Rn and let’s try to solve (with provable, eﬃcient algorithms) the three sparse
approximation problems. Because Φ is an orthonormal basis, the matrix Φ which columns are the vectors in the basis is an orthogonal matrix and it satisﬁes Φ−1 = ΦT.
2.2.1 Exact Exact: Given Φ an orthonormal basis and x an arbitrary vector, ﬁnd the sparsest vector c such that Φc = x. This problem is trivial to solve when Φ is an
orthonormal basis. We set
c = ΦTx = Φ−1x; i.e., each entry in the vector c` =hx,φ`i. This representation is unique as Φ is an orthonormal basis. The algorithm consists of multiplying ΦT by x
which can be done in time O(n2) for an arbitrary basis, or faster if the transform has special structure (e.g., O(nlogn) for the Fourier transform or O(n) for
wavelets).
2.2.2 Sparse Sparse: Given Φ an orthonormal basis, x an arbitrary vector, and k a sparsity parameter, ﬁnd the vector c such that kΦc−xk2 is minimized and kck0 ≤ k. We
will analyze Algorithm 1 and show that it solves the Sparse problem. First, we remember from our previous lectures that for an orthonormal basis, the k-term
representation that minimizes the `2 distance to the vector x consists of the k vectors that enjoy the largest dot products with x so our algorithm should return that
representation. We must check that the above algorithm produces the correct vector c. Furthermore, the straightforward algorithm would be to compute the orthonormal
transform Φx, sort the coeﬃcients, and return the k largest but it’s not immediately clear that the above algorithm returns this answer as it’s not the straightforward
algorithm. It is, however, the same architecture as one of the two main algorithmic techniques in sparse approxima- tion so we present it for the simple case of an
orthonormal basis for expository reasons. It is what
8
Algorithm 1: Prototype greedy algorithm for Sparse when Φ is an orthonormal basis. Input: Φ, x, and k Output: c Initialize a = 0, Λ =∅ for j = 1 to k do r = x−a λj =
argmax`|hr,φ`i| Λ = Λ∪{λj} cλj =hr,φλji a =Pλ∈Λ cλφλend
is known as a greedy algorithm; it chooses the vector that is the best at each step of an iterative procedure. The best vector is the one that has largest dot product
(or highest correlation) with the residual r = x−a at the current iteration. Finally, this algorithm builds up the approximation a one vector at a time and iterates
for k steps. Let’s analyze what this algorithm achieves; i.e., let’s prove that this algorithm computes eﬃ- ciently the best k-term representation of x over Φ. When j
= 1, the approximation a = 0 (by initialization) and so r = x. The maximization step selects λ1 = argmax`|hx,φ`i| the index of the biggest (in absolute value) dot
product or transform coeﬃcient. Then the algorithm includes the index λ1 in the index set Λ, sets entry of cλ1 equal to the maximizing dot product cλ1 =hx,φ`i, and the
approximation at the end of the ﬁrst iteration is a = cλ1φλ1. When j = 2, the approximation consists of one term so the residual is not simply equal to x, r = x
−cλ1φλ1. The maximization step selects λ2 = argmax`|hx−cλ1φλ1,φ`i|. Because Φ is an orthonormal basis, x has a unique expansion over Φ, x =Pj cjφj, and dot products
amongst the vectors in Φ simplify as hφj,φ`i= 1 ⇐⇒ j = `. Using these two pieces of information, we can simplify the maximization step λ2 = argmax`|hx−cλ1φλ1,φ`i| =
argmax`|hX6 =λ1 cjφj,φ`i| = argmax`|X6 =λ1 cjhφj,φ`i| = argmax`6=λ1|c`|
9
and observe that λ2 is the index of the second largest (in absolute value) dot product or transform coeﬃcient of x. We then update the index set of the approximation
a, Λ = {λ1, λ2}, and the approximation itself consists of two terms a = cλ1φλ1 + cλ2φλ2 .
The run time of the algorithm if implemented as written above would be, per iteration, O(n2) to compute the dot products (depending on the particular basis Φ) with the
residual r and ﬁnd the largest, plus O(n) to update the vectors r and a. As the dominating term is O(n2) per iteration and there are k iterations, the total running
time is O(kn2). If we let T denote the time to compute the transform Φ, the total time is O(kT ). Note that updating the vectors r and a in each iteration is simple as
the change in a each iteration consists of adding one new vector cλφλ (scaled by a constant). We have shown Theorem 1 Algorithm 1 solves the Sparse problem; i.e.,
returns a vector c such that kx−Φck2 is minimized and kck0 ≤ k. It does so in time O(kT ) where T is the time to compute the orthonormal transform Φ. It is also easy
to see from our analysis above that we don’t have to compute a Φ transform every iteration. A better implementation would ﬁrst compute all the coeﬃcients Φx (in time T
) and then ﬁnd the k largest (in magnitude), either by sorting all of the coeﬃcients (in time O(n log n)) and returning the k largest or by a more sophisticated
procedure. A textbook solution for returning the k biggest items in a list of size n is to maintain a priority queue of size k and slide that along the list, inserting
new entries in the queue according to their size if they are larger than the smallest existing item in the queue. This algorithm runs in time O(n log k). There is a
faster deterministic algorithm in [1]that runs in time O(n) but is somewhat complicated and a randomized version that is much simpler with similar running time O(n).
10
2.2.3 Error Error: Given Φ an orthonormal basis, x an arbitrary vector, and a ﬁdelity parameter, ﬁnd the sparsest vector c such that kΦc−xk2 ≤ . After our previous
discussion, it is easy to see that to obtain a provable, eﬃcient algorithm for Error: is a simple modiﬁcation of Algorithm 1 ; we change the for loop to a while loop
with the conditional that while kx−ak2 >> , we iterate. 2.3 Arbitrary complete, redundant dictionaries When the dictionary Φ is an orthonormal basis, all three sparse

approximation problems are easy to solve; they admit eﬃcient algorithms that return the optimal solutions. Unfortunately, this is not true for general or arbitrary

complete or redundant dictionaries. In the next lecture, we will discuss how hard is it to solve the general sparse approximation problems, how diﬃcult are they

compared to well-studied problems such as Primes, SAT, and whether or not there is a feasible algorithm that is guaranteed to approximate the optimal solution. Before

we proceed with a crash course in complexity theory, we have some bad news: Sparse approximation problems are at least as hard to solve as a long list of challenging

problems, including SAT, Traveling Salesman, Knapsack, Subset Sum, etc.

References

[1] Charles E. Rivest Ronald L. Stein Cliﬀord Cormen, Thomas H.; Leiserson. Introduction to Algorithms. MIT Press and McGraw-Hill, 3rd edition, 2009. 3

11

3.1 Agenda

1. Coherence of a redundant dictionary

2. Approximation algorithms for sparse approximation problems

In this lecture, we will discuss suﬃcient geometric conditions on a redundant dictionary that will permit eﬃcient (approximation) algorithms for sparse approximation

problems. The simplest geometric condition is that of coherence, how correlated are the atoms of the dictionary with one another. We will focus on the Sparse problem

ﬁrst and show that several greedy, iterative algorithms solve it approximately if the redundant dictionary is fairly incoherent. In class this lecture was divided into

two pieces, so consider this write-up Lectures 5 and 6.

3.2 Coherence of a redundant dictionary To motivate the deﬁnition of coherence of a dictionary, let us recall the iterative algorithm we employed to solve the Sparse

problem when Φ is an orthonormal basis. The key step in the

Algorithm 1: Prototype greedy algorithm for Sparse. Input: Φ, x, and k Output: c Initialize a = 0, Λ = ∅ for j = 1 to k do r = x−a λj = argmax`|hr,φ`i| Λ = Λ∪{λj} cλj

= hr,φλji a =Pλ∈Λ cλφλend

algorithm is the maximization step

hr,φji = hx−ciφi,φji = hx,φji−cihφi,φji illustrates what can go wrong when we do not have an orthonormal basis. This example leads us to deﬁne how coherent the atoms

of a dictionary are.

Deﬁnition 1 The coherence of a dictionary is the largest dot product between distinct pairs of atoms µ = max j6=` |hφj,φ`i|. Figure 1 illustrates two diﬀerent

collections of atoms in R2; one with small coherence (which is a good property) and one with large coherence (which is a bad property). One might wonder just how small

the coherence of a collection of N atoms can be in dimension d, especially if N is quite large compared to d. Fortunately, there are a number of results of this type

from coding theory (as redundant dictionaries are nothing other than spherical codes). The most straightforward lower bound on the coherence is known as the Welch

bound.

12

φ1

φ2 φ3

φ1

φ2

φ3

Small coherence (good)

Large coherence (bad)

Figure 1: Coherence of two diﬀerent collections of atoms in R2.

Theorem 2 For a d×N redundant dictionary, µ ≥s N −d d(N −1) ≈

1 √d.

It is always useful to have at hand several examples of large, incoherent dictionaries from which to draw intuition. • Fourier–Dirac, N = 2d, µ = 1 √d • wavelet

packets, N = dlogd, µ = 1 √2 • There are large dictionaries with coherence close to the lower (Welch) bound; e.g., Kerdock codes, N = d2, µ = 1/√d

3.3 Approximation algorithms for sparse approximation

The word approximation has several uses in this topic: one is a mathematical meaning, we ap- proximate a vector, and the other is an algorithmic meaning, the output of

an algorithm is an approximation to the optimal output. The optimal output is what an algorithm would return if it took all the necessary time to ﬁnd the best

representation for the input over the dictionary. Another term to describe such a “perfect” algorithm is to say that an oracle would give us the best representation

for free (simply for asking). It is this representation against which we want to compare the output of our algorithm. Let’s denote the optimal solution as cOPT and the

error of this solution as the optimal error, EOPT = kΦcopt −xk2. Our algorithm returns a coeﬃcient vectorb c with 1. kb ck0 = k 13

Figure 2: Kerdock codes are a union of orthonormal bases; pairs of bases are Fourier duals of one another under an appropriate group action. 2. E = kΦb c−xk2 ≤ C1EOPT;

that is, the error of output is guaranteed to be less than a constant C1 times the optimal error EOPT. So, our algorithm doesn’t necessarily return the best

representation but it is guaranteed to be a fairly good one, where fairly good is measured in terms of the optimal error. If the optimal error is zero, however, our

approximation algorithm is guaranteed to ﬁnd the optimal solution. This is an important point that may be easy to overlook in one’s ﬁrst encounter with approximation

algorithms. If an input vector x has an exact sparse representation with k atoms over a dictionary Φ, an approximation algorithm will ﬁnd it. Frequently, in

approximation algorithms we measure the approximation ratio of the algorithm or the ratio of the output error as compared to the optimal error

E EOPT

=

C1EOPT EOPT

= C1.

We will see later that for some algorithms the approximation ratio is truly a constant and for others it depends on the coherence µ of the dictionary and k the desired

sparsity. A constant factor approximation ratio is clearly better than one that depends on several input parameters but an algorithm with such a guarantee may have a

longer running time or may require more stringent conditions than an algorithm with a weaker approximation guarantee. These types of analyses are all important in the

analysis of sparse approximation algorithms.

3.3.1 Greedy algorithms We will focus our attention on greedy, iterative algorithms ﬁrst (and return to diﬀerent algorithmic formulations later). By greedy, iterative

algorithms we mean those that choose the best atom (or collection thereof) at each step of the algorithm. The main algorithm that we analyze is Orthogonal Matching

Pursuit. Notice that this algorithm is very similar to our prototype greedy algorithm but it diﬀers in one important place; the method by which we calculate the

coeﬃcients cλ that participate in the representation is more complicated. At each step, the coeﬃcients are the solution to a least-squares problem. That is, we compute

the orthogonal projection of x onto the subspace spanned by the columns indexed by Λ, the active set of atoms in

14

the representation. The orthogonal projection step is precisely where OMP gets the leading “O” in its name. Because at each step of the algorithm, the representation

is the orthogonal projection of x onto the span of the selected atoms, each selected atom in the representation appears only once. That is, the residual is orthogonal

to the representation so atoms that already participate in the representation cannot be selected again at a later step of the algorithm. When one implements this

algorithm, one should be careful about implementing the least- squares step. Naive implementations will result in poor empirical performance for OMP on large input.

Algorithm 2: Orthogonal Matching Pursuit (OMP). Input: Φ, x, and k Output: c Initialize a = 0, Λ = ∅ for j = 1 to k do r = x−a λj = argmax`|hr,φ`i| Λ = Λ∪{λj} Find cλ

for λ ∈ Λ that minimizes kx−Pλ∈Λ cλφλk2 Update a =Pλ∈Λ cλφλend

Many greedy algorithms have a similar outline and we enumerate a few adjustments one could make to this basic architecture. There are several adjustments to make in

how coeﬃcients are calculated and what the selection criterion should be. • Matching Pursuit: replace the coeﬃcient update step with cλt ←− cλt +hr,φλti. •

Thresholding: Choose m atoms where |hx,φλi| are among m largest. There are alternate stopping rules as well. • Terminate the loop when the residual norm is smaller

than a threshold: krk2 ≤ . • Terminate the loop when the largest dot product with the residual is smaller than a threshold: maxλ|hrt,φ`i|≤ .

15