Project on Dictionary learning

Project on Dictionary learning

Problem
Consider a set of signals fY1; Y2 : : : ; YLg ⊂ IRN, Suppose that there exits an overcomplete dictionary D 2 IRN×M with M ≥ N such that each signal can be sparsely
approximated under the dictionary D. The dictionary learning aims at find such a
dictionary so that each signal Y‘ can be well approximated by Dc‘ where c‘ 2 IRM
is a sparse vector. The dictionary learning problem can be formulated as solving the
following optimization problem
min
D2IRN×M;C2IRM×L
kY – DCk2 F; (1)
subject to
kDjk1 = 1; for j = 1; 2; : : : ; M;
kC‘k0 ≤ K; for ‘ = 1; 2; : : : ; L:
Goal
In Lecture notes, The K-SVD method is presented for solving dictionary learning
problem. The goal of this project is to implement the K-SVD method with possible
modifications to have your version of the dictionary learning method that solves (1).
Implementation
In this project, the following routine is expected.
DICTLEARN Computing the sparsity-based dictionary of input signals
D = dictlearn (Y, M, K) computes the dictionary D with M atoms
that can sparsify the set of signals Y with degree K.
For L signals with size Nx1, the dimensionality of Y is NxL.
The dimensionality of D is NxM with M≥N.

National University of Singapore Department of Mathematics
02/2017 MA5232 MODELING AND NUMERICAL SIMULATIONS Project on Dictionary learning
Problem Consider a set of signals {Y1,Y2 …,YL} ⊂ IRN, Suppose that there exits an over- complete dictionary D ∈ IRN×M with M ≥ N such that each signal can be

sparsely approximated under the dictionary D. The dictionary learning aims at find such a dictionary so that each signal Y` can be well approximated by Dc` where c` ∈

IRM is a sparse vector. The dictionary learning problem can be formulated as solving the following optimization problem
min D∈IRN×M,C∈IRM×LkY −DCk2 F, (1)
subject to
kDjk1 = 1,for j = 1,2,…,M; kC`k0 ≤ K,for ` = 1,2,…,L.
Goal In Lecture notes, The K-SVD method is presented for solving dictionary learning problem. The goal of this project is to implement the K-SVD method with possible

modifications to have your version of the dictionary learning method that solves (1).
Implementation In this project, the following routine is expected.
DICTLEARN Computing the sparsity-based dictionary of input signals D = dictlearn (Y, M, K) computes the dictionary D with M atoms that can sparsify the set of signals

Y with degree K.
For L signals with size Nx1, the dimensionality of Y is NxL. The dimensionality of D is NxM with M≥N.
Grading Policy This is an individual project. You may discuss the ideas with your classmates, but the sharing of code is strictly prohibited. You are not allowed to

use any code from
online sources or from other classmates’ projects, which will be checked via some standard code plagiarism checking system (e.g. theory.stanford.edu/∼aiken/moss) The

grade of the submitted project is based on the following three factors: 1) whether meets the guidelines; 2) the correctness of the results; and (3) the computational

efficiency in terms of running time in MATLAB.
Submissions You are required to submit the following:
1. (optional) The PDF file that describes any idea or technique for either speeding up the computation of the algorithm or stability of the algorithm
2. the source code of DICTLEARN in program format, e.g., dictlearn.m
Please package all your files in a single zip file named by your student ID and the suffix ”-DL”, and upload the zip file into IVLE, e.g., U1234567-DL.zip. The deadline for

submission is Sunday, 23-April-2017. Any submission after the deadline will lead to some grade penalty.
If you have any question, please contact me by email: matjh@nus.edu.sg for help.

4 Sparse recovery and convex relaxation A vector x ∈RN is called s-sparse if at most s entries are not zero. To seek a sparse solution from an under-determined system,

Ax = b, can be done by solving
min z kzk0, s.t. Az = Ax. Theorem 32. Consider a matrix A ∈RM×N with normalized columns. Suppose that every sub- matrix ∈ RM×M of A is full-rank. Then, x is the only

global minimizer of the above `0-norm based model, provided that s < M 2 . Proof. Surpose that ¯ x is a global minimizer, then it has at most s non-zero elements. Then, the vector δ = x− ¯ x has at most 2s < M non-zero elements, and Aδ = 0. By the fact that the matrix is full rank, thus δ = 0. The proof is done. Although the theorem above shows that we can use the `0 norm based model to find the truth with very mild condition, i.e. as long as the sparsity degree of x is less than half of the measurements availble. It is known that it is an NP-hard problem, and thus the OMP cannot guarantee that it can find the correct solution. One often used strategy is to consider using `1 norm as the convex relaxation of `0 norm. The advantage is that there exist many numerical solvers to find a global solution (or obtain a solution with desired accuracy in finite iterations). However, the question remains that whether a global minimizer is corresponding to the truth, as there might be many global minimizers considering the fact that the optimization problem is not strictly convex. In the next, we consider the noise-free case. We will provide two conditions that guarantee the following convex model whose only global minimizer is the truth: min z kzk1, s.t. Az = Ax. The first is a sufficient condition, which is easy to check through computation, but is very strong such that it is difficult to design such matrix. Another one is a sufficient and necessary condition, which is much weaker than the first one, but is very difficult to verify through computation. Definition 33 (Mutual coherence). Consider a matrix A ∈RM×N whose columns are normalized to 1, i.e., kAjk2 = 1, for j = 1,2,...,N. The mutual coherence of the matrix A is defined as µ(A) = max i6=j |hAi,Aji|. Clearly, if A is an unitary matrix, we have µ(A) = 0. In the next, we show that the lower bound of mutual coherence when the column dimensionality is larger than row dimensionality. Theorem 34. Consider a full-rank matrix A ∈RM×N with M < N. Then, we have µ(A) ≥s N −M M(N −1) = O( 1 √M ). Proof. See the references. Consider a vector x ∈ RN, for any index set S, let xS denote the elements of x whose index is in S, and let Sc denote the complementation to S in [1,2,...,N]. 19 Theorem 35 (Sufficient condition). Consider a full rank matrix A ∈ RM×N with normalized columns. For any s-sparse vector x ∈RN satisfying s ≤ 1 2(1 + µ(A)−1), we have x = ¯ x, ¯ x ∈{argminkzk1,s.t.Az = Ax}. Proof. Let S denote the support of x. Consider a global minimizer, z, which is different from x. Define δ = z−x. Then, we have Aδ = 0. Notice that kzk1 ≤kxk1, and kzk1 −kxk1 = N X k=1 |z(k)|− N X k=1 |x(k)| =X Sc |z(k)|+X S (|z(k)|−|x(k)|) ≥X Sc |z(k)|−X S (|z(k)−x(k)|) =X Sc |δ(K)|−X S |δ(k)| = kδk1 −2X S |δ(k)|, We have then kδk1 ≤ 2kδSk1. From A>Aδ = 0, we have |δj| = |X i6=j
(A>A)j,iδi|≤X i6=j
|(A>A)j,i||δi|≤ µ(A)(kek1 −|δj|),
which gives
|δj|≤ (1 + µ(A))−1µ(A)kδk1. Taking the summation of |δjk over the index set S, we have kδSk1 =X j∈S |δj|≤ s(1 + µ(A))−1µ(A)kδk1 < 1 2 1 + µ(A)−1 1 + µ(A)−1|δk1 = 1 2kδk1. Thus, we have the contradiction. The proof is done. In the next, we provide a necessary and sufficient condition for exact recovery of sparse signals using `1-norm based model. Definition 36 (Null space property). A matrix A is said to satify the null space property with sparse degree S, if for any x ∈ ker(A): kxSk1 < kxSck1, for all index set S with its cardinality ≤ s. Theorem 37 (Sufficient and necessary condition). Consider a matrix A ∈RM×N. The sufficient and necessary condition that guarantees for any s- sparse vector x, x = ¯ x, ¯ x = argminAz=Axkzk1. is A satisfies null space property. 20 Proof. Sufficiency. For any s-sparse vector x, let S be the support of x. For any z 6= x that obeys Az = Ax, we have δ = z−x ∈ ker(A) and kzk1 = kxS + δSk1 +kδSck1,, ≥kxSk1 −kδSk1 +kδSck1, (29) = kxSk1 + (kδSck1 −kδSk1). As A satisfies null space property, kδSck1 ≥kδSk1. Thus, kxk1 = kxSk1 < kzk1, which leads to x = argminAz=Axkzk1. Necessity. For any index set S of size |S|≤ s and any δ ∈ ker(A). Define δS(k) = δ(k), if k ∈S; and 0 otherwise. Then, the vector δS is a s-sparse vector. Thus, we have δS = argminAz=AδSkzk1,. On the other hand, Aδ = 0 implies that A(−δSc) = AδS. Thus, we have kδSck1 > kδSk1, which means that A satisfies null space property.
When the system is contaminated by the noise, i.e.,
Ax = b + ,
where kzk2 ≤ δ. The following model will be used min z kzk1, s.t. kAz−bk2 2 ≤ δ2
Theorem 38. The matrix A is said to satisfy the robust null space property, if there exist two constants D > 0 and 0 ≤ β < 1 such that for any index set S with cardinality s, kxSk2 ≤ DkAxk2 + βs−1/2kxSck1. Suppose that A satisfies the robust null space property, then, the solution from the constrained optimization model above. denoted by ¯ x satisfies kx− ¯ xk2 ≤ Cδ, for some constant C. Proof. The proof is similar to null space property. It can be seen that robust null space property implies null space property by considering x ∈ ker(A) and the fact that kxSk1 ≤√skxSk2. 21 References [1] Royden H L, Fitzpatrick P. Real analysis. New York: Macmillan, 1988. [2] Rudin W. Functional analysis. International series in pure and applied mathematics[J]. 1991. [3] Daubechies I. Ten lectures on wavelets. Philadelphia: Society for industrial and applied math- ematics, 1992. [4] Fan Z, Ji H, Shen Z. Dual Gramian analysis: duality principle and unitary extension principle. Mathematics of Computation, 2016, 85(297): 239-270. [5] Ji H, Shen Z, Zhao Y. Directional Frames for Image Recovery: Multi-scale Discrete Gabor Frames. [6] B. Dong, Z. Shen, MRA based wavelet frames and applications, IAS Lecture Notes Series, Summer Program on ”The Mathematics of Image Processing”, Park City Mathematics Insti- tute. [7] Donoho D L, Elad M. Optimally sparse representation in general (nonorthogonal) dictionaries via `1 minimization. Proceedings of the National Academy of Sciences, 2003, 100(5): 2197- 2202. 22 Review In past classes, we studied the problem of sparsity. Sparsity problem is that we are given a set of basis D (dictionary, design matrix) and a signal y, and we want to find a data x of high dimensions, but of few elements being nonzero. If the measurements we have are linear measurements, then the problem can be expressed as y = Dx, where x is the data we want to recover from the linear measurements y and matrix D. If written down as a formulation, it is represented as minx ||x||0 s.t. y = Dx, or approximately y ≈ Dx, satisfying ky − Dxkp ≤ ε. Or if we limited the sparsity to be s, then it is represented as minx ||y − Dx|| 2 2 s.t. ||x||0 ≤ s. 1. Dictionary Learning Now we look at the reverse problem: could we design dictionaries based on learning? Our goal is to find the dictionary D that yields sparse representations for the training signals. We believe that such dictionaries have the potential to outperform commonly used predeter- mined dictionaries. With evergrowing computational capabilities, computational cost may become secondary in importance to the improved performance achievable by methods that adapt dictionaries for special classes of signals. Given y1, y2, · · · , yN , where N is large, find a dictionary matrix D with columns number K  N, such that every yi is a sparse combination of columns of D. From now on, denote dictionary matrix as D, with columns {di} K 1=1; collection of signals as a m × N matrix Y , whose columns are points yis; representation matrix X, a n × N matrix with columns xis being representation of yis in dictionary D. 24-1 Dictionary learning and sparse coding Intuitively, under natural idea, we would have the formulation represented as: min D,{xi}X i kyi −Dxik2 s.t. kxik0 ≤ s (24.1) But representation (24.1) has several issues to be considered: • Issue 1: there is a combinatorial sparsity constraint; • Issue 2: Optimization is over both D and {xi}, so this is a non-convex problem due to bi-linearity in objective function. 2. Dictionary Learning, Idea 1 Do Lagrangian relaxation to the formulation (24.1) and relax the `0 norm to `1 norm, we have min D,{xi}kY −DXk2 F + λkxk1 (24.2) Here kY −DXk2 F =Pikyi −Dxik2. Formulation (24.2) does penalize `1 norm of x, butdoes not penalize the entries of D as it does for x. Thus, the solution will tend to increase the dictionary entries values, in order to allow the coefficients to become closer to zero. This difficulty has been handled by constraining the `2 norm of each basis element, so that the output variance of the coefficients is kept at an appropriate level. An iterative method was suggested for solving (24.2). It includes two main steps in each iteration: • Sparse coding: calculate the coefficients xis; • Dictionary update: fix {xi}, update D. 24-2 The iterative method would lead to a local-minimum, so the performance of the solution is sensitive to the initial D and {xi} we choose and the method we use to upgrade dictionary and coefficients. In sparse coding part, there is a bunch of methods that could be used: OMP, `1 minimiza- tion, iterative thresholding, etc, which, more or less, were already been covered in former classes. In dictionary update part, fix {xi}, we are left with solving min D kY −DXk2 F (24.3) Option 1: directly solve the least square problem. Issue: the “guess” X might be lousy. By doing the “exact” solving step, the whole process would be “pushed” to local minimum nearest the initial guess. The situation is similar to “over-fitting” current observation in regression model, where noise is unsuitably considered in optimization and “push” regression model to a lousy direction. Option 2: simple gradient descent procedure with small step D(n+1) = D(n) −η N X i=1 (D(n)xi −yi)x0 i Initializing D: greedy algorithm 1 Pick largest (`2 norm) column of Y , move to D from Y ; 2 To all columns that remain in Y , subtract their orthogonal projection to range(D) repeat (1) and (2) till a full basis D is found. 3. Dictionary Learning, Idea 2 Unions of orthonormal dictionaries Consider a dictionary composed as a union of orthonormal bases. D = [D1,··· ,DL] where Di ∈O(m) (m is the dimension of yis), that is, D0 iDi = DiD0 i = Im×m, for i = 1,··· ,L. 24-3 Correspondingly, divide X to L submatrices, with each submatix Xi containing the co- efficients of of Di X = [X0 1,X0 2,··· ,X0 L]0 The update algorithm need to preserve the orthonormality of Dis in each iteration. As- suming known coefficients, the proposed algorithm updates each orthonormal basis Di se- quentially. The update of Di is done by first computing the residual matrix Ei = Y −X j6=i DjXj Then we solve minDi∈O(m) kEi − DiXik2F for updated Di 4. K-SVD algorithm Overall, this algorithm is a “generalization” of K-means algorithm for clustering points {yi}iN=1. K-means algorithm In K-means, first we are offered K centers, denoted as C = [c1,c2,··· ,cK]. This could also be seen as a basis or dictionary. The index j of point yi is selected by finding j such that kyi −Cejk2 2 ≤kyi −Cekk,∀k 6= j This has analogy to sparse coding stage. We force the sparsity of coefficients to be 1 and the nonzero component should be 1. Written as formulation and in matrix form min x kY −CXk2 F s.t. xi = ek for some k,∀i (kxik0 = 1) And in updating {ci} stage, let Rk be the set of index of points clustered into kth cluster ck = 1 |Rk|X i∈Rk yi = argmincX i∈Rk kyi −ck2 2 combine all K such optimization problem, we have update of C is C = argmin C kY −CXk2 F 24-4 remember here X comes from sparse coding stage, defining which cluster each point in Y lies. Denote kth row of X as xk T, then Y −CX = Y −X k ckxk T = Y −X j6=k cjxj T Ek −ckxk T Since the whole problem is decomposable to subproblems of finding each ck,∀k, so it is equivalent to sequentially search ck that minimizes kEk −ckxk Tk2 F =Pi∈Rk kyi −ckk2 2. Back to dictionary learning problem, Y −DX = (Y −X j6=k djxj T Ek )−dkxk T So kY −DXk2 F = kEk −dkxk Tk2 F In dictionary update step, analogous to K-means, assuming known X, we could sequentially solve min dk kEk −dkxk Tk2 F for update of columns of D. K-SVD algorithm: Data: {yi}N i=1 Initialization: D(0) ∈Rm×K, with each column normalized; t = 1 Repeat until convergence: 1 Do sparse coding, solve for each {xi}N i=1 using suitable algorithm; 2 for each column k = 1,··· ,K in D(t−1) 1 Define Rk = {i|1 ≤ i ≤ N,xk T(i) 6= 0} 2 Compute Ek = Y −Pj6=k djxj T, extract columns with indices in Rk, denote as ERk k ; ERk k ∈Rm×|Rk| 3 Apply SVD, ERk k = U∆V T, choose updated dictionary column dk as the first column of U, update the nonzero items of coefficient xk T by xk Rk, xk Rk is the first column of V multiplied by σ1(ERk k ) 3 Set t = t + 1 24-5 1 Introduction Background and goals: The goal of this lecture is to give you an introduction to the mathematics, algorithms, and applications in the field of sparse analysis, including sparse approximation and compressive sensing. Despite learning in high school mathematics that solving an under-determined linear system of the form Ax = b is impossible, mathematicians, computer scientists, and engineers attempt to do so in a myriad of fields, applications, and settings. This problem arises in signal and image processing where A plays the role of a redundant dictionary, b is the image or signal to compress, and x is the collection of “transform” coefficients that captures the exact linear combination of dictionary elements which closely approximate b. In this case, solving Ax = b amounts to finding a sparse or parsimonious linear representation over A that is close to the input signal b. Given A and b, a streaming algorithm strives to compute quickly statistical or combinatorial information about x. In these applications, the number of rows of the matrix A is smaller than the number of columns, sometimes exponentially so. In compressive sensing, one designs a matrix A, takes linear measurements of a signal using A, Ax = b, and from these measurements and knowledge of A, one tries to reconstruct information about x. One mathematical feature that allows us to solve many variants of this under-determined linear system problem is sparsity; roughly speaking, as long as the vector x does not contain too many non-zero components (or has a few dominating components), we can “solve” the under-determined system Ax = b. Common variants of the problem are collectively called sparse approximation (SA) problems. In this course we will highlight the multidisciplinary nature of sparse signal analysis, covering topics ranging from its mathematical core to its potential impact and applications, as well as the role of algorithms. Content: An approximate list of topics follows. 1. background: sparse and compressible signal models, mathematical approximation theory, the sparse representation problem, computational hardness; 2. algorithms: convex optimization, greedy pursuit methods 3. applications: image compression , machine learning, coding theory; 1 and discrete math), electrical engineering (signal processing and coding theory), computer science (algorithms and complexity theory), and statistics (statistical learning theory). The minimum requirements for a student to feel somewhat comfortable following the lectures are and linear algebra. There will almost certainly be topics for which you do not have the prerequisite courses so my main prerequisite is a willingness to learn new material, to forge ahead even when the topic is unfamiliar to you. 1.1 Basic image compression: transform coding We begin our study of sparse signal analysis with a simple example of image compression. In Figure 1, we take an image and compute its 2d wavelet transform. This transform is an orthogonal transform so the transform produces a set of wavelet coefficients that are nothing other than a change of basis from the original image. The left panel of Figure 1 shows both the original image and its 2d wavelet transform. Notice that many of the wavelet coefficients in the lower left figure are dark or close to zero. If we were to quantize all of the coefficient values, we would spend a considerable fraction of our total bits quantizing the small coefficient values. Instead, let’s spend most of our bits on the large (in absolute value) coefficients and set to zero those coefficients that are smaller (in absolute value) than a threshold θ. If we then compute the inverse wavelet transform of the thresholded coefficients, we obtain a close approximation to the original image (top right panel). Those few coefficients we retained capture almost all of the important features of the original image and there are considerably fewer of these coefficients than pixels in the original image so we have produced a lossy compressed representation that consists of a few non-zero wavelet coefficients. Figure 1: original image (top left); 2D wavelet transform of image (bottom left); threshold wavelet coefficients (bottom right); inverse wavelet transform of thresholded wavelet coefficients (top right) Let’s make this discussion a bit more formal. Let Φ denote orthogonal matrix that corresponds to our orthonormal basis; the vectors of the basis are the columns of the matrix Φ. And let x denote the signal that we wish to compress. The encoding algorithm that we used above The 2 Input: Φ and x Output: t sparse vector c = Φ∗x // compute orthogonal transform t = Θ(c) // threshold small coefficients decoding (or reconstruction) algorithm that we use to synthesize the image from its compressed representation is straightforward. Input: Φ and t Output: x ˆ x = Φt This encoding algorithm is an example of a nonlinear encoding algorithm and a linear decoding algorithm. The encoding algorithm is nonlinear in the sense that which coefficients are retained in the thresholding process depend on the input signal x. Or, more precisely, for two different input signals x and x0 Θ(Φ(x + x0)) = Θ(Φx + Φx0) 6= Θ(Φx) + Θ(Φx0). This configuration is a typical one for modern codecs; more computational resources are spent encoding signals or images so that the decoder can be relatively simple. Later in this course, we will see that compressive sensing attempts to invert this typical allocation of resources with a linear encoder and a nonlinear decoder. Nonlinear encoding with a fixed orthonormal basis Φ for signal compression has several advan- tages and one large disadvantage. Both the encoding and decoding algorithms are relatively easy to compute and, given a signal x, the compression that the encoder produces is the best set of coefficients for that signal; i.e., the compression is instance optimal. It is important to note that the compressibility of the input signal x depends on the fixed orthonormal basis Φ and that the compression is only as good as the basis Φ is for representing x. Given a large class of input signals x, it is challenging to design a single orthonormal basis that compresses all of the input signals well. This is especially true if the class consists of natural images. If we fix a single input x and then design one orthonormal basis Φ, it is easy to produce a basis that compresses x perfectly (simply take the first vector in the basis to be x, and all the other vector orthogonal to x); however, the orthonormal basis we construct may not compress many other signals! Throughout this discussion, we have referred informally to the terms sparsity and compressibility somewhat interchangeably but it’s important that we have precise definitions of these terms before we proceed. Definition 1 The sparsity of a vector x ∈ Rd or Cd is the number of non-zero entries in x. We denote this quantity by kxk0 despite the fact that this is not a proper norm. Definition 2 Suppose that we sort in decreasing absolute value the entries of a vector x ∈ Rd. The rate of decay of these entries is the compressibility of x. Let |x(`)| be the sorted vector x, then the compressibility is the exponent α in the relationship |x(`)| ∼ `−α. See Figure 2 for an example of a compressible signal and what effect thresholding entries in a compressible vector has on its sparsity. 3 Figure 2: The set Λ consists of those entries of a vector x that are larger (in absolute value) than a threshold . These definitions assume that the basis in which we represent or compress the signal x is the canonical basis but, if instead we use a different orthonormal basis Φ, the vector in question is the vector of coefficients Φx. Sparsity or compressibility are two ways of measuring the number of items we need to encode to represent a signal; the fewer the items, the fewer the bits, and the greater the compression. On a more philosophical level, sparsity or compressibility is a mathematical version of Occam’s razor; we strive to capture the essential features of a signal. 1.2 Complete and redundant dictionaries One of the biggest drawbacks of using a single orthonormal basis for compression, especially for a large class of input signals, is that the compression is only as good as the basis is “right” for that class and it is difficult to design a single orthonormal basis that compresses well a large class of signals. It is much easier, instead, to use more than orthonormal basis from which to represent a single input signal x or a class of signals. With apologies to George Orwell, if one orthonormal basis is good, surely two (or more) are better...especially for images. Figure 3 shows the different features that are present in an image, from the brush-like texture of the mandrill’s hair to the pointillist-like residual that remains after we remove a linear combination of both a few wavelets and a few wavelet packets. No single orthonormal basis is rich enough (or large enough!) to include vectors that resemble all of these features. Which bring us to a rigorous definition of a dictionary for signal representation. Definition 3 A dictionary Φ in Rd (or Cd) is a collection {φ`}N `=1 ⊂ Rd (or Cd) of unit- normvectors: kφ`k2 = 1 for all `. • Elements are called atoms • If span{φ`}= Rd, the dictionary is complete • If {φ`} are linearly dependent and the dictionary is complete, the dictionary is redundant We represent a dictionary as a matrix Φ =φ1 φ2 ... φNwith atoms as the columns of the matrix so that Φc =X ` c`φ`. See Figure 4 for the various linear algebraic operations. 4 Figure 3: Different bases capture different features of an image. x Φᵀ Φ c Figure 4: The operation Φc produces a linear combination of dictionary atoms and is used to synthesize a signal in Rd (left) while the operation ΦTx produces a vector of dot products of x with the dictionary atoms or the analysis of a signal over a dictionary (right). 1.1.1 Examples 1. Fourier-Dirac: Φ = [F|I ]. In Cd, we define the dictionary as: φ`(t) = 1 √de2πi`t/d ` = 1,2,...,d φ`(t) = δ`(t) ` = d +1,d +2,...,2d. It is of size 2d and the atoms are shown in Figure 5 . 2. DCT-Wavelets: Φ = [F|W]. In Rd×d (or two-dimensional images), we have the union of the discrete cosine transform (DCT) (a basis traditionally used to compress video frames) and 5 Figure 5: The first four rows are the Fourier basis and the second four rows show the Dirac basis for d = 32. two-dimensional wavelets (another plausible basis for compressing images). This dictionary is of size 2d2, or twice as large as d×d images themselves. See Figure 6 for several representative atoms in this dictionary. Figure 6: The DCT of an image (left) and three different 2-dimensional wavelets (right). 3. Wavelet packets: Φ in Rd consists of dlogd different wavelet packets which can be arranged into 2O(d) different orthonormal bases. See Figure 7 for an example. 1.3 Sparse approximation basic problems If we use the convention that we represent a signal as a linear combination over a redundant dictionary Φc, then there are two competing metrics for the quality of this representation: (i) the cost of the representation itself (typically, how sparse or how compressible c is) and (ii) the error in the approximation x ≈ Φc. In signal or image compression, we usually trade-off the rate of the encodingwiththedistortionthatitproducesascomparedtotheoriginalsignal. Wewilldefinethese terms slightly differently from that of typical source coding; nonetheless, our definitions address this inherent trade-off in how expensive versus how accurate the representation or compression is. The following three problems capture precisely three types of compression or signal represen- tation: lossless compression, lossy compression that meets an error (or distortion) tolerance, and lossy compression that meets a memory (or rate) tolerance. 6 Figure 7: The Haar wavelet packet dictionary for d = 8. • Exact.Given a vector x ∈Rd and a complete dictionary Φ, solve argminckck0 s.t. x = Φc i.e., find a sparsest representation of x over Φ. • Error.Given  ≥0, solve argminckck0 s.t. kx−Φck2 ≤  i.e., find a sparsest approximation of x that achieves error . • Sparse.Given k ≥1, solve argminckx−Φck2 s.t. kck0 ≤ k i.e., find the best approximation of x using k atoms. Now that we have formalized three fundamental sparse approximation problems, there are two directions to pursue: (i) how to compute sparse approximations and (ii) if we could compute them, how accurate would they be (or how should we evaluate the output of the algorithms)? In the next lecture, we will discuss (ii), how do we evaluate the compressibility of signals and signal classes in different bases, an excursion into mathematical approximation theory. 7 2.1 Agenda 1. SA algorithms for orthonormal bases (in Rn) 2. Arbitrary complete or redundant dictionaries: Bad news for algorithms In this lecture, we turn to algorithms and how to compute these sparse representations. We begin with a simple setting, we restrict Φ to be an orthonormal basis, so as to understand (i) the basic architecture of one of the two major classes of algorithms, (ii) how to analyze algorithms, (iii) how to prove that the algorithms perform as advertised (and what that advertisement should say), and (iv) what pitfalls we might encounter in generalizing both the analysis and algorithms to less simple complete dictionaries. We also begin a (short!) crash course in complexity theory so that we can both prove the hardness of the SA problems and understand what that result tells us. 2.2 SA algorithms for orthonormal bases Let Φ be an arbitrary orthonormal basis in Rn and let’s try to solve (with provable, efficient algorithms) the three sparse approximation problems. Because Φ is an orthonormal basis, the matrix Φ which columns are the vectors in the basis is an orthogonal matrix and it satisfies Φ−1 = ΦT. 2.2.1 Exact Exact: Given Φ an orthonormal basis and x an arbitrary vector, find the sparsest vector c such that Φc = x. This problem is trivial to solve when Φ is an orthonormal basis. We set c = ΦTx = Φ−1x; i.e., each entry in the vector c` =hx,φ`i. This representation is unique as Φ is an orthonormal basis. The algorithm consists of multiplying ΦT by x which can be done in time O(n2) for an arbitrary basis, or faster if the transform has special structure (e.g., O(nlogn) for the Fourier transform or O(n) for wavelets). 2.2.2 Sparse Sparse: Given Φ an orthonormal basis, x an arbitrary vector, and k a sparsity parameter, find the vector c such that kΦc−xk2 is minimized and kck0 ≤ k. We will analyze Algorithm 1 and show that it solves the Sparse problem. First, we remember from our previous lectures that for an orthonormal basis, the k-term representation that minimizes the `2 distance to the vector x consists of the k vectors that enjoy the largest dot products with x so our algorithm should return that representation. We must check that the above algorithm produces the correct vector c. Furthermore, the straightforward algorithm would be to compute the orthonormal transform Φx, sort the coefficients, and return the k largest but it’s not immediately clear that the above algorithm returns this answer as it’s not the straightforward algorithm. It is, however, the same architecture as one of the two main algorithmic techniques in sparse approxima- tion so we present it for the simple case of an orthonormal basis for expository reasons. It is what 8 Algorithm 1: Prototype greedy algorithm for Sparse when Φ is an orthonormal basis. Input: Φ, x, and k Output: c Initialize a = 0, Λ =∅ for j = 1 to k do r = x−a λj = argmax`|hr,φ`i| Λ = Λ∪{λj} cλj =hr,φλji a =Pλ∈Λ cλφλend is known as a greedy algorithm; it chooses the vector that is the best at each step of an iterative procedure. The best vector is the one that has largest dot product (or highest correlation) with the residual r = x−a at the current iteration. Finally, this algorithm builds up the approximation a one vector at a time and iterates for k steps. Let’s analyze what this algorithm achieves; i.e., let’s prove that this algorithm computes effi- ciently the best k-term representation of x over Φ. When j = 1, the approximation a = 0 (by initialization) and so r = x. The maximization step selects λ1 = argmax`|hx,φ`i| the index of the biggest (in absolute value) dot product or transform coefficient. Then the algorithm includes the index λ1 in the index set Λ, sets entry of cλ1 equal to the maximizing dot product cλ1 =hx,φ`i, and the approximation at the end of the first iteration is a = cλ1φλ1. When j = 2, the approximation consists of one term so the residual is not simply equal to x, r = x −cλ1φλ1. The maximization step selects λ2 = argmax`|hx−cλ1φλ1,φ`i|. Because Φ is an orthonormal basis, x has a unique expansion over Φ, x =Pj cjφj, and dot products amongst the vectors in Φ simplify as hφj,φ`i= 1 ⇐⇒ j = `. Using these two pieces of information, we can simplify the maximization step λ2 = argmax`|hx−cλ1φλ1,φ`i| = argmax`|hX6 =λ1 cjφj,φ`i| = argmax`|X6 =λ1 cjhφj,φ`i| = argmax`6=λ1|c`| 9 and observe that λ2 is the index of the second largest (in absolute value) dot product or transform coefficient of x. We then update the index set of the approximation a, Λ = {λ1, λ2}, and the approximation itself consists of two terms a = cλ1φλ1 + cλ2φλ2 . The run time of the algorithm if implemented as written above would be, per iteration, O(n2) to compute the dot products (depending on the particular basis Φ) with the residual r and find the largest, plus O(n) to update the vectors r and a. As the dominating term is O(n2) per iteration and there are k iterations, the total running time is O(kn2). If we let T denote the time to compute the transform Φ, the total time is O(kT ). Note that updating the vectors r and a in each iteration is simple as the change in a each iteration consists of adding one new vector cλφλ (scaled by a constant). We have shown Theorem 1 Algorithm 1 solves the Sparse problem; i.e., returns a vector c such that kx−Φck2 is minimized and kck0 ≤ k. It does so in time O(kT ) where T is the time to compute the orthonormal transform Φ. It is also easy to see from our analysis above that we don’t have to compute a Φ transform every iteration. A better implementation would first compute all the coefficients Φx (in time T ) and then find the k largest (in magnitude), either by sorting all of the coefficients (in time O(n log n)) and returning the k largest or by a more sophisticated procedure. A textbook solution for returning the k biggest items in a list of size n is to maintain a priority queue of size k and slide that along the list, inserting new entries in the queue according to their size if they are larger than the smallest existing item in the queue. This algorithm runs in time O(n log k). There is a faster deterministic algorithm in [1]that runs in time O(n) but is somewhat complicated and a randomized version that is much simpler with similar running time O(n). 10 2.2.3 Error Error: Given Φ an orthonormal basis, x an arbitrary vector, and  a fidelity parameter, find the sparsest vector c such that kΦc−xk2 ≤ . After our previous discussion, it is easy to see that to obtain a provable, efficient algorithm for Error: is a simple modification of Algorithm 1 ; we change the for loop to a while loop with the conditional that while kx−ak2 >> , we iterate. 2.3 Arbitrary complete, redundant dictionaries When the dictionary Φ is an orthonormal basis, all three sparse

approximation problems are easy to solve; they admit efficient algorithms that return the optimal solutions. Unfortunately, this is not true for general or arbitrary

complete or redundant dictionaries. In the next lecture, we will discuss how hard is it to solve the general sparse approximation problems, how difficult are they

compared to well-studied problems such as Primes, SAT, and whether or not there is a feasible algorithm that is guaranteed to approximate the optimal solution. Before

we proceed with a crash course in complexity theory, we have some bad news: Sparse approximation problems are at least as hard to solve as a long list of challenging

problems, including SAT, Traveling Salesman, Knapsack, Subset Sum, etc.
References
[1] Charles E. Rivest Ronald L. Stein Clifford Cormen, Thomas H.; Leiserson. Introduction to Algorithms. MIT Press and McGraw-Hill, 3rd edition, 2009. 3
11
3.1 Agenda
1. Coherence of a redundant dictionary
2. Approximation algorithms for sparse approximation problems
In this lecture, we will discuss sufficient geometric conditions on a redundant dictionary that will permit efficient (approximation) algorithms for sparse approximation

problems. The simplest geometric condition is that of coherence, how correlated are the atoms of the dictionary with one another. We will focus on the Sparse problem

first and show that several greedy, iterative algorithms solve it approximately if the redundant dictionary is fairly incoherent. In class this lecture was divided into

two pieces, so consider this write-up Lectures 5 and 6.
3.2 Coherence of a redundant dictionary To motivate the definition of coherence of a dictionary, let us recall the iterative algorithm we employed to solve the Sparse

problem when Φ is an orthonormal basis. The key step in the
Algorithm 1: Prototype greedy algorithm for Sparse. Input: Φ, x, and k Output: c Initialize a = 0, Λ = ∅ for j = 1 to k do r = x−a λj = argmax`|hr,φ`i| Λ = Λ∪{λj} cλj

= hr,φλji a =Pλ∈Λ cλφλend
algorithm is the maximization step
hr,φji = hx−ciφi,φji = hx,φji−cihφi,φji illustrates what can go wrong when we do not have an orthonormal basis. This example leads us to define how coherent the atoms

of a dictionary are.
Definition 1 The coherence of a dictionary is the largest dot product between distinct pairs of atoms µ = max j6=` |hφj,φ`i|. Figure 1 illustrates two different

collections of atoms in R2; one with small coherence (which is a good property) and one with large coherence (which is a bad property). One might wonder just how small

the coherence of a collection of N atoms can be in dimension d, especially if N is quite large compared to d. Fortunately, there are a number of results of this type

from coding theory (as redundant dictionaries are nothing other than spherical codes). The most straightforward lower bound on the coherence is known as the Welch

bound.
12
φ1
φ2 φ3
φ1
φ2
φ3
Small coherence (good)
Large coherence (bad)
Figure 1: Coherence of two different collections of atoms in R2.
Theorem 2 For a d×N redundant dictionary, µ ≥s N −d d(N −1) ≈
1 √d.
It is always useful to have at hand several examples of large, incoherent dictionaries from which to draw intuition. • Fourier–Dirac, N = 2d, µ = 1 √d • wavelet

packets, N = dlogd, µ = 1 √2 • There are large dictionaries with coherence close to the lower (Welch) bound; e.g., Kerdock codes, N = d2, µ = 1/√d
3.3 Approximation algorithms for sparse approximation
The word approximation has several uses in this topic: one is a mathematical meaning, we ap- proximate a vector, and the other is an algorithmic meaning, the output of

an algorithm is an approximation to the optimal output. The optimal output is what an algorithm would return if it took all the necessary time to find the best

representation for the input over the dictionary. Another term to describe such a “perfect” algorithm is to say that an oracle would give us the best representation

for free (simply for asking). It is this representation against which we want to compare the output of our algorithm. Let’s denote the optimal solution as cOPT and the

error of this solution as the optimal error, EOPT = kΦcopt −xk2. Our algorithm returns a coefficient vectorb c with 1. kb ck0 = k 13
Figure 2: Kerdock codes are a union of orthonormal bases; pairs of bases are Fourier duals of one another under an appropriate group action. 2. E = kΦb c−xk2 ≤ C1EOPT;

that is, the error of output is guaranteed to be less than a constant C1 times the optimal error EOPT. So, our algorithm doesn’t necessarily return the best

representation but it is guaranteed to be a fairly good one, where fairly good is measured in terms of the optimal error. If the optimal error is zero, however, our

approximation algorithm is guaranteed to find the optimal solution. This is an important point that may be easy to overlook in one’s first encounter with approximation

algorithms. If an input vector x has an exact sparse representation with k atoms over a dictionary Φ, an approximation algorithm will find it. Frequently, in

approximation algorithms we measure the approximation ratio of the algorithm or the ratio of the output error as compared to the optimal error
E EOPT
=
C1EOPT EOPT
= C1.
We will see later that for some algorithms the approximation ratio is truly a constant and for others it depends on the coherence µ of the dictionary and k the desired

sparsity. A constant factor approximation ratio is clearly better than one that depends on several input parameters but an algorithm with such a guarantee may have a

longer running time or may require more stringent conditions than an algorithm with a weaker approximation guarantee. These types of analyses are all important in the

analysis of sparse approximation algorithms.
3.3.1 Greedy algorithms We will focus our attention on greedy, iterative algorithms first (and return to different algorithmic formulations later). By greedy, iterative

algorithms we mean those that choose the best atom (or collection thereof) at each step of the algorithm. The main algorithm that we analyze is Orthogonal Matching

Pursuit. Notice that this algorithm is very similar to our prototype greedy algorithm but it differs in one important place; the method by which we calculate the

coefficients cλ that participate in the representation is more complicated. At each step, the coefficients are the solution to a least-squares problem. That is, we compute

the orthogonal projection of x onto the subspace spanned by the columns indexed by Λ, the active set of atoms in
14
the representation. The orthogonal projection step is precisely where OMP gets the leading “O” in its name. Because at each step of the algorithm, the representation

is the orthogonal projection of x onto the span of the selected atoms, each selected atom in the representation appears only once. That is, the residual is orthogonal

to the representation so atoms that already participate in the representation cannot be selected again at a later step of the algorithm. When one implements this

algorithm, one should be careful about implementing the least- squares step. Naive implementations will result in poor empirical performance for OMP on large input.
Algorithm 2: Orthogonal Matching Pursuit (OMP). Input: Φ, x, and k Output: c Initialize a = 0, Λ = ∅ for j = 1 to k do r = x−a λj = argmax`|hr,φ`i| Λ = Λ∪{λj} Find cλ

for λ ∈ Λ that minimizes kx−Pλ∈Λ cλφλk2 Update a =Pλ∈Λ cλφλend
Many greedy algorithms have a similar outline and we enumerate a few adjustments one could make to this basic architecture. There are several adjustments to make in

how coefficients are calculated and what the selection criterion should be. • Matching Pursuit: replace the coefficient update step with cλt ←− cλt +hr,φλti. •

Thresholding: Choose m atoms where |hx,φλi| are among m largest. There are alternate stopping rules as well. • Terminate the loop when the residual norm is smaller

than a threshold: krk2 ≤ . • Terminate the loop when the largest dot product with the residual is smaller than a threshold: maxλ|hrt,φ`i|≤ .
15

find the cost of your paper