## Project on Dictionary learning

Project on Dictionary learning

Problem
Consider a set of signals fY1; Y2 : : : ; YLg ⊂ IRN, Suppose that there exits an overcomplete dictionary D 2 IRN×M with M ≥ N such that each signal can be sparsely
approximated under the dictionary D. The dictionary learning aims at find such a
dictionary so that each signal Y‘ can be well approximated by Dc‘ where c‘ 2 IRM
is a sparse vector. The dictionary learning problem can be formulated as solving the
following optimization problem
min
D2IRN×M;C2IRM×L
kY – DCk2 F; (1)
subject to
kDjk1 = 1; for j = 1; 2; : : : ; M;
kC‘k0 ≤ K; for ‘ = 1; 2; : : : ; L:
Goal
In Lecture notes, The K-SVD method is presented for solving dictionary learning
problem. The goal of this project is to implement the K-SVD method with possible
modifications to have your version of the dictionary learning method that solves (1).
Implementation
In this project, the following routine is expected.
DICTLEARN Computing the sparsity-based dictionary of input signals
D = dictlearn (Y, M, K) computes the dictionary D with M atoms
that can sparsify the set of signals Y with degree K.
For L signals with size Nx1, the dimensionality of Y is NxL.
The dimensionality of D is NxM with M≥N.

National University of Singapore Department of Mathematics
02/2017 MA5232 MODELING AND NUMERICAL SIMULATIONS Project on Dictionary learning
Problem Consider a set of signals {Y1,Y2 …,YL} ⊂ IRN, Suppose that there exits an over- complete dictionary D ∈ IRN×M with M ≥ N such that each signal can be

sparsely approximated under the dictionary D. The dictionary learning aims at ﬁnd such a dictionary so that each signal Y` can be well approximated by Dc` where c` ∈

IRM is a sparse vector. The dictionary learning problem can be formulated as solving the following optimization problem
min D∈IRN×M,C∈IRM×LkY −DCk2 F, (1)
subject to
kDjk1 = 1,for j = 1,2,…,M; kC`k0 ≤ K,for ` = 1,2,…,L.
Goal In Lecture notes, The K-SVD method is presented for solving dictionary learning problem. The goal of this project is to implement the K-SVD method with possible

modiﬁcations to have your version of the dictionary learning method that solves (1).
Implementation In this project, the following routine is expected.
DICTLEARN Computing the sparsity-based dictionary of input signals D = dictlearn (Y, M, K) computes the dictionary D with M atoms that can sparsify the set of signals

Y with degree K.
For L signals with size Nx1, the dimensionality of Y is NxL. The dimensionality of D is NxM with M≥N.
Grading Policy This is an individual project. You may discuss the ideas with your classmates, but the sharing of code is strictly prohibited. You are not allowed to

use any code from
online sources or from other classmates’ projects, which will be checked via some standard code plagiarism checking system (e.g. theory.stanford.edu/∼aiken/moss) The

grade of the submitted project is based on the following three factors: 1) whether meets the guidelines; 2) the correctness of the results; and (3) the computational

eﬃciency in terms of running time in MATLAB.
Submissions You are required to submit the following:
1. (optional) The PDF ﬁle that describes any idea or technique for either speeding up the computation of the algorithm or stability of the algorithm
2. the source code of DICTLEARN in program format, e.g., dictlearn.m
Please package all your ﬁles in a single zip ﬁle named by your student ID and the suﬃx ”-DL”, and upload the zip ﬁle into IVLE, e.g., U1234567-DL.zip. The deadline for

4 Sparse recovery and convex relaxation A vector x ∈RN is called s-sparse if at most s entries are not zero. To seek a sparse solution from an under-determined system,

Ax = b, can be done by solving
min z kzk0, s.t. Az = Ax. Theorem 32. Consider a matrix A ∈RM×N with normalized columns. Suppose that every sub- matrix ∈ RM×M of A is full-rank. Then, x is the only

global minimizer of the above `0-norm based model, provided that s < M 2 . Proof. Surpose that ¯ x is a global minimizer, then it has at most s non-zero elements. Then, the vector δ = x− ¯ x has at most 2s < M non-zero elements, and Aδ = 0. By the fact that the matrix is full rank, thus δ = 0. The proof is done. Although the theorem above shows that we can use the `0 norm based model to ﬁnd the truth with very mild condition, i.e. as long as the sparsity degree of x is less than half of the measurements availble. It is known that it is an NP-hard problem, and thus the OMP cannot guarantee that it can ﬁnd the correct solution. One often used strategy is to consider using `1 norm as the convex relaxation of `0 norm. The advantage is that there exist many numerical solvers to ﬁnd a global solution (or obtain a solution with desired accuracy in ﬁnite iterations). However, the question remains that whether a global minimizer is corresponding to the truth, as there might be many global minimizers considering the fact that the optimization problem is not strictly convex. In the next, we consider the noise-free case. We will provide two conditions that guarantee the following convex model whose only global minimizer is the truth: min z kzk1, s.t. Az = Ax. The ﬁrst is a suﬃcient condition, which is easy to check through computation, but is very strong such that it is diﬃcult to design such matrix. Another one is a suﬃcient and necessary condition, which is much weaker than the ﬁrst one, but is very diﬃcult to verify through computation. Deﬁnition 33 (Mutual coherence). Consider a matrix A ∈RM×N whose columns are normalized to 1, i.e., kAjk2 = 1, for j = 1,2,...,N. The mutual coherence of the matrix A is deﬁned as µ(A) = max i6=j |hAi,Aji|. Clearly, if A is an unitary matrix, we have µ(A) = 0. In the next, we show that the lower bound of mutual coherence when the column dimensionality is larger than row dimensionality. Theorem 34. Consider a full-rank matrix A ∈RM×N with M < N. Then, we have µ(A) ≥s N −M M(N −1) = O( 1 √M ). Proof. See the references. Consider a vector x ∈ RN, for any index set S, let xS denote the elements of x whose index is in S, and let Sc denote the complementation to S in [1,2,...,N]. 19 Theorem 35 (Suﬃcient condition). Consider a full rank matrix A ∈ RM×N with normalized columns. For any s-sparse vector x ∈RN satisfying s ≤ 1 2(1 + µ(A)−1), we have x = ¯ x, ¯ x ∈{argminkzk1,s.t.Az = Ax}. Proof. Let S denote the support of x. Consider a global minimizer, z, which is diﬀerent from x. Deﬁne δ = z−x. Then, we have Aδ = 0. Notice that kzk1 ≤kxk1, and kzk1 −kxk1 = N X k=1 |z(k)|− N X k=1 |x(k)| =X Sc |z(k)|+X S (|z(k)|−|x(k)|) ≥X Sc |z(k)|−X S (|z(k)−x(k)|) =X Sc |δ(K)|−X S |δ(k)| = kδk1 −2X S |δ(k)|, We have then kδk1 ≤ 2kδSk1. From A>Aδ = 0, we have |δj| = |X i6=j
(A>A)j,iδi|≤X i6=j
|(A>A)j,i||δi|≤ µ(A)(kek1 −|δj|),
which gives
|δj|≤ (1 + µ(A))−1µ(A)kδk1. Taking the summation of |δjk over the index set S, we have kδSk1 =X j∈S |δj|≤ s(1 + µ(A))−1µ(A)kδk1 < 1 2 1 + µ(A)−1 1 + µ(A)−1|δk1 = 1 2kδk1. Thus, we have the contradiction. The proof is done. In the next, we provide a necessary and suﬃcient condition for exact recovery of sparse signals using `1-norm based model. Deﬁnition 36 (Null space property). A matrix A is said to satify the null space property with sparse degree S, if for any x ∈ ker(A): kxSk1 < kxSck1, for all index set S with its cardinality ≤ s. Theorem 37 (Suﬃcient and necessary condition). Consider a matrix A ∈RM×N. The suﬃcient and necessary condition that guarantees for any s- sparse vector x, x = ¯ x, ¯ x = argminAz=Axkzk1. is A satisﬁes null space property. 20 Proof. Suﬃciency. For any s-sparse vector x, let S be the support of x. For any z 6= x that obeys Az = Ax, we have δ = z−x ∈ ker(A) and kzk1 = kxS + δSk1 +kδSck1,, ≥kxSk1 −kδSk1 +kδSck1, (29) = kxSk1 + (kδSck1 −kδSk1). As A satisﬁes null space property, kδSck1 ≥kδSk1. Thus, kxk1 = kxSk1 < kzk1, which leads to x = argminAz=Axkzk1. Necessity. For any index set S of size |S|≤ s and any δ ∈ ker(A). Deﬁne δS(k) = δ(k), if k ∈S; and 0 otherwise. Then, the vector δS is a s-sparse vector. Thus, we have δS = argminAz=AδSkzk1,. On the other hand, Aδ = 0 implies that A(−δSc) = AδS. Thus, we have kδSck1 > kδSk1, which means that A satisﬁes null space property.
When the system is contaminated by the noise, i.e.,
Ax = b + ,
where kzk2 ≤ δ. The following model will be used min z kzk1, s.t. kAz−bk2 2 ≤ δ2

approximation problems are easy to solve; they admit eﬃcient algorithms that return the optimal solutions. Unfortunately, this is not true for general or arbitrary

complete or redundant dictionaries. In the next lecture, we will discuss how hard is it to solve the general sparse approximation problems, how diﬃcult are they

compared to well-studied problems such as Primes, SAT, and whether or not there is a feasible algorithm that is guaranteed to approximate the optimal solution. Before

we proceed with a crash course in complexity theory, we have some bad news: Sparse approximation problems are at least as hard to solve as a long list of challenging

problems, including SAT, Traveling Salesman, Knapsack, Subset Sum, etc.
References
[1] Charles E. Rivest Ronald L. Stein Cliﬀord Cormen, Thomas H.; Leiserson. Introduction to Algorithms. MIT Press and McGraw-Hill, 3rd edition, 2009. 3
11
3.1 Agenda
1. Coherence of a redundant dictionary
2. Approximation algorithms for sparse approximation problems
In this lecture, we will discuss suﬃcient geometric conditions on a redundant dictionary that will permit eﬃcient (approximation) algorithms for sparse approximation

problems. The simplest geometric condition is that of coherence, how correlated are the atoms of the dictionary with one another. We will focus on the Sparse problem

ﬁrst and show that several greedy, iterative algorithms solve it approximately if the redundant dictionary is fairly incoherent. In class this lecture was divided into

two pieces, so consider this write-up Lectures 5 and 6.
3.2 Coherence of a redundant dictionary To motivate the deﬁnition of coherence of a dictionary, let us recall the iterative algorithm we employed to solve the Sparse

problem when Φ is an orthonormal basis. The key step in the
Algorithm 1: Prototype greedy algorithm for Sparse. Input: Φ, x, and k Output: c Initialize a = 0, Λ = ∅ for j = 1 to k do r = x−a λj = argmax`|hr,φ`i| Λ = Λ∪{λj} cλj

= hr,φλji a =Pλ∈Λ cλφλend
algorithm is the maximization step
hr,φji = hx−ciφi,φji = hx,φji−cihφi,φji illustrates what can go wrong when we do not have an orthonormal basis. This example leads us to deﬁne how coherent the atoms

of a dictionary are.
Deﬁnition 1 The coherence of a dictionary is the largest dot product between distinct pairs of atoms µ = max j6=` |hφj,φ`i|. Figure 1 illustrates two diﬀerent

collections of atoms in R2; one with small coherence (which is a good property) and one with large coherence (which is a bad property). One might wonder just how small

the coherence of a collection of N atoms can be in dimension d, especially if N is quite large compared to d. Fortunately, there are a number of results of this type

from coding theory (as redundant dictionaries are nothing other than spherical codes). The most straightforward lower bound on the coherence is known as the Welch

bound.
12
φ1
φ2 φ3
φ1
φ2
φ3
Small coherence (good)
Figure 1: Coherence of two diﬀerent collections of atoms in R2.
Theorem 2 For a d×N redundant dictionary, µ ≥s N −d d(N −1) ≈
1 √d.
It is always useful to have at hand several examples of large, incoherent dictionaries from which to draw intuition. • Fourier–Dirac, N = 2d, µ = 1 √d • wavelet

packets, N = dlogd, µ = 1 √2 • There are large dictionaries with coherence close to the lower (Welch) bound; e.g., Kerdock codes, N = d2, µ = 1/√d
3.3 Approximation algorithms for sparse approximation
The word approximation has several uses in this topic: one is a mathematical meaning, we ap- proximate a vector, and the other is an algorithmic meaning, the output of

an algorithm is an approximation to the optimal output. The optimal output is what an algorithm would return if it took all the necessary time to ﬁnd the best

representation for the input over the dictionary. Another term to describe such a “perfect” algorithm is to say that an oracle would give us the best representation

for free (simply for asking). It is this representation against which we want to compare the output of our algorithm. Let’s denote the optimal solution as cOPT and the

error of this solution as the optimal error, EOPT = kΦcopt −xk2. Our algorithm returns a coeﬃcient vectorb c with 1. kb ck0 = k 13
Figure 2: Kerdock codes are a union of orthonormal bases; pairs of bases are Fourier duals of one another under an appropriate group action. 2. E = kΦb c−xk2 ≤ C1EOPT;

that is, the error of output is guaranteed to be less than a constant C1 times the optimal error EOPT. So, our algorithm doesn’t necessarily return the best

representation but it is guaranteed to be a fairly good one, where fairly good is measured in terms of the optimal error. If the optimal error is zero, however, our

approximation algorithm is guaranteed to ﬁnd the optimal solution. This is an important point that may be easy to overlook in one’s ﬁrst encounter with approximation

algorithms. If an input vector x has an exact sparse representation with k atoms over a dictionary Φ, an approximation algorithm will ﬁnd it. Frequently, in

approximation algorithms we measure the approximation ratio of the algorithm or the ratio of the output error as compared to the optimal error
E EOPT
=
C1EOPT EOPT
= C1.
We will see later that for some algorithms the approximation ratio is truly a constant and for others it depends on the coherence µ of the dictionary and k the desired

sparsity. A constant factor approximation ratio is clearly better than one that depends on several input parameters but an algorithm with such a guarantee may have a

longer running time or may require more stringent conditions than an algorithm with a weaker approximation guarantee. These types of analyses are all important in the

analysis of sparse approximation algorithms.
3.3.1 Greedy algorithms We will focus our attention on greedy, iterative algorithms ﬁrst (and return to diﬀerent algorithmic formulations later). By greedy, iterative

algorithms we mean those that choose the best atom (or collection thereof) at each step of the algorithm. The main algorithm that we analyze is Orthogonal Matching

Pursuit. Notice that this algorithm is very similar to our prototype greedy algorithm but it diﬀers in one important place; the method by which we calculate the

coeﬃcients cλ that participate in the representation is more complicated. At each step, the coeﬃcients are the solution to a least-squares problem. That is, we compute

the orthogonal projection of x onto the subspace spanned by the columns indexed by Λ, the active set of atoms in
14
the representation. The orthogonal projection step is precisely where OMP gets the leading “O” in its name. Because at each step of the algorithm, the representation

is the orthogonal projection of x onto the span of the selected atoms, each selected atom in the representation appears only once. That is, the residual is orthogonal

to the representation so atoms that already participate in the representation cannot be selected again at a later step of the algorithm. When one implements this

algorithm, one should be careful about implementing the least- squares step. Naive implementations will result in poor empirical performance for OMP on large input.
Algorithm 2: Orthogonal Matching Pursuit (OMP). Input: Φ, x, and k Output: c Initialize a = 0, Λ = ∅ for j = 1 to k do r = x−a λj = argmax`|hr,φ`i| Λ = Λ∪{λj} Find cλ

for λ ∈ Λ that minimizes kx−Pλ∈Λ cλφλk2 Update a =Pλ∈Λ cλφλend
Many greedy algorithms have a similar outline and we enumerate a few adjustments one could make to this basic architecture. There are several adjustments to make in

how coeﬃcients are calculated and what the selection criterion should be. • Matching Pursuit: replace the coeﬃcient update step with cλt ←− cλt +hr,φλti. •

Thresholding: Choose m atoms where |hx,φλi| are among m largest. There are alternate stopping rules as well. • Terminate the loop when the residual norm is smaller

than a threshold: krk2 ≤ . • Terminate the loop when the largest dot product with the residual is smaller than a threshold: maxλ|hrt,φ`i|≤ .
15