当前位置: 首页 > >

One sketch for all_ fast algorithms for compressed sensing

发布时间:

One sketch for all: Fast algorithms for Compressed Sensing
A. C. Gilbert M. J. Strauss J. A. Tropp
Department of Mathematics University of Michigan Ann Arbor, MI 48109

R. Vershynin
Department of Mathematics University of California at Davis Davis, CA 95616

{annacg,martinjs,jtropp}@umich.edu ABSTRACT
Compressed Sensing is a new paradigm for acquiring the compressible signals that arise in many applications. These signals can be approximated using an amount of information much smaller than the nominal dimension of the signal. Traditional approaches acquire the entire signal and process it to extract the information. The new approach acquires a small number of nonadaptive linear measurements of the signal and uses sophisticated algorithms to determine its information content. Emerging technologies can compute these general linear measurements of a signal at unit cost per measurement. This paper exhibits a randomized measurement ensemble and a signal reconstruction algorithm that satisfy four requirements: 1. The measurement ensemble succeeds for all signals, with high probability over the random choices in its construction. 2. The number of measurements of the signal is optimal, except for a factor polylogarithmic in the signal length. 3. The running time of the algorithm is polynomial in the amount of information in the signal and polylogarithmic in the signal length. 4. The recovery algorithm o?ers the strongest possible type of error guarantee. Moreover, it is a fully polynomial approximation scheme with respect to this type of error bound. Emerging applications demand this level of performance. Yet no other algorithm in the literature simultaneously achieves all four of these desiderata.

vershynin@math.ucdavis.edu General Terms
Algorithms, Theory

Keywords
Approximation, embedding, group testing, sketching, sparse approximation, sublinear algorithms

1.

INTRODUCTION

Categories and Subject Descriptors
G.4 [Mathematical Software]: Algorithm design and analysis

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro?t or commercial advantage and that copies bear this notice and the full citation on the ?rst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speci?c permission and/or a fee. STOC’07, June 11–13, 2007, San Diego, California, USA. Copyright 2007 ACM 978-1-59593-631-8/07/0006 ...$5.00.

Compressed Sensing is a new paradigm for acquiring signals, images, and other types of compressible data. These data have the property that they can be approximated using much less information than their nominal dimension would suggest. At present, the standard approach to signal acquisition is to measure a complete copy of the signal and then process it to extract the important information. For example, one typically measures an image in the pixel basis and then applies JPEG compression to obtain a more e?cient representation. Instead, the new approach collects a small number of carefully chosen (but nonadaptive) linear measurements that condense the information in the signal. Sophisticated algorithms are used to approximately reconstruct the signal from these measurements. Some exciting new technological applications are driving the theoretical work on Compressed Sensing. In these applications, it is possible to compute general linear measurements of the signal with unit cost per measurement. Therefore, the acquisition cost is proportional to the number of signal measurements that we take. (This setting stands in contrast with the digital computation of a dot product component by component.) In traditional signal acquisition models, measurements of the signal have a straightforward interpretation. On the other hand, Compressed Sensing uses measurements that have no real meaning. In particular, there is no simple map from measurement data back to the signal domain. As a result, we are also very concerned about the time it takes to reconstruct signals from measurements. Scientists and engineers are developing technologies where the computational model of Compressed Sensing applies. They are building cameras [15, 18], analog-to-digital converters [8, 13, 12], and other sensing devices [20, 19] that can obtain a general linear measurement of a signal at unit cost. Compressive imaging cameras use a digital micro-mirror array to optically compute inner products of the image with pseudorandom binary patterns. The image is digitally reconstructed from the projections. Popular media have showcased this application. See Business Week (Oct. 16, 2006)

and The Economist (Oct. 28, 2006). In fact, certain types of Compressed Sensing devices are already widespread, namely CT and MRI scanners. The detector in a Computed Tomography (CT) scanner takes a number of snapshots or pro?les of an attenuated X-ray beam as it passes through a patient. The pro?les are used to reconstruct a two-dimensional image. Each snapshot of the X-ray beam is, in essence, the line integral of the X-ray beam through the patient (i.e., an inner product).

approximation of the signal using at most m terms with respect to any monotonic norm (such as p ). The vector f ?fm is called the tail of the signal since it contains the entries with small magnitude. Theorem 1. Fix an integer m and a number ε ∈ (0, 1). With probability at least 0.99, the random measurement matrix Ψ has the following property. Suppose that f is a ddimensional signal, and let v = Ψf be the signal sketch. Given m, ε, and v , the HHS Pursuit algorithm produces a b with O(m/ε2 ) nonzero entries. The signal approximation f approximation satis?es ? ? ε b? ≤ √ ?f ? f f ? fm 1 . 2 m The signal sketch has size (m/ε2 ) polylog(d/ε), and HHS Pursuit runs in time (m2 /ε4 ) polylog(d/ε). The algorithm uses working space (m/ε2 ) polylog(d/ε), including storage of the matrix Ψ. In particular, note that the algorithm recovers every m-term signal without error. The ?rst corollary shows that we can construct an mterm signal approximation whose 2 error is within an an additive 1 term of the optimal 2 error. One can show that this corollary is equivalent with the theorem. b Corollary 2. Let f m be the best m-term approximation b of HHS Pursuit. Then to the output f ? ? 2ε b ?f ? f ? f ? fm 1 . m 2 ≤ f ? fm 2 + √ m This result should be compared with Theorem 2 of [2], which gives an analogous bound for the (superlinear) 1 minimization algorithm. A second corollary provides an 1 error estimate. b Corollary 3. Let f m be the best m-term approximation b to the output f of HHS Pursuit. Then ? ? b ? ?f ? f m 1 ≤ (1 + 3ε) f ? fm 1 . The error bound in Corollary 3 is more intuitive but substantially weaker than the bound in Theorem 1. One may check this point by considering a signal whose ?rst component equals m?1/4 and whose remaining components equal d?1 . The 1 error bound holds even if an algorithm fails to identify the ?rst signal component, but the mixed-norm error bounds do not.

1.1

Desiderata for Compressed Sensing

Our premise is that, if one measures a highly compressible signal, it is pointless to reconstruct a full-length copy of the signal because it will include a huge number of small, noisy components that bear no information. Instead, a recovery algorithm should directly identify those few components of the signal that are signi?cant. The algorithm should output this compressed representation directly, and its runtime should be roughly proportional to the size of the representation. Let us be more formal. We are interested in acquiring signals in Rd that are well approximated by sparse signals with m nonzero components, where m d. The measurement process can be represented by an n × d matrix Ψ, where n is roughly proportional to m rather than d. Each signal f yields a sketch v = Ψf . The recovery algorithm uses the sketch and a description of the measurement matrix to conb that has only O(m) nonzero struct a signal approximation f components. We want the measurements and the algorithm to satisfy the following properties: 1. One (randomly generated) measurement matrix Ψ is used to measure all signals. With high probability over the random choices in its construction, it must succeed for all signals. 2. The number of measurements is nearly optimal, namely n = m polylog(d). 3. The algorithm must run in time poly(m, log d). 4. Given the sketch of an arbitrary input signal, the algorithm must return a nearly optimal m-term approximation of that signal.

1.2

Our results

We present a linear measurement procedure that takes a near-optimal number of measurements of a signal. We also present HHS Pursuit,1 a fully polynomial approximation scheme that uses these measurements to construct a sparse estimate of the signal with an optimal error bound. Moreover, this algorithm is exponentially faster than known recovery algorithms that o?er equivalent guarantees. This section states our major theorem and two important corollaries. We establish these results in Section 4 (and the appendices). We discuss the error bounds in Sections 1.3 and 1.4. Section 1.5 provides a comparison with related work. Given a signal f , we write fm to denote the signal obtained by zeroing all the components of f except the m components with largest magnitude. (Break ties lexicographically.) We refer to fm as the head of the signal; it is the best
1 The initials HHS stand for “Heavy Hitters on Steroids,” which re?ects the strong demands on the algorithm.

1.3

Compressible signals

A compressible signal has the property that its components decay when sorted by magnitude. These signals arise in numerous applications because one can compress the wavelet and Fourier expansions of certain classes of natural signals [7]. A common measure of compressibility is the weak- p norm, which is de?ned for 0 < p < ∞ as f
def

w p

= inf {r : |f |(k) ≤ r · k?1/p

for k = 1, 2, . . . , d}.

The notation |f |(k) indicates the kth largest magnitude of a signal component. When the weak- p norm is small for

some p < 2, the signal can be approximated e?ciently by a sparse signal because f ? fm
1

≤ m1?1/p f

w p

b satTheorem 1 shows that the computed approximation f is?es the error bound ? ? b? ≤ ε m1/2?1/p f ?f ? f w p . 2 In particular, when p = 1, the error decays like m?1/2 .

1.4

Optimality of error bounds

The error guarantees may look strange at ?rst view. Indeed, one might hope to take (m/ε2 ) polylog(d) measurements of a signal f and produce an m-sparse approximation b that satis?es the error bound f ? ? b? ≤ (1 + ε) f ? fm . ?f ? f 2 2 It has been established [9] that this guarantee is possible if we construct a random measurement matrix for each signal. On the other hand, Cohen, Dahmen, and DeVore have shown [4] that it is impossible to obtain this error bound simultaneously for all signals unless the number of measurements is ?(d). The same authors also proved a more general lower bound [4]. For each p in the range [1, 2), it requires ?(m(d/m)2?2/p ) measurements to achieve ? ? b? ≤ Cp m1/2?1/p f ? fm ?f ? f (1.1) p 2 simultaneously for all signals. This result holds for all possible recovery algorithms. It becomes vacuous when p = 1, which is precisely the case delineated in Theorem 1. It is not hard to check that the number of measurements required by our algorithm is within a polylogarithmic factor of the lower bound. Proposition 4. Fix p in the range [1, 2). With m(d/m)2?2/p polylog(d) measurements, the HHS Pursuit alb gorithm produces for every signal f an m-term estimate f such that (1.1) holds.

1.5

Related work

The major di?erence between our work and other algorithms for Compressed Sensing is that we simultaneously provide (i) a uniform guarantee for all signals, (ii) an optimal error bound, (iii) a near-optimal number of measurements, and (iv) a sublinear running time. We discuss these points in turn and summarize this discussion in Table 1. Some additional comments on this table may help clarify b, let the situation. If the signal is f and the output is f b denote the error vector of the output and E = E (f ) = f ? f let Eopt = Eopt (f ) = f ? fm denote the error vector for the ? ? optimal output. Also, let Copt,p denote maxg ?Eopt (g )?p , where g is the worst possible signal in the class where f lives. The two results in [5] refer to the two deterministic constructions which are uniform on a class of functions (noted in the result). LP(md) denotes resources needed to solve a linear program with Θ(md) variables, plus minor overhead. We suppress big-O notation for legibility. First, we focus on the algorithm’s guarantees, both a uniform guarantee for all signals and an optimal error bound.

Typically, randomized sketches guarantee that “on each signal, with high probability, the algorithm succeeds.” When the application involves adaptiveness or iteration, it is much better to have a uniform guarantees of the form “with high probability, on all signals, the algorithm succeeds.” Most approaches to Compressed Sensing yield uniform guarantees— exceptions include work on Orthogonal Matching Pursuit (OMP) due to Tropp–Gilbert [16] and the randomized algorithm of Cormode–Muthukrishnan [5] which achieves the strongest error bounds. Our algorithm achieves a uniform bound because, unlike “for each” algorithms, HHS uses a stronger estimation matrix and a combination of sifting and noise reduction matrices (see below) tailored to the mixednorm bound of Theorem 1. (We include in Table 1 uniform results only.) Chaining Pursuit is the only algorithm in the literature that achieves the ?rst three desiderata [10]. The error bound in Chaining Pursuit, however, is less than optimal. Not only is this error bound worse than the HHS error bound, but also Chaining Pursuit is not an approximation scheme. Our algoirthm achieves a mixed-norm approximation scheme because, unlike Chaining, HHS uses separate matrices for estimation, sifting, and noise reduction. Next, we examine the number of measurements. A major selling point for Compressed Sensing is that it uses only m polylog(d) measurements to recover an entire class of compressible signals. Candès–Romberg–Tao [2] and Donoho [6] have shown that a linear programming algorithm achieves this goal. The Chaining Pursuit algorithm of the current authors [10] also has this property. On the other hand, the algorithms of Cormode–Muthukrishnan that yield a uniform guarantee require ?(m2 ) measurements [5]. The determinism in [5] is an important desideratum not achieved by HHS. Our algorithm manages with only m polylog(d) measurements because, unlike [5], HHS recovers only a fraction of spikes at a time (see below). Finally, we discuss the running time of the di?erent Compressed Sensing algorithms. The major advantage of our work is that most recovery algorithms for Compressed Sensing have runtimes that are at least linear in the length of the input signal. In particular, the linear programming technique has cost ?(d3/2 ). Cormode–Muthukrishnan have developed some sublinear algorithms whose runtimes are comparable with HHS Pursuit [5]. The Chaining Pursuit algorithm has running time m polylog(d), so it is even faster than HHS Pursuit.

1.6

Roadmap

The next three sections give an overview of our approach. Section 2 provides a detailed description of the measurement matrix required by HHS Pursuit. Section 3 states the HHS algorithm, along with implementation details and pseudocode. Section 4 shows how to draw the corollaries from the main theorem, and it explains how the analysis of the algorithm breaks into two cases. The bulk of the proof is deferred to the journal version of this extended abstract.

2.

THE MEASUREMENTS

This section describes a random construction of a measurement matrix Ψ. Afterward, we explain how to store and apply the matrix e?ciently. For clarity, we focus on the case ε = 1. To obtain an approximation scheme, we substitute m/ε2 for m, which increases the costs by (1/ε)O(1) .

Approach, Refs
1 1

Error bd. ? ? ? ? ?E ? ≤ m?1/2 ?Eopt ? 2 1 ? ? ? ? ?E ? ≤ m?1/2 ?Eopt ? 2 1 ? ? ?E ? ≤ CCopt,p 0 < p < 1 2 ? ? ?E ? ≤ CCopt,2 exp. decay 2 ? ? ≤ ?Eopt ?1 ? ? ? ? ?E ? ≤ · m?1/2 ?Eopt ? 2 1
weak?1

# Meas. m log(d/m) m log4 d m
3?p 1?p

Time LP(md) d log d (empirical) m
4?2p 1?p

min. + Gauss [3] min. + Fourier

Combinatorial [5] Combinatorial [5] Chaining Pursuit [10] HHS (this result)

log2 d

log3 d

m2 polylog d m log2 d (m/ 2 ) polylog(d/ )

m2 polylog d m log2 d (m2 / 4 ) polylog(d/ )

? ? ?E ?

Table 1: Comparison of algorithmic results for compressed sensing The matrix Ψ consists of two pieces: an identi?cation matrix ? and an estimation matrix Φ. We view the matrix as a linear map that acts on a signal f in Rd to produce a two-part sketch. 2 3 2 3 vid ? 5. Ψf = 4 5 f = 4 Φ vest The ?rst part of the sketch, vid = ?f , is used to identify large components of the signal quickly. The second part, vest = Φf , is used to estimate the size of the identi?ed components. Decoupling the identi?cation and estimation steps allows us to produce strong error guarantees.

2.2.1

The bit-test matrix

The matrix B is a deterministic matrix that contains a row of 1s appended to a 0–1 matrix B0 . The matrix B0 has dimensions log2 d × d. Its kth column contains the binary expansion of k. Therefore, the inner product of the ith row of B0 with a signal f T sums the components of f that have bit i equal to one. The bit-test matrix with d = 8 is 3 2 1 1 1 1 1 1 1 1 6 0 0 0 0 1 1 1 1 7 7 B=6 4 0 0 1 1 0 0 1 1 5. 0 1 0 1 0 1 0 1 In coding theory, B is called the parity check matrix for the extended Hamming code.

2.1

Row tensor products

2.2.2

The isolation matrix

The identi?cation matrix zeroes out many di?erent subsets of the signal components to isolate large components from each other, and then it computes inner products between these restricted signals and a group testing matrix. We construct this restriction map by applying several different restrictions in sequence. We introduce notation for this operation. If q and r are 0–1 vectors, we can view them as masks that determine which entries of a signal appear and which ones are zeroed out. For example, the signal q ? f is the signal f restricted the components in q that equal one. (The notation ? indicates the Hadamard, or componentwise, product.) The sequential restriction by q and r can be written as (r ? q ) ? f . Given 0–1 matrices Q and R, we can form a matrix that encodes sequential restrictions by all pairs of their rows. We express this matrix using the row tensor product, as in [10, 5]. Definition 5. Let Q be a q × d matrix and R an r × d matrix with rows {qi : 0 ≤ i < q } and {rk : 0 ≤ k < r}, respectively. The row tensor product A = Q ?r R is a qr × d matrix whose rows are {qi ? rk : 0 ≤ i < q, 0 ≤ k < r}.

The isolation matrix A is a randomly constructed 0–1 matrix with a hierarchical structure. It consists of O(log2 m) blocks A(j ) labeled by j = 1, 2, 4, 8, . . . , J , where J = O(m). See Figure 1. Each block, in turn, has further substructure as a row tensor product of two 0–1 matrices: A(j ) = R(j ) ?r S (j ) . The second matrix S (j ) is called the sifting matrix, and its dimensions are O(j log(d/j )) × d. The ?rst matrix R(j ) is called the noise reduction matrix, and its dimensions are O((m/j ) log(m) log d) × d.

2.2.3

The sifting matrix

2.2

The identi?cation operator

The identi?cation matrix ? is a 0–1 matrix with dimensions O(m log2 (m) log(d/m) log2 (d)) × d. It consists of a combination of a structured deterministic matrix and ensembles of simple random matrices. Formally, ? is the row tensor product ? = B ?r A. The bit-test matrix B has dimensions O(log d) × d, and the isolation matrix A has dimensions O(m log2 (m) log(d/m) log d) × d.

The purpose of the sifting matrix S (j ) is to isolate about j distinguished signal positions from each other. It is a random 0–1 block matrix, as shown in Figure 1. Each submatrix of S (j ) has dimensions O(j ) × d, and the number of submatrices is Tj = O(log(d/j )). The Tj submatrices are fully independent from each other. Each of the submatrices encodes a O(j )-wise independent random assignment of each signal position to a row. The (i, k) entry of the matrix equals one when the kth signal component is assigned to the ith row. Therefore, with high probability, the componentwise product of the ith row of the matrix with f generates a copy of the signal with d/O(j ) components selected and the others zeroed out. For example, 2 3 0 1 0 0 1 1 0 (j ) St = 41 0 0 1 0 0 05 . 0 0 1 0 0 0 1 This submatrix can also viewed as a random linear hash function from the space of d keys onto a set of O(j ) buckets.

2

A(1)

3

2

S (1) ?r R(1)

3
(j )

2

S1

(j )

3
(j )

2

R1

(j )

3

6 7 6 7 6 A(2) 7 6 S (2) ? R(2) 7 r 6 7 6 7 7=6 7 A=6 . 6 . 7 6 7 . . 6 . 7 6 7 . 4 5 4 5 (J ) (J ) (J ) A S ?r R

where

S

6 7 6S (j ) 7 6 2 7 7 =6 6 . 7 6 . 7 . 4 5 (j ) S Tj

and

R

6 7 6R (j ) 7 6 2 7 7 =6 6 . 7 6 . 7 . 4 5 (j ) R Uj

Figure 1: The structure of the isolation matrix A. See Section 2.2.2 for details.

2.2.4

The noise reduction matrix
(j )

2.5

Encoding time

The purpose of the noise reduction matrix R is to attenuate the noise in a signal that has a single large component. It is also a random 0–1 block matrix, as pseen in Figure 1. (j ) Each submatrix Ru has dimensions O( (m/j ) log m) × d, and the total number of submatrices p Uj = O( (m/j ) log m log d). The submatrices are fully independent from each other. Each one encodes a pairwise independent assignment of each signal positions to a row. The (i, k) entry of the matrix equals one when the kth signal component is assigned to the ith row, as in the sifting matrix. Each submatrix can be p viewed as a random linear hash function from d keys to O( (m/j ) log m) buckets.

2.3

The estimation matrix

The estimation matrix Φ is a randomly constructed matrix √ with complex entries. Let λ = O(log4 d) and L = O(m log m). Choose q ≥ λL. The estimation matrix consists of q rows drawn independently at random from the d × d discrete Fourier transform (DFT) matrix. The matrix Φ is scaled by q ?1/2 so its columns have unit 2 norm.

2.4

Storage costs

The bit test matrix requires no storage as it is straightforward to generate as needed. The isolation and estimation matrices can be generated from short pseudorandom seeds, as needed. The √ total storage for the estimation matrix is q log(d) = O(m log m log5 d) because it takes log d bits to store the index of each of the q rows drawn from the DFT matrix. The total storage for the isolation matrix A is O(m log3 d), which is negligible compared with the cost of the estimation matrix. To obtain the bound for the isolation matrix, we examine the sifting matrices and the noise reduction matrices separately. (j ) First, observe that each block St of the sifting matrix re2 quires O(j log d) bits. Since there are Tj = O(log(d/j )) independent blocks for each j , we have a space bound O(j log2 d) (j ) for St . Summing over j = 1, 2, 4, . . . , Cm, we ?nd that the sifting matrices require O(m log2 d) bits. (j ) Meanwhile, each block Ru of the noise reduction matrix p requires O(log d) bits. There are U m/j log m log d) j = O( p blocks, giving a space bound O( m/j log m log2 d) for R(j ) . Summing on j , we see that the noise reduction matrices require space O(m1/2+o(1) log2 d). We generate the random variables using a polynomial of degree j over a ?eld of size around d. Without loss of generality, we may assume that m and d are powers of two.
2

The encoding time depends on the technology for computing measurements. If we can compute inner products in constant time, the encoding time is proportional to the number of measurements. This section focuses on the case where the cost is proportional to the minimal sparsity of the vectors that appear in the inner product. This analysis plays a role in determining the runtime of the algorithm. We show that the time required to measure a signal f that √ has exactly one nonzero component is O(m log m log4 (d)) word operations. This analysis implies a time bound for measuring a vector with more nonzeros. The time (in word operations) to generate the estimation matrix Φ and to multiply Φ by f dominates the time (in bit operations) to generate the identi?cation matrix ? and to multiply ? by f . The time cost for measuring a nonzero component of a signal with the estimation operator Φ is √ q = O(m log m log4 d), assuming column of Φ has been computed. The (k, ) entry of Φ is simply q ?1/2 exp{?2π i tk ω /d}, which is computed from tk and ω in a constant number of word operations. We now turn to the identi?cation matrix ?. Each of its √ columns contains roughly m nonzeros, so we can ignore the cost to apply the matrix. To generate a column of ?, we form the sifting matrices and noise reduction matrices and then compute their row tensor product. Afterward, we form the row tensor product with the bit-test matrix. The total cost is O(m log4 d) bit operations, as follows. (j ) To construct St , one can use a polynomial of degree O(j ) over a ?eld of size approximately d to obtain O(j )-wise independent random variables. This step must be repeated Tj = O(log(d/j )) times, for a total of O(j log(d/j )) ?eld operations over a ?eld of size around d. Therefore, to generate S (j ) costs O(j log3 d) bit operations. To generate each (j ) Ru requires R (j ) √ log d operations, so the cost to generate (j ) (j ) is roughly m, which is negligible. √ To build R ?r S from its factors costs Tj · Uj = O( m log3 d), so it can also be ignored. Summing on j , we ?nd that the bit time to construct A is O(m log3 d) bit operations. The row tensor product with the bit-test matrix gives an additional factor of O(log d) bit operations. To summarize, the total time cost to use Ψ to measure a signal with k nonzero entries is O(km polylog(d)). During the √ HHS Pursuit algorithm, we must encode a list L of O(m log m) signals with one nonzero component each. The time cost for this encoding procedure is m2 polylog(d) if we use the straightforward algorithm for matrix–vector multiplication.3
3 Note that encoding requires us to compute a partial discrete Fourier transform with unequally-spaced points on the

3.

THE HHS ALGORITHM

The HHS algorithm is an iterative procedure, where each iteration has several stages. The ?rst stage is designed to identify a small set of signal positions that carry a constant proportion of the remaining energy. The second stage estimates the values of the signal on these components. The third stage adds the new approximation to the old approximation and prunes it so that it contains only O(m) nonzero components. The fourth stage encodes the new approximation using the measurement matrix and subtracts it from the initial sketch to obtain a sketch of the current residual signal. See Figure 2 for pseudocode. For brevity, we use the term spike to refer to the location and size of a single signal component. The algorithm also employs a preprocessing phase. The preprocessing step encodes the signal with the Chaining Pursuit measurement matrix C and executes the Chaining Pursuit algorithm to produce a good initial approximation ainit of the input signal. This initial approximation has at most m nonzero components. We encode ainit with the HHS measurement matrix and subtract it from the original chaining sketch Ψf to obtain a sketch s of the initial residual f ?ainit . See Figure 3 for pseudocode. The 2 norm of the residual after preprocessing is proportional with m and the optimal 1 error with m terms. This fact ensures the algorithm recovers sparse signals exactly and that it requires only O(log m) iterations to reduce the error by a polynomial factor in m. The correctness of the preprocessing phase follows from our previous work [10]. If we do not run the optional preprocessing step, then the initial residual sketch s and list L of spike locations and values are Ψf and empty, respectively. In that case, one can run the algorithm for O(log ?) iterations, where ? = f 1 / f ? fm 1 is the dynamic range of the problem which we assume is known.

We encode the recovered spikes by accessing the columns of the identi?cation and estimation matrices corresponding to the locations of these spikes and then re-scaling these columns by the spike values. Note that this step requires us to generate arbitrary columns.

3.2

Resource Requirements

In Section 2.4, we showed that we need space m polylog(d) to store pseudorandom seeds from which columns of the measurement operator can be generated as needed, in time m polylog(d) each. We recall that we can apply Φ? L sest via Jacobi iteration in time m2 polylog(d). It follows that our algorithm requires just m polylog(d) working space and m2 polylog(d) time. Our algorithm becomes an approximation scheme by substituting m/ε2 for m. This increases the space to (m/ε2 ) polylog(d/ε) and the overall time cost to (m2 /ε4 ) polylog(d/ε).

4.

ANALYSIS OF THE ALGORITHM

This section describes, at the highest level, why HHS works. We establish the following result. Theorem 6. Fix m. Assume that Ψ is a measurement matrix that satis?es the conclusions of Lemmas 9 and 15. Suppose that f is a d-dimensional signal. Given the sketch b v = Ψf , the HHS Pursuit algorithm produces a signal f with at most 8m nonzero entries. This signal estimate satis?es ? ? 20 b? ≤ √ ?f ? f f ? fm 1 . 2 m Let m and ε be ?xed. Observe that we can apply the theob with 8m rem with m = m/ε2 to obtain a signal estimate f terms that satis?es the error bound ? ? 20ε b? ≤ √ ?f ? f f ? fm 1 . 2 m The running time increases by a factor of (1/ε4 ) polylog(1/ ). This leads to Theorem 1. We give an overview of the proof in the next subsections. The goal of the algorithm is to identify a small set of signal components that carry most of the energy in the signal and to estimate the magnitudes of those components well. We argue that, when our signal estimate is poor, the algorithm makes substantial progress toward this goal. When our estimate is already good, the algorithm does not make it much worse. We focus on the analysis of the algorithm in the case when our signal estimate is poor as this is the critical case. The most important portion of the analysis is the identi?cation of the energetic signal components, and, as such, we concentrate on this section of the analysis in the Section 4.2. We omit many of the details of the rest of the analysis from this extended abstract.

3.1

Implementation

The HHS Pursuit algorithm is easily implemented with standard data structures. There are a few steps that require a short discussion. The Jacobi iteration is a standard algorithm from numerical analysis. The bit tests also require explanation. Each row of the isolation matrix A e?ectively generates a copy of the input signal with many locations zeroed out. The bit-test matrix calculates inner products between its rows and the restricted signal. The bit tests attempt to use these numbers to ?nd the location of the largest entry in the restricted signal. Suppose that the bit tests yield the following log2 d + 1 numbers: c, b(0), b(1), ..., b(log2 d ? 1).

The number c arises from the top row of the bit test matrix, so it is the sum of the components of the restricted signal. We estimate a spike location as follows. If |b(i)| ≥ |c ? b(i)|, then the ith bit of the estimated location is zero. Otherwise, the ith bit of the estimated location is one. It is clear that that the estimated location is correct if the restricted signal contains one large component, and the remaining components have 1 norm smaller than the magnitude of the large component. domain and codomain of the transform. We are not aware of any nontrivial algorithm for this problem, despite the existence of faster algorithms [1] for problems that are super?cially similar.

4.1

Preliminaries

In a given iteration, the performance of the algorithm depends on the size of the residual. We establish that if the approximation is poor then the algorithm improves it substantially. More precisely, assume that the current approximation a satis?es 1 f ? fm 1 . (Case 1) f ?a 2 > √ m

Algorithm:

HHS Pursuit

Inputs: The number m of spikes, the HHS measurement matrix Ψ, the initial sketch v = Ψf , the initial list L of m spikes, the initial residual sketch s Output: A list L of O(m) spikes For each iteration k = 0, 1, . . . , O(log m) { For each scale j = 1, 2, 4, . . . , O(m) { Initialize L = ?. For each row of A(j ) { Use the O(log d) bit tests to identify one spike location } Retain a list Lj of p the spike locations that appear ?( m/j log m log(d/j ) log d) times each Update L ←? L ∪ Lj } Estimate values for the spikes in L by forming Φ? L sest with Jacobi iteration Update L by adding the spikes in L If a spike is duplicated, add the two values together Prune L to retain the O(m) largest spikes Encode these spikes with measurement matrix Ψ Subtract encoded spikes from original sketch v to form a new residual sketch s } Figure 2: Pseudocode for the HHS Pursuit algorithm Algorithm: (Optional) Chaining Pursuit Preprocessing

Inputs: The number m of spikes, the Chaining measurement matrix C , the Chaining sketch w = Cf , the HHS measurement matrix Ψ, the HHS sketch v = Ψf Outputs: A list L of m spikes, the residual sketch s Run ChainingPursuit(m, w, C ) to obtain a list L of m spikes Encode the spikes in L using the HHS measurement matrix Ψ. Subtract the encoded spikes from v to form the residual sketch s. Figure 3: Pseudocode for Chaining Pursuit Preprocessing Then one iteration produces a new approximation anew for 1 which f ? anew 2 ≤ 2 f ? a 2. On the other hand, when the approximation is good, then the algorithm produces a new approximation that is not too bad. Suppose that a satis?es f ?a
2 1 √ 7m

2

f ? fm

1

since p ? m = 7m.

1 f ? fm ≤ √ m

When the condition (Case 1) holds, most of the energy in the signal is concentrated in its largest components. Let r = f ? a denote the residual signal. The number α is this lemma in a constant that will be ?xed later. Lemma 8 (Heads and Tails). Suppose that (Case 1) is in force. Fix a number α ≥ 1, and let M be the smallest power of two that exceeds 16α2 m + 9m. Then the following bounds hold. α r 2 ≥ α r ? rM 2 and r 2≥ √ r 1. M

1

.

(Case 2)

Then the next approximation anew satis?es f ? anew 2 ≤ 20 √ f ? fm 1 . m Suppose that (Case 1) is in force at the beginning of an iteration. We describe some generic properties of signals that are important in the analysis. We show that the 2 norm of the tail of a signal is much smaller than the 1 norm of the entire signal. The pruning step of the algorithm ensures that we have the loop invariant a 0 ≤ 8m. We abbreviate p = 8m to make the argument clearer.
1 √ 2 t

4.2

Identi?cation

Lemma 7. For any signal g , it holds that g ? gt 2 ≤ g 1 . In particular, for g = f ?fm , we have f ? fp 2 ≤

As the bulk of the innovation of our result is in the identi?cation of signi?cant signal components for Case 1, we explain this portion of the analysis in more detail (at the expense of the other portions). The identi?cation matrix

? = B ?r A is a complicated thing and in this object lies the majority of the analysis of the algorithm. We can best understand its behavior by studying its pieces separately. The isolation matrix A consists of log2 M blocks A(j ) , where j = 1, 2, 4, . . . , M/2. Each block A(j ) = R(j ) ?r S (j ) where S (j ) is the sifting matrix and R(j ) is the noise reduction matrix. It is best to think about the action of the isolation matrix A(j ) in two phases. 1. First S (j ) takes an input signal and generates a collection of output signals of the same length by zeroing out di?erent collections of components. The idea is that most of the distinguished components will appear in an output signal that contains no other distinguished component. 2. Then R(j ) takes each of these signals and generates a further collection of output signals by zeroing out additional subsets of components. The idea is that, in many of the output signals, a distinguished component will survive, but the 1 norm of the other components will be substantially reduced. Afterward, the bit-test matrix B forms the inner product between each of its rows and each of the numerous output signals. Whenever an output signal contains a distinguished component and a small amount of noise, the log d bit tests allow us to determine the location of the distinguished component correctly. The bit test process can always identify the largest component of a signal, provided that the 1 norm of the remaining components is not too large. Let us consider a ?xed collection I of components in a signal g , where j ≤ |I | < 2j . The next result shows that A(j ) succeeds in generating a lot of output signals where a large proportion of the components in |I | are isolated from each other. Moreover, the 1 norm of the other components in these signals is small in comparison with the total norm of the signal. The number ρ in this lemma is a constant (depending only on α) that will be determined shortly. Lemma 9. Except with probability O(d?1 log m), the random isolation operator A(j ) satis?es the following property. Let g be a signal, and let I be an arbitrary subset of {1, 2, . . . , d} with j ≤ |I | < 2j . For at least (1 ? ρ) |I | of the components i ∈ I , the operator A(j ) generates at least p O( (M/j ) log M log(d/j ) log d) signals of the form gi ei + ν where s 1 j ν 1≤ g 4 |I | M log2 M

Lemma 10 (Sifting: One Trial). Let g be a signal, and let I ? {1, 2, . . . , d} with j ≤ |I | < 2j . Write k = |I |. (j ) Suppose we apply the random operator St to g . Except ?1.7?ρk/5 with probability e , for at least (1 ? ρ)k of the indices i ∈ I , there is an output signal h of the form h = gi ei + ν and ν
1



2 g ρk

1

.

Proof. We can think of the sifting operator as assigning each of the k distinguished positions (balls) to one of the N output signals (bins) uniformly at random. We hope that the balls are isolated from one another. We will see that if the number of bins satis?es N ≥ max{10kρ?1 , 850ρ?1 }, then the result holds. For n = 1, 2, . . . , N , let Xn be the indicator variable P for the event that the nth bin is empty, and write X = Xn for the total number of empty bins. The symbols ? and σ 2 will denote the expectation and variance of X . To understand large deviations of X requires some e?ort because the set of indicators {Xn } is not stochastically independent. Nevertheless, X satis?es a rather strong tail bound. (Theorem 6, [11]). n ` “ o ? a ” P {X ≥ E X + a} ≤ exp ? σ 2 + a log 1 + 2 ? a . σ This result is based on the surprising fact, due to Vatutin and Mikhailov [17], that X can be expressed as a sum of independent indicators. The content of our argument is to develop explicit bounds on the expectation and variance of X , which will allow us to apply Janson’s result. By calculating the means and covariances of the variables Xn , we determine that ?k ? k2 1 < (N ? k) + ?=N 1? N 2N and that ?k ? ?k ? ?2k ? 2 1 1 +N (N ?1) 1 ? ?N 2 1 ? . σ2 = N 1 ? N N N The variance bound takes some work. First, regroup terms and factor and then apply Bernoulli’s inequality (1 + x)k ≥ 1 + kx, which is valid for x ≥ ?1. Finally, we obtain the bound σ 2 ≤ k · h(1/N ) ≤ k(k ? 1) k2 < N N Fact 11

1

.

The proof of this result takes several long steps and we focus on the sifting and the noise reduction matrics to highlight the necessary lemmas which form the signi?cant steps in the proof. (j ) We can think about the action of one submatrix St as ? ? (j ) S t : g → h1 h2 . . . h N , mapping each input signal to a collection of output signals. The ?rst result shows that one trial of sifting is very likely to isolate all but a constant proportion of the distinguished indices.

provided that N ≥ 2. Depending on the size of k, we need to choose a di?erent number N of bins to obtain the required probabilities. First, assume that 0.2ρk > 1.7. In this case, we select N ≥ 10k/ρ, which yields the following estimates on the mean and variance of X : ? ≤ (N ? k) + 0.05ρk and σ 2 < 0.1ρk.

We invoke Fact 11 with the value a = 0.2ρk to reach P {X > (N ? k) + 0.25ρk} < e?0.4ρk < e?1.7?0.2ρk (4.1)

using 0.2ρk > 1.7. This estimate allows us to bound the number Y of balls that fail to be isolated. It takes at least (N ? X ) balls to ?ll the nonempty bins. The remaining (k ? (N ? X )) balls can

be placed in no more than (k ? (N ? X )) bins, where they will result in no more than Y = 2(k ? (N ? X )) = 2(X ? (N ? k)) collisions. Using the deviation bound (4.1), we conclude that ? ? ρk P Y > < e?1.7?0.2ρk . 2 Second, we assume that k is small. Precisely, consider the case where 0.2ρk ≤ 1.7. Now, select N ≥ 100k2 . The mean and variance of X satisfy ? ≤ (N ? k) + 0.005 and σ ≤ 0.01
2

Finally, we take a union bound over all sets I with size between j and (2j ? 1) to obtain a failure probability of e2j log(ed/j ) · e?3j log(ed/j ) = e?j log(ed/j ) . In other words, for every such I , the sifting matrix S (j ) isolates at least (1 ? ρ) |I | of the distinguished indices in at least half the trials. We establish that, if g contains a single distinguished component (a spike) plus noise, then a large number of the output signals contain that spike along with a reduced amount of noise. Lemma 13 (Noise Reduction). Except with probability d?1 , the noise reduction matrix R(j ) has the following property. Let g be an input signal that satis?es (i) g = δ + ν , (ii) δ 0 = 1, and (iii) supp(δ ) ∩ supp(ν ) = ?. Then there are at least 0.5Cr log d output signals h of the form h = δ +? where √ j 10 p ν 1= ν 1. ? 1≤ Cr 8ρ?1 M log2 M Proof. Let g be a signal of the form g = δ + ν , where δ is a spike at position i. Consider the submatrix X of R(j ) constructed by extracting the rows of R(j ) where the index i appears and then removing the ith column. This submatrix contains exactly Cr log d rows and (d ? 1) columns. Let ν be the vector ν without its ith component (which equals zero by hypothesis). Note that ν has the same 1 norm as ν. Let x be a column of X . The entries of x are independent binary random variables with mutual expectation (Cr)?1 because each one comes from a di?erent submatrix. Therelog d fore, E x 1 = CrCr = log d. Cherno?’s bound shows that h 4 ilog d ? ? e P x 1 > 5 log d ≤ 5 < d?3 . Applying the union 5 n o bound over all (d ? 1) columns of X , P X 1,1 > 5 log d ≤ d?2 . Therefore, the number of output signals in which position i appears and where the noise is large satis?es ? ? 10 ν #{n : ?(Xν )n ? > Cr
1

Apply Fact 11 with the value a = 0.995 to reach P {X ≥ (N ? k) + 1} < e?3.6 < e?1.7?0.2ρk using 0.2ρk ≤ 1.7. When X < (N ? k) + 1, the maximum number of bins are empty, and so all k of the balls are isolated. Furthermore, we observe that the number N of bins required here satis?es N ≤ 850ρ?1 . Finally, we need to argue that few of the output signals have a lot of noise. Let un denote the 1 norm of the nth output signal. Since each position in the P input signal g is assigned to exactly one output signal, n un = g 1 . By Markov’s inequality, ? ? ρk X ρk 2 g 1 ≤ un = . # n : un ≥ n ρk 2 g 1 2 In particular, no more than ρk/2 of the isolated balls can appear in a bin whose 1 norm exceeds (2/ρk) g 1 . Therefore, the total number of positions in I lost to collisions or noise is at most ρk except with probability e?1.7?0.2ρk . The failure probability for one trial is not small enough to take a union bound over all possible sets I . We perform Tj = O(log(d/j )) repeated trials to drive down the failure probability. Lemma 12 (Sifting: All Trials). Except with probability exp(?j log d), the sifting operator S (j ) has the following property. Let g be an arbitrary signal, and let I ? {1, 2, . . . , d} satisfy j ≤ |I | < 2j . For some set of (1 ? ρ) |I | indices i ∈ I , there are at least 0.5Tj output signals h of the 2 g 1. form h = gi ei + ν and ν 1 ≤ ρk Proof. Let g be a signal, and ?x a set I ? {1, 2, . . . , d} that contains k or more indices. Lemma 10 shows that each isolation trial succeeds for at least (1 ? ρ) |I | of the distinguished indices, except with probability p = e?1.7?0.2ρk . We repeat this experiment Tj times. Let Xt be the indicator variable for the event that trial P t fails, so E Xt ≤ p. Then the random variable X = Xt counts the total number of trials in which ρk balls fail to be isolated, and its mean satis?es ? ≤ pTj . Cherno?’s bound shows that – Tj ? e < e?0.2ρkTj P {X > 0.5Tj } < 0.5Tj /(pTj ) Choose Tj = 15ρ?1 log(ed/j ), and use the fact that k ≥ j to obtain P {X > Tj } < e?3j log(ed/j ) . Next, we must count the total number of subsets whose size is between j and (2j ? 1). We bound ! 2j ?1 2j ?1 X d X r log(ed/r) Z 2j x log(ed/x) ≤ e ≤ e dx ≤ e2j log(ed/j ) . r j r =j r =j

} ≤ 0.1Cr

Xν ν

1 1

≤ 0.5Cr log d. Since position i appears in exactly Cr log d output signals, we discover that the number of output signals where position i appears and the noise is less than (10/Cr) ν 1 is at least 0.5Cr log d. So far, we have only established the result for a single spike location i. To complete the proof, we perform a union bound over the d possible locations for the spike to obtain a ?nal failure probability of d?1 . Combine Lemma 12 and Lemma 13 to obtain the announced Lemma 9. Let us instantiate our ?xed collection I of signal components as those which fall in a signi?cant band Bs . Let s be a power of two between 1 and M/2, and note that s takes log2 M values. of ? the residual to be ? We1 de?ne the sth band 1 the set Bs = i : 2 r < | r | ≤ r i 1? 1 . We say that the s ? s ? ? ? α?1 ? √ sth band is signi?cant if ?r ? > r . First, we
Bs 2 log2 M 2

check that Bs meets our size constraints. Next, we use our guarantee on the isolation operator A(j ) to conclude that

1. the identi?cation process ?nds at least (1 ? ρ) |Bs | of the in the band at least p components O( (M/j ) log M log(d/j ) log d) times each; 2. the total 2 norm of the lost components is at most α?1 r 2 ; and 3. the √ ?nal list of identi?ed components contains O(M log M ) items. Now we argue that the signal positions in the list of identi?ed components carry most of the energy in the residual signal. This fact ensures that the iteration is making progress toward ?nding signi?cant signal positions. Moreover, it guarantees that the estimation step can accurately predict the values of the signal positions listed in L. The parts of the residual that we miss fall into several categories. The challenging piece requires some serious work which we presented above. The remaining pieces follow from the headtail relationships and the de?nition of signi?cant band.

4.3

Estimation

The estimation step produces an approximation b to the residual r that lies relatively close to the residual, even √ though it contains O(M log M ) nonzero entries. There are two technical results that are essential to the proof. The ?rst appeared in the work of Rudelson and Vershynin [14]. Lemma 14 (Restricted Isometry). The estimation matrix Φ has the property that every |L|-column submatrix 1 A satis?es for every vector x, 2 x 2 ≤ Ax 2 ≤ 3 x 2. 2 Lemma 15. The estimation matrix h Φ has the property i 3 that for every vector x, Φx 2 ≤ 2 x 2 + √1 x . 1 M

Acknowledgments
We wish to thank Mark Rudelson for showing us this lovely proof of Lemma 15. We thank Howard Karlo? and Piotr Indyk for many insightful discussions and, ?nally, we thank the anonymous referees for helping us clarify the related work. ACG is an Alfred P. Sloan Research Fellow. ACG has been supported in part by NSF DMS 0354600. JAT has been supported by NSF DMS 0503299. MJS has been supported in part by NSF DMS 0354600. RV is an Alfred P. Sloan Research Fellow. He was also partially supported by the NSF grant DMS 0401032.

5.

REFERENCES

[1] G. Beylkin. On the Fast Fourier Transform of functions with singularities. Appl. Comp. Harmonic Anal., 2:363–381, 1995. [2] E. J. Candès, J. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Comm. Pure Appl. Math., 59(8):1208–1223, 2006. [3] E. J. Candès and T. Tao. Near optimal signal recovery from random projections: Universal encoding strategies? Submitted for publication, Nov. 2004. [4] A. Cohen, W. Dahmen, and R. DeVore. Compressed Sensing and best k-term approximation. Submitted for publication, July 2006. [5] G. Cormode and S. Muthukrishnan. Combinatorial algorithms for Compressed Sensing. In Proc. of SIROCCO, pages 280–294, 2006.

[6] D. L. Donoho. Compressed Sensing. IEEE Trans. Info. Theory, 52(4):1289–1306, Apr. 2006. [7] D. L. Donoho, I. Daubechies, R. DeVore, and M. Vetterli. Data compression and harmonic analysis. IEEE Trans. Info. Theory, 44(6):2435–2476, 1998. [8] M. F. Duarte, M. B. Wakin, and R. G. Baraniuk. Fast reconstruction of piecewise smooth signals from random projections. In Proc. SPARS05, Rennes, France, Nov. 2005. [9] A. C. Gilbert, S. Muthukrishnan, and M. J. Strauss. Improved time bounds for near-optimal sparse Fourier representation via sampling. In Proc. SPIE Wavelets XI, San Diego, 2005. [10] A. C. Gilbert, M. J. Strauss, J. A. Tropp, and R. Vershynin. Algorithmic linear dimension reduction in the 1 norm for sparse vectors. Submitted for publication, 2006. [11] S. Janson. Large deviation inequalities for sums of indicator variables. Technical report, Mathematics Dept., Uppsala Univ., 1994. [12] S. Kirolos, J. Laska, M. Wakin, M. Duarte, D. Baron, T. Ragheb, Y. Massoud, and R. Baraniuk. Analog-to-information conversion via random demodulation. In IEEE Dallas Circuits and Systems Workshop (DCAS), Dallas, Texas, Oct. 2006. [13] J. Laska, S. Kirolos, Y. Massoud, R. Baraniuk, A. Gilbert, M. Iwen, and M. Strauss. Random sampling for analog-to-information conversion of wideband signals. In IEEE Dallas Circuits and Systems Workshop (DCAS), Dallas, Texas, Oct. 2006. [14] M. Rudelson and R. Veshynin. Sparse reconstruction by convex relaxation: Fourier and Gaussian measurements. In Proc. 40th Ann. Conf. Information Sciences and Systems, Princeton, Mar. 2006. [15] D. Takhar, J. Laska, M. B. Wakin, M. F. Duarte, D. Baron, S. Sarvotham, K. Kelly, and R. G. Baraniuk. A new compressive imaging camera architecture using optical-domain compression. In Proc. IS&T/SPIE Symposium on Electronic Imaging, 2006. [16] J. A. Tropp and A. C. Gilbert. Signal recovery from random measurements via Orthogonal Matching Pursuit. Submitted for publication, Apr. 2005. Revised, Nov. 2006, 2006. [17] V. A. Vatutin and V. G. Mikhailov. Limit theorems for the numer of empty cells in an equiprobable scheme for group allocation of particles. Theor. Probab. Appl., 27:734–743, 1982. [18] M. Wakin, J. Laska, M. Duarte, D. Baron, S. Sarvotham, D. Takhar, K. Kelly, and R. Baraniuk. Compressive imaging for video representation and coding. In Proc. Picture Coding Symposium 2006, Beijing, China, Apr. 2006. [19] Y. Zheng, D. J. Brady, M. E. Sullivan, and B. D. Guenther. Fiber-optic localization by geometric space coding with a two-dimensional gray code. Applied Optics, 44(20):4306–4314, 2005. [20] Y. H. Zheng, N. P. Pitsianis, and D. J. Brady. Nonadaptive group testing based ?ber sensor deployment for multiperson tracking. IEEE Sensors Journal, 6(2):490–494, 2006.




友情链接: