Harmonic Analysis On Graphs

Sarah Constantin December 1, 2011

1

1.1

Lecture 2

Multidimensional scaling

xi ∈ Rn

i = 1...m d2 = ||xi − xj ||2 ij ˜ We can do an embedding xi → Xi . We can compute the matrix Cij = xi ·xj inner products and diagonalize it. ˜ Ot Λ2 O = (ΛO)t ΛO = Xj · ΛO Another way to think of it: Cij = λ2 vl (i)vl (j) l inner products between eigenvectors. The vector associated with xi is xi = {λl vl (i), l = ˜ 1 . . . m}. If the xi lie in a low-rank space, then only few λl will be nonzero. Here we associated xi and xj with the kernel < xi , xj >. But we could have had a diﬀerent kernel k(xi , xj ). Abstract metric space X, distance d(x, y), want to deﬁne a mapping φ : X → Rn such that ||φ(x) − φ(y)|| d(x, y). There is a lot of literature on the subject. You have some estimate on the ratio between the distances. If |X| = 2L , then there is a map into RcL . Any metric space. This is a ”coding” theorem. In reality L is never really bigger than 50. Notice that in the multidimensional scaling example it didn’t really matter if N was big.

1

1.2

Diﬀusion geometry

Instead of looking at Xi · Xj , we could have looked at [Xi · Xj ] , cut oﬀ to be 0 unless xi xj ||xi ||||xj || ≤ . That also identiﬁes nearby points. Suppose my point Xi in R121 is an image of 11 by 11 pixels. How do I compare images? I have a database of images. Perhaps we have a big image and each patch is a subimage. ν(p) is the image centered at p. Also, we shouldn’t consider the image to be smooth. If it were, then it would describe a 2-d surface in R121 . But instead you’re going to see pixels all over the place, around some surface. A point cloud, from which we’d like to recover an underlying manifold. Take a pixel p and patch ν(p) and take its inner product with the patch ν(q), and deﬁne an aﬃnity α(p, q) = [ν(p) · ν(q)] truncate it to be zero unless they’re close. Then renormalize: Ap,q = α(p,q) where ω(p) = q α(p, q). ω(p) This produces a smoothing ﬁlter – replace a patch by the average of its neighbors. Denoises beautifully. Or you can let features be local variances rather than pixel values in a patch around each pixel. How do you convert a point cloud to the underlying manifold? Take the neighborhood of each point, average out the points to replace each point by its center of mass. This ’cleans up’ the data. Or, rather, take inner product between pairs of points; if they’re close enough, accept them, embed them into Euclidean space. This is ”diﬀusion geometry.” exp( −||ν(p)−ν(q)|| ) = ω(p)ω(q) Aq,p I(p)

p

2

Ap,q

I (q) =

This is called non-local means. Weighted means – only the close points count. What about rotated patches? They’ll look uncorrelated when they’re just oﬀ-center. Texture will obviously not pick out nearby patches. ¯ Deﬁne a graph on the image connecting nearby points. Weight each edge with Ap,q . φ(p) = ¯ − φ(p) is small. φ(p) − φ(p) = ∆φ + ¯ q ap,q φ(q) then if the function is smooth-ish, φ(p) O( 2 ). We’ll prove this next time.

2

2

Lecture 3

Symmetric aﬃnity matrix; associate a Markov process with it, or a graph. Start with the matrix a(i, j), assume symmetric and positive. (positive spectrum, not the same as entries being positive. Equivalent to a(i, j)ui uj is always positive. ) View a(i, j) = λ2 ψl (i)ψl (j) l

inner product matrix of x(i) = (λl ψl (i)) and x(j) = (λl ψl (j)). This matrix also deﬁnes a ˜ ˜ ˜ ˜ metric ||X(i) − X(j)||. Deﬁne ω(i) = j a(i, j) Deﬁne a new matrix Aij = √ which is symmetric, and another matrix Pi,j = a(i, j) ω(i) a(i, j) √ ω(i) ω(j)

2 Pi,j Pj,k = Pi,k j m is the probability of going from i to k in 2 steps. In general, Pi,k is the probability of going from i to k in m steps.

P is symmetric, and is ω −1/2 Aω 1/2 . P = λ2 ω −1/2 φl (i)φj (j)ω 1/2 (j) l 1/ωφl = ψl Suppose x, y are points distributed in the plane, and you have P (x, y) probability of going from x to y. 2 e−|x−y| /2 Gaussian distributed around each point. Can measure the distance between the bumps around x and x’ by measuring distance between the bumps. d(x, x ) = ( |P (x, y) − P (x , y)|2 dy)1/2

We can also think of Pi,i as a distance, suitably deﬁned: dm (i, i ) = |P m (i, j) − P m (i , j)|2 3 1 ω(j)

=

j l

λ2m (ψl (i) − ψl (i ))2 1/ω(j) l = λ4m |ψl (i) − ψl (j)|2 l ˜ = ||X m (i) − xm (i)|| ˜

where x(i) = {λ2m ψl (i)}. ˜ l Need to let m propagate to have a distance. How far is a question. Consider random points in Rn distributed along a density q(x).

N

1/N

i=1

f (xi )

f (x)q(x)dx

approximate integral. a(i, j) = e−|xi −xj | a(i, j)f (j)

j

2 /2

cN

e|xi −g|

2/

f (y)q(y)dy

Deﬁne a (x, y)e−|x−y| A0 f = ω (x) = A f=

α

2/

e−|x−y|

2/

f (y)q(y)dy

2/

e−|x−y|

q(y)dy

e−|x−y| / f (y)dy ω α (x)ω α (y)

2

P α (f ) is a convolution and two multiplications. cn en/2 change of variable: t =

x−y

1/2

e−|x−y|

1/2 t.

2

2/

f (y)dy

= t, y = x − cn

e−|t| f (x −

√

t)dt

assume f is c∞ with compact support. Can expand f as

4

(cn

e−t dt)f (x) + cn

2

e−t ∆f (x)

2

1/2

tdt + cn

e−|t| /2

2

d2 f (x)ti tj dt + O( 2 ) dxi dxj

Taylor expansion. = f (x) + m2 ∆f (x) + O( 2 ) m2 = cn e−|t| t2 dt 1

2

g ∗ f = f + m2 ∆f + O( 2 ) 1/ (f − g (f )) = −m2 ∆f + O( ) lim 1/ (f − g ∗ f ) = −m2 ∆f

→0

Schrodinger operator. What does (g ∗ f ) mean? (g ∗ f )hξ) = e−|ξ| t = νn

2 ˆ Gt (f )= e−|xi| t f (ξ) 2

ν)n

ˆ f (ξ)

Gt Gs = Gt+s semigroup. ∂ Gt (f ) = ∆Gt f ∂t u(x, t) = Gt (f )(x) ∂u = ∆u ∂t diﬀusion equation.

t→0

lim u(x, t) → f Gt (f ) = et∆ f

5

3

Lecture 4

f (yi )e−|x−yi | ω(x) =

i

2/

√

2/

e−|x−yi |

g ∗ (f p) √ g ∗p

ωi = ω(yi ) f smooth e−|x−y|

2/

f (y)dy = f (x) + ml ∆f (x) + O( 2 ) =g ∗f

continuous version. Deﬁne ω (x) = g (p) Deﬁne dα = 1/ω α g (p/ω α ) P α (f ) = 1/dα 1/ω α g ∗ (f p/ω α ) What is ω? convolution of p with g . Close to p, up to an epsilon. So P α is like getting rid of p, if α = 1, just having g ∗ f , but normalized with 1/d because we need it to be a probability measure. You want to know the geometry of the dataset, which has nothing to do with the statistics of the dataset, p. Dividing by ω is uniformizing. THEOREM P α (f ) = f + m2 (1/p1−α ∆(f p1−α ) − 1/p1−α ∆(p1−α )f ) + O( 2 ) When α = 1, you just get P 1 f = f + m2 ∆f + O( 2 ) Interesting options: α = 1, 0, 1/2. From now on, let m2 be part of the Laplacian. Proof of theorem: ω =g ∗p = p + ∆p = p(1 + ∆p/p) ω α = pα (1 + α ∆p/p) ω α dα = g (p/ω α ) 6

p1−α (1 + α ∆p/p) + ∆p1−α Theorem: P α f = f (∆f p1−α /p1−α − f (∆p1−α /p1−α )] Integrand: ∆α f = 1/p1−α ∆(f p1−α ) + ∆p1−α p1−α

Suppose I have points uniformly lying on a curve. f (s) is the function of arclength parametrizing the curve. y(s) is a parametrization. Points could also not be distributed uniformly; just look at P1 to make it uniform. If y(t) is the parametrization, 1/ √ e−|y(s)|

2/

f (s)ds = f (0) + (f (0) + a2 /2f (0))

Divide by what happens when f = 1, you’ll get rid of the f (0) term. You’ll get the second derivative in arclength. You’ll get f + f (0).

4

Lecture 5

The point of last time’s calculation: g ∗ f = 1/ where m2 =

−∞ n/2

e−|x−y|

2/

f (y)dy = f (x) + m2 ∆f (x) + O( 2 )

∞

e−t t2 dt.

2

From now on we swallow the m2 . We started with that to deﬁne operators which we called 2 1 e−|x−y| / f (y)p(y)dy α P f= α dy d x ω α (x)ω α (y) where ω (x) = g ∗ p = p + ∆p + O( 2 ) We use p to refer to the distribution of the points in Rn . If you randomly pick points out of the density, the average points will converge to the integral above. Last time we showed P α f = f (x) + (∆(f p1−α )/p1−α + ∆(p1−α )/p1−α f ) + O( 2 7

If α = 1, the second term doesn’t exist, so it approaches the Laplacian. That makes it independent of the density. What do I do if the points are constrained to lie on a set? On a curve or a manifold or something? Today we let α = 1. Let’s start with a curve; we know the points lie on a curve. Assume rectiﬁable. Then you can assume the points are distributed by some density p(s)ds ON THE CURVE. Let’s let p = 1, uniformly distributed by arclength. Pick a point on the curve, look at the tangent line to the curve, and model what’s going on. Map that point to the origin, the second derivative of the curve to the y-axis and the tangent of the curve to the x-axis. (”Osculating plane.”) e−r (s)/ f (s)ds e−r2 (s)/ 1ds

2

P (f )(0) = Movement along the curve:

y(t) = at2 r(t) = Distance to the origin:

t t

t2 + a2 t4 = t(1 + a2 t2 )1/2 = t + 1/2a2 t2 + O(δ 3 )

s(t) =

0

1 + (2au)2 du =

0

1 + 2a2 u2 du = t + 2a2 /3 + O(δ 4 )

s(r) = r − 1/2a2 r3 + 2a2 /3r3 + O(δ 4 ) t = s − 2a2 /3s3 + O(δ 4 ) e−r

2/

f (s(r))s (r)dr

= f (0) + f (s ) + O( 2 ) (f (s(r))s (r)) (0) = f (s(r)) s (r) + 2s (r)[f (s(r)] + f (s)s (r) evaluated at r = 0. But at r = 0, s(r) = 0, so this is f (0) + a2 f (0) so the above integral is f (0) + (f (0) + a2 f (0)) 8

(1 + a2 )f (0) + f (0) Now the normalized integral e−r / f (s)ds f (0) = f (0) + + O( 2 ) −r2 / ds 1 + a2 e = f (0) + f (0) + O( 2 ) Same thing as before. This can be generalized (as an exercise) for two variables; a 2-d surface, an osculating plane. height is a1 t2 + a2 t2 and you should get a Laplacian in the s variable. This is sort 1 2 of the deﬁnition of the Laplace-Beltrami operator on the surface. The ﬁnal result should be d2 d2 f (0, 0) + ( + f (0, 0) + O( 2 ) ds1 ds2 Returning to the curve; observe that if we had a density, and normalized as we had before with α = 1, we’d be able to get rid of the density. e−r (s)/ f (s)p(s)ds e−r2 / p(s)ds this does NOT converge to the Laplacian, but to the Laplacian plus a potential. e−r (s)/ f (s)p(s)/ω (s) e−r2 / p(s)ds/ω (s) on the other hand, will get rid of the density, converging to f (0) + f (0). This gets rid of statistics and gives us geometry. (This case corresponds to taking α = 0. ) Claim: If we let P f = f (0) + f (0) + O( 2 ) the operator I + ∆ + O( 2 ) If we take = 1/n, and take P , then this converges to e+∆ . Note that the spectrum of I − P is positive, while the spectrum of the second derivative is negative. Since the eigenvectors are just the eigenvectors of the Laplacian, this tells us the eigenvectors of P 1/ should converge to the eigenvectors of the Laplacian. This ”parametrizes” points without parametrizing. You get the arclength parametrization of the curve JUST from this averaging operator. Diﬀusion distance between two points is Euclidean distance in the ambient plane times something bounded. 9

1/

2 2 2

We took P as an operator, I + ∆ + R the residual. The residual is terrible. It’s of order 2 because we assumed the function has two well-behaved derivatives. But what if you can’t do Taylor series? Doesn’t work if the function is not diﬀerentiable. Restrict our attention to a space of eigenvectors of the Laplacian whose eigenvalue does not exceed a certain number; i.e. bandlimited functions. Think of trigonometric polynomials. Claim

n ||Πm (P1/n − e−∆ )|| → 0

where Pm is the orthogonal projection on the eigenspace of bandlimited functions with eigenvalue less than m.

n Πm P1/n Πm = P im (I + 1/m∆ + 1/nRn )n Πm

Now ||(A + B)n − B n || ≤ n(1 + ||B||)n−1 ||B|| if ||A|| ≤ 1 If A = 1 + 1/n∆ and B = Rn 1/n, you apply the above, and get a bound.

5

Lecture 5.

1 ω (t) e−(x−y(s))

2/

f (s)p(s) ds p (t)p (s)

→ −f (t) as → 0. we showed this last time. If we hadn’t canceled by the p ’s, we’d have a diﬀerent operator, corresponding to the situation α = 0. Second derivative plus some second order thing. p is the density function along the curve. The ”statistics.” The limit shows that the ”statistics” don’t matter to the geometry of the curve. The conventional machine learning embedding blends the statistics with the geometry. But this is an intrinsic parametrization, independent of where the data is most densely samples. The above integral, if we took it discretely, would mean just averaging the points, against a Gaussian. We would expect the eigenvectors of the discrete operators approximate the eigenvectors of the continuous operator. One of the goals is to go to surfaces. Riemannian manifolds. If I have data which is on some surface, we want to understand the data independently of how we measure the data. Some way of deﬁning the operator so that they’re intrinsic. P 1/ → e∆ 10

so in particular, if

n = 1/n, P1/n → e∆ . d2 . dx2

What does that mean? Here ∆ =

f=

ˆ fk eikx means that f = ˆ F (−k 2 )fk eikθ

ˆ −k 2 fk eikθ .

F (∆)f = Now if u(x, t) = et∆ f , du/dt =

2 ˆ e−k t (−k 2 )fk eikx 2 ˆ e−k t fk eikx

u = et∆ f =

π

= 1/2π

−π

qt (x − y)f (y)dy e−k t eik(x−y)

2

where qt (x) =

This is real. So it’s equal to its real part. So it equals e−k t cos k(x − y) This is the heat kernel!!! Take a Gaussian kernel and periodize it to have period 2 pi and you have this q. It’s actually a theta function. We deﬁne a diﬀusion distance on the circle as follows: for small t, this really does look like a Gaussian. d2 (x, x ) = t || |qt (x − y) − qt (x − y)|2 dy

2 2

e−k t (eikx − eikx )e−iky ||2 2 = ek t |eikx − eikx |2 e3−k

2 )2t 2

= e−2t |eix − eix |2 (1 + e−t

|eikx − eikx |2 |eix − eiy |2

This is the diﬀusion distance, not the geodesic distance. Short diﬀusion distance means there are many paths between the two points, which are not too long. You have a surface; you want an intrinsic relationship between points that you can measure directly. You can put a Riemannian metric on it. We’ll be able to embed it explicitly into Euclidean space. Our embedding is not going to preserve the metric, it’s going to preserve the diﬀusion distances. Let’s assume now that this manifold is given to us in R3 . I’m going to do, on the manifold, e−|x−y|

M

2/

f (y)dy

11

Want this to be f (x) + ∆f + O( 2 ) We need to rescale it appropriately. We can do what we had before: 2 e−|x−y| / f (y)p(s)ds

M

Think of our manifold as a surface in R3 . Look at the tangent plane at a point. y(u) = ai u2 i

Pick coordinates. Deﬁne s as the length of the geodesic curve in the direction u connecting 0 to (u, y(u)). This is called the exponential map. The direction we pick is s(u)u/|u|. dSi = 1 + 2a2 u2 + O(δ 3 ) i i dui dSi = O(δ 3 ) duj It’s a diagonal matrix, so |det( dSI )| = 1 + 2 duj a2 u2 + O(δ 3 ) i i

Taylor series in the s-coordinate: f (S) = f (0) + si df (0) + dsi si df (0) + dsi d2 f f (0) + O(δ 3 ) dsi dsj

This is all extrinsic. All in the space we’re embedded in.

6

Fast algorithms and potential theory in scientiﬁc computing

Wilbur Cross Lecture by Leslie Greengard. Fast, robust algorithms for engineering and applied physics are necessary. They should be designed so that precision is a ”knob you can turn.” Also, you want automatically adaptive. That is, you want to be able to reﬁne and then repeat the calculation. ”Tool-building” perspective. A hierarchy of tools from the most basic linear algebra modules to the full application. Out there in the world, there’s either MATLAB, or a speciﬁc application. We’d like to change that. 12

Integral equation (Green’s function) methods. We like these integral equation methods, because the’re good at handling complicating geometry. You need a description of the geometry and that becomes your discretization – don’t need to build a mesh. The integral equations are as well-conditioned as the underlying physics allowed. But in the absence of fast algorithms they’re intractable. And they need signiﬁcant ”quadrature designs” (integrands are not smooth.) What are Green’s functions? Diﬀusion: G(x, y, t) = e−||x−y|| Gravitation: 1/||x − y|| etc. Integral equations are data driven: if I want to solve the electrostatic problem with a bunch of point sources, qj U (S) = ||S − Sj || I don’t need a mesh, I need to compute this sum quickly. That’s not just particle interactions. Even a continuous problem, discretizing the diﬀerential equation yields a sparse linear system. ∆U (x) = ρ(x)|R U (x) = f (x)|S S is the boundary, R is the system. U (x) = V [ρ(x)] + D[µ](x) V [ρ](x) =

R

2 /4t

/(4πt)d/2

ρ(y) dy ||x − y||

integral formulation. Problem 1: Dense Linear Algebra. Consider Y = AX Yn = Anm Xm It takes N 2 operations. To solve, that takes N 3 operations using Gaussian elimination. Can we do this faster? Anm = cos(tn − sm ) = cos(tn ) cos(sm ) + sin(tn ) sin(sm ) 13

so ﬁrst compute W1 = Fast Gauss Transform.

cos(sm )Xm , and W2 = Yn = e−|tn −sm |

sin(sm )xm . Then let Yn = cos(tn )Wn .

2 /4T

Xm

It turns out that this function has a decomposition into Hermite functions times polynomial moments. Just a Taylor expansion about the center, in fact. But we have explicit functions. This is rapidly decaying, and T controls the decay. If we only want 14 digits, we know how far you have to go to make it safe. Just compute moments and build an expansion. N-body interactions: qj U (Tk ) = ||Tk − Sj || do you have nm work (where n sources, m targets)? This can be written as a multipole expansion. A sum −m n qj Yn (θj , φj )rj where the Y’s are spherical harmonics. Truncate that at p terms, and show the error decays geometrically Q(R/D)p . D is distance to nearest target. This is the origin of the fast multipole methods.

m Instead of computing the N-body calculation, we compute p moments Mn , that’s O(N p2 ) work, and evaluate expansion at each target, that’s O(M q 2 ). And then combine them. FMM uses various length scales to cluster.

Pictorially: imagine you’re a box in the middle. There are some boxes a reasonable distance from you and you want to compute interactions. You can’t interact with your nearest neighbors. Chop up every box, and you’re left with a region of the children of the parent’s nearest neighbors. O(N) calculations at the ﬁnest level. O(N log N) work over all. Performance is independent of the distribution. The more clustered it is, the better it performs. Modern FMM’s: waves, Laplace equation, etc. Cylindrical waves, plane waves, solid harmonics, etc, etc. Suppose you want the solution to the Poisson equation ∆U = f . Standard methods try to do this with meshes. Faster than the FMM is for point interactions. ”Volume potential FMM’s.” Applications: capacitance, inductance, full wave analysis for chips. Also quantum chemistry, molecular dynamics, astrophysics. Thermal modeling of fuel cells; lots of holes. 530,000 degrees of freedom. Simulation can have direct impact on engineering and applied problem. 14

7

Lecture 6

(I − P )f = ∆f + O( 3/2 . This came from the Taylor formula. For that to hold you need, for instance, third derivatives to be bounded. P f = (I − ∆)f + R f residual R . Then

n P1/n f

= [(I − 1/n∆) + R1/n ]n f = (I − 1/n∆)n f + ρn f,

ρn → 0. using inequality that if ||A|| ≤ 1 ||(A + B)n − A||n ≤ (n − 1)||B||(1 + ||B||)n−1 ||B|| ≤ All inequalities are okay provided they are restricted to band-limited functions for ∆. What that means is: HM = αl φl

l:λl ≤M 2

∆φl = λl φl The point of band-limited functions is that they’re diﬀerentiable to all orders. O( dominated by M 4 . If we ﬁx M and let → 0 then this goes to 0.

3/2 )

is

If I’m on the circle and i’m looking at the discrete operator and want to compare it to the integral operator, there’s going to be an error. If I ﬁnd eigenvalues of the discrete operator, I want to relate them to the eigenvalues of the continuous operator. And that’s good so long as < 1/δ if δ is the spacing and is the index of the eigenvector. (personal note: Nyquist sampling rate? Is this what it is?)

n P1/n f =

α (1 − 1/nλ )n φ + ρn f

→

α e−λ φ + ρn f = e−∆ φ + ρn (f )

and the error does converge to 0. This is not completely obvious (the bit about the error) on the sphere or other non-circle things. You do get a bound, and it doesn’t matter how big that bound is. So we’re saying, if f ∈ HM ,

1/n Pn f → e−∆ f

as n → ∞. To what extent are the eigenfunctions of Pn close to the eigenfunctions of the Laplacian? Do we have convergence? 15

NO. NOT EVEN ON THE CIRCLE. α cos nθ + β sin nθ α2 + β 2 = 1 same eigenvalue. (So there’s more than one eigenvector for each eigenvalue.) So you don’t necessarily get a limit. But the eigenspaces converge. But you can’t expect the eigenvectors to converge. The eigenvalues will be okay once you organize them in decreasing order. So will the eigenprojections.

1/n An = πM Pn πM → πM e−∆ πM = A0

Exercise: prove this. max|x|=1 An x ∗ x = λn , the top eigenvalue. 0 And pick the next highest to get the next highest eigenvalue, etc. We claim the eigenvector φ is equal to φ0 +

m= <Rφ ,φ0 >φ0 m m . λ −λ0 m

R is the residual. If the gap is not large, this error term is not great. If there’s multiplicity, then you need to replace Rφ by the projection on the corresponding space. Aφ = (A0 + R)φ . Here Aφ = λ φ . (λ I − A0 )φ = Rφ Rφ = < Rφ , φ0 > φ0

φ =< φ , φ0 > φ0 So φ =< φ , φ> φ0 + Look over this again! The goal was to show that deﬁning those Markov operators on a discrete set of points isn’t crazy. < Rφ , φ0 > 0 m φm λ − λ0 m

16

7.1

And now for something completely diﬀerent.

Problem: gravitational force from multiple point masses. 102 0 interactions from 101 0 points. Very big problem. But the interaction between the Earth and the Moon doesn’t depend on how many particles are in the Earth or the Moon. That’s the whole reasoning behind fast multipole methods; if two clusters of diameter D are separated by more than D, you can calculate the interaction while ignoring the number of points. The only thing that matters is how the points are distributed. We have a kernel k(xi , xj ) the interactions between the two points. In fact, we have a matrix a(i, j). How do we organize the points through their mutual interactions? Organize the columns and rows. Unpermute everything. Suppose I have a matrix that has only one row. Possible Gaussian noise. But it’s very easy to make this function smooth - just reorder the values, largest to smallest.

x

f (x) =

0

˜ f (t)dt +

0

x

dν(t)

integrable function plus singular function. The support of a singular measure can be covered by arbitrarily small intervals. You need a Calderon-Zygmund decomposition, but you can prove that this is a B.V. function, modulo an arbitrarily small set. Given any λ > 0, we can write f = gλ + bλ . |gλ (x) − gλ (y)| < 1/λ|x − y| ”good function” and bλ (x) is supported on a set of measure less than or equal to c/λ. We want to do the same thing with a matrix. With the columns, simultaneously, and then symmetrically with the rows. Exercise: prove Rising Sun Lemma. We want the geometry to emerge.

8

Lecture 7

From last time: think of the data as a random function on [0, 1]. (or a discrete collection of samples.) Is there a way of reorganizing it so that this is smooth? A permutation? Organize it in increasing order. An increasing function can be replaced by a function that has a derivative. Pick a slope; what regions are in the ”shadow” (portions of the function where the slope is steeper than that)? Everything ”in the sun” has a smaller slope. Every shadow region corresponds to an interval. f (βk ) − f (αk )/(βk − αk ) = λk if β and α are the endpoints of the shadow interval. If you replace f with the polygonal line that has

17

no ”shadow,” get a new function called gλ (x). Then F = gλ (x) + bλ (x), where b is the residual. The support of the residual is just the union of the ”shadow” intervals. λ(βk − αk ) = F (βk ) − F (αk ) Ik ≤ binning.

max F −min F λ

and gλ (x) ≤ λ. So this is a decent, BV function formed just from

We need to learn to do this for vector-valued functions. Want to organize the vectors in Rn so as to have a good part with a small slope and a bad part supported on a small set of intervals. This is the Riesz decomposition or the Calderon-Zygmund decomposition in one variable. Now we deﬁne a new, more extensible distance on [0, 1]. Deﬁne the dyadic tree obtained by binary splits. Deﬁne dT (x, y) as the length of the smallest dyadic interval containing both. You can get unlucky if you’re on either side of a branch. So it’s not Euclidean distance. You can assign a weight to each node of this tree, increasing as you go farther down the tree. Any collection of weights which are monotone can be assigned to be the weight of the smallest dyadic interval containing a pair. This is a large family of distances. Advantages over Euclidean distances: fast to compute. Especially in high dimension. If you deﬁned the Euclidean distance as a combination of tree distances (instead of splitting it into two halves, make a lopsided tree) and then average the distances over a collection of trees. f on [0, 1] is Holder or Lipschitz of order α if |f (x) − f (y)| ≤ Cd(x, y). Before we claimed we had a bounded derivative on the good function. Functions χ0,1 etc that are characteristic functions of a dyadic interval. χ(2 x − j). These are orthogonal for ﬁxed because they have disjoint support. Deﬁne V as the span of χ(2 x − j). These, by the way, are always Holder. Pf= < f, 2 =

/2

χ(2 x − j > 2 1 p Ij f (x)

Ij

/2

χ(2 x − j)

If χ = χ0 + χ1 , then χ0 − χ1 is orthogonal to χ. Haar functions hj = h(2 x − j). These form an orthonormal basis of L2 (0, 1). If Pl is the projection onto Vl then ||Pl f → f ||p , but the projection can be expanded in the Haar functions. This shows that linear combinations of Haar functions are dense.

18

The following is completely WRONG for Fourier series, but works for Haar. Suppose |f (x) − f (y)| ≤ dα (x, y). Then < f, hI >≤ C|Ij |1/2+α . Haar function is 1 on the left half t j of the interval, -1 on the right half. I+ and I− . =

I−

−

I+

f (y)dy

altogether, this is less than |Ij |1/2

I−

f (x) − c −

I+

f (x) + c. This gives the claim.

Size doesn’t depend on j, does depend on . We claim the opposite is true. f (x) = αj h (x)

Then |f (x) − f (y)| ≤ dT (x, y)α . That is, the coeﬃcients can tell us if the function is going to be Holder. Synthesize the function at x and at y. The only Haar functions we’ll need are those in the smallest interval containing x and y. Tabulating a function by its Haar coeﬃcients is incredibly eﬃcient (compared to Fourier coeﬃcients.) Complete the proof as an exercise.

9

Lecture 8

h(x) = 1

Deﬁne the Haar functions for 0 < x < 1/2 h(x) = −1 for 1/2 < x < 1 2m h(2m x − j) for m ≤ − 1 and j ≤ 2m − 1. is an orthonormal basis of V . V h are the basis of W . Write P

+1 +1

=V +W

− P = ∆ , the orthogonal projection on W . < f, hI hI +

j j

f=

j

f

f=

(P

+1

− P )f + P0 f

= P∞ f − P0 f

19

Telescoping series. We deﬁned the dyadic tree metric between two points dt (x, y) to be the length of the smallest dyadic interval containing both x and y. dτ (x, y) = dT (x − τ, y − τ )

1

dτ (x, y)α = |x − y|α

0

for 0 < α < 1. For α = 1, unknown. We proved that |f (x) − f (y)| ≤ cdα (x, y) I f (x) =

|I|=2−

dI hI

dI =< f, hI > |dI | ≤ |I|1/2+α C I only sum on the intervals that contain x, not anywhere else. Average of f over I is 1/2 times the average of f over I− plus the average of f over I+ . (P

+1

− P )(f ) = [mI (f ) − mI− (f )[χI− + (mI (f ) − mI+ (f ))χI+ = 1/2[mI+ (f ) − mI− (f )]χI (x)

You have to follow the path that leads down to x and y. To compute a function at a given point from the Haar coeﬃcients, I just follow the path and add the coeﬃcients. Number of coeﬃcients is number of levels. If the function satisﬁes a Holder condition, and you know the function path up to the bottom level, you can just resynthesize the function from samples. Store the diﬀerences as Haar coeﬃcients. The diﬀerences on the next level are Haar coeﬃcients on the next ﬁner level. Suppose you’ve ﬁlled the tree with data. Data and diﬀerences between neighboring points. Suppose we have a matrix mapping: L : RN → RN Write L as (Pk+1 LPk+1 − Pk LPk ) + P0 LP0 just by telescoping. (PK+1 − PK )LPK+1 + PK LPK+1 − PK LPK 20

=

∆K LPk + ∆K L∆K + PK L∆K

Three matrices; moves one scale to itself. Completely decouples the scale. Then you sum over the scale. The original matrix becomes a block matrix in which each scale is mapped into itself and they’re completely independent. How do you build a dyadic tree on a square? Four corners (subsquares), then 16 subsquares, etc. Quadtree. V of dimension 4 . P (f ) = mQ (f )χQ (x)

10

Lecture 9

Once again, we’re on the hierarchical tree. dT (x, y) is the smallest I such that x, y ∈ I. |f (x) − f (y)| ≤ CdT (x, y)α Suppose |x − y| 2−m .

1

dT (x − τ, y − τ )dτ

0

|x − y|α

for α < 1. If we look at shifted trees, two nearby points will have small distance in most of the shifted trees (though in a few shifted trees the distance may be large, if the points fall on either side of a boundary.) 2− α 2 2−m

<m m (1−α) −m

=

=0

2

2

= 2−αm if α < 1 and m2−m if α = 1 and 2−m > |x − y| if α > 1. For α < 1, the averaged sum converges to the conventional distance to the power α; otherwise not. If f= < f, hI > hI

I

21

then we claim |f (x) − f (y)| ≤ dT (x, y)α If | < f, hI > | ≤ |I|1/2 |I|α . First note that if f is Holder with exponent α in general, then it’s automatically ”Holder” with respect to the tree and all shifted versions of the tree. Conversely, if a function satisﬁes the condition that |f (x) − f (y)| ≤ CdT (x, y)α for all trees, dT (x − τ, y − τ ), then integrating over τ gives

1

dα (x − τ, y − τ )dτ T

0

|x − y|α .

If it’s Holder for all shifted trees, then it’s Holder in the usual sense; and not otherwise. Example. Assume 1/I I |f (x) − mI (f )|dx ≤ |I|α where mI is the mean value on the interval. This is the mean oscillation. I is any interval, not necessarily dyadic. Then the claim is, that implies < f, hI >≤ |I|1/2+α . To check this, f (x)hI (x)dx =

I

(f (x) − mI (f ))hI (x)

and the average value of h is 0 on the interval. ≤ Back to basics. Partition tree: broke up interval into unions of subsets. Back to the circle. Every circle, take a periodic function of period 2π, and you can write every function in terms of its Fourier series. ∞ 2 ˆ u(θ, t) = At (f ) = e−tk fk eikθ

−∞

1 |I|1/2

|f − mI f |x

I

=e where

2 −t d 2 dθ

π

f = 1/2π

−π

2

gt (θ)f (ψ − θ)dθ e−(θ−j)

2 /t

gt (θ) =

e−tk eikθ = c 22

sum of Gaussian kernels; periodized Gaussian kernel. This is the dear old Jacobi theta function. This is the closest to an analytic expression you’re ever gonna get. Pt (f ) = = with r = e−t . = 1/2π

−π π

ˆ fk e−|k|t eikθ ˆ r|k| fk eikθ Pr (θ)f (ψ − θ)dθ 1 − r2 |1 − eiθ r|2

where Pr (θ) =

Pt is the Poisson semigroup, and At is the diﬀusion semigroup. Why semigroup? Pt Ps = Pt+s At As = At+s Construct rings in the circle, of diﬀerent levels of coarseness... Ak = P1−2−k |Bk − Bk+1 | ≤ 2−kα is equivalent with |f (x) − f (y)| ≤ |x − y|α . here, the Bk are Poisson kernels P1−2−k . An estimate of the variation of the function in the radial variable can give us the boundary function; this is a theme from antiquity. We started with a function on the line, made it ”nice” by averaging it on diﬀerent interval scales, do the analysis on the multi-scale tree. Poisson semigroup: ∆u(x, y) = 0 u(x, 0) = f (x) u(x, y) = Diﬀusion semigroup: 1 √ t e−|x−y| /tf (y)dy

2

y f (t)dt = (x − t)2 + y 2

ˆ eixξ e−|ξ|t f (ξ)dξ

In both cases, we have a ﬁxed φ(x), integrable,

φ(x)dx = 1, and φλ (x) = 1/λφ(x/λ). √ 2 1 In the Poisson, we have φ = 1+x2 , λ = y. For the diﬀusion, phi = e−|x| and λ = t. Aλ (f )(x) = φλ (u)f (x − u)du

23

is the general rule. It doesn’t matter what φ is; it can be a characteristic function. If φ = χ[−1/2,1/2] then Aλ (f ) = 1/λ

|u|<λ/2

f (x − u)du

In general, |Aλ − Aλ/2 | ≤ cλα is equivalent to |f (x) − f (y) ≤ |x − y|α . t We already proved this for the case A as the characteristic function. The claim is that this is true for all possible A.

11

Lecture 10

φ(x)dx = 1. 1/λφ(x/λ) = φλ (x) Aλ (f ) = f (x − t)φλ (t)dt

On R1 , let

take φ(t) = 1 on [−1/2, 1/2] and zero elsewhere. Then Aλ (f ) = 1/λ

|t|<1/2

f (x − t)dt

averaging on an interval. Other kernels: Gaussian: 1/λe−t Poisson: 1/π

2 /λ2

λ + λ2 These allow you to, e.g., solve the Heat Equation at time t. Aλ (f ) = u(x, λ2 ), solution of heat equation at time λ2 . What does this have to do with the tree basis: assign the function f the average value on each interval of the partition. Analyzing how the average changes as we go vertically tells us something about smoothness of the original functions. Two discretizations: on the one hand, we discretized the intervals; on the other hand we also discretized the λ’s, to be 2−n . t2 The Haar expansion on the tree was just taking the diﬀerences of adjacent averages on each level. Regularity was just measured by the size of the diﬀerences of the averages. Generalizing this principle, ∆λ f = Aλ (f ) − Aλ/2 (f ) = 24 ψλ (t)f (x − t)dt

or Qλ f = λ

d Aλ = dλ

ψλ (t)f (x − t)dt

d where the former ψλ is actually just φλ −φλ/2 , and the latter 1/λψλ (t/λ) is λ dλ [φ(t/λ)1/λ]

If |∆λ | ≤ λα , then |f (x) − f (y)| ≤ |x − y|α , and vice versa, as we showed last time. Theorem: we claim this is true in general. Assume ψ(x) is integrable or is a measure with compact support. Could be like δ0 − δ1 or δ−1 + δ+1 − 2δ0 . Also assume ∞ ˆ |ψ(ξ)|2 dξ/|ξ| > 0

0

Let Qλ (f ) =

∞

f (x − t)ψλ (t) =

−∞

f (x − λt)ψ(t)dt

Then |Qλ (f )| ≤ λα iﬀ |f (x) − f (y)| ≤ c|x − y|α . Take dψ = δ+1 + δ−1 − 2δ0 . Then Qλ (f ) = f (x + λ) + f (x − λ) − 2f (x). This implies |f (x + λ) − f (x)| ≤ λα . (Note: this approach is due to Calderon; Qλ (f ) is called the continuous wavelet transform.) What we’re doing is convolving the function with a wavelet at diﬀerent locations and scales. Building a function that’s identically 1 on an interval, zero outside an open interval around ˆ it. ρ(ξ). Now deﬁne η (ξ) = ψ(ξ)ρ(ξ)dξ ∗ c± chopped oﬀ outside a certain range. where c+ ˆ is a constant for positive ξ and c− is a constant for negative ξ.

∞ 0 ∞ 0

ˆ |ψ(t)|2 ρ2 (t)dt/t = 1/c+ ˆ |ψ(−t)|2 ρ(t)dt/t = 1/c−

So this function is compactly supported in the Fourier domain. So we can write f (x) = νλ ∗ uλ

25

ˆ with uλ (x) = Qλ (f )(x) = u(x, λ). You cannot reconstruct f the usual way, by dividing by ˆ ψ since it vanishes, which is why we do it as above. Assume that u(x, λ) ≤ cλα , some arbitrary function. Then

∞

f (x) =

0

ηλ ∗ u(·, λ)dλ/λ

satisﬁes |f (x) − f (y)| ≤ |x − y|α . Take |x − y| = r. Then |f (x) − f (y)| =

λ>r

1/λ[η( η(

x−u y−u ) − η( )]u(·, λ)dλ λ λ

≤

x−y−s s − )1/λds λ λ

12

Lecture 11

What did we show? We showed that if you have ψ, compactly supported with mean 0, and ψλ = 1/λψ(x/λ) then |ψλ ∗ f | ≤ cλα is equivalent with |f (x) − f (y)| ≤ C|x − y|α Let f (x) =

0 ∞

ˆ ˆ eixξ f (ξ)ψ(ξ, λ)dx/λ

provided

ˆ ψ(ξλ)dλ/λ = 1. Write

∞

f (x) =

0

u(x, λ)dλ/λ

Where f = u(x, λ)λ−β , write

∞

Dβ f =

0

λ−β u(x, λ)dλ/λ

=

ˆ f (ξ)eixξ

0

∞

ˆ ψ(ξλ)(ξλ)β dλ/λ|ξ|β 26

ˆ = f (ξ)|ξ|β eixξ Fractional derivative of f. ξ is a Fourier transform of d/dx. You look at the wavelet coeﬃcients of the function ψ ∗ f , you multiply by λβ , which is like diﬀerentiating β times. |u(x, λ)λβ | ≤ λα−β if |u(x, λ)| ≤ λα . So the new function is Holder if the old function was Holder. Just changes the degree of integrability. Take a function f on [0, 1]. Look at f= < f, hI > hI (x)

If < f, hI >≤ CI α then |f (x) − f (y)| ≤ dα (x, y) in the dyadic tree metric. T Dβ f = 1 < f, hI > hI (x) = g |I|β

|g(x) − g(y)| ≤ dα−β (x, y) T you make it worse by exactly β when you diﬀerentiate β times. Littlewood-Paley Function: S1 (x) = ( d2 I χI (x) 1/2 ) |I|

Measures ”activity” around x. It’s like an envelope around f. One of its properties is that if you integrate S 2 ,

2 S2 =

d2 χT (x)/|I| = I

d2 = I

f 2 (x)dx

measures all the coeﬃcients around x. In general ||S2 ||p Generalized: Sp (X) = ( dp I ||f ||p , 1 < p < ∞ χI (x) 1/p ) |I|

Take f ∈ B1 , the space of f = dI hI , and ||f ||B1 = |dI |. It seems there’s no regularity here. But if we write f as dI ∗ |I| ∗ 1/ |I|hI we can show something... If |dI | ≤ 1 then given any λ > 0 f = gλ + bλ where |gλ (x) − gλ (y)| ≤ λdT (x, y) and the support of bλ is less than 1/λ. Sparsity implies smoothness outside of a small set! 27

1/2

Eλ = {s1 (x) > λ} where s1 (x) = Then |I| ≤

R s1 (x) λ

|dI |χI (x)/|I|

=

P

|dI | λ .

gλ (x) =

c I∩Eλ =0

dI hI (x)

|dI | ≤ λ|I|1/2+1/2 = 1 The other I’s are completely contained in Eλ , which is bigger than the support of bλ . Take f= Look at D1/2 f = |D1/2 f | ≤ iﬀ dI /|I|1/2 hI |dI | χI = |I|

χI . |I|1/2

dI hI

(x)

1

|dI | ≤ ∞ We’re using the fact that |hI | =

Let K ⊂ Rn . Let’s try to build a geometry on the points so that all the coordinates of the points satisfy this kind of condition. Take a small discretization scale . Cover the set of points by a maximal collection of balls with diameter which cover the set. Any point in the set is at distance ≤ from some xi the center of a ball. Replace this by a partition; assign to each point the nearest xi . This is called a Voronoi Diagram. Then connect them into supercells (all the cells bordering a cell). This creates a tree. |x − y| ≤ dT (x, y)1 by deﬁnition! I can evaluate how good my tree is by how small the coeﬃcients are. What if I’m in a high dimensional space? And we want to organize coordinates *and* data?

13

Lecture 11

x t 0 0 s

(

0

f (x)dudsdt) = f (x)

integrate and diﬀerentiate; fundamental theorem.

x

I n = [1/n!

0

(x − a)n−1 f (x)dx](n) = f (x) 28

Deﬁne the fractional integral Iα = And deﬁne D α = Cα

−∞

1 Γ(α)

x

(x − u)α−1 f (u)du

0 ∞

sgn(x − y) f (y)dy |x − y|α

= cα

ˆ eixξ |ξ|α f (ξ)dξ

1 x−y f

Special cases: α = 0 means |D|α f = f . α = 1 means |D|α f = ( transform. ˆ f (x) = eixξ f (ξ) ˆ L(eixξ ) = L(ξ)eixξ 1/i F (1/i Now back to where we were. If f is expanded in a Haar series f= < f, hI > +f0 h0 < f, hi > hi

|I|=2l

(y)dy), Hilbert

d ixξ e =ξ dx

d ixξ )e = F (ξ)eixξ dx

∆f = (El+1 − El )f = Now we deﬁne a regularity ∨D =

2l ∆l (f )

Last time: if |dI | < 1 where f = dI hI then D1/2 f is integrable. Half a derivative is in L1 when the coeﬃcients are sparse. We say that X is a balanced partition tree if for each l

l X = ∪Xj l l Xj ∩ Xj = ∅ l−1 l and every Xj is contained in some Xj the ratio of numbers of folders on successive levels are bounded between constants.

29

Write f= = (El+1 − El )f + E0 f ∆l (f ) + E0

where ∆l are orthogonal projections on Wl . No Haar functions here. Haar transform is just the diﬀerence between coeﬃcients of successive levels. To each node it assigns its average, and assign to each edge the diﬀerence between the two nodes. Where do you have large coeﬃcients? small coeﬃcients? The synthesis is just integration. Suppose the function is noisy. When I do the averages I repress a lot of the noise. So we can resynthesize the function with less noise. If |f (x) − f (y)| ≤ dT (x, y)α , then |∆l f | ≤ Clα

14

Lecture 12

l l Let X be a balanced partition tree, X = ∪Xk which are disjoint, every Xk is contained in l−1 some xk . We have the condition that l−1 l−1 l δ|Xk | ≥ |Xk | ≥ c|Xk |

uniformly over the whole tree. So it decays exponentially, but not too fast. A space of homogeneous type is a metric space which is also a measure space such that a ball |Br1 ,x | ≥ c|BRr,x |. Given a function deﬁned on X I can ﬁnd various tree transforms of the function f . Transform deﬁned on the edges: diﬀerences of samples at the endpoints. Synthesis is integration or addition of diﬀerences along path. |f (x) − f (y)| ≤ dT (x, y)α is equivalent to

l |dl | ≤ |Xk |α k l deﬁnitionally, since dT (x, y) = |Xk |.

Return to [0, 1] × [0, 1]. The square. We organize functions of two variables f (x, y). For instance, a document database, with words and documents. Suppose I have 105 words

30

and 105 documents. Can I build a table for all of this that has only 105 numbers? Can I compress? Build a conceptual tree of words, and a tree of documents. If there is any justice, things will go together. We want to jointly rearrange the whole system. We know that you can reorder a one-d function to make it smooth, and make it Holder plus a small set. We want to do something similar on a function of two variables. Permuting the vectors in one dimension and then in the other? We want to ﬁnd a geometry on the columns and rows so the function will be as smooth as possible in both x and y variables. We also want to do this eﬃciently. Somebody gave you two trees; a dyadic tree in one direction and a dyadic tree in the other. Deﬁne the notion of regularity as follows. I have a Haar system hI (x) and a Haar system hJ (y). First write f (x, y) = dI (y)hI (x) and then write dI (y) = so f (x, y) = dI,J hI (x)hJ (y) dI,J hJ (y)

Look at four points, x0 , x1 , y0 , y1 . ∆2 f = f (x0 , y1 ) − f (x1 , y1 ) − f (x0 , y0 ) + f (x1 , y0 ). Like R ∂2 a second derivative. ∂x∂y (1/2(x0 + x1 ), 1/2(y0 + y1 ))|R|. A function f is said to be bi-Holder with exponent α if |∆2 f | ≤ |R|α C. R Theorem: if dR ≤ 1, then for any λ > 0, f = gλ + bλ where |∆2 gλ | ≤ λ|R|1/2 R and the support of bλ has measure ≤ 1/λ. The classical version of this in one dimension would be the Rising Sun Lemma. The multivariate classical analogue is unproven. Example: suppose this were actually a matrix, and the matrix is not full. Missing values. The missing value can be given by the value just below, plus the y-derivative (taken from the y-distance of adjacent values) plus an error of the order of |R|α . Estimating a missing value from nearby values. You can only do that if the function is bi-Holder. This could be, for instance, a recommendation engine.

31

(Note: this procedure can be extended to truly sparse matrices to do recommendation engines properly.) We’ll prove the theorem later. But it says that functions that are well represented in the basis have Holder regularity, and the converse is also true. The set of samples we need is the set of all centers of dyadic rectangles. You don’t need more than k2k centers if 2−k is your smallest size.

15

Lecture 13

|f (x − y) − f (x , y) − f (x, y ) + f (x , y )| ≤ d1 (x, x )α ∗ d2 (y, y )α

Equivalent to hR (x, y) = hI (x)hJ (y). f (x, y) Claim dI (y)hJ (x) = dI,J hI (x)hJ (y) = dR hR

1 |d (y) − dI (y )| ≤ |I|α d2 (y, y )α 1/2 I |I|

using the same theorem as always. 1 |I|1/2 |J|1/2 | < dI (y), hJ (y) > | ≤ |J|α |I|α

note that all of this is independent of dimension! |g(x) − f (x )| ≤ dI (x, x ) is equivalent with dI hI = g(x) 1/|I||dI | ≤ c|I|α

f (x, y) =

|R|> =2m

dR hR + e

|e | ≤

|R|≤

|dR |

χR (x) |R|1/2

=

|R|=

|R|α χR (x, y)

32

≤

l≥m

l2−lα

∼ m2−mα Theorem: If f is bi-Holder with exponent α, f (xi , yi ) is known on a sparse grid corresponding rectangles of area ≥ then f can be approximated to error α log 1/ . Exercise: prove Smolyak’s theorem.

16

Lecture 13

Questionnaires and questions, Xp and Yq , in matrix.

l X = ∪ml Xk k=0 l+1 l l Xk are disjoint in k, Xk is contained in Xk for some k. A sparse grid is just a selection of l ∈ X l . Deﬁne an approximation P (f ) to be the sampling of the function a single point xk l k l at that level. χl is the characteristic function of Xk . k

Pl (f ) =

l f (Xk )χl (x) k

|f (x) − f (x )| ≤ dT (x, x )α

l where dT is the size of the smallest Xk containing x, x .

f= =

Pl+1 (f ) − Pl (f ) + P0 (f ) [f (xl+1 ) − f (xl )]χl+1 (x) k k k

l δk χl+1 + f (x0 ) 0 k l k

= f is Holder iﬀ

l |δk

|≤

C|χp |α . k

l Now, back to the questionnaires. Say you have a tree Xk on the questions and Yjm on the respondents.

f (x, y) =

l k

m+1 m+1 m m χl+1 (x)[f (xl+1 yj ) − f (xl , yj − f (xl+1 , yj ) + f (xl , yj )] k k k k k l,m δk ,j χl,m (x, y) k ,j

=

33

Number of rectangles such that |R| > : dR h= 1/ log(1/ )

|h|>

17

Lecture 14

Find a tree on each dimension of the 2-d matrix. If you had a function that was biHolder, you could sample it more sparsely, and reconstruct it from mixed derivatives. If you were to take a Euclidean distance hierarchical quantization tree, each row is Lipschitz relative to that tree, you can organize the tree on the other dimension with respect to that organization. The size of the constant depends on how much one dimension depends on the other. You can’t necessarily do it in high dimension. Fundamental question: You have a collection of functions on a population. Want to organize the analysis of the population so that the organization will be as smooth as possible. Want to be able to say something about values of functions in the data. Suppose you look at an atlas. Points are points on the globe. Each point has a collection of numbers attached to it. Could also have a demographic or political proﬁle. Every location has a collection of functions attached to it. If I want to organize an atlas – a collection of maps on the globe – you build a tree. Globe, continents, countries, etc. A diﬀerent atlas for climate. Variability in climate is dominated by latitude. It’s a diﬀerent geometry than euclidean distance. Consider the spiral. S(x, y) distance along the curve. This is LOUSY in terms of Euclidean 1 coordinates. You need intrinsic coordinates. Or consider sin( x+δ ). It’s bad – it oscillates a lot. But if you look at it as a function on the graph of itself, it’s nice. |d/ds(sin x(s))| ≤ 1. Lipschitz! You can map the spiral to a curve in 2 dimensions, via the arclength parametrization, More general situation: ˜ Xi → Xi xi · xj = ˜ ˜

(t)

= (λt φ1 (xi ), λt φ2 (xi ) . . . λt φN (xi )) 1 2 N

λ2t φl (xi )φl (xj ) = A2t (i, j) = φ0 (xi )φ0 (xj ) l

||˜i − x)j ||2 = A2t (i, j) + A2t (j, j) − 2A2t (i, j) x ˜ = |At (i, k) − At (j, k)|2

Geometric interpretation: link each point to its neighbors; can link to higher order by taking the adjacency matrix to a power. Distance is diﬀusion. 34

Distance d1 = d1 (˜i , xj ), just nearest neighbors. Everything who’s not a neighbor is at x ˜ distance 2. Shrink this down into the ﬁrst embedding, and then again take a maximal subcollection such that d2 (xi , xj ) ≥ 1. These are the distances after time two. Doing exactly what we were doing in the Euclidean case, but the distance at diﬀerent levels is diﬀerent. So you can view this as a tree of points. Each folder is points that are linked at the scale of the folder. Probabilistic interpretation – probability that you’ve diﬀused out that distance by that time. For small t, the folder is a small spherical cap, may as well be geodesic. Large t is not. For example, think of a dumbbell: diﬀusion distance across the neck is large, much closer within the dumbbell.

18

Lecture 14

Pick a black spot. The organization of local patches is naturally parametrized bye their average and orientation of the edge. The ”folders” actually extract portions of the curve on the edge. Started with small folders and then agglomerated them bigger and bigger. Alternatively, organize a domain in the plain by breaking it into partitions. How do we form the partitions? The ﬁrst eigenvector gives the direction of greatest variance, and that forms the direction of the dividing line. Divide and divide again. It will generate regions which are as ”fat” as possible. This is a top-down hierarchical construction of the tree. Another method: sampling the data at ﬁner and ﬁner points. This is the bottom-up way. Suppose you have points on a line. f (xi ) known. Want to extend the function to every other point. Think of this as a classiﬁer. dI =< f, hI >, f (x) = dI hI . Look at all possible expansions aI hI (xi ) which agree with the given f (xi ). Best ﬁt. Minimize |aI |. Want your function described in the simplest, sparsest way you can. Easiest way to extrapolate: necessarily sparse or simple. f (xi )χI (xi ) extrapolate with a constant. But this is not

Exercise: I have a function which takes only two values: S1 and S2 . I have three possible functions in which to represent it. I have intervals I1 , I2 , and I1 ∪ I2 . Want to represent f = α1 χI1 + α2 χI2 + α3 χI . where I = I1 ∪ I2 . Want to minimize 2|α1 | + 2|α2 | + |α3 |. Hint: pick the mean value of S1 , S2 on the full interval, and pick the correction. Haar expansion. That should be better. In a sense we’re also imposing smoothness when we impose sparsity. A minimal representation of f based on characteristic functions will be deﬁned everywhere. It’s an extrapolation. But I want the extrapolation to be as consistent and as smooth as possible. Haar expansions, recall, do NOT satisfy standard properties of 35

Fourier expansions. As we’ve seen, though, the fact that Holder of order 1/2, except for a set of small measure.

|aI | is ﬁnite means that f is

Classiﬁer on zip codes. You build a graph, then a tree, then a function on that tree. Fit the Haar coeﬃcients to the samples. This gets you to the state of the art classiﬁcation error. Let’s go back to the questionnaire. People vs. questions, and the function for the depres¯ sion score, known at d(pi ), and we can predict it for new people. Get a d, the simplest organization of the depression score based on the data we have. Candidate score. I can add this as an extra question now! I can give this question a very large weight. Now two people who used to be close will be farther; and two people who used to be farther become closer. It will reorganize the tree geometry of people. This then changes the relationship between questions. Consider bumps: e−|x−j| /2 . What is the class of functions that can be represented as 2 αj e−|x−j| /2 . These functions are far from being orthogonal. But I can ﬁnd the one which is simplest: minimize |αk |. If I have a function in this space, what is a good grid 2 of xk ’s to sample so that I’ll be able to reconstruct exactly? Deﬁne g1 = eikθ e−k . Look at all the shifts: g(θ − ψ)α(ψ)dψ = Fα (θ) What is the dimension of F ? Up to a certain degree of precision. The answer is trivial. 2 αk e−k eikθ ˆ Well, if |k| ≤ 5, we’ll have an error of O(e−24 ). The dimension of this is 11. Eleven numbers tell me practically everything. The local rank of the projection on this Gaussianbell space is the same as on the circle. It’s not an inﬁnite dimensional collection; it’s very low rank.

2

19

Lecture 15

¯ Let f (x) = f (xi )χIi (x). Just take step functions from the given points. If |f (x)−f (y)| ≤ α (x, y) then f has the same property. ¯ dT Suppose instead we take the triangle function, whose height is equal to I, the length of the interval, around each point xi . More generally, take points xl , centers of dyadic intervals, i and take φ(

x−xl i ). |I l |

f (xl )φ( i

x − xl i ) |Iil |

36

another way of interpolating. The derivative of the triangle function is a (rescaled) Haar function. x − xl i ∆( ) = |I|−1/2 hI l |I l | Suppose I have a function represented as ¯ f (x) = al φ( i al φ ( i x − xl i

l i

)

¯ f (x) = And the integral ¯ |f (x)|dx = (

x − xl i

l i

)1/

l i

|al |)( i

|φ (t)|dt)

So if I want |al | ≤ 1 then f is of bounded variation. If you’re given a φ and all possible i scales, try to ﬁnd a representation for f which minimize 1 norm of the coeﬃcients. Take the orthogonal expansion, and this will give you the minimum. Note: Hardy spaces are deﬁned by the fact that every function has a decomposition into slightly more general functions. This is a simple version. Now we want to get oﬀ the line. Suppose we have e−|x−α| . What is the dimension of this collection of functions? Suppose |α| ≤ 1, |x| ≤ 1 . λ0 > λ1 > .... > 10−10 This gives about 10 digits. It’s inﬁnitely dimensional, but up to good precision it’s almost ﬁnite dimensional. 2 The λk = e−k , so since this decays so fast, 10 digits is plenty. f (x) = ai e−|x−αi |

2 2

The function that you measure might be a bandlimited function. I can project into the space of bandlimited functions. < f (x), φk > φk

k:λk >10−10

But what if I want to represent it in the original kernel? φk = 1/λk e−|x−α| φk (α)

2

But if I could write the integral as a discrete sum, we’d be ok: f would be represented in the kernel. It’ll be overkill – too many coeﬃcients – but it works. Need about 30 grid points α to get the same error. 37

What’s really going on here? I have a matrix e−|x−y , x and y in a dense collection of points. (x, y) ∈ Γ ⊂ Rn . But we know the matrix has low rank! This is really a Gaussian. In each variable, this operator is a Gaussian in that variable. In each variable, the Fourier nodes 2 look like e−z . So the rank is ﬁxed. Does not exceed the minimal number of balls of radius epsilon you need to cover the area. How do I subsample this matrix in a way that guarantees I’ll ﬁnd samples which cover the whole range of the matrix? The theorem (Rokhlin, Tygert, Martenson) says that if you have a matrix aij , large size but low rank. The reduction is basically: encode the rows (or columns) of this matrix in a random code. Take a random vector of plus-minus ones. We’re building a random matrix 1 , ... L , where L is the rank. Take the inner product of the matrix with the code. Orthogonalize the rows of the resulting matrix. We build a dictionary that way. The rows you select at every step are the ones far away from the preceding ones. ”Far away” means projecting in the orthogonal direction. So you’ve selected L points: L rows in the original matrix. This is a way to select a subset of rows in the matrix which span the same space as the original matrix. Equivalently, this picks the correct αi . (It’s not the same as compressed sensing, but it is a projection pursuit, and it’s in the same ”spirit.”) But how do you compute the coeﬃcients? e−|x−αi | ai = f (x) Normally I would do it by integration, but I don’t want to integrate. I want to, for instance, ﬁnd ai so that |ai | is minimal. min(||f (x) − ai e−|x−αi | ||2 + µ

2 2

|ai |)

Penalize error and complexity. Or, instead of working with a Gaussian kernel at scale 1, rescale it to be half as wide: 2 ai e−|x−αi | ∗4 Sharper Gaussian: use the points you already have and augment them with the next generation of points. The region aﬀected shrinks. I can break up the space into boxes, and add points box by box – it’s parallelizable, so to speak. Describe f as being in the range of a coarse Gaussian, plus the range of a ﬁner Gaussian, plus the range of an even ﬁner Gaussian, etc. What’s the point? We had data; we built a graph. We can look at the eigenvectors of this graph, and embed into a Euclidean space. The problem is those were eigenvectors on a matrix deﬁned on the data. If I take a new point, where do I map it? I don’t know. 38

If I were to take those eigenvectors f , as before, I’d have an extension to the rest of the universe. For example: consider the circle. There are Gaussian extensions of cos(θ), cos(2θ), . . .. The higher the frequency, the closer the peaks are to approximating the circle. A highly oscillating function will not extend much beyond the circle.

20

Lecture 16

Given a graph which is a curve, when you embed it into the ﬁrst two eigenvectors, it’s mapped to a circle. First two eigenvectors are φ1 = cos θ, φ2 = sin θ. All the others are 2 cos kθ, sin kθ. where k are integers. Correspond to λk = e−k . Suppose F (x) = cos 8θ. Can I ﬁnd a function f (x) = |ξ|≤16 eixξ η(ξ) such that the L2 norm of f is minimal? In other words, is there a band-limited, minimal norm extension of f(x) beyond the circle? On the other hand, if we look at cos 100θ, we can’t have it band-limited by 16. What we’ve proved: ∆ψ = λ2 ψ ψ(x) =

|ξ|≤cλ

eixξ η(ξ)

here the c depends on how the manifold is sitting in the surrounding ambient space. You can always approximate it by a bandlimited function whose band is bounded by the square of the eigenvalue. You can also take a bandlimited function, restrict it to the manifold, and approximate it by eigenvectors of the Laplacian. How far you can extend depends on how wiggly the eigenfunction is, roughly. How it’s embedded. We have data points and outside points. So we’re going to build two graphs. I have a million points, say. Let’s randomly select 10,000 points. Pick those points which are selected randomly as a reference library. Every patch in 121 dimensions can be compared to one of them. A library of representatives. I have a metric ||x − yi ||i = Ωi (x − yi ) ∗ (x − yi ) for some Ωi > 0. Ω(x) = < f, g >Rd = e−|x−yi |i

2

f (x)g(x)Ω2 (x)dx

39

< fi , gi >Ref = Af =

fi gi

A(x, yi )f (yi ) = F (x) At Aφt = λ2 φl l

A(x, yi )A(x, yj )Ω2 (x)dx =

e−|x−yi |i e−|x−yj |j dx

2

2

e−Ωi (x−yi )∗(x−yi ) e−Ωj (x−yj )∗(x−yj ) dx e−Ωi (x−(yi −yj ))∗(xi −(yi −yj )) e−Ωj x∗x dx e−Ωx∗x ∗ eΩj x∗x (δ) where δ = yi − yj = e−(Ωi

−1

Ω−1 )−1 j

δ

e−||yi −yj ||i,j ωi ωj A= e−|x−yi = Ω(x)ωi

|2

λl ψl (x)φl (xi )

At Aφλ = λ2 φl l ψl (x) = 1/λl A(φl ) We showed that e|yi −yj | φl (yi ) = λ2 φl (yj ) l ωi ωj

2

would be the same as before if the Gaussians were the same. Now I can compare any point to my reference set via this kernel. I know the eigenvectors of the outside world don’t depend on how many points I have. The whole image is a function of the pixels; can be expanded in eigenfunctions of all the pixels. Any patch can be checked to see if it’s similar to something in the image.

21

lecture 16

Suppose I have a data set x ∈ Gamma. Map it into a new set y = ax ∈ AΓ. We want to deﬁne a distance or a graph structure that is invariant under such transformations. How do we do this? Mahalanobis distance. xp ∈ Γ 40

xp = (xp , . . . xp ) q 1 So we write our data as a matrix X. Take the matrix C = XX T and diagonalize it. Write it as λ2 Ol (q)Ol (q ). l C = ODOt . Look at the distance between xp and xp to be d(x, x ) = C −1 (xp − xp ) ∗ (xp − xp ). If we look at y’s instead of x’s, d(y, y ), we have CA = AXX t (A−1 )T , so A−1 (CA−1 )A(xp − xp ) ∗ (Axp − Axp ) which is just the original distance. The λl express how much that coordinate varies over the data. The data that originally looked like an ellipsoid becomes a sphere. Suppose you build a graph of A of the data. You want a metric that does not depend on the function of the data, even if it is nonlinear. Local Mahalanobis? By taking only the data near the point. If I have some extra information I can do it. Suppose I have this nonlinear map f (x). Want to deﬁne a graph on the new data which is independent of f – same graph as the old data. Near f (x0 ) it looks like f (x0 ) + f (x0 )(x − x0 ) + O(x − x0 )2 . Inverse covariance matrix of the data near x0 gives you the local Mahalanobis distance. The real problem we want to solve: the so-called black box problem. Suppose I have data in a black box, mapped by some nonlinear transformation to some other place, where 2 2 we see a collection of ellipsoids. The results of some experiment. e−||yj −yi ||i +||yi −yj ||j can be our distance. Deﬁne a graph based on this distance. When I build a graph in the black box, the eigenvectors are products of a function of x and a function of y. (This whole process is known as Nonlinear Independent Components Analysis.) What is the relationship between eigenvectors and the initial coordinates? Compute the discrete graph Laplacian from the Mahalanobis metric graph. Eigenvalues of ψl (x1 ) ∗ ψm (x2 ) is a sum of the other two eigenvalues... they’re orthogonal.

41

Sarah Constantin December 1, 2011

1

1.1

Lecture 2

Multidimensional scaling

xi ∈ Rn

i = 1...m d2 = ||xi − xj ||2 ij ˜ We can do an embedding xi → Xi . We can compute the matrix Cij = xi ·xj inner products and diagonalize it. ˜ Ot Λ2 O = (ΛO)t ΛO = Xj · ΛO Another way to think of it: Cij = λ2 vl (i)vl (j) l inner products between eigenvectors. The vector associated with xi is xi = {λl vl (i), l = ˜ 1 . . . m}. If the xi lie in a low-rank space, then only few λl will be nonzero. Here we associated xi and xj with the kernel < xi , xj >. But we could have had a diﬀerent kernel k(xi , xj ). Abstract metric space X, distance d(x, y), want to deﬁne a mapping φ : X → Rn such that ||φ(x) − φ(y)|| d(x, y). There is a lot of literature on the subject. You have some estimate on the ratio between the distances. If |X| = 2L , then there is a map into RcL . Any metric space. This is a ”coding” theorem. In reality L is never really bigger than 50. Notice that in the multidimensional scaling example it didn’t really matter if N was big.

1

1.2

Diﬀusion geometry

Instead of looking at Xi · Xj , we could have looked at [Xi · Xj ] , cut oﬀ to be 0 unless xi xj ||xi ||||xj || ≤ . That also identiﬁes nearby points. Suppose my point Xi in R121 is an image of 11 by 11 pixels. How do I compare images? I have a database of images. Perhaps we have a big image and each patch is a subimage. ν(p) is the image centered at p. Also, we shouldn’t consider the image to be smooth. If it were, then it would describe a 2-d surface in R121 . But instead you’re going to see pixels all over the place, around some surface. A point cloud, from which we’d like to recover an underlying manifold. Take a pixel p and patch ν(p) and take its inner product with the patch ν(q), and deﬁne an aﬃnity α(p, q) = [ν(p) · ν(q)] truncate it to be zero unless they’re close. Then renormalize: Ap,q = α(p,q) where ω(p) = q α(p, q). ω(p) This produces a smoothing ﬁlter – replace a patch by the average of its neighbors. Denoises beautifully. Or you can let features be local variances rather than pixel values in a patch around each pixel. How do you convert a point cloud to the underlying manifold? Take the neighborhood of each point, average out the points to replace each point by its center of mass. This ’cleans up’ the data. Or, rather, take inner product between pairs of points; if they’re close enough, accept them, embed them into Euclidean space. This is ”diﬀusion geometry.” exp( −||ν(p)−ν(q)|| ) = ω(p)ω(q) Aq,p I(p)

p

2

Ap,q

I (q) =

This is called non-local means. Weighted means – only the close points count. What about rotated patches? They’ll look uncorrelated when they’re just oﬀ-center. Texture will obviously not pick out nearby patches. ¯ Deﬁne a graph on the image connecting nearby points. Weight each edge with Ap,q . φ(p) = ¯ − φ(p) is small. φ(p) − φ(p) = ∆φ + ¯ q ap,q φ(q) then if the function is smooth-ish, φ(p) O( 2 ). We’ll prove this next time.

2

2

Lecture 3

Symmetric aﬃnity matrix; associate a Markov process with it, or a graph. Start with the matrix a(i, j), assume symmetric and positive. (positive spectrum, not the same as entries being positive. Equivalent to a(i, j)ui uj is always positive. ) View a(i, j) = λ2 ψl (i)ψl (j) l

inner product matrix of x(i) = (λl ψl (i)) and x(j) = (λl ψl (j)). This matrix also deﬁnes a ˜ ˜ ˜ ˜ metric ||X(i) − X(j)||. Deﬁne ω(i) = j a(i, j) Deﬁne a new matrix Aij = √ which is symmetric, and another matrix Pi,j = a(i, j) ω(i) a(i, j) √ ω(i) ω(j)

2 Pi,j Pj,k = Pi,k j m is the probability of going from i to k in 2 steps. In general, Pi,k is the probability of going from i to k in m steps.

P is symmetric, and is ω −1/2 Aω 1/2 . P = λ2 ω −1/2 φl (i)φj (j)ω 1/2 (j) l 1/ωφl = ψl Suppose x, y are points distributed in the plane, and you have P (x, y) probability of going from x to y. 2 e−|x−y| /2 Gaussian distributed around each point. Can measure the distance between the bumps around x and x’ by measuring distance between the bumps. d(x, x ) = ( |P (x, y) − P (x , y)|2 dy)1/2

We can also think of Pi,i as a distance, suitably deﬁned: dm (i, i ) = |P m (i, j) − P m (i , j)|2 3 1 ω(j)

=

j l

λ2m (ψl (i) − ψl (i ))2 1/ω(j) l = λ4m |ψl (i) − ψl (j)|2 l ˜ = ||X m (i) − xm (i)|| ˜

where x(i) = {λ2m ψl (i)}. ˜ l Need to let m propagate to have a distance. How far is a question. Consider random points in Rn distributed along a density q(x).

N

1/N

i=1

f (xi )

f (x)q(x)dx

approximate integral. a(i, j) = e−|xi −xj | a(i, j)f (j)

j

2 /2

cN

e|xi −g|

2/

f (y)q(y)dy

Deﬁne a (x, y)e−|x−y| A0 f = ω (x) = A f=

α

2/

e−|x−y|

2/

f (y)q(y)dy

2/

e−|x−y|

q(y)dy

e−|x−y| / f (y)dy ω α (x)ω α (y)

2

P α (f ) is a convolution and two multiplications. cn en/2 change of variable: t =

x−y

1/2

e−|x−y|

1/2 t.

2

2/

f (y)dy

= t, y = x − cn

e−|t| f (x −

√

t)dt

assume f is c∞ with compact support. Can expand f as

4

(cn

e−t dt)f (x) + cn

2

e−t ∆f (x)

2

1/2

tdt + cn

e−|t| /2

2

d2 f (x)ti tj dt + O( 2 ) dxi dxj

Taylor expansion. = f (x) + m2 ∆f (x) + O( 2 ) m2 = cn e−|t| t2 dt 1

2

g ∗ f = f + m2 ∆f + O( 2 ) 1/ (f − g (f )) = −m2 ∆f + O( ) lim 1/ (f − g ∗ f ) = −m2 ∆f

→0

Schrodinger operator. What does (g ∗ f ) mean? (g ∗ f )hξ) = e−|ξ| t = νn

2 ˆ Gt (f )= e−|xi| t f (ξ) 2

ν)n

ˆ f (ξ)

Gt Gs = Gt+s semigroup. ∂ Gt (f ) = ∆Gt f ∂t u(x, t) = Gt (f )(x) ∂u = ∆u ∂t diﬀusion equation.

t→0

lim u(x, t) → f Gt (f ) = et∆ f

5

3

Lecture 4

f (yi )e−|x−yi | ω(x) =

i

2/

√

2/

e−|x−yi |

g ∗ (f p) √ g ∗p

ωi = ω(yi ) f smooth e−|x−y|

2/

f (y)dy = f (x) + ml ∆f (x) + O( 2 ) =g ∗f

continuous version. Deﬁne ω (x) = g (p) Deﬁne dα = 1/ω α g (p/ω α ) P α (f ) = 1/dα 1/ω α g ∗ (f p/ω α ) What is ω? convolution of p with g . Close to p, up to an epsilon. So P α is like getting rid of p, if α = 1, just having g ∗ f , but normalized with 1/d because we need it to be a probability measure. You want to know the geometry of the dataset, which has nothing to do with the statistics of the dataset, p. Dividing by ω is uniformizing. THEOREM P α (f ) = f + m2 (1/p1−α ∆(f p1−α ) − 1/p1−α ∆(p1−α )f ) + O( 2 ) When α = 1, you just get P 1 f = f + m2 ∆f + O( 2 ) Interesting options: α = 1, 0, 1/2. From now on, let m2 be part of the Laplacian. Proof of theorem: ω =g ∗p = p + ∆p = p(1 + ∆p/p) ω α = pα (1 + α ∆p/p) ω α dα = g (p/ω α ) 6

p1−α (1 + α ∆p/p) + ∆p1−α Theorem: P α f = f (∆f p1−α /p1−α − f (∆p1−α /p1−α )] Integrand: ∆α f = 1/p1−α ∆(f p1−α ) + ∆p1−α p1−α

Suppose I have points uniformly lying on a curve. f (s) is the function of arclength parametrizing the curve. y(s) is a parametrization. Points could also not be distributed uniformly; just look at P1 to make it uniform. If y(t) is the parametrization, 1/ √ e−|y(s)|

2/

f (s)ds = f (0) + (f (0) + a2 /2f (0))

Divide by what happens when f = 1, you’ll get rid of the f (0) term. You’ll get the second derivative in arclength. You’ll get f + f (0).

4

Lecture 5

The point of last time’s calculation: g ∗ f = 1/ where m2 =

−∞ n/2

e−|x−y|

2/

f (y)dy = f (x) + m2 ∆f (x) + O( 2 )

∞

e−t t2 dt.

2

From now on we swallow the m2 . We started with that to deﬁne operators which we called 2 1 e−|x−y| / f (y)p(y)dy α P f= α dy d x ω α (x)ω α (y) where ω (x) = g ∗ p = p + ∆p + O( 2 ) We use p to refer to the distribution of the points in Rn . If you randomly pick points out of the density, the average points will converge to the integral above. Last time we showed P α f = f (x) + (∆(f p1−α )/p1−α + ∆(p1−α )/p1−α f ) + O( 2 7

If α = 1, the second term doesn’t exist, so it approaches the Laplacian. That makes it independent of the density. What do I do if the points are constrained to lie on a set? On a curve or a manifold or something? Today we let α = 1. Let’s start with a curve; we know the points lie on a curve. Assume rectiﬁable. Then you can assume the points are distributed by some density p(s)ds ON THE CURVE. Let’s let p = 1, uniformly distributed by arclength. Pick a point on the curve, look at the tangent line to the curve, and model what’s going on. Map that point to the origin, the second derivative of the curve to the y-axis and the tangent of the curve to the x-axis. (”Osculating plane.”) e−r (s)/ f (s)ds e−r2 (s)/ 1ds

2

P (f )(0) = Movement along the curve:

y(t) = at2 r(t) = Distance to the origin:

t t

t2 + a2 t4 = t(1 + a2 t2 )1/2 = t + 1/2a2 t2 + O(δ 3 )

s(t) =

0

1 + (2au)2 du =

0

1 + 2a2 u2 du = t + 2a2 /3 + O(δ 4 )

s(r) = r − 1/2a2 r3 + 2a2 /3r3 + O(δ 4 ) t = s − 2a2 /3s3 + O(δ 4 ) e−r

2/

f (s(r))s (r)dr

= f (0) + f (s ) + O( 2 ) (f (s(r))s (r)) (0) = f (s(r)) s (r) + 2s (r)[f (s(r)] + f (s)s (r) evaluated at r = 0. But at r = 0, s(r) = 0, so this is f (0) + a2 f (0) so the above integral is f (0) + (f (0) + a2 f (0)) 8

(1 + a2 )f (0) + f (0) Now the normalized integral e−r / f (s)ds f (0) = f (0) + + O( 2 ) −r2 / ds 1 + a2 e = f (0) + f (0) + O( 2 ) Same thing as before. This can be generalized (as an exercise) for two variables; a 2-d surface, an osculating plane. height is a1 t2 + a2 t2 and you should get a Laplacian in the s variable. This is sort 1 2 of the deﬁnition of the Laplace-Beltrami operator on the surface. The ﬁnal result should be d2 d2 f (0, 0) + ( + f (0, 0) + O( 2 ) ds1 ds2 Returning to the curve; observe that if we had a density, and normalized as we had before with α = 1, we’d be able to get rid of the density. e−r (s)/ f (s)p(s)ds e−r2 / p(s)ds this does NOT converge to the Laplacian, but to the Laplacian plus a potential. e−r (s)/ f (s)p(s)/ω (s) e−r2 / p(s)ds/ω (s) on the other hand, will get rid of the density, converging to f (0) + f (0). This gets rid of statistics and gives us geometry. (This case corresponds to taking α = 0. ) Claim: If we let P f = f (0) + f (0) + O( 2 ) the operator I + ∆ + O( 2 ) If we take = 1/n, and take P , then this converges to e+∆ . Note that the spectrum of I − P is positive, while the spectrum of the second derivative is negative. Since the eigenvectors are just the eigenvectors of the Laplacian, this tells us the eigenvectors of P 1/ should converge to the eigenvectors of the Laplacian. This ”parametrizes” points without parametrizing. You get the arclength parametrization of the curve JUST from this averaging operator. Diﬀusion distance between two points is Euclidean distance in the ambient plane times something bounded. 9

1/

2 2 2

We took P as an operator, I + ∆ + R the residual. The residual is terrible. It’s of order 2 because we assumed the function has two well-behaved derivatives. But what if you can’t do Taylor series? Doesn’t work if the function is not diﬀerentiable. Restrict our attention to a space of eigenvectors of the Laplacian whose eigenvalue does not exceed a certain number; i.e. bandlimited functions. Think of trigonometric polynomials. Claim

n ||Πm (P1/n − e−∆ )|| → 0

where Pm is the orthogonal projection on the eigenspace of bandlimited functions with eigenvalue less than m.

n Πm P1/n Πm = P im (I + 1/m∆ + 1/nRn )n Πm

Now ||(A + B)n − B n || ≤ n(1 + ||B||)n−1 ||B|| if ||A|| ≤ 1 If A = 1 + 1/n∆ and B = Rn 1/n, you apply the above, and get a bound.

5

Lecture 5.

1 ω (t) e−(x−y(s))

2/

f (s)p(s) ds p (t)p (s)

→ −f (t) as → 0. we showed this last time. If we hadn’t canceled by the p ’s, we’d have a diﬀerent operator, corresponding to the situation α = 0. Second derivative plus some second order thing. p is the density function along the curve. The ”statistics.” The limit shows that the ”statistics” don’t matter to the geometry of the curve. The conventional machine learning embedding blends the statistics with the geometry. But this is an intrinsic parametrization, independent of where the data is most densely samples. The above integral, if we took it discretely, would mean just averaging the points, against a Gaussian. We would expect the eigenvectors of the discrete operators approximate the eigenvectors of the continuous operator. One of the goals is to go to surfaces. Riemannian manifolds. If I have data which is on some surface, we want to understand the data independently of how we measure the data. Some way of deﬁning the operator so that they’re intrinsic. P 1/ → e∆ 10

so in particular, if

n = 1/n, P1/n → e∆ . d2 . dx2

What does that mean? Here ∆ =

f=

ˆ fk eikx means that f = ˆ F (−k 2 )fk eikθ

ˆ −k 2 fk eikθ .

F (∆)f = Now if u(x, t) = et∆ f , du/dt =

2 ˆ e−k t (−k 2 )fk eikx 2 ˆ e−k t fk eikx

u = et∆ f =

π

= 1/2π

−π

qt (x − y)f (y)dy e−k t eik(x−y)

2

where qt (x) =

This is real. So it’s equal to its real part. So it equals e−k t cos k(x − y) This is the heat kernel!!! Take a Gaussian kernel and periodize it to have period 2 pi and you have this q. It’s actually a theta function. We deﬁne a diﬀusion distance on the circle as follows: for small t, this really does look like a Gaussian. d2 (x, x ) = t || |qt (x − y) − qt (x − y)|2 dy

2 2

e−k t (eikx − eikx )e−iky ||2 2 = ek t |eikx − eikx |2 e3−k

2 )2t 2

= e−2t |eix − eix |2 (1 + e−t

|eikx − eikx |2 |eix − eiy |2

This is the diﬀusion distance, not the geodesic distance. Short diﬀusion distance means there are many paths between the two points, which are not too long. You have a surface; you want an intrinsic relationship between points that you can measure directly. You can put a Riemannian metric on it. We’ll be able to embed it explicitly into Euclidean space. Our embedding is not going to preserve the metric, it’s going to preserve the diﬀusion distances. Let’s assume now that this manifold is given to us in R3 . I’m going to do, on the manifold, e−|x−y|

M

2/

f (y)dy

11

Want this to be f (x) + ∆f + O( 2 ) We need to rescale it appropriately. We can do what we had before: 2 e−|x−y| / f (y)p(s)ds

M

Think of our manifold as a surface in R3 . Look at the tangent plane at a point. y(u) = ai u2 i

Pick coordinates. Deﬁne s as the length of the geodesic curve in the direction u connecting 0 to (u, y(u)). This is called the exponential map. The direction we pick is s(u)u/|u|. dSi = 1 + 2a2 u2 + O(δ 3 ) i i dui dSi = O(δ 3 ) duj It’s a diagonal matrix, so |det( dSI )| = 1 + 2 duj a2 u2 + O(δ 3 ) i i

Taylor series in the s-coordinate: f (S) = f (0) + si df (0) + dsi si df (0) + dsi d2 f f (0) + O(δ 3 ) dsi dsj

This is all extrinsic. All in the space we’re embedded in.

6

Fast algorithms and potential theory in scientiﬁc computing

Wilbur Cross Lecture by Leslie Greengard. Fast, robust algorithms for engineering and applied physics are necessary. They should be designed so that precision is a ”knob you can turn.” Also, you want automatically adaptive. That is, you want to be able to reﬁne and then repeat the calculation. ”Tool-building” perspective. A hierarchy of tools from the most basic linear algebra modules to the full application. Out there in the world, there’s either MATLAB, or a speciﬁc application. We’d like to change that. 12

Integral equation (Green’s function) methods. We like these integral equation methods, because the’re good at handling complicating geometry. You need a description of the geometry and that becomes your discretization – don’t need to build a mesh. The integral equations are as well-conditioned as the underlying physics allowed. But in the absence of fast algorithms they’re intractable. And they need signiﬁcant ”quadrature designs” (integrands are not smooth.) What are Green’s functions? Diﬀusion: G(x, y, t) = e−||x−y|| Gravitation: 1/||x − y|| etc. Integral equations are data driven: if I want to solve the electrostatic problem with a bunch of point sources, qj U (S) = ||S − Sj || I don’t need a mesh, I need to compute this sum quickly. That’s not just particle interactions. Even a continuous problem, discretizing the diﬀerential equation yields a sparse linear system. ∆U (x) = ρ(x)|R U (x) = f (x)|S S is the boundary, R is the system. U (x) = V [ρ(x)] + D[µ](x) V [ρ](x) =

R

2 /4t

/(4πt)d/2

ρ(y) dy ||x − y||

integral formulation. Problem 1: Dense Linear Algebra. Consider Y = AX Yn = Anm Xm It takes N 2 operations. To solve, that takes N 3 operations using Gaussian elimination. Can we do this faster? Anm = cos(tn − sm ) = cos(tn ) cos(sm ) + sin(tn ) sin(sm ) 13

so ﬁrst compute W1 = Fast Gauss Transform.

cos(sm )Xm , and W2 = Yn = e−|tn −sm |

sin(sm )xm . Then let Yn = cos(tn )Wn .

2 /4T

Xm

It turns out that this function has a decomposition into Hermite functions times polynomial moments. Just a Taylor expansion about the center, in fact. But we have explicit functions. This is rapidly decaying, and T controls the decay. If we only want 14 digits, we know how far you have to go to make it safe. Just compute moments and build an expansion. N-body interactions: qj U (Tk ) = ||Tk − Sj || do you have nm work (where n sources, m targets)? This can be written as a multipole expansion. A sum −m n qj Yn (θj , φj )rj where the Y’s are spherical harmonics. Truncate that at p terms, and show the error decays geometrically Q(R/D)p . D is distance to nearest target. This is the origin of the fast multipole methods.

m Instead of computing the N-body calculation, we compute p moments Mn , that’s O(N p2 ) work, and evaluate expansion at each target, that’s O(M q 2 ). And then combine them. FMM uses various length scales to cluster.

Pictorially: imagine you’re a box in the middle. There are some boxes a reasonable distance from you and you want to compute interactions. You can’t interact with your nearest neighbors. Chop up every box, and you’re left with a region of the children of the parent’s nearest neighbors. O(N) calculations at the ﬁnest level. O(N log N) work over all. Performance is independent of the distribution. The more clustered it is, the better it performs. Modern FMM’s: waves, Laplace equation, etc. Cylindrical waves, plane waves, solid harmonics, etc, etc. Suppose you want the solution to the Poisson equation ∆U = f . Standard methods try to do this with meshes. Faster than the FMM is for point interactions. ”Volume potential FMM’s.” Applications: capacitance, inductance, full wave analysis for chips. Also quantum chemistry, molecular dynamics, astrophysics. Thermal modeling of fuel cells; lots of holes. 530,000 degrees of freedom. Simulation can have direct impact on engineering and applied problem. 14

7

Lecture 6

(I − P )f = ∆f + O( 3/2 . This came from the Taylor formula. For that to hold you need, for instance, third derivatives to be bounded. P f = (I − ∆)f + R f residual R . Then

n P1/n f

= [(I − 1/n∆) + R1/n ]n f = (I − 1/n∆)n f + ρn f,

ρn → 0. using inequality that if ||A|| ≤ 1 ||(A + B)n − A||n ≤ (n − 1)||B||(1 + ||B||)n−1 ||B|| ≤ All inequalities are okay provided they are restricted to band-limited functions for ∆. What that means is: HM = αl φl

l:λl ≤M 2

∆φl = λl φl The point of band-limited functions is that they’re diﬀerentiable to all orders. O( dominated by M 4 . If we ﬁx M and let → 0 then this goes to 0.

3/2 )

is

If I’m on the circle and i’m looking at the discrete operator and want to compare it to the integral operator, there’s going to be an error. If I ﬁnd eigenvalues of the discrete operator, I want to relate them to the eigenvalues of the continuous operator. And that’s good so long as < 1/δ if δ is the spacing and is the index of the eigenvector. (personal note: Nyquist sampling rate? Is this what it is?)

n P1/n f =

α (1 − 1/nλ )n φ + ρn f

→

α e−λ φ + ρn f = e−∆ φ + ρn (f )

and the error does converge to 0. This is not completely obvious (the bit about the error) on the sphere or other non-circle things. You do get a bound, and it doesn’t matter how big that bound is. So we’re saying, if f ∈ HM ,

1/n Pn f → e−∆ f

as n → ∞. To what extent are the eigenfunctions of Pn close to the eigenfunctions of the Laplacian? Do we have convergence? 15

NO. NOT EVEN ON THE CIRCLE. α cos nθ + β sin nθ α2 + β 2 = 1 same eigenvalue. (So there’s more than one eigenvector for each eigenvalue.) So you don’t necessarily get a limit. But the eigenspaces converge. But you can’t expect the eigenvectors to converge. The eigenvalues will be okay once you organize them in decreasing order. So will the eigenprojections.

1/n An = πM Pn πM → πM e−∆ πM = A0

Exercise: prove this. max|x|=1 An x ∗ x = λn , the top eigenvalue. 0 And pick the next highest to get the next highest eigenvalue, etc. We claim the eigenvector φ is equal to φ0 +

m= <Rφ ,φ0 >φ0 m m . λ −λ0 m

R is the residual. If the gap is not large, this error term is not great. If there’s multiplicity, then you need to replace Rφ by the projection on the corresponding space. Aφ = (A0 + R)φ . Here Aφ = λ φ . (λ I − A0 )φ = Rφ Rφ = < Rφ , φ0 > φ0

φ =< φ , φ0 > φ0 So φ =< φ , φ> φ0 + Look over this again! The goal was to show that deﬁning those Markov operators on a discrete set of points isn’t crazy. < Rφ , φ0 > 0 m φm λ − λ0 m

16

7.1

And now for something completely diﬀerent.

Problem: gravitational force from multiple point masses. 102 0 interactions from 101 0 points. Very big problem. But the interaction between the Earth and the Moon doesn’t depend on how many particles are in the Earth or the Moon. That’s the whole reasoning behind fast multipole methods; if two clusters of diameter D are separated by more than D, you can calculate the interaction while ignoring the number of points. The only thing that matters is how the points are distributed. We have a kernel k(xi , xj ) the interactions between the two points. In fact, we have a matrix a(i, j). How do we organize the points through their mutual interactions? Organize the columns and rows. Unpermute everything. Suppose I have a matrix that has only one row. Possible Gaussian noise. But it’s very easy to make this function smooth - just reorder the values, largest to smallest.

x

f (x) =

0

˜ f (t)dt +

0

x

dν(t)

integrable function plus singular function. The support of a singular measure can be covered by arbitrarily small intervals. You need a Calderon-Zygmund decomposition, but you can prove that this is a B.V. function, modulo an arbitrarily small set. Given any λ > 0, we can write f = gλ + bλ . |gλ (x) − gλ (y)| < 1/λ|x − y| ”good function” and bλ (x) is supported on a set of measure less than or equal to c/λ. We want to do the same thing with a matrix. With the columns, simultaneously, and then symmetrically with the rows. Exercise: prove Rising Sun Lemma. We want the geometry to emerge.

8

Lecture 7

From last time: think of the data as a random function on [0, 1]. (or a discrete collection of samples.) Is there a way of reorganizing it so that this is smooth? A permutation? Organize it in increasing order. An increasing function can be replaced by a function that has a derivative. Pick a slope; what regions are in the ”shadow” (portions of the function where the slope is steeper than that)? Everything ”in the sun” has a smaller slope. Every shadow region corresponds to an interval. f (βk ) − f (αk )/(βk − αk ) = λk if β and α are the endpoints of the shadow interval. If you replace f with the polygonal line that has

17

no ”shadow,” get a new function called gλ (x). Then F = gλ (x) + bλ (x), where b is the residual. The support of the residual is just the union of the ”shadow” intervals. λ(βk − αk ) = F (βk ) − F (αk ) Ik ≤ binning.

max F −min F λ

and gλ (x) ≤ λ. So this is a decent, BV function formed just from

We need to learn to do this for vector-valued functions. Want to organize the vectors in Rn so as to have a good part with a small slope and a bad part supported on a small set of intervals. This is the Riesz decomposition or the Calderon-Zygmund decomposition in one variable. Now we deﬁne a new, more extensible distance on [0, 1]. Deﬁne the dyadic tree obtained by binary splits. Deﬁne dT (x, y) as the length of the smallest dyadic interval containing both. You can get unlucky if you’re on either side of a branch. So it’s not Euclidean distance. You can assign a weight to each node of this tree, increasing as you go farther down the tree. Any collection of weights which are monotone can be assigned to be the weight of the smallest dyadic interval containing a pair. This is a large family of distances. Advantages over Euclidean distances: fast to compute. Especially in high dimension. If you deﬁned the Euclidean distance as a combination of tree distances (instead of splitting it into two halves, make a lopsided tree) and then average the distances over a collection of trees. f on [0, 1] is Holder or Lipschitz of order α if |f (x) − f (y)| ≤ Cd(x, y). Before we claimed we had a bounded derivative on the good function. Functions χ0,1 etc that are characteristic functions of a dyadic interval. χ(2 x − j). These are orthogonal for ﬁxed because they have disjoint support. Deﬁne V as the span of χ(2 x − j). These, by the way, are always Holder. Pf= < f, 2 =

/2

χ(2 x − j > 2 1 p Ij f (x)

Ij

/2

χ(2 x − j)

If χ = χ0 + χ1 , then χ0 − χ1 is orthogonal to χ. Haar functions hj = h(2 x − j). These form an orthonormal basis of L2 (0, 1). If Pl is the projection onto Vl then ||Pl f → f ||p , but the projection can be expanded in the Haar functions. This shows that linear combinations of Haar functions are dense.

18

The following is completely WRONG for Fourier series, but works for Haar. Suppose |f (x) − f (y)| ≤ dα (x, y). Then < f, hI >≤ C|Ij |1/2+α . Haar function is 1 on the left half t j of the interval, -1 on the right half. I+ and I− . =

I−

−

I+

f (y)dy

altogether, this is less than |Ij |1/2

I−

f (x) − c −

I+

f (x) + c. This gives the claim.

Size doesn’t depend on j, does depend on . We claim the opposite is true. f (x) = αj h (x)

Then |f (x) − f (y)| ≤ dT (x, y)α . That is, the coeﬃcients can tell us if the function is going to be Holder. Synthesize the function at x and at y. The only Haar functions we’ll need are those in the smallest interval containing x and y. Tabulating a function by its Haar coeﬃcients is incredibly eﬃcient (compared to Fourier coeﬃcients.) Complete the proof as an exercise.

9

Lecture 8

h(x) = 1

Deﬁne the Haar functions for 0 < x < 1/2 h(x) = −1 for 1/2 < x < 1 2m h(2m x − j) for m ≤ − 1 and j ≤ 2m − 1. is an orthonormal basis of V . V h are the basis of W . Write P

+1 +1

=V +W

− P = ∆ , the orthogonal projection on W . < f, hI hI +

j j

f=

j

f

f=

(P

+1

− P )f + P0 f

= P∞ f − P0 f

19

Telescoping series. We deﬁned the dyadic tree metric between two points dt (x, y) to be the length of the smallest dyadic interval containing both x and y. dτ (x, y) = dT (x − τ, y − τ )

1

dτ (x, y)α = |x − y|α

0

for 0 < α < 1. For α = 1, unknown. We proved that |f (x) − f (y)| ≤ cdα (x, y) I f (x) =

|I|=2−

dI hI

dI =< f, hI > |dI | ≤ |I|1/2+α C I only sum on the intervals that contain x, not anywhere else. Average of f over I is 1/2 times the average of f over I− plus the average of f over I+ . (P

+1

− P )(f ) = [mI (f ) − mI− (f )[χI− + (mI (f ) − mI+ (f ))χI+ = 1/2[mI+ (f ) − mI− (f )]χI (x)

You have to follow the path that leads down to x and y. To compute a function at a given point from the Haar coeﬃcients, I just follow the path and add the coeﬃcients. Number of coeﬃcients is number of levels. If the function satisﬁes a Holder condition, and you know the function path up to the bottom level, you can just resynthesize the function from samples. Store the diﬀerences as Haar coeﬃcients. The diﬀerences on the next level are Haar coeﬃcients on the next ﬁner level. Suppose you’ve ﬁlled the tree with data. Data and diﬀerences between neighboring points. Suppose we have a matrix mapping: L : RN → RN Write L as (Pk+1 LPk+1 − Pk LPk ) + P0 LP0 just by telescoping. (PK+1 − PK )LPK+1 + PK LPK+1 − PK LPK 20

=

∆K LPk + ∆K L∆K + PK L∆K

Three matrices; moves one scale to itself. Completely decouples the scale. Then you sum over the scale. The original matrix becomes a block matrix in which each scale is mapped into itself and they’re completely independent. How do you build a dyadic tree on a square? Four corners (subsquares), then 16 subsquares, etc. Quadtree. V of dimension 4 . P (f ) = mQ (f )χQ (x)

10

Lecture 9

Once again, we’re on the hierarchical tree. dT (x, y) is the smallest I such that x, y ∈ I. |f (x) − f (y)| ≤ CdT (x, y)α Suppose |x − y| 2−m .

1

dT (x − τ, y − τ )dτ

0

|x − y|α

for α < 1. If we look at shifted trees, two nearby points will have small distance in most of the shifted trees (though in a few shifted trees the distance may be large, if the points fall on either side of a boundary.) 2− α 2 2−m

<m m (1−α) −m

=

=0

2

2

= 2−αm if α < 1 and m2−m if α = 1 and 2−m > |x − y| if α > 1. For α < 1, the averaged sum converges to the conventional distance to the power α; otherwise not. If f= < f, hI > hI

I

21

then we claim |f (x) − f (y)| ≤ dT (x, y)α If | < f, hI > | ≤ |I|1/2 |I|α . First note that if f is Holder with exponent α in general, then it’s automatically ”Holder” with respect to the tree and all shifted versions of the tree. Conversely, if a function satisﬁes the condition that |f (x) − f (y)| ≤ CdT (x, y)α for all trees, dT (x − τ, y − τ ), then integrating over τ gives

1

dα (x − τ, y − τ )dτ T

0

|x − y|α .

If it’s Holder for all shifted trees, then it’s Holder in the usual sense; and not otherwise. Example. Assume 1/I I |f (x) − mI (f )|dx ≤ |I|α where mI is the mean value on the interval. This is the mean oscillation. I is any interval, not necessarily dyadic. Then the claim is, that implies < f, hI >≤ |I|1/2+α . To check this, f (x)hI (x)dx =

I

(f (x) − mI (f ))hI (x)

and the average value of h is 0 on the interval. ≤ Back to basics. Partition tree: broke up interval into unions of subsets. Back to the circle. Every circle, take a periodic function of period 2π, and you can write every function in terms of its Fourier series. ∞ 2 ˆ u(θ, t) = At (f ) = e−tk fk eikθ

−∞

1 |I|1/2

|f − mI f |x

I

=e where

2 −t d 2 dθ

π

f = 1/2π

−π

2

gt (θ)f (ψ − θ)dθ e−(θ−j)

2 /t

gt (θ) =

e−tk eikθ = c 22

sum of Gaussian kernels; periodized Gaussian kernel. This is the dear old Jacobi theta function. This is the closest to an analytic expression you’re ever gonna get. Pt (f ) = = with r = e−t . = 1/2π

−π π

ˆ fk e−|k|t eikθ ˆ r|k| fk eikθ Pr (θ)f (ψ − θ)dθ 1 − r2 |1 − eiθ r|2

where Pr (θ) =

Pt is the Poisson semigroup, and At is the diﬀusion semigroup. Why semigroup? Pt Ps = Pt+s At As = At+s Construct rings in the circle, of diﬀerent levels of coarseness... Ak = P1−2−k |Bk − Bk+1 | ≤ 2−kα is equivalent with |f (x) − f (y)| ≤ |x − y|α . here, the Bk are Poisson kernels P1−2−k . An estimate of the variation of the function in the radial variable can give us the boundary function; this is a theme from antiquity. We started with a function on the line, made it ”nice” by averaging it on diﬀerent interval scales, do the analysis on the multi-scale tree. Poisson semigroup: ∆u(x, y) = 0 u(x, 0) = f (x) u(x, y) = Diﬀusion semigroup: 1 √ t e−|x−y| /tf (y)dy

2

y f (t)dt = (x − t)2 + y 2

ˆ eixξ e−|ξ|t f (ξ)dξ

In both cases, we have a ﬁxed φ(x), integrable,

φ(x)dx = 1, and φλ (x) = 1/λφ(x/λ). √ 2 1 In the Poisson, we have φ = 1+x2 , λ = y. For the diﬀusion, phi = e−|x| and λ = t. Aλ (f )(x) = φλ (u)f (x − u)du

23

is the general rule. It doesn’t matter what φ is; it can be a characteristic function. If φ = χ[−1/2,1/2] then Aλ (f ) = 1/λ

|u|<λ/2

f (x − u)du

In general, |Aλ − Aλ/2 | ≤ cλα is equivalent to |f (x) − f (y) ≤ |x − y|α . t We already proved this for the case A as the characteristic function. The claim is that this is true for all possible A.

11

Lecture 10

φ(x)dx = 1. 1/λφ(x/λ) = φλ (x) Aλ (f ) = f (x − t)φλ (t)dt

On R1 , let

take φ(t) = 1 on [−1/2, 1/2] and zero elsewhere. Then Aλ (f ) = 1/λ

|t|<1/2

f (x − t)dt

averaging on an interval. Other kernels: Gaussian: 1/λe−t Poisson: 1/π

2 /λ2

λ + λ2 These allow you to, e.g., solve the Heat Equation at time t. Aλ (f ) = u(x, λ2 ), solution of heat equation at time λ2 . What does this have to do with the tree basis: assign the function f the average value on each interval of the partition. Analyzing how the average changes as we go vertically tells us something about smoothness of the original functions. Two discretizations: on the one hand, we discretized the intervals; on the other hand we also discretized the λ’s, to be 2−n . t2 The Haar expansion on the tree was just taking the diﬀerences of adjacent averages on each level. Regularity was just measured by the size of the diﬀerences of the averages. Generalizing this principle, ∆λ f = Aλ (f ) − Aλ/2 (f ) = 24 ψλ (t)f (x − t)dt

or Qλ f = λ

d Aλ = dλ

ψλ (t)f (x − t)dt

d where the former ψλ is actually just φλ −φλ/2 , and the latter 1/λψλ (t/λ) is λ dλ [φ(t/λ)1/λ]

If |∆λ | ≤ λα , then |f (x) − f (y)| ≤ |x − y|α , and vice versa, as we showed last time. Theorem: we claim this is true in general. Assume ψ(x) is integrable or is a measure with compact support. Could be like δ0 − δ1 or δ−1 + δ+1 − 2δ0 . Also assume ∞ ˆ |ψ(ξ)|2 dξ/|ξ| > 0

0

Let Qλ (f ) =

∞

f (x − t)ψλ (t) =

−∞

f (x − λt)ψ(t)dt

Then |Qλ (f )| ≤ λα iﬀ |f (x) − f (y)| ≤ c|x − y|α . Take dψ = δ+1 + δ−1 − 2δ0 . Then Qλ (f ) = f (x + λ) + f (x − λ) − 2f (x). This implies |f (x + λ) − f (x)| ≤ λα . (Note: this approach is due to Calderon; Qλ (f ) is called the continuous wavelet transform.) What we’re doing is convolving the function with a wavelet at diﬀerent locations and scales. Building a function that’s identically 1 on an interval, zero outside an open interval around ˆ it. ρ(ξ). Now deﬁne η (ξ) = ψ(ξ)ρ(ξ)dξ ∗ c± chopped oﬀ outside a certain range. where c+ ˆ is a constant for positive ξ and c− is a constant for negative ξ.

∞ 0 ∞ 0

ˆ |ψ(t)|2 ρ2 (t)dt/t = 1/c+ ˆ |ψ(−t)|2 ρ(t)dt/t = 1/c−

So this function is compactly supported in the Fourier domain. So we can write f (x) = νλ ∗ uλ

25

ˆ with uλ (x) = Qλ (f )(x) = u(x, λ). You cannot reconstruct f the usual way, by dividing by ˆ ψ since it vanishes, which is why we do it as above. Assume that u(x, λ) ≤ cλα , some arbitrary function. Then

∞

f (x) =

0

ηλ ∗ u(·, λ)dλ/λ

satisﬁes |f (x) − f (y)| ≤ |x − y|α . Take |x − y| = r. Then |f (x) − f (y)| =

λ>r

1/λ[η( η(

x−u y−u ) − η( )]u(·, λ)dλ λ λ

≤

x−y−s s − )1/λds λ λ

12

Lecture 11

What did we show? We showed that if you have ψ, compactly supported with mean 0, and ψλ = 1/λψ(x/λ) then |ψλ ∗ f | ≤ cλα is equivalent with |f (x) − f (y)| ≤ C|x − y|α Let f (x) =

0 ∞

ˆ ˆ eixξ f (ξ)ψ(ξ, λ)dx/λ

provided

ˆ ψ(ξλ)dλ/λ = 1. Write

∞

f (x) =

0

u(x, λ)dλ/λ

Where f = u(x, λ)λ−β , write

∞

Dβ f =

0

λ−β u(x, λ)dλ/λ

=

ˆ f (ξ)eixξ

0

∞

ˆ ψ(ξλ)(ξλ)β dλ/λ|ξ|β 26

ˆ = f (ξ)|ξ|β eixξ Fractional derivative of f. ξ is a Fourier transform of d/dx. You look at the wavelet coeﬃcients of the function ψ ∗ f , you multiply by λβ , which is like diﬀerentiating β times. |u(x, λ)λβ | ≤ λα−β if |u(x, λ)| ≤ λα . So the new function is Holder if the old function was Holder. Just changes the degree of integrability. Take a function f on [0, 1]. Look at f= < f, hI > hI (x)

If < f, hI >≤ CI α then |f (x) − f (y)| ≤ dα (x, y) in the dyadic tree metric. T Dβ f = 1 < f, hI > hI (x) = g |I|β

|g(x) − g(y)| ≤ dα−β (x, y) T you make it worse by exactly β when you diﬀerentiate β times. Littlewood-Paley Function: S1 (x) = ( d2 I χI (x) 1/2 ) |I|

Measures ”activity” around x. It’s like an envelope around f. One of its properties is that if you integrate S 2 ,

2 S2 =

d2 χT (x)/|I| = I

d2 = I

f 2 (x)dx

measures all the coeﬃcients around x. In general ||S2 ||p Generalized: Sp (X) = ( dp I ||f ||p , 1 < p < ∞ χI (x) 1/p ) |I|

Take f ∈ B1 , the space of f = dI hI , and ||f ||B1 = |dI |. It seems there’s no regularity here. But if we write f as dI ∗ |I| ∗ 1/ |I|hI we can show something... If |dI | ≤ 1 then given any λ > 0 f = gλ + bλ where |gλ (x) − gλ (y)| ≤ λdT (x, y) and the support of bλ is less than 1/λ. Sparsity implies smoothness outside of a small set! 27

1/2

Eλ = {s1 (x) > λ} where s1 (x) = Then |I| ≤

R s1 (x) λ

|dI |χI (x)/|I|

=

P

|dI | λ .

gλ (x) =

c I∩Eλ =0

dI hI (x)

|dI | ≤ λ|I|1/2+1/2 = 1 The other I’s are completely contained in Eλ , which is bigger than the support of bλ . Take f= Look at D1/2 f = |D1/2 f | ≤ iﬀ dI /|I|1/2 hI |dI | χI = |I|

χI . |I|1/2

dI hI

(x)

1

|dI | ≤ ∞ We’re using the fact that |hI | =

Let K ⊂ Rn . Let’s try to build a geometry on the points so that all the coordinates of the points satisfy this kind of condition. Take a small discretization scale . Cover the set of points by a maximal collection of balls with diameter which cover the set. Any point in the set is at distance ≤ from some xi the center of a ball. Replace this by a partition; assign to each point the nearest xi . This is called a Voronoi Diagram. Then connect them into supercells (all the cells bordering a cell). This creates a tree. |x − y| ≤ dT (x, y)1 by deﬁnition! I can evaluate how good my tree is by how small the coeﬃcients are. What if I’m in a high dimensional space? And we want to organize coordinates *and* data?

13

Lecture 11

x t 0 0 s

(

0

f (x)dudsdt) = f (x)

integrate and diﬀerentiate; fundamental theorem.

x

I n = [1/n!

0

(x − a)n−1 f (x)dx](n) = f (x) 28

Deﬁne the fractional integral Iα = And deﬁne D α = Cα

−∞

1 Γ(α)

x

(x − u)α−1 f (u)du

0 ∞

sgn(x − y) f (y)dy |x − y|α

= cα

ˆ eixξ |ξ|α f (ξ)dξ

1 x−y f

Special cases: α = 0 means |D|α f = f . α = 1 means |D|α f = ( transform. ˆ f (x) = eixξ f (ξ) ˆ L(eixξ ) = L(ξ)eixξ 1/i F (1/i Now back to where we were. If f is expanded in a Haar series f= < f, hI > +f0 h0 < f, hi > hi

|I|=2l

(y)dy), Hilbert

d ixξ e =ξ dx

d ixξ )e = F (ξ)eixξ dx

∆f = (El+1 − El )f = Now we deﬁne a regularity ∨D =

2l ∆l (f )

Last time: if |dI | < 1 where f = dI hI then D1/2 f is integrable. Half a derivative is in L1 when the coeﬃcients are sparse. We say that X is a balanced partition tree if for each l

l X = ∪Xj l l Xj ∩ Xj = ∅ l−1 l and every Xj is contained in some Xj the ratio of numbers of folders on successive levels are bounded between constants.

29

Write f= = (El+1 − El )f + E0 f ∆l (f ) + E0

where ∆l are orthogonal projections on Wl . No Haar functions here. Haar transform is just the diﬀerence between coeﬃcients of successive levels. To each node it assigns its average, and assign to each edge the diﬀerence between the two nodes. Where do you have large coeﬃcients? small coeﬃcients? The synthesis is just integration. Suppose the function is noisy. When I do the averages I repress a lot of the noise. So we can resynthesize the function with less noise. If |f (x) − f (y)| ≤ dT (x, y)α , then |∆l f | ≤ Clα

14

Lecture 12

l l Let X be a balanced partition tree, X = ∪Xk which are disjoint, every Xk is contained in l−1 some xk . We have the condition that l−1 l−1 l δ|Xk | ≥ |Xk | ≥ c|Xk |

uniformly over the whole tree. So it decays exponentially, but not too fast. A space of homogeneous type is a metric space which is also a measure space such that a ball |Br1 ,x | ≥ c|BRr,x |. Given a function deﬁned on X I can ﬁnd various tree transforms of the function f . Transform deﬁned on the edges: diﬀerences of samples at the endpoints. Synthesis is integration or addition of diﬀerences along path. |f (x) − f (y)| ≤ dT (x, y)α is equivalent to

l |dl | ≤ |Xk |α k l deﬁnitionally, since dT (x, y) = |Xk |.

Return to [0, 1] × [0, 1]. The square. We organize functions of two variables f (x, y). For instance, a document database, with words and documents. Suppose I have 105 words

30

and 105 documents. Can I build a table for all of this that has only 105 numbers? Can I compress? Build a conceptual tree of words, and a tree of documents. If there is any justice, things will go together. We want to jointly rearrange the whole system. We know that you can reorder a one-d function to make it smooth, and make it Holder plus a small set. We want to do something similar on a function of two variables. Permuting the vectors in one dimension and then in the other? We want to ﬁnd a geometry on the columns and rows so the function will be as smooth as possible in both x and y variables. We also want to do this eﬃciently. Somebody gave you two trees; a dyadic tree in one direction and a dyadic tree in the other. Deﬁne the notion of regularity as follows. I have a Haar system hI (x) and a Haar system hJ (y). First write f (x, y) = dI (y)hI (x) and then write dI (y) = so f (x, y) = dI,J hI (x)hJ (y) dI,J hJ (y)

Look at four points, x0 , x1 , y0 , y1 . ∆2 f = f (x0 , y1 ) − f (x1 , y1 ) − f (x0 , y0 ) + f (x1 , y0 ). Like R ∂2 a second derivative. ∂x∂y (1/2(x0 + x1 ), 1/2(y0 + y1 ))|R|. A function f is said to be bi-Holder with exponent α if |∆2 f | ≤ |R|α C. R Theorem: if dR ≤ 1, then for any λ > 0, f = gλ + bλ where |∆2 gλ | ≤ λ|R|1/2 R and the support of bλ has measure ≤ 1/λ. The classical version of this in one dimension would be the Rising Sun Lemma. The multivariate classical analogue is unproven. Example: suppose this were actually a matrix, and the matrix is not full. Missing values. The missing value can be given by the value just below, plus the y-derivative (taken from the y-distance of adjacent values) plus an error of the order of |R|α . Estimating a missing value from nearby values. You can only do that if the function is bi-Holder. This could be, for instance, a recommendation engine.

31

(Note: this procedure can be extended to truly sparse matrices to do recommendation engines properly.) We’ll prove the theorem later. But it says that functions that are well represented in the basis have Holder regularity, and the converse is also true. The set of samples we need is the set of all centers of dyadic rectangles. You don’t need more than k2k centers if 2−k is your smallest size.

15

Lecture 13

|f (x − y) − f (x , y) − f (x, y ) + f (x , y )| ≤ d1 (x, x )α ∗ d2 (y, y )α

Equivalent to hR (x, y) = hI (x)hJ (y). f (x, y) Claim dI (y)hJ (x) = dI,J hI (x)hJ (y) = dR hR

1 |d (y) − dI (y )| ≤ |I|α d2 (y, y )α 1/2 I |I|

using the same theorem as always. 1 |I|1/2 |J|1/2 | < dI (y), hJ (y) > | ≤ |J|α |I|α

note that all of this is independent of dimension! |g(x) − f (x )| ≤ dI (x, x ) is equivalent with dI hI = g(x) 1/|I||dI | ≤ c|I|α

f (x, y) =

|R|> =2m

dR hR + e

|e | ≤

|R|≤

|dR |

χR (x) |R|1/2

=

|R|=

|R|α χR (x, y)

32

≤

l≥m

l2−lα

∼ m2−mα Theorem: If f is bi-Holder with exponent α, f (xi , yi ) is known on a sparse grid corresponding rectangles of area ≥ then f can be approximated to error α log 1/ . Exercise: prove Smolyak’s theorem.

16

Lecture 13

Questionnaires and questions, Xp and Yq , in matrix.

l X = ∪ml Xk k=0 l+1 l l Xk are disjoint in k, Xk is contained in Xk for some k. A sparse grid is just a selection of l ∈ X l . Deﬁne an approximation P (f ) to be the sampling of the function a single point xk l k l at that level. χl is the characteristic function of Xk . k

Pl (f ) =

l f (Xk )χl (x) k

|f (x) − f (x )| ≤ dT (x, x )α

l where dT is the size of the smallest Xk containing x, x .

f= =

Pl+1 (f ) − Pl (f ) + P0 (f ) [f (xl+1 ) − f (xl )]χl+1 (x) k k k

l δk χl+1 + f (x0 ) 0 k l k

= f is Holder iﬀ

l |δk

|≤

C|χp |α . k

l Now, back to the questionnaires. Say you have a tree Xk on the questions and Yjm on the respondents.

f (x, y) =

l k

m+1 m+1 m m χl+1 (x)[f (xl+1 yj ) − f (xl , yj − f (xl+1 , yj ) + f (xl , yj )] k k k k k l,m δk ,j χl,m (x, y) k ,j

=

33

Number of rectangles such that |R| > : dR h= 1/ log(1/ )

|h|>

17

Lecture 14

Find a tree on each dimension of the 2-d matrix. If you had a function that was biHolder, you could sample it more sparsely, and reconstruct it from mixed derivatives. If you were to take a Euclidean distance hierarchical quantization tree, each row is Lipschitz relative to that tree, you can organize the tree on the other dimension with respect to that organization. The size of the constant depends on how much one dimension depends on the other. You can’t necessarily do it in high dimension. Fundamental question: You have a collection of functions on a population. Want to organize the analysis of the population so that the organization will be as smooth as possible. Want to be able to say something about values of functions in the data. Suppose you look at an atlas. Points are points on the globe. Each point has a collection of numbers attached to it. Could also have a demographic or political proﬁle. Every location has a collection of functions attached to it. If I want to organize an atlas – a collection of maps on the globe – you build a tree. Globe, continents, countries, etc. A diﬀerent atlas for climate. Variability in climate is dominated by latitude. It’s a diﬀerent geometry than euclidean distance. Consider the spiral. S(x, y) distance along the curve. This is LOUSY in terms of Euclidean 1 coordinates. You need intrinsic coordinates. Or consider sin( x+δ ). It’s bad – it oscillates a lot. But if you look at it as a function on the graph of itself, it’s nice. |d/ds(sin x(s))| ≤ 1. Lipschitz! You can map the spiral to a curve in 2 dimensions, via the arclength parametrization, More general situation: ˜ Xi → Xi xi · xj = ˜ ˜

(t)

= (λt φ1 (xi ), λt φ2 (xi ) . . . λt φN (xi )) 1 2 N

λ2t φl (xi )φl (xj ) = A2t (i, j) = φ0 (xi )φ0 (xj ) l

||˜i − x)j ||2 = A2t (i, j) + A2t (j, j) − 2A2t (i, j) x ˜ = |At (i, k) − At (j, k)|2

Geometric interpretation: link each point to its neighbors; can link to higher order by taking the adjacency matrix to a power. Distance is diﬀusion. 34

Distance d1 = d1 (˜i , xj ), just nearest neighbors. Everything who’s not a neighbor is at x ˜ distance 2. Shrink this down into the ﬁrst embedding, and then again take a maximal subcollection such that d2 (xi , xj ) ≥ 1. These are the distances after time two. Doing exactly what we were doing in the Euclidean case, but the distance at diﬀerent levels is diﬀerent. So you can view this as a tree of points. Each folder is points that are linked at the scale of the folder. Probabilistic interpretation – probability that you’ve diﬀused out that distance by that time. For small t, the folder is a small spherical cap, may as well be geodesic. Large t is not. For example, think of a dumbbell: diﬀusion distance across the neck is large, much closer within the dumbbell.

18

Lecture 14

Pick a black spot. The organization of local patches is naturally parametrized bye their average and orientation of the edge. The ”folders” actually extract portions of the curve on the edge. Started with small folders and then agglomerated them bigger and bigger. Alternatively, organize a domain in the plain by breaking it into partitions. How do we form the partitions? The ﬁrst eigenvector gives the direction of greatest variance, and that forms the direction of the dividing line. Divide and divide again. It will generate regions which are as ”fat” as possible. This is a top-down hierarchical construction of the tree. Another method: sampling the data at ﬁner and ﬁner points. This is the bottom-up way. Suppose you have points on a line. f (xi ) known. Want to extend the function to every other point. Think of this as a classiﬁer. dI =< f, hI >, f (x) = dI hI . Look at all possible expansions aI hI (xi ) which agree with the given f (xi ). Best ﬁt. Minimize |aI |. Want your function described in the simplest, sparsest way you can. Easiest way to extrapolate: necessarily sparse or simple. f (xi )χI (xi ) extrapolate with a constant. But this is not

Exercise: I have a function which takes only two values: S1 and S2 . I have three possible functions in which to represent it. I have intervals I1 , I2 , and I1 ∪ I2 . Want to represent f = α1 χI1 + α2 χI2 + α3 χI . where I = I1 ∪ I2 . Want to minimize 2|α1 | + 2|α2 | + |α3 |. Hint: pick the mean value of S1 , S2 on the full interval, and pick the correction. Haar expansion. That should be better. In a sense we’re also imposing smoothness when we impose sparsity. A minimal representation of f based on characteristic functions will be deﬁned everywhere. It’s an extrapolation. But I want the extrapolation to be as consistent and as smooth as possible. Haar expansions, recall, do NOT satisfy standard properties of 35

Fourier expansions. As we’ve seen, though, the fact that Holder of order 1/2, except for a set of small measure.

|aI | is ﬁnite means that f is

Classiﬁer on zip codes. You build a graph, then a tree, then a function on that tree. Fit the Haar coeﬃcients to the samples. This gets you to the state of the art classiﬁcation error. Let’s go back to the questionnaire. People vs. questions, and the function for the depres¯ sion score, known at d(pi ), and we can predict it for new people. Get a d, the simplest organization of the depression score based on the data we have. Candidate score. I can add this as an extra question now! I can give this question a very large weight. Now two people who used to be close will be farther; and two people who used to be farther become closer. It will reorganize the tree geometry of people. This then changes the relationship between questions. Consider bumps: e−|x−j| /2 . What is the class of functions that can be represented as 2 αj e−|x−j| /2 . These functions are far from being orthogonal. But I can ﬁnd the one which is simplest: minimize |αk |. If I have a function in this space, what is a good grid 2 of xk ’s to sample so that I’ll be able to reconstruct exactly? Deﬁne g1 = eikθ e−k . Look at all the shifts: g(θ − ψ)α(ψ)dψ = Fα (θ) What is the dimension of F ? Up to a certain degree of precision. The answer is trivial. 2 αk e−k eikθ ˆ Well, if |k| ≤ 5, we’ll have an error of O(e−24 ). The dimension of this is 11. Eleven numbers tell me practically everything. The local rank of the projection on this Gaussianbell space is the same as on the circle. It’s not an inﬁnite dimensional collection; it’s very low rank.

2

19

Lecture 15

¯ Let f (x) = f (xi )χIi (x). Just take step functions from the given points. If |f (x)−f (y)| ≤ α (x, y) then f has the same property. ¯ dT Suppose instead we take the triangle function, whose height is equal to I, the length of the interval, around each point xi . More generally, take points xl , centers of dyadic intervals, i and take φ(

x−xl i ). |I l |

f (xl )φ( i

x − xl i ) |Iil |

36

another way of interpolating. The derivative of the triangle function is a (rescaled) Haar function. x − xl i ∆( ) = |I|−1/2 hI l |I l | Suppose I have a function represented as ¯ f (x) = al φ( i al φ ( i x − xl i

l i

)

¯ f (x) = And the integral ¯ |f (x)|dx = (

x − xl i

l i

)1/

l i

|al |)( i

|φ (t)|dt)

So if I want |al | ≤ 1 then f is of bounded variation. If you’re given a φ and all possible i scales, try to ﬁnd a representation for f which minimize 1 norm of the coeﬃcients. Take the orthogonal expansion, and this will give you the minimum. Note: Hardy spaces are deﬁned by the fact that every function has a decomposition into slightly more general functions. This is a simple version. Now we want to get oﬀ the line. Suppose we have e−|x−α| . What is the dimension of this collection of functions? Suppose |α| ≤ 1, |x| ≤ 1 . λ0 > λ1 > .... > 10−10 This gives about 10 digits. It’s inﬁnitely dimensional, but up to good precision it’s almost ﬁnite dimensional. 2 The λk = e−k , so since this decays so fast, 10 digits is plenty. f (x) = ai e−|x−αi |

2 2

The function that you measure might be a bandlimited function. I can project into the space of bandlimited functions. < f (x), φk > φk

k:λk >10−10

But what if I want to represent it in the original kernel? φk = 1/λk e−|x−α| φk (α)

2

But if I could write the integral as a discrete sum, we’d be ok: f would be represented in the kernel. It’ll be overkill – too many coeﬃcients – but it works. Need about 30 grid points α to get the same error. 37

What’s really going on here? I have a matrix e−|x−y , x and y in a dense collection of points. (x, y) ∈ Γ ⊂ Rn . But we know the matrix has low rank! This is really a Gaussian. In each variable, this operator is a Gaussian in that variable. In each variable, the Fourier nodes 2 look like e−z . So the rank is ﬁxed. Does not exceed the minimal number of balls of radius epsilon you need to cover the area. How do I subsample this matrix in a way that guarantees I’ll ﬁnd samples which cover the whole range of the matrix? The theorem (Rokhlin, Tygert, Martenson) says that if you have a matrix aij , large size but low rank. The reduction is basically: encode the rows (or columns) of this matrix in a random code. Take a random vector of plus-minus ones. We’re building a random matrix 1 , ... L , where L is the rank. Take the inner product of the matrix with the code. Orthogonalize the rows of the resulting matrix. We build a dictionary that way. The rows you select at every step are the ones far away from the preceding ones. ”Far away” means projecting in the orthogonal direction. So you’ve selected L points: L rows in the original matrix. This is a way to select a subset of rows in the matrix which span the same space as the original matrix. Equivalently, this picks the correct αi . (It’s not the same as compressed sensing, but it is a projection pursuit, and it’s in the same ”spirit.”) But how do you compute the coeﬃcients? e−|x−αi | ai = f (x) Normally I would do it by integration, but I don’t want to integrate. I want to, for instance, ﬁnd ai so that |ai | is minimal. min(||f (x) − ai e−|x−αi | ||2 + µ

2 2

|ai |)

Penalize error and complexity. Or, instead of working with a Gaussian kernel at scale 1, rescale it to be half as wide: 2 ai e−|x−αi | ∗4 Sharper Gaussian: use the points you already have and augment them with the next generation of points. The region aﬀected shrinks. I can break up the space into boxes, and add points box by box – it’s parallelizable, so to speak. Describe f as being in the range of a coarse Gaussian, plus the range of a ﬁner Gaussian, plus the range of an even ﬁner Gaussian, etc. What’s the point? We had data; we built a graph. We can look at the eigenvectors of this graph, and embed into a Euclidean space. The problem is those were eigenvectors on a matrix deﬁned on the data. If I take a new point, where do I map it? I don’t know. 38

If I were to take those eigenvectors f , as before, I’d have an extension to the rest of the universe. For example: consider the circle. There are Gaussian extensions of cos(θ), cos(2θ), . . .. The higher the frequency, the closer the peaks are to approximating the circle. A highly oscillating function will not extend much beyond the circle.

20

Lecture 16

Given a graph which is a curve, when you embed it into the ﬁrst two eigenvectors, it’s mapped to a circle. First two eigenvectors are φ1 = cos θ, φ2 = sin θ. All the others are 2 cos kθ, sin kθ. where k are integers. Correspond to λk = e−k . Suppose F (x) = cos 8θ. Can I ﬁnd a function f (x) = |ξ|≤16 eixξ η(ξ) such that the L2 norm of f is minimal? In other words, is there a band-limited, minimal norm extension of f(x) beyond the circle? On the other hand, if we look at cos 100θ, we can’t have it band-limited by 16. What we’ve proved: ∆ψ = λ2 ψ ψ(x) =

|ξ|≤cλ

eixξ η(ξ)

here the c depends on how the manifold is sitting in the surrounding ambient space. You can always approximate it by a bandlimited function whose band is bounded by the square of the eigenvalue. You can also take a bandlimited function, restrict it to the manifold, and approximate it by eigenvectors of the Laplacian. How far you can extend depends on how wiggly the eigenfunction is, roughly. How it’s embedded. We have data points and outside points. So we’re going to build two graphs. I have a million points, say. Let’s randomly select 10,000 points. Pick those points which are selected randomly as a reference library. Every patch in 121 dimensions can be compared to one of them. A library of representatives. I have a metric ||x − yi ||i = Ωi (x − yi ) ∗ (x − yi ) for some Ωi > 0. Ω(x) = < f, g >Rd = e−|x−yi |i

2

f (x)g(x)Ω2 (x)dx

39

< fi , gi >Ref = Af =

fi gi

A(x, yi )f (yi ) = F (x) At Aφt = λ2 φl l

A(x, yi )A(x, yj )Ω2 (x)dx =

e−|x−yi |i e−|x−yj |j dx

2

2

e−Ωi (x−yi )∗(x−yi ) e−Ωj (x−yj )∗(x−yj ) dx e−Ωi (x−(yi −yj ))∗(xi −(yi −yj )) e−Ωj x∗x dx e−Ωx∗x ∗ eΩj x∗x (δ) where δ = yi − yj = e−(Ωi

−1

Ω−1 )−1 j

δ

e−||yi −yj ||i,j ωi ωj A= e−|x−yi = Ω(x)ωi

|2

λl ψl (x)φl (xi )

At Aφλ = λ2 φl l ψl (x) = 1/λl A(φl ) We showed that e|yi −yj | φl (yi ) = λ2 φl (yj ) l ωi ωj

2

would be the same as before if the Gaussians were the same. Now I can compare any point to my reference set via this kernel. I know the eigenvectors of the outside world don’t depend on how many points I have. The whole image is a function of the pixels; can be expanded in eigenfunctions of all the pixels. Any patch can be checked to see if it’s similar to something in the image.

21

lecture 16

Suppose I have a data set x ∈ Gamma. Map it into a new set y = ax ∈ AΓ. We want to deﬁne a distance or a graph structure that is invariant under such transformations. How do we do this? Mahalanobis distance. xp ∈ Γ 40

xp = (xp , . . . xp ) q 1 So we write our data as a matrix X. Take the matrix C = XX T and diagonalize it. Write it as λ2 Ol (q)Ol (q ). l C = ODOt . Look at the distance between xp and xp to be d(x, x ) = C −1 (xp − xp ) ∗ (xp − xp ). If we look at y’s instead of x’s, d(y, y ), we have CA = AXX t (A−1 )T , so A−1 (CA−1 )A(xp − xp ) ∗ (Axp − Axp ) which is just the original distance. The λl express how much that coordinate varies over the data. The data that originally looked like an ellipsoid becomes a sphere. Suppose you build a graph of A of the data. You want a metric that does not depend on the function of the data, even if it is nonlinear. Local Mahalanobis? By taking only the data near the point. If I have some extra information I can do it. Suppose I have this nonlinear map f (x). Want to deﬁne a graph on the new data which is independent of f – same graph as the old data. Near f (x0 ) it looks like f (x0 ) + f (x0 )(x − x0 ) + O(x − x0 )2 . Inverse covariance matrix of the data near x0 gives you the local Mahalanobis distance. The real problem we want to solve: the so-called black box problem. Suppose I have data in a black box, mapped by some nonlinear transformation to some other place, where 2 2 we see a collection of ellipsoids. The results of some experiment. e−||yj −yi ||i +||yi −yj ||j can be our distance. Deﬁne a graph based on this distance. When I build a graph in the black box, the eigenvectors are products of a function of x and a function of y. (This whole process is known as Nonlinear Independent Components Analysis.) What is the relationship between eigenvectors and the initial coordinates? Compute the discrete graph Laplacian from the Mahalanobis metric graph. Eigenvalues of ψl (x1 ) ∗ ψm (x2 ) is a sum of the other two eigenvalues... they’re orthogonal.

41