gbh qed

machine learning + computer vision + facts + proofs

Computing the dual

leave a comment »

Here I will work out the derivation for the dual, and dual update equations, presented in Frome, Singer, Sha, and Malik, Learning Globally-Consistent Local Distance Functions for Shape-Based Image Retrieval and Classification, ICCV 2007.

Deriving the Dual

From equation (4) in the paper, we can add Lagrange multipliers to get the following objective function:

\min_{W\geq 0,\xi} \max_{\beta,\alpha,\mu\geq 0} \frac{1}{2} \|W\|^2 + C\sum_{ijk} \xi_{ijk} - \sum_{ijk}\beta_{ijk}\xi_{ijk} - \sum_m \mu_m W_m + \sum_{ijk} \alpha_{ijk} (1 - \xi_{ijk} - W \cdot X_{ijk})

All the dual variables, \beta,\alpha,\mu are constrained to be greater than 0, and are such that, if one of the original inequality constraints is violated, the inner maximum can be made to be +\infty, and so a violated constraint now corresponds to a positive infinite value in the minimization problem.

We form the dual by switching the \min and \max.  It is straightforward to show that the dual is a lower bound on the original primal problem.  In the case of a convex optimization problem with feasible solution, the dual optimum solution will be equal to the primal optimum solution, so we can solve the dual instead of the primal.

To get from the above to equation (5) in the paper, we first take the derivative with respect to \xi_{ijk}.  This yields, \forall ijk,

C - \beta_{ijk} - \alpha_{ijk} = 0

So \beta_{ijk} = C - \alpha_{ijk}.  Since \beta_{ijk} \geq 0, we now have the constraint \alpha_{ijk} \leq C, and we can substitute the expression for \beta into the above equation, and simplify to get:

\max_{\alpha,\mu} \min_{W\geq 0} \frac{1}{2} \|W\|^2 -  \sum_m \mu_m W_m + \sum_{ijk} \alpha_{ijk} (1 - W \cdot  X_{ijk})

s.t. \forall ijk : 0 \leq \alpha_{ijk} \leq C

\forall m: \mu_m \geq 0

Now we take the derivative with respect to a W_m and get:

W_m - \mu_m - \sum_{ijk} \alpha_{ijk} X_{ijk,m} = 0

So therefore, W = \sum_{ijk} \alpha_{ijk}X_{ijk} + \mu.

If we now plug this expression for W back into the objective, expand the L2 norm, and simplify, we get the equation (5) in the paper:

\max_{\alpha,\mu} -\frac{1}{2} \|\sum_{ijk} \alpha_{ijk}X_{ijk} + \mu\|^2 + \sum_{ijk} \alpha_{ijk}

s.t. \forall ijk : 0 \leq \alpha_{ijk} \leq C

\forall m: \mu_m \geq 0

Deriving the Updates

Now we can just do coordinate ascent on the dual variables.  One thing to note is that, in the ordinary SVM formulation, there is an equality constraint that ties together all the dual variables \alpha_i.  Therefore, if you try to do coordinate ascent by just optimizing a single dual variable at a time, its value will be determined by all the rest, and the objective will stay the same.  Dual solvers for SVMs thus optimize a pair of dual variables at a time.

In this case, there is no such equality constraint, so we can use ordinary coordinate ascent.  Taking the derivative with respect to \mu_m, we have

\sum_{ijk} \alpha_{ijk}X_{ijk,m} + \mu_m = 0

We also know that \mu_m \geq 0, so this basically means for any component of \alpha_{ijk}X_{ijk} that is negative, the corresponding component of \mu will be set to make the sum equal 0.  Since W = \sum_{ijk} \alpha_{ijk}X_{ijk} + \mu, this is just enforcing the positivity of W.

Now, taking the derivative with respect to \alpha_{ijk}, we get

- (\sum_{i'j'k'} \alpha_{i'j'k'}X_{i'j'k'})X_{ijk} - X_{ijk}\cdot \mu + 1 = 0

\alpha_{ijk} = \frac{1 - (\sum_{i'j'k' \neq ijk} \alpha_{i'j'k'}X_{i'j'k'} + \mu)\cdot X_{ijk}}{\|X_{ijk}\|^2}

Adding and subtracting an additional \alpha_{ijk}X_{ijk}\cdot X_{ijk} on the right, and using our expression for W and \mu above, we get:

\alpha_{ijk} \leftarrow \frac{1-W\cdot X_{ijk}}{\|X_{ijk}\|^2} + \alpha_{ijk}

Andrea Frome’s thesis gives more details on the algorithm.  Essentially, there is both an outer loop and an inner loop.  In the inner loop, dual variables \alpha_{ijk} are randomly reordered and updated.  If a variable becomes bounded, i.e. \alpha_{ijk} = 0 or C, then it is no longer updated in the next iteration of the inner loop.  The rationale is that, once a variable becomes bounded, it is not likely to change much, so time should be spent updating the other variables instead.  The inner loop ends once the unbounded variables do not change, or the change in the dual objective is less than some threshold.  At this point, all variables are marked as unbounded and the outer loop repeats.

Written by gbhqed

October 2, 2010 at 7:39 pm

nohup matlab

leave a comment »

I recently had to run some long matlab jobs, so I ran them with nohup to keep the jobs going after I signed out.  However, I soon found that the jobs would die as soon as I logged out, despite running them with nohup.

Turns out the secret to getting nohup to work with matlab is to unset the DISPLAY environment variable as described on this Mathworks help page.

Actually, I find it better to run nohup matlab jobs as described on this page of pointers, by redirecting standard in put from a file rather than using the “-r” matlab option.

In summary, to run the script “commands.m” in matlab with nohup, execute the following command:

nohup nice matlab < commands.m > output.txt &

where the first command is to clear out the DISPLAY variable in the bash shell, nice is thrown in since it is a long job, and output is redirected to “output.txt”.

Written by gbhqed

March 28, 2010 at 11:35 pm

Posted in Uncategorized

Tagged with , , ,


leave a comment »

A little off subject, but I recently came across this blog, Emacs-fu, while searching for how to add line numbers to the left of each line in Emacs (the answer is in this post).

Emacs is my editor of choice when I need to code.  I suppose there will always be the battle between Emacs and vi on one hand, and between these editors and more sophisticated editors, such as may be part of an IDE on the other hand.

For my part, I prefer Emacs because it is available on all three operating systems I use (Windows, Mac OS X, Unix), it can be easily customized to different languages as far as syntax highlighting and tabbing, and it minimizes the number of times I need to switch over off the keyboard to the mouse while coding.  I suppose you could say the same things about vi; honestly I am just more proficient with Emacs as it was the default choice in my first CS class at Stanford.

At any rate, the Emacs-fu blog seems to have a number of tips for getting the most out of Emacs, and I’ll be looking into it in more detail.

Written by gbhqed

March 13, 2010 at 7:21 pm

Posted in Uncategorized

Tagged with ,

Fast Bit Counting

leave a comment »

Awhile ago, I was reading papers on various hashing algorithms, such as Weiss, Torralba, and Fergus’s Spectral Hashing.  One use for these algorithms is in computing approximate distance and nearest neighbors in very large data sets, where exact computation would be too slow.  Data points are instead mapped to bit codes that are constructed such that nearby points will map to similar bit codes.

One question this raised, and one which also apparently potential interview question, is what is a fast way to compute the number of one bits in a number (allowing you to compute the Hamming distance between bit codes).

It’s an interesting question, and this page – Puzzle: Fast Bit Counting – gives a number of clever algorithms for computing the number of one bits.

Of course, this isn’t necessarily relevant in the context of using the hashing to bit codes for computing nearest neighbors.  Instead, given a query, you’d probably want to find all data points within a Hamming distance of k.  For this, I imagine you would pre-compute all bit codes with k or less one bits, and then run through this list, XOR (^ in C) the query with the bit code to get a new code with Hamming distance less than or equal to k from the query, and then test to see if any data point was hashed to the new code.

Written by gbhqed

February 25, 2010 at 1:37 am

Conditional and marginal distributions of a multivariate Gaussian

with 3 comments

While reading up on Gaussian Processes (GPs), I decided it would be useful to be able to prove some of the basic facts about multivariate Gaussian distributions that are the building blocks for GPs.  Namely, how to prove that the conditional distribution and marginal distribution of a multivariate Gaussian is also Gaussian, and to give its form.


First, we know that the density of a multivariate normal distribution with mean \mu and covariance \Sigma is given by

\frac{1}{(2\pi)^{k/2}|\Sigma|^{1/2}} \exp\left(-\frac{1}{2} (x-\mu)^T \Sigma^{-1} (x-\mu)\right).

For simplicity of notation, I’ll now assume that the distribution has zero-mean, but everything should carry over in a straightforward manner to the more general case.

Writing out x as two components \left[ \begin{array}{c} a\\ b \end{array} \right], we are now interested in two distributions, the conditional p(a|b) and the marginal p(b).

Separate the components of the covariance matrix \Sigma into a block matrix \left[ \begin{array}{cc} A & C^T \\ C & B \end{array}\right], such that A corresponds to the covariance for a, similarly for B, and C contains the cross-terms.

Rewriting the Joint

We’d now like to be able to write out the form for the inverse covariance matrix \left[ \begin{array}{cc} A & C^T \\ C & B \end{array}\right]^{-1}.  We can make use of the Schur complement and write this as

\left[ \begin{array}{cc} A & C^T \\ C & B  \end{array}\right]^{-1} = \left[ \begin{array}{cc} I & 0 \\ -B^{-1}C & I  \end{array}\right] \left[ \begin{array}{cc} (A-C^T B^{-1} C)^{-1} & 0 \\ 0 & B^{-1}  \end{array}\right] \left[ \begin{array}{cc} I & -C^T B^{-1} \\ 0 & I  \end{array}\right].

I’ll explain below how this can be derived.

Now, we know that the joint distribution can be written as

p(a,b) \propto \exp \left(-\frac{1}{2} \left[ \begin{array}{c} a\\ b \end{array} \right]^T \left[ \begin{array}{cc} A & C^T \\ C & B   \end{array}\right]^{-1} \left[ \begin{array}{c} a\\ b \end{array} \right] \right).

We can substitute in the above expression of the inverse of the block covariance matrix, and if we simplify by multiplying the outer matrices, we obtain

p(a,b) \propto \exp \left(-\frac{1}{2} \left[ \begin{array}{c}  a - C^T B^{-1} b \\ b \end{array} \right]^T \left[ \begin{array}{cc} (A-C^T B^{-1} C)^{-1} & 0 \\ 0 & B^{-1}   \end{array}\right] \left[ \begin{array}{c}  a - C^T B^{-1} b \\ b \end{array} \right] \right).

Using the fact that the center matrix is block diagonal, we have

p(a,b) \propto \exp \left(-\frac{1}{2} (a  - C^T B^{-1} b)^T (A-C^T B^{-1} C)^{-1} (a  - C^T B^{-1} b)\right) \exp \left( -\frac{1}{2} b^T B^{-1} b\right).

Wrapping up

At this point, we’re pretty much done.  If we condition on b, the second exponential term drops out as a constant, and we have

p(a|b) \sim \mathcal{N}\left(C^T B^{-1} b, (A-C^T B^{-1} C)\right).

Note that if a and b are uncorrelated, C = 0, and we just get the marginal distribution of a.

If we marginalize over a, we can pull the second exponential term outside the integral, and the first term is just the density of a Gaussian distribution, so it integrates to 1, and we find that

p(b) = \int_a p(a,b) \sim \mathcal{N}(0,B).

Schur complement

Above, I wrote that you could use the Schur complement to get the block matrix form of the inverse covariance matrix.  How would one actually derive that?  As mentioned in the wikipedia page, the expression for the inverse can be derived using Gaussian elimination.

If you right-multiply the covariance by the left-most matrix in the expression, you obtain

\left[ \begin{array}{cc} A & C^T \\ C & B   \end{array}\right] \left[ \begin{array}{cc} I & 0 \\ -B^{-1}C & I    \end{array}\right] = \left[ \begin{array}{cc} A-C^T B^{-1} C & C^T \\ 0 & B    \end{array}\right]

zero-ing out the bottom right matrix.  Multiplying by the center matrix gives you the identity in the diagonal components, and the right-most matrix zeros out the top left matrix, giving you the identity, so the whole expression is the inverse of the covariance matrix.

Further Reading

I got started on this train of thought after reading the wikipedia page on Gaussian processes.  The external link on the page to a gentle introduction to GPs was somewhat helpful as a quick primer.  The video lectures by MacKay and Rasmussen were both good and helped to give a better understanding of GPs.

MacKay also has a nice short essay on the Humble Gaussian distribution, which gives more information on the covariance and inverse covariance matrices of Gaussian distributions.  In particular, the inverse covariance matrix tells you the relationship between two variables, conditioned on all other variables, and therefore changes if you marginalize out some of the variables.  The sign of the off diagonal elements in the inverse covariance matrix is opposite the sign of the correlation between the two variables, conditioned on all the other variables.

To go deeper into Gaussian Processes, one can read the book Gaussian Processes for Machine Learning, by Rasmussen and Williams, which is available online.  The appendix contains useful facts and references on Gaussian identities and matrix identities, such as the matrix inversion lemma, another application of Gaussian elimination to determine the inverse, in this case the inverse of a matrix sum.

Written by gbhqed

February 21, 2010 at 5:53 pm