Sunday, March 21, 2010

Embedding LaTeX in Blogger take 2

After further consideration on Embedding LaTeX in Blogger I have the following concerns in using jsLaTeX.

  • LaTeX images not rendered in RSS readers

  • Stability of all the components over (a long) time

jsMath improves on point 2, but looks way too hard to host on Blogger. So I am falling back to using images generated by the CodeCogs online equation editor hosted on Picasa, which is where Blogger uploads images. I put the LaTeX source in the img alt attribute.

Blogger resizing images
After adding a few equations through the standard Blogger mechanism, I noticed that they weren't as clear when previewing. The following two equations are the same image hosted on Picasa, but embedded with different src urls.



This is using the Blogger generated url in the image tag and has size 400 x 285 pixels.



This is using the Picasa generated url and has the original size of 415 x 296 pixels.

Notice how the first image is both smaller and blurry. It seems that Blogger is automatically resizing it. Unfortunately this means a little more work for each equation as it requires generating the link in Picasa and replacing the Blogger generated image tag.

Computing variance

Given a list of n numbers x_1,x_2,...,x_n, the variance is defined as

\frac{1}{n}\sum_{i=1}^n{\left(x_i - \bar{x}\right)^2}\quad\quad\quad\text{(1)}

where \bar{x} is the arithmetic mean.

\bar{x} = \frac{1}{n}\sum_{i=1}^n{x_i}

Expression 1 stated in words is the average squared difference of each data point from the mean. Implemented as an algorithm, this requires two passes over the data set. The first to calculate the mean and the second to sum the squared differences.

Let's derive the more commonly used expression for variance.

\begin{align*}<br />\frac{1}{n}\sum_{i=1}^n{\left(x_i - \bar{x}\right)^2} & =  \frac{1}{n}\sum_{i=1}^n{\left({x_i}^2 - 2x_i\bar{x} + \bar{x}^2\right)} \\<br />& = \frac{1}{n}\left(\sum_{i=1}^n{x_i}^2 - 2\bar{x}\sum_{i=1}^n{x_i} + n\bar{x}^2\right) \\<br />& = \frac{1}{n}\sum_{i=1}^n{x_i}^2 - 2\bar{x}^2 + \bar{x}^2 \\<br />& = \frac{1}{n}\sum_{i=1}^n{x_i}^2 - \bar{x}^2 \\<br />& = \frac{1}{n}\sum_{i=1}^n{x_i}^2 - \left(\frac{1}{n}\sum_{i=1}^n{x_i}\right)^2 && \text{(2)}<br />\end{align*}

This only requires a single pass over the data set, computing the incremental sum of the values and the incremental sum of their squares. However, if the terms in the subtraction are large and close enough, catastrophic cancellation can occur. Essentially the precision of the floating point representation is exceeded, yielding unacceptable results. See Algorithms for calculating variance and Theoretical explanation for numerical results for examples (they are calculating the sample variance which is slightly different). More detailed information on catastrophic cancellation can be found in What Every Computer Scientist Should Know About Floating Point Arithmetic (1991).

A more numerically stable (but computationally expensive) single pass algorithm was published by Knuth in The Art of Computer Programming, volume 2. Define \bar{x}_n and s_n as the mean and sum of squared differences of the the first n values respectively.

\begin{align*}<br />s_0 & = 0\\<br />\\<br />s_{n+1} & = \sum_{i=1}^{n+1}\left(x_i-\bar{x}_{n+1}\right)^2\\<br />& = s_n + \left(x_{n+1}-\bar{x}_n\right)\left(x_{n+1}-\bar{x}_{n+1}\right)\quad\quad\quad(3)<br />\end{align*}

For m > 0 values, the variance is then \frac{s_m}{m}.

Derivation of Equation 3
This is really just a more verbose version of one-pass algorithm to compute sample variance. First up is the running mean.

\begin{align*}<br />\bar{x}_0 & = 0\\<br />\\<br />\bar{x}_{n+1} & = \frac{n\bar{x}_n + x_{n+1}}{n+1}\\<br />& = \frac{\left(n+1\right)\bar{x}_n-\bar{x}_n+x_{n+1}}{n+1}\\<br />& = \bar{x}_n + \frac{x_{n+1}-\bar{x}_n}{n+1} && \quad &&& (4)\\<br />\end{align*}

If you think of the mean visually i.e. a horizontal line through the data on a scatter plot, then it is intuitive that the sum of the distances between each point and the mean is zero.

\sum_{i=1}^n\left(x_i-\bar{x}_n\right) & = \sum_{i=1}^nx_i-n\left(\frac{1}{n}\sum_{i=1}^nx_i\right)=0\quad\quad\quad(5)

Now define \gamma as the difference of consecutive incremental means, along with a few useful variations.

\begin{align*}<br />\gamma & = \bar{x}_{n+1} - \bar{x}_n && \quad &&& (6)\\<br />\\<br />\gamma & = \bar{x}_n + \frac{x_{n+1}-\bar{x}_n}{n+1} - \bar{x}_n && &&& \text{by (4)}\\<br />\gamma & = \frac{x_{n+1}-\bar{x}_n}{n+1} \\<br />\left(n+1\right)\gamma & = x_{n+1}-\bar{x}_n && &&& (7)\\<br />\\<br />x_{n+1}-\bar{x}_{n+1} & = x_{n+1}-\bar{x}_n+\bar{x}_n-\bar{x}_{n+1} \\<br />& = \left(n+1\right)\gamma-\gamma && &&& \text{by (7, 6)} \\<br />& = n\gamma && &&& (8)\\<br />\end{align*}

The pieces are now all in place to derive equation 3.

\begin{align*}<br />s_{n+1} & = \sum_{i=1}^{n+1}\left(x_i-\bar{x}_{n+1}\right)^2\\<br />& = \sum_{i=1}^{n}\left(x_i-\bar{x}_{n+1}\right)^2 + \left(x_{n+1}-\bar{x}_{n+1}\right)^2\\<br />& = \sum_{i=1}^{n}\left(x_i-\bar{x}_n+\bar{x}_n-\bar{x}_{n+1}\right)^2 + \left(n\gamma\right)^2 && \quad &&& \text{by (8)}\\<br />& = \sum_{i=1}^{n}\left(\left(x_i-\bar{x}_n\right)-\gamma\right)^2+n^2\gamma^2 && &&& \text{by (6)}\\<br />& = \sum_{i=1}^{n}\left(x_i-\bar{x}_n\right)^2 -2\gamma\sum_{i=1}^{n}\left(x_i-\bar{x}_n\right)+n\gamma^2+n^2\gamma^2\\<br />& = s_n-0+n\gamma^2\left(1+n\right) && &&& \text{by (5)}\\<br />& = s_n+\left(x_{n+1}-\bar{x}_n\right)n\gamma && &&& \text{by (7)}\\<br />& = s_n+\left(x_{n+1}-\bar{x}_n\right)\left(x_{n+1}-\bar{x}_{n+1}\right) && &&& \text{by (8)}<br />\end{align*}

Friday, March 12, 2010

Embeding LaTeX in Blogger

The goal is to include mathematical equations in blog posts. My plan was to create images from the output of MacTeX. Apart from being painfully slow to crop and upload with multiple distinct equations, the LaTeX source would not be contained in the post.

Embedding LaTeX in HTML led me to jsLaTeX. Option 4 from Solutions for JavaScript Hosting in Blogger solved the Blogger hosting issue. After some CSS shenanigans we have Bayes rule right here, right now.

\mathbb{P}(B_j|A) = \frac{\mathbb{P}(A|B_j)\mathbb{P}(B_j)}{\sum_{i=1}^n\mathbb{P}(A|B_i)\mathbb{P}(B_i)}

It is easier to write simple equations in the CodeCogs online equation editor first as Blogger preview doesn't seem to execute the JavaScript to render them.

Edit 23/03/2010: No longer using jsLaTeX, see Embedding LaTeX in Blogger take 2.