You can see how the VT looks like in this demo videos I just recorded.
So how does it work?
To illustrate how the VT is implemented, let's imagine our friend Felix the cat is the input of the VT in a certain moment in time. Then, the state of the wave function will be obtained by computing two projections: the one in the X axis and the one in the Y axis. To compute the projections I just consider p(x) = sum(img(x, y)) for all possible y values, and viceversa. Each projection has different meaning for the computation of the wave form. The projection in Y is used to represent a spectral analysis whereas the projection in X is used to calculate a base frequency. For the spectral analysis I just discretize the projection to a reasonable limit (16 partials) and for the base frequency I compute the mean of the distribution of the projection.
Before doing the projections, I binarize the image and optionally apply gradient detection to it.
Hasn't someone done something like that?
I am aware that very similar things exist. There are very complex proprietary pieces of software that can do almost anything with live video. Speaking about simple software like mine, this and this are using a different approach: they are getting the input of the waveform inside the webcam output, whereas in the VT the whole webcam output is the input of the waveform. I think this is a big difference. It probably has been simpler to implement (otherwise you have to first recognize some kind of object to get some input) and it leads to infinite possibilities (any image can be the input of a waveform).
Finally, this thing implemented using the Wii is really cool.