<?xml version="1.0"?>
<!-- name="generator" content="blosxom/2.0" -->
<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN" "http://my.netscape.com/publish/formats/rss-0.91.dtd">

<rss version="0.91">
  <channel>
    <title>Kragen's Blog Thing   </title>
    <link>http://www.bentwookie.org/blog</link>
    <description>Kragen's Blog Thing.</description>
    <language>en</language>

  <item>
    <title>speech codecs: can a sinusoidal codec get below 500 bits per second and produce comprehensible speech?</title>
    <link>http://www.bentwookie.org/blog/2010/03/17#000911</link>
    <description>

&lt;PRE&gt;David Rowe (VK5DGR) has been doing some absolutely awesome work on
open-source speech codec development (for lossy compression of speech,
for example for certain radio frequency bands where available bandwidth
is very limited):

&amp;lt;&lt;A HREF=&quot;http://www.rowetel.com/ucasterisk/codec2.html&quot;&gt;http://www.rowetel.com/ucasterisk/codec2.html&lt;/A&gt;&amp;gt;
&amp;lt;&lt;A HREF=&quot;http://www.rowetel.com/blog/?p=132&quot;&gt;http://www.rowetel.com/blog/?p=132&lt;/A&gt;&amp;gt;

He's shooting for good speech quality at 2400 bits per second, which is
apparently close to the best proprietary codecs. And he's using an
approach fairly similar in many ways to what the rest of this post
describes.

But today I ran into this truly astonishing research project at Haskins
Laboratories at Yale, by Robert Remez and Philip Rubin:

&amp;lt;&lt;A HREF=&quot;http://www.haskins.yale.edu/featured/sws/sws.html&quot;&gt;http://www.haskins.yale.edu/featured/sws/sws.html&lt;/A&gt;&amp;gt;

There is some further exploration, which unfortunately I couldn't listen
to successfully, at Remez's site:

&amp;lt;&lt;A HREF=&quot;http://www.columbia.edu/~remez/Site/Musical%20Sinewave%20Speech.html&quot;&gt;http://www.columbia.edu/~remez/Site/Musical%20Sinewave%20Speech.html&lt;/A&gt;&amp;gt;

And there's a Wikipedia page:

&amp;lt;&lt;A HREF=&quot;http://en.wikipedia.org/wiki/Sinewave_synthesis&quot;&gt;http://en.wikipedia.org/wiki/Sinewave_synthesis&lt;/A&gt;&amp;gt;

They're synthesizing comprehensible --- if slow --- English speech out
of nothing more than three or four formants, realized purely as sine
waves. The sound recordings are truly astonishing to listen to. They
don't sound like human speech; they don't sound like synthesized speech;
they sound like the whistling and squealing sounds when you're trying to
tune an AM radio. But you can understand them.

(And apparently they started this research in the 1970s, their Fortran
code is from 1980, they put up a web page about it in 1996, and the
current Matlab version of the code is from 2003. The publications on
their publications page are from 1980 to 1994.)

This made me wonder how low you can really push the bandwidth of a
usable speech codec. Presumably you could do k-nearest-neighbors
averaging on a database of recorded speech sounds in order to get back
from the sine waves to something that more closely resembles human
speech. But how much bandwidth would it take to transmit the sine-wave
information?

They've put online the parameters they used to synthesize their sample
utterances; one of them is at
&amp;lt;&lt;A HREF=&quot;http://www.haskins.yale.edu/featured/sws/swssentences/S7pars.html&quot;&gt;http://www.haskins.yale.edu/featured/sws/swssentences/S7pars.html&lt;/A&gt;&amp;gt;. It
encodes the sentence &amp;quot;Please say what this word is,&amp;quot; about 19 phones, in
1.68 seconds.  The text file is 15045 bytes, which isn't particularly
good; that's almost 80kbps. But even `gzip` can compress it to 3333
bytes, which brings it down below 16 kilobits per second.

That is, even without intending to, their research comprises a speech
codec that can produce comprehensible speech at 16 kilobits per second,
slightly more than GSM uses (although GSM produces very realistic
speech).

Beyond that, though, we'd need to discard more information. The
parameters file in its current form divides the utterance up into 10ms
frames, with pitch and amplitude information for each formant in each
frame; between frames, the pitch and amplitude are linearly
interpolated.  The pitch information in that file ranges from 136Hz to
4597Hz, and is already quantized to, apparently, 1Hz.  The amplitude
information is represented to six significant figures.

Suppose that, instead, we reduced the frames to a few &amp;quot;keyframes&amp;quot; and
interpolated between them with cubic splines. We probably need at least
one keyframe per phone, and probably 1.5 or so. That would give us about
17 keyframes per second, which is an improvement of a factor of 6. If
that didn't affect its gzip-compressibility, that alone would get us
down to 2700 bits per second.

But there's no need to spend 13 or 16 bits per formant per keyframe on
the frequency; we can almost certainly quantize the frequency
logarithmically to within one semitone. The range in question is almost
61 semitones, so you only need six bits.

Similarly, the amplitude probably doesn't need ten bits of precision.
Probably four bits (logarithmic; maybe 2dB each, for a total dynamic
range of 32dB) would do fine.

And the interval between keyframes can probably be quantized to 10ms,
and range up to, say, 160ms, which would require four bits of keyframe
duration.

So a keyframe consisting of four bits of timing, and four formants each
with ten bits of pitch and amplitude information, would occupy a total
of 44 bits, for a total of about 750 bits per second, or 210 bits per
word in this case (since you'd need fewer keyframes when the speech was
slower).  That's about five times worse than ASCII text.

Also, in many of the frames, not all of the formants were present; out
of the 169 frames in that file, there were an average of 2.5 formants
present. If the average number of formants were the same for keyframes,
and you used two bits per frame to indicate the number of formants, the
average frame size would fall from 44 bits to 4+2+25 = 31 bits, and the
total bit rate would fall to 530 bits per second. But in practice, you
would probably tend to choose fewer keyframes in segments with fewer
formants. (This would also reduce the bit rate for silence to 6 bits per
keyframe, a little over 6 times per second: almost 38 bits per second.)

If you had some kind of entropy or linear-predictive-plus-quantized-
residuals coding for the keyframes, you might be able to do still
better; in essence, you could take advantage of the kinds of
redundancies that phonotactics enforces --- consonants tend to alternate
with vowels, for example.

How to choose keyframes? A simple greedy approach would be to start with
100 frames per second, and then iteratively remove the frame that would
produce the smallest error, until removing further frames would generate
unacceptable levels of error. Another simple greedy approach would be to
start with no frames, and then iteratively add keyframes at the point
where the interpolated spectrograms were the farthest from the real
signal, until the spectrogram was close enough.

Neither of those approaches to keyframe selection can work well in
real-time. A simple approach that would probably work well in real-time
would be to maintain two &amp;quot;possible keyframes&amp;quot; at the present moment and
just before, and whenever the spectrogram interpolated from the last
emitted keyframes to the &amp;quot;possible keyframes&amp;quot; becomes too far from the
real signal, emit a keyframe at the point in the recent past where the
error is greatest.

All of these approaches, of course, have to be adjusted to not exceed
the maximum representable keyframe interval.

Other tweaks to try:

- Add a bit per keyframe to indicate that the spectrogram has a
  discontinuous break at that point, rather than interpolating. (This
  could avoid transmitting as many as three more closely spaced
  keyframes.)
- Add a bit per keyframe to indicate the presence or absence of voicing,
  as almost all vocoder algorithms do.
- More generally, add two or three bits per keyframe to indicate the
  average bandwidth of the formants.
- Transmit parameters for a voice model toward the beginning of the
  connection or periodically throughout the recording, in parallel with
  the formant frequency data, so that the synthesized voice can sound at
  least vaguely like the speaker instead of like someone else. If you
  had a perfect model of the range of variation of human voices, 36 bits
  would be enough to uniquely specify the voice of any person who's ever
  lived, and another 26 bits would be enough to specify a minute of
  their life. How close to that can you get with some kind of
  parametric model? Can you come up with a model that describes the
  unique timbre of a person's vocal tract in a small number of
  ruthlessly quantized coefficients, say, 25 dimensions of four bits
  each?
- Since the formants can be constrained to be transmitted in sorted
  order, transmit formant frequencies as intervals (ratios) from the
  previous formant's frequency rather than independently. This could
  reduce the size of each frequency transmitted from 6 bits to 5 or 4.
- Nonuniform encoding for the per-frame formant parameters. This would
  probably require some pretty heavy-duty psychoacoustic research to
  validate (and someone has probably already done it), but perhaps, say,
  there is less tolerance for error in the interval between two formants
  when they are close together, because the difference between a perfect
  fifth and a perfect fourth is more audible than the difference between
  a 16:3 and an 18:3 interval --- which are a perfect fourth and fifth
  plus two octaves. Or perhaps amplitude variation is more important at
  high frequencies.
- Update only the higher-frequency formants in some keyframes. The
  frequency and amplitude of a 200Hz formant can't change very rapidly.
  In a 10ms frame you only get two full cycles! So if you're looking at
  10ms, unless I'm confused about the math, your first few discrete
  Fourier transform coefficients are DC, 100Hz, 200Hz, and 300Hz. So you
  can't detect even fairly large shifts in its frequency --- if it were
  to drop or rise by a whole fifth, seven semitones, you wouldn't even
  notice until you're looking at a longer period of time.  On the other
  hand, if a 4000Hz formant drops to 3900Hz --- less than half a
  semitone --- you could detect that in the DFT of those 10ms.
  Presumably similar constraints apply to your ear: you can't detect if
  a 200Hz signal jumps to 216Hz over a 10ms period; you need a longer
  period of time. So you could emit updates for the high-frequency
  formants more frequently.  This would add a couple of bits per
  keyframe (to indicate which formants were being updated), but most
  keyframes would only contain one formant.

Klatt 1987 reports that the Speak 'n' Spell had stored about 1000 bits
per second of speech, using linear predictive coding:

&amp;lt;&lt;A HREF=&quot;http://americanhistory.si.edu/archives/speechsynthesis/dk_749.htm&quot;&gt;http://americanhistory.si.edu/archives/speechsynthesis/dk_749.htm&lt;/A&gt;&amp;gt;

Dan Ellis, who wrote the current Matlab version on the Haskins Lab site,
talks about the connection with LPC vocoders:

&amp;lt;&lt;A HREF=&quot;http://labrosa.ee.columbia.edu/matlab/sws/&quot;&gt;http://labrosa.ee.columbia.edu/matlab/sws/&lt;/A&gt;&amp;gt;
&lt;/PRE&gt;
</description>
  </item>
  <item>
    <title>mailing lists, blog posts, and Git: what to do next with kragen-tol?</title>
    <link>http://www.bentwookie.org/blog/2010/03/15#001127</link>
    <description>

&lt;PRE&gt;
Le 13 mars 10 &amp;#224; 03:45, Kragen Javier Sitaker a &amp;#233;crit :
&amp;gt;&lt;i&gt; In order for this to work, though, there needs to be a record of these
&lt;/I&gt;&amp;gt;&lt;i&gt; things having been published at a particular time; things published
&lt;/I&gt;&amp;gt;&lt;i&gt; after a patent application has been filed are not &amp;quot;prior art&amp;quot; and  
&lt;/I&gt;&amp;gt;&lt;i&gt; do not
&lt;/I&gt;&amp;gt;&lt;i&gt; invalidate patent claims.
&lt;/I&gt;
A long time ago there was someone (ex-Bell?) who was offering a  
timestamping service; IIRC documents and metadata were securely  
hashed, and these hashes then hashed, until finally the daily root  
hash was published as a classified ad in the New York Times (yes,  
back when newspapers existed and still printed classified ads :-)

-Dave

&lt;/PRE&gt;
</description>
  </item>
  <item>
    <title>mailing lists, blog posts, and Git: what to do next with kragen-tol?</title>
    <link>http://www.bentwookie.org/blog/2010/03/15#001126</link>
    <description>

&lt;PRE&gt;On Sun, Mar 14, 2010 at 01:17:44PM +0100, Aristotle Pagaltzis wrote:
&amp;gt;&lt;i&gt; * Kragen Javier Sitaker &amp;lt;&lt;A HREF=&quot;http://lists.canonical.org/mailman/listinfo/kragen-discuss&quot;&gt;kragen at canonical.org&lt;/A&gt;&amp;gt; [2010-03-13 03:50]:
&lt;/I&gt;&amp;gt;&lt;i&gt; &amp;gt; Is there some feature of Git or the court system I'm not
&lt;/I&gt;&amp;gt;&lt;i&gt; &amp;gt; familiar with that makes this scenario implausible?
&lt;/I&gt;&amp;gt;&lt;i&gt; 
&lt;/I&gt;&amp;gt;&lt;i&gt; There is a Git feature you *are* already familiar with. I suppose
&lt;/I&gt;&amp;gt;&lt;i&gt; you&amp;#8217;d have to convince people to keep their reflogs forever (and
&lt;/I&gt;&amp;gt;&lt;i&gt; maybe sign them) &amp;#8211; since that records which head was set to which
&lt;/I&gt;&amp;gt;&lt;i&gt; commit at which point in time.
&lt;/I&gt;
Yeah, it's too bad gc.reflogExpire isn't replicated upon git-clone, or
that would be simple!

&amp;gt;&lt;i&gt; (I wish there was a one-stop configuration setting to tell Git to
&lt;/I&gt;&amp;gt;&lt;i&gt; never expire any data of its data *ever*.)
&lt;/I&gt;
That would be ideal for this, but only if you could get it to be turned
on by default for clones of a given repo.
&lt;/PRE&gt;

</description>
  </item>
  <item>
    <title>mailing lists, blog posts, and Git: what to do next with kragen-tol?</title>
    <link>http://www.bentwookie.org/blog/2010/03/14#001125</link>
    <description>

&lt;PRE&gt;* Kragen Javier Sitaker &amp;lt;&lt;A HREF=&quot;http://lists.canonical.org/mailman/listinfo/kragen-discuss&quot;&gt;kragen at canonical.org&lt;/A&gt;&amp;gt; [2010-03-13 03:50]:
&amp;gt;&lt;i&gt; Is there some feature of Git or the court system I'm not
&lt;/I&gt;&amp;gt;&lt;i&gt; familiar with that makes this scenario implausible?
&lt;/I&gt;
There is a Git feature you *are* already familiar with. I suppose
you&amp;#8217;d have to convince people to keep their reflogs forever (and
maybe sign them) &amp;#8211; since that records which head was set to which
commit at which point in time.

(I wish there was a one-stop configuration setting to tell Git to
never expire any data of its data *ever*.)

Regards,
-- 
Aristotle Pagaltzis // &amp;lt;&lt;A HREF=&quot;http://plasmasturm.org/&quot;&gt;http://plasmasturm.org/&lt;/A&gt;&amp;gt;
&lt;/PRE&gt;

</description>
  </item>
  <item>
    <title>mailing lists, blog posts, and Git: what to do next with kragen-tol?</title>
    <link>http://www.bentwookie.org/blog/2010/03/12#000910</link>
    <description>

&lt;PRE&gt;kragen-tol is a mailing list rather than a blog for a couple of reasons.

The first was that, when I [set up the list and made my first post][0],
(about prisons, crime, policing, democracy, and humanity), it was
November 12th, 1998.  Some people had blogs; I read some blogs; but it
wasn't yet the default means of publication that it has become now.
There's a blog version of the list at &amp;lt;&lt;A HREF=&quot;http://bentwookie.org/blog/kragen-tol/&quot;&gt;http://bentwookie.org/blog/kragen-tol/&lt;/A&gt;&amp;gt;,
but because it's not primary, I haven't made the formatting on it work
well.

The second was that I want my list mail to serve as prior art to stop
obvious patents from being granted, or to revoke them or the obvious
claims in them in court. (Re-examination didn't exist yet, if I recall
correctly.) Some examples: I posted [a simple system architecture for
MMORPGs] [1], [an article about networked automated fabrication] [4],
and [some thoughts on interactive kinematic modeling] [2] that month.
Later that year, I wrote about [uses for ubiquitous computing] [3],
[ballistic transport in evacuated tunnels] [5], [a technique for
transparent CPU virtualization] [6], [fault-tolerant distributed
computation on untrusted CPUs] [7], [a technique for time-travel
debugging] [8] (which Michael Elizabeth Chastain implemented around the
same time as mec-replay), and [a lightweight device for stopping
bullets] [9].

In order for this to work, though, there needs to be a record of these
things having been published at a particular time; things published
after a patent application has been filed are not &amp;quot;prior art&amp;quot; and do not
invalidate patent claims.

My thought was that publishing these things on a mailing list, rather
than merely on the web, would create a number of distributed copies in
different subscribers' mailboxes, each timestamped with the time that it
had originally been received. This way, there would be more than just my
word to go on. A number of people with different interests would have
records of the publication.

There are some big disadvantages to publishing in mailing-list form.
It's not very observable (what mailing lists are your friends on?), so
it doesn't spread very fast; the formatting sucks unless you send HTML
email; reading the archives is a pain; and I have to waste my time
trying to get my domain un-categorized as a spam source in order for
people to get the mail.

[0]: &lt;A HREF=&quot;http://lists.canonical.org/pipermail/kragen-tol/1998-November/000296.html&quot;&gt;http://lists.canonical.org/pipermail/kragen-tol/1998-November/000296.html&lt;/A&gt;
[1]: &lt;A HREF=&quot;http://lists.canonical.org/pipermail/kragen-tol/1998-November/000300.html&quot;&gt;http://lists.canonical.org/pipermail/kragen-tol/1998-November/000300.html&lt;/A&gt;
[2]: &lt;A HREF=&quot;http://lists.canonical.org/pipermail/kragen-tol/1998-November/000303.html&quot;&gt;http://lists.canonical.org/pipermail/kragen-tol/1998-November/000303.html&lt;/A&gt;
[3]: &lt;A HREF=&quot;http://lists.canonical.org/pipermail/kragen-tol/1998-December/000307.html&quot;&gt;http://lists.canonical.org/pipermail/kragen-tol/1998-December/000307.html&lt;/A&gt;
[4]: &lt;A HREF=&quot;http://lists.canonical.org/pipermail/kragen-tol/1998-November/000299.html&quot;&gt;http://lists.canonical.org/pipermail/kragen-tol/1998-November/000299.html&lt;/A&gt;
[5]: &lt;A HREF=&quot;http://lists.canonical.org/pipermail/kragen-tol/1998-December/000309.html&quot;&gt;http://lists.canonical.org/pipermail/kragen-tol/1998-December/000309.html&lt;/A&gt;
[6]: &lt;A HREF=&quot;http://lists.canonical.org/pipermail/kragen-tol/1998-December/000313.html&quot;&gt;http://lists.canonical.org/pipermail/kragen-tol/1998-December/000313.html&lt;/A&gt;
[7]: &lt;A HREF=&quot;http://lists.canonical.org/pipermail/kragen-tol/1998-December/000314.html&quot;&gt;http://lists.canonical.org/pipermail/kragen-tol/1998-December/000314.html&lt;/A&gt;
[8]: &lt;A HREF=&quot;http://lists.canonical.org/pipermail/kragen-tol/1998-December/000315.html&quot;&gt;http://lists.canonical.org/pipermail/kragen-tol/1998-December/000315.html&lt;/A&gt;
[9]: &lt;A HREF=&quot;http://lists.canonical.org/pipermail/kragen-tol/1998-December/000316.html&quot;&gt;http://lists.canonical.org/pipermail/kragen-tol/1998-December/000316.html&lt;/A&gt;

A better solution?
------------------

I wonder if there's a better solution today. For example, if I put all
this stuff into a Git repository, then I could add new articles, other
people could add new articles too, I could edit existing articles
without losing the records of the old versions, and everybody who edited
it would have a copy that was protected from external corruption.

Unfortunately I don't think there's a good way to prove when somebody
first received a particular thing using Git. Git commits are authored
and dated, but there is no authentication of the author or the date. Git
supports cryptographically-signed tags, but it doesn't create them in
its normal workflow, so typically only a few people make them; they're
used for things like major software releases.

So the threat model is something like this.

It's 2025. FooCorp is suing BarCorp for patent infringement; tens of
millions of dollars are at stake. BarCorp wants to show that the
supposedly infringed patent claims are invalid because they were
published openly in 2012 on my git-based equivalent to Halfbakery, four
years before the patent was filed in 2016.

BarCorp presents the publication of this technique as evidence, along
with the Git commit in which they were created and the commits since
then, by a variety of authors, and shows that the same commit history is
found in several different contributors' replicas; they explain how
Git's content-based blob store prevents tampering with the history of
any commit.

FooCorp has branched from a very old commit, set their computer clock
back to 2010, added an article with that 2010 date describing a
well-known celebrity scandal of 2023, and then constructed a similar,
but fake, commit history with a variety of imaginary authors, over the
next fifteen years. (As it happens, `git rebase` can be used to do this
fairly easily, but if it didn't exist it would be easy to write it
yourself.)

Sometime in 2025, during this court case, they managed to get some
legitimate contributor somewhere to merge their branch in, and that
contributor's branch got merged into the cloud of the popular versions
of the repository.  So in the very same commit history that BarCorp
presented to the court, the FooCorp lawyer locates this obviously recent
article supposedly from 2010, dated to 2010 using the same techniques
that BarCorp was using to prove that the publication of the patented
technique happened in 2012 and not 2022.

Consequently, the court ignores BarCorp's evidence and finds in favor of
FooCorp.

Any ideas? Is there some feature of Git or the court system I'm not
familiar with that makes this scenario implausible? Or provides a way to
prevent it?

By contrast, the Received: line on people's email would probably stand
up, if you could find half a dozen people whose mail was stored in
different places.
&lt;/PRE&gt;

</description>
  </item>
  </channel>
</rss>