Thu, 25 Mar 2010

Le 25 mars 10 à 08:37, Kragen Javier Sitaker a écrit :
> This curriculum would need to be administered by a piece of software
> that statistically modeled the current state of your knowledge, so as
> to be able to drill you on what needed drilling at that moment, and
> kept drilling you at the edge of your competency continuously for many
> hours a day.

cf http://www.quisition.com/ for a similar piece of software.

off the top of my head, my guess is that your bit rate estimate may  
be right for learning but very optimistic for retention.

-Dave

PS have you seen Europanto?
http://www.europanto.be/cabillot1.html

How quickly is it possible to learn a foreign language?

I estimate that, with an optimally structured curriculum, learning a
completely foreign language with no structure or vocabulary in common
with any language you know should take 70 hours or less --- about a
week --- to get to a level of full, if slow, reading competency. This
is such an astonishingly short estimate that it must surely be wrong,
but I don’t know exactly how. Alternatively, proper curriculum design
could enable people to learn things such as foreign languages two
orders of magnitude faster than they currently do.

Understanding a certain minimal vocabulary size is needed for fluency.
----------------------------------------------------------------------

One of the larger tasks in learning a foreign language is vocabulary
memorization. The morphemic inventory of a language is essentially
arbitrary and thus must be memorized, along with a collection of
idiomatic phrases with arbitrary meanings.

But memorizing the entire vocabulary of any language is impossible,
because vocabularies are open, not closed, and learning words or
morphemes at random (say, by choosing random pages in a dictionary)
will tend to be very inefficient, because the vast majority of
distinct morphemes (and words) are hapaxes: they only occur once in
the entire corpus of the language, so learning them is a waste of
time.

There’s a threshold effect in vocabulary learning, vaguely analogous
to the threshold effect in percolation theory. If you know only 80% of
the words (or morphemes) in a text, it’s likely to be almost
completely impenetrable. But if you know 95%, you’re likely to be able
to guess the other 5% of the words by context.  Somewhere between the
two is a critical threshold. The question then arises, how many
morphemes do you need to know in order to understand 85% or 90% or 95%
of the morphemes in a text?

Of course, the size varies by text; some texts use much larger
vocabularies than others.

That vocabulary size is about 6000 morphemes.
---------------------------------------------

<http://www.balancedreading.com/vocabulary.html> makes the following
claims about vocabulary sizes:

    | Group                                     | Vocabulary size                          | Source                             |
    |-------------------------------------------+--------------------------------------+------------------------------------|
    | 5 to 6 year olds                          | 2500-5000 words                          | Beck and McKeown (1991)            |
    | literate adults                           | 50000 words                              | (his own guess)                    |
    | an average high school senior             | 45000 words                              | Nagy and Anderson (1984)           |
    | same                                      | 17000 words                              | D’Anna, Zechmeister, & Hall (1991) |
    | same                                      | 5000 words                               | Hirsh & Nation (1992)              |
    | the English language                      | 50000-60000 “word     families” (lemmas) | “most linguists”                   |
    | adults who “don’t read very much”         | 5000-10000 word families                 | (his own guess)                    |
    | “a ‘typical’ college graduate”            | 20000 word families                      | “I have seen estimates”            |
    | “people who read 3 to 4 hours a day”      | 25000 word families                      | (his own guess)                    |
    | “somebody who dropped out of high school” | 5000-6000 word families                  | (his own guess)                    |
    

You should also be able to make a sort of ballpark estimate by doing
the statistics on a large corpus.

I have handy the word frequencies from the British National Corpus for
all words that occur 5 or more times. I ought to do these statistics
over the entire corpus, and using morphological analysis, but this is
a first cut.

> Aside: Google’s teraword corpus
> -------------------------------
> 
> I wanted to redo this analysis with the corpus of N-grams over a
> trillion tokens that Google has published, but the only word
> frequency list I have handy is the one Norvig posted on his web page
> at <http://norvig.com/ngrams/>, which only covers 333,333 types,
> which together only add up to about 588 billion words: less than 60%
> of the total corpus. I don’t know why the most common third of a
> million types cover less than two-thirds of the corpus. The last
> “words” in there are nonword things like “lolge”, “oooglo”,
> “klonipan”, “britishairs”, “lgoggle”, “magapass”, “offool”,
> “antemortem”, “iconw”, and “sshool”, occurring at about 13 parts per
> billion each. I’m quite sure that it is not the case that two out of
> every five words in the Web *I’m* reading are weirder and more
> uncommon than “lgoggle”, and no significant amount of what I'm
> reading consists of near-anagrams of “google”.
>
> So I don’t know what’s going on here but I don’t think this corpus
> is useful for my purposes at the moment.

In this part of the BNC corpus, the total number of tokens (of 109 557
distinct types, most of which are misspellings) is 90 080 933, about
90 million.

80% of this number is 72 064 746. To know 80% of the words (tokens) in
this part of the corpus, you would have to know all 2028 words that
occur at least as frequently as “lie”, which occurs 4610 times (one
out of every 19 540 words). The most frequent words you wouldn’t know
yet would be: stress severe liked mentioned contains.

85% of this number is 76 568 793. To know 85% of these words, you
would have to know all 3335 words that occur at least as frequently as
“leather”, which occurs 2564 times (one out of every 35 132
words). The most frequent words you wouldn’t know yet would be:
recommendations appreciate alexander cool solutions.

90% of this number is 81 072 839. To know 90% of these words, you’d
need 5982 words up to “fry”, 1155 times (one out of every 78 195
words), leaving words like: subjective rage pencil cheerful cd.

95% of this number is 85 576 886. To know 95%, you’d need 13 141 words
up to “loser”, 347 times (one out of every quarter-million words),
leaving words like: hammersmith graded exporters englishmen endured.

Now, there are some difficulties that may make these estimates
meaningless, quite aside from the large spread between 3000 and 13000
words! First, the number of morphemes is quite a bit smaller than the
total number of words (“lose” occurs in five obvious forms: lose,
loser, loses, losers, losing, plus lost and lostness, although “loser”
has an idiomatic meaning that must be memorized, while “fry” occurs in
nine forms: frying, fryer, fryers, frys, fried, fries, frier, and
frypan); second, the words that occur fewer than 5 times in the BNC
may actually shift the 95th percentile by quite a large number of
distinct morphemes.

Memorizing 6000 morphemes should take about 50 hours.
-----------------------------------------------------

For now I will choose to pretend that the number of essential
morphemes to reach the critical threshold is somewhere around
6000. How to estimate how difficult this learning task is?

If we model human long-term memory as a simple store that takes input
at about 0.5 bits per second, we need only estimate how many bits are
contained in 6000 or so arbitrary meanings. Clearly each one of these
requires about 13 bits to distinguish it from any of the others, but
perhaps a couple more bits to distinguish it from other equally basic
concepts that happen not to be encoded as single morphemes in a
particular language.  So that gives us around 90 000 bits, which
should require around 180 000 seconds of optimal memorization: only
about 50 hours!  This would involve learning about two new vocabulary
words per minute, which seems like a plausible rate.

People currently take much, much more than 50 hours to learn a new language.
----------------------------------------------------------------------------

50 hours is a rather remarkably small number. If it’s really accurate,
it should be possible to learn the entire essential vocabulary of a
new foreign language each week or so, or possibly even twice a week,
to the point of being able to glark the meanings of new terms from
context. This is at least one order of magnitude better than
commonly-observed performance in foreign-language learning. Why might
this be?

One possible explanation is that people are usually learning not only
the vocabulary of the language, but also its alphabet, orthography,
phonology, morphology, syntax, and pragmatics at the same time.

A second possible explanation is that typical vocabulary memorization
is very badly structured: much memorization time is wasted on
overlearning of common morphemes (after a certain point, you don’t
encounter new morphemes often enough any more, or re-encounter them
early enough to reinforce the previous learning of them), while much
of the rest is wasted on learning morphemes that aren’t among the most
frequent few thousand (words like “reverse”, “retaliate”, “railroad”,
and the like, which could likely be learned from context later on);
and students often have to uselessly learn different compounds of the
same morpheme because they are learning the morpheme before they learn
the productive morphology that would allow them to represent the
compound words more efficiently in their memory.

A third possibility is that the model is wrong: either representing
the meaning of a morpheme takes many more than 15 bits, many more than
6000 morphemes are needed to reach the critical threshold, or human
long-term memory is not capable of accepting input at such a high
sustained rate, or it is only capable of doing so under special
circumstances.

If each new morpheme must be presented 20 times (at exponentially
increasing intervals) before being properly learned, that leaves time
for about 1.5 seconds per repetition, which seems challenging and
exhausting, but not implausible.

If such a system worked properly, then once you had learned the
alphabet, orthography, phonology, and morphology of a language (along
with perhaps the couple hundred most common morphemes), you could
spend 8 hours a day for 6 or 7 days being drilled on vocabulary
morphemes, starting with the most common ones --- in a featureless
white room with no distractions --- and go home exhausted
afterwards. And in that week, you’d go from almost no vocabulary to a
working vocabulary that allowed you to read and speak the language
with apparent fluency.  If the language had many cognates or calques
with a language you already knew, you could do it much faster, perhaps
in two or three days. This would be an incredible feat.

Other parts of the language learning task would total 12 to 20 hours.
---------------------------------------------------------------------

How long would it take to learn the rest of the language?  The other
learning tasks are essentially procedural rather than declarative
knowledge, and so they must be learned by usage and practice, rather
than drills. However, as I asserted at the beginning, they require
memorizing much less information than the vocabulary does.

I think you could probably learn a new alphabetic script in two to
eight hours, although the orthography, phonology, and morphology of
the language may be considerably more arbitrary than that. Still, if
you could be explicitly given the information needed about these
aspects of the language, then practice at the edge of your competency
with continuous feedback until you’d reached a reasonable level of
mastery, you ought to be able to reach a minimal level of mastery of
these other aspects all together in time comparable to that needed to
learn the vocabulary.

This curriculum would need to be administered by a piece of software
that statistically modeled the current state of your knowledge, so as
to be able to drill you on what needed drilling at that moment, and
kept drilling you at the edge of your competency continuously for many
hours a day. No human instructor could be expected to keep such a
complete model of your competency in their limited human mind, and
even if they could, they would have to work just as hard as you were.

Constructing the curriculum would require a careful linguistic
analysis of a particular dialect of the language in question, put into
a machine-readable form.

First, the learner would study the phonology of the language. This
requires training a speech-recognition system for the language, then
having the student repeat back common words, correcting them (“bit,
not beat!”) when they mispronounce them. This also requires analysis
of the most frequent phonemes in the language, and also perhaps an
analysis of which distinctions are the most difficult for new learners
to learn to recognize and reproduce.

This should probably overlap somewhat with the study of the language’s
orthography: once the basics of the phonetic system are down (maybe 40
phonemes: 10 minutes to be able to distinguish them all?) you can
present the names of the letters and grade the learner on their
ability to name the letters and draw the letters named, again
correcting them when they get it wrong. This should require perhaps
another 10 minutes.

At this point, you can also present the spellings of the words that
are being used for pronunciation practice, now that the learner knows
how to recognize and produce the different phonemes; and you can start
to teach a keyboard layout that will allow the learner to indicate
their recognition of a word by typing it rather than handwriting it or
repeating it aloud.

Next would be phonotactics: how phonemes can fit together, the rules
that govern when particular allophones are produced, which is crucial
to reproducing a particular accent correctly. These rules can be
relatively complex, so this might require a few hours of drill.

Once the learner can read, write, pronounce, and hear the language up
to the phonetic level with a reasonable level of accuracy, it’s time
to present morphology. There may be a few hundred rules to learn, each
with particular morphemes they necessarily involve.

Learning a few hundred morphological rules might take a while. Since
the learner has internalized the phonotactic rules of the language to
some degree, they can compress the phonetic realization of each
morpheme using a first- or second-order Markov chain model, which I
think should compress the pronunciation of a typical morpheme to 8
bits or so. So learning, say, 300 morphemes requires memorizing 2400
bits, which should take about 80 minutes --- the rules themselves will
be substantially more complex to memorize and practice, requiring
considerably more than 8 bits to represent. So this could take 10
hours or more.

At this point the learner has spent maybe 12 to 20 hours learning the
basics of the language, and knows the 500 or so most common morphemes,
enough to recognize maybe two-thirds or three-quarters of the
morphemes in most texts. At this point they must embark upon the task
estimated earlier at 50 hours.

Mon, 22 Mar 2010

I was lying in bed thinking about 20Q.net. What's the ideal strategy for
playing 20 Questions, given some statistical knowledge base about the world?

The traditional approach, which you can assign in a beginning programming
class, is to build a binary tree of questions, adding onto the tip when
you guess wrong. This is probably good for a start. For slightly better
performance, you can rebalance the tree using the same left and right
rotations used to manage a red-black tree, either to reduce the maximum
height of the tree or the average height of answers.

The main trouble with this approach is that sometimes different people
will answer the same question differently for the same item, because
it's vague or they don't know the answer. (Do you hold the movie
THX-1138 when you use it? Is the movie carbon-based? Can linoleum cheer
you up? Does plaster weigh more than 1 ton, and is it annoying? Can you
hold the font named Papyrus, and do you use it at night? Does Django
Reinhardt have cash value or win races? Do you hold a lentil when you
use it?) So you end up failing to guess the item right, and eventually
building big potentially duplicate question trees.

20Q.net uses a neural network to cope with this problem, and supplies
additional answers; in addition to the usual "yes" and "no", it has
"Unknown", "Irrelevant", "Sometimes", "Probably", and "Doubtful".
Nevertheless, it doesn't do very well; I thought of linoleum, plaster,
THX-1138, Django Reinhardt, and a lentil, and it failed all four times
with 30 questions.  Yet all four of these items, like most of the things
20Q tries to guess, have articles in the English Wikipedia, which only
has 3.2 million articles --- 22 bits' worth.

Here's a sample analysis of a game I played, where unfortunately 20Q's
knowledge base failed pretty badly:

> You were thinking of linoleum.
> 
> You said it's classified as Vegetable, 20Q was taught by other
> players that the answer is Other.
> 
> Is it orange colored? You said Sometimes, 20Q was taught by
> other players that the answer is Doubtful.
> 
> Is it used with animals? You said Sometimes, 20Q was taught by
> other players that the answer is Doubtful.
> 
> Does it live above ground? You said No, 20Q was taught by other
> players that the answer is Yes.
> 
> Does it come from a plant? You said Yes, 20Q was taught by other
> players that the answer is No.
> 
> Does it fit in your wallet? You said Sometimes, 20Q was taught
> by other players that the answer is No.
> 
> Do you use it at work? You said Yes, 20Q was taught by other
> players that the answer is No.
> 
> Can it be painted? You said Yes, 20Q was taught by other players
> that the answer is No.
> 
> Does it have writing on it? You said Sometimes, 20Q was taught
> by other players that the answer is No.
> 
> Does it come in a box? You said Yes, 20Q was taught by other
> players that the answer is No.

Now, some of the answers here are simply wrong (linoleum can easily be
painted, and when sold at retail it is almost always boxed) perhaps
these are not the optimal questions 20Q could have asked.

Is there an optimal approach to this problem?

Consider this model. You have a database of N items and M questions.
Each of the N items has a prior probability: an estimate of how likely a
person is to choose that item. For each item-question pair, you have a
yes-probability P(yes | Q, X): the probability that the person would
answer "yes" to question Q if they are thinking of item X.  Among the M
questions are N questions of the form "Is it X?", whose yes-probability
is defined as 1.

Given this model, a naïve Bayesian approach to estimating the
probabilities of items, given a particular set of answers to questions,
seems quite viable.  If only 5% of players think that the answer to "Do
you make things with it?" is "no" when the object is "plaster", then it
is probably reasonable, given a "no" answer, to diminish the estimate of
the probability that the object is plaster by a factor of 20.

Where naïve Bayesian models can fall down badly (uh, or so I hear; I
don't actually know this stuff) is when there's correlation among their
inputs. For example, 20Q.net asked me both, "Does [plaster] come in many
varieties?" and "Is [plaster] made in many different styles?" There are
very few items for which the answers to these two questions would
differ, so it's a mistake to treat these two as independent data points.

One approach to choosing questions is to try to maximize the information
gained with each question.

Suppose you have a particular probability distributions for the N items
(not necessarily the prior probabilities). Then you can calculate the
information content of someone telling you, "I was thinking of item X";
it's simply -lg P(X) bits. The *expected* information content of that
statement, before you know what X is, is the sum of -P(X) lg P(X) over
all the possible values of X. (This is the same as the information per
symbol on a channel.) There must be a name for this metric; I'll
confusingly call it the "remaining bit rate" of that probability
distribution.

In a sense, what you want to do is choose a next question to reduce the
remaining bit rate of the resulting distribution as much as possible ---
approaching one bit as closely as possible, for a yes/no question.

So, given some way of updating your probabilities given the answer to a
question, you can build a question tree as follows:

1. Start with the prior probabilities at the root node.

2. To choose the question for a node:

    1. For each question Q that has not been asked:

        1. Calculate the probability that the answer to the question
	will be "yes", given the prior probabilities at that node and
	the item-question pair probabilities: the sum of 
	P(X) * P(yes | Q, X) for that question Q for all items X.

	2. Calculate the new probability distributions given a yes
	answer to that question and given a no answer to that question.

	3. Calculate the remaining bit rates of those probability
	distributions, and then multiply by the answer probabilities in
	order to come up with an expected bit rate after the question
	has been answered.

    2. Choose the question Q with the lowest expected remaining bit
    rate. Create two new nodes with the probability distributions given
    previously.

    3. Choose the questions for those two new nodes using the same
    method, unless the question was of the form "Is it X?", in which
    case the "yes" node is terminal, or unless the tree has gotten too
    deep.

(This can be straightforwardly generalized for more than two answers to
each question.)

Now, given a database of past games, you can probably do better than
naïve Bayesian updating.  Given that you're examining a node at which
you know, say, that the item is round, hard, and not usually sliced or
carved (from ancestor questions), you can come up with a possibly more
accurate set of probability distributions for, say, "Does it fit in an
envelope?" by tabulating the previously-chosen items that were round,
hard, not usually sliced or carved, and fit in envelopes, versus those
that had those same attributes but did. (And, when there aren't very
many in one category or the other, you can "smooth" by adjusting the
probability a bit with other similar games that differed on one
attribute or another.)

This procedure should avoid asking repetitive questions that are
repetitive, like the "Does it come in many varieties?" example earlier,
because it will discover that asking a second version of that question
will provide very little additional information. Even if it doesn't
discover this at first, after the repetitive questions make it into some
tree paths that are played through a number of times, it will discover
it then.

One downside is that the computation described above is at least O(NM),
which is probably O(N²). So it's probably feasible for a thousand or a
million items, at least as a batch process to update a question tree
periodically, but not a billion.

Like the Open Mind Common Sense project and the Verbosity game on Games
With A Purpose, the twenty-questions game could quickly produce a
database of common-sense beliefs about objects: not only which things
are believed to be "round", but also a statistical strength of belief
for each such predicate. 20Q.net claims to have 76 million games played
already, which ought to give it a pretty decent database for this
purpose, but I don't think it's published.

Wed, 17 Mar 2010

David Rowe (VK5DGR) has been doing some absolutely awesome work on
open-source speech codec development (for lossy compression of speech,
for example for certain radio frequency bands where available bandwidth
is very limited):

<http://www.rowetel.com/ucasterisk/codec2.html>
<http://www.rowetel.com/blog/?p=132>

He's shooting for good speech quality at 2400 bits per second, which is
apparently close to the best proprietary codecs. And he's using an
approach fairly similar in many ways to what the rest of this post
describes.

But today I ran into this truly astonishing research project at Haskins
Laboratories at Yale, by Robert Remez and Philip Rubin:

<http://www.haskins.yale.edu/featured/sws/sws.html>

There is some further exploration, which unfortunately I couldn't listen
to successfully, at Remez's site:

<http://www.columbia.edu/~remez/Site/Musical%20Sinewave%20Speech.html>

And there's a Wikipedia page:

<http://en.wikipedia.org/wiki/Sinewave_synthesis>

They're synthesizing comprehensible --- if slow --- English speech out
of nothing more than three or four formants, realized purely as sine
waves. The sound recordings are truly astonishing to listen to. They
don't sound like human speech; they don't sound like synthesized speech;
they sound like the whistling and squealing sounds when you're trying to
tune an AM radio. But you can understand them.

(And apparently they started this research in the 1970s, their Fortran
code is from 1980, they put up a web page about it in 1996, and the
current Matlab version of the code is from 2003. The publications on
their publications page are from 1980 to 1994.)

This made me wonder how low you can really push the bandwidth of a
usable speech codec. Presumably you could do k-nearest-neighbors
averaging on a database of recorded speech sounds in order to get back
from the sine waves to something that more closely resembles human
speech. But how much bandwidth would it take to transmit the sine-wave
information?

They've put online the parameters they used to synthesize their sample
utterances; one of them is at
<http://www.haskins.yale.edu/featured/sws/swssentences/S7pars.html>. It
encodes the sentence "Please say what this word is," about 19 phones, in
1.68 seconds.  The text file is 15045 bytes, which isn't particularly
good; that's almost 80kbps. But even `gzip` can compress it to 3333
bytes, which brings it down below 16 kilobits per second.

That is, even without intending to, their research comprises a speech
codec that can produce comprehensible speech at 16 kilobits per second,
slightly more than GSM uses (although GSM produces very realistic
speech).

Beyond that, though, we'd need to discard more information. The
parameters file in its current form divides the utterance up into 10ms
frames, with pitch and amplitude information for each formant in each
frame; between frames, the pitch and amplitude are linearly
interpolated.  The pitch information in that file ranges from 136Hz to
4597Hz, and is already quantized to, apparently, 1Hz.  The amplitude
information is represented to six significant figures.

Suppose that, instead, we reduced the frames to a few "keyframes" and
interpolated between them with cubic splines. We probably need at least
one keyframe per phone, and probably 1.5 or so. That would give us about
17 keyframes per second, which is an improvement of a factor of 6. If
that didn't affect its gzip-compressibility, that alone would get us
down to 2700 bits per second.

But there's no need to spend 13 or 16 bits per formant per keyframe on
the frequency; we can almost certainly quantize the frequency
logarithmically to within one semitone. The range in question is almost
61 semitones, so you only need six bits.

Similarly, the amplitude probably doesn't need ten bits of precision.
Probably four bits (logarithmic; maybe 2dB each, for a total dynamic
range of 32dB) would do fine.

And the interval between keyframes can probably be quantized to 10ms,
and range up to, say, 160ms, which would require four bits of keyframe
duration.

So a keyframe consisting of four bits of timing, and four formants each
with ten bits of pitch and amplitude information, would occupy a total
of 44 bits, for a total of about 750 bits per second, or 210 bits per
word in this case (since you'd need fewer keyframes when the speech was
slower).  That's about five times worse than ASCII text.

Also, in many of the frames, not all of the formants were present; out
of the 169 frames in that file, there were an average of 2.5 formants
present. If the average number of formants were the same for keyframes,
and you used two bits per frame to indicate the number of formants, the
average frame size would fall from 44 bits to 4+2+25 = 31 bits, and the
total bit rate would fall to 530 bits per second. But in practice, you
would probably tend to choose fewer keyframes in segments with fewer
formants. (This would also reduce the bit rate for silence to 6 bits per
keyframe, a little over 6 times per second: almost 38 bits per second.)

If you had some kind of entropy or linear-predictive-plus-quantized-
residuals coding for the keyframes, you might be able to do still
better; in essence, you could take advantage of the kinds of
redundancies that phonotactics enforces --- consonants tend to alternate
with vowels, for example.

How to choose keyframes? A simple greedy approach would be to start with
100 frames per second, and then iteratively remove the frame that would
produce the smallest error, until removing further frames would generate
unacceptable levels of error. Another simple greedy approach would be to
start with no frames, and then iteratively add keyframes at the point
where the interpolated spectrograms were the farthest from the real
signal, until the spectrogram was close enough.

Neither of those approaches to keyframe selection can work well in
real-time. A simple approach that would probably work well in real-time
would be to maintain two "possible keyframes" at the present moment and
just before, and whenever the spectrogram interpolated from the last
emitted keyframes to the "possible keyframes" becomes too far from the
real signal, emit a keyframe at the point in the recent past where the
error is greatest.

All of these approaches, of course, have to be adjusted to not exceed
the maximum representable keyframe interval.

Other tweaks to try:

- Add a bit per keyframe to indicate that the spectrogram has a
  discontinuous break at that point, rather than interpolating. (This
  could avoid transmitting as many as three more closely spaced
  keyframes.)
- Add a bit per keyframe to indicate the presence or absence of voicing,
  as almost all vocoder algorithms do.
- More generally, add two or three bits per keyframe to indicate the
  average bandwidth of the formants.
- Transmit parameters for a voice model toward the beginning of the
  connection or periodically throughout the recording, in parallel with
  the formant frequency data, so that the synthesized voice can sound at
  least vaguely like the speaker instead of like someone else. If you
  had a perfect model of the range of variation of human voices, 36 bits
  would be enough to uniquely specify the voice of any person who's ever
  lived, and another 26 bits would be enough to specify a minute of
  their life. How close to that can you get with some kind of
  parametric model? Can you come up with a model that describes the
  unique timbre of a person's vocal tract in a small number of
  ruthlessly quantized coefficients, say, 25 dimensions of four bits
  each?
- Since the formants can be constrained to be transmitted in sorted
  order, transmit formant frequencies as intervals (ratios) from the
  previous formant's frequency rather than independently. This could
  reduce the size of each frequency transmitted from 6 bits to 5 or 4.
- Nonuniform encoding for the per-frame formant parameters. This would
  probably require some pretty heavy-duty psychoacoustic research to
  validate (and someone has probably already done it), but perhaps, say,
  there is less tolerance for error in the interval between two formants
  when they are close together, because the difference between a perfect
  fifth and a perfect fourth is more audible than the difference between
  a 16:3 and an 18:3 interval --- which are a perfect fourth and fifth
  plus two octaves. Or perhaps amplitude variation is more important at
  high frequencies.
- Update only the higher-frequency formants in some keyframes. The
  frequency and amplitude of a 200Hz formant can't change very rapidly.
  In a 10ms frame you only get two full cycles! So if you're looking at
  10ms, unless I'm confused about the math, your first few discrete
  Fourier transform coefficients are DC, 100Hz, 200Hz, and 300Hz. So you
  can't detect even fairly large shifts in its frequency --- if it were
  to drop or rise by a whole fifth, seven semitones, you wouldn't even
  notice until you're looking at a longer period of time.  On the other
  hand, if a 4000Hz formant drops to 3900Hz --- less than half a
  semitone --- you could detect that in the DFT of those 10ms.
  Presumably similar constraints apply to your ear: you can't detect if
  a 200Hz signal jumps to 216Hz over a 10ms period; you need a longer
  period of time. So you could emit updates for the high-frequency
  formants more frequently.  This would add a couple of bits per
  keyframe (to indicate which formants were being updated), but most
  keyframes would only contain one formant.

Klatt 1987 reports that the Speak 'n' Spell had stored about 1000 bits
per second of speech, using linear predictive coding:

<http://americanhistory.si.edu/archives/speechsynthesis/dk_749.htm>

Dan Ellis, who wrote the current Matlab version on the Haskins Lab site,
talks about the connection with LPC vocoders:

<http://labrosa.ee.columbia.edu/matlab/sws/>

Mon, 15 Mar 2010

Le 13 mars 10 à 03:45, Kragen Javier Sitaker a écrit :
> In order for this to work, though, there needs to be a record of these
> things having been published at a particular time; things published
> after a patent application has been filed are not "prior art" and  
> do not
> invalidate patent claims.

A long time ago there was someone (ex-Bell?) who was offering a  
timestamping service; IIRC documents and metadata were securely  
hashed, and these hashes then hashed, until finally the daily root  
hash was published as a classified ad in the New York Times (yes,  
back when newspapers existed and still printed classified ads :-)

-Dave

On Sun, Mar 14, 2010 at 01:17:44PM +0100, Aristotle Pagaltzis wrote:
> * Kragen Javier Sitaker <kragen at canonical.org> [2010-03-13 03:50]:
> > Is there some feature of Git or the court system I'm not
> > familiar with that makes this scenario implausible?
> 
> There is a Git feature you *are* already familiar with. I suppose
> you’d have to convince people to keep their reflogs forever (and
> maybe sign them) – since that records which head was set to which
> commit at which point in time.

Yeah, it's too bad gc.reflogExpire isn't replicated upon git-clone, or
that would be simple!

> (I wish there was a one-stop configuration setting to tell Git to
> never expire any data of its data *ever*.)

That would be ideal for this, but only if you could get it to be turned
on by default for clones of a given repo.

Sun, 14 Mar 2010

* Kragen Javier Sitaker <kragen at canonical.org> [2010-03-13 03:50]:
> Is there some feature of Git or the court system I'm not
> familiar with that makes this scenario implausible?

There is a Git feature you *are* already familiar with. I suppose
you’d have to convince people to keep their reflogs forever (and
maybe sign them) – since that records which head was set to which
commit at which point in time.

(I wish there was a one-stop configuration setting to tell Git to
never expire any data of its data *ever*.)

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>

Fri, 12 Mar 2010

kragen-tol is a mailing list rather than a blog for a couple of reasons.

The first was that, when I [set up the list and made my first post][0],
(about prisons, crime, policing, democracy, and humanity), it was
November 12th, 1998.  Some people had blogs; I read some blogs; but it
wasn't yet the default means of publication that it has become now.
There's a blog version of the list at <http://bentwookie.org/blog/kragen-tol/>,
but because it's not primary, I haven't made the formatting on it work
well.

The second was that I want my list mail to serve as prior art to stop
obvious patents from being granted, or to revoke them or the obvious
claims in them in court. (Re-examination didn't exist yet, if I recall
correctly.) Some examples: I posted [a simple system architecture for
MMORPGs] [1], [an article about networked automated fabrication] [4],
and [some thoughts on interactive kinematic modeling] [2] that month.
Later that year, I wrote about [uses for ubiquitous computing] [3],
[ballistic transport in evacuated tunnels] [5], [a technique for
transparent CPU virtualization] [6], [fault-tolerant distributed
computation on untrusted CPUs] [7], [a technique for time-travel
debugging] [8] (which Michael Elizabeth Chastain implemented around the
same time as mec-replay), and [a lightweight device for stopping
bullets] [9].

In order for this to work, though, there needs to be a record of these
things having been published at a particular time; things published
after a patent application has been filed are not "prior art" and do not
invalidate patent claims.

My thought was that publishing these things on a mailing list, rather
than merely on the web, would create a number of distributed copies in
different subscribers' mailboxes, each timestamped with the time that it
had originally been received. This way, there would be more than just my
word to go on. A number of people with different interests would have
records of the publication.

There are some big disadvantages to publishing in mailing-list form.
It's not very observable (what mailing lists are your friends on?), so
it doesn't spread very fast; the formatting sucks unless you send HTML
email; reading the archives is a pain; and I have to waste my time
trying to get my domain un-categorized as a spam source in order for
people to get the mail.

[0]: http://lists.canonical.org/pipermail/kragen-tol/1998-November/000296.html
[1]: http://lists.canonical.org/pipermail/kragen-tol/1998-November/000300.html
[2]: http://lists.canonical.org/pipermail/kragen-tol/1998-November/000303.html
[3]: http://lists.canonical.org/pipermail/kragen-tol/1998-December/000307.html
[4]: http://lists.canonical.org/pipermail/kragen-tol/1998-November/000299.html
[5]: http://lists.canonical.org/pipermail/kragen-tol/1998-December/000309.html
[6]: http://lists.canonical.org/pipermail/kragen-tol/1998-December/000313.html
[7]: http://lists.canonical.org/pipermail/kragen-tol/1998-December/000314.html
[8]: http://lists.canonical.org/pipermail/kragen-tol/1998-December/000315.html
[9]: http://lists.canonical.org/pipermail/kragen-tol/1998-December/000316.html

A better solution?
------------------

I wonder if there's a better solution today. For example, if I put all
this stuff into a Git repository, then I could add new articles, other
people could add new articles too, I could edit existing articles
without losing the records of the old versions, and everybody who edited
it would have a copy that was protected from external corruption.

Unfortunately I don't think there's a good way to prove when somebody
first received a particular thing using Git. Git commits are authored
and dated, but there is no authentication of the author or the date. Git
supports cryptographically-signed tags, but it doesn't create them in
its normal workflow, so typically only a few people make them; they're
used for things like major software releases.

So the threat model is something like this.

It's 2025. FooCorp is suing BarCorp for patent infringement; tens of
millions of dollars are at stake. BarCorp wants to show that the
supposedly infringed patent claims are invalid because they were
published openly in 2012 on my git-based equivalent to Halfbakery, four
years before the patent was filed in 2016.

BarCorp presents the publication of this technique as evidence, along
with the Git commit in which they were created and the commits since
then, by a variety of authors, and shows that the same commit history is
found in several different contributors' replicas; they explain how
Git's content-based blob store prevents tampering with the history of
any commit.

FooCorp has branched from a very old commit, set their computer clock
back to 2010, added an article with that 2010 date describing a
well-known celebrity scandal of 2023, and then constructed a similar,
but fake, commit history with a variety of imaginary authors, over the
next fifteen years. (As it happens, `git rebase` can be used to do this
fairly easily, but if it didn't exist it would be easy to write it
yourself.)

Sometime in 2025, during this court case, they managed to get some
legitimate contributor somewhere to merge their branch in, and that
contributor's branch got merged into the cloud of the popular versions
of the repository.  So in the very same commit history that BarCorp
presented to the court, the FooCorp lawyer locates this obviously recent
article supposedly from 2010, dated to 2010 using the same techniques
that BarCorp was using to prove that the publication of the patented
technique happened in 2012 and not 2022.

Consequently, the court ignores BarCorp's evidence and finds in favor of
FooCorp.

Any ideas? Is there some feature of Git or the court system I'm not
familiar with that makes this scenario implausible? Or provides a way to
prevent it?

By contrast, the Received: line on people's email would probably stand
up, if you could find half a dozen people whose mail was stored in
different places.