Wed, 17 Mar 2010

David Rowe (VK5DGR) has been doing some absolutely awesome work on
open-source speech codec development (for lossy compression of speech,
for example for certain radio frequency bands where available bandwidth
is very limited):

<http://www.rowetel.com/ucasterisk/codec2.html>
<http://www.rowetel.com/blog/?p=132>

He's shooting for good speech quality at 2400 bits per second, which is
apparently close to the best proprietary codecs. And he's using an
approach fairly similar in many ways to what the rest of this post
describes.

But today I ran into this truly astonishing research project at Haskins
Laboratories at Yale, by Robert Remez and Philip Rubin:

<http://www.haskins.yale.edu/featured/sws/sws.html>

There is some further exploration, which unfortunately I couldn't listen
to successfully, at Remez's site:

<http://www.columbia.edu/~remez/Site/Musical%20Sinewave%20Speech.html>

And there's a Wikipedia page:

<http://en.wikipedia.org/wiki/Sinewave_synthesis>

They're synthesizing comprehensible --- if slow --- English speech out
of nothing more than three or four formants, realized purely as sine
waves. The sound recordings are truly astonishing to listen to. They
don't sound like human speech; they don't sound like synthesized speech;
they sound like the whistling and squealing sounds when you're trying to
tune an AM radio. But you can understand them.

(And apparently they started this research in the 1970s, their Fortran
code is from 1980, they put up a web page about it in 1996, and the
current Matlab version of the code is from 2003. The publications on
their publications page are from 1980 to 1994.)

This made me wonder how low you can really push the bandwidth of a
usable speech codec. Presumably you could do k-nearest-neighbors
averaging on a database of recorded speech sounds in order to get back
from the sine waves to something that more closely resembles human
speech. But how much bandwidth would it take to transmit the sine-wave
information?

They've put online the parameters they used to synthesize their sample
utterances; one of them is at
<http://www.haskins.yale.edu/featured/sws/swssentences/S7pars.html>. It
encodes the sentence "Please say what this word is," about 19 phones, in
1.68 seconds.  The text file is 15045 bytes, which isn't particularly
good; that's almost 80kbps. But even `gzip` can compress it to 3333
bytes, which brings it down below 16 kilobits per second.

That is, even without intending to, their research comprises a speech
codec that can produce comprehensible speech at 16 kilobits per second,
slightly more than GSM uses (although GSM produces very realistic
speech).

Beyond that, though, we'd need to discard more information. The
parameters file in its current form divides the utterance up into 10ms
frames, with pitch and amplitude information for each formant in each
frame; between frames, the pitch and amplitude are linearly
interpolated.  The pitch information in that file ranges from 136Hz to
4597Hz, and is already quantized to, apparently, 1Hz.  The amplitude
information is represented to six significant figures.

Suppose that, instead, we reduced the frames to a few "keyframes" and
interpolated between them with cubic splines. We probably need at least
one keyframe per phone, and probably 1.5 or so. That would give us about
17 keyframes per second, which is an improvement of a factor of 6. If
that didn't affect its gzip-compressibility, that alone would get us
down to 2700 bits per second.

But there's no need to spend 13 or 16 bits per formant per keyframe on
the frequency; we can almost certainly quantize the frequency
logarithmically to within one semitone. The range in question is almost
61 semitones, so you only need six bits.

Similarly, the amplitude probably doesn't need ten bits of precision.
Probably four bits (logarithmic; maybe 2dB each, for a total dynamic
range of 32dB) would do fine.

And the interval between keyframes can probably be quantized to 10ms,
and range up to, say, 160ms, which would require four bits of keyframe
duration.

So a keyframe consisting of four bits of timing, and four formants each
with ten bits of pitch and amplitude information, would occupy a total
of 44 bits, for a total of about 750 bits per second, or 210 bits per
word in this case (since you'd need fewer keyframes when the speech was
slower).  That's about five times worse than ASCII text.

Also, in many of the frames, not all of the formants were present; out
of the 169 frames in that file, there were an average of 2.5 formants
present. If the average number of formants were the same for keyframes,
and you used two bits per frame to indicate the number of formants, the
average frame size would fall from 44 bits to 4+2+25 = 31 bits, and the
total bit rate would fall to 530 bits per second. But in practice, you
would probably tend to choose fewer keyframes in segments with fewer
formants. (This would also reduce the bit rate for silence to 6 bits per
keyframe, a little over 6 times per second: almost 38 bits per second.)

If you had some kind of entropy or linear-predictive-plus-quantized-
residuals coding for the keyframes, you might be able to do still
better; in essence, you could take advantage of the kinds of
redundancies that phonotactics enforces --- consonants tend to alternate
with vowels, for example.

How to choose keyframes? A simple greedy approach would be to start with
100 frames per second, and then iteratively remove the frame that would
produce the smallest error, until removing further frames would generate
unacceptable levels of error. Another simple greedy approach would be to
start with no frames, and then iteratively add keyframes at the point
where the interpolated spectrograms were the farthest from the real
signal, until the spectrogram was close enough.

Neither of those approaches to keyframe selection can work well in
real-time. A simple approach that would probably work well in real-time
would be to maintain two "possible keyframes" at the present moment and
just before, and whenever the spectrogram interpolated from the last
emitted keyframes to the "possible keyframes" becomes too far from the
real signal, emit a keyframe at the point in the recent past where the
error is greatest.

All of these approaches, of course, have to be adjusted to not exceed
the maximum representable keyframe interval.

Other tweaks to try:

- Add a bit per keyframe to indicate that the spectrogram has a
  discontinuous break at that point, rather than interpolating. (This
  could avoid transmitting as many as three more closely spaced
  keyframes.)
- Add a bit per keyframe to indicate the presence or absence of voicing,
  as almost all vocoder algorithms do.
- More generally, add two or three bits per keyframe to indicate the
  average bandwidth of the formants.
- Transmit parameters for a voice model toward the beginning of the
  connection or periodically throughout the recording, in parallel with
  the formant frequency data, so that the synthesized voice can sound at
  least vaguely like the speaker instead of like someone else. If you
  had a perfect model of the range of variation of human voices, 36 bits
  would be enough to uniquely specify the voice of any person who's ever
  lived, and another 26 bits would be enough to specify a minute of
  their life. How close to that can you get with some kind of
  parametric model? Can you come up with a model that describes the
  unique timbre of a person's vocal tract in a small number of
  ruthlessly quantized coefficients, say, 25 dimensions of four bits
  each?
- Since the formants can be constrained to be transmitted in sorted
  order, transmit formant frequencies as intervals (ratios) from the
  previous formant's frequency rather than independently. This could
  reduce the size of each frequency transmitted from 6 bits to 5 or 4.
- Nonuniform encoding for the per-frame formant parameters. This would
  probably require some pretty heavy-duty psychoacoustic research to
  validate (and someone has probably already done it), but perhaps, say,
  there is less tolerance for error in the interval between two formants
  when they are close together, because the difference between a perfect
  fifth and a perfect fourth is more audible than the difference between
  a 16:3 and an 18:3 interval --- which are a perfect fourth and fifth
  plus two octaves. Or perhaps amplitude variation is more important at
  high frequencies.
- Update only the higher-frequency formants in some keyframes. The
  frequency and amplitude of a 200Hz formant can't change very rapidly.
  In a 10ms frame you only get two full cycles! So if you're looking at
  10ms, unless I'm confused about the math, your first few discrete
  Fourier transform coefficients are DC, 100Hz, 200Hz, and 300Hz. So you
  can't detect even fairly large shifts in its frequency --- if it were
  to drop or rise by a whole fifth, seven semitones, you wouldn't even
  notice until you're looking at a longer period of time.  On the other
  hand, if a 4000Hz formant drops to 3900Hz --- less than half a
  semitone --- you could detect that in the DFT of those 10ms.
  Presumably similar constraints apply to your ear: you can't detect if
  a 200Hz signal jumps to 216Hz over a 10ms period; you need a longer
  period of time. So you could emit updates for the high-frequency
  formants more frequently.  This would add a couple of bits per
  keyframe (to indicate which formants were being updated), but most
  keyframes would only contain one formant.

Klatt 1987 reports that the Speak 'n' Spell had stored about 1000 bits
per second of speech, using linear predictive coding:

<http://americanhistory.si.edu/archives/speechsynthesis/dk_749.htm>

Dan Ellis, who wrote the current Matlab version on the Haskins Lab site,
talks about the connection with LPC vocoders:

<http://labrosa.ee.columbia.edu/matlab/sws/>

Fri, 12 Mar 2010

kragen-tol is a mailing list rather than a blog for a couple of reasons.

The first was that, when I [set up the list and made my first post][0],
(about prisons, crime, policing, democracy, and humanity), it was
November 12th, 1998.  Some people had blogs; I read some blogs; but it
wasn't yet the default means of publication that it has become now.
There's a blog version of the list at <http://bentwookie.org/blog/kragen-tol/>,
but because it's not primary, I haven't made the formatting on it work
well.

The second was that I want my list mail to serve as prior art to stop
obvious patents from being granted, or to revoke them or the obvious
claims in them in court. (Re-examination didn't exist yet, if I recall
correctly.) Some examples: I posted [a simple system architecture for
MMORPGs] [1], [an article about networked automated fabrication] [4],
and [some thoughts on interactive kinematic modeling] [2] that month.
Later that year, I wrote about [uses for ubiquitous computing] [3],
[ballistic transport in evacuated tunnels] [5], [a technique for
transparent CPU virtualization] [6], [fault-tolerant distributed
computation on untrusted CPUs] [7], [a technique for time-travel
debugging] [8] (which Michael Elizabeth Chastain implemented around the
same time as mec-replay), and [a lightweight device for stopping
bullets] [9].

In order for this to work, though, there needs to be a record of these
things having been published at a particular time; things published
after a patent application has been filed are not "prior art" and do not
invalidate patent claims.

My thought was that publishing these things on a mailing list, rather
than merely on the web, would create a number of distributed copies in
different subscribers' mailboxes, each timestamped with the time that it
had originally been received. This way, there would be more than just my
word to go on. A number of people with different interests would have
records of the publication.

There are some big disadvantages to publishing in mailing-list form.
It's not very observable (what mailing lists are your friends on?), so
it doesn't spread very fast; the formatting sucks unless you send HTML
email; reading the archives is a pain; and I have to waste my time
trying to get my domain un-categorized as a spam source in order for
people to get the mail.

[0]: http://lists.canonical.org/pipermail/kragen-tol/1998-November/000296.html
[1]: http://lists.canonical.org/pipermail/kragen-tol/1998-November/000300.html
[2]: http://lists.canonical.org/pipermail/kragen-tol/1998-November/000303.html
[3]: http://lists.canonical.org/pipermail/kragen-tol/1998-December/000307.html
[4]: http://lists.canonical.org/pipermail/kragen-tol/1998-November/000299.html
[5]: http://lists.canonical.org/pipermail/kragen-tol/1998-December/000309.html
[6]: http://lists.canonical.org/pipermail/kragen-tol/1998-December/000313.html
[7]: http://lists.canonical.org/pipermail/kragen-tol/1998-December/000314.html
[8]: http://lists.canonical.org/pipermail/kragen-tol/1998-December/000315.html
[9]: http://lists.canonical.org/pipermail/kragen-tol/1998-December/000316.html

A better solution?
------------------

I wonder if there's a better solution today. For example, if I put all
this stuff into a Git repository, then I could add new articles, other
people could add new articles too, I could edit existing articles
without losing the records of the old versions, and everybody who edited
it would have a copy that was protected from external corruption.

Unfortunately I don't think there's a good way to prove when somebody
first received a particular thing using Git. Git commits are authored
and dated, but there is no authentication of the author or the date. Git
supports cryptographically-signed tags, but it doesn't create them in
its normal workflow, so typically only a few people make them; they're
used for things like major software releases.

So the threat model is something like this.

It's 2025. FooCorp is suing BarCorp for patent infringement; tens of
millions of dollars are at stake. BarCorp wants to show that the
supposedly infringed patent claims are invalid because they were
published openly in 2012 on my git-based equivalent to Halfbakery, four
years before the patent was filed in 2016.

BarCorp presents the publication of this technique as evidence, along
with the Git commit in which they were created and the commits since
then, by a variety of authors, and shows that the same commit history is
found in several different contributors' replicas; they explain how
Git's content-based blob store prevents tampering with the history of
any commit.

FooCorp has branched from a very old commit, set their computer clock
back to 2010, added an article with that 2010 date describing a
well-known celebrity scandal of 2023, and then constructed a similar,
but fake, commit history with a variety of imaginary authors, over the
next fifteen years. (As it happens, `git rebase` can be used to do this
fairly easily, but if it didn't exist it would be easy to write it
yourself.)

Sometime in 2025, during this court case, they managed to get some
legitimate contributor somewhere to merge their branch in, and that
contributor's branch got merged into the cloud of the popular versions
of the repository.  So in the very same commit history that BarCorp
presented to the court, the FooCorp lawyer locates this obviously recent
article supposedly from 2010, dated to 2010 using the same techniques
that BarCorp was using to prove that the publication of the patented
technique happened in 2012 and not 2022.

Consequently, the court ignores BarCorp's evidence and finds in favor of
FooCorp.

Any ideas? Is there some feature of Git or the court system I'm not
familiar with that makes this scenario implausible? Or provides a way to
prevent it?

By contrast, the Received: line on people's email would probably stand
up, if you could find half a dozen people whose mail was stored in
different places.

Mon, 11 Jan 2010

There's a certain kind of "research" that consists of spending long
hours in the library, or with EDGAR, or what have you, and digesting the
available information to produce useful summaries of the current state
of things.  Aaron Swartz is looking for someone to do this sort of
research on political issues, at
<http://www.aaronsw.com/weblog/researcherjob>.

I think there's an unfortunate lack of this sort of "research" being
produced and made public, and much of what *is* produced is being
produced by partisan research institutes, which limits their credibility
somewhat. (The job Aaron is offering is partisan, too; he cites a
political orientation as one of the job requirements.)

Supposedly this is one of the purposes of journalism, as well, but
self-described journalists seem to be pretty terrible at carrying out
this kind of research, by and large.

So the following mechanism occurred to me as a way to aggregate demand
for such research.

A research institute sets up a web site where anyone can post a request
for a report analyzing some issue, coupled with a pledge to pay an
amount of money of their choosing if the institute produces such a
report.  Visitors to the site see a list of open requests and previously
produced reports, and can pledge their own money to any open request.

When someone at the institute finishes a report, they post it on the
site for anyone to read, and the institute calls in the pledges on that
report. Then that person chooses a new report to work on: the one with
the largest total amount pledged, or perhaps the largest total amount
pledged per hour that they estimate it will take.

In a way, this is similar to Sourcexchange, CoSource, pubsoft.org (the
Public Software Fund), eLance, and so on, except that there is only a
single provider of what is being funded --- so all the issues of
bidding, choice of service providers, and quality feedback are greatly
simplified.

It's a sort of auction: the next week (or whatever unit) of the
researcher's time is "auctioned off" to the *issue* with the highest
*total* bid on it.  Contributors influence the priorities of the
organization by submitting bids.

The Dominant Assurance Contract Variant
---------------------------------------

The above also bears some resemblance to an assurance contract: nobody's
pledges are called in until there's enough money pledged to fund work at
the institute's usual level of quality (whatever that may be). The
incentives are slightly different, since whatever the amount currently
pledged happens to be, you can increase the likelihood that your desired
report will be the next one produced by putting in money; and there's no
point at which any particular report definitely fails to get written,
just a matter of being indefinitely postponed. But there's still an
incentive to "free-ride": as long as the reports are made available to
the public, you still get them at about the same time if you don't pay
for them.

On the other hand, only providing the reports to those who paid for them
destroys the vast majority of their potential social value, and it also
damages your institute's ability to market itself to potential new
funders.

Alex Tabarrok came up with a variant of an assurance contract in which,
if the contract fails, everyone who pledged money gets a small amount
back. This is supposed to give people an incentive to pledge money to
any cause that they think will fail. He analyzes it in
<http://mason.gmu.edu/~atabarro/PrivateProvision.pdf>.

I've written about these before in
<http://lists.canonical.org/pipermail/kragen-tol/2005-June/000783.html>.

I think you can apply the same idea here: if the institute wants a
particular report to be produced, it can offer an up-front payment of,
say, 10% of your pledge, in exchange for you making the pledge --- sort
of like buying a put option from you. This way, as long as the report
hasn't been produced, you're ahead financially, so you have an incentive
to pledge money to a report if you think it is unlikely to be produced.

(Alex's paper envisions competing entrepreneurs funding different
dominant assurance contracts. I'm not sure how that would work here.)

Fri, 10 Jul 2009

<http://canonical.org/~kragen/search-comparison-2009.html>

Some guy from ask.com just made the totally implausible claim that
their search results are “just as good if not better” than Google’s,
and their search engine also had another advantage: they were willing
to put paid advertising someplace Google wouldn’t (specifically, on
searches about abortion).

So I thought I would do a comparison.

Here are the last ten Google queries from my browser history:

1. [morning-after pill]
2. [len tower lawnmower]
3. [melting point of solder]
4. [melting point of silicon]
5. [david phillip oster]
6. [1998 blogs], with a drill-down to [1998 weblogs] and [history of
   weblogs]
7. [emacs tags file syntax], with drill-down to [emacs tags table
   syntax] and [site:www.gnu.org emacs tags table syntax]
8. [Eric Stoltz]
9. [cytocomputer]
10. [zHosting Ltd]

I evaluated them on Google, Ask.com, Yahoo Search, and Bing. I more or
less have ads turned off with AdBlock Plus and NoScript, and I’m
viewing everything in Firefox 3.0 with Gnash for my Flash player. So
there may be annoyances that affect other people but not me.

Summary
-------

So here are the grades for the different queries:

[morning-after pill]
Grades: Google **B**, Ask.com **B-**, Yahoo Search **D**, Bing **F**.  
[len tower lawnmower]
Grades: Google **A**, Ask.com **A**, Yahoo Search **A+**, Bing **A**.  
[melting point of solder]
Grades: Google **A**, Ask.com **B**, Yahoo Search **A+**, Bing **C**.  
[melting point of silicon]
Grades: Google **A+**, Ask.com **A**, Yahoo Search **D**, Bing **C**.  
[david phillip oster]
Grades: Google **C**, Ask.com **B**, Yahoo Search **A+**, Bing **B**.  
[1998 blogs]
Grades: Google **D**, Ask.com **D**, Yahoo Search **B**, Bing **C**.  
[emacs tags file syntax]
Grades: Google **F**, Ask.com **F**, Yahoo Search **F**, Bing **F**.  
[Eric Stoltz]
Grades: Google **A**, Ask.com **C**, Yahoo Search **B**, Bing **A+**.  
[cytocomputer]
Grades: Google **B**, Ask.com **D**, Yahoo Search **D**, Bing **F**.  
[zHosting Ltd.]
Grades: Google **A**, Ask.com **A+**, Yahoo Search **A**, Bing **B**.  

**Google**’s median grade is **A- or B+**, the best of the four.  It
only failed on a query where all four search engines failed.  However,
it was only the best search engine of the four **30%** of the time.
It was clearly better than the others on dealing with a controversial
topic and providing search results from beyond the Web: books and
academic papers.

**Ask.com**’s median grade is **B**.  It, too, only failed on the
query where all four search engines failed.  Its results were worse
than Google’s 50% of the time, equally good 30% of the time, and
better than Google’s 20% of the time.  So the claim by the guy from
Ask.com isn’t as implausible as it appeared at first, but it still
isn’t true for my query mix.  It was only the best search engine of
the four **10%** of the time.

I’m really surprised at how well Ask.com did, because I always thought
of their search engine as a joke.

**Yahoo Search**’s median grade is **B**.  It, too, only failed on a
query where all four search engines failed.  It was the best search
engine of the four **40%** of the time, more than any other search
engine, so I am going to switch to it as my default search engine.  It
was better than Ask.com less often than Google, though: it was better
40% of the time, equally good 20% of the time, and worse 30% of the
time.

**Bing**’s median grade is **C**, the worst of any engine, and unlike
any other engine, it failed badly on two of the nine queries the other
search engines were able to answer: in one case by privileging
misinformation and scaremongering over reliable information, and in a
second case by simply failing to find anything relevant. It was the
best search engine of the four only **10%** of the time, like Ask;
that was on a celebrity query.  I’m sad to say this because my friend
Barney Pell has been working really hard on it for years, but Bing’s
performance is pathetic.

(The percentages of “best of the four” 30% + 10% + 40% + 10% add up to
only 90%; that’s because one of the ten queries was failed by all four
search engines, and in that case none was “the best”.)

So there isn’t really a clear winner; Yahoo Search, Google, and
Ask.com are pretty even overall, even though some did much better than
others on particular queries.  There is a clear *loser*, though:
Bing. Maybe I should have included Cuil to make Bing look better. I
mean, I feel kind of bad.  

(Actually, I did try [morning-after pill] and [david phillip oster] on
Cuil. It did better than Bing.)

The rest of this document (4000 words) is taken up with explanations
of the particular queries.

[morning-after pill]
--------------------

Here I wanted to see if I could find accurate information about
emergency contraception without having to cope with abortion-scare
sites providing misinformation.

Google:

* hit 1 is Wikipedia: ideal; explains both sides of the debate
  objectively, along with lots of detailed information.
* hit 2 is morningafterpill.org, an abortion-scare site: not so
  good. However, the snippet says, “Site asserts that “morning after”
  emergency contraception is just another abortion approach that kills
  a human life.”, so it’s not a surprise shock.
* hit 3 is morningafterpill.org also, with health-scare information
  which is not actually accurate. Not good.
* hit 4 is news results, saying, “Legal fight continues on sale of
  “morning after” pill”.
* hit 5 is some UK site with what appears to be accurate information.

Ask.com:

* hit 1 is something on healthline.com, with apparently accurate
  information and an unhelpful blurry IUD diagram.
* hit 2 is Google hit #5.
* hit 3 is Google hit #2. Not good.
* hit 4 is getthepill.com, apparently an online OTC pharmacy for
  morning-after pills.
* hit 5 is Google hit #1, Wikipedia.

Yahoo Search:

* provides lots of drop-down suggestions before I even finish typing
  the search query!
* hit 1 is Google hit #2, with the more misleading snippet, “Rejects
  ideas that the Morning After Pill is not an abortifacient and argues
  instead that MAP use is tantamount to abortion. From the American
  Life League.”. Very bad.
* hit 2 is Google hit #3.
* hit 3 is Google hit #1, Wikipedia.
* hit 4 is Google hit #4.
* hit 5 is a Mayo Clinic page.

Bing:

* hit 1 is Google hit #2, very bad.
* hit 2 is a dictionary definition. Worthless.
* hit 3 is the Mayo Clinic page.
* hit 4 is another page from morningafterpill.org, but with no visual
  indication that it’s the same site or that it doesn’t have reliable
  information. Very bad.
* hit 5 is from sexuality.about.com. Similar to the Wikipedia page,
  but shorter, except that it doesn’t cover the controversy at all.

Grades on this query: Google B, Ask.com B-, Yahoo Search D, Bing F.

[len tower lawnmower]
---------------------

I wanted to find a photo of Len Tower on a human-powered riding mower
that I had seen a few days ago.

Google: hit #1 is a page with the photo and background information,
instantly recognizable as such.

Ask.com: same.

Yahoo Search: same, but hits #2 and #3 are also about it, with more
information.

Bing: same as Google.

Grades on this query: Google A, Ask.com A¸ Yahoo Search A+, Bing A.

[melting point of solder]
-------------------------

I wanted to find out the melting point of traditional eutectic
lead-tin solder as well as the melting point of common modern
RoHS-compliant solders.

Google: 

* hit 1 is Wikipedia page for “Solder”, which is a very general page
  with a uselessly large range in the snippet.
* hit 2 is Wikipedia page for “Soldering”.
* hit 3 is “RF Cafe - Solder Properties Melting Point”, with the
  snippet “These values are for some of the most common solders...”,
  so I clicked on that. It has precise melting points for a rather
  larger number of solders than I wanted, but I got the information I
  needed.

Ask.com:

* hit 1 is some journal article from 1996 about a new solder
  formulation that, as far as I know, nobody uses today. Useless.
* hit 2 is Google hit #1.
* hit 3 is Google hit #2.
* hit 4 is “EPE “Basic Soldering Guide””, which says in the snippet,
  “The melting point of most solder is in the region of 188°C (370°F)
  and the iron tip temperature is typically 330-350°C
  (626°-662°F). The latest lead-free solders typically require a
  higher temperature.”. You would think this was better, but if you
  follow the link to the (rather large) page, it never actually tells
  you what the higher temperature is.
* hit 5 is Google hit #3.

Yahoo Search says, “Did you mean: melting point of soldier?”

* hit 1 is Google hit #1, with a nice little graphic.
* hit 2 is Google hit #2, with the same nice little graphic.
* hit 3 is somebody asking a question about what kind of solder was in
  common use in 1969, and how hot it melts.
* hit 4 is some sort of “tips on soldering” page.
* hit 5 is Ask.com hit #1.
* hit 7 actually looks promising, but has only questions but no
  answers.

So I followed the Wikipedia link, and it has the answer for eutectic
lead-tin solder above the fold and a section on “lead-free solders”
with a whole big discussion of which ones are most common and what
their melting points are. So I probably should have followed that link
from Google instead of hit #3.

Bing:

* hit 1 is a brand new US patent on a type of solder. Trash.
* hit 2 is another one. Trash.
* hit 3 is ask.com hit #1. Trash.
* hit 4 is a 2007 article by Zhenhua Chen about lower-temperature
  lead-free solder formulations that aren’t yet in wide use, also
  summarizing the melting points of the widely-used modern solders and
  their various advantages and disadvantages. Pure gold. (The article,
  metaphorically speaking, not the solders.)
* hit 5 is the Wikipedia page.

Grades on this query: Google A, Ask.com B, Yahoo Search A+, Bing C —
would be an F except for hit 4.

[melting point of silicon]
--------------------------

Google has the answer in big letters above the search results: 1687
K. Wikipedia article is hit #2, and the correct answer in °C is in 
hit #4.

Ask.com has the answer in the snippets for hits 1, 2, and slightly
wrong answers in snippets for hit 4 and hit 5, and hit #3 presumably
has it if I click through.

Yahoo Search hit 1 is Wikipedia. Hits 2 and 3 are the wrong
answer. Snippets for hits 4 and 5 have the right answer.

Bing:

* hit 1 is a patent. Trash.
* hit 2 is about the melting point of silicon dioxide. Trash.
* hit 3 is roughly a duplicate of hit 2. Trash.
* hit 4 has the answer in a snippet from another Wikipedia page.
* hit 5 is another irrelevant thing about SiO₂.

Grades: Google A+, Ask.com A, Yahoo Search D, Bing C.

[david phillip oster]
---------------------

I wanted to find his home page, thence to find his current email
address, to email him.

Google: no home page, but hits 5-7 look vaguely promising. Hit 5 leads
to a blog post that links to
<http://groups.google.com/groups/search?q=%22david+phillip+oster%22&start=0&scoring=d>,
which does actually link to
<http://groups.google.com/group/iphonesdkdevelopment/browse_thread/thread/5c9cd5561d7b0d64/da37b38ede21148d?q=%22david+phillip+oster%22#da37b38ede21148d>
which links to
<http://groups.google.com/groups/profile?enc_user=szRVXBsAAABguGT__oukXrijYyXRsYeu3jKajrjPH-s4VDv7fhNHSg>,
which says “davidphillipos... at gmail.com”, which is close enough. Hit
7, his Amazon reviewer page, actually has “oster at ieee.org” on the page.

Hit 9 links to a RISKS page that gives the email address he had in
1988.

In practice I gave up when I saw the page of snippets; instead I
searched my email.

Ask.com: hit #2 is Google’s hit #7.

Yahoo Search: turbozen.com is hits #1 and #2, with “oster at ieee.org” in
both snippets.  Hit #4 is mosaiccodes.com, which links to turbozen.com.

Bing: hit #3 is Yahoo hit #2 (without the email address in the
snippet, but clear that it’s his software company), and hit #5 is
Google hit #7.

Grades: Google C, Ask.com B, Yahoo Search A+, Bing B.

[1998 blogs]
------------

I was trying to remember the state of the blogosphere in 1998 when I
started kragen-tol in order to justify my claim that it wasn’t very
surprising that I didn’t start it as a blog.

Google: top ten hits are all trash — things that happen to be a blog
or mention blogs and mention 1998. Hit #11 looks more promising but is
also trash.  Somewhere around hit #20 there’s [Psychology of Blogs
(Weblogs)](http://psychcentral.com/blogs/blog.htm), from 1998, which
is a pretty good snapshot of how things were in 1998 — except a little
bit polluted by a 2001 update.

Ask.com: same trash as Google, except only ten hits of it. (I have
Google set to display 100.)

Yahoo Search: mostly the same trash, but Psychology of Blogs is 
hit #4. Yahoo Search used to display 20 hits by default, but now it
seems it’s down to 10, just like Google.

Bing: hit #1 talks about what the web was like in 1998, in Spanish,
but doesn’t shed any light on my actual question, which is what the
blogosphere was like in 1998. Hit #2 is the Spanish Wikipedia page for
“blog”, which has a pretty good “Historia” section. Hit #7 is
somebody’s presentation on SlideShare, which loses pretty badly (not
accessible without Flash and fails freakishly in Gnash) but there’s
some good information in the title.

None of these really gave me what I was looking for, which was Rebecca
Blood’s “History of Weblogs” from 2000, which I couldn’t remember the
title of.  So when I was doing this search “for real”, the first time,
instead of looking at hit #20 or trying multiple search engines, I
glanced at the page full of trash and reformulated my search. The word
“blog” wouldn’t be coined until 1999 (by The Brand Peter Me.) and at
the time they were called “weblog”, a term Jorn Barger had invented in
1997 for what are now called “linklogs” or sometimes “microblogs” or
“tumblelogs”.

So I searched for [1998 weblogs].

On Google, “Psychology of Weblogs” is hit #1, and Jason Kottke’s blog
archives for 1998 are hit #3.  The snippet for hit #6, from a blog I’d
never heard of that ended in 2005, says, “I started this weblog in
August 1998, when it was one of the first 25 or so weblogs in
existence,” which is a piece of the information I was looking for but
not the comprehensive overview of Wikipedia or Blood’s piece.

Ask.com is essentially identical to Google, with the same hits #1 
and #3, and Google’s hit #6 moved up to #4.  However, it also has a
sidebar of “Related Searches”, which includes a suggestion for
“history of weblogs”.

Yahoo Search has “Psychology of Weblogs” as hit #1, but also has
Blood’s essay as hit #8! Also, hit #4 is “Computer History for 1998”,
with some minimal information.  Hit #9 mentions that Scripting News’s
comments section started in October 1998, and hit #10 is “Jorn Barger,
the NewsPage Network, and the Emergence of the Weblog Community”,
which offers a somewhat deeper history even than Blood’s essay.

Bing gives essentially exactly the same results as for [1998 blogs].

So, since I was using Google instead of Yahoo Search, I searched a
third time for [history of weblogs]. 

On Google, below the Google Scholar hits, which don’t have enough
information on the page to tell me if they’re the right thing, Blood’s
article is #1. English Wikipedia articles are the next couple of hits,
followed by more articles about the early history of weblogs
(1997-2000). Pure gold.

Ask.com gives basically the same results.

Yahoo Search puts Blood’s article at the top, a self-promotional post
short on detail by Dave Winer, the Wikipedia article, etc.

Bing gives Blood’s essay at the top, followed by a Spanish Wikipedia
article, some random irrelevant stuff, a German page (which I don’t
understand), some more irrelevant stuff, and what appears to be an SEO
spam page (“Interested in history? At weblogs.hu you find posts and
information relevant to history.  www.weblogs.hu/posts/tags/history”.)

So, grades: Google D, Ask.com D, Yahoo Search B, Bing C.  On my earlier
queries Yahoo Search does dramatically better than the others, well
enough that I wouldn’t have proceeded to the third query and maybe not
past the first.

[emacs tags file syntax]
------------------------

I wanted to look up the syntax of Emacs `TAGS` files so I could write
a program to generate one (introspectively from the state of a Python
program, rather than by parsing a bunch of source code).  This search
originally was completely unsuccessful, although I’m not totally
stymied; there is one free-software consumer of `TAGS` and two
free-software generators of `TAGS` already on my machine, so I can
just look at the source. If I’m lucky, it will reference a file format
spec.

Google: all of the hits relate to how to invoke `etags`, which
generates `TAGS` files, or how to use them in Emacs. The “syntax”
being referenced is invariably the syntax of the source files, not of
`TAGS` itself (which is called a “tags table”, apparently.) Most of
them are a zillion copies of the Emacs manual and the man pages for
`etags` and Exuberant Ctags.

Ask.com: identically useless results, except for a bunch of irrelevant
“Related Searches” at the top.

Yahoo Search: same.

Bing: same.

My next attempt was to be more specific in my query: I’m looking for
information about the *tags table*. In retrospect, I should have
looked for information about the “file format”, not “syntax”, but my
next search was [emacs tags table syntax].

All four search engines give basically the same results as before.

So my next attempt was to click on “more results from www.gnu.org »”,
with the thought that this would give me each section of the Emacs
manual only once, and many more of them. It did, on Google, but the
Emacs manual does not contain the answer. I am not trying the query on
the other search engines.

Searching for [emacs tags table format] does not seem to help.

I thought I would try using natural-language search on Ask.com and
Bing. [how do i generate an emacs tags table?] on Ask.com yields
mostly `etags` man pages, but also a link to
<http://www.emacswiki.org/cgi-bin/wiki/EmacsTags>, which doesn’t help
but is usually a better resource than the Emacs manual.  Bing has it
at the top.

Grades: Google F, Ask.com F, Yahoo Search F, Bing F.

[Eric Stoltz]
-------------

I had read that Eric Stoltz had been originally cast in Back To The
Future, and I wondered who he was.

Google gave me four photos of him at the top, which was sufficient for
me to know I didn’t recognize him.  Hit #1 was his IMDB page and hit #3
is the Wikipedia page, which outlined his acting career in
sufficient detail to satisfy me.

Ask.com has a bunch of irrelevant “related searches” at the top,
followed by product images from Amazon which are too small to see the
guy’s face.  Then there’s the IMDB page, some TV listings for ZIP code
10010 in the US (utterly pathetic; I’m in Argentina), and then a
Wikipedia page with a too-small image.

Yahoo Search has only three photos, of smaller size than Google’s, but
they’re recognizable. Top few hits are from IMDB and Wikipedia.

Bing has six photos, including a closeup shot, which are highly
recognizable. Then the top hit is some other guy Eric Stoltz who’s a
web designer, followed by Wikipedia entries from English and Spanish,
an IMDB page, and then a French Wikipedia article.

Grades: Google A, Ask.com C, Yahoo Search B, Bing A+.

[cytocomputer]
--------------

I wanted to know what had been written recently about Bob Lougheed et
al.’s image processing device.

Google:

* hit 1: The abstract of Lougheed and McCubbrey’s 1980 paper, without
  the full text. Fail.
* hit 2: Some paper from 1982 that referenced it, also without the
  full text. Fail.
* hit 3: Thesaurus.com. Not just fail but spam; thesaurus.com (an
  ask.com service) is wasting Google users’ time by directing them to
  a page that says, “No results found for *cytocomputer*: Did you mean
  strumpet?” No, I certainly did not.
* hit 4: a 2001 book on Google Books about image processing,
  describing the Cytocomputer architecture in the context of image
  processing architectures of the time. OK.
* hit 5: another book on Google Books, this one from 1993.
* hit 6: from IEEE Xplore: an abstract, without the text, of a paper
  from 2001 that referenced it. Fail.
* hit 7: also from IEEE Xplore: The same paper as hit 2, again
  without the text, and also with the wrong title. Fail.
* hit 8: a spam page from reference.com (an ask.com service), saying,
  “No results found for *cytocomputer*: Did you mean supercomputer (in
  dictionary) or Cart computer (in reference)?”
* hit 9: the full text of the paper from hit 6, which turns out to be
  a 2001 emulation of the Cytocomputer in an FPGA, getting a 10×
  speedup over the software emulation they had been using.  This is
  the version of the paper that was submitted to the government
  sponsors and thenceforth freely disseminated. MADE OF WIN.
* hit 10: the full text of Barry Bruce Megdal’s 1983 dissertation on
  VLSI fingerprint recognition. WIN. Particularly impressive since the
  PDF contains no text; it’s scanned from prints.

Later Google hits include crap from linkinghub.elsevier.com, expired
US patents describing the Cytocomputer in some detail, and so on.  So
even though 60% of the top 10 Google hits are basically spam
(duplicate teasers from ACM and IEEE, and ask.com SEO spam pages)
there’s some good stuff in there.

Also, Google offers “Cited by 57” on the original Cytocomputer
paper. Among other things, that links me to the Cheops paper from 1995
and the 400-page Image Algebra book from 1986. These only mention the
Cytocomputer in passing, but they look pretty interesting.

Ask.com:

* hit 1: Google hit #1. Fail.
* hit 2: Google hit #2. Fail.
* hit 3: Google hit #6. FAIL.
* hit 4: Google hit #7. FAIL.
* hit 5: Google hit #9. MADE OF WIN.
* hit 6: Google hit #16, one of the patents. OK.
* hit 7: crap from linkinghub. Fail.
* hit 8: a European Cytocomputer patent, probably a dupe of one of the
  US patents. OK.
* hit 9: a DBLP conference proceedings page for ISCA 1980, which
  included the paper that is hit #1 and Ask.com hit #3. OK.
* hit 10: some crap from ingentaconnect that offers to sell you Google
  hit #9 for US$47.00 plus tax. FAIL.

So Ask’s first ten results are almost indistinguishable from Google’s,
except:

1. They’re 90% garbage instead of 60%;
2. They omit the spam pages produced by Ask.com properties like
   reference.com and thesaurus.com;
3. They don’t have Google Books hits (naturally);
4. As a result of lacking Google Books and spam from Ask.com, hit #9
   (the jackpot) moves up to hit #5.

Yahoo Search:

* hit 1: some PDF from gaianxaos.com. It’s 9MB, so I clicked the “view
  as HTML” link, which didn’t work.
* hit 2: apparently the same PDF from quantumconsciousness.org, which
  makes me suspect that the paper is written by a nutcase. It turns
  out to be a 272-page book that seems mostly sane but is primarily
  concerned with the nature of consciousness, and therefore is
  somewhat speculative. It mentions the word “cytocomputer” once in
  the title of Chapter 5 but never explains what it means in the text.
* hit 3: an IEEE page without the full text of some paper about
  CLIP7A.
* hits 4 and 5: HTML and PDF versions of the Cheops paper I got off
  Google Scholar.
* hit 6: a blog comment I made last year about unusual computing
  hardware, which might be interesting to anybody interested in the
  Cytocomputer, except me.
* hit 7: Chip Morningstar’s resume. He worked on software for the
  Cytocomputer in the early 1980s.
* hit 8: a copy of one of some paper citing the Cytocomputer that
  somebody uploaded to “docstoc”, maybe the Image Algebra book. Page
  has broken Flash on it, offers to let me download the document if I
  register.
* hit 9: Ask.com hit #9.
* hit 10: A mailing list post of mine from 2005.

So Yahoo Search found a lot of interesting stuff, but it’s marginally
related to the Cytocomputer. I guess I should be flattered that two
things I wrote are in the top 10, but I’m more frustrated than
flattered.  The most relevant items — the US patent and the 2001
Cytocomputer emulation in an FPGA — are missing entirely.

Bing:

* hit 1: a variant of Google hit #1 but with a useless snippet and
  two-word title. FAIL.
* hit 2: citations for the 1980 paper from Citeseer. Citeseer finds 10
  to Google Scholar’s 57, but they’re 10 that it’s guaranteed to have
  downloadable copies of. Unfortunately none of them look like they
  say anything interesting about the Cytocomputer. Fail.
* hit 3: some teaser page from IEEE Xplore. FAIL.
* hit 4: something from CiteSeer with no title or author; turns out to
  be a 100-page chunk from the middle of some book on image
  processing; I think it’s the “Image Algebra” book I got from Google
  Scholar. FAIL.
* hit 5: The Cheops paper via CiteSeer. OK.
* hit 6: A 1988 ERIM paper on a use of the Cytocomputer with a
  Symbolics 3600 for machine vision for automated orbital
  navigation. OK.
* hit 7: Yahoo Search hit #10.
* hit 8: The Cheops paper, not via CiteSeer. OK.
* hit 9: some teaser page from ACM. FAIL.
* hit 10: Ask.com’s hit #9, the DBLP page.

So Bing basically gave me none of what I want.

Grades: Google B, Ask.com D, Yahoo Search D, Bing F.

I wish I could give Ask.com an F for spamming Google’s search results,
but that wouldn’t accurately represent the quality of their own search
results, which is at issue here.  If they get successful enough at it,
I guess I’ll have to stop using Google, after all.

[zHosting Ltd.]
---------------

Charlie Stross wrote about his attempt to start up a virtual Linux
hosting company on an IBM mainframe in 2000.  Before I got to the part
where the company folded before even getting angel funding, I searched
to see what the company was up to now.  So “success” in this search
would be a clear statement that the company had folded without
customers or revenue.

On Google, hit 4 is Charlie’s story of the company. None of the other
top 10 or 20 hits suggest that zHosting Ltd. of the UK has ever
existed. This is somewhat confused by some guy who uses “zHosting” as
his screen name when posting on webmaster-oriented forums, including
some that are related to virtualization.

Ask.com has Charlie’s story as hit 2.

Yahoo Search doesn’t have Charlie’s story, but its hit #1 is from
checksure.biz, which lists a zHosting Ltd. at 54 Easter Road,
Edinburgh, Midlothian EH7 5RQ.  I’m pretty sure that’s Charlie’s
company. It offers to sell me a “report” on the company for £9.95. I’m
not sure whether I should treat this as a spectacular success (I got
the incorporation address of a company that folded in 2000 and never
had a customer!) or a failure to filter spam (somebody tried to charge
me US$15 for a “report” on a company that folded in 2000 and never had
a customer!)

Bing doesn’t have Charlie’s story or anything interesting, just the
guy who posts on web forums.

Grades: Google A, Ask.com A+, Yahoo Search A, Bing B.

<link rel="stylesheet" href="http://canonical.org/~kragen/style.css" />

Sun, 26 Apr 2009

<http://canonical.org/~kragen/costs-lives.html>

How False Rumors Can Cost Lives
===============================

I have said that spreading false rumors in time of epidemic costs
lives.  People have asked me how.

The Tuskegee Experiment
-----------------------

Let me first explain how the Tuskegee Experiment cost lives.

A group of US Public Health Service scientists at Tuskegee recruited a
group of patients with syphilis.  Before penicillin was widespread,
syphilis treatments tended to kill people and didn’t work well; they
conducted an experiment to see if people were better off without them.
So they began an experiment treating a group of 399 syphilitic men
with placebo, that is, a fake treatment that had no real effect.

All of the subjects in the study were black men.  This, plus the
institutionalized racism in the United States in that time period, is
crucial to what follows.

15 years into the study, penicillin had been shown effective, and had
become the standard treatment for syphilis.  The researchers should
have halted the study by then and given their subjects the effective
treatment; instead, with the agreement of the AMA, the CDC, and
Tuskegee University, they lied to them for 25 years, as the patients
continued to infect their wives and children, died young, and went
insane.  The study was halted immediately when the press found out in
1972; a Congressional investigation was called, and medical research
changed a lot.

The Tuskegee Experiment cost about 140 lives directly.  Reporting on
it probably saved some lives by ending the experiment early, and may
have saved hundreds more by preventing other such depraved
experiments.

You can read more in the [Wikipedia Tuskegee Experiment article] [4]
or [the CDC’s Tuskegee web page] [3].

But those 140-or-so murders are not what I’m talking about; there was
another way in which it cost many more lives.

In 1991, ten years into the AIDS pandemic, 97000 white people in the
US had AIDS, while 51000 black people in the US did.  If
African-Americans had gotten infected at the same rate as the whites,
only 21000 of them would have had AIDS; so there were about 30 000
cases more than we would expect with equal treatment.  There was an
article published in the American Journal of Public Health about this
in 1991:

> [The Tuskegee Syphilis Study] [1], 1932 to 1972: Implications for HIV
> Education and AIDS Risk Education Programs in the Black Community,
> by Stephen B. Thomas and Sandra Crouse Quinn.  American Journal of
> Public Health, November 1991, Vol. 81, No. 11, pp. 1498–1504

It turns out that a lot of black people didn’t trust the US government
public health programs that had been trying to encourage them to use
condoms, not get pregnant when they were HIV-positive, and distribute
clean needles for heroin addicts in their neighborhoods.  In fact, a
lot of them thought that both HIV and these public-health measures
were efforts to exterminate black people.  The Tuskegee Experiment was
strong evidence that such a thing was possible — and it was by the
very same agency, showing that its people lied and killed and could
not be trusted.

Perhaps some of the higher AIDS prevalence in the black population was
due to other reasons: worse health care, lower circumcision rates,
poverty, whatever.  But some of it was a direct result of this
distrust.  Let’s say a third, conservatively.

Nearly everyone who had AIDS in 1991 died of it.  This distrust cost
the lives of over ten thousand people from AIDS: perhaps a hundred
times worse than the direct death toll from the Tuskegee Experiment
itself.

There are similar distrust problems with recruiting African-Americans
for clinical trials or organ or bone marrow donation.

A Thought Experiment
--------------------

So let me suggest an alternate universe: one in which the Tuskegee
Experiment never happened, due to researchers having a higher ethical
standard than in our own universe, or perhaps due to an effective
oversight system of IRBs like the one that was put in place in 1972,
after Tuskegee came to light.

Suppose, in this universe, someone imagined the story of the Tuskegee
Experiment, and created a widely-believed hoax about it.  Suppose that
this created the same distrust in the 1980s that the knowledge of the
actual Tuskegee Experiment created in our universe.  And in that
universe, at least ten thousand more African-Americans would have died
of AIDS, just as in the real world.

In that universe, the creators and repeaters of this hoax would be
responsible for the deaths of ten thousand innocent people.

The Flu
-------

Now consider our own universe again.  We are facing a flu outbreak.
It seems most likely that it started at an [overcrowded pig farm in
Veracruz] [2].  I estimate it has about a 40% chance of going
pandemic, a 59% chance of fizzling like SARS, and a 1% chance of
something else entirely.  It’s in a critical stage right now; in the
next month or so, it could go either way.

Maybe, like the 1976 Fort Dix flu virus, it’s not as dangerous as the
1968, 1957, and 1918 viruses.  Maybe it has no chance of going
pandemic, regardless of what we do.  Or maybe it’s so contagious (and
we’re so mobile) that we’re going to suffer a worldwide flu pandemic
regardless of what we do; we can only mitigate its severity, not spare
any place.

Or maybe it does matter.  Maybe it’s still contagious in few enough
places that the right prevention measures can cause it to fizzle out
before touching most of the population, where it might flourish in
their absence.  If we can stop it, we can save not just the tens of
thousands of lives that were senselessly wasted in the
African-American AIDS epidemic, but millions or tens of millions of
lives.  

If the pandemic is possible but not inevitable, it won’t be stopped by
individual action.  It can only be done by entire countries, united,
acting rapidly to take preventive measures: wearing facemasks, washing
hands, not shaking hands, using alcohol hand sanitizer gel, social
distancing, soldiers going on leave instead of living in barracks,
administration of antiviral drugs like Tamiflu to those in affected
areas, quarantining travelers and the sick, and so on.  Maybe it will
turn out that experimental use of OX40-Ig or something stops the damn
thing from drowning you in your own plasma.

In the US, there is a system in place for taking such decisive united
action on issues of public health.  It depends on the government: the
CDC, the PHS, FEMA, the TSA, and so on.  They have to decide what to
do; there’s no system in place for democratic deliberation about these
issues.  But once they make their choice, they can’t implement it
without the trust of the population.

Of course, if it happens that the government agencies are corrupt and
unconcerned with public welfare — especially if they were actually
complicit in creating the problem, as they were in New Orleans after
Katrina — there is no hope for such decisive action.

Suppose, though, that the agencies actually do try to take effective
action.  Suppose that, unlike in 1976, their action is necessary and
sufficient to keep this damn thing from taking off.  But suppose there
are a bunch of hoaxes floating around.  Hoaxes that claim, say, that
the virus was created in government laboratories and then released —
on no factual basis, with no plausible theory of motivation, and no
plausible explanation of how such a thing was possible.  

If people believe such hoaxes, the agencies will find themselves
unable to act — paralyzed by the distrust of the public.  

And the hoax will kill millions, or tens of millions, of people.

Everyone’s Responsibility
-------------------------

So when you’re sending around something you read about the flu,
please, stop to think.  Don’t forward wildly speculative ideas about
government conspiracies to your friends or to the world.  When someone
proposes an idea, think about whether it makes sense.  Here are some
things to think about.  I’ve provided examples mostly from [an
article by Paul Joseph Watson on InfoWars.com] [0]:

- Did the person do a thorough investigation before making the
  material public?  For example, if they’re putting a surprising
  interpretation on something a public official said, have they
  contacted the official’s office to ask for clarification?

- Did the person make basic errors of fact?  For example, do they
  assert that “mixing a live ... virus with vaccine material by
  accident is virtually impossible”, or refer to Tamiflu as a
  “vaccine”?  If you don’t know anything about the science, ask
  someone who does, or check in Wikipedia. (It’s not infallible, but
  it’s a lot better than the New York Times.)

- Does it contradict other things you know?  For example, if it
  asserts without comment that “programs of mass vaccination are
  already being prepared”, while the CDC’s web site and the New York
  Times claim that the CDC has developed seed stock but has not yet
  decided whether to deliver it to vaccine manufacturers, 

- Have you found the person (either the forwarder or the original
  source) to be unreliable in the past, writing or forwarding things
  that turn out to be false?

- Is the original source obscured — e.g. just a person’s name, with no
  URL, email address, or other contact information, or explicitly
  anonymous?

- Does the person believe in other highly improbable theories, like
  the Time Cube, extraterrestrial lizards controlling the world,
  creationism, homeopathy, the idea that vaccines are ineffective and
  a conspiracy of silence among all doctors to poison our children, or
  that Barack Obama is secretly a Muslim?  Maybe their judgment isn’t
  very good.

- Does the person propose conspiracy theories without exploring the
  plausibility of the motivations?  For example, proposing that a flu
  virus capable of a worldwide deadly pandemic was intended as a
  biological weapon — even though it would inevitably devastate the
  friends and family of its designers — isn’t a credible conspiracy
  theory, unless you also propose that the designers are collectively
  suicidal.

- Does the person propose theories of conspiracies that would be
  implausibly difficult to keep secret?  For example, a conspiracy
  involving dozens of public health officials from a variety of
  politically-unfriendly countries would inevitably get ratted out
  fairly soon.

- Do they fail at basic assessments of human behavior probability?
  For example, given the choice between the explanation that some lab
  technicians cut corners on safety procedures when manufacturing a
  flu vaccine, and the explanation that evil upper management
  instructed the evil lab technicians to try to create a deadly flu
  virus mixture and sell it as vaccine, possibly killing millions of
  people (including the families of the management and lab
  technicians) in a global flu pandemic in order to sell more flu
  vaccine, do they think that the second one is plausible?

- Do they fail to link to their sources so you can’t find their
  errors, particularly when the sources are already public?

- Do they occasionally point to manifestly irrelevant information in
  support of their thesis?  Maybe they’re borderline psychotic and are
  struggling to maintain any semblance at all of coherent thought, or
  maybe they just think you’re stupid and will be impressed by a lot
  of words.

- Are you thinking of forwarding something that includes pleas that
  you should forward it, especially urgent ones?  That probably means
  the material wouldn’t have been forwarded to you on its own merit.
  The latest version of this is “PLEASE RETWEET”.

- Is the material highly emotionally charged, for example, inspiring
  outrage?  Then probably the person who wrote it knew that it
  wouldn’t get forwarded much if it had to stand on its logical
  merits, and the person who forwarded it to you wouldn’t have
  forwarded it to you just based on its logical merits.  (Some
  material is just inherently highly emotionally charged, but a
  responsible writer will do their best to treat it dispassionately so
  that you can use your own judgment about its merit.)

- Does the person who wrote the information have a vested interest in
  getting it widely distributed — for example, do they run a site
  covered in banner ads, make a living as a writer, or sell
  self-published videos or books on their website?  Do they spend a
  lot of effort on self-promotion?

- If it turns out that it’s false, would it hurt someone to distribute
  it?  It’s not so bad to forward around a funny kitten photo that
  turns out to be fake, but in the particular case I’ve been writing
  about, it will at least damage the reputations of some innocent
  people (who ought to sue you for defamation, but probably won’t),
  and at worst kill hundreds of millions of people.

If a few of these red flags pop up, don’t just forward the thing.
Investigate it first!

[0]: http://www.infowars.com/medical-director-swine-flu-was-cultured-in-a-laboratory/
[1]: http://www.ajph.org/cgi/reprint/81/11/1498
[2]: http://biosurveillance.typepad.com/biosurveillance/2009/04/swine-flu-in-mexico-timeline-of-events.html
[3]: http://www.cdc.gov/tuskegee/
[4]: http://en.wikipedia.org/wiki/Tuskegee_Study_of_Untreated_Syphilis_in_the_Negro_Male

<link rel="stylesheet" href="http://canonical.org/~kragen/style.css" />