Mon, 11 Jan 2010

There's a certain kind of "research" that consists of spending long
hours in the library, or with EDGAR, or what have you, and digesting the
available information to produce useful summaries of the current state
of things.  Aaron Swartz is looking for someone to do this sort of
research on political issues, at
<http://www.aaronsw.com/weblog/researcherjob>.

I think there's an unfortunate lack of this sort of "research" being
produced and made public, and much of what *is* produced is being
produced by partisan research institutes, which limits their credibility
somewhat. (The job Aaron is offering is partisan, too; he cites a
political orientation as one of the job requirements.)

Supposedly this is one of the purposes of journalism, as well, but
self-described journalists seem to be pretty terrible at carrying out
this kind of research, by and large.

So the following mechanism occurred to me as a way to aggregate demand
for such research.

A research institute sets up a web site where anyone can post a request
for a report analyzing some issue, coupled with a pledge to pay an
amount of money of their choosing if the institute produces such a
report.  Visitors to the site see a list of open requests and previously
produced reports, and can pledge their own money to any open request.

When someone at the institute finishes a report, they post it on the
site for anyone to read, and the institute calls in the pledges on that
report. Then that person chooses a new report to work on: the one with
the largest total amount pledged, or perhaps the largest total amount
pledged per hour that they estimate it will take.

In a way, this is similar to Sourcexchange, CoSource, pubsoft.org (the
Public Software Fund), eLance, and so on, except that there is only a
single provider of what is being funded --- so all the issues of
bidding, choice of service providers, and quality feedback are greatly
simplified.

It's a sort of auction: the next week (or whatever unit) of the
researcher's time is "auctioned off" to the *issue* with the highest
*total* bid on it.  Contributors influence the priorities of the
organization by submitting bids.

The Dominant Assurance Contract Variant
---------------------------------------

The above also bears some resemblance to an assurance contract: nobody's
pledges are called in until there's enough money pledged to fund work at
the institute's usual level of quality (whatever that may be). The
incentives are slightly different, since whatever the amount currently
pledged happens to be, you can increase the likelihood that your desired
report will be the next one produced by putting in money; and there's no
point at which any particular report definitely fails to get written,
just a matter of being indefinitely postponed. But there's still an
incentive to "free-ride": as long as the reports are made available to
the public, you still get them at about the same time if you don't pay
for them.

On the other hand, only providing the reports to those who paid for them
destroys the vast majority of their potential social value, and it also
damages your institute's ability to market itself to potential new
funders.

Alex Tabarrok came up with a variant of an assurance contract in which,
if the contract fails, everyone who pledged money gets a small amount
back. This is supposed to give people an incentive to pledge money to
any cause that they think will fail. He analyzes it in
<http://mason.gmu.edu/~atabarro/PrivateProvision.pdf>.

I've written about these before in
<http://lists.canonical.org/pipermail/kragen-tol/2005-June/000783.html>.

I think you can apply the same idea here: if the institute wants a
particular report to be produced, it can offer an up-front payment of,
say, 10% of your pledge, in exchange for you making the pledge --- sort
of like buying a put option from you. This way, as long as the report
hasn't been produced, you're ahead financially, so you have an incentive
to pledge money to a report if you think it is unlikely to be produced.

(Alex's paper envisions competing entrepreneurs funding different
dominant assurance contracts. I'm not sure how that would work here.)

Thu, 03 Dec 2009

Lots of things have happened since the previous kragen-journal post.

Today we got temporary residency in Argentina for 12 months. For the
previous year we'd only had "residencia precaria", precarious residency,
which could be revoked at any time and in any case was only valid for
three months at a time. I shaved my head to celebrate.

A number of dear friends have visited us, including some family members.

We helped out a bit with Wikimania 2009 (especially Beatrice.)

I turned 33.

My former roommate Payton Whitaker died after a long illnes. I hadn't
talked to him in many years.

Last week my client launched <http://knx.to/>, the site I've been
working on with them for more than a month. It's a way to search through
your contacts on a number of different social networks at once, and even
though it's an interactive web site, it doesn't store any private
information out "in the cloud" --- it's all on your own machine.

My dad gave us a little netbook, which is proving extremely useful.

The list of things I did for the first time in my 32nd year is around
here somewhere, along with a shorter list of things I did for the first
time in my 33rd year. After some polishing, I'll post them.

I seem to be in good health, but now I have this huge paunch.  I've been
dancing contact improv lately, which has been wonderful and physically
challenging.

I translated a poem from Spanish:
<http://canonical.org/~kragen/cosmologya.html>.

During the early part of the North American swine flu, I wrote an essay
called "How False Rumors Cost Lives"; I've been told it's good:
<http://canonical.org/~kragen/costs-lives.html> 

Several of my friends have had the swine flu, but so far all have
recovered without incident.

I'm not aware of anything having been stolen from me or Beatrice this
year, which is a nice improvement over the last few.

Wed, 12 Aug 2009

#!/usr/bin/python
# -*- coding: utf-8 -*-
"""I was surprised early one morning in May 2009, lying awake in bed,
to realize that the graph of 2d6 probabilities was a sort of
piecewise-linear approximation to the bell curve,
and that the 3d6 probability graph sort of looked like
a piecewise-quadratic approximation.

So I wrote this program to compute and display
vanishing triangles of dice probabilities.
The bottom row is the number on the dice;
the next row up is the number of combinations of the dice giving it,
and therefore would be a probability if you divided by the total;
the next row up is differences between adjacent numbers of combinations;
the next row up is differences between adjacent differences
(from the row just mentioned);
and so on.

In the process I wrote yet another table-layout algorithm,
as if the world doesn’t already have enough of those.
Oh well.
It only took half an hour.

The vanishing triangle here shows
(by default)
that the probability distribution of 4d6
is piecewise-cubic in four segments.
Similarly 3d6 is indeed piecewise-quadratic (in three segments),
as 2d6 is piecewise-linear in two segments.

Interestingly, a chunk of Pascal’s triangle
appears in the corners of the table.

"""

import sys, re

def diffs(seq):
    "Yield differences of adjacent items in seq."
    seq = iter(seq)
    last = seq.next()
    for item in seq:
        yield item - last
        last = item

def vanishing_triangle(alist):
    "Compute a vanishing triangle for the elements of alist."
    for depth in range(len(alist)):
        yield alist
        alist = list(diffs(alist))

def dice_combos(dice, sides):
    "Yield all the possible combinations of N dice with M sides."
    if dice == 0:
        yield ()
        return
    for number in range(1, sides+1):
        for combo in dice_combos(dice=dice-1, sides=sides):
            yield (number,) + combo

def layout_table(rows):
    "Takes a sequence of sequences of strings; yields a sequence of strings."
    assert rows
    for row in rows:
        assert len(row) == len(rows[0])

    column_widths = [max(len(row[ii]) for row in rows)
                     for ii in range(len(rows[0]))]
    formatstr = ''.join('%%%ds' % width for width in column_widths) + '\n'
    for row in rows:
        yield formatstr % tuple(row)

def triangle_to_table(xlab, triangle):
    "Yield a sequence of strings formatting a triangle in a table."
    width = len(xlab) * 2 - 1
    rows = []
    xlab_row = [item for label in xlab for item in [str(label), '']][:-1]
    rows.append(xlab_row)

    for data_row in triangle:
        number_blank_columns = len(xlab) - len(data_row)
        assert number_blank_columns >= 0
        
        row = (('',) * number_blank_columns +
               tuple([item for datum in data_row
                           for item in [str(datum), '']][:-1]) +
               ('',) * number_blank_columns)
        assert len(row) == len(xlab_row), (data_row, len(row), row, xlab_row)
        rows.append(row)

    rows.reverse()
    return layout_table(rows)

def vanishing_triangle_table(xlab, sequence):
    tri = list(vanishing_triangle(sequence))
    return triangle_to_table(xlab, tri)

def dice_combo_frequencies(dice, sides):
    freqs = {}
    for combo in dice_combos(dice=dice, sides=sides):
        total = sum(combo)
        freqs.setdefault(total, 0)
        freqs[total] += 1
    keys = sorted(freqs.keys())

    return keys, [freqs[key] for key in keys]

def main(argv):
    if len(argv) == 1:
        dice, sides = 4, 6
    elif len(argv) == 2:
        mo = re.match(r'(\d+)d(\d+)$', argv[1])
        if not mo:
            return usage()
        dice = int(mo.group(1))
        sides = int(mo.group(2))
    else:
        return usage()

    print __doc__

    keys, freqs = dice_combo_frequencies(dice=dice, sides=sides)
    sys.stdout.writelines(vanishing_triangle_table(keys, freqs))

if __name__ == '__main__':
    main(sys.argv)

Fri, 10 Jul 2009

<http://canonical.org/~kragen/search-comparison-2009.html>

Some guy from ask.com just made the totally implausible claim that
their search results are “just as good if not better” than Google’s,
and their search engine also had another advantage: they were willing
to put paid advertising someplace Google wouldn’t (specifically, on
searches about abortion).

So I thought I would do a comparison.

Here are the last ten Google queries from my browser history:

1. [morning-after pill]
2. [len tower lawnmower]
3. [melting point of solder]
4. [melting point of silicon]
5. [david phillip oster]
6. [1998 blogs], with a drill-down to [1998 weblogs] and [history of
   weblogs]
7. [emacs tags file syntax], with drill-down to [emacs tags table
   syntax] and [site:www.gnu.org emacs tags table syntax]
8. [Eric Stoltz]
9. [cytocomputer]
10. [zHosting Ltd]

I evaluated them on Google, Ask.com, Yahoo Search, and Bing. I more or
less have ads turned off with AdBlock Plus and NoScript, and I’m
viewing everything in Firefox 3.0 with Gnash for my Flash player. So
there may be annoyances that affect other people but not me.

Summary
-------

So here are the grades for the different queries:

[morning-after pill]
Grades: Google **B**, Ask.com **B-**, Yahoo Search **D**, Bing **F**.  
[len tower lawnmower]
Grades: Google **A**, Ask.com **A**, Yahoo Search **A+**, Bing **A**.  
[melting point of solder]
Grades: Google **A**, Ask.com **B**, Yahoo Search **A+**, Bing **C**.  
[melting point of silicon]
Grades: Google **A+**, Ask.com **A**, Yahoo Search **D**, Bing **C**.  
[david phillip oster]
Grades: Google **C**, Ask.com **B**, Yahoo Search **A+**, Bing **B**.  
[1998 blogs]
Grades: Google **D**, Ask.com **D**, Yahoo Search **B**, Bing **C**.  
[emacs tags file syntax]
Grades: Google **F**, Ask.com **F**, Yahoo Search **F**, Bing **F**.  
[Eric Stoltz]
Grades: Google **A**, Ask.com **C**, Yahoo Search **B**, Bing **A+**.  
[cytocomputer]
Grades: Google **B**, Ask.com **D**, Yahoo Search **D**, Bing **F**.  
[zHosting Ltd.]
Grades: Google **A**, Ask.com **A+**, Yahoo Search **A**, Bing **B**.  

**Google**’s median grade is **A- or B+**, the best of the four.  It
only failed on a query where all four search engines failed.  However,
it was only the best search engine of the four **30%** of the time.
It was clearly better than the others on dealing with a controversial
topic and providing search results from beyond the Web: books and
academic papers.

**Ask.com**’s median grade is **B**.  It, too, only failed on the
query where all four search engines failed.  Its results were worse
than Google’s 50% of the time, equally good 30% of the time, and
better than Google’s 20% of the time.  So the claim by the guy from
Ask.com isn’t as implausible as it appeared at first, but it still
isn’t true for my query mix.  It was only the best search engine of
the four **10%** of the time.

I’m really surprised at how well Ask.com did, because I always thought
of their search engine as a joke.

**Yahoo Search**’s median grade is **B**.  It, too, only failed on a
query where all four search engines failed.  It was the best search
engine of the four **40%** of the time, more than any other search
engine, so I am going to switch to it as my default search engine.  It
was better than Ask.com less often than Google, though: it was better
40% of the time, equally good 20% of the time, and worse 30% of the
time.

**Bing**’s median grade is **C**, the worst of any engine, and unlike
any other engine, it failed badly on two of the nine queries the other
search engines were able to answer: in one case by privileging
misinformation and scaremongering over reliable information, and in a
second case by simply failing to find anything relevant. It was the
best search engine of the four only **10%** of the time, like Ask;
that was on a celebrity query.  I’m sad to say this because my friend
Barney Pell has been working really hard on it for years, but Bing’s
performance is pathetic.

(The percentages of “best of the four” 30% + 10% + 40% + 10% add up to
only 90%; that’s because one of the ten queries was failed by all four
search engines, and in that case none was “the best”.)

So there isn’t really a clear winner; Yahoo Search, Google, and
Ask.com are pretty even overall, even though some did much better than
others on particular queries.  There is a clear *loser*, though:
Bing. Maybe I should have included Cuil to make Bing look better. I
mean, I feel kind of bad.  

(Actually, I did try [morning-after pill] and [david phillip oster] on
Cuil. It did better than Bing.)

The rest of this document (4000 words) is taken up with explanations
of the particular queries.

[morning-after pill]
--------------------

Here I wanted to see if I could find accurate information about
emergency contraception without having to cope with abortion-scare
sites providing misinformation.

Google:

* hit 1 is Wikipedia: ideal; explains both sides of the debate
  objectively, along with lots of detailed information.
* hit 2 is morningafterpill.org, an abortion-scare site: not so
  good. However, the snippet says, “Site asserts that “morning after”
  emergency contraception is just another abortion approach that kills
  a human life.”, so it’s not a surprise shock.
* hit 3 is morningafterpill.org also, with health-scare information
  which is not actually accurate. Not good.
* hit 4 is news results, saying, “Legal fight continues on sale of
  “morning after” pill”.
* hit 5 is some UK site with what appears to be accurate information.

Ask.com:

* hit 1 is something on healthline.com, with apparently accurate
  information and an unhelpful blurry IUD diagram.
* hit 2 is Google hit #5.
* hit 3 is Google hit #2. Not good.
* hit 4 is getthepill.com, apparently an online OTC pharmacy for
  morning-after pills.
* hit 5 is Google hit #1, Wikipedia.

Yahoo Search:

* provides lots of drop-down suggestions before I even finish typing
  the search query!
* hit 1 is Google hit #2, with the more misleading snippet, “Rejects
  ideas that the Morning After Pill is not an abortifacient and argues
  instead that MAP use is tantamount to abortion. From the American
  Life League.”. Very bad.
* hit 2 is Google hit #3.
* hit 3 is Google hit #1, Wikipedia.
* hit 4 is Google hit #4.
* hit 5 is a Mayo Clinic page.

Bing:

* hit 1 is Google hit #2, very bad.
* hit 2 is a dictionary definition. Worthless.
* hit 3 is the Mayo Clinic page.
* hit 4 is another page from morningafterpill.org, but with no visual
  indication that it’s the same site or that it doesn’t have reliable
  information. Very bad.
* hit 5 is from sexuality.about.com. Similar to the Wikipedia page,
  but shorter, except that it doesn’t cover the controversy at all.

Grades on this query: Google B, Ask.com B-, Yahoo Search D, Bing F.

[len tower lawnmower]
---------------------

I wanted to find a photo of Len Tower on a human-powered riding mower
that I had seen a few days ago.

Google: hit #1 is a page with the photo and background information,
instantly recognizable as such.

Ask.com: same.

Yahoo Search: same, but hits #2 and #3 are also about it, with more
information.

Bing: same as Google.

Grades on this query: Google A, Ask.com A¸ Yahoo Search A+, Bing A.

[melting point of solder]
-------------------------

I wanted to find out the melting point of traditional eutectic
lead-tin solder as well as the melting point of common modern
RoHS-compliant solders.

Google: 

* hit 1 is Wikipedia page for “Solder”, which is a very general page
  with a uselessly large range in the snippet.
* hit 2 is Wikipedia page for “Soldering”.
* hit 3 is “RF Cafe - Solder Properties Melting Point”, with the
  snippet “These values are for some of the most common solders...”,
  so I clicked on that. It has precise melting points for a rather
  larger number of solders than I wanted, but I got the information I
  needed.

Ask.com:

* hit 1 is some journal article from 1996 about a new solder
  formulation that, as far as I know, nobody uses today. Useless.
* hit 2 is Google hit #1.
* hit 3 is Google hit #2.
* hit 4 is “EPE “Basic Soldering Guide””, which says in the snippet,
  “The melting point of most solder is in the region of 188°C (370°F)
  and the iron tip temperature is typically 330-350°C
  (626°-662°F). The latest lead-free solders typically require a
  higher temperature.”. You would think this was better, but if you
  follow the link to the (rather large) page, it never actually tells
  you what the higher temperature is.
* hit 5 is Google hit #3.

Yahoo Search says, “Did you mean: melting point of soldier?”

* hit 1 is Google hit #1, with a nice little graphic.
* hit 2 is Google hit #2, with the same nice little graphic.
* hit 3 is somebody asking a question about what kind of solder was in
  common use in 1969, and how hot it melts.
* hit 4 is some sort of “tips on soldering” page.
* hit 5 is Ask.com hit #1.
* hit 7 actually looks promising, but has only questions but no
  answers.

So I followed the Wikipedia link, and it has the answer for eutectic
lead-tin solder above the fold and a section on “lead-free solders”
with a whole big discussion of which ones are most common and what
their melting points are. So I probably should have followed that link
from Google instead of hit #3.

Bing:

* hit 1 is a brand new US patent on a type of solder. Trash.
* hit 2 is another one. Trash.
* hit 3 is ask.com hit #1. Trash.
* hit 4 is a 2007 article by Zhenhua Chen about lower-temperature
  lead-free solder formulations that aren’t yet in wide use, also
  summarizing the melting points of the widely-used modern solders and
  their various advantages and disadvantages. Pure gold. (The article,
  metaphorically speaking, not the solders.)
* hit 5 is the Wikipedia page.

Grades on this query: Google A, Ask.com B, Yahoo Search A+, Bing C —
would be an F except for hit 4.

[melting point of silicon]
--------------------------

Google has the answer in big letters above the search results: 1687
K. Wikipedia article is hit #2, and the correct answer in °C is in 
hit #4.

Ask.com has the answer in the snippets for hits 1, 2, and slightly
wrong answers in snippets for hit 4 and hit 5, and hit #3 presumably
has it if I click through.

Yahoo Search hit 1 is Wikipedia. Hits 2 and 3 are the wrong
answer. Snippets for hits 4 and 5 have the right answer.

Bing:

* hit 1 is a patent. Trash.
* hit 2 is about the melting point of silicon dioxide. Trash.
* hit 3 is roughly a duplicate of hit 2. Trash.
* hit 4 has the answer in a snippet from another Wikipedia page.
* hit 5 is another irrelevant thing about SiO₂.

Grades: Google A+, Ask.com A, Yahoo Search D, Bing C.

[david phillip oster]
---------------------

I wanted to find his home page, thence to find his current email
address, to email him.

Google: no home page, but hits 5-7 look vaguely promising. Hit 5 leads
to a blog post that links to
<http://groups.google.com/groups/search?q=%22david+phillip+oster%22&start=0&scoring=d>,
which does actually link to
<http://groups.google.com/group/iphonesdkdevelopment/browse_thread/thread/5c9cd5561d7b0d64/da37b38ede21148d?q=%22david+phillip+oster%22#da37b38ede21148d>
which links to
<http://groups.google.com/groups/profile?enc_user=szRVXBsAAABguGT__oukXrijYyXRsYeu3jKajrjPH-s4VDv7fhNHSg>,
which says “davidphillipos... at gmail.com”, which is close enough. Hit
7, his Amazon reviewer page, actually has “oster at ieee.org” on the page.

Hit 9 links to a RISKS page that gives the email address he had in
1988.

In practice I gave up when I saw the page of snippets; instead I
searched my email.

Ask.com: hit #2 is Google’s hit #7.

Yahoo Search: turbozen.com is hits #1 and #2, with “oster at ieee.org” in
both snippets.  Hit #4 is mosaiccodes.com, which links to turbozen.com.

Bing: hit #3 is Yahoo hit #2 (without the email address in the
snippet, but clear that it’s his software company), and hit #5 is
Google hit #7.

Grades: Google C, Ask.com B, Yahoo Search A+, Bing B.

[1998 blogs]
------------

I was trying to remember the state of the blogosphere in 1998 when I
started kragen-tol in order to justify my claim that it wasn’t very
surprising that I didn’t start it as a blog.

Google: top ten hits are all trash — things that happen to be a blog
or mention blogs and mention 1998. Hit #11 looks more promising but is
also trash.  Somewhere around hit #20 there’s [Psychology of Blogs
(Weblogs)](http://psychcentral.com/blogs/blog.htm), from 1998, which
is a pretty good snapshot of how things were in 1998 — except a little
bit polluted by a 2001 update.

Ask.com: same trash as Google, except only ten hits of it. (I have
Google set to display 100.)

Yahoo Search: mostly the same trash, but Psychology of Blogs is 
hit #4. Yahoo Search used to display 20 hits by default, but now it
seems it’s down to 10, just like Google.

Bing: hit #1 talks about what the web was like in 1998, in Spanish,
but doesn’t shed any light on my actual question, which is what the
blogosphere was like in 1998. Hit #2 is the Spanish Wikipedia page for
“blog”, which has a pretty good “Historia” section. Hit #7 is
somebody’s presentation on SlideShare, which loses pretty badly (not
accessible without Flash and fails freakishly in Gnash) but there’s
some good information in the title.

None of these really gave me what I was looking for, which was Rebecca
Blood’s “History of Weblogs” from 2000, which I couldn’t remember the
title of.  So when I was doing this search “for real”, the first time,
instead of looking at hit #20 or trying multiple search engines, I
glanced at the page full of trash and reformulated my search. The word
“blog” wouldn’t be coined until 1999 (by The Brand Peter Me.) and at
the time they were called “weblog”, a term Jorn Barger had invented in
1997 for what are now called “linklogs” or sometimes “microblogs” or
“tumblelogs”.

So I searched for [1998 weblogs].

On Google, “Psychology of Weblogs” is hit #1, and Jason Kottke’s blog
archives for 1998 are hit #3.  The snippet for hit #6, from a blog I’d
never heard of that ended in 2005, says, “I started this weblog in
August 1998, when it was one of the first 25 or so weblogs in
existence,” which is a piece of the information I was looking for but
not the comprehensive overview of Wikipedia or Blood’s piece.

Ask.com is essentially identical to Google, with the same hits #1 
and #3, and Google’s hit #6 moved up to #4.  However, it also has a
sidebar of “Related Searches”, which includes a suggestion for
“history of weblogs”.

Yahoo Search has “Psychology of Weblogs” as hit #1, but also has
Blood’s essay as hit #8! Also, hit #4 is “Computer History for 1998”,
with some minimal information.  Hit #9 mentions that Scripting News’s
comments section started in October 1998, and hit #10 is “Jorn Barger,
the NewsPage Network, and the Emergence of the Weblog Community”,
which offers a somewhat deeper history even than Blood’s essay.

Bing gives essentially exactly the same results as for [1998 blogs].

So, since I was using Google instead of Yahoo Search, I searched a
third time for [history of weblogs]. 

On Google, below the Google Scholar hits, which don’t have enough
information on the page to tell me if they’re the right thing, Blood’s
article is #1. English Wikipedia articles are the next couple of hits,
followed by more articles about the early history of weblogs
(1997-2000). Pure gold.

Ask.com gives basically the same results.

Yahoo Search puts Blood’s article at the top, a self-promotional post
short on detail by Dave Winer, the Wikipedia article, etc.

Bing gives Blood’s essay at the top, followed by a Spanish Wikipedia
article, some random irrelevant stuff, a German page (which I don’t
understand), some more irrelevant stuff, and what appears to be an SEO
spam page (“Interested in history? At weblogs.hu you find posts and
information relevant to history.  www.weblogs.hu/posts/tags/history”.)

So, grades: Google D, Ask.com D, Yahoo Search B, Bing C.  On my earlier
queries Yahoo Search does dramatically better than the others, well
enough that I wouldn’t have proceeded to the third query and maybe not
past the first.

[emacs tags file syntax]
------------------------

I wanted to look up the syntax of Emacs `TAGS` files so I could write
a program to generate one (introspectively from the state of a Python
program, rather than by parsing a bunch of source code).  This search
originally was completely unsuccessful, although I’m not totally
stymied; there is one free-software consumer of `TAGS` and two
free-software generators of `TAGS` already on my machine, so I can
just look at the source. If I’m lucky, it will reference a file format
spec.

Google: all of the hits relate to how to invoke `etags`, which
generates `TAGS` files, or how to use them in Emacs. The “syntax”
being referenced is invariably the syntax of the source files, not of
`TAGS` itself (which is called a “tags table”, apparently.) Most of
them are a zillion copies of the Emacs manual and the man pages for
`etags` and Exuberant Ctags.

Ask.com: identically useless results, except for a bunch of irrelevant
“Related Searches” at the top.

Yahoo Search: same.

Bing: same.

My next attempt was to be more specific in my query: I’m looking for
information about the *tags table*. In retrospect, I should have
looked for information about the “file format”, not “syntax”, but my
next search was [emacs tags table syntax].

All four search engines give basically the same results as before.

So my next attempt was to click on “more results from www.gnu.org »”,
with the thought that this would give me each section of the Emacs
manual only once, and many more of them. It did, on Google, but the
Emacs manual does not contain the answer. I am not trying the query on
the other search engines.

Searching for [emacs tags table format] does not seem to help.

I thought I would try using natural-language search on Ask.com and
Bing. [how do i generate an emacs tags table?] on Ask.com yields
mostly `etags` man pages, but also a link to
<http://www.emacswiki.org/cgi-bin/wiki/EmacsTags>, which doesn’t help
but is usually a better resource than the Emacs manual.  Bing has it
at the top.

Grades: Google F, Ask.com F, Yahoo Search F, Bing F.

[Eric Stoltz]
-------------

I had read that Eric Stoltz had been originally cast in Back To The
Future, and I wondered who he was.

Google gave me four photos of him at the top, which was sufficient for
me to know I didn’t recognize him.  Hit #1 was his IMDB page and hit #3
is the Wikipedia page, which outlined his acting career in
sufficient detail to satisfy me.

Ask.com has a bunch of irrelevant “related searches” at the top,
followed by product images from Amazon which are too small to see the
guy’s face.  Then there’s the IMDB page, some TV listings for ZIP code
10010 in the US (utterly pathetic; I’m in Argentina), and then a
Wikipedia page with a too-small image.

Yahoo Search has only three photos, of smaller size than Google’s, but
they’re recognizable. Top few hits are from IMDB and Wikipedia.

Bing has six photos, including a closeup shot, which are highly
recognizable. Then the top hit is some other guy Eric Stoltz who’s a
web designer, followed by Wikipedia entries from English and Spanish,
an IMDB page, and then a French Wikipedia article.

Grades: Google A, Ask.com C, Yahoo Search B, Bing A+.

[cytocomputer]
--------------

I wanted to know what had been written recently about Bob Lougheed et
al.’s image processing device.

Google:

* hit 1: The abstract of Lougheed and McCubbrey’s 1980 paper, without
  the full text. Fail.
* hit 2: Some paper from 1982 that referenced it, also without the
  full text. Fail.
* hit 3: Thesaurus.com. Not just fail but spam; thesaurus.com (an
  ask.com service) is wasting Google users’ time by directing them to
  a page that says, “No results found for *cytocomputer*: Did you mean
  strumpet?” No, I certainly did not.
* hit 4: a 2001 book on Google Books about image processing,
  describing the Cytocomputer architecture in the context of image
  processing architectures of the time. OK.
* hit 5: another book on Google Books, this one from 1993.
* hit 6: from IEEE Xplore: an abstract, without the text, of a paper
  from 2001 that referenced it. Fail.
* hit 7: also from IEEE Xplore: The same paper as hit 2, again
  without the text, and also with the wrong title. Fail.
* hit 8: a spam page from reference.com (an ask.com service), saying,
  “No results found for *cytocomputer*: Did you mean supercomputer (in
  dictionary) or Cart computer (in reference)?”
* hit 9: the full text of the paper from hit 6, which turns out to be
  a 2001 emulation of the Cytocomputer in an FPGA, getting a 10×
  speedup over the software emulation they had been using.  This is
  the version of the paper that was submitted to the government
  sponsors and thenceforth freely disseminated. MADE OF WIN.
* hit 10: the full text of Barry Bruce Megdal’s 1983 dissertation on
  VLSI fingerprint recognition. WIN. Particularly impressive since the
  PDF contains no text; it’s scanned from prints.

Later Google hits include crap from linkinghub.elsevier.com, expired
US patents describing the Cytocomputer in some detail, and so on.  So
even though 60% of the top 10 Google hits are basically spam
(duplicate teasers from ACM and IEEE, and ask.com SEO spam pages)
there’s some good stuff in there.

Also, Google offers “Cited by 57” on the original Cytocomputer
paper. Among other things, that links me to the Cheops paper from 1995
and the 400-page Image Algebra book from 1986. These only mention the
Cytocomputer in passing, but they look pretty interesting.

Ask.com:

* hit 1: Google hit #1. Fail.
* hit 2: Google hit #2. Fail.
* hit 3: Google hit #6. FAIL.
* hit 4: Google hit #7. FAIL.
* hit 5: Google hit #9. MADE OF WIN.
* hit 6: Google hit #16, one of the patents. OK.
* hit 7: crap from linkinghub. Fail.
* hit 8: a European Cytocomputer patent, probably a dupe of one of the
  US patents. OK.
* hit 9: a DBLP conference proceedings page for ISCA 1980, which
  included the paper that is hit #1 and Ask.com hit #3. OK.
* hit 10: some crap from ingentaconnect that offers to sell you Google
  hit #9 for US$47.00 plus tax. FAIL.

So Ask’s first ten results are almost indistinguishable from Google’s,
except:

1. They’re 90% garbage instead of 60%;
2. They omit the spam pages produced by Ask.com properties like
   reference.com and thesaurus.com;
3. They don’t have Google Books hits (naturally);
4. As a result of lacking Google Books and spam from Ask.com, hit #9
   (the jackpot) moves up to hit #5.

Yahoo Search:

* hit 1: some PDF from gaianxaos.com. It’s 9MB, so I clicked the “view
  as HTML” link, which didn’t work.
* hit 2: apparently the same PDF from quantumconsciousness.org, which
  makes me suspect that the paper is written by a nutcase. It turns
  out to be a 272-page book that seems mostly sane but is primarily
  concerned with the nature of consciousness, and therefore is
  somewhat speculative. It mentions the word “cytocomputer” once in
  the title of Chapter 5 but never explains what it means in the text.
* hit 3: an IEEE page without the full text of some paper about
  CLIP7A.
* hits 4 and 5: HTML and PDF versions of the Cheops paper I got off
  Google Scholar.
* hit 6: a blog comment I made last year about unusual computing
  hardware, which might be interesting to anybody interested in the
  Cytocomputer, except me.
* hit 7: Chip Morningstar’s resume. He worked on software for the
  Cytocomputer in the early 1980s.
* hit 8: a copy of one of some paper citing the Cytocomputer that
  somebody uploaded to “docstoc”, maybe the Image Algebra book. Page
  has broken Flash on it, offers to let me download the document if I
  register.
* hit 9: Ask.com hit #9.
* hit 10: A mailing list post of mine from 2005.

So Yahoo Search found a lot of interesting stuff, but it’s marginally
related to the Cytocomputer. I guess I should be flattered that two
things I wrote are in the top 10, but I’m more frustrated than
flattered.  The most relevant items — the US patent and the 2001
Cytocomputer emulation in an FPGA — are missing entirely.

Bing:

* hit 1: a variant of Google hit #1 but with a useless snippet and
  two-word title. FAIL.
* hit 2: citations for the 1980 paper from Citeseer. Citeseer finds 10
  to Google Scholar’s 57, but they’re 10 that it’s guaranteed to have
  downloadable copies of. Unfortunately none of them look like they
  say anything interesting about the Cytocomputer. Fail.
* hit 3: some teaser page from IEEE Xplore. FAIL.
* hit 4: something from CiteSeer with no title or author; turns out to
  be a 100-page chunk from the middle of some book on image
  processing; I think it’s the “Image Algebra” book I got from Google
  Scholar. FAIL.
* hit 5: The Cheops paper via CiteSeer. OK.
* hit 6: A 1988 ERIM paper on a use of the Cytocomputer with a
  Symbolics 3600 for machine vision for automated orbital
  navigation. OK.
* hit 7: Yahoo Search hit #10.
* hit 8: The Cheops paper, not via CiteSeer. OK.
* hit 9: some teaser page from ACM. FAIL.
* hit 10: Ask.com’s hit #9, the DBLP page.

So Bing basically gave me none of what I want.

Grades: Google B, Ask.com D, Yahoo Search D, Bing F.

I wish I could give Ask.com an F for spamming Google’s search results,
but that wouldn’t accurately represent the quality of their own search
results, which is at issue here.  If they get successful enough at it,
I guess I’ll have to stop using Google, after all.

[zHosting Ltd.]
---------------

Charlie Stross wrote about his attempt to start up a virtual Linux
hosting company on an IBM mainframe in 2000.  Before I got to the part
where the company folded before even getting angel funding, I searched
to see what the company was up to now.  So “success” in this search
would be a clear statement that the company had folded without
customers or revenue.

On Google, hit 4 is Charlie’s story of the company. None of the other
top 10 or 20 hits suggest that zHosting Ltd. of the UK has ever
existed. This is somewhat confused by some guy who uses “zHosting” as
his screen name when posting on webmaster-oriented forums, including
some that are related to virtualization.

Ask.com has Charlie’s story as hit 2.

Yahoo Search doesn’t have Charlie’s story, but its hit #1 is from
checksure.biz, which lists a zHosting Ltd. at 54 Easter Road,
Edinburgh, Midlothian EH7 5RQ.  I’m pretty sure that’s Charlie’s
company. It offers to sell me a “report” on the company for £9.95. I’m
not sure whether I should treat this as a spectacular success (I got
the incorporation address of a company that folded in 2000 and never
had a customer!) or a failure to filter spam (somebody tried to charge
me US$15 for a “report” on a company that folded in 2000 and never had
a customer!)

Bing doesn’t have Charlie’s story or anything interesting, just the
guy who posts on web forums.

Grades: Google A, Ask.com A+, Yahoo Search A, Bing B.

<link rel="stylesheet" href="http://canonical.org/~kragen/style.css" />

Sun, 26 Apr 2009

<http://canonical.org/~kragen/costs-lives.html>

How False Rumors Can Cost Lives
===============================

I have said that spreading false rumors in time of epidemic costs
lives.  People have asked me how.

The Tuskegee Experiment
-----------------------

Let me first explain how the Tuskegee Experiment cost lives.

A group of US Public Health Service scientists at Tuskegee recruited a
group of patients with syphilis.  Before penicillin was widespread,
syphilis treatments tended to kill people and didn’t work well; they
conducted an experiment to see if people were better off without them.
So they began an experiment treating a group of 399 syphilitic men
with placebo, that is, a fake treatment that had no real effect.

All of the subjects in the study were black men.  This, plus the
institutionalized racism in the United States in that time period, is
crucial to what follows.

15 years into the study, penicillin had been shown effective, and had
become the standard treatment for syphilis.  The researchers should
have halted the study by then and given their subjects the effective
treatment; instead, with the agreement of the AMA, the CDC, and
Tuskegee University, they lied to them for 25 years, as the patients
continued to infect their wives and children, died young, and went
insane.  The study was halted immediately when the press found out in
1972; a Congressional investigation was called, and medical research
changed a lot.

The Tuskegee Experiment cost about 140 lives directly.  Reporting on
it probably saved some lives by ending the experiment early, and may
have saved hundreds more by preventing other such depraved
experiments.

You can read more in the [Wikipedia Tuskegee Experiment article] [4]
or [the CDC’s Tuskegee web page] [3].

But those 140-or-so murders are not what I’m talking about; there was
another way in which it cost many more lives.

In 1991, ten years into the AIDS pandemic, 97000 white people in the
US had AIDS, while 51000 black people in the US did.  If
African-Americans had gotten infected at the same rate as the whites,
only 21000 of them would have had AIDS; so there were about 30 000
cases more than we would expect with equal treatment.  There was an
article published in the American Journal of Public Health about this
in 1991:

> [The Tuskegee Syphilis Study] [1], 1932 to 1972: Implications for HIV
> Education and AIDS Risk Education Programs in the Black Community,
> by Stephen B. Thomas and Sandra Crouse Quinn.  American Journal of
> Public Health, November 1991, Vol. 81, No. 11, pp. 1498–1504

It turns out that a lot of black people didn’t trust the US government
public health programs that had been trying to encourage them to use
condoms, not get pregnant when they were HIV-positive, and distribute
clean needles for heroin addicts in their neighborhoods.  In fact, a
lot of them thought that both HIV and these public-health measures
were efforts to exterminate black people.  The Tuskegee Experiment was
strong evidence that such a thing was possible — and it was by the
very same agency, showing that its people lied and killed and could
not be trusted.

Perhaps some of the higher AIDS prevalence in the black population was
due to other reasons: worse health care, lower circumcision rates,
poverty, whatever.  But some of it was a direct result of this
distrust.  Let’s say a third, conservatively.

Nearly everyone who had AIDS in 1991 died of it.  This distrust cost
the lives of over ten thousand people from AIDS: perhaps a hundred
times worse than the direct death toll from the Tuskegee Experiment
itself.

There are similar distrust problems with recruiting African-Americans
for clinical trials or organ or bone marrow donation.

A Thought Experiment
--------------------

So let me suggest an alternate universe: one in which the Tuskegee
Experiment never happened, due to researchers having a higher ethical
standard than in our own universe, or perhaps due to an effective
oversight system of IRBs like the one that was put in place in 1972,
after Tuskegee came to light.

Suppose, in this universe, someone imagined the story of the Tuskegee
Experiment, and created a widely-believed hoax about it.  Suppose that
this created the same distrust in the 1980s that the knowledge of the
actual Tuskegee Experiment created in our universe.  And in that
universe, at least ten thousand more African-Americans would have died
of AIDS, just as in the real world.

In that universe, the creators and repeaters of this hoax would be
responsible for the deaths of ten thousand innocent people.

The Flu
-------

Now consider our own universe again.  We are facing a flu outbreak.
It seems most likely that it started at an [overcrowded pig farm in
Veracruz] [2].  I estimate it has about a 40% chance of going
pandemic, a 59% chance of fizzling like SARS, and a 1% chance of
something else entirely.  It’s in a critical stage right now; in the
next month or so, it could go either way.

Maybe, like the 1976 Fort Dix flu virus, it’s not as dangerous as the
1968, 1957, and 1918 viruses.  Maybe it has no chance of going
pandemic, regardless of what we do.  Or maybe it’s so contagious (and
we’re so mobile) that we’re going to suffer a worldwide flu pandemic
regardless of what we do; we can only mitigate its severity, not spare
any place.

Or maybe it does matter.  Maybe it’s still contagious in few enough
places that the right prevention measures can cause it to fizzle out
before touching most of the population, where it might flourish in
their absence.  If we can stop it, we can save not just the tens of
thousands of lives that were senselessly wasted in the
African-American AIDS epidemic, but millions or tens of millions of
lives.  

If the pandemic is possible but not inevitable, it won’t be stopped by
individual action.  It can only be done by entire countries, united,
acting rapidly to take preventive measures: wearing facemasks, washing
hands, not shaking hands, using alcohol hand sanitizer gel, social
distancing, soldiers going on leave instead of living in barracks,
administration of antiviral drugs like Tamiflu to those in affected
areas, quarantining travelers and the sick, and so on.  Maybe it will
turn out that experimental use of OX40-Ig or something stops the damn
thing from drowning you in your own plasma.

In the US, there is a system in place for taking such decisive united
action on issues of public health.  It depends on the government: the
CDC, the PHS, FEMA, the TSA, and so on.  They have to decide what to
do; there’s no system in place for democratic deliberation about these
issues.  But once they make their choice, they can’t implement it
without the trust of the population.

Of course, if it happens that the government agencies are corrupt and
unconcerned with public welfare — especially if they were actually
complicit in creating the problem, as they were in New Orleans after
Katrina — there is no hope for such decisive action.

Suppose, though, that the agencies actually do try to take effective
action.  Suppose that, unlike in 1976, their action is necessary and
sufficient to keep this damn thing from taking off.  But suppose there
are a bunch of hoaxes floating around.  Hoaxes that claim, say, that
the virus was created in government laboratories and then released —
on no factual basis, with no plausible theory of motivation, and no
plausible explanation of how such a thing was possible.  

If people believe such hoaxes, the agencies will find themselves
unable to act — paralyzed by the distrust of the public.  

And the hoax will kill millions, or tens of millions, of people.

Everyone’s Responsibility
-------------------------

So when you’re sending around something you read about the flu,
please, stop to think.  Don’t forward wildly speculative ideas about
government conspiracies to your friends or to the world.  When someone
proposes an idea, think about whether it makes sense.  Here are some
things to think about.  I’ve provided examples mostly from [an
article by Paul Joseph Watson on InfoWars.com] [0]:

- Did the person do a thorough investigation before making the
  material public?  For example, if they’re putting a surprising
  interpretation on something a public official said, have they
  contacted the official’s office to ask for clarification?

- Did the person make basic errors of fact?  For example, do they
  assert that “mixing a live ... virus with vaccine material by
  accident is virtually impossible”, or refer to Tamiflu as a
  “vaccine”?  If you don’t know anything about the science, ask
  someone who does, or check in Wikipedia. (It’s not infallible, but
  it’s a lot better than the New York Times.)

- Does it contradict other things you know?  For example, if it
  asserts without comment that “programs of mass vaccination are
  already being prepared”, while the CDC’s web site and the New York
  Times claim that the CDC has developed seed stock but has not yet
  decided whether to deliver it to vaccine manufacturers, 

- Have you found the person (either the forwarder or the original
  source) to be unreliable in the past, writing or forwarding things
  that turn out to be false?

- Is the original source obscured — e.g. just a person’s name, with no
  URL, email address, or other contact information, or explicitly
  anonymous?

- Does the person believe in other highly improbable theories, like
  the Time Cube, extraterrestrial lizards controlling the world,
  creationism, homeopathy, the idea that vaccines are ineffective and
  a conspiracy of silence among all doctors to poison our children, or
  that Barack Obama is secretly a Muslim?  Maybe their judgment isn’t
  very good.

- Does the person propose conspiracy theories without exploring the
  plausibility of the motivations?  For example, proposing that a flu
  virus capable of a worldwide deadly pandemic was intended as a
  biological weapon — even though it would inevitably devastate the
  friends and family of its designers — isn’t a credible conspiracy
  theory, unless you also propose that the designers are collectively
  suicidal.

- Does the person propose theories of conspiracies that would be
  implausibly difficult to keep secret?  For example, a conspiracy
  involving dozens of public health officials from a variety of
  politically-unfriendly countries would inevitably get ratted out
  fairly soon.

- Do they fail at basic assessments of human behavior probability?
  For example, given the choice between the explanation that some lab
  technicians cut corners on safety procedures when manufacturing a
  flu vaccine, and the explanation that evil upper management
  instructed the evil lab technicians to try to create a deadly flu
  virus mixture and sell it as vaccine, possibly killing millions of
  people (including the families of the management and lab
  technicians) in a global flu pandemic in order to sell more flu
  vaccine, do they think that the second one is plausible?

- Do they fail to link to their sources so you can’t find their
  errors, particularly when the sources are already public?

- Do they occasionally point to manifestly irrelevant information in
  support of their thesis?  Maybe they’re borderline psychotic and are
  struggling to maintain any semblance at all of coherent thought, or
  maybe they just think you’re stupid and will be impressed by a lot
  of words.

- Are you thinking of forwarding something that includes pleas that
  you should forward it, especially urgent ones?  That probably means
  the material wouldn’t have been forwarded to you on its own merit.
  The latest version of this is “PLEASE RETWEET”.

- Is the material highly emotionally charged, for example, inspiring
  outrage?  Then probably the person who wrote it knew that it
  wouldn’t get forwarded much if it had to stand on its logical
  merits, and the person who forwarded it to you wouldn’t have
  forwarded it to you just based on its logical merits.  (Some
  material is just inherently highly emotionally charged, but a
  responsible writer will do their best to treat it dispassionately so
  that you can use your own judgment about its merit.)

- Does the person who wrote the information have a vested interest in
  getting it widely distributed — for example, do they run a site
  covered in banner ads, make a living as a writer, or sell
  self-published videos or books on their website?  Do they spend a
  lot of effort on self-promotion?

- If it turns out that it’s false, would it hurt someone to distribute
  it?  It’s not so bad to forward around a funny kitten photo that
  turns out to be fake, but in the particular case I’ve been writing
  about, it will at least damage the reputations of some innocent
  people (who ought to sue you for defamation, but probably won’t),
  and at worst kill hundreds of millions of people.

If a few of these red flags pop up, don’t just forward the thing.
Investigate it first!

[0]: http://www.infowars.com/medical-director-swine-flu-was-cultured-in-a-laboratory/
[1]: http://www.ajph.org/cgi/reprint/81/11/1498
[2]: http://biosurveillance.typepad.com/biosurveillance/2009/04/swine-flu-in-mexico-timeline-of-events.html
[3]: http://www.cdc.gov/tuskegee/
[4]: http://en.wikipedia.org/wiki/Tuskegee_Study_of_Untreated_Syphilis_in_the_Negro_Male

<link rel="stylesheet" href="http://canonical.org/~kragen/style.css" />