Amazon's new search: "Whoah"

Wow, Amazon now lets you search every word in every book that they sell. More than 33 million pages. I’m just stunned.

Any bets on what the most popular search words will be? :)

abstinence and celibacy? ;)

The dirty parts, of course.

Wow, Woolen, you’re right that’s pretty amazing.

It seems like without advanced search options it will only be useful in a very limited number of settings. I would imagine the number of hits for most words would be huge. Maybe I’m wrong though, Google seems to work for the web, which I assume is much more than 33 million pages. Of course Google depends on it’s page link algorithm to determine the “best” match. Wonder what Amazon uses to determine how to order their search results? Perhaps sell ranking?

the thing i’m trying to figure out is how they catalogued all those pages? I mean, either they got the text in electronic form from the publishers, or someone, somewhere, was doing a helluva lot of OCR scanning.

Google is impressive, but in a way, this is even more impressive, because someone had to actually convert millions of pages of text into digital form.

They had to have gotten the data from the publishers. I’m sure as part of a modern publishing process the book is prepared electronically. Those digital text copies would be pretty simple to get into a database if you had any way to convert them to ASCII/UTF.

They must have a decent database team to set up a good full-text search system for that massive pile of data. Kudos to the Amazon DBA people.

They had to have gotten the data from the publishers. I’m sure as part of a modern publishing process the book is prepared electronically. Those digital text copies would be pretty simple to get into a database if you had any way to convert them to ASCII/UTF.

They must have a decent database team to set up a good full-text search system for that massive pile of data. Kudos to the Amazon DBA people.[/quote]

No, they were actually scanned in.

According to the Washington Post, you can actually see the scanned version of a page that contains a word you’re looking for.

http://www.washingtonpost.com/wp-dyn/articles/A15020-2003Oct25.html

Not terribly surprising.

A few years back, I put together a web server (IIS, sadly, but it was a requirement since hardware was donated by M$) that was used to store court decisions on Refugee Law. We used Adobe’s PDF IFilter add-in for IIS to allow full text searching of decisions that were scanned in and OCR-ed by Acrobat.

Granted, there’s several orders of magnitude difference involved in the scale, but it’s not exactly new technology.

My wife works in the E-Reserves at the U of M, and is responsible for running the group that takes requests to put stuff online from print (though I think that’s w/o OCR). They have huge book scanners that account for the spinal curve in the page and so forth.

It’s still cool, though.

Here’s a related question: As a Canadian, am I better off ordering off amazon.com or amazon.ca? So me, the prices end up almost the exact same.

What do the other Canadians here do?

I thought Amazon got into trouble with one of those weird Canadian laws about American companies trying to sell stuff in Canada? Hence the reason for Amazon.ca.

Then how do you explain Best Buy, Starbucks, AMC…

I guess .ca it is then.

Canada has (had?) weird publishing laws that restrict sales of magazines and such across the border. Books fell into that, so Amazon got in trouble.

This sounds like a really neat service, great for recalling those books where you can remember a memorable phrase but not the book title or author.

It does raise some interesting hacking possibilities, like retrieving entire books (not that I condone such things, it’s just interesting to think about).