Wednesday, January 28, 2009

N + 1 Select Revisited

I ran into an old problem the other day whose solution is common knowledge to even neophyte programmers, and is always used by default to make a database driven app more performant; but now it doesn't necessarily apply anymore, and I had to unlearn what I had always known to be true.

The problem: Suppose I have a table of books and a table of authors; and suppose I want all of the names of the authors of all of the books whose titles start with "A". The amateurish way to do this is the N + 1 Select:

select book_id from book where title like "A%"

then for each book_id

select name from author where book_id = ?

If I have 100 books that start with "A" (N = 100) then I have 101 queries (N + 1) to get my results.

Now suppose my database can return every query in constant time of 20 milliseconds. The above operation takes 101 * 20ms, which is 2020ms, or about 2 seconds. That's pretty bad, right? Right.

The "correct" way to do the above operation is with a join

select name from book, author where title like "A%" and book.book_id = author.book_id

As I said above, all of my queries return in constant time, so the above query gives me the correct results in 20 ms.

But what if I could execute the "select name from author where book_id = ?" query for all of the books concurrently (all at the same time) with threads rather than serially (one after another). Each query still takes 20ms to return, but that 20ms is all at the same time, so really the total elapsed time of the N + 1 queries is 40ms, which isn't even in the same ballpark as the N + 1 queries executed serially. In fact it's more in the ballpark of the "correct" way.

The reason why we don't do this is because most database driven applications involve some sort of connection pooling, so your threads would have to block while waiting for a connection, and even if there were enough connections for one query, running two queries at the same time would completely stop your application, let alone 100 users browsing your app at the same time.

However, that's your typical database. Amazon's SimpleDB has no such limitations. All of the money on your credit card could not buy enough server power to make enough connections to SimpleDB to even phase it. It will simply plug along at the same speed no matter what you do. So when that media campaign hits you hard, the data side of your app keeps chugging along at the same speed it always did. Oh and SimpleDB is super redundant. There is no single point of failure. To get that with a typical database you have to cluster at least three of them together, and guess what happens then? Your database performance drastically drops when performing joins.

SimpleDB, as the above discussion indicates, does not have support for joins, so you must do N + 1 selects and compile your "joins" via concurrency. However, on multi-core processors, this type of behavior is encouraged because you get the most bang for your buck when processing concurrently.

This makes me wonder, "how many other things have I always held to be true, but were really just conditions of being carried out in serial?" The next decade is going to be interesting.