Wednesday, January 28, 2009

N + 1 Select Revisited

I ran into an old problem the other day whose solution is common knowledge to even neophyte programmers, and is always used by default to make a database driven app more performant; but now it doesn't necessarily apply anymore, and I had to unlearn what I had always known to be true.

The problem: Suppose I have a table of books and a table of authors; and suppose I want all of the names of the authors of all of the books whose titles start with "A". The amateurish way to do this is the N + 1 Select:

select book_id from book where title like "A%"

then for each book_id

select name from author where book_id = ?

If I have 100 books that start with "A" (N = 100) then I have 101 queries (N + 1) to get my results.

Now suppose my database can return every query in constant time of 20 milliseconds. The above operation takes 101 * 20ms, which is 2020ms, or about 2 seconds. That's pretty bad, right? Right.

The "correct" way to do the above operation is with a join

select name from book, author where title like "A%" and book.book_id = author.book_id

As I said above, all of my queries return in constant time, so the above query gives me the correct results in 20 ms.

But what if I could execute the "select name from author where book_id = ?" query for all of the books concurrently (all at the same time) with threads rather than serially (one after another). Each query still takes 20ms to return, but that 20ms is all at the same time, so really the total elapsed time of the N + 1 queries is 40ms, which isn't even in the same ballpark as the N + 1 queries executed serially. In fact it's more in the ballpark of the "correct" way.

The reason why we don't do this is because most database driven applications involve some sort of connection pooling, so your threads would have to block while waiting for a connection, and even if there were enough connections for one query, running two queries at the same time would completely stop your application, let alone 100 users browsing your app at the same time.

However, that's your typical database. Amazon's SimpleDB has no such limitations. All of the money on your credit card could not buy enough server power to make enough connections to SimpleDB to even phase it. It will simply plug along at the same speed no matter what you do. So when that media campaign hits you hard, the data side of your app keeps chugging along at the same speed it always did. Oh and SimpleDB is super redundant. There is no single point of failure. To get that with a typical database you have to cluster at least three of them together, and guess what happens then? Your database performance drastically drops when performing joins.

SimpleDB, as the above discussion indicates, does not have support for joins, so you must do N + 1 selects and compile your "joins" via concurrency. However, on multi-core processors, this type of behavior is encouraged because you get the most bang for your buck when processing concurrently.

This makes me wonder, "how many other things have I always held to be true, but were really just conditions of being carried out in serial?" The next decade is going to be interesting.

Tuesday, January 6, 2009

Amazon EC2 and Cloud Tools Saves the Day

I don't know if I've written about it here or not, but I have spent some time over the past couple of months contributing to an open source project started by Chris Richardson, of "POJOs in Action" fame, called Cloud Tools. It's a set of tools to launch Java based web applications onto Amazon EC2. I've been in charge of moving a lot of the functionality that exists in Maven commands over to Gant scripts for use with Grails. It's pretty rad.

You can launch your Grails application onto a server by simply typing: "grails cloud-tools-deploy". It support all sorts of clustering and master/slave arrangements with DBs and so forth... totally rad.

Anyway, back in October my company, -yoink-, got the go ahead from our client to run one of our web sites on EC2 -yoink-. I elected to use Cloud Tools because it looked great, and I had read Chris's book in the past, so I knew it would be quality.

I can't tell you how much time it saves to be able to bootstrap a server from the command line with a simple "grails cloud-tools-deploy" If you've built any web sites then you know what a pain in the ass deployment is, and you know what a pain in the ass server configuration is. Cloud Tools and EC2 eliminate the need for both and allows you to focus on what you're good at: writing code.

Continuing... today we had a massive email drop, which drove a ton of traffic to the site, that I had no idea was going to happen. The only way I found out is an external monitoring service I had hooked up alerted me to the fact that my server was unable to serve pages. The solution was as simple as opening my app, changing the config to run with 4 app servers, and running "grails cloud-tools-deploy." Once deployed, I switched the Elastic IP address of the old deployment to the load balancing server of my new deployment and voila! My site was hauling ass again.

Had we been using the traditional managed hosting services that we typically use, like those offered by Rackspace, we'd have been totally screwed. It would have taken them at least a week to get another server up and running, which would have been way too long. Using EC2 and Cloud Tools, I simply started servers as I needed them, and I'll shut them down tomorrow when the traffic dies down. It will probably only cost our client about $20 at the most, and think of the time and money that would have been lost if the server configuration had not been able to handle the load being thrown at it.

EC2 is pretty rad.