Background
There are two main sources of latency in the backend of web
applications: rendering (HTML templating or data serialization) and IO
(database or external service calls.) Today we’ll look at the latter and
more specifically focus getting rid of extraneous database round trips.
The fastest query possible is one that doesn’t happen.
Before we get in to the details I’ll mention the importance of schema,
but won’t spend time on it here. No single factor will affect the
overall responsiveness of your application, and more specifically your
ability to tune it, than an effective data(base) schema. This is
something we’ll return to with detail in the future, but I bring it up
now as schema issues can often limit your options.
I’ll also mention caching. In an optimal situation the things we’ll talk
about will help to reduce the amount of time required to serve a result
that can’t be cached or is a cache miss, but just as reducing the number
of database round trips by optimizing querying helps, avoiding them
altogether via caching is a big win.
A final note, I’ll discuss this topic in the context of Python (Django,)
but it applies equally well in any language and ORM. I cut my teeth with
Java (Hibernate) and have worked with Perl (DBIx::Class,) Ruby (Rails,)
and several others since.
The N+1 Selects Problem
The most common and egregious round-trip problem is the n+1
selects
problem. In it’s simplest form it involves an object with children, a
one-to-many
relationship. A minimal example follows.
from django.db import models
class State(models.Model):
name = models.CharField(max_length=64)
country = models.ForeignKey(Country, related_name='states')
class Meta:
ordering = ('name',)
class City(models.Model):
name = models.CharField(max_length=64)
state = models.ForeignKey(State, related_name='cities')
class Meta:
ordering = ('name',)
We have states which have zero or more cities and our use case is to
print out a nested list of states and their cities.
Alaska
Anchorage
Fairbanks
Willow
California
Berkeley
Monterey
Palo Alto
San Diego
San Francisco
Santa Cruz
Kentucky
Albany
Monticello
Lexington
Louisville
Somerset
Stamping Ground
The code to accomplish this looks something like:
from django.shortcuts import render_to_response
from django.template.context import RequestContext
from locations.models import State
def list_locations(request):
data = {'states': State.objects.all()}
return render_to_response('list_locations.html', data,
RequestContext(request))
If we run the above code to generate an HTML page and look at the
database queries, using
django-debug-toolbar,
we’ll see a single query to list the states and then a query for each of
the states to retrieve the cities. With 3 states that’s not a huge deal,
but if it were all 50 we’d be looking at 1 query, the +1, to get the
states and 50, the n, to get all of the cities.
2N+1 (no that’s not a thing)
Before we fix the N+1 queries I want to add an additional piece of
information for each state, it’s country. This one will go in the other
direction on a one-to-many relationship. Each state will have exactly
one country.
Alaska (United States)
...
...
class Country(models.Model):
name = models.CharField(max_length=64)
class State(models.Model):
name = models.CharField(max_length=64)
country = models.ForeignKey(Country, related_name='states')
...
If we take a look at the SQL tab of django-debug-toolbar we’ll see there
are now queries happening to pull in the country object for each state.
Also note that we’re retrieving the same country over and over, since
it’s shared by all of the states.
Now we have a couple interesting problems to solve, conveniently one for
each of Django ORM’s primary solutions.
states = State.objects.select_related('country').all()
select_related
works by joining in SQL between the primary object (state) and the other
object (country.) In this case that will get rid of the extra queries to
fetch the country objects for each state. So if a database round-trip
takes 20ms to transit the network, run, and return we’ll see a N * 20ms
speedup. For any reasonable size of N that’ll be noticeable.
The new query to retrieve the state objects will look something like the
following.
SELECT ... FROM "locations_state"
INNER JOIN "locations_country" ON
("locations_state"."country_id" = "locations_country"."id")
ORDER BY "locations_state"."name" ASC
...
The above query replaces the original state fetching query and rids us
of the secondary queries to retrieve the country objects. There is
however a potential downside to solving the problem this way, namely
we’re getting back the same country object repeatedly and thus having to
send it’s columns through the ORM code over and over and creating lots
of duplicate objects. We’ll come back to this in a moment.
One last thing before we move on, in Django’s ORM select_related does
not work when there are multiple objects on the other end of an
association. We can make use of it to grab the country for a state,
but if we were to add ‘cities’ to the select_related call, nothing
would happen. Other ORMs (Hibernate for example) don’t share this
limitation, but you must be very careful when making use of it in those
situations. In those cases the join will duplicate the primary object’s
data for each secondary object. That can quickly balloon out of control
and leave the ORM processing lots of data/rows.
Given all of this, one of the best uses of select_related is when
you’re fetching a single object and want to pull back (a) related object
in a single shot. It’ll do so in a single round-trip to the database and
won’t end up duplicating too much data, any in the case of Django ORM’s
one-to-one only support.
states = State.objects.prefetch_related('country', 'cities').all()
In contrast
prefetch_related
works by collecting all of the ids for the related objects, fetches them
in a single batch, and then transparently attaches them to the objects.
The best part is that it can be used on one-to-many relationships, which
is what the state to cities relationship is in our example.
The SQL will now look something like the following.
SELECT ... FROM "locations_state" ORDER BY "locations_state"."name" ASC
SELECT ... FROM "locations_country" WHERE "locations_country"."id" IN (1)
SELECT ... FROM "locations_city"
WHERE "locations_city"."state_id" IN (1, 2, 3)
ORDER BY "locations_city"."name" ASC
So we’ve gone from 2N+1 queries to 3. Getting rid of N is a big deal. 3
* 20ms is way less than (2 * 50 + 1) * 20ms or even (50 + 1) * 20ms
in the case of using select_related alone.
The above example uses prefetch on both country and cities. As mentioned
in the previous section all of the states share a single country and
when using select_related that means we pulled back and processed its
data N times along side each state object. In contrast using
prefetch_related we’ll only pull back the country object once. This
however comes at the cost of an extra db round trip so it may be
preferable to use a combination, you’ll have to test in your specific
circumstances. However, in this case using both select_related and
prefetch_related takes us down to 2 * 20ms, it’ll probably be
faster than 3 queries, but there are a lot of potential factors.
states = State.objects.select_related('country') \
.prefetch_related('cities').all()
How Deep
What about when you want to span multiple levels? Both select_related
and prefetch_related can traverse relationships using
double-underscores. When you make use of the ability the intermediate
objects will be included as well. This can be extremely powerful, but a
little hard to grok in more complex object models.
# only works when there's a single object at each step
city = City.objects.select_related('state__country').all()[0]
# 1 query, no further db queries
print('{0} - {1} - {2}'.format(city.name, city.state.name,
city.state.country.name)
# works for both single and multiple object relationships
countries = Country.objects.prefetch_related('states__cities')
# 3 queries, no further db queries
for country in countries:
for state in country.states:
for city in state.cities:
print('{0} - {1} - {2}'.format(city.name, city.state.name,
city.state.country.name)
One final Django specific note. In last week’s post on efficiently
querying for nearby
things I
worked through a complicated SQL query to find the closest objects to a
given lat/lon point. The best way to go about doing this sort of query
with Django is to use a raw sql
query. The issue
is that raw query sets don’t support prefetch_related, which is a
bummer. There is however a work-around, you can directly make use of
prefetch_related_objects, which Django uses under the hood to power
prefetch_related.
from django.db.models.query import prefetch_related_objects
# prefetch_related_objects requires a list, it won't work on a QuerySet so
# we need to convert with list()
cities = list(City.objects.raw('<sql-query-for-nearby-cities>'))
prefetch_related_objects(cities, ('state__country',))
# 3 queries, no further db queries
for city in cities:
print('{0} - {1} - {2}'.format(city.name, city.state.name,
city.state.country.name)
That’s pretty cool.