Django's ORM is extremely powerful, allowing you to manage your data without ever going near a line of SQL and hiding a multitude of complexities. But its power can sometimes be a curse rather than a blessing, multiplying queries without your knowledge and bringing your database to its knees.
In this session I explain what's going on behind the scenes and present some techniques to make your ORM use more efficient, showing how to monitor what's going on and how to better deal with relationships, indexes and more.
This talk was presented at Europython 2010 in Birmingham.
2. About Me
• Python user for five years
• Discovered Django four years ago
• Worked full-time with Python/Django since
2008.
• Top Django answerer on StackOverflow!
• Occasionally blog on Django, concentrating
on efficient use of the ORM.
3. Contents
• Behind the scenes: models and fields
• How model relationships work
• More efficient relationships
• Other optimising techniques
9. Defining a model
• Model structure initialised via metaclass
• Called when model is first defined
• Resulting model class stored in cache to
use when instantiated
10. Fields
• Fields have contribute_to_class
• Adds methods, eg get_FOO_display()
• Enables use of descriptors for field access
12. Model instantiation
• Instance is populated from database initially
• Has no subsequent relationship with db
until save
• No identity between models
13. Querysets
• Model=manager returns a queryset:
foos Foo.objects.all()
• Queryset is an ordered list of instances
of a single model
• No database access yet
• Slice: foos[0]
• Iterate: {% for foo in foos %}
14. Where do all those
queries come from?
• Repeated queries
• Lack of caching
• Relational lookup
• Templates as well as views
15. Repeated queries
def get_absolute_url(self):
return "%s/%s" % (
self.category.slug,
self.slug
)
Same category, but query is
repeated for each article
16. Repeated queries
• Same link on every
page
• Dynamic, so can't
go in urlconf
• Could be cached
or memoized
19. Example models
class Foo(models.Model):
name = models.CharField(max_length=10)
class Bar(models.Model):
name = models.CharField(max_length=10)
foo = models.ForeignKey(Foo)
22. Fowards relationships
• Relational access implemented via a
descriptor:
django.db.models.fields.related.
SingleRelatedObjectDescriptor
• __get__ tries to access _foo_cache
• If doesn't exist, does lookup and creates
cache
23. select_related
• Automatically follows foreign keys in SQL
query
• Prepopulates _foo_cache
• Doesn't follow null=True relationships by
default
• Makes query more expensive, so be sure
you need it
24. Backwards relationships
{% for foo in my_foos %}
{% for bar in foo.bar_set.all %}
{{ bar.name }}
{% endfor %}
{% endfor %}
25. Backwards relationships
• One query per foo
• If you iterate over foo_set again, you
generate a new set of db hits
• No _foo_cache
• select_related does not work here
26. Optimising backwards
relationships
• Get all related objects at once
• Sort by ID of parent object
• Then cache in hidden attribute as with
select_related
27. qs = Foo.objects.filter(criteria=whatever)
obj_dict = dict([(obj.id, obj)
for obj in qs])
objects = Bar.objects.filter(foo__in=qs)
relation_dict = {}
for obj in objects:
relation_dict.setdefault(
obj.foo_id, []).append(obj)
for id, related in relation_dict.items():
obj_dict[id]._related = related
28. qs = Foo.objects.filter(criteria=whatever)
obj_dict = dict([(obj.id, obj)
for obj in qs])
objects = Bar.objects.filter(foo__in=qs)
relation_dict = {}
for obj in objects:
relation_dict.setdefault(
obj.foo_id, []).append(obj)
for id, related in relation_dict.items():
obj_dict[id]._related = related
29. qs = Foo.objects.filter(criteria=whatever)
obj_dict = dict([(obj.id, obj)
for obj in qs])
objects = Bar.objects.filter(foo__in=qs)
relation_dict = {}
for obj in objects:
relation_dict.setdefault(
obj.foo_id, []).append(obj)
for id, related in relation_dict.items():
obj_dict[id]._related = related
30. qs = Foo.objects.filter(criteria=whatever)
obj_dict = dict([(obj.id, obj)
for obj in qs])
objects = Bar.objects.filter(foo__in=qs)
relation_dict = {}
for obj in objects:
relation_dict.setdefault(
obj.foo_id, []).append(obj)
for id, related in relation_dict.items():
obj_dict[id]._related = related
31. qs = Foo.objects.filter(criteria=whatever)
obj_dict = dict([(obj.id, obj)
for obj in qs])
objects = Bar.objects.filter(foo__in=qs)
relation_dict = {}
for obj in objects:
relation_dict.setdefault(
obj.foo_id, []).append(obj)
for id, related in relation_dict.items():
obj_dict[id]._related = related
32. qs = Foo.objects.filter(criteria=whatever)
obj_dict = dict([(obj.id, obj)
for obj in qs])
objects = Bar.objects.filter(foo__in=qs)
relation_dict = {}
for obj in objects:
relation_dict.setdefault(
obj.foo_id, []).append(obj)
for id, related in relation_dict.items():
obj_dict[id]._related = related
33. Optimising backwards
[{'time': '0.000', 'sql': u'SELECT
"foobar_foo"."id", "foobar_foo"."name" FROM
"foobar_foo"'},
{'time': '0.000', 'sql': u'SELECT
"foobar_bar"."id", "foobar_bar"."name",
"foobar_bar"."foo_id" FROM "foobar_bar"
WHERE "foobar_bar"."foo_id" IN (SELECT
U0."id" FROM "foobar_foo" U0)'}]
34. Optimising backwards
• Still quite expensive, as can mean large
dependent subquery – MySQL in particular
very bad at these
• But now just two queries instead of n
• Not automatic – need to remember to use
_related_items attribute
35. Generic relations
• Foreign key to ContentType, object_id
• Descriptor to enable direct access
• iterating through creates n+m
queries(n=number of source objects,
m=number of different content types)
• ContentType objects automatically cached
• Forwards relationship creates _foo_cache
• but select_related doesn't work
36. generics = {}
for item in queryset:
generics.setdefault(item.content_type_id,
set()).add(item.object_id)
content_types = ContentType.objects.in_bulk(
generics.keys())
relations = {}
for ct, fk_list in generics.items():
ct_model = content_types[ct].model_class()
relations[ct] = ct_model.objects.
in_bulk(list(fk_list))
for item in queryset:
setattr(item, '_content_object_cache',
relations[content_type_id][item.object_id]
)
37. generics = {}
for item in queryset:
generics.setdefault(item.content_type_id,
set()).add(item.object_id)
content_types = ContentType.objects.in_bulk(
generics.keys())
relations = {}
for ct, fk_list in generics.items():
ct_model = content_types[ct].model_class()
relations[ct] = ct_model.objects.
in_bulk(list(fk_list))
for item in queryset:
setattr(item, '_content_object_cache',
relations[content_type_id][item.object_id]
)
38. generics = {}
for item in queryset:
generics.setdefault(item.content_type_id,
set()).add(item.object_id)
content_types = ContentType.objects.in_bulk(
generics.keys())
relations = {}
for ct, fk_list in generics.items():
ct_model = content_types[ct].model_class()
relations[ct] = ct_model.objects.
in_bulk(list(fk_list))
for item in queryset:
setattr(item, '_content_object_cache',
relations[content_type_id][item.object_id]
)
39. generics = {}
for item in queryset:
generics.setdefault(item.content_type_id,
set()).add(item.object_id)
content_types = ContentType.objects.in_bulk(
generics.keys())
relations = {}
for ct, fk_list in generics.items():
ct_model = content_types[ct].model_class()
relations[ct] = ct_model.objects.
in_bulk(list(fk_list))
for item in queryset:
setattr(item, '_content_object_cache',
relations[content_type_id][item.object_id]
)
40. generics = {}
for item in queryset:
generics.setdefault(item.content_type_id,
set()).add(item.object_id)
content_types = ContentType.objects.in_bulk(
generics.keys())
relations = {}
for ct, fk_list in generics.items():
ct_model = content_types[ct].model_class()
relations[ct] = ct_model.objects.
in_bulk(list(fk_list))
for item in queryset:
setattr(item, '_content_object_cache',
relations[content_type_id][item.object_id]
)
41. generics = {}
for item in queryset:
generics.setdefault(item.content_type_id,
set()).add(item.object_id)
content_types = ContentType.objects.in_bulk(
generics.keys())
relations = {}
for ct, fk_list in generics.items():
ct_model = content_types[ct].model_class()
relations[ct] = ct_model.objects.
in_bulk(list(fk_list))
for item in queryset:
setattr(item, '_content_object_cache',
relations[content_type_id][item.object_id]
)
42. generics = {}
for item in queryset:
generics.setdefault(item.content_type_id,
set()).add(item.object_id)
content_types = ContentType.objects.in_bulk(
generics.keys())
relations = {}
for ct, fk_list in generics.items():
ct_model = content_types[ct].model_class()
relations[ct] = ct_model.objects.
in_bulk(list(fk_list))
for item in queryset:
setattr(item, '_content_object_cache',
relations[content_type_id][item.object_id]
)
43. generics = {}
for item in queryset:
generics.setdefault(item.content_type_id,
set()).add(item.object_id)
content_types = ContentType.objects.in_bulk(
generics.keys())
relations = {}
for ct, fk_list in generics.items():
ct_model = content_types[ct].model_class()
relations[ct] = ct_model.objects.
in_bulk(list(fk_list))
for item in queryset:
setattr(item, '_content_object_cache',
relations[content_type_id][item.object_id]
)
44. generics = {}
for item in queryset:
generics.setdefault(item.content_type_id,
set()).add(item.object_id)
content_types = ContentType.objects.in_bulk(
generics.keys())
relations = {}
for ct, fk_list in generics.items():
ct_model = content_types[ct].model_class()
relations[ct] = ct_model.objects.
in_bulk(list(fk_list))
for item in queryset:
setattr(item, '_content_object_cache',
relations[content_type_id][item.object_id]
)
46. Memoizing
• Cache property on first access
• Can cache within instance, if multiple
accesses within same request
def get_expensive_items(self):
if not hasattr(self, '_cache'):
self._cache = self.expensive_op()
return self._cache
47. DB Indexes
• Pay attention to slow query log and
debug toolbar output
• Add extra indexes where necessary -
especially for multiple-column lookup
• Use EXPLAIN
48. Outsourcing
• Does all the logic need to go in the web
app?
• Services - via eg Piston
• Message queues
• Distributed tasks, eg Celery
49. Summary
• Understand where queries are coming
from
• Optimise where necessary, within Django
or in the database
• and...
(background: montage of Limmud, rosemanblog, Capital, Classic, Heart, GlassesDirect)
Some of same ideas in Guido's Appstats talk this morning
It's a model, in a field, geddit?
For more, see Marty Alchin, Pro Django (Apress)
descriptors used especially in related objects - see later
Very useful for introspection and working out what's going on
explain identity: multiple instances relating to same model row aren't the same object, changes made to one don't reflect the other; even saving one with new values won't be reflected in others.
Update, Aggregates, Q, F
Find repeated queries with my branch of the django-debug-toolbar, or SimonW's original query debug middleware
Actually in 1.2 there's an extra _state object in __dict__, which is used for the multiple DB support (which I'm not covering here).
Lack of model identity means that accessing the related item on one instance does not cause cache to be created on other instances that might reference the same db row
Note: backwards cache does work on OneToOne as of 1.2
+----+--------------------+-----------+-----------------+---------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-----------+-----------------+---------------+---------+---------+------+------+-------------+
| 1 | PRIMARY | sandy_bar | ALL | NULL | NULL | NULL | NULL | 100 | Using where |
| 2 | DEPENDENT SUBQUERY | U0 | unique_subquery | PRIMARY | PRIMARY | 4 | func | 1 | Using where |
+----+-----------+----+--------------------+-----------+-----------------+---------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-----------+-----------------+---------------+---------+---------+------+------+-------------+
| 1 | PRIMARY | sandy_bar | ALL | NULL | NULL | NULL | NULL | 100 | Using where |
| 2 | DEPENDENT SUBQUERY | U0 | unique_subquery | PRIMARY | PRIMARY | 4 | func | 1 | Using where |
+----+--------------------+-----------+-----------------+---------------+---------+---------+------+------+-------------+
--------+-----------+-----------------+---------------+---------+---------+------+------+-------------+