Active Record

Guidelines for fast & reusable Active Record Edit

Table of Contents

  1. Principles & style
    1. Controllers, views, presenters, decorators
    2. Models
      1. Don’t write SQL
      2. Attribute naming
    3. Migrations
      1. Use raw SQL
      2. Relation to record cache
      3. Zero-downtime migrations
      4. Avoiding migration issues
    4. Query objects
  2. Performance considerations
    1. Indices
    2. Using the database client
    3. Multi-table queries
    4. Memory swapping
    5. Contention

You may have previously heard somebody say something like this:

I need to write a custom SQL query to do this!

or:

I’ve heard that ActiveRecord doesn’t scale…

This guide is here to help you write ActiveRecord (or any ORM code, really) that lets your app scale with good performance and without prematurely optimising the hell out of it.

There should be no exceptions to this guide. If you think you found one, you’re probably wrong, but still feel free to issue a pull request and be ready to defend it!

Principles & style

Layering and separation of concerns are overarching principles here. There should be a clear separation between your database and your application: the former is there to persist data, period. When you put something in, it should come out exactly as is, and pulling it out should be reasonably simple.

In particular your database should not have any logic: all the (business) rules should be in your application. This means that triggers, stored procedures should be forbidden in general.

We make two exceptions to “logic in the database”:

The only case where your database will “do” something for you beyond storage is when counting, or otherwise aggregating persisted data. This is essentially for performance reasons.

Controllers, views, presenters, decorators

Controllers, views, helpers, view objects, presenters, decorators should never contain any SQL. At all. Even SQL expressions. Seriously, you’ll thank us later.

The rationale: a class/object that contains SQL

They should not use relation builders (the where, order, group methods and friends) but instead rely on named scopes; unless the named scopes are trivial.

Good:

@posts = current_user.posts.created_after(1.year.ago)

Okay:

@posts = current_user.posts.where(created_at: 1.year.ago .. Time.current)

Bad:

@posts = Posts.where('user_id = ? AND created_at > ?', current_user.id, 1.year.ago)

Worse:

@posts = Post.find_by_sql(OH_GOD_KILL_ME)

Models

Don’t write SQL

Models should not contain any SQL queries.

They can, however, exceptionally contain SQL expressions in the form of where conditions for instance.

Using Arel to express conditions can also be used; bear in mind Arel is a somewhat private API that changes between Rails versions, so unit-testing scopes is even more important there.

The only place where either should happen is named scopes, which you should use extensively.

Okay:

scope :created_after, -> { |timestamp| where('created_at > ?', timestamp) }

Better (no SQL, no issue with using the scope in joins):

scope :created_after, -> { |timestamp|
  where self.class.arel_table[:created_at].gt timestamp
}
Scope chains

If a scope contains a SQL snippet (ie. if it’s not pure Arel), it should be unit tested.

It’s okay to define class methods for often-used chains of scopes, as long as they return a Relation or something scopish:

module ClassMethods
  def recently_created_in_account(date:, account:)
    created_after(date).account_is(account)
  end
end
extend ClassMethods

Be careful though, if you’re doing anything more complex than chaining a few scopes it probably needs to go to a query object (see below).

For the sake of reuse, remember that scopes are code: make scopes generic as necessary, avoid scope proliferation.

Good:

scope :created_after, -> { |timestamp| where('created_at > ?', timestamp) }

Bad:

scope :created_since_last_year, -> { where('created_at > ?', 1.year.ago) }
scope :created_since_yesterday, -> { where('created_at > ?', 1.day.ago.beginning_of_day) }
Scope Naming

We choose to name scopes using the {attribute}_{operator} pattern when possible.

Scope parameters should be records, not IDs, wherever possible and not hurtful for performance.

Good: User.account_is(account), User.created_after(date)

Bad: User.member_of(account), User.recently_created(date), User.account_id_is(account.id)

Attribute naming

No abbreviations or lingo

Like anything else in a codebase, attributes should avoid abbreviations and never include company lingo, for the sake of readability.

Good:

add_column :cargo_container_type, :volume_cubic_centimeters, :integer
add_column :restaurant_assignment, :restaurant_ack_needed, :boolean

Bad:

add_column :cargo_container_type, :volume_cc, :integer
add_column :restaurant_assignment, :rom_ack_needed, :boolean

Examples of tolerable exceptions include non-domain-specific / conventional abbreviations (geo_lat, geo_long) and SI units when unambiguous (radius_m).

Suffix conventions

Respect Rails conventions for column naming based on data type, plus a few of our own:

Denormalised column naming

It’s sometimes necessary to denormalise a relation for performance reasons.

Unless the consistency of the relations are guaranteed by (tested) code, the column name should be prefixed with cached_ to make it obvious it’s non-authoritative.

Example: Neighborhoods can be reshaped, and this does not automatically trigger a change to the user address -> neighborhood association within the same transaction.

The only authoritative information on user addresses (geolocation wise) is their latitude/longitude.

Migrations

Use raw SQL

Migrations are one of the only places which should contain SQL queries. In fact, migrations should contain only SQL queries, and no code using ActiveRecord models.

This is because your migration has to work up and down even if your model no longer exists, has been renamed, or has had and internal API change!

Migrations may exceptionally use PORO models to facilitate complex data changes, though.

Good (roughly):

def up
  add_column :users, :age, :integer
  update %{
    UPDATE users
    SET age = TIMESTAMPDIFF(YEAR, date_of_birth, CURDATE())
  }
end

Bad:

def up
  add_column :users, :age, :integer
  User.find_each do |u|
    u.age = Date.today.year - u.date_of_birth.year
    u.save!
  end
end

Relation to record cache

If applicable, remember to clear the cache1 after running migrations that change data if the affected model is cached; or to send update events on the event bus2 to refresh subscribers.

Zero-downtime migrations

Because we do rolling deploys (with Heroku’s “preboot” feature), both the old and new code will be running at the same time.

There are two options to make this work without causing exceptions:

Preferred:

Okay:

Removing columns safely

In the model file, add a columns patch for the column being removed, as well as virtual getter/setter methods:

  def self.columns
    super.reject { |col| col.name == 'column_to_be_removed' }
  end

  def column_to_be_removed; end

  def column_to_be_removed=(value)
    value
  end

This safeguards against ActiveRecord caching the column, and prevents errors if there happens to be code still referencing the column for whatever reason.

Once the migration is run and all servers have restarted, these methods can be removed.

Avoiding migration issues

Migrations can cause major issues when run in a production environment, and should always be reviewed with extreme care.

In particular, be on the lookout for:

Query objects

An application of one of the PoEAA, this is meant to encapsulate a query:

This is the only type of classes where ActiveRecord::Base#find_by_sql or ActiveRecord::Base#connection are allowed. You’ll typically use your connection’s #select_values, #select_rows, and #update methods to do something useful.

Example (roughly):

class User::RecentlyCreatedFinder
  def initialize(account: nil, timestamp: nil)
    @account = account or raise ArgumentError
    @timestamp = timestamp || 1.week.ago
  end

  def call
    ids = User.connection.select_values(sanitize([%{
      SELECT id FROM users
      WHERE created_at < ? AND account_id = ?
    }, @timestamp, @account.id]))
    ids.each { |id| yield User.find(id) }
  end
end

Note that it’s still vastly preferred to use Railsy querying and Arel in query objects.

Improved example:

class User::RecentlyCreatedFinder
  def initialize(account: nil, timestamp: nil)
    @account = account or raise ArgumentError
    @timestamp = timestamp || 1.week.ago
  end

  def call
    User.
    where(account_id: @account.id).
    where('created_at < ?', @timestamp).
    find_each do |user|
      yield user
    end
  end
end

In terms of usage, this translates into:

Good:

query = User::RecentlyCreatedFinder.new(account: foo, timestamp: bar)
query.call do |user|
  do_stuff user
end

Bad:

relation = User.where(account_id: foo.id).where('created_at < ?', bar)
relation.find_each do |user|
  do_stuff user
end

Using a query object is better because:

Performance considerations

You don’t need a database administrator—if you follow a few simple rules. Otherwise carelessly crafted queries can easily blow up your app servers, your database servers, or both.

Indices

Starting with the obvious. If there’s no index for your particular query, it will be slow.

Rules of thumb:

More indices do not always help: the more indices, the slower the updates, and your RDBMS will get confused and pick the wrong one.

Using the database client

Do more in the client side: unlike the database, the Ruby client can be scaled out without any particular limit.

A typical example is sorting client-side, whether or not the result set is large (sorting will be less costly than just allocating the ActiveRecord instances):

# Good:
DeliveryZone.where(country: { tld: 'fr' }).sort_by(&:code)

# Bad:
DeliveryZone.where(country: { tld: 'fr' }).order(:code)

In many cases, grouping or filtering on the client-side might be just as fast for users, and lighter for the database, ie. preferable.

Multi-table queries

Rule of thumb: count one for each group, join, having, where, order in your query. More than three? Things will :boom: blow up.

In particular, multiple joins are a symptom of over-normalized data modeling.

If that isn’t enough: precalculate, use caching, and do your math in Ruby as described in the previous section. Your app code can scale very well (possibly through scheduled jobs), the database cannot.

Example, listing 3 restaurants a user recently ordered from and didn’t rate poorly:

# Good: precalculated "thin" model

# Nightly: update model
Order.where(created_at: 25.hours.ago .. Time.current).find_each do |order|
  RecentlyOrderedFromAndNotPoorlyRatedRestaurants.find_or_create_by!(
    user_id:        order.user_id,
    restaurant_id:  order.restaurant_id,
    created_at:     order.created_at
  )
end
OrderRating.where(created_at: 25.hours.ago .. Time.current).find_each do |rating|
  next unless (1..3).include? rating.stars
  RecentlyOrderedFromAndNotPoorlyRatedRestaurants.where(order_id:
  rating.order_id).destroy_all
end

# At point of request
ids = RecentlyOrderedFromAndNotPoorlyRatedRestaurants.
  where(user_id: current_user.id).
  order(:created_at).
  limit(3).
  pluck(:restaurant_id)
restaurants = Restaurant.where(id: ids)
# Bad: trying to do it "live" (pseudocode)
restaurants = Restaurant.
  joins('RIGHT JOIN orders ON orders.restaurant_id = restaurants.id').
  joins('LEFT JOIN ratings ON ratings.order_id = orders.id').
  where(
    order: {
      user_id: current_user, status: 'delivered',
    },
    ratings: {
      stars: [nil, 4, 5]
    }
  ).
  group(:id).
  order('MAX(orders.created_at)').
  limit(3)

The “live” option might look fast enough locally, because there’s no competition for resources — but the complicated query causes a lot of locking, reads a lot of data, and is particularly hard to index well.

Avoiding multi-table queries is not premature optimisation as it usually also makes the code vastly simpler to read.

Memory swapping

Rule of thumb: if you’re not certain how many records your query will retrieve, eventually it’s going to retreive too many and you’ll blow up your machine’s memory (known as swapping).

Always paginate or limit unless domain knowledge tells you clearly you don’t have to.

# Good:
User.paginate(page:1, per_page:10)
User.limit(10)

# Fine:
User.find_in_batches { |batch| ... }

# Bad:
User.all

Contention

It’s a lot easier for any (database) server to schedule and parallelise lots of small tasks than a few monstrous ones.

If a given query touches (reads from or writes to) more than a few 1000 rows or 5% of a given table at a time (whichever is the smaller), you’re going to blow things up.

Of course this is particularly true if the table in question is heavily used.

Rule of thumb: use batches to process groups of records.

# Good:
User.find_each { |u| puts u.id }

# Bad:
User.each { |u| puts u.id }