Drupal's database schema model

We shall do a deep dive of Drupal's database schema. For the purpose of simplicity, we shall deal exclusively with SQL queries and not step out of DB land unless its required. By doing this exercise, we can derive Drupal's schema from first principles. Let's start with a humble node, more specifically, an article content type which ships by default with Drupal's core. It consists of the node ID, the node type and node properties, i.e. the node title and node status. We are assuming only one bundle, the "article" bundle for now.

-- a placeholder user table
create table users (
  uid serial primary key,
  name text not null
);

create type content_status as enum ('draft', 'published');

create table node (
   nid serial primary key,
   title text not null,
   status content_status,
   uid integer not null references users(uid)
);

The article node has a body field which we can add as a column, as in:

alter table node add column body text;

This would get clumsy if we want to reuse the body field across different bundles or entities later on. So, let's add a separate table and refer to the node's ID.

create table body (
  nid integer,
  body_content text
);

The article CT has a tags field, which is a multiple valued field. This tag field can hold terms which fall under the "tags" vocabulary. We need 2 tables, one for representing all the terms in the vocabulary.

create table terms (
  tid serial primary key,
  vid varchar(32),
  name text
);

Why can't we have 1 table for each vocabulary instead of the vid column as above? The reason is, its hard to figure out which table is a vocabulary table and which table isn't, unless we are prefixing our table names with a convention, like vocabulary_tags. We could do that, because Drupal actually does something similar. Another assumption we are making here is a flat hierarchy of tag terms, as opposed to the nested structure found in actual Drupal terms.

As terms can be reused by other bundles, we need another table to store the terms for the article node.

create table node_tags (
  nid integer,
  delta integer,
  tid integer
);

The delta is the order of occurence of the tag term for the article. Let's add a couple of nodes.

-- add a user first
insert into users (name) values ('Badri');
insert into node (title, status, uid) values ('Hello world', 'published', 1);
insert into node (title, status, uid) values ('Ea ante nobis', 'published', 1);

-- add a few tag terms
-- vids used to be integers in previous versions of Drupal, 
-- now its the machine name of the vocabulary.
insert into terms (vid, name) values ('tags', 'Drupal 8');
insert into terms (vid, name) values ('tags', 'Drupal Planet');
insert into terms (vid, name) values ('tags', 'DB Schema');

-- the body content for 2 nodes.
insert into body (nid, body_content) values (1, 'Hendrerit ab, quis molestiae. Dignissim tempor, semper eget sollicitudin, aute! Non nobis, fuga! Quam, mauris in veniam, nobis ligula fermentum.');
insert into body (nid, body_content) values (2, 'Assumenda cupiditate orci ratione dictumst similique deleniti posuere ullam doloremque deleniti impedit ante minima ad similique neque wisi, molestie inceptos.');

-- add 2 tags each for 2 nodes.
insert into node_tags (nid, delta, tid) values (1, 0, 1);
insert into node_tags (nid, delta, tid) values (1, 1, 2);
insert into node_tags (nid, delta, tid) values (2, 0, 3);
insert into node_tags (nid, delta, tid) values (2, 1, 2);

Now, its query time. Let's get all the fields and data associated with a single node, using SQL Joins.

Let's associate body and node first and fetch both details for nid 1.

select n.nid as nid, n.title as title, n.status as status, n.uid as uid, b.body_content as body
from node n
left join body b on b.nid=n.nid
where n.nid=1;

Adding tags as well, we get,

select n.nid as nid, n.title as title, n.status as status, n.uid as uid, (select body.body_content from body where  body.nid=1) as body,
array_agg(t.name) as tag
from node n
inner join node_tags nt on nt.nid=n.nid
inner join terms t on nt.tid=t.tid
where n.nid=1
group by n.nid;

We shall add one last field, the article image. For images/files, we will have a table to manage the files,

create table files (
  fid serial primary key,
  uid integer not null references users(uid),
  filename varchar(255),
  uri varchar(255),
  filemime varchar(255)
);

and another table to relate files to the node.

create table node_images (
  nid integer,
  fid integer
);

Cat pic

Add a picture of a cute cat.

-- insert new image
insert into files(uid, filename, uri, filemime) values(1, 'cute-cat.jpg', 'public://pics/cute-cat.jpg', 'image/jpeg');

-- associate the image with node
insert into node_images (nid,fid) values (1,1);

Adding images to the final query, we get:

select n.nid as nid, n.title as title, n.status as status, n.uid as uid, 
(select body.body_content from body where  body.nid=1) as body,
(select f.uri from files f
inner join node_images ni on ni.fid=f.fid
where ni.nid=1) as image,
array_agg(t.name) as tag
from node n
inner join node_tags nt on nt.nid=n.nid
inner join terms t on nt.tid=t.tid
where n.nid=1
group by n.nid;

This is roughly the kind of query executed when Drupal fetches a node. Drupal also has the concept of revisions in nodes. This implies that we split the node table into 2 separate tables, one for storing the info about all the nodes and the latest revision ID, another table which stores the revisions and its related data(revision log, vid/revision ID, created timestamp for revision)

While we are at it, let's also add a bundle type column for nodes.

create table node (
   nid serial primary key,
   vid integer,
   bundle varchar(255),
   title text not null,
   status content_status,
   uid integer not null references users(uid)
);

create table node_revision (
   vid serial integer,
   nid integer,
   title text not null,
   revision_uid integer not null references users(uid),
   revision_log text
);

Now, revisions apply not only to node properties, but fields as well. Let's add a revision table for each of the article fields.

create table body_revision (
  nid integer,
  vid integer,
  body_content text
);

create table node_tags_revision (
  nid integer,
  vid integer,
  delta integer,
  tid integer
);

create table node_images_revision (
  nid integer,
  vid integer,
  fid integer
);

These tables will store the respective field content for all revisions. We also need to amend the existing field tables to add a revision ID column, so that it stores the latest revision content alone for that field. For instance, the body_revision contains 4 revisions of body field.

insert into body_revision (nid, vid, body_content) values (1,1, 'Placerat porro nec varius, accumsan elementum, nesciunt, bibendum. Bibendum perspiciatis.');
insert into body_revision (nid, vid, body_content) values (1,2, 'Revision 2. Placerat porro nec varius, accumsan elementum, nesciunt, bibendum. Bibendum perspiciatis.');
insert into body_revision (nid, vid, body_content) values (1,3, '#3. Placerat porro nec varius, accumsan elementum, nesciunt, bibendum. Bibendum perspiciatis.');
insert into body_revision (nid, vid, body_content) values (1,4, 'Final Copy. Placerat porro nec varius, accumsan elementum, nesciunt, bibendum. Bibendum perspiciatis.');

The body table will only contain the latest version of the field's content.

alter table body add column vid integer not null;
update body set vid=4, body_content='Final Copy. Placerat porro nec varius, accumsan elementum, nesciunt, bibendum. Bibendum perspiciatis.' where nid=1;

Rinse and repeat for tag and image field tables as well.

alter table node_tags add column vid integer not null;
alter table node_images add column vid integer not null;

Note that a new revision for the node adds a revision entry in all the revision tables and updates the revision id in all the field tables, irrespective of which field/property has changed. This is the most efficient way to preserve the field changes among different revisions.

Let's add a 2nd revision of node 1 where we change the title.

insert into node_revision (vid, nid, title, revision_uid, revision_log) values (1, 2, 'Hello world v2', 1, 'testing revisions.');
update node set title='Hello world v2', vid=2 where nid=1;

We have to bump up versions for other fields as well, even if we don't change it. Let's change the body text in every revision going forward, just to illustrate how revisions work.

insert into body_revision (nid, vid, body_content) values (1,2, 'Revision 2. Placerat porro nec varius, accumsan elementum, nesciunt, bibendum. Bibendum perspiciatis.');
update body set vid=2, body_content='Revision 2. Placerat porro nec varius, accumsan elementum, nesciunt, bibendum. Bibendum perspiciatis.' where nid=1;

-- other stuff unchanged
-- tags
insert into node_tags_revision (nid, vid, delta, tid) values (1, 2, 0, 1);
insert into node_tags_revision (nid, vid, delta, tid) values (1, 2, 1, 2);

update node_tags set vid=2,delta=0 where nid=1 and tid=1;
update node_tags set vid=2,delta=1 where nid=1 and tid=2;

-- image
insert into node_images_revision (nid, vid, fid) values (1, 2, 1);
update node_images set vid=2,fid=1 where nid=1;

For the 3rd revision, we shall add an extra tag and change the order.

-- node
insert into node_revision (vid, nid, title, revision_uid, revision_log) values (1, 3, 'Hello world v2', 1, 'tag change.');
update node set vid=3 where nid=1;

-- body changes
insert into body_revision (nid, vid, body_content) values (1,3, '#3. Placerat porro nec varius, accumsan elementum, nesciunt, bibendum. Bibendum perspiciatis.');
update body set vid=3, body_content='#3. Placerat porro nec varius, accumsan elementum, nesciunt, bibendum. Bibendum perspiciatis.' where nid=1;

-- tags
-- add a new tag and change the order(delta)
insert into node_tags (nid, vid, delta, tid) values (1, 3, 0, 3);

insert into node_tags_revision (nid, vid, delta, tid) values (1, 3, 0, 3);
insert into node_tags_revision (nid, vid, delta, tid) values (1, 3, 1, 1);
insert into node_tags_revision (nid, vid, delta, tid) values (1, 3, 2, 2);

update node_tags set vid=3,delta=1 where nid=1 and tid=1;
update node_tags set vid=3,delta=2 where nid=1 and tid=2;

-- image, unchanged
insert into node_images_revision (nid, vid, fid) values (1, 3, 1);
update node_images set vid=3,fid=1 where nid=1;

Lastly, let's add a new image for the image field as the fourth revision. I'm thinking puppies.

Puppy

-- node
insert into node_revision (vid, nid, title, revision_uid, revision_log) values (1, 3, 'Hello world v2', 1, 'updating image.');
update node set vid=4 where nid=1;

-- body changes
insert into body_revision (nid, vid, body_content) values (1,4, 'Final Copy. Placerat porro nec varius, accumsan elementum, nesciunt, bibendum. Bibendum perspiciatis.');
update body set vid=4, body_content='Final Copy. Placerat porro nec varius, accumsan elementum, nesciunt, bibendum. Bibendum perspiciatis.' where nid=1;

-- tags
-- add a new tag and change the order(delta)

insert into node_tags_revision (nid, vid, delta, tid) values (1, 4, 0, 3);
insert into node_tags_revision (nid, vid, delta, tid) values (1, 4, 1, 1);
insert into node_tags_revision (nid, vid, delta, tid) values (1, 4, 2, 2);


update node_tags set vid=4,delta=0 where nid=1 and tid=3;
update node_tags set vid=4,delta=1 where nid=1 and tid=1;
update node_tags set vid=4,delta=2 where nid=1 and tid=2;

-- image
-- first add a new image
insert into files(uid, filename, uri, filemime) values(1, 'puppy.jpg', 'public://pics/puppy.jpg', 'image/jpeg');

insert into node_images_revision (nid, vid, fid) values (1, 4, 2);
update node_images set vid=4,fid=2 where nid=1;

Let's alter the query to fetch a specific revision. It's a matter of adding an extra vid check for each field.

select n.nid as nid, n.vid as revision, n.title as title, n.status as status, n.uid as uid, 
(select body.body_content from body where  body.nid=1 and body.vid=1) as body,
(select f.uri from files f
inner join node_images ni on ni.fid=f.fid
where ni.nid=1 and ni.vid=1) as image,
array_agg(t.name) as tag
from node n
inner join node_tags nt on nt.nid=n.nid and nt.vid=n.vid
inner join terms t on nt.tid=t.tid
where n.nid=1 and n.vid=1
group by n.nid;

But in most cases, we shall fetch the latest revision of the node.

select n.nid as nid, n.vid as revision, n.title as title, n.status as status, n.uid as uid, 
(select body.body_content from body where  body.nid = 1 and body.vid = n.vid) as body,
(select f.uri from files f
inner join node_images ni on ni.fid = f.fid
where ni.nid = 1 and ni.vid = n.vid) as image,
array_agg(t.name) as tag
from node n
inner join node_tags nt on nt.nid = n.nid and nt.vid = n.vid
inner join terms t on nt.tid = t.tid
where n.nid = 1
group by n.nid;

Now, it would be hard to fetch all these values everytime a node is loaded from a performance perspective. This is especially true when there are hundreds of fields. To avoid this, Drupal creates a "cache" table and stores a dump of the latest node revision using a convenient key.

Let's create a function to wrap all node data into a JSON blob.

create or replace function get_node_data(node_id integer) returns json
  as $$
  select row_to_json(node_data) as data from (
  select n.nid as nid, n.vid as revision, n.title as title, n.status as status, n.uid as uid, 
  (select body.body_content from body where  body.nid = node_id and body.vid = n.vid) as body,
  (select f.uri from files f
  inner join node_images ni on ni.fid = f.fid
  where ni.nid = node_id and ni.vid = n.vid) as image,
  array_agg(t.name) as tag
  from node n
  inner join node_tags nt on nt.nid = n.nid and nt.vid = n.vid
  inner join terms t on nt.tid = t.tid
  where n.nid = node_id
  group by n.nid
  ) node_data;
$$ language sql;

And a node cache table to store the queried data.

create table node_cache (
  key text primary key,
  value json
);

Every time a node is added or updated, we cache the node in this table.

-- cache write
insert into node_cache (key, value) values ('node:1', get_node_data(1));

-- cache write 2
update node_cache set value=get_node_data(42) where key='node:42';

-- cache read
select value from node_cache where key='node:1';

Stuff that's left out or different from actual Drupal schema

postgres. The reason for using Posgres for this example is to emulate the business logic functions which happens in the PHP side. This could be emulated in any other database as well.

locale. The language of the node would be assigned a locale code, like en. This will be a separate column, similar to bundle, and will carry over to the associated fields as well.

UUIDs. Drupal has an entity ID and a separate UUID assigned for each node/entity in the system. This is to preserve the uniqueness of the entity. For example, to handle a case where a node whose nid is 12 is imported into another Drupal site, we should not have a collision between this node and a node already existing in the site with nid 12. More on this concept in a future post.

Taxonomy hierarchy. As mentioned earlier, we assume a flat taxonomy system, whereas Drupal has a nested structure of terms. It uses a separate table with parent ID and term ID to track this relationship. You may want to read up my other post about modelling trees in SQL.

Node and node data. Drupal separates node and node data into 2 different tables, as node and node_field_data. I'm not sure why this design is in place.

Users and Permissions. We just assume any user can get the node or update by changing the user id value. Drupal has a robust roles and permission system in place. This could be modelled by [[][Row level security]] policies in Postgres.

Field reuse. There is no separate place where the field related metadata is stored. The table naming convention hints at this. The norm is <entity_name>__<field_name> and <entity_name>_revision__<field_name>. We haen't seen a field reuse case in the above examples.

Links to the data models without revisions and with revisions in SQL fiddle.