Building a nested hierarchy of JSON in a relational DB

How to best store and retrieve a tree-style in a relational DB? This pattern is more common than we think. Examples would be hierarchical tags and threaded comments or discussions.

Let's start with the basic table structure with a self reference to a parent genre.

CREATE TABLE genres (
    id serial NOT NULL PRIMARY KEY,
    name character varying(255) NOT NULL,
    parent_id integer REFERENCES genres (id)
);

Here's how our proposed hierarchy looks like:

Genre hierarchy

followed by insert statements to populate the table,

INSERT INTO genres (name, parent_id) VALUES ('Arts & Photography',NULL);
INSERT INTO genres (name, parent_id) VALUES ('Architecture',1);
INSERT INTO genres (name, parent_id) VALUES ('Graphic Design',1);
INSERT INTO genres (name, parent_id) VALUES ('Music',1); -- 4
INSERT INTO genres (name, parent_id) VALUES ('Songbooks',4);
INSERT INTO genres (name, parent_id) VALUES ('Instruments & Performers',4); -- 6
INSERT INTO genres (name, parent_id) VALUES ('Brass',6);
INSERT INTO genres (name, parent_id) VALUES ('Woodwinds',6);

INSERT INTO genres (name, parent_id) VALUES ('Comics & Graphic Novels', NULL); -- 9
INSERT INTO genres (name, parent_id) VALUES ('Comic Strips',9);
INSERT INTO genres (name, parent_id) VALUES ('Graphic Novels',9);
INSERT INTO genres (name, parent_id) VALUES ('Manga',9);

INSERT INTO genres (name, parent_id) VALUES ('Comic Strips',NULL); -- 13
INSERT INTO genres (name, parent_id) VALUES ('Mystery, Thriller and Suspense', NULL); -- 14
INSERT INTO genres (name, parent_id) VALUES ('Mystery',14); -- 15
INSERT INTO genres (name, parent_id) VALUES ('Hard Boiled',15);
INSERT INTO genres (name, parent_id) VALUES ('Police Procedurals',15); -- 17
INSERT INTO genres (name, parent_id) VALUES ('British Detectives',17);
INSERT INTO genres (name, parent_id) VALUES ('FBI Agents',17);
INSERT INTO genres (name, parent_id) VALUES ('Police Officers',17);

INSERT INTO genres (name, parent_id) VALUES ('Nonfiction', NULL); -- 21
INSERT INTO genres (name, parent_id) VALUES ('Biographies and Memoirs',21);
INSERT INTO genres (name, parent_id) VALUES ('Business & Investing',21);
INSERT INTO genres (name, parent_id) VALUES ('Computers & Technology',21); -- 24
INSERT INTO genres (name, parent_id) VALUES ('Databases',24);
INSERT INTO genres (name, parent_id) VALUES ('Hardware',24);
INSERT INTO genres (name, parent_id) VALUES ('Software',24);

Now, we're all set. In order to facilitate finding of items in a tree hierarchy easier, we first build a materialized path of our items. This is a serialized column which stores all the ancestor ids of the item. We shall represent this using postgres array type. To construct a materialized path, we use a feature called CTEs(Common Table Expressions). It is a fancy term for the following type of expression:

WITH custom_table AS (
SELECT a,b,c FROM existing_table WHERE id = 1
)
SELECT a FROM custom_table;

Here's a sample CTE for fetching all the root genre items from the genres table.

WITH root_genres AS (
SELECT id, name FROM genres WHERE parent_id IS NULL
) SELECT * FROM root_genres;

CTEs can be recursive, in which case, they will be prefixed with WITH RECURSIVE. This implies that I can refer to the CTE table being constructed within the CTE itself, just like recursive functions in most programming languages. Let's take a stab at constructing the materialized path table using recursive CTEs.

WITH RECURSIVE genres_materialized_path AS (
  SELECT id, name, ARRAY[]::INTEGER[] AS path
  FROM genres WHERE parent_id IS NULL

  UNION ALL

  SELECT genres.id, genres.name, genres_materialized_path.path || genres.parent_id
  FROM genres, genres_materialized_path
  WHERE genres.parent_id = genres_materialized_path.id
) SELECT * FROM genres_materialized_path;

We can break it down into 2 parts. The first part fetches all root node items with an empty materialized path array. The second part fetches the children by appending the current parent to the path column. Note that || indicates array concatenation in PostgreSQL. For instance, ARRAY[4,5,6] || 7 gives {4,5,6,7}.

# SELECT ARRAY[4,5,6] || 7 ;
 ?column?  
-----------
 {4,5,6,7}

To get a subtree of a given item, all we have to add is an extra WHERE clause to check if the given id is the last item in the path.

WITH RECURSIVE genres_materialized_path AS (
  SELECT id, name, ARRAY[]::INTEGER[] AS path
  FROM genres WHERE parent_id IS NULL

  UNION ALL

  SELECT genres.id, genres.name, genres_materialized_path.path || genres.parent_id
  FROM genres, genres_materialized_path
  WHERE genres.parent_id = genres_materialized_path.id
) SELECT * FROM genres_materialized_path WHERE 15 = genres_materialized_path.path[array_upper(genres_materialized_path.path,1)];

This will fetch all the children of item whose id is 15.

 id |        name        |  path   
----+--------------------+---------
 16 | Hard Boiled        | {14,15}
 17 | Police Procedurals | {14,15}
(2 rows)

We might want to do this for every item in genres. So, we convert the above CTE into a PostgreSQL function. Now, we can apply it to every row in genres. Also, a tabular form might not fit a hierarchical representation. So, inside the function, we convert the above output to JSON while maintaining the hierarchy.

CREATE OR REPLACE FUNCTION get_children(genre_id integer)
RETURNS json AS $$
DECLARE
result json;
BEGIN
SELECT array_to_json(array_agg(row_to_json(t))) INTO result -- inject output into result variable
FROM ( -- same CTE as above
  WITH RECURSIVE genres_materialized_path AS (
    SELECT id, name, ARRAY[]::INTEGER[] AS path
    FROM genres WHERE parent_id IS NULL

    UNION ALL

    SELECT genres.id, genres.name, genres_materialized_path.path || genres.parent_id
    FROM genres, genres_materialized_path
    WHERE genres.parent_id = genres_materialized_path.id
  ) SELECT id, name, ARRAY[]::INTEGER[] AS children FROM genres_materialized_path WHERE $1 = genres_materialized_path.path[array_upper(genres_materialized_path.path,1)] -- some column polish for a cleaner JSON
) t;
RETURN result;
END;
$$ LANGUAGE PLPGSQL;

Running this through the genre table, we get:

SELECT id, name, COALESCE(get_children(id), '[]') AS children FROM genres;

Now, for the final act, i.e., representing the whole data from the above query as a single nested JSON object. For this, we cross the realm of PosgreSQL and move to Javascript land, without leaving the console ;) This can be faciliated by enabling the plv8 extension. Note that this requires PostgreSQL version >= 9.1.

CREATE EXTENSION plv8;

Why Javascript? For one, you need not learn another language to operate with your data. Besides, I personally find it easy to manipulate JSON in Javascript. The syntax for defining a Javascript function looks like this:

CREATE OR REPLACE FUNCTION test_func(data json) RETURNS json AS $$
  return JSON.stringify(data);
$$ LANGUAGE PLV8;

SELECT test_func('{"a": {"b":"foo"}}'::json);
     test_func     
-------------------
 {"a":{"b":"foo"}}

Our Javascipt function iterates through every row from the genre children table and builds a JSON object by inserting the row item under the appropriate parent. I found a recursive Javascript function at stackoverflow which finds an object given by key in a deeply nested JSON blob. I've adopted it for our needs. getObject finds an object by id and returns it.

CREATE OR REPLACE FUNCTION get_tree(data json) RETURNS json AS $$

var root = [];

for(var i in data) {
  build_tree(data[i]['id'], data[i]['name'], data[i]['children']);
}

function build_tree(id, name, children) {
  var exists = getObject(root, id);
  if(exists) {
       exists['children'] = children;
  }
  else {
    root.push({'id': id, 'name': name, 'children': children});
  }
}


function getObject(theObject, id) {
    var result = null;
    if(theObject instanceof Array) {
    for(var i = 0; i < theObject.length; i++) {
        result = getObject(theObject[i], id);
        if (result) {
        break;
        }   
    }
    }
    else
    {
    for(var prop in theObject) {
        if(prop == 'id') {
        if(theObject[prop] === id) {
            return theObject;
        }
        }
        if(theObject[prop] instanceof Object || theObject[prop] instanceof Array) {
        result = getObject(theObject[prop], id);
        if (result) {
            break;
        }
        } 
    }
    }
    return result;
}

    return JSON.stringify(root);
$$ LANGUAGE plv8 IMMUTABLE STRICT;

The buildTree function calls getObject for each and every row in the materialized table of genres. If the object is not found, then it is added at the root level. Let's call this and build our tree, again, using CTEs.

WITH data AS(
select array_to_json(array_agg(row_to_json(t))) as data
    from (
     SELECT id, name, COALESCE(get_children(id), '[]') as children from genres
    ) t
) SELECT get_tree(data) from data;

This will get us the nested JSON tree. Here's the prettified version of it.

[
  {
    "id": 1,
    "name": "Arts & Photography",
    "children": [
      {
    "id": 2,
    "name": "Architecture",
    "children": [

    ]
      },
      {
    "id": 3,
    "name": "Graphic Design",
    "children": [

    ]
      },
      {
    "id": 4,
    "name": "Music",
    "children": [
      {
        "id": 5,
        "name": "Songbooks",
        "children": [

        ]
      },
      {
        "id": 6,
        "name": "Instruments & Performers",
        "children": [
          {
        "id": 7,
        "name": "Brass",
        "children": [

        ]
          },
          {
        "id": 8,
        "name": "Woodwinds",
        "children": [

        ]
          }
        ]
      }
    ]
      }
    ]
  },
  {
    "id": 9,
    "name": "Comics & Graphic Novels",
    "children": [
      {
    "id": 10,
    "name": "Comic Strips",
    "children": [

    ]
      },
      {
    "id": 11,
    "name": "Graphic Novels",
    "children": [

    ]
      },
      {
    "id": 12,
    "name": "Manga",
    "children": [

    ]
      }
    ]
  },
  {
    "id": 13,
    "name": "Comic Strips",
    "children": [

    ]
  },
  {
    "id": 14,
    "name": "Mystery, Thriller and Suspense",
    "children": [
      {
    "id": 15,
    "name": "Mystery",
    "children": [
      {
        "id": 16,
        "name": "Hard Boiled",
        "children": [

        ]
      },
      {
        "id": 17,
        "name": "Police Procedurals",
        "children": [
          {
        "id": 18,
        "name": "British Detectives",
        "children": [

        ]
          },
          {
        "id": 19,
        "name": "FBI Agents",
        "children": [

        ]
          },
          {
        "id": 20,
        "name": "Police Officers",
        "children": [

        ]
          }
        ]
      }
    ]
      }
    ]
  },
  {
    "id": 21,
    "name": "Nonfiction",
    "children": [
      {
    "id": 22,
    "name": "Biographies and Memoirs",
    "children": [

    ]
      },
      {
    "id": 23,
    "name": "Business & Investing",
    "children": [

    ]
      },
      {
    "id": 24,
    "name": "Computers & Technology",
    "children": [
      {
        "id": 25,
        "name": "Databases",
        "children": [

        ]
      },
      {
        "id": 26,
        "name": "Hardware",
        "children": [

        ]
      },
      {
        "id": 27,
        "name": "Software",
        "children": [

        ]
      }
    ]
      }
    ]
  }
]

Further Tweaks

The genres table can contain other genre related dynamic properties, like date_added, count etc. This could be represented by a separate column in the table of type json. That's correct. You get the best of both NoSQL and SQL in PostgreSQL. This will be the subject of another post.

Resources