SQL JOIN

Russell Bateman
September 2010
last update:

Table of Contents

A LEFT JOIN example
Renaming a column...
A second LEFT JOIN example
JOIN methodology
Forgotten data rows
Optimizations
LEFT JOINs slower

This page contains a considerable number of JOIN examples to study. It will take me a while to finish polishing these notes to make them usable. In the meantime, here they are...

I want to give credit to a bright and patient work colleague, Al Wild, who gave me his own methodology as a way of thinking about how to write SQL JOIN statements.

Example 1

Follow the SQL statements below to see how two tables are set up. One is a table of e-mail addresses and the other of names. After altering the e-mail table, the two are then linked via the latter's name_id field that holds what is basically, but informally a foreign key into the name table.

A LEFT JOIN will be demonstrated to show how it compares in certain uses with an INNER JOIN. In this example, we're joining table A (t_names) to table B (t_emails). The LEFT JOIN always gives everything from table A and, if there the same column exists in the corresponding row in table B, shows that too.

The more tables you INNER JOIN, the greater the possibility that you wipe out the entire data set for a given name. This is the situation below with Al Wild, who has no e-mail address, while Russ Bateman has two. Observe that the INNER JOIN fails to include Al Wild in the results while a LEFT JOIN will include him.

In this situation therefore, LEFT JOIN is a more sure approach that won't result in leaving a name out.

I threw in an attempt at RIGHT JOIN that demonstrates that in this schema, at least, it does nothing different from INNER JOIN.

(Note: name_id below isn't actually a FOREIGN KEY, so it's only a loose or arbitrary association between t_names and t_emails.)

mysql> CREATE TABLE t_emails -> ( -> id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY, -> email VARCHAR( 500 ) -> ); Query OK, 0 rows affected (0.01 sec) mysql> DESC t_emails; +-------+--------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +-------+--------------+------+-----+---------+----------------+ | id | int(11) | NO | PRI | NULL | auto_increment | | email | varchar(500) | YES | | NULL | | +-------+--------------+------+-----+---------+----------------+ 2 rows in set (0.00 sec) mysql> CREATE TABLE t_names -> ( -> id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY, -> first VARCHAR( 500 ), -> last VARCHAR( 500 ) -> ); Query OK, 0 rows affected (0.01 sec) mysql> DESC t_names; +-------+--------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +-------+--------------+------+-----+---------+----------------+ | id | int(11) | NO | PRI | NULL | auto_increment | | first | varchar(500) | YES | | NULL | | | last | varchar(500) | YES | | NULL | | +-------+--------------+------+-----+---------+----------------+ 3 rows in set (0.00 sec) mysql> ALTER TABLE t_emails ADD COLUMN name_id INT NOT NULL; Query OK, 0 rows affected (0.01 sec) Records: 0 Duplicates: 0 Warnings: 0 mysql> DESC t_emails; +--------------+--------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +--------------+--------------+------+-----+---------+----------------+ | id | int(11) | NO | PRI | NULL | auto_increment | | email | varchar(500) | YES | | NULL | | | name_id | int(11) | NO | | NULL | | +--------------+--------------+------+-----+---------+----------------+ 3 rows in set (0.00 sec) mysql> INSERT INTO t_names ( first, last) VALUES( 'russ', 'bateman' ); Query OK, 1 row affected (0.00 sec) mysql> INSERT INTO t_names ( first, last) VALUES( 'al', 'wild' ); Query OK, 1 row affected (0.00 sec) mysql> select * from t_names; +----+-------+---------+ | id | first | last | +----+-------+---------+ | 1 | russ | bateman | | 2 | al | wild | +----+-------+---------+ 2 rows in set (0.00 sec) mysql> INSERT INTO t_emails ( email, name_id ) VALUES ( '[email protected]', 1 ); Query OK, 1 row affected (0.00 sec) mysql> INSERT INTO t_emails ( email, name_id ) VALUES ( '[email protected]', 1 ); Query OK, 1 row affected (0.00 sec) mysql> SELECT * FROM t_emails; +----+-----------------------+---------+ | id | email | name_id | +----+-----------------------+---------+ | 1 | [email protected] | 1 | | 2 | [email protected] | 1 | +----+-----------------------+---------+ 2 rows in set (0.00 sec) mysql> SELECT * FROM t_names INNER JOIN t_emails ON t_emails.name_id = names.id; +----+-------+---------+----+-----------------------+---------+ | id | first | last | id | email | name_id | +----+-------+---------+----+-----------------------+---------+ | 1 | russ | bateman | 1 | [email protected] | 1 | | 1 | russ | bateman | 2 | [email protected] | 1 | +----+-------+---------+----+-----------------------+---------+ 2 rows in set (0.00 sec) mysql> SELECT * FROM t_names LEFT JOIN t_emails ON t_emails.name_id = names.id; +----+-------+---------+------+-----------------------+---------+ | id | first | last | id | email | name_id | +----+-------+---------+------+-----------------------+---------+ | 1 | russ | bateman | 1 | [email protected] | 1 | | 1 | russ | bateman | 2 | [email protected] | 1 | | 2 | al | wild | NULL | NULL | NULL | +----+-------+---------+------+-----------------------+---------+ 3 rows in set (0.01 sec) mysql> SELECT * FROM t_names RIGHT JOIN t_emails ON t_emails.name_id = names.id; +----+-------+---------+----+-----------------------+---------+ | id | first | last | id | email | name_id | +----+-------+---------+----+-----------------------+---------+ | 1 | russ | bateman | 1 | [email protected] | 1 | | 1 | russ | bateman | 2 | [email protected] | 1 | +----+-------+---------+----+-----------------------+---------+ 2 rows in set (0.00 sec)

Renaming a column...

(Okay, kind of silly to put this example here. Maybe I'll relocate it someday.)

mysql> DESC email; +----------+-------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +----------+-------------+------+-----+---------+----------------+ | oid | int(11) | NO | PRI | NULL | auto_increment | | name | varchar(32) | YES | | NULL | | | domain | varchar(32) | YES | | NULL | | +----------+-------------+------+-----+---------+----------------+ 3 rows in set (0.00 sec) mysql> ALTER TABLE email CHANGE COLUMN name username VARCHAR(32); Query OK, 0 rows affected (0.01 sec) Records: 0 Duplicates: 0 Warnings: 0 mysql> DESC email; +----------+-------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +----------+-------------+------+-----+---------+----------------+ | oid | int(11) | NO | PRI | NULL | auto_increment | | username | varchar(32) | YES | | NULL | | | domain | varchar(32) | YES | | NULL | | +----------+-------------+------+-----+---------+----------------+ 3 rows in set (0.00 sec)

Example 3

Forget the tables in the previous examples and think of these as new ones.

mysql> SHOW TABLES; +---------------+ | Tables_in_fun | +---------------+ | acct | | acct_info | | email | +---------------+ 3 rows in set (0.00 sec) mysql> DESC acct; +------------+-------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +------------+-------------+------+-----+---------+----------------+ | oid | int(11) | NO | PRI | NULL | auto_increment | | email_oid | int(11) | YES | | NULL | | | password | varchar(32) | YES | | NULL | | | salutation | varchar(32) | YES | | NULL | | | subscribed | varchar(2) | YES | | NULL | | +------------+-------------+------+-----+---------+----------------+ 5 rows in set (0.00 sec) mysql> DESC acct_info; +----------+-------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +----------+-------------+------+-----+---------+----------------+ | oid | int(11) | NO | PRI | NULL | auto_increment | | acct_oid | int(11) | YES | | NULL | | | first | varchar(32) | YES | | NULL | | | last | varchar(32) | YES | | NULL | | +----------+-------------+------+-----+---------+----------------+ 4 rows in set (0.00 sec) mysql> DESC email; +--------+-------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +--------+-------------+------+-----+---------+----------------+ | oid | int(11) | NO | PRI | NULL | auto_increment | | name | varchar(32) | YES | | NULL | | | domain | varchar(32) | YES | | NULL | | +--------+-------------+------+-----+---------+----------------+ 3 rows in set (0.00 sec) mysql> SELECT * FROM acct; +-----+-----------+-----------+------------+------------+ | oid | email_oid | password | salutation | subscribed | +-----+-----------+-----------+------------+------------+ | 1 | 1 | saugrenue | Russ | n | | 2 | 2 | funstuff | Al | n | | 3 | 3 | bfsplt | Levi | y | +-----+-----------+-----------+------------+------------+ 3 rows in set (0.00 sec) mysql> SELECT * FROM acct_info; +-----+----------+---------+---------+ | oid | acct_oid | first | last | +-----+----------+---------+---------+ | 1 | 1 | Russell | Bateman | | 2 | 2 | Al | Wild | | 3 | 3 | Levi | Miller | +-----+----------+---------+---------+ 3 rows in set (0.00 sec) mysql> SELECT * FROM email; +-----+-----------------+------------------+ | oid | username | domain | +-----+-----------------+------------------+ | 1 | russell.bateman | windofkeltia.com | | 2 | al.wild | gmail.com | | 3 | levi.miller | yahoo.com | +-----+-----------------+------------------+ 3 rows in set (0.00 sec) mysql> SELECT ai.first, ai.last, a.password, e.username, e.domain FROM acct AS a -> LEFT JOIN email AS e ON a.email_oid = e.oid -> LEFT JOIN acct_info AS ai ON ai.acct_oid = a.oid; +---------+---------+-----------+-----------------+------------------+ | first | last | password | username | domain | +---------+---------+-----------+-----------------+------------------+ | Russell | Bateman | saugrenue | russell.bateman | windofkeltia.com | | Al | Wild | funstuff | al.wild | gmail.com | | Levi | Miller | bfsplt | levi.miller | yahoo.com | +---------+---------+-----------+-----------------+------------------+ 3 rows in set (0.00 sec)

Consider the collection of tables just described augmented with these additional ones.

mysql> DESC affiliates; +--------+--------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +--------+--------------+------+-----+---------+----------------+ | oid | int(11) | NO | PRI | NULL | auto_increment | | name | varchar(128) | YES | | NULL | | | domain | varchar(128) | YES | | NULL | | +--------+--------------+------+-----+---------+----------------+ 3 rows in set (0.00 sec) mysql> DESC affiliations; +---------------+---------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +---------------+---------+------+-----+---------+----------------+ | oid | int(11) | NO | PRI | NULL | auto_increment | | acct_oid | int(11) | YES | | NULL | | | affiliate_oid | int(11) | YES | | NULL | | +---------------+---------+------+-----+---------+----------------+ 3 rows in set (0.00 sec)

With this data set...

mysql> SELECT * FROM acct; +-----+-----------+-----------+------------+------------+ | oid | email_oid | password | salutation | subscribed | +-----+-----------+-----------+------------+------------+ | 1 | 1 | saugrenue | Russ | n | | 2 | 2 | funstuff | Al | n | | 3 | 3 | bfsplt | Levi | y | | 4 | NULL | poop | NULL | NULL | +-----+-----------+-----------+------------+------------+ 4 rows in set (0.00 sec) mysql> SELECT * FROM acct_info; +-----+----------+---------+---------+ | oid | acct_oid | first | last | +-----+----------+---------+---------+ | 1 | 1 | Russell | Bateman | | 2 | 2 | Al | Wild | | 3 | 3 | Levi | Miller | | 4 | 4 | Jay | Johnson | +-----+----------+---------+---------+ 4 rows in set (0.00 sec) mysql> SELECT * FROM email; +-----+-----------------+------------------+ | oid | username | domain | +-----+-----------------+------------------+ | 1 | russell.bateman | windofkeltia.com | | 2 | al.wild | gmail.com | | 3 | levi.miller | yahoo.com | +-----+-----------------+------------------+ 3 rows in set (0.00 sec) mysql> SELECT * FROM affiliates; +-----+---------------------------------+-------------------+ | oid | name | domain | +-----+---------------------------------+-------------------+ | 1 | Russ Cooks | www.russcooks.com | | 2 | National Hockey League | www.nhl.com | | 3 | National Basketball Association | www.nba.com | | 4 | Tattoos by Vic Back | www.vicback.com | +-----+---------------------------------+-------------------+ 4 rows in set (0.00 sec) mysql> SELECT * FROM affiliations; +-----+----------+---------------+ | oid | acct_oid | affiliate_oid | +-----+----------+---------------+ | 1 | 1 | 1 | | 2 | 1 | 2 | | 3 | 1 | 3 | | 4 | 1 | 4 | | 5 | 2 | 1 | | 6 | 2 | 2 | | 7 | 2 | 3 | | 8 | 3 | 1 | | 9 | 3 | 4 | +-----+----------+---------------+ 9 rows in set (0.00 sec)

Table affiliations is called a relational or associative table and implements a many-to-many relationship with acct.

The JOIN statements we might make are the following. We added a partially filled-out row for "Jay Johnson" to illustrate differences between INNER JOINs and LEFT OUTER JOINs. Jay will have no affiliations.

mysql> SELECT a.oid,ai.first,ai.last,at.name FROM acct AS a -> INNER JOIN acct_info AS ai ON ai.acct_oid = a.oid -> INNER JOIN affiliations AS af ON af.acct_oid = a.oid -> INNER JOIN affiliates AS at ON at.oid = af.affiliate_oid; +-----+---------+---------+---------------------------------+ | oid | first | last | name | +-----+---------+---------+---------------------------------+ | 1 | Russell | Bateman | Russ Cooks | | 1 | Russell | Bateman | National Hockey League | | 1 | Russell | Bateman | National Basketball Association | | 1 | Russell | Bateman | Tattoos by Vic Back | | 2 | Al | Wild | Russ Cooks | | 2 | Al | Wild | National Hockey League | | 2 | Al | Wild | National Basketball Association | | 3 | Levi | Miller | Russ Cooks | | 3 | Levi | Miller | Tattoos by Vic Back | +-----+---------+---------+---------------------------------+ 9 rows in set (0.00 sec) mysql> SELECT a.oid,ai.first,ai.last,at.name FROM acct AS a -> INNER JOIN acct_info AS ai ON ai.acct_oid = a.oid -> LEFT JOIN affiliations AS af ON af.acct_oid = a.oid -> LEFT JOIN affiliates AS at ON at.oid = af.affiliate_oid; +-----+---------+---------+---------------------------------+ | oid | first | last | name | +-----+---------+---------+---------------------------------+ | 1 | Russell | Bateman | Russ Cooks | | 1 | Russell | Bateman | National Hockey League | | 1 | Russell | Bateman | National Basketball Association | | 1 | Russell | Bateman | Tattoos by Vic Back | | 2 | Al | Wild | Russ Cooks | | 2 | Al | Wild | National Hockey League | | 2 | Al | Wild | National Basketball Association | | 3 | Levi | Miller | Russ Cooks | | 3 | Levi | Miller | Tattoos by Vic Back | | 4 | Jay | Johnson | NULL | +-----+---------+---------+---------------------------------+ 10 rows in set (0.00 sec) mysql> SELECT a.oid,ai.first,ai.last,at.name FROM acct AS a -> INNER JOIN acct_info AS ai ON ai.acct_oid = a.oid AND ai.first = 'Levi' -> LEFT JOIN affiliations AS af ON af.acct_oid = a.oid -> LEFT JOIN affiliates AS at ON at.oid = af.affiliate_oid; +-----+-------+--------+---------------------+ | oid | first | last | name | +-----+-------+--------+---------------------+ | 3 | Levi | Miller | Russ Cooks | | 3 | Levi | Miller | Tattoos by Vic Back | +-----+-------+--------+---------------------+ 2 rows in set (0.00 sec)

Comments

This AND clause, in the very last example above, which is equivalent to a WHERE clause, but up in the middle of things, speeds things up as it reaches the result (and limits the amount of data) faster.

Remove the unused column in affiliations. This will save 33% of space, but speed up JOINs by 80%. However, some ORM require it as do some (human) database administrators.

JOIN methodology

This is the candy store of this document: its whole contribution.

Here's how to think about things as you code a JOIN. It's a great idea to compose your JOIN in the SQL edit window of a good tool such as SQuirreL. You should invoke the SQL command at each incremental step to ensure that your statement is working (rather than wait until the end to see the whole thing fail without always realizing where it went wrong early on).

The following is illustrated using the data set developed in the example above.

  1. Begin with SELECT *...
  2.  
  3. Pick the most primary table (most central, etc.) to start with: acct
    SELECT * FROM acct;
  4.  
  5. Join the table from which you want data to come: acct_info which yields first and last names. Begin giving the tables small reference names.
    SELECT * FROM acct AS a INNER JOIN acct_info AS ai;
  6.  
  7. Compose the ON clause from the two columns (one from each of the two tables) that intersect: acct_info.acct_oid = acct.oid
    SELECT * FROM acct AS a INNER JOIN acct_info AS ai ON ai.acct_oid = a.oid;
     
  8. Stack up successively JOIN'd tables according to the two preceding steps.
  9. SELECT * FROM acct AS a INNER JOIN acct_info AS ai ON ai.acct_oid = a.oid INNER JOIN affiliations AS af ON af.acct_oid = a.oid;
     
  10. Until you've finally JOIN'd all the tables whose fields you need to get the result.
  11. SELECT * FROM acct AS a INNER JOIN acct_info AS ai ON ai.acct_oid = a.oid INNER JOIN affiliations AS af ON af.acct_oid = a.oid INNER JOIN affiliates AS at ON a.oid = af.affiliate_oid;
     
  12. Begin paring down to the desired result by changing out the '*' in the SELECT to just the list of columns whose data you want to display.
    SELECT a.oid,ai.first,ai.last,at.name FROM acct AS a INNER JOIN acct_info AS ai ON ai.acct_oid = a.oid INNER JOIN affiliations AS af ON af.acct_oid = a.oid INNER JOIN affiliates AS at ON a.oid = af.affiliate_oid;

More comments on forgotten data rows, optimizations, other speed issues

If you know your data, for example these tables, you know that each account has an affiliation. However, if an account did not have an affiliation, as the "Jay Johnson" account we added for this example did not, then using INNER JOIN would fail to list that account in the result. Use a LEFT JOIN to solve this.

Strangely (perhaps), just using a LEFT JOIN on affiliations won't do the trick because affiliates also won't supply a row in that case (even though the "problem" is between affiliations and acct). So, we have to use LEFT JOIN with affiliates as well.

Pinch it off early...

Note that by adding the AND below, as high in the statement as we can, will speed things up because the virtual table produced by the first JOIN will be much smaller. Smaller table, less data, faster running.

SELECT a.oid,ai.first,ai.last,at.name FROM acct AS a INNER JOIN acct_info AS ai ON ai.acct_oid = a.oid AND ai.first = 'Levi' INNER JOIN affiliations AS af ON af.acct_oid = a.oid INNER JOIN affiliates AS at ON a.oid = af.affiliate_oid;

LEFT JOINs are just slower...

LEFT JOINs are significantly slower than INNER JOINs. Why? INNER JOINs only produce a (virtual) table made up of matching rows (that are then filtered further by the ON clause) while LEFT [OUTER] JOINs use all of the rows in the first table in the virtual table output (before the ON clause trims them down).