Speeding up queries with semi joins and anti-joins

Speeding Up Queries with Semi-Joins and Anti-Joins

Speeding Up Queries with Semi-Joins and Anti-Joins: How
Oracle Evaluates EXISTS, NOT EXISTS, IN, and NOT IN
by Roger Schrag
Database Specialists, Inc.

About Database Specialists, Inc.
Database Specialists, Inc. provides remote DBA services and onsite database support for your mission critical
Oracle systems. Since 1995, we have been providing Oracle database consulting in Solaris, HP-UX, Linux, AIX,
and Windows environments. We are DBAs, speakers, educators, and authors. Our team is continually recognized
by Oracle, at national conferences and by leading trade publications. Learn more about our remote DBA,
database tuning, and consulting services. Or, call us at 415-344-0500 or 888-648-0500 .

Introduction

Optimizing SQL usually gives the most significant results when DBAs are called upon to “make the system run
faster.” Using tools like Statspack or Enterprise Manager, it is often easy to find the slow SQL. But how do you make
the queries run faster? That is the challenge! In this paper we will discuss the semi-join and the anti-join, two powerful
SQL constructs Oracle offers for use in your quest for faster queries. In particular, we will define these two terms,
discuss when and why you might want to use the [NOT] EXISTS or [NOT] IN constructs, and demonstrate how you
can use optimizer hints and make minor query changes in order to enable Oracle to use some very powerful and
efficient access paths.

For certain classes of queries, these features can dramatically reduce logical reads, physical reads, CPU time, and
elapsed time. But beware! There are some obscure (and not always easy to find) requirements that must be met in
order for Oracle to deploy the semi- and anti-join access paths. If you fail to dot an “I” or cross a “T,” you could be
banging your head against the wall for hours. In this paper we will look at SQL from a real application and
demonstrate the “before” and “after” versions of queries that ran orders of magnitude faster once semi-joins and anti-
joins were implemented correctly.

Before we get started, I want to point out that Oracle Corporation has improved its optimization techniques
significantly over the years, specifically in the area of semi-joins and anti-joins. This paper was written using Oracle 9i
release 9.2.0.4, and all examples were validated in such an environment. If you are using a newer or older release of
Oracle, it is quite possible that you will see different results. Please do not take anything you read in this paper—or in
any paper for that matter—for granted. Validate every hypothesis in your environment before relying upon it.

Semi-Join Defined

A “semi-join” between two tables returns rows from the first table where one or more matches are found in the second
table. The difference between a semi-join and a conventional join is that rows in the first table will be returned at most
once. Even if the second table contains two matches for a row in the first table, only one copy of the row will be
returned. Semi-joins are written using the EXISTS or IN constructs.

Suppose you have the DEPT and EMP tables in the SCOTT schema and you want a list of departments with at least
one employee. You could write the query with a conventional join:
SELECT D.deptno, D.dname
FROM dept D, emp E
WHERE E.deptno = D.deptno
ORDER BY D.deptno;

Unfortunately, if a department has 400 employees then that department will appear in the query output 400 times. You
could eliminate the duplicate rows by using the DISTINCT keyword, but you would be making Oracle do more work
than necessary. Really what you want to do is specify a semi-join between the DEPT and EMP tables instead of a

http://www.dbspecialists.com/files/presentations/semijoins.html[08/29/2012 5:00:12 PM]


conventional join:
FROM dept D
WHERE EXISTS
(
SELECT 1
FROM emp E
)
ORDER BY D.deptno;

The above query will list the departments that have at least one employee. Whether a department has one employee or
100, the department will appear only once in the query output. Moreover, Oracle will move on to the next department
as soon as it finds the first employee in a department, instead of finding all of the employees in each department.

Anti-Join Defined

An “anti-join” between two tables returns rows from the first table where no matches are found in the second table.
An anti-join is essentially the opposite of a semi-join: While a semi-join returns one copy of each row in the first table
for which at least one match is found, an anti-join returns one copy of each row in the first table for which no match is
found. Anti-joins are written using the NOT EXISTS or NOT IN constructs. These two constructs differ in how they
handle nulls—a subtle but very important distinction which we will discuss later.

Suppose you want a list of empty departments—departments with no employees. You could write a query that finds all
departments and subtracts off the department of each employee:
SELECT D1.deptno, D1.dname
FROM dept D1
MINUS
SELECT D2.deptno, D2.dname
FROM dept D2, emp E2
WHERE D2.deptno = E2.deptno
ORDER BY 1;

The above query will give the desired results, but it might be clearer to write the query using an anti-join:
FROM dept D
WHERE NOT EXISTS
(
SELECT 1
FROM emp E
)
ORDER BY D.deptno;

The above query is more efficient because Oracle can employ an anti-join access path. The difference in efficiency
here is akin to the difference between a nested loops join and a hash join when you are joining every row in one table
to another.

The EXISTS and IN Constructs

In this section we will discuss how Oracle evaluates EXISTS and IN clauses, prerequisites for using Oracle’s semi-
join access paths, and hints that influence semi-join query optimization. Then we’ll look at a few sample queries where
semi-join access paths can be used to make the query more efficient.

How Oracle Evaluates EXISTS and IN Clauses

The SQL language as implemented in Oracle does not include a syntax for explicitly performing semi-joins, so you
use an EXISTS or IN clause with a subquery to indicate to Oracle that a semi-join should be performed instead of a
conventional join. Oracle will typically transform the subquery into a join wherever possible.



An interesting characteristic of Oracle’s query optimizer is that (according to Metalink document 144967.1), in Oracle
8i and 9i the decision of whether or not to transform a query by merging its subqueries into the main body is a
heuristic one. Basically, Oracle will merge subqueries if it can, regardless of impact on cost. The assumption is that
merging is always a good thing. In fact, Oracle performs query transformations (merging subqueries into the main
body) before making a list of possible execution plans and evaluating their cost. This bit of trivia will come into play
when we discuss the second example in this section.

The Oracle Performance Tuning Guide in the Oracle 8i documentation set explains that EXISTS and IN are processed
differently by the query optimizer. The manual explains a rule of thumb for when to use each, and provides an
example to demonstrate. The Oracle 9i Performance Tuning Guide provides the same rule of thumb with a slightly
more polished example. The Oracle 10g Performance Tuning Guide contains substantially the same text as the Oracle
9i manual on this subject.

The rule of thumb goes like this: If the main body of your query is highly selective, then an EXISTS clause might be
more appropriate to semi-join to the target table. However, if the main body of your query is not so selective and the
subquery (the target of the semi-join) is more selective, then an IN clause might be more appropriate.

The example in the Oracle 8i Performance Tuning Guide shows a query where the main body is not selective, while
the target of the semi-join is highly selective. The example demonstrates that Oracle chooses a poor execution plan
when an EXISTS clause is used, and a much more efficient execution plan when an IN clause is used instead. I was
easily able to reproduce this example in an Oracle 8i environment with the same results as shown in the Performance
Tuning Guide.

However, when I reproduced this same example in an Oracle 9i environment, Oracle chose an identical (and efficient)
execution plan for both versions of the query. Furthermore, the example in the Oracle 9i Performance Tuning Guide
gave an identical (and efficient) execution plan for both versions of the query in my Oracle 9i environment. These
tests echo my previous experience—that Oracle 9i is smarter than Oracle 8i about EXISTS and IN clauses, and you
don’t typically need to worry about when to use EXISTS versus IN.

The upshot is that if you are writing queries for an Oracle 8i (or earlier) system, then you should think about the
selectivity of the predicates in the main query versus the subquery and choose whether to use an EXISTS or an IN
clause wisely. If you are using Oracle 9i or later, it doesn’t hurt to follow this rule of thumb, but you may not need to
worry as much.

Prerequisites and Hints that Impact Semi-Joins

There are a few situations in which Oracle cannot use a semi-join access path. If the query contains a DISTINCT
operation (explicitly or implicitly through a set operator such as UNION) then Oracle cannot transform EXISTS or IN
clauses into something that could use a semi-join access path. The same is true if the EXISTS or IN clause is part of an
OR branch. (Semi-join access paths can be used in queries that contain ORs in the WHERE clause, just as long as the
EXISTS or IN is not part of the OR.)

Assuming that a semi-join access path is possible, Oracle chooses the semi-join algorithm to use while evaluating
execution plans. In Oracle 9i a semi-join can be performed using the nested loops, hash join, or merge join algorithms.
As with a conventional join, Oracle 9i will choose the join algorithm with the lowest cost. In Oracle 8i a semi-join can
be performed using the hash join or merge join algorithms, but Oracle 8i will not choose the join algorithm with the
lowest cost. Instead, the always_semi_join instance parameter dictates whether or not an Oracle 8i database should
perform semi-joins and, if so, which algorithm to use. The always_semi_join instance parameter is obsolete in Oracle
9i.

If a semi-join access path is not possible (because of a DISTINCT operation or an OR predicate), or if instance
parameters specify a semi-join should not be used, then Oracle will choose an alternate data access path. This usually
amounts to performing a conventional join and sorting out the duplicates, or scanning the first table and using the
second table as a filter.



Oracle provides the NL_SJ, HASH_SJ, and MERGE_SJ hints in order for you to manipulate the semi-join process if
you need to. The hint is applied to the subquery of the EXISTS or IN clause, not the main body of the query itself. In
my experience Oracle does not need you to provide a hint in order to consider the efficient semi-join access paths.
However, in Oracle 9i or 10g, where you have a wider choice of semi-join algorithms, you might want to use hints if
Oracle chooses a hash semi-join when a nested loops semi-join would in fact be better, or vice versa.

Semi-Join Example #1

Suppose you want a list of your gold-status customers who have placed an order within the last three days. You might
start with a query that looks like:
SELECT DISTINCT C.short_name, C.customer_id
FROM customers C, orders O
WHERE C.customer_type = 'Gold'
AND O.customer_id = C.customer_id
AND O.order_date > SYSDATE - 3
ORDER BY C.short_name;

The execution plan (from a TKPROF report) for this query is:
Rows Row Source Operation
------- ---------------------------------------------------
2 SORT UNIQUE (cr=33608 r=30076 w=0 time=6704029 us)
20 HASH JOIN (cr=33608 r=30076 w=0 time=6703101 us)
10 TABLE ACCESS FULL CUSTOMERS (cr=38 r=36 w=0 time=31718 us)
2990 TABLE ACCESS FULL ORDERS (cr=33570 r=30040 w=0 time=6646420 us)

You can see that Oracle did a (conventional) hash join between the CUSTOMERS and ORDERS tables, and then
performed a sort for uniqueness to filter out the duplicate records. This sort was necessary because the query is
supposed to show a customer only once, even if they have more than one order placed within the last three days. (A
sort to satisfy the ORDER BY clause is also necessary, but Oracle tackles this at the same time as the sort for
uniqueness and thus it does not appear as a separate step in the execution plan.)

Using what we have discussed so far, you might want to rewrite the query using a semi-join:
SELECT C.short_name, C.customer_id
FROM customers C
AND EXISTS
(
SELECT 1
FROM orders O
WHERE O.customer_id = C.customer_id
)

The execution plan for this query becomes:
------- ---------------------------------------------------
2 SORT ORDER BY (cr=33608 r=29921 w=0 time=6422770 us)
2 HASH JOIN SEMI (cr=33608 r=29921 w=0 time=6422538 us)
2990 TABLE ACCESS FULL ORDERS (cr=33570 r=29921 w=0 time=6345754 us)

The “SEMI” that you see on the second line indicates that a semi-join access path is being used instead of a
conventional join. A semi-join is more efficient here for two reasons: (1) The search through the ORDERS table for a
given customer can stop as soon as the first qualifying order is found, and (2) There is no need to sort the results to
remove duplicate customer records in the case where a customer had multiple qualifying orders. (You can see that the
sort for uniqueness, which took 928 microseconds, has been replaced by a sort to satisfy the ORDER BY clause, which
took only 232 microseconds.)

Rewriting the query to use a semi-join access path has perhaps been a novel experience, but, so far, it hasn’t done us a
substantial amount of good. We brought the elapsed time down about 4%, but that doesn’t seem too significant, and it



could just be the result of chance variation in our test environment. But suppose you are aware of the fact that very few
of your customers are gold-status. Perhaps a nested loops semi-join would be more efficient. You can add an NL_SJ
hint and see the results for yourself:
FROM customers C
AND EXISTS
(
SELECT /*+ NL_SJ */ 1
FROM orders O
)
------- ---------------------------------------------------
2 NESTED LOOPS SEMI (cr=833 r=725 w=0 time=358232 us)
2 TABLE ACCESS BY INDEX ROWID ORDERS (cr=795 r=725 w=0 time=355822 us)
780 INDEX RANGE SCAN ORDERS_N1 (cr=15 r=13 w=0 time=5601 us)(object id 34176)

Indeed, the nested loops semi-join brought the consistent reads down to 833 from 33608 and reduced the execution
time by over 90%. So it appears that Oracle chose a hash semi-join when a nested loops semi-join was more efficient.
Once we get Oracle to see the light with the NL_SJ hint, we end up with a much more efficient query than we started
with. We achieved this performance gain by leveraging Oracle’s nested loops semi-join access path.

Semi-Join Example #2

I recently worked on a project where applying the concepts presented in this section brought the execution time for a
key query in my client’s application down from 46 minutes to 1.1 seconds. A screen in the application allowed a user
to enter a project owner and up to five participants, and a list of assignments involving all of the specified people
would be displayed. The users found that the response time was tolerable if they only selected one participant, but it
went downhill from there as they added more participants. The original version of the query looked like this:
SELECT DISTINCT A.name, A.code, A.description,
A.item_id, A.assignment_id, FI.string0, FI.string1
FROM relationships R, assignments A, format_items FI,
relationships R1, relationships R2, relationships R3,
relationships R4, relationships R5
WHERE R.user_id = 134546
AND R.account_id = 134545
AND R.type_code = 0
AND A.item_id = R.item_id
AND FI.item_id = A.item_id
AND R1.item_id = A.item_id AND R1.status = 5 AND R1.user_id = 137279
ORDER BY A.name ASC;

(I have changed the names of tables and columns, stripped away extraneous columns and predicates, and eliminated
bind variables in order to both protect my client’s confidentiality and make the example easier to understand.) The
execution plan looked like this:
------- ---------------------------------------------------
64339104 TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=23520269 r=34 w=0
time=1838051782 us)
95184881 NESTED LOOPS (cr=7842642 r=23 w=0 time=907238095 us)



2046 TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=7211 r=0 w=0 time=299181
us)
17788 INDEX RANGE SCAN RELATIONSHIPS_N3 (cr=71 r=0 w=0 time=81158 us)(object
id 34528)
3634 TABLE ACCESS FULL ASSIGNMENTS (cr=184 r=0 w=0 time=25536 us)
1831 TABLE ACCESS BY INDEX ROWID FORMAT_ITEMS (cr=5958 r=0 w=0 time=163252
us)
1831 INDEX RANGE SCAN FORMAT_ITEMS_N1 (cr=4127 r=0 w=0 time=115113
us)(object id 34316)
us)
4264 INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=3678 r=0 w=0 time=224390
us)
id 34345)
us)
us)

It is no wonder that this query took so long when five participants were selected. With each join to the
RELATIONSHIPS table, the Cartesian product gets more and more out of hand. By the final sort to eliminate the
duplicates, we have over 64 million rows to reckon with.

This query struck me right away as a good candidate for a nested loops semi-join. We want a list of assignments that
meet certain criteria. Whether a specific user has one qualifying relationship record or 10, it makes no difference. We
are only concerned with the existence of at least one qualifying relationship record for each of the specified users.

Rewriting the query with EXISTS clauses to semi-join to the RELATIONSHIPS table five times is pretty easy. The
challenge here is that the query does have a DISTINCT operator, and this defeats Oracle’s semi-join access paths. Not
to worry—an inline view with a NO_MERGE hint saved the day. The revised query looked like:
SELECT /*+ NO_MERGE (M) */
DISTINCT M.name, M.code, M.description,
M.item_id, M.assignment_id, M.string0, M.string1
FROM (
SELECT A.name, A.code, A.description,
A.item_id, A.assignment_id, FI.string0, FI.string1
FROM relationships R, assignments A, format_items FI
WHERE R.user_id = 134546
AND R.account_id = 134545
AND R.type_code = 0
AND A.item_id = R.item_id
AND FI.item_id = A.item_id
AND EXISTS
(SELECT 1 FROM relationships R1
WHERE R1.item_id = A.item_id AND R1.status = 5
AND R1.user_id = 137279)
AND EXISTS
AND EXISTS
AND EXISTS
AND EXISTS
) M
ORDER BY M.name ASC;



The execution plan was as follows:
------- ---------------------------------------------------
1300 VIEW (cr=36315 r=89 w=0 time=1085116 us)
us)
id 34528)
3634 TABLE ACCESS FULL ASSIGNMENTS (cr=184 r=89 w=0 time=91899 us)
1831 TABLE ACCESS BY INDEX ROWID FORMAT_ITEMS (cr=5958 r=0 w=0 time=141416
us)
1831 INDEX RANGE SCAN FORMAT_ITEMS_N1 (cr=4127 r=0 w=0 time=100207
1322 TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=5377 r=0 w=0 time=92046 us)
id 34345)
id 34345)
id 34345)
2234 INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=2630 r=0 w=0 time=49373 us)(object id
34345)
1300 INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=2630 r=0 w=0 time=38832 us)(object id
34345)

The execution plan starts out the same as before, with a hash join between RELATIONSHIPS and ASSIGNMENTS
followed by a nested loops join to FORMAT_ITEMS. At this point there are 1831 rows in the candidate result set.
Whereas the original query does five conventional joins to the RELATIONSHIPS table and the 1831 rows balloon to
over 64 million, the revised query does semi-joins instead and whittles the candidate rows down to 1300. Joining 1300
rows is a lot faster than joining 64 million rows, so you can see how the semi-join access path really speeds up this
query.

The NO_MERGE hint and the inline view (aliased as “M”) causes Oracle to separate the DISTINCT operation from
the bulk of the query. This is how we get around the fact that a DISTINCT operation defeats semi-join access paths. In
the second version of the query, Oracle executes the query as if the DISTINCT keyword were not there. Then, as a
final step, it applies the DISTINCT operation to the interim result set that it has stored in memory (the output of the
VIEW operator).

If you are wondering why the NO_MERGE hint is necessary, recall what we discussed earlier about the heuristic
Oracle uses when deciding whether or not to merge a subquery into the main body of a query. Oracle assumes that
merging is a good thing and will do it whenever possible. Thus without the NO_MERGE hint Oracle would merge the
inline view into the DISTINCT operation (eliminating the inline view), and the semi-join access path would be
defeated by the presence of the DISTINCT operation.

This example demonstrates the value of the semi-join access paths. Your gains may not be as spectacular as we have
seen here, but hopefully you can see from this discussion how proper use of semi-joins can make a certain class of
queries much more efficient. This example also underscores the importance of knowing how to properly use semi-
joins. If you did not remember that semi-join access paths are defeated by the DISTINCT operation, you might have
changed the example query to use EXISTS clauses and wondered why the query still ran slowly.

The NOT EXISTS and NOT IN Constructs

In this section we will discuss how Oracle evaluates NOT EXISTS and NOT IN clauses, prerequisites for using



Oracle’s anti-join access paths, and hints that influence anti-join query optimization. Then we’ll look at a few
examples of how anti-join access paths can be used to make some queries more efficient.

How Oracle Evaluates NOT EXISTS and NOT IN Clauses

NOT EXISTS and NOT IN clauses are used to retrieve rows from a table for which no match is found in another
table. In the early days, Oracle would use the subquery of a NOT EXISTS or NOT IN clause as a filter. For each
candidate row retrieved by the main body of the query, Oracle would execute the subquery. If the subquery retrieved
one or more rows, then the candidate row would be discarded. Conceptually this approach seems similar to a nested
loops anti-join.

Somewhere around Oracle 7.3, the merge anti-join and hash anti-join access paths were added. These enabled Oracle
to leverage the merge and hash join algorithms when performing anti-joins. However, Oracle would only use these
algorithms if the always_anti_join instance parameter was set appropriately or if a hint was added to the query. In
Oracle 8i this still appears to be true. Oracle 9i, however, does away with the always_anti_join instance parameter and
instead the optimizer evaluates the costs of the different algorithms and chooses the one with the lowest cost.

In an earlier section of this paper we discussed when the EXISTS clause should be used versus when the IN clause
should be used. The two are functionally equivalent, and actively choosing one over the other is usually done in the
hopes that this will help the optimizer choose a better execution plan. When talking about NOT EXISTS and NOT IN,
however, it is important to note that these two constructs are not functionally equivalent—there is a significant
difference between the two in how null values are handled.

If the subquery of a NOT IN clause returns at least one null value, then the NOT IN predicate evaluates to false. The
NOT EXISTS construct, meanwhile, concerns itself only with whether a row is returned or not, and thus does not do
anything differently if a null value is returned. This may seem a bit abstract, so let’s clarify the point with an example.

Suppose you want a list of empty departments—departments that have no employees. You could write this query with
a NOT EXISTS clause or a NOT IN clause. We saw the NOT EXISTS case earlier:
FROM dept D
WHERE NOT EXISTS
(
SELECT 1
FROM emp E
)
ORDER BY D.deptno;

Written with a NOT IN clause, the query becomes:
FROM dept D
WHERE D.deptno NOT IN
(
SELECT E.deptno
FROM emp E
)
ORDER BY D.deptno;

The queries look like they would be functionally equivalent, but if you are using the EMP and DEPT tables from the
SCOTT schema of an Oracle 9i database, then the two queries are in fact different. Insert a row into the EMP table
with a null DEPTNO, run the two versions of the query, and you will see the difference. The NOT EXISTS version
will show you that department 40 has no employees. However, the NOT IN version of the query will say there are no
empty departments. Why? Because the subquery of the NOT IN clause retrieves a row with a null value, causing the
NOT IN clause to always evaluate to false. As strange as this NOT IN nuance may seem, it is fully documented in
Metalink document 28934.1.

The difference between NOT EXISTS and NOT IN with respect to handling of nulls is an important one—not just
from the standpoint of differing functionality, but also from a performance perspective as well. To evaluate a NOT IN



subquery that is capable of returning nulls, Oracle cannot scan an index because an implicit NVL() function must be
applied and this would defeat the index. (Null values aren’t stored in single-column non-unique indexes, anyway.)
We’ll see the impact of this implicit NVL() function in the examples later in this section.

If the subquery of a NOT IN clause is physically incapable of returning a null value (because the columns involved
have NOT NULL constraints or because predicates have been added to the WHERE clause requiring non-null values),
then the query with the NOT IN clause will return the same results as a query with an equivalent NOT EXISTS clause
—and indexes won’t be defeated in the process.

If we wanted to write the query that shows empty departments with a NOT IN clause that gave the same results as the
NOT EXISTS version, we would need to add a predicate to the subquery that rules out null values. The resulting query
is:
FROM dept D
WHERE D.deptno NOT IN
(
SELECT E.deptno
FROM emp E
WHERE E.deptno IS NOT NULL
)
ORDER BY D.deptno;

To choose whether to use NOT EXISTS or NOT IN in a query, therefore, you must first consider how you want null
values to be handled. If you need the presence of a null value in the subquery to cause the expression to always
evaluate to false, then you need to use the NOT IN construct. Otherwise you have a choice between NOT EXISTS and
NOT IN (with the extra WHERE predicate if the subquery would otherwise be capable of returning a null value).

If you don’t need the special null value semantics of NOT IN, then your choice of whether to use NOT EXISTS or
NOT IN hinges on what sort of access path you think will be best for the query. In a conventional join, the nested loops
algorithm is desirable when the predicates on the first table are very selective and the join columns in the second table
are selectively indexed. The merge and hash join algorithms, meanwhile, are more desirable when predicates are not
very selective or the join columns in the second table are not selectively indexed. A similar rule of thumb applies to
anti-joins. That is to say, merge and hash anti-joins are most valuable in situations where the first table has a lot of
candidate rows or the join columns are not selectively indexed.

If you think your query is going to be better off with a merge or hash anti-join than a nested loops anti-join or a filter
operation, then you might want to code your query with a NOT IN clause instead of a NOT EXISTS clause. Whenever
the Oracle documentation discusses anti-join access paths, it always seems to do so in the context of the NOT IN
construct with no mention of NOT EXISTS. While sometimes Oracle 9i will choose an anti-join access path for a
query with a NOT EXISTS clause, it does seem that Oracle 9i might be more likely to do so if NOT IN is used instead
of NOT EXISTS. In Oracle 8i, I believe the query optimizer will never choose an anti-join access path for a query
with a NOT EXISTS clause (even if always_anti_join is set and a hint appears in the subquery).

Meanwhile, if you don’t think your query will benefit from a merge or hash anti-join, then you might be best off with
the good old filter technique instead of the nested loops anti-join access path. It appears that, at least in Oracle 9i, the
nested loops anti-join might not be as efficient as it could be. In the example of the query for empty departments that
we have been discussing, the filter technique will rule out a department as soon as the first employee in that
department is located. The nested loops anti-join, meanwhile, appears to locate all employees in the department before
ruling it out. This is unnecessary work. Whether a department has one employee or 100, it does not qualify as an
empty department. So Oracle should cut to the chase and move on to the next department as soon as the first employee
is found. The filter technique does just that, but the nested loops anti-join (at least in the cases I have seen) does not.
The second example in this section will demonstrate this.

The upshot is that you must first consider how null values should be handled when deciding whether to use NOT
EXISTS or NOT IN. If you need the special semantics provided by NOT IN, then your decision has been made.
Otherwise, you should next consider whether or not the query might benefit from a merge or hash anti-join. If so, then
you probably ought to choose NOT IN. If you decide to go with NOT IN but do not want the expression to evaluate to



false if a null value is found, then make sure the subquery cannot return a null value. If there is no chance that the
query will benefit from a merge or hash anti-join and the special semantics of NOT IN are not desired, then you might
want to select the NOT EXISTS construct so that there is a better chance Oracle will perform an efficient filter instead
of an inefficient nested loops anti-join.

Prerequisites and Hints that Impact Anti-Joins

The key prerequisite for the anti-join access paths is that the subquery of a NOT IN clause cannot be capable of
returning a null value. As mentioned before, this means that either the columns being selected must have NOT NULL
constraints or there must be predicates in the WHERE clause of the subquery to ensure there are no nulls. Even if
every row in the table has a non-null value in a nullable column, you must still add the extra predicate to the WHERE
clause or else Oracle will refuse to use the merge and hash anti-join access paths.

Assuming that an anti-join access path is possible, Oracle chooses the anti-join algorithm to use while evaluating
execution plans. In Oracle 9i an anti-join can be performed using the nested loops, merge, or hash join algorithms. As
with a conventional join, Oracle will choose the join algorithm with the lowest cost, or it may choose to use the filter
technique instead of an anti-join altogether. In Oracle 8i an anti-join can be performed using the merge or hash join
algorithms, but as with semi-joins, Oracle 8i will not choose the join algorithm with the lowest cost. Instead, the
always_anti_join instance parameter dictates which approach Oracle 8i should use. (Oracle 8i does not implement the
nested loops anti-join access path, so if the always_anti_join parameter is set to nested_loops, the filter technique is
used instead.) The always_anti_join instance parameter is obsolete in Oracle 9i.

Oracle provides the NL_AJ, MERGE_AJ, and HASH_AJ hints in order for you to manipulate the anti-join process if
you need to. As with the semi-join hints, the hint is applied to the subquery and not the main body of the query itself.

Anti-Join Example #1

Suppose you want a list of customers who have not placed an order within the last ten days. You might start with a
query that looks like:
FROM customers C
WHERE NOT EXISTS
(
SELECT 1
FROM orders O
)

The execution plan looks like:
------- ---------------------------------------------------
11 FILTER (cr=18707 r=301 w=0 time=22491555 us)
989 VIEW (cr=18669 r=265 w=0 time=22365647 us)
100000 INDEX RANGE SCAN ORDERS_N1_CUST_ID (cr=2207 r=208 w=0 time=277584 us)(object
id 34555)
5338680 INDEX RANGE SCAN ORDERS_N2_ORD_DATE (cr=16462 r=57 w=0 time=9036516

You can see that the query took over 22 seconds to run and performed 18,707 logical reads. Oracle did not use an anti-
join access path in this execution plan, instead opting to read every row in the CUSTOMERS table and use the
subquery as a filter by executing it once for each customer. Although Oracle found a clever way to hash join two
indexes on the ORDERS table and thereby make the subquery very efficient, the query overall is still resource
intensive because the subquery must execute 1000 times—once for each customer.

This query seems like a good candidate for an anti-join access path because executing the subquery once and



performing a merge or hash anti-join might be more efficient than performing the subquery once for each customer as
a filter operation. We could rewrite the query using a NOT IN clause instead of NOT EXISTS to see if Oracle chooses
an anti-join access path:
FROM customers C
WHERE C.customer_id NOT IN
(
SELECT O.customer_id
FROM orders O
WHERE O.order_date > SYSDATE - 10
)

Unfortunately, the results are quite disappointing—elapsed time shot up to 695 seconds! The execution plan looks like:
------- ---------------------------------------------------
989 TABLE ACCESS BY INDEX ROWID ORDERS (cr=5360711 r=4870595 w=0 time=695088325
us)
5359590 INDEX RANGE SCAN ORDERS_N2_ORD_DATE (cr=16520 r=0 w=0 time=22299946

You can see that Oracle did not choose an anti-join access path. Instead, Oracle is still reading every row from the
CUSTOMERS table and executing the subquery once for each of the 1000 customers. Moreover, Oracle is no longer
hash joining the two ORDERS indexes to make the subquery very efficient. For each customer, Oracle is now
grabbing all orders from the last 10 days and sifting through them to find those belonging to the customer currently
being processed. It turns out that the CUSTOMER_ID column in the ORDERS table is nullable, and so Oracle
conceptually rewrites the query in order to implement the special null value semantics of NOT IN. The rewritten query
looks like:
FROM customers C
WHERE NOT EXISTS
(
SELECT 1
FROM orders O
WHERE O.order_date > SYSDATE – 10
AND NVL (O.customer_id, C.customer_id) = C.customer_id
)

You can see how the implicit NVL() function that Oracle inserts during the transformation would be necessary in order
to cause the predicate to evaluate to false if a null value is returned. You can also see how this defeats the index on the
CUSTOMER_ID column of the ORDERS table.

You might wonder why the CUSTOMER_ID column in the ORDERS table can be null. Let’s suppose the developers
tell you that this handles a special case and you should not worry about it. The developers want the query to ignore
nulls. Thus you can add an extra predicate to the WHERE clause of the subquery in order to meet the prerequisite for
anti-join access paths. The resulting query is:
FROM customers C
WHERE C.customer_id NOT IN
(
SELECT O.customer_id
FROM orders O
AND O.customer_id IS NOT NULL
)

The execution plan becomes:



------- ---------------------------------------------------
11 HASH JOIN ANTI (cr=311 r=132 w=98 time=1463770 us)
20910 VIEW (cr=273 r=98 w=98 time=1318222 us)
20910 INDEX RANGE SCAN ORDERS_N2_ORD_DATE (cr=58 r=0 w=0 time=43870 us)(object id
34556)
100000 INDEX FAST FULL SCAN ORDERS_N1_CUST_ID (cr=215 r=0 w=0 time=196953

You can see that Oracle now chooses to perform a hash anti-join. The CUSTOMERS table gets a full table scan as
before, but the critical difference is that now the subquery executes just once. The two indexes on the ORDERS table
are hash joined once in order to develop a list of order dates for each customer, and this is hash anti-joined to the
CUSTOMERS table in order to yield the list of customers with no orders in the last 10 days. The result is a query that
completes in 1.46 seconds and performs just 311 logical reads.

In this example we did not need to use hints to get Oracle to find the most efficient execution plan. All we had to do
was write the query in a way that made it eligible for the anti-join access paths. (This entailed using NOT IN instead of
NOT EXISTS, and adding a predicate to the subquery’s WHERE clause to make it impossible for the subquery to
return a null value.)


Along the lines of the previous example, suppose you needed to find out if a specific customer has not placed an order
within the last 10 days. Your query might look like:
FROM customers C
WHERE C.customer_id = 22
AND NOT EXISTS
(
SELECT 1
FROM orders O
AND O.customer_id = C.customer_id
)

The execution plan looks like:
------- ---------------------------------------------------
0 TABLE ACCESS BY INDEX ROWID CUSTOMERS (cr=6 r=0 w=0 time=341 us)
0 INDEX UNIQUE SCAN CUSTOMERS_PK (cr=6 r=0 w=0 time=333 us)(object id 34552)
2 INDEX RANGE SCAN ORDERS_N1_CUST_ID (cr=2 r=0 w=0 time=65 us)(object id 34555)

As you can see, Oracle uses a filter operation instead of an anti-join. This is not surprising, because we used the NOT
EXISTS construct instead of NOT IN. The query performed 6 logical reads and completed in 341 microseconds.
Customer 22 has 110 orders in the system. By looking at the row counts in the execution plan, you can see that Oracle
stopped the filter process as soon as it found the first order placed by the customer within the last 10 days. There is no
need to look at all of the orders placed by this customer because we’ve already found one that excludes the customer
from the result set.

This is an example of a query with an anti-join where the main query is extremely selective—we are only looking at
one customer record. This begs the question: Would this be a good opportunity to use the nested loops anti-join access
path? It is hard to imagine an execution plan for this query that is more efficient than what we have just seen, but let’s
try. The query rewritten with a NOT IN clause and an NL_AJ hint looks like:
FROM customers C
WHERE C.customer_id = 22
AND C.customer_id NOT IN
(
SELECT /*+ NL_AJ */ O.customer_id



FROM orders O
AND O.customer_id IS NOT NULL
)

The execution plan is:
------- ---------------------------------------------------
0 NESTED LOOPS ANTI (cr=115 r=86 w=0 time=16724 us)
1 TABLE ACCESS BY INDEX ROWID CUSTOMERS (cr=3 r=2 w=0 time=467 us)
1 INDEX UNIQUE SCAN CUSTOMERS_PK (cr=2 r=2 w=0 time=429 us)(object id34552)
110 INDEX RANGE SCAN ORDERS_N1_CUST_ID (cr=2 r=0 w=0 time=308 us)(object id
34555)

You can see that Oracle obeyed our hint and used the nested loops anti-join access path. Unfortunately, the number of
logical reads went up from 6 to 115 and the elapsed time went up by a factor of 50 to 16824 microseconds. Why? The
row counts in the execution plan tell the story. Oracle looked at all 110 orders from the customer and found that 10
were placed less than 10 days ago. Oracle did a lot of unnecessary work. In this regard the filter access path seems
much smarter than the nested loops anti-join access path. This behavior was seen on an Oracle 9.2.0.4 database. Let’s
hope Oracle fixes this in a future release. Until then, I can’t think of a case where the nested loops anti-join access
path would be useful.

By the way, it turns out that Oracle would have chosen the nested loops anti-join access path even without the NL_AJ
hint. So it turns out that this query was more efficient when written with a NOT EXISTS than a NOT IN. Again,
hopefully this is just an anomaly that will be fixed in a future release of Oracle.


I recently used the concepts presented in this section to bring a client’s query down from 33 minutes (where one CPU
on the server was pegged) to less than four seconds. It turned out to be another case of the implicit NVL() function in a
NOT IN clause defeating an index. This time it was a query that showed a count of calls to the customer service center
placed by users who did not belong to corporate customers. The original query looked like this:
SELECT COUNT(*)
FROM calls C
WHERE C.requestor_user_id NOT IN
(
SELECT CM.member_user_id
FROM corp_members CM
);

(Once again names have been changed to protect the innocent.) The execution plan looked like this:
------- ---------------------------------------------------
1 SORT AGGREGATE (cr=12784272 r=1864678 w=0 time=1978321835 us)
184965 TABLE ACCESS FULL CALLS (cr=3588 r=1370 w=0 time=979769 us)
61032 TABLE ACCESS FULL CORP_MEMBERS (cr=12780684 r=1863308 w=0 time=2948544268 us)

It appears that Oracle is reading rows from the CALLS table one at a time and, for each row, is scanning the
CORP_MEMBERS table until it finds a row with a matching MEMBER_USER_ID. Since there are over 180,000 calls
in the system, it takes over 180,000 partial scans of the CORP_MEMBERS table to complete this query. (The
CORP_MEMBERS table has 584 blocks below its high water mark.)

A quick check of the CORP_MEMBERS table shows that the MEMBER_USER_ID column is indexed. However, the
column is also nullable and so Oracle cannot use the index because of the implicit NVL() function associated with the
NOT IN clause.

Since the client saw records in the CORP_MEMBERS table with a null MEMBER_USER_ID as an unrelated special



case, they wanted the query to ignore such records. So my first step was to rewrite the query with the extra WHERE
predicate in the subquery that makes the NOT IN clause ignore nulls:
SELECT COUNT(*)
FROM calls C
WHERE C.requestor_user_id NOT IN
(
SELECT CM.member_user_id
WHERE CM.member_user_id IS NOT NULL
);

The resulting execution plan was:
------- ---------------------------------------------------
0 HASH JOIN ANTI (cr=3790 r=3906 w=420 time=5450591 us)
81945 INDEX FAST FULL SCAN CORP_MEMBERS_USER_ID (cr=202 r=6 w=0 time=113178

Now not only did Oracle use the index on the MEMBER_USER_ID column when accessing the CORP_MEMBERS
table, but more importantly Oracle also switched to a hash anti-join access path. This means that Oracle only had to
access the CORP_MEMBERS table once instead of more than 180,000 times. The execution time dropped by 99% to
5.45 seconds.

The users were thrilled with the improvement and the problem was considered resolved. But to better understand
Oracle’s data access paths and join methods, I benchmarked two more versions of the query. First I added a
MERGE_AJ hint to cause a merge anti-join instead of a hash anti-join. This version of the query performed the same
number of logical reads as the hash anti-join but used a little more CPU time. In this situation it looked like the merge
anti-join used up more CPU time on sorting than did the hash anti-join.

I also tried rewriting the query with a NOT EXISTS clause to see how it would perform using a filter operation instead
of an anti-join access path. This version of the query looked like:
SELECT COUNT(*)
FROM calls C
WHERE NOT EXISTS
(
SELECT 1
WHERE CM.member_user_id = C.requestor_user_id
);

The execution plan was:
------- ---------------------------------------------------
61032 INDEX RANGE SCAN CORP_MEMBERS_USER_ID (cr=122064 r=0 w=0 time=1842121

This version of the query performed a lot more logical reads than the hash anti-join version, but actually used less
CPU time. This query appeared to hammer on the blocks of the CORP_MEMBERS_USER_ID index very heavily.
Most likely the whole index segment was sitting in the buffer cache and thus the logical reads were relatively
inexpensive. Meanwhile, the large sort required by a hash anti-join was avoided and the result was a net reduction in
CPU time.

These examples demonstrate the value of the anti-join access paths and the significance of the way NOT IN treats null
values. Again, your gains may not be as spectacular as those shown here, but hopefully you can see from this
discussion how the proper use of anti-joins and NOT EXISTS and NOT IN clauses can make a certain class of queries
much more efficient. These examples also demonstrate the importance of the difference between NOT EXISTS and



NOT IN, and the apparent behavior (although not fully documented as far as I can tell) that Oracle typically will not
apply anti-join access paths to a NOT EXISTS clause.

Conclusion

The semi-join and anti-join are two special ways of joining data from multiple tables in a query. Although SQL does
not have a special symbol or keyword to designate a semi- or anti-join, we use the EXISTS, IN, NOT EXISTS, and
NOT IN constructs to indicate when one of these joins is desired. Over the years Oracle has gotten better at optimizing
semi-joins and anti-joins, and new data access paths make it possible to write very efficient queries that use these two
types of joins. However, there are certain prerequisites to be met before Oracle can use these access paths, and
sometimes hints or specific coding styles are necessary in order for the query optimizer to find the best execution plan
possible.

Understanding how semi-joins and anti-joins work—and how Oracle implements the relevant data access paths—will
enable you to write very efficient SQL and dramatically speed up an entire class of queries. This in turn will make
your employer (or client) very happy.

About the Author

Roger Schrag, OCP, has been an Oracle DBA and application architect for over 15 years. He began his career at
Oracle Corporation on the Oracle Financials development team. In 1995, he founded Database Specialists, Inc., a
boutique consulting firm specializing in Oracle database technology, remote administration, and performance tuning.
Since that time, Roger has focused his expertise in the area of performance optimization. Roger is a frequent speaker at
Oracle OpenWorld and the International Oracle Users Group (IOUG) Live conferences, where he has frequently been
voted in the Top 10% of speakers. Roger has been an Oracle Masters Class instructor for the IOUG and is the current
president of the Northern California Oracle Users Group (NoCOUG). He can be reached at rschrag@dbspecialists.com.

Still Looking for Help on this Subject?
Get a Consultation
We would be happy to talk with you about our services and how our senior-level database team might help
you. Call Database Specialists at 415-344-0500 or 888-648-0500 or fill out a free
consultation request form.
Complimentary Newsletter
If you'd like to receive our complimentary monthly newsletter with database tips and new white paper
announcements, sign up for The Specialist.

Copyright © 2005 Database Specialists, Inc. http://www.dbspecialists.com


Speeding up queries with semi joins and anti-joins

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Speeding up queries with semi joins and anti-joins

Similar to Speeding up queries with semi joins and anti-joins (20)

More from Kaing Menglieng

More from Kaing Menglieng (20)

Speeding up queries with semi joins and anti-joins