SlideShare a Scribd company logo
1 of 15
Download to read offline
Speeding Up Queries with Semi-Joins and Anti-Joins


  Speeding Up Queries with Semi-Joins and Anti-Joins: How
  Oracle Evaluates EXISTS, NOT EXISTS, IN, and NOT IN
  by Roger Schrag
  Database Specialists, Inc.

             About Database Specialists, Inc.
             Database Specialists, Inc. provides remote DBA services and onsite database support for your mission critical
             Oracle systems. Since 1995, we have been providing Oracle database consulting in Solaris, HP-UX, Linux, AIX,
             and Windows environments. We are DBAs, speakers, educators, and authors. Our team is continually recognized
             by Oracle, at national conferences and by leading trade publications. Learn more about our remote DBA,
             database tuning, and consulting services. Or, call us at      415-344-0500     or        888-648-0500      .


  Introduction

  Optimizing SQL usually gives the most significant results when DBAs are called upon to “make the system run
  faster.” Using tools like Statspack or Enterprise Manager, it is often easy to find the slow SQL. But how do you make
  the queries run faster? That is the challenge! In this paper we will discuss the semi-join and the anti-join, two powerful
  SQL constructs Oracle offers for use in your quest for faster queries. In particular, we will define these two terms,
  discuss when and why you might want to use the [NOT] EXISTS or [NOT] IN constructs, and demonstrate how you
  can use optimizer hints and make minor query changes in order to enable Oracle to use some very powerful and
  efficient access paths.

  For certain classes of queries, these features can dramatically reduce logical reads, physical reads, CPU time, and
  elapsed time. But beware! There are some obscure (and not always easy to find) requirements that must be met in
  order for Oracle to deploy the semi- and anti-join access paths. If you fail to dot an “I” or cross a “T,” you could be
  banging your head against the wall for hours. In this paper we will look at SQL from a real application and
  demonstrate the “before” and “after” versions of queries that ran orders of magnitude faster once semi-joins and anti-
  joins were implemented correctly.

  Before we get started, I want to point out that Oracle Corporation has improved its optimization techniques
  significantly over the years, specifically in the area of semi-joins and anti-joins. This paper was written using Oracle 9i
  release 9.2.0.4, and all examples were validated in such an environment. If you are using a newer or older release of
  Oracle, it is quite possible that you will see different results. Please do not take anything you read in this paper—or in
  any paper for that matter—for granted. Validate every hypothesis in your environment before relying upon it.

  Semi-Join Defined

  A “semi-join” between two tables returns rows from the first table where one or more matches are found in the second
  table. The difference between a semi-join and a conventional join is that rows in the first table will be returned at most
  once. Even if the second table contains two matches for a row in the first table, only one copy of the row will be
  returned. Semi-joins are written using the EXISTS or IN constructs.

  Suppose you have the DEPT and EMP tables in the SCOTT schema and you want a list of departments with at least
  one employee. You could write the query with a conventional join:
               SELECT        D.deptno, D.dname
               FROM          dept D, emp E
               WHERE         E.deptno = D.deptno
               ORDER BY      D.deptno;

  Unfortunately, if a department has 400 employees then that department will appear in the query output 400 times. You
  could eliminate the duplicate rows by using the DISTINCT keyword, but you would be making Oracle do more work
  than necessary. Really what you want to do is specify a semi-join between the DEPT and EMP tables instead of a


http://www.dbspecialists.com/files/presentations/semijoins.html[08/29/2012 5:00:12 PM]
Speeding Up Queries with Semi-Joins and Anti-Joins

  conventional join:
               SELECT   D.deptno, D.dname
               FROM     dept D
               WHERE    EXISTS
                        (
                        SELECT 1
                        FROM   emp E
                        WHERE E.deptno = D.deptno
                        )
               ORDER BY D.deptno;

  The above query will list the departments that have at least one employee. Whether a department has one employee or
  100, the department will appear only once in the query output. Moreover, Oracle will move on to the next department
  as soon as it finds the first employee in a department, instead of finding all of the employees in each department.

  Anti-Join Defined

  An “anti-join” between two tables returns rows from the first table where no matches are found in the second table.
  An anti-join is essentially the opposite of a semi-join: While a semi-join returns one copy of each row in the first table
  for which at least one match is found, an anti-join returns one copy of each row in the first table for which no match is
  found. Anti-joins are written using the NOT EXISTS or NOT IN constructs. These two constructs differ in how they
  handle nulls—a subtle but very important distinction which we will discuss later.

  Suppose you want a list of empty departments—departments with no employees. You could write a query that finds all
  departments and subtracts off the department of each employee:
               SELECT        D1.deptno, D1.dname
               FROM          dept D1
               MINUS
               SELECT        D2.deptno, D2.dname
               FROM          dept D2, emp E2
               WHERE         D2.deptno = E2.deptno
               ORDER BY      1;

  The above query will give the desired results, but it might be clearer to write the query using an anti-join:
               SELECT   D.deptno, D.dname
               FROM     dept D
               WHERE    NOT EXISTS
                        (
                        SELECT 1
                        FROM   emp E
                        WHERE E.deptno = D.deptno
                        )
               ORDER BY D.deptno;

  The above query is more efficient because Oracle can employ an anti-join access path. The difference in efficiency
  here is akin to the difference between a nested loops join and a hash join when you are joining every row in one table
  to another.

  The EXISTS and IN Constructs

  In this section we will discuss how Oracle evaluates EXISTS and IN clauses, prerequisites for using Oracle’s semi-
  join access paths, and hints that influence semi-join query optimization. Then we’ll look at a few sample queries where
  semi-join access paths can be used to make the query more efficient.

  How Oracle Evaluates EXISTS and IN Clauses

  The SQL language as implemented in Oracle does not include a syntax for explicitly performing semi-joins, so you
  use an EXISTS or IN clause with a subquery to indicate to Oracle that a semi-join should be performed instead of a
  conventional join. Oracle will typically transform the subquery into a join wherever possible.



http://www.dbspecialists.com/files/presentations/semijoins.html[08/29/2012 5:00:12 PM]
Speeding Up Queries with Semi-Joins and Anti-Joins

  An interesting characteristic of Oracle’s query optimizer is that (according to Metalink document 144967.1), in Oracle
  8i and 9i the decision of whether or not to transform a query by merging its subqueries into the main body is a
  heuristic one. Basically, Oracle will merge subqueries if it can, regardless of impact on cost. The assumption is that
  merging is always a good thing. In fact, Oracle performs query transformations (merging subqueries into the main
  body) before making a list of possible execution plans and evaluating their cost. This bit of trivia will come into play
  when we discuss the second example in this section.

  The Oracle Performance Tuning Guide in the Oracle 8i documentation set explains that EXISTS and IN are processed
  differently by the query optimizer. The manual explains a rule of thumb for when to use each, and provides an
  example to demonstrate. The Oracle 9i Performance Tuning Guide provides the same rule of thumb with a slightly
  more polished example. The Oracle 10g Performance Tuning Guide contains substantially the same text as the Oracle
  9i manual on this subject.

  The rule of thumb goes like this: If the main body of your query is highly selective, then an EXISTS clause might be
  more appropriate to semi-join to the target table. However, if the main body of your query is not so selective and the
  subquery (the target of the semi-join) is more selective, then an IN clause might be more appropriate.

  The example in the Oracle 8i Performance Tuning Guide shows a query where the main body is not selective, while
  the target of the semi-join is highly selective. The example demonstrates that Oracle chooses a poor execution plan
  when an EXISTS clause is used, and a much more efficient execution plan when an IN clause is used instead. I was
  easily able to reproduce this example in an Oracle 8i environment with the same results as shown in the Performance
  Tuning Guide.

  However, when I reproduced this same example in an Oracle 9i environment, Oracle chose an identical (and efficient)
  execution plan for both versions of the query. Furthermore, the example in the Oracle 9i Performance Tuning Guide
  gave an identical (and efficient) execution plan for both versions of the query in my Oracle 9i environment. These
  tests echo my previous experience—that Oracle 9i is smarter than Oracle 8i about EXISTS and IN clauses, and you
  don’t typically need to worry about when to use EXISTS versus IN.

  The upshot is that if you are writing queries for an Oracle 8i (or earlier) system, then you should think about the
  selectivity of the predicates in the main query versus the subquery and choose whether to use an EXISTS or an IN
  clause wisely. If you are using Oracle 9i or later, it doesn’t hurt to follow this rule of thumb, but you may not need to
  worry as much.

  Prerequisites and Hints that Impact Semi-Joins

  There are a few situations in which Oracle cannot use a semi-join access path. If the query contains a DISTINCT
  operation (explicitly or implicitly through a set operator such as UNION) then Oracle cannot transform EXISTS or IN
  clauses into something that could use a semi-join access path. The same is true if the EXISTS or IN clause is part of an
  OR branch. (Semi-join access paths can be used in queries that contain ORs in the WHERE clause, just as long as the
  EXISTS or IN is not part of the OR.)

  Assuming that a semi-join access path is possible, Oracle chooses the semi-join algorithm to use while evaluating
  execution plans. In Oracle 9i a semi-join can be performed using the nested loops, hash join, or merge join algorithms.
  As with a conventional join, Oracle 9i will choose the join algorithm with the lowest cost. In Oracle 8i a semi-join can
  be performed using the hash join or merge join algorithms, but Oracle 8i will not choose the join algorithm with the
  lowest cost. Instead, the always_semi_join instance parameter dictates whether or not an Oracle 8i database should
  perform semi-joins and, if so, which algorithm to use. The always_semi_join instance parameter is obsolete in Oracle
  9i.

  If a semi-join access path is not possible (because of a DISTINCT operation or an OR predicate), or if instance
  parameters specify a semi-join should not be used, then Oracle will choose an alternate data access path. This usually
  amounts to performing a conventional join and sorting out the duplicates, or scanning the first table and using the
  second table as a filter.



http://www.dbspecialists.com/files/presentations/semijoins.html[08/29/2012 5:00:12 PM]
Speeding Up Queries with Semi-Joins and Anti-Joins

  Oracle provides the NL_SJ, HASH_SJ, and MERGE_SJ hints in order for you to manipulate the semi-join process if
  you need to. The hint is applied to the subquery of the EXISTS or IN clause, not the main body of the query itself. In
  my experience Oracle does not need you to provide a hint in order to consider the efficient semi-join access paths.
  However, in Oracle 9i or 10g, where you have a wider choice of semi-join algorithms, you might want to use hints if
  Oracle chooses a hash semi-join when a nested loops semi-join would in fact be better, or vice versa.

  Semi-Join Example #1

  Suppose you want a list of your gold-status customers who have placed an order within the last three days. You might
  start with a query that looks like:
               SELECT        DISTINCT C.short_name, C.customer_id
               FROM          customers C, orders O
               WHERE         C.customer_type = 'Gold'
               AND           O.customer_id = C.customer_id
               AND           O.order_date > SYSDATE - 3
               ORDER BY      C.short_name;

  The execution plan (from a TKPROF report) for this query is:
               Rows           Row Source Operation
               -------        ---------------------------------------------------
                     2        SORT UNIQUE (cr=33608 r=30076 w=0 time=6704029 us)
                    20         HASH JOIN (cr=33608 r=30076 w=0 time=6703101 us)
                    10          TABLE ACCESS FULL CUSTOMERS (cr=38 r=36 w=0 time=31718 us)
                  2990          TABLE ACCESS FULL ORDERS (cr=33570 r=30040 w=0 time=6646420 us)

  You can see that Oracle did a (conventional) hash join between the CUSTOMERS and ORDERS tables, and then
  performed a sort for uniqueness to filter out the duplicate records. This sort was necessary because the query is
  supposed to show a customer only once, even if they have more than one order placed within the last three days. (A
  sort to satisfy the ORDER BY clause is also necessary, but Oracle tackles this at the same time as the sort for
  uniqueness and thus it does not appear as a separate step in the execution plan.)

  Using what we have discussed so far, you might want to rewrite the query using a semi-join:
               SELECT   C.short_name, C.customer_id
               FROM     customers C
               WHERE    C.customer_type = 'Gold'
               AND      EXISTS
                        (
                        SELECT 1
                        FROM   orders O
                        WHERE O.customer_id = C.customer_id
                        AND    O.order_date > SYSDATE - 3
                        )
               ORDER BY C.short_name;

  The execution plan for this query becomes:
               Rows           Row Source Operation
               -------        ---------------------------------------------------
                     2        SORT ORDER BY (cr=33608 r=29921 w=0 time=6422770 us)
                     2         HASH JOIN SEMI (cr=33608 r=29921 w=0 time=6422538 us)
                    10          TABLE ACCESS FULL CUSTOMERS (cr=38 r=0 w=0 time=61290 us)
                  2990          TABLE ACCESS FULL ORDERS (cr=33570 r=29921 w=0 time=6345754 us)

  The “SEMI” that you see on the second line indicates that a semi-join access path is being used instead of a
  conventional join. A semi-join is more efficient here for two reasons: (1) The search through the ORDERS table for a
  given customer can stop as soon as the first qualifying order is found, and (2) There is no need to sort the results to
  remove duplicate customer records in the case where a customer had multiple qualifying orders. (You can see that the
  sort for uniqueness, which took 928 microseconds, has been replaced by a sort to satisfy the ORDER BY clause, which
  took only 232 microseconds.)

  Rewriting the query to use a semi-join access path has perhaps been a novel experience, but, so far, it hasn’t done us a
  substantial amount of good. We brought the elapsed time down about 4%, but that doesn’t seem too significant, and it


http://www.dbspecialists.com/files/presentations/semijoins.html[08/29/2012 5:00:12 PM]
Speeding Up Queries with Semi-Joins and Anti-Joins

  could just be the result of chance variation in our test environment. But suppose you are aware of the fact that very few
  of your customers are gold-status. Perhaps a nested loops semi-join would be more efficient. You can add an NL_SJ
  hint and see the results for yourself:
               SELECT   C.short_name, C.customer_id
               FROM     customers C
               WHERE    C.customer_type = 'Gold'
               AND      EXISTS
                        (
                        SELECT /*+ NL_SJ */ 1
                        FROM   orders O
                        WHERE O.customer_id = C.customer_id
                        AND    O.order_date > SYSDATE - 3
                        )
               ORDER BY C.short_name;
               Rows           Row Source Operation
               -------        ---------------------------------------------------
                      2       SORT ORDER BY (cr=833 r=725 w=0 time=358431 us)
                      2        NESTED LOOPS SEMI (cr=833 r=725 w=0 time=358232 us)
                     10         TABLE ACCESS FULL CUSTOMERS (cr=38 r=0 w=0 time=2210 us)
                      2         TABLE ACCESS BY INDEX ROWID ORDERS (cr=795 r=725 w=0 time=355822 us)
                    780          INDEX RANGE SCAN ORDERS_N1 (cr=15 r=13 w=0 time=5601 us)(object id 34176)

  Indeed, the nested loops semi-join brought the consistent reads down to 833 from 33608 and reduced the execution
  time by over 90%. So it appears that Oracle chose a hash semi-join when a nested loops semi-join was more efficient.
  Once we get Oracle to see the light with the NL_SJ hint, we end up with a much more efficient query than we started
  with. We achieved this performance gain by leveraging Oracle’s nested loops semi-join access path.

  Semi-Join Example #2

  I recently worked on a project where applying the concepts presented in this section brought the execution time for a
  key query in my client’s application down from 46 minutes to 1.1 seconds. A screen in the application allowed a user
  to enter a project owner and up to five participants, and a list of assignments involving all of the specified people
  would be displayed. The users found that the response time was tolerable if they only selected one participant, but it
  went downhill from there as they added more participants. The original version of the query looked like this:
               SELECT   DISTINCT A.name, A.code, A.description,
                                 A.item_id, A.assignment_id, FI.string0, FI.string1
               FROM     relationships R, assignments A, format_items FI,
                        relationships R1, relationships R2, relationships R3,
                        relationships R4, relationships R5
               WHERE    R.user_id = 134546
               AND      R.account_id = 134545
               AND      R.type_code = 0
               AND      A.item_id = R.item_id
               AND      FI.item_id = A.item_id
               AND      R1.item_id = A.item_id AND R1.status = 5 AND R1.user_id = 137279
               AND      R2.item_id = A.item_id AND R2.status = 5 AND R2.user_id = 134555
               AND      R3.item_id = A.item_id AND R3.status = 5 AND R3.user_id = 134546
               AND      R4.item_id = A.item_id AND R4.status = 5 AND R4.user_id = 137355
               AND      R5.item_id = A.item_id AND R5.status = 5 AND R5.user_id = 134556
               ORDER BY A.name ASC;

  (I have changed the names of tables and columns, stripped away extraneous columns and predicates, and eliminated
  bind variables in order to both protect my client’s confidentiality and make the example easier to understand.) The
  execution plan looked like this:
          Rows     Row Source Operation
          ------- ---------------------------------------------------
               642 SORT UNIQUE (cr=23520269 r=34 w=0 time=2759937104 us)
          64339104    TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=23520269 r=34 w=0
  time=1838051782 us)
          95184881     NESTED LOOPS (cr=7842642 r=23 w=0 time=907238095 us)
          2710288      NESTED LOOPS (cr=2266544 r=23 w=0 time=103840003 us)
           317688       NESTED LOOPS (cr=484734 r=11 w=0 time=23494451 us)
            50952        NESTED LOOPS (cr=43280 r=10 w=0 time=2688237 us)
             4146         NESTED LOOPS (cr=19016 r=3 w=0 time=988374 us)
             1831          NESTED LOOPS (cr=13353 r=0 w=0 time=608296 us)
             4121           HASH JOIN (cr=7395 r=0 w=0 time=399488 us)



http://www.dbspecialists.com/files/presentations/semijoins.html[08/29/2012 5:00:12 PM]
Speeding Up Queries with Semi-Joins and Anti-Joins

                    2046                    TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=7211 r=0 w=0 time=299181
  us)
                  17788                       INDEX RANGE SCAN RELATIONSHIPS_N3 (cr=71 r=0 w=0 time=81158 us)(object
  id 34528)
                    3634                   TABLE ACCESS FULL ASSIGNMENTS (cr=184 r=0 w=0 time=25536 us)
                    1831                  TABLE ACCESS BY INDEX ROWID FORMAT_ITEMS (cr=5958 r=0 w=0 time=163252
  us)
              1831          INDEX RANGE SCAN FORMAT_ITEMS_N1 (cr=4127 r=0 w=0 time=115113
  us)(object id 34316)
              4146        TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=5663 r=3 w=0 time=349554
  us)
              4264         INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=3678 r=0 w=0 time=224390
  us)(object id 34345)
             50952       TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=24264 r=7 w=0 time=1538051
  us)
             70976        INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=8428 r=0 w=0 time=630831 us)(object
  id 34345)
            317688      TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=441454 r=1 w=0 time=19584539
  us)
          1631032        INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=108302 r=0 w=0 time=7213989
  us)(object id 34345)
          2710288      TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=1781810 r=12 w=0 time=72015925
  us)
          4237104       INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=643298 r=1 w=0 time=29018764
  us)(object id 34345)
          92474592     INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=5576098 r=0 w=0 time=503165776
  us)(object id 34345)

  It is no wonder that this query took so long when five participants were selected. With each join to the
  RELATIONSHIPS table, the Cartesian product gets more and more out of hand. By the final sort to eliminate the
  duplicates, we have over 64 million rows to reckon with.

  This query struck me right away as a good candidate for a nested loops semi-join. We want a list of assignments that
  meet certain criteria. Whether a specific user has one qualifying relationship record or 10, it makes no difference. We
  are only concerned with the existence of at least one qualifying relationship record for each of the specified users.

  Rewriting the query with EXISTS clauses to semi-join to the RELATIONSHIPS table five times is pretty easy. The
  challenge here is that the query does have a DISTINCT operator, and this defeats Oracle’s semi-join access paths. Not
  to worry—an inline view with a NO_MERGE hint saved the day. The revised query looked like:
               SELECT   /*+ NO_MERGE (M) */
                        DISTINCT M.name, M.code, M.description,
                                 M.item_id, M.assignment_id, M.string0, M.string1
               FROM     (
                        SELECT A.name, A.code, A.description,
                               A.item_id, A.assignment_id, FI.string0, FI.string1
                        FROM   relationships R, assignments A, format_items FI
                        WHERE R.user_id = 134546
                        AND    R.account_id = 134545
                        AND    R.type_code = 0
                        AND    A.item_id = R.item_id
                        AND    FI.item_id = A.item_id
                        AND    EXISTS
                               (SELECT 1 FROM relationships R1
                                WHERE R1.item_id = A.item_id AND R1.status = 5
                                AND   R1.user_id = 137279)
                        AND    EXISTS
                               (SELECT 1 FROM relationships R2
                                WHERE R2.item_id = A.item_id AND R2.status = 5
                                AND   R2.user_id = 134555)
                        AND    EXISTS
                               (SELECT 1 FROM relationships R3
                                WHERE R3.item_id = A.item_id AND R3.status = 5
                                AND   R3.user_id = 134546)
                        AND    EXISTS
                               (SELECT 1 FROM relationships R4
                                WHERE R4.item_id = A.item_id AND R4.status = 5
                                AND   R4.user_id = 137355)
                        AND    EXISTS
                               (SELECT 1 FROM relationships R5
                                WHERE R5.item_id = A.item_id AND R5.status = 5
                                AND   R5.user_id = 134556)
                        ) M
               ORDER BY M.name ASC;



http://www.dbspecialists.com/files/presentations/semijoins.html[08/29/2012 5:00:12 PM]
Speeding Up Queries with Semi-Joins and Anti-Joins

  The execution plan was as follows:
               Rows           Row Source Operation
               -------        ---------------------------------------------------
                    642       SORT UNIQUE (cr=36315 r=89 w=0 time=1107054 us)
                  1300         VIEW (cr=36315 r=89 w=0 time=1085116 us)
                  1300          NESTED LOOPS SEMI (cr=36315 r=89 w=0 time=1082232 us)
                  1314           NESTED LOOPS SEMI (cr=32385 r=89 w=0 time=1002330 us)
                  1314            NESTED LOOPS SEMI (cr=28261 r=89 w=0 time=904654 us)
                  1314             NESTED LOOPS SEMI (cr=22822 r=89 w=0 time=737705 us)
                  1322              NESTED LOOPS SEMI (cr=18730 r=89 w=0 time=651196 us)
                  1831               NESTED LOOPS (cr=13353 r=89 w=0 time=530670 us)
                  4121                HASH JOIN (cr=7395 r=89 w=0 time=347584 us)
                  2046                 TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=7211 r=0 w=0 time=186820
  us)
                  17788                       INDEX RANGE SCAN RELATIONSHIPS_N3 (cr=71 r=0 w=0 time=43770 us)(object
  id 34528)
                    3634                   TABLE ACCESS FULL ASSIGNMENTS (cr=184 r=89 w=0 time=91899 us)
                    1831                  TABLE ACCESS BY INDEX ROWID FORMAT_ITEMS (cr=5958 r=0 w=0 time=141416
  us)
             1831            INDEX RANGE SCAN FORMAT_ITEMS_N1 (cr=4127 r=0 w=0 time=100207
  us)(object id 34316)
             1322          TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=5377 r=0 w=0 time=92046 us)
             2472           INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=3664 r=0 w=0 time=61077 us)(object
  id 34345)
             1314         TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=4092 r=0 w=0 time=63478 us)
             1582          INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=2647 r=0 w=0 time=40433 us)(object
  id 34345)
             1314        TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=5439 r=0 w=0 time=133620 us)
            11011         INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=2639 r=0 w=0 time=65312 us)(object
  id 34345)
             1314       TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=4124 r=0 w=0 time=75657 us)
             2234        INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=2630 r=0 w=0 time=49373 us)(object id
  34345)
             1300      TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=3930 r=0 w=0 time=58651 us)
             1300       INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=2630 r=0 w=0 time=38832 us)(object id
  34345)

  The execution plan starts out the same as before, with a hash join between RELATIONSHIPS and ASSIGNMENTS
  followed by a nested loops join to FORMAT_ITEMS. At this point there are 1831 rows in the candidate result set.
  Whereas the original query does five conventional joins to the RELATIONSHIPS table and the 1831 rows balloon to
  over 64 million, the revised query does semi-joins instead and whittles the candidate rows down to 1300. Joining 1300
  rows is a lot faster than joining 64 million rows, so you can see how the semi-join access path really speeds up this
  query.

  The NO_MERGE hint and the inline view (aliased as “M”) causes Oracle to separate the DISTINCT operation from
  the bulk of the query. This is how we get around the fact that a DISTINCT operation defeats semi-join access paths. In
  the second version of the query, Oracle executes the query as if the DISTINCT keyword were not there. Then, as a
  final step, it applies the DISTINCT operation to the interim result set that it has stored in memory (the output of the
  VIEW operator).

  If you are wondering why the NO_MERGE hint is necessary, recall what we discussed earlier about the heuristic
  Oracle uses when deciding whether or not to merge a subquery into the main body of a query. Oracle assumes that
  merging is a good thing and will do it whenever possible. Thus without the NO_MERGE hint Oracle would merge the
  inline view into the DISTINCT operation (eliminating the inline view), and the semi-join access path would be
  defeated by the presence of the DISTINCT operation.

  This example demonstrates the value of the semi-join access paths. Your gains may not be as spectacular as we have
  seen here, but hopefully you can see from this discussion how proper use of semi-joins can make a certain class of
  queries much more efficient. This example also underscores the importance of knowing how to properly use semi-
  joins. If you did not remember that semi-join access paths are defeated by the DISTINCT operation, you might have
  changed the example query to use EXISTS clauses and wondered why the query still ran slowly.

  The NOT EXISTS and NOT IN Constructs

  In this section we will discuss how Oracle evaluates NOT EXISTS and NOT IN clauses, prerequisites for using

http://www.dbspecialists.com/files/presentations/semijoins.html[08/29/2012 5:00:12 PM]
Speeding Up Queries with Semi-Joins and Anti-Joins

  Oracle’s anti-join access paths, and hints that influence anti-join query optimization. Then we’ll look at a few
  examples of how anti-join access paths can be used to make some queries more efficient.

  How Oracle Evaluates NOT EXISTS and NOT IN Clauses

  NOT EXISTS and NOT IN clauses are used to retrieve rows from a table for which no match is found in another
  table. In the early days, Oracle would use the subquery of a NOT EXISTS or NOT IN clause as a filter. For each
  candidate row retrieved by the main body of the query, Oracle would execute the subquery. If the subquery retrieved
  one or more rows, then the candidate row would be discarded. Conceptually this approach seems similar to a nested
  loops anti-join.

  Somewhere around Oracle 7.3, the merge anti-join and hash anti-join access paths were added. These enabled Oracle
  to leverage the merge and hash join algorithms when performing anti-joins. However, Oracle would only use these
  algorithms if the always_anti_join instance parameter was set appropriately or if a hint was added to the query. In
  Oracle 8i this still appears to be true. Oracle 9i, however, does away with the always_anti_join instance parameter and
  instead the optimizer evaluates the costs of the different algorithms and chooses the one with the lowest cost.

  In an earlier section of this paper we discussed when the EXISTS clause should be used versus when the IN clause
  should be used. The two are functionally equivalent, and actively choosing one over the other is usually done in the
  hopes that this will help the optimizer choose a better execution plan. When talking about NOT EXISTS and NOT IN,
  however, it is important to note that these two constructs are not functionally equivalent—there is a significant
  difference between the two in how null values are handled.

  If the subquery of a NOT IN clause returns at least one null value, then the NOT IN predicate evaluates to false. The
  NOT EXISTS construct, meanwhile, concerns itself only with whether a row is returned or not, and thus does not do
  anything differently if a null value is returned. This may seem a bit abstract, so let’s clarify the point with an example.

  Suppose you want a list of empty departments—departments that have no employees. You could write this query with
  a NOT EXISTS clause or a NOT IN clause. We saw the NOT EXISTS case earlier:
               SELECT   D.deptno, D.dname
               FROM     dept D
               WHERE    NOT EXISTS
                        (
                        SELECT 1
                        FROM   emp E
                        WHERE E.deptno = D.deptno
                        )
               ORDER BY D.deptno;

  Written with a NOT IN clause, the query becomes:
               SELECT   D.deptno, D.dname
               FROM     dept D
               WHERE    D.deptno NOT IN
                        (
                        SELECT E.deptno
                        FROM   emp E
                        )
               ORDER BY D.deptno;

  The queries look like they would be functionally equivalent, but if you are using the EMP and DEPT tables from the
  SCOTT schema of an Oracle 9i database, then the two queries are in fact different. Insert a row into the EMP table
  with a null DEPTNO, run the two versions of the query, and you will see the difference. The NOT EXISTS version
  will show you that department 40 has no employees. However, the NOT IN version of the query will say there are no
  empty departments. Why? Because the subquery of the NOT IN clause retrieves a row with a null value, causing the
  NOT IN clause to always evaluate to false. As strange as this NOT IN nuance may seem, it is fully documented in
  Metalink document 28934.1.

  The difference between NOT EXISTS and NOT IN with respect to handling of nulls is an important one—not just
  from the standpoint of differing functionality, but also from a performance perspective as well. To evaluate a NOT IN

http://www.dbspecialists.com/files/presentations/semijoins.html[08/29/2012 5:00:12 PM]
Speeding Up Queries with Semi-Joins and Anti-Joins

  subquery that is capable of returning nulls, Oracle cannot scan an index because an implicit NVL() function must be
  applied and this would defeat the index. (Null values aren’t stored in single-column non-unique indexes, anyway.)
  We’ll see the impact of this implicit NVL() function in the examples later in this section.

  If the subquery of a NOT IN clause is physically incapable of returning a null value (because the columns involved
  have NOT NULL constraints or because predicates have been added to the WHERE clause requiring non-null values),
  then the query with the NOT IN clause will return the same results as a query with an equivalent NOT EXISTS clause
  —and indexes won’t be defeated in the process.

  If we wanted to write the query that shows empty departments with a NOT IN clause that gave the same results as the
  NOT EXISTS version, we would need to add a predicate to the subquery that rules out null values. The resulting query
  is:
               SELECT   D.deptno, D.dname
               FROM     dept D
               WHERE    D.deptno NOT IN
                        (
                        SELECT E.deptno
                        FROM   emp E
                        WHERE E.deptno IS NOT NULL
                        )
               ORDER BY D.deptno;

  To choose whether to use NOT EXISTS or NOT IN in a query, therefore, you must first consider how you want null
  values to be handled. If you need the presence of a null value in the subquery to cause the expression to always
  evaluate to false, then you need to use the NOT IN construct. Otherwise you have a choice between NOT EXISTS and
  NOT IN (with the extra WHERE predicate if the subquery would otherwise be capable of returning a null value).

  If you don’t need the special null value semantics of NOT IN, then your choice of whether to use NOT EXISTS or
  NOT IN hinges on what sort of access path you think will be best for the query. In a conventional join, the nested loops
  algorithm is desirable when the predicates on the first table are very selective and the join columns in the second table
  are selectively indexed. The merge and hash join algorithms, meanwhile, are more desirable when predicates are not
  very selective or the join columns in the second table are not selectively indexed. A similar rule of thumb applies to
  anti-joins. That is to say, merge and hash anti-joins are most valuable in situations where the first table has a lot of
  candidate rows or the join columns are not selectively indexed.

  If you think your query is going to be better off with a merge or hash anti-join than a nested loops anti-join or a filter
  operation, then you might want to code your query with a NOT IN clause instead of a NOT EXISTS clause. Whenever
  the Oracle documentation discusses anti-join access paths, it always seems to do so in the context of the NOT IN
  construct with no mention of NOT EXISTS. While sometimes Oracle 9i will choose an anti-join access path for a
  query with a NOT EXISTS clause, it does seem that Oracle 9i might be more likely to do so if NOT IN is used instead
  of NOT EXISTS. In Oracle 8i, I believe the query optimizer will never choose an anti-join access path for a query
  with a NOT EXISTS clause (even if always_anti_join is set and a hint appears in the subquery).

  Meanwhile, if you don’t think your query will benefit from a merge or hash anti-join, then you might be best off with
  the good old filter technique instead of the nested loops anti-join access path. It appears that, at least in Oracle 9i, the
  nested loops anti-join might not be as efficient as it could be. In the example of the query for empty departments that
  we have been discussing, the filter technique will rule out a department as soon as the first employee in that
  department is located. The nested loops anti-join, meanwhile, appears to locate all employees in the department before
  ruling it out. This is unnecessary work. Whether a department has one employee or 100, it does not qualify as an
  empty department. So Oracle should cut to the chase and move on to the next department as soon as the first employee
  is found. The filter technique does just that, but the nested loops anti-join (at least in the cases I have seen) does not.
  The second example in this section will demonstrate this.

  The upshot is that you must first consider how null values should be handled when deciding whether to use NOT
  EXISTS or NOT IN. If you need the special semantics provided by NOT IN, then your decision has been made.
  Otherwise, you should next consider whether or not the query might benefit from a merge or hash anti-join. If so, then
  you probably ought to choose NOT IN. If you decide to go with NOT IN but do not want the expression to evaluate to

http://www.dbspecialists.com/files/presentations/semijoins.html[08/29/2012 5:00:12 PM]
Speeding Up Queries with Semi-Joins and Anti-Joins

  false if a null value is found, then make sure the subquery cannot return a null value. If there is no chance that the
  query will benefit from a merge or hash anti-join and the special semantics of NOT IN are not desired, then you might
  want to select the NOT EXISTS construct so that there is a better chance Oracle will perform an efficient filter instead
  of an inefficient nested loops anti-join.

  Prerequisites and Hints that Impact Anti-Joins

  The key prerequisite for the anti-join access paths is that the subquery of a NOT IN clause cannot be capable of
  returning a null value. As mentioned before, this means that either the columns being selected must have NOT NULL
  constraints or there must be predicates in the WHERE clause of the subquery to ensure there are no nulls. Even if
  every row in the table has a non-null value in a nullable column, you must still add the extra predicate to the WHERE
  clause or else Oracle will refuse to use the merge and hash anti-join access paths.

  Assuming that an anti-join access path is possible, Oracle chooses the anti-join algorithm to use while evaluating
  execution plans. In Oracle 9i an anti-join can be performed using the nested loops, merge, or hash join algorithms. As
  with a conventional join, Oracle will choose the join algorithm with the lowest cost, or it may choose to use the filter
  technique instead of an anti-join altogether. In Oracle 8i an anti-join can be performed using the merge or hash join
  algorithms, but as with semi-joins, Oracle 8i will not choose the join algorithm with the lowest cost. Instead, the
  always_anti_join instance parameter dictates which approach Oracle 8i should use. (Oracle 8i does not implement the
  nested loops anti-join access path, so if the always_anti_join parameter is set to nested_loops, the filter technique is
  used instead.) The always_anti_join instance parameter is obsolete in Oracle 9i.

  Oracle provides the NL_AJ, MERGE_AJ, and HASH_AJ hints in order for you to manipulate the anti-join process if
  you need to. As with the semi-join hints, the hint is applied to the subquery and not the main body of the query itself.

  Anti-Join Example #1

  Suppose you want a list of customers who have not placed an order within the last ten days. You might start with a
  query that looks like:
               SELECT   C.short_name, C.customer_id
               FROM     customers C
               WHERE    NOT EXISTS
                        (
                        SELECT 1
                        FROM   orders O
                        WHERE O.customer_id = C.customer_id
                        AND    O.order_date > SYSDATE - 10
                        )
               ORDER BY C.short_name;

  The execution plan looks like:
          Rows     Row Source Operation
          ------- ---------------------------------------------------
                11 SORT ORDER BY (cr=18707 r=301 w=0 time=22491917 us)
                11  FILTER (cr=18707 r=301 w=0 time=22491555 us)
              1000   TABLE ACCESS FULL CUSTOMERS (cr=38 r=36 w=0 time=15277 us)
               989   VIEW (cr=18669 r=265 w=0 time=22365647 us)
               989     HASH JOIN (cr=18669 r=265 w=0 time=22347234 us)
            100000      INDEX RANGE SCAN ORDERS_N1_CUST_ID (cr=2207 r=208 w=0 time=277584 us)(object
  id 34555)
          5338680       INDEX RANGE SCAN ORDERS_N2_ORD_DATE (cr=16462 r=57 w=0 time=9036516
  us)(object id 34556)

  You can see that the query took over 22 seconds to run and performed 18,707 logical reads. Oracle did not use an anti-
  join access path in this execution plan, instead opting to read every row in the CUSTOMERS table and use the
  subquery as a filter by executing it once for each customer. Although Oracle found a clever way to hash join two
  indexes on the ORDERS table and thereby make the subquery very efficient, the query overall is still resource
  intensive because the subquery must execute 1000 times—once for each customer.

  This query seems like a good candidate for an anti-join access path because executing the subquery once and


http://www.dbspecialists.com/files/presentations/semijoins.html[08/29/2012 5:00:12 PM]
Speeding Up Queries with Semi-Joins and Anti-Joins

  performing a merge or hash anti-join might be more efficient than performing the subquery once for each customer as
  a filter operation. We could rewrite the query using a NOT IN clause instead of NOT EXISTS to see if Oracle chooses
  an anti-join access path:
               SELECT   C.short_name, C.customer_id
               FROM     customers C
               WHERE    C.customer_id NOT IN
                        (
                        SELECT O.customer_id
                        FROM   orders O
                        WHERE O.order_date > SYSDATE - 10
                        )
               ORDER BY C.short_name;

  Unfortunately, the results are quite disappointing—elapsed time shot up to 695 seconds! The execution plan looks like:
               Rows           Row Source Operation
               -------        ---------------------------------------------------
                     11       SORT ORDER BY (cr=5360749 r=4870724 w=0 time=695232973 us)
                     11        FILTER (cr=5360749 r=4870724 w=0 time=695232475 us)
                  1000          TABLE ACCESS FULL CUSTOMERS (cr=38 r=129 w=0 time=61614 us)
                    989         TABLE ACCESS BY INDEX ROWID ORDERS (cr=5360711 r=4870595 w=0 time=695088325
  us)
          5359590      INDEX RANGE SCAN ORDERS_N2_ORD_DATE (cr=16520 r=0 w=0 time=22299946
  us)(object id 34556)

  You can see that Oracle did not choose an anti-join access path. Instead, Oracle is still reading every row from the
  CUSTOMERS table and executing the subquery once for each of the 1000 customers. Moreover, Oracle is no longer
  hash joining the two ORDERS indexes to make the subquery very efficient. For each customer, Oracle is now
  grabbing all orders from the last 10 days and sifting through them to find those belonging to the customer currently
  being processed. It turns out that the CUSTOMER_ID column in the ORDERS table is nullable, and so Oracle
  conceptually rewrites the query in order to implement the special null value semantics of NOT IN. The rewritten query
  looks like:
               SELECT   C.short_name, C.customer_id
               FROM     customers C
               WHERE    NOT EXISTS
                        (
                        SELECT 1
                        FROM   orders O
                        WHERE O.order_date > SYSDATE – 10
                        AND    NVL (O.customer_id, C.customer_id) = C.customer_id
                        )
               ORDER BY C.short_name;

  You can see how the implicit NVL() function that Oracle inserts during the transformation would be necessary in order
  to cause the predicate to evaluate to false if a null value is returned. You can also see how this defeats the index on the
  CUSTOMER_ID column of the ORDERS table.

  You might wonder why the CUSTOMER_ID column in the ORDERS table can be null. Let’s suppose the developers
  tell you that this handles a special case and you should not worry about it. The developers want the query to ignore
  nulls. Thus you can add an extra predicate to the WHERE clause of the subquery in order to meet the prerequisite for
  anti-join access paths. The resulting query is:
               SELECT   C.short_name, C.customer_id
               FROM     customers C
               WHERE    C.customer_id NOT IN
                        (
                        SELECT O.customer_id
                        FROM   orders O
                        WHERE O.order_date > SYSDATE - 10
                        AND    O.customer_id IS NOT NULL
                        )
               ORDER BY C.short_name;

  The execution plan becomes:
               Rows           Row Source Operation


http://www.dbspecialists.com/files/presentations/semijoins.html[08/29/2012 5:00:12 PM]
Speeding Up Queries with Semi-Joins and Anti-Joins

               -------        ---------------------------------------------------
                    11        SORT ORDER BY (cr=311 r=132 w=98 time=1464035 us)
                    11         HASH JOIN ANTI (cr=311 r=132 w=98 time=1463770 us)
                  1000          TABLE ACCESS FULL CUSTOMERS (cr=38 r=34 w=0 time=37976 us)
                 20910          VIEW (cr=273 r=98 w=98 time=1318222 us)
                 20910           HASH JOIN (cr=273 r=98 w=98 time=1172207 us)
                 20910            INDEX RANGE SCAN ORDERS_N2_ORD_DATE (cr=58 r=0 w=0 time=43870 us)(object id
  34556)
           100000      INDEX FAST FULL SCAN ORDERS_N1_CUST_ID (cr=215 r=0 w=0 time=196953
  us)(object id 34555)

  You can see that Oracle now chooses to perform a hash anti-join. The CUSTOMERS table gets a full table scan as
  before, but the critical difference is that now the subquery executes just once. The two indexes on the ORDERS table
  are hash joined once in order to develop a list of order dates for each customer, and this is hash anti-joined to the
  CUSTOMERS table in order to yield the list of customers with no orders in the last 10 days. The result is a query that
  completes in 1.46 seconds and performs just 311 logical reads.

  In this example we did not need to use hints to get Oracle to find the most efficient execution plan. All we had to do
  was write the query in a way that made it eligible for the anti-join access paths. (This entailed using NOT IN instead of
  NOT EXISTS, and adding a predicate to the subquery’s WHERE clause to make it impossible for the subquery to
  return a null value.)

  Anti-Join Example #2

  Along the lines of the previous example, suppose you needed to find out if a specific customer has not placed an order
  within the last 10 days. Your query might look like:
               SELECT   C.short_name, C.customer_id
               FROM     customers C
               WHERE    C.customer_id = 22
               AND      NOT EXISTS
                        (
                        SELECT 1
                        FROM   orders O
                        WHERE O.order_date > SYSDATE - 10
                        AND    O.customer_id = C.customer_id
                        )
               ORDER BY C.short_name;

  The execution plan looks like:
               Rows           Row Source Operation
               -------        ---------------------------------------------------
                     0        TABLE ACCESS BY INDEX ROWID CUSTOMERS (cr=6 r=0 w=0 time=341 us)
                     0         INDEX UNIQUE SCAN CUSTOMERS_PK (cr=6 r=0 w=0 time=333 us)(object id 34552)
                     1          TABLE ACCESS BY INDEX ROWID ORDERS (cr=4 r=0 w=0 time=170 us)
                     2           INDEX RANGE SCAN ORDERS_N1_CUST_ID (cr=2 r=0 w=0 time=65 us)(object id 34555)

  As you can see, Oracle uses a filter operation instead of an anti-join. This is not surprising, because we used the NOT
  EXISTS construct instead of NOT IN. The query performed 6 logical reads and completed in 341 microseconds.
  Customer 22 has 110 orders in the system. By looking at the row counts in the execution plan, you can see that Oracle
  stopped the filter process as soon as it found the first order placed by the customer within the last 10 days. There is no
  need to look at all of the orders placed by this customer because we’ve already found one that excludes the customer
  from the result set.

  This is an example of a query with an anti-join where the main query is extremely selective—we are only looking at
  one customer record. This begs the question: Would this be a good opportunity to use the nested loops anti-join access
  path? It is hard to imagine an execution plan for this query that is more efficient than what we have just seen, but let’s
  try. The query rewritten with a NOT IN clause and an NL_AJ hint looks like:
               SELECT        C.short_name, C.customer_id
               FROM          customers C
               WHERE         C.customer_id = 22
               AND           C.customer_id NOT IN
                             (
                             SELECT /*+ NL_AJ */ O.customer_id


http://www.dbspecialists.com/files/presentations/semijoins.html[08/29/2012 5:00:12 PM]
Speeding Up Queries with Semi-Joins and Anti-Joins

                        FROM   orders O
                        WHERE O.order_date > SYSDATE - 10
                        AND    O.customer_id IS NOT NULL
                        )
               ORDER BY C.short_name;

  The execution plan is:
               Rows           Row Source Operation
               -------        ---------------------------------------------------
                      0       SORT ORDER BY (cr=115 r=86 w=0 time=16824 us)
                      0        NESTED LOOPS ANTI (cr=115 r=86 w=0 time=16724 us)
                      1         TABLE ACCESS BY INDEX ROWID CUSTOMERS (cr=3 r=2 w=0 time=467 us)
                      1          INDEX UNIQUE SCAN CUSTOMERS_PK (cr=2 r=2 w=0 time=429 us)(object id34552)
                     10         TABLE ACCESS BY INDEX ROWID ORDERS (cr=112 r=84 w=0 time=16213 us)
                    110          INDEX RANGE SCAN ORDERS_N1_CUST_ID (cr=2 r=0 w=0 time=308 us)(object id
  34555)

  You can see that Oracle obeyed our hint and used the nested loops anti-join access path. Unfortunately, the number of
  logical reads went up from 6 to 115 and the elapsed time went up by a factor of 50 to 16824 microseconds. Why? The
  row counts in the execution plan tell the story. Oracle looked at all 110 orders from the customer and found that 10
  were placed less than 10 days ago. Oracle did a lot of unnecessary work. In this regard the filter access path seems
  much smarter than the nested loops anti-join access path. This behavior was seen on an Oracle 9.2.0.4 database. Let’s
  hope Oracle fixes this in a future release. Until then, I can’t think of a case where the nested loops anti-join access
  path would be useful.

  By the way, it turns out that Oracle would have chosen the nested loops anti-join access path even without the NL_AJ
  hint. So it turns out that this query was more efficient when written with a NOT EXISTS than a NOT IN. Again,
  hopefully this is just an anomaly that will be fixed in a future release of Oracle.

  Anti-Join Example #3

  I recently used the concepts presented in this section to bring a client’s query down from 33 minutes (where one CPU
  on the server was pegged) to less than four seconds. It turned out to be another case of the implicit NVL() function in a
  NOT IN clause defeating an index. This time it was a query that showed a count of calls to the customer service center
  placed by users who did not belong to corporate customers. The original query looked like this:
               SELECT COUNT(*)
               FROM   calls C
               WHERE C.requestor_user_id NOT IN
                      (
                      SELECT CM.member_user_id
                      FROM    corp_members CM
                      );

  (Once again names have been changed to protect the innocent.) The execution plan looked like this:
               Rows           Row Source Operation
               -------        ---------------------------------------------------
                     1        SORT AGGREGATE (cr=12784272 r=1864678 w=0 time=1978321835 us)
                     0         FILTER (cr=12784272 r=1864678 w=0 time=1978321817 us)
                184965          TABLE ACCESS FULL CALLS (cr=3588 r=1370 w=0 time=979769 us)
                 61032          TABLE ACCESS FULL CORP_MEMBERS (cr=12780684 r=1863308 w=0 time=2948544268 us)

  It appears that Oracle is reading rows from the CALLS table one at a time and, for each row, is scanning the
  CORP_MEMBERS table until it finds a row with a matching MEMBER_USER_ID. Since there are over 180,000 calls
  in the system, it takes over 180,000 partial scans of the CORP_MEMBERS table to complete this query. (The
  CORP_MEMBERS table has 584 blocks below its high water mark.)

  A quick check of the CORP_MEMBERS table shows that the MEMBER_USER_ID column is indexed. However, the
  column is also nullable and so Oracle cannot use the index because of the implicit NVL() function associated with the
  NOT IN clause.

  Since the client saw records in the CORP_MEMBERS table with a null MEMBER_USER_ID as an unrelated special


http://www.dbspecialists.com/files/presentations/semijoins.html[08/29/2012 5:00:12 PM]
Speeding Up Queries with Semi-Joins and Anti-Joins

  case, they wanted the query to ignore such records. So my first step was to rewrite the query with the extra WHERE
  predicate in the subquery that makes the NOT IN clause ignore nulls:
               SELECT COUNT(*)
               FROM   calls C
               WHERE C.requestor_user_id NOT IN
                      (
                      SELECT CM.member_user_id
                      FROM    corp_members CM
                      WHERE CM.member_user_id IS NOT NULL
                      );

  The resulting execution plan was:
          Rows     Row Source Operation
          ------- ---------------------------------------------------
                1 SORT AGGREGATE (cr=3790 r=3906 w=420 time=5450615 us)
                0   HASH JOIN ANTI (cr=3790 r=3906 w=420 time=5450591 us)
           184965    TABLE ACCESS FULL CALLS (cr=3588 r=3480 w=0 time=646554 us)
            81945    INDEX FAST FULL SCAN CORP_MEMBERS_USER_ID (cr=202 r=6 w=0 time=113178
  us)(object id 34580)

  Now not only did Oracle use the index on the MEMBER_USER_ID column when accessing the CORP_MEMBERS
  table, but more importantly Oracle also switched to a hash anti-join access path. This means that Oracle only had to
  access the CORP_MEMBERS table once instead of more than 180,000 times. The execution time dropped by 99% to
  5.45 seconds.

  The users were thrilled with the improvement and the problem was considered resolved. But to better understand
  Oracle’s data access paths and join methods, I benchmarked two more versions of the query. First I added a
  MERGE_AJ hint to cause a merge anti-join instead of a hash anti-join. This version of the query performed the same
  number of logical reads as the hash anti-join but used a little more CPU time. In this situation it looked like the merge
  anti-join used up more CPU time on sorting than did the hash anti-join.

  I also tried rewriting the query with a NOT EXISTS clause to see how it would perform using a filter operation instead
  of an anti-join access path. This version of the query looked like:
               SELECT COUNT(*)
               FROM   calls C
               WHERE NOT EXISTS
                      (
                      SELECT 1
                      FROM    corp_members CM
                      WHERE CM.member_user_id = C.requestor_user_id
                      );

  The execution plan was:
          Rows     Row Source Operation
          ------- ---------------------------------------------------
                1 SORT AGGREGATE (cr=125652 r=3489 w=0 time=3895569 us)
                0   FILTER (cr=125652 r=3489 w=0 time=3895547 us)
           184965    TABLE ACCESS FULL CALLS (cr=3588 r=3489 w=0 time=906699 us)
            61032    INDEX RANGE SCAN CORP_MEMBERS_USER_ID (cr=122064 r=0 w=0 time=1842121
  us)(object id 34580)

  This version of the query performed a lot more logical reads than the hash anti-join version, but actually used less
  CPU time. This query appeared to hammer on the blocks of the CORP_MEMBERS_USER_ID index very heavily.
  Most likely the whole index segment was sitting in the buffer cache and thus the logical reads were relatively
  inexpensive. Meanwhile, the large sort required by a hash anti-join was avoided and the result was a net reduction in
  CPU time.

  These examples demonstrate the value of the anti-join access paths and the significance of the way NOT IN treats null
  values. Again, your gains may not be as spectacular as those shown here, but hopefully you can see from this
  discussion how the proper use of anti-joins and NOT EXISTS and NOT IN clauses can make a certain class of queries
  much more efficient. These examples also demonstrate the importance of the difference between NOT EXISTS and


http://www.dbspecialists.com/files/presentations/semijoins.html[08/29/2012 5:00:12 PM]
Speeding Up Queries with Semi-Joins and Anti-Joins

  NOT IN, and the apparent behavior (although not fully documented as far as I can tell) that Oracle typically will not
  apply anti-join access paths to a NOT EXISTS clause.

  Conclusion

  The semi-join and anti-join are two special ways of joining data from multiple tables in a query. Although SQL does
  not have a special symbol or keyword to designate a semi- or anti-join, we use the EXISTS, IN, NOT EXISTS, and
  NOT IN constructs to indicate when one of these joins is desired. Over the years Oracle has gotten better at optimizing
  semi-joins and anti-joins, and new data access paths make it possible to write very efficient queries that use these two
  types of joins. However, there are certain prerequisites to be met before Oracle can use these access paths, and
  sometimes hints or specific coding styles are necessary in order for the query optimizer to find the best execution plan
  possible.

  Understanding how semi-joins and anti-joins work—and how Oracle implements the relevant data access paths—will
  enable you to write very efficient SQL and dramatically speed up an entire class of queries. This in turn will make
  your employer (or client) very happy.

  About the Author

  Roger Schrag, OCP, has been an Oracle DBA and application architect for over 15 years. He began his career at
  Oracle Corporation on the Oracle Financials development team. In 1995, he founded Database Specialists, Inc., a
  boutique consulting firm specializing in Oracle database technology, remote administration, and performance tuning.
  Since that time, Roger has focused his expertise in the area of performance optimization. Roger is a frequent speaker at
  Oracle OpenWorld and the International Oracle Users Group (IOUG) Live conferences, where he has frequently been
  voted in the Top 10% of speakers. Roger has been an Oracle Masters Class instructor for the IOUG and is the current
  president of the Northern California Oracle Users Group (NoCOUG). He can be reached at rschrag@dbspecialists.com.

             Still Looking for Help on this Subject?
             Get a Consultation
             We would be happy to talk with you about our services and how our senior-level database team might help
             you. Call Database Specialists at       415-344-0500     or        888-648-0500     or fill out a free
             consultation request form.
             Complimentary Newsletter
             If you'd like to receive our complimentary monthly newsletter with database tips and new white paper
             announcements, sign up for The Specialist.




  Copyright © 2005 Database Specialists, Inc. http://www.dbspecialists.com




http://www.dbspecialists.com/files/presentations/semijoins.html[08/29/2012 5:00:12 PM]

More Related Content

What's hot (20)

Sub query_SQL
Sub query_SQLSub query_SQL
Sub query_SQL
 
Sql subquery
Sql subquerySql subquery
Sql subquery
 
Application sql issues_and_tuning
Application sql issues_and_tuningApplication sql issues_and_tuning
Application sql issues_and_tuning
 
Performance tuning
Performance tuningPerformance tuning
Performance tuning
 
TSQL Coding Guidelines
TSQL Coding GuidelinesTSQL Coding Guidelines
TSQL Coding Guidelines
 
plsql les10
 plsql les10 plsql les10
plsql les10
 
PL/SQL 3 DML
PL/SQL 3 DMLPL/SQL 3 DML
PL/SQL 3 DML
 
Database development coding standards
Database development coding standardsDatabase development coding standards
Database development coding standards
 
Subqueries
SubqueriesSubqueries
Subqueries
 
plsql Les08
plsql Les08 plsql Les08
plsql Les08
 
plsql les01
 plsql les01 plsql les01
plsql les01
 
plsql les06
 plsql les06 plsql les06
plsql les06
 
plsql Lec11
plsql Lec11 plsql Lec11
plsql Lec11
 
e computer notes - Subqueries
e computer notes - Subqueriese computer notes - Subqueries
e computer notes - Subqueries
 
PL/SQL CONDICIONALES Y CICLOS
PL/SQL CONDICIONALES Y CICLOSPL/SQL CONDICIONALES Y CICLOS
PL/SQL CONDICIONALES Y CICLOS
 
PL/SQL CURSORES
PL/SQL CURSORESPL/SQL CURSORES
PL/SQL CURSORES
 
plsql Les05
plsql Les05 plsql Les05
plsql Les05
 
Oracle views
Oracle viewsOracle views
Oracle views
 
Subqueries -Oracle DataBase
Subqueries -Oracle DataBaseSubqueries -Oracle DataBase
Subqueries -Oracle DataBase
 
SQL for ETL Testing
SQL for ETL TestingSQL for ETL Testing
SQL for ETL Testing
 

Similar to Speeding up queries with semi joins and anti-joins

White Paper for OMG! Identifying and Refactoring Common SQL...
White Paper for OMG! Identifying and Refactoring Common SQL...White Paper for OMG! Identifying and Refactoring Common SQL...
White Paper for OMG! Identifying and Refactoring Common SQL...Jeff Jacobs
 
SQL – The Natural Language for Analysis - Oracle - Whitepaper - 2431343
SQL – The Natural Language for Analysis - Oracle - Whitepaper - 2431343SQL – The Natural Language for Analysis - Oracle - Whitepaper - 2431343
SQL – The Natural Language for Analysis - Oracle - Whitepaper - 2431343Edgar Alejandro Villegas
 
Sql interview question part 12
Sql interview question part 12Sql interview question part 12
Sql interview question part 12kaashiv1
 
Sql interview question part 12
Sql interview question part 12Sql interview question part 12
Sql interview question part 12kaashiv1
 
Cursors, triggers, procedures
Cursors, triggers, proceduresCursors, triggers, procedures
Cursors, triggers, proceduresVaibhav Kathuria
 
Myth busters - performance tuning 102 2008
Myth busters - performance tuning 102 2008Myth busters - performance tuning 102 2008
Myth busters - performance tuning 102 2008paulguerin
 
Oracle Sql Tuning
Oracle Sql TuningOracle Sql Tuning
Oracle Sql TuningChris Adkin
 
Applyinga blockcentricapproach
Applyinga blockcentricapproachApplyinga blockcentricapproach
Applyinga blockcentricapproachoracle documents
 
Excel for SEO -from Distilled UK
Excel for SEO -from Distilled UKExcel for SEO -from Distilled UK
Excel for SEO -from Distilled UKEldad Sotnick-Yogev
 
Oracle DB Performance Tuning Tips
Oracle DB Performance Tuning TipsOracle DB Performance Tuning Tips
Oracle DB Performance Tuning TipsAsanka Dilruk
 
Db performance optimization with indexing
Db performance optimization with indexingDb performance optimization with indexing
Db performance optimization with indexingRajeev Kumar
 
Dbms important questions and answers
Dbms important questions and answersDbms important questions and answers
Dbms important questions and answersLakshmiSarvani6
 
DB2 Interview Questions - Part 1
DB2 Interview Questions - Part 1DB2 Interview Questions - Part 1
DB2 Interview Questions - Part 1ReKruiTIn.com
 
Answer and provide 1 reference. Advanced SQL and PLSQLE.docx
Answer and provide 1 reference. Advanced SQL and PLSQLE.docxAnswer and provide 1 reference. Advanced SQL and PLSQLE.docx
Answer and provide 1 reference. Advanced SQL and PLSQLE.docxSHIVA101531
 
Advanced MySQL Query Optimizations
Advanced MySQL Query OptimizationsAdvanced MySQL Query Optimizations
Advanced MySQL Query OptimizationsDave Stokes
 
2012_893_Klier_doc_v2.pdf
2012_893_Klier_doc_v2.pdf2012_893_Klier_doc_v2.pdf
2012_893_Klier_doc_v2.pdfssuser2421e4
 
Sql interview question part 8
Sql interview question part 8Sql interview question part 8
Sql interview question part 8kaashiv1
 
CIS 515 Discussion post responses.Respondto the colleagu.docx
CIS 515 Discussion post responses.Respondto the colleagu.docxCIS 515 Discussion post responses.Respondto the colleagu.docx
CIS 515 Discussion post responses.Respondto the colleagu.docxsleeperharwell
 

Similar to Speeding up queries with semi joins and anti-joins (20)

White Paper for OMG! Identifying and Refactoring Common SQL...
White Paper for OMG! Identifying and Refactoring Common SQL...White Paper for OMG! Identifying and Refactoring Common SQL...
White Paper for OMG! Identifying and Refactoring Common SQL...
 
SQL – The Natural Language for Analysis - Oracle - Whitepaper - 2431343
SQL – The Natural Language for Analysis - Oracle - Whitepaper - 2431343SQL – The Natural Language for Analysis - Oracle - Whitepaper - 2431343
SQL – The Natural Language for Analysis - Oracle - Whitepaper - 2431343
 
Sql interview question part 12
Sql interview question part 12Sql interview question part 12
Sql interview question part 12
 
Ebook12
Ebook12Ebook12
Ebook12
 
Sql interview question part 12
Sql interview question part 12Sql interview question part 12
Sql interview question part 12
 
Cursors, triggers, procedures
Cursors, triggers, proceduresCursors, triggers, procedures
Cursors, triggers, procedures
 
Semi join
Semi joinSemi join
Semi join
 
Myth busters - performance tuning 102 2008
Myth busters - performance tuning 102 2008Myth busters - performance tuning 102 2008
Myth busters - performance tuning 102 2008
 
Oracle Sql Tuning
Oracle Sql TuningOracle Sql Tuning
Oracle Sql Tuning
 
Applyinga blockcentricapproach
Applyinga blockcentricapproachApplyinga blockcentricapproach
Applyinga blockcentricapproach
 
Excel for SEO -from Distilled UK
Excel for SEO -from Distilled UKExcel for SEO -from Distilled UK
Excel for SEO -from Distilled UK
 
Oracle DB Performance Tuning Tips
Oracle DB Performance Tuning TipsOracle DB Performance Tuning Tips
Oracle DB Performance Tuning Tips
 
Db performance optimization with indexing
Db performance optimization with indexingDb performance optimization with indexing
Db performance optimization with indexing
 
Dbms important questions and answers
Dbms important questions and answersDbms important questions and answers
Dbms important questions and answers
 
DB2 Interview Questions - Part 1
DB2 Interview Questions - Part 1DB2 Interview Questions - Part 1
DB2 Interview Questions - Part 1
 
Answer and provide 1 reference. Advanced SQL and PLSQLE.docx
Answer and provide 1 reference. Advanced SQL and PLSQLE.docxAnswer and provide 1 reference. Advanced SQL and PLSQLE.docx
Answer and provide 1 reference. Advanced SQL and PLSQLE.docx
 
Advanced MySQL Query Optimizations
Advanced MySQL Query OptimizationsAdvanced MySQL Query Optimizations
Advanced MySQL Query Optimizations
 
2012_893_Klier_doc_v2.pdf
2012_893_Klier_doc_v2.pdf2012_893_Klier_doc_v2.pdf
2012_893_Klier_doc_v2.pdf
 
Sql interview question part 8
Sql interview question part 8Sql interview question part 8
Sql interview question part 8
 
CIS 515 Discussion post responses.Respondto the colleagu.docx
CIS 515 Discussion post responses.Respondto the colleagu.docxCIS 515 Discussion post responses.Respondto the colleagu.docx
CIS 515 Discussion post responses.Respondto the colleagu.docx
 

More from Kaing Menglieng

What is your sql server backup strategy tech_republic
What is your sql server backup strategy    tech_republicWhat is your sql server backup strategy    tech_republic
What is your sql server backup strategy tech_republicKaing Menglieng
 
Using sql server 2008's merge statement tech republic
Using sql server 2008's merge statement   tech republicUsing sql server 2008's merge statement   tech republic
Using sql server 2008's merge statement tech republicKaing Menglieng
 
Using object dependencies in sql server 2008 tech republic
Using object dependencies in sql server 2008   tech republicUsing object dependencies in sql server 2008   tech republic
Using object dependencies in sql server 2008 tech republicKaing Menglieng
 
Using hash fields in sql server tech republic
Using hash fields in sql server   tech republicUsing hash fields in sql server   tech republic
Using hash fields in sql server tech republicKaing Menglieng
 
Using grouping sets in sql server 2008 tech republic
Using grouping sets in sql server 2008   tech republicUsing grouping sets in sql server 2008   tech republic
Using grouping sets in sql server 2008 tech republicKaing Menglieng
 
Understand when to use user defined functions in sql server tech-republic
Understand when to use user defined functions in sql server   tech-republicUnderstand when to use user defined functions in sql server   tech-republic
Understand when to use user defined functions in sql server tech-republicKaing Menglieng
 
Sql server indexed views speed up your select queries part 1 - code-projec
Sql server indexed views   speed up your select queries  part 1 - code-projecSql server indexed views   speed up your select queries  part 1 - code-projec
Sql server indexed views speed up your select queries part 1 - code-projecKaing Menglieng
 
Sql server – query optimization – remove bookmark lookup – remove rid lookup
Sql server – query optimization – remove bookmark lookup – remove rid lookupSql server – query optimization – remove bookmark lookup – remove rid lookup
Sql server – query optimization – remove bookmark lookup – remove rid lookupKaing Menglieng
 
Sql server common interview questions and answers
Sql server   common interview questions and answersSql server   common interview questions and answers
Sql server common interview questions and answersKaing Menglieng
 
Sql server common interview questions and answers page 6
Sql server   common interview questions and answers page 6Sql server   common interview questions and answers page 6
Sql server common interview questions and answers page 6Kaing Menglieng
 
Sql server common interview questions and answers page 5
Sql server   common interview questions and answers page 5Sql server   common interview questions and answers page 5
Sql server common interview questions and answers page 5Kaing Menglieng
 
Sql server common interview questions and answers page 4
Sql server   common interview questions and answers page 4Sql server   common interview questions and answers page 4
Sql server common interview questions and answers page 4Kaing Menglieng
 
Sql server common interview questions and answers page 2
Sql server   common interview questions and answers page 2Sql server   common interview questions and answers page 2
Sql server common interview questions and answers page 2Kaing Menglieng
 
Sql server – 2008 – hardware and software requirements for installing sql se
Sql server – 2008 – hardware and software requirements for installing sql seSql server – 2008 – hardware and software requirements for installing sql se
Sql server – 2008 – hardware and software requirements for installing sql seKaing Menglieng
 
Speed up sql server apps - visual studio magazine
Speed up sql server apps  - visual studio magazineSpeed up sql server apps  - visual studio magazine
Speed up sql server apps - visual studio magazineKaing Menglieng
 
See sql server graphical execution plans in action tech republic
See sql server graphical execution plans in action   tech republicSee sql server graphical execution plans in action   tech republic
See sql server graphical execution plans in action tech republicKaing Menglieng
 
Reviewing sql server permissions tech republic
Reviewing sql server permissions   tech republicReviewing sql server permissions   tech republic
Reviewing sql server permissions tech republicKaing Menglieng
 
Query optimization how to search millions of record in sql table faster -
Query optimization   how to search millions of record in sql table faster  -Query optimization   how to search millions of record in sql table faster  -
Query optimization how to search millions of record in sql table faster -Kaing Menglieng
 
Optimize sql server queries with these advanced tuning techniques tech repu
Optimize sql server queries with these advanced tuning techniques   tech repuOptimize sql server queries with these advanced tuning techniques   tech repu
Optimize sql server queries with these advanced tuning techniques tech repuKaing Menglieng
 

More from Kaing Menglieng (20)

What is your sql server backup strategy tech_republic
What is your sql server backup strategy    tech_republicWhat is your sql server backup strategy    tech_republic
What is your sql server backup strategy tech_republic
 
Using sql server 2008's merge statement tech republic
Using sql server 2008's merge statement   tech republicUsing sql server 2008's merge statement   tech republic
Using sql server 2008's merge statement tech republic
 
Using object dependencies in sql server 2008 tech republic
Using object dependencies in sql server 2008   tech republicUsing object dependencies in sql server 2008   tech republic
Using object dependencies in sql server 2008 tech republic
 
Using hash fields in sql server tech republic
Using hash fields in sql server   tech republicUsing hash fields in sql server   tech republic
Using hash fields in sql server tech republic
 
Using grouping sets in sql server 2008 tech republic
Using grouping sets in sql server 2008   tech republicUsing grouping sets in sql server 2008   tech republic
Using grouping sets in sql server 2008 tech republic
 
Understand when to use user defined functions in sql server tech-republic
Understand when to use user defined functions in sql server   tech-republicUnderstand when to use user defined functions in sql server   tech-republic
Understand when to use user defined functions in sql server tech-republic
 
Sql server indexed views speed up your select queries part 1 - code-projec
Sql server indexed views   speed up your select queries  part 1 - code-projecSql server indexed views   speed up your select queries  part 1 - code-projec
Sql server indexed views speed up your select queries part 1 - code-projec
 
Sql server – query optimization – remove bookmark lookup – remove rid lookup
Sql server – query optimization – remove bookmark lookup – remove rid lookupSql server – query optimization – remove bookmark lookup – remove rid lookup
Sql server – query optimization – remove bookmark lookup – remove rid lookup
 
Sql server common interview questions and answers
Sql server   common interview questions and answersSql server   common interview questions and answers
Sql server common interview questions and answers
 
Sql server common interview questions and answers page 6
Sql server   common interview questions and answers page 6Sql server   common interview questions and answers page 6
Sql server common interview questions and answers page 6
 
Sql server common interview questions and answers page 5
Sql server   common interview questions and answers page 5Sql server   common interview questions and answers page 5
Sql server common interview questions and answers page 5
 
Sql server common interview questions and answers page 4
Sql server   common interview questions and answers page 4Sql server   common interview questions and answers page 4
Sql server common interview questions and answers page 4
 
Sql server common interview questions and answers page 2
Sql server   common interview questions and answers page 2Sql server   common interview questions and answers page 2
Sql server common interview questions and answers page 2
 
Sql server – 2008 – hardware and software requirements for installing sql se
Sql server – 2008 – hardware and software requirements for installing sql seSql server – 2008 – hardware and software requirements for installing sql se
Sql server – 2008 – hardware and software requirements for installing sql se
 
Speed up sql
Speed up sqlSpeed up sql
Speed up sql
 
Speed up sql server apps - visual studio magazine
Speed up sql server apps  - visual studio magazineSpeed up sql server apps  - visual studio magazine
Speed up sql server apps - visual studio magazine
 
See sql server graphical execution plans in action tech republic
See sql server graphical execution plans in action   tech republicSee sql server graphical execution plans in action   tech republic
See sql server graphical execution plans in action tech republic
 
Reviewing sql server permissions tech republic
Reviewing sql server permissions   tech republicReviewing sql server permissions   tech republic
Reviewing sql server permissions tech republic
 
Query optimization how to search millions of record in sql table faster -
Query optimization   how to search millions of record in sql table faster  -Query optimization   how to search millions of record in sql table faster  -
Query optimization how to search millions of record in sql table faster -
 
Optimize sql server queries with these advanced tuning techniques tech repu
Optimize sql server queries with these advanced tuning techniques   tech repuOptimize sql server queries with these advanced tuning techniques   tech repu
Optimize sql server queries with these advanced tuning techniques tech repu
 

Speeding up queries with semi joins and anti-joins

  • 1. Speeding Up Queries with Semi-Joins and Anti-Joins Speeding Up Queries with Semi-Joins and Anti-Joins: How Oracle Evaluates EXISTS, NOT EXISTS, IN, and NOT IN by Roger Schrag Database Specialists, Inc. About Database Specialists, Inc. Database Specialists, Inc. provides remote DBA services and onsite database support for your mission critical Oracle systems. Since 1995, we have been providing Oracle database consulting in Solaris, HP-UX, Linux, AIX, and Windows environments. We are DBAs, speakers, educators, and authors. Our team is continually recognized by Oracle, at national conferences and by leading trade publications. Learn more about our remote DBA, database tuning, and consulting services. Or, call us at 415-344-0500 or 888-648-0500 . Introduction Optimizing SQL usually gives the most significant results when DBAs are called upon to “make the system run faster.” Using tools like Statspack or Enterprise Manager, it is often easy to find the slow SQL. But how do you make the queries run faster? That is the challenge! In this paper we will discuss the semi-join and the anti-join, two powerful SQL constructs Oracle offers for use in your quest for faster queries. In particular, we will define these two terms, discuss when and why you might want to use the [NOT] EXISTS or [NOT] IN constructs, and demonstrate how you can use optimizer hints and make minor query changes in order to enable Oracle to use some very powerful and efficient access paths. For certain classes of queries, these features can dramatically reduce logical reads, physical reads, CPU time, and elapsed time. But beware! There are some obscure (and not always easy to find) requirements that must be met in order for Oracle to deploy the semi- and anti-join access paths. If you fail to dot an “I” or cross a “T,” you could be banging your head against the wall for hours. In this paper we will look at SQL from a real application and demonstrate the “before” and “after” versions of queries that ran orders of magnitude faster once semi-joins and anti- joins were implemented correctly. Before we get started, I want to point out that Oracle Corporation has improved its optimization techniques significantly over the years, specifically in the area of semi-joins and anti-joins. This paper was written using Oracle 9i release 9.2.0.4, and all examples were validated in such an environment. If you are using a newer or older release of Oracle, it is quite possible that you will see different results. Please do not take anything you read in this paper—or in any paper for that matter—for granted. Validate every hypothesis in your environment before relying upon it. Semi-Join Defined A “semi-join” between two tables returns rows from the first table where one or more matches are found in the second table. The difference between a semi-join and a conventional join is that rows in the first table will be returned at most once. Even if the second table contains two matches for a row in the first table, only one copy of the row will be returned. Semi-joins are written using the EXISTS or IN constructs. Suppose you have the DEPT and EMP tables in the SCOTT schema and you want a list of departments with at least one employee. You could write the query with a conventional join: SELECT D.deptno, D.dname FROM dept D, emp E WHERE E.deptno = D.deptno ORDER BY D.deptno; Unfortunately, if a department has 400 employees then that department will appear in the query output 400 times. You could eliminate the duplicate rows by using the DISTINCT keyword, but you would be making Oracle do more work than necessary. Really what you want to do is specify a semi-join between the DEPT and EMP tables instead of a http://www.dbspecialists.com/files/presentations/semijoins.html[08/29/2012 5:00:12 PM]
  • 2. Speeding Up Queries with Semi-Joins and Anti-Joins conventional join: SELECT D.deptno, D.dname FROM dept D WHERE EXISTS ( SELECT 1 FROM emp E WHERE E.deptno = D.deptno ) ORDER BY D.deptno; The above query will list the departments that have at least one employee. Whether a department has one employee or 100, the department will appear only once in the query output. Moreover, Oracle will move on to the next department as soon as it finds the first employee in a department, instead of finding all of the employees in each department. Anti-Join Defined An “anti-join” between two tables returns rows from the first table where no matches are found in the second table. An anti-join is essentially the opposite of a semi-join: While a semi-join returns one copy of each row in the first table for which at least one match is found, an anti-join returns one copy of each row in the first table for which no match is found. Anti-joins are written using the NOT EXISTS or NOT IN constructs. These two constructs differ in how they handle nulls—a subtle but very important distinction which we will discuss later. Suppose you want a list of empty departments—departments with no employees. You could write a query that finds all departments and subtracts off the department of each employee: SELECT D1.deptno, D1.dname FROM dept D1 MINUS SELECT D2.deptno, D2.dname FROM dept D2, emp E2 WHERE D2.deptno = E2.deptno ORDER BY 1; The above query will give the desired results, but it might be clearer to write the query using an anti-join: SELECT D.deptno, D.dname FROM dept D WHERE NOT EXISTS ( SELECT 1 FROM emp E WHERE E.deptno = D.deptno ) ORDER BY D.deptno; The above query is more efficient because Oracle can employ an anti-join access path. The difference in efficiency here is akin to the difference between a nested loops join and a hash join when you are joining every row in one table to another. The EXISTS and IN Constructs In this section we will discuss how Oracle evaluates EXISTS and IN clauses, prerequisites for using Oracle’s semi- join access paths, and hints that influence semi-join query optimization. Then we’ll look at a few sample queries where semi-join access paths can be used to make the query more efficient. How Oracle Evaluates EXISTS and IN Clauses The SQL language as implemented in Oracle does not include a syntax for explicitly performing semi-joins, so you use an EXISTS or IN clause with a subquery to indicate to Oracle that a semi-join should be performed instead of a conventional join. Oracle will typically transform the subquery into a join wherever possible. http://www.dbspecialists.com/files/presentations/semijoins.html[08/29/2012 5:00:12 PM]
  • 3. Speeding Up Queries with Semi-Joins and Anti-Joins An interesting characteristic of Oracle’s query optimizer is that (according to Metalink document 144967.1), in Oracle 8i and 9i the decision of whether or not to transform a query by merging its subqueries into the main body is a heuristic one. Basically, Oracle will merge subqueries if it can, regardless of impact on cost. The assumption is that merging is always a good thing. In fact, Oracle performs query transformations (merging subqueries into the main body) before making a list of possible execution plans and evaluating their cost. This bit of trivia will come into play when we discuss the second example in this section. The Oracle Performance Tuning Guide in the Oracle 8i documentation set explains that EXISTS and IN are processed differently by the query optimizer. The manual explains a rule of thumb for when to use each, and provides an example to demonstrate. The Oracle 9i Performance Tuning Guide provides the same rule of thumb with a slightly more polished example. The Oracle 10g Performance Tuning Guide contains substantially the same text as the Oracle 9i manual on this subject. The rule of thumb goes like this: If the main body of your query is highly selective, then an EXISTS clause might be more appropriate to semi-join to the target table. However, if the main body of your query is not so selective and the subquery (the target of the semi-join) is more selective, then an IN clause might be more appropriate. The example in the Oracle 8i Performance Tuning Guide shows a query where the main body is not selective, while the target of the semi-join is highly selective. The example demonstrates that Oracle chooses a poor execution plan when an EXISTS clause is used, and a much more efficient execution plan when an IN clause is used instead. I was easily able to reproduce this example in an Oracle 8i environment with the same results as shown in the Performance Tuning Guide. However, when I reproduced this same example in an Oracle 9i environment, Oracle chose an identical (and efficient) execution plan for both versions of the query. Furthermore, the example in the Oracle 9i Performance Tuning Guide gave an identical (and efficient) execution plan for both versions of the query in my Oracle 9i environment. These tests echo my previous experience—that Oracle 9i is smarter than Oracle 8i about EXISTS and IN clauses, and you don’t typically need to worry about when to use EXISTS versus IN. The upshot is that if you are writing queries for an Oracle 8i (or earlier) system, then you should think about the selectivity of the predicates in the main query versus the subquery and choose whether to use an EXISTS or an IN clause wisely. If you are using Oracle 9i or later, it doesn’t hurt to follow this rule of thumb, but you may not need to worry as much. Prerequisites and Hints that Impact Semi-Joins There are a few situations in which Oracle cannot use a semi-join access path. If the query contains a DISTINCT operation (explicitly or implicitly through a set operator such as UNION) then Oracle cannot transform EXISTS or IN clauses into something that could use a semi-join access path. The same is true if the EXISTS or IN clause is part of an OR branch. (Semi-join access paths can be used in queries that contain ORs in the WHERE clause, just as long as the EXISTS or IN is not part of the OR.) Assuming that a semi-join access path is possible, Oracle chooses the semi-join algorithm to use while evaluating execution plans. In Oracle 9i a semi-join can be performed using the nested loops, hash join, or merge join algorithms. As with a conventional join, Oracle 9i will choose the join algorithm with the lowest cost. In Oracle 8i a semi-join can be performed using the hash join or merge join algorithms, but Oracle 8i will not choose the join algorithm with the lowest cost. Instead, the always_semi_join instance parameter dictates whether or not an Oracle 8i database should perform semi-joins and, if so, which algorithm to use. The always_semi_join instance parameter is obsolete in Oracle 9i. If a semi-join access path is not possible (because of a DISTINCT operation or an OR predicate), or if instance parameters specify a semi-join should not be used, then Oracle will choose an alternate data access path. This usually amounts to performing a conventional join and sorting out the duplicates, or scanning the first table and using the second table as a filter. http://www.dbspecialists.com/files/presentations/semijoins.html[08/29/2012 5:00:12 PM]
  • 4. Speeding Up Queries with Semi-Joins and Anti-Joins Oracle provides the NL_SJ, HASH_SJ, and MERGE_SJ hints in order for you to manipulate the semi-join process if you need to. The hint is applied to the subquery of the EXISTS or IN clause, not the main body of the query itself. In my experience Oracle does not need you to provide a hint in order to consider the efficient semi-join access paths. However, in Oracle 9i or 10g, where you have a wider choice of semi-join algorithms, you might want to use hints if Oracle chooses a hash semi-join when a nested loops semi-join would in fact be better, or vice versa. Semi-Join Example #1 Suppose you want a list of your gold-status customers who have placed an order within the last three days. You might start with a query that looks like: SELECT DISTINCT C.short_name, C.customer_id FROM customers C, orders O WHERE C.customer_type = 'Gold' AND O.customer_id = C.customer_id AND O.order_date > SYSDATE - 3 ORDER BY C.short_name; The execution plan (from a TKPROF report) for this query is: Rows Row Source Operation ------- --------------------------------------------------- 2 SORT UNIQUE (cr=33608 r=30076 w=0 time=6704029 us) 20 HASH JOIN (cr=33608 r=30076 w=0 time=6703101 us) 10 TABLE ACCESS FULL CUSTOMERS (cr=38 r=36 w=0 time=31718 us) 2990 TABLE ACCESS FULL ORDERS (cr=33570 r=30040 w=0 time=6646420 us) You can see that Oracle did a (conventional) hash join between the CUSTOMERS and ORDERS tables, and then performed a sort for uniqueness to filter out the duplicate records. This sort was necessary because the query is supposed to show a customer only once, even if they have more than one order placed within the last three days. (A sort to satisfy the ORDER BY clause is also necessary, but Oracle tackles this at the same time as the sort for uniqueness and thus it does not appear as a separate step in the execution plan.) Using what we have discussed so far, you might want to rewrite the query using a semi-join: SELECT C.short_name, C.customer_id FROM customers C WHERE C.customer_type = 'Gold' AND EXISTS ( SELECT 1 FROM orders O WHERE O.customer_id = C.customer_id AND O.order_date > SYSDATE - 3 ) ORDER BY C.short_name; The execution plan for this query becomes: Rows Row Source Operation ------- --------------------------------------------------- 2 SORT ORDER BY (cr=33608 r=29921 w=0 time=6422770 us) 2 HASH JOIN SEMI (cr=33608 r=29921 w=0 time=6422538 us) 10 TABLE ACCESS FULL CUSTOMERS (cr=38 r=0 w=0 time=61290 us) 2990 TABLE ACCESS FULL ORDERS (cr=33570 r=29921 w=0 time=6345754 us) The “SEMI” that you see on the second line indicates that a semi-join access path is being used instead of a conventional join. A semi-join is more efficient here for two reasons: (1) The search through the ORDERS table for a given customer can stop as soon as the first qualifying order is found, and (2) There is no need to sort the results to remove duplicate customer records in the case where a customer had multiple qualifying orders. (You can see that the sort for uniqueness, which took 928 microseconds, has been replaced by a sort to satisfy the ORDER BY clause, which took only 232 microseconds.) Rewriting the query to use a semi-join access path has perhaps been a novel experience, but, so far, it hasn’t done us a substantial amount of good. We brought the elapsed time down about 4%, but that doesn’t seem too significant, and it http://www.dbspecialists.com/files/presentations/semijoins.html[08/29/2012 5:00:12 PM]
  • 5. Speeding Up Queries with Semi-Joins and Anti-Joins could just be the result of chance variation in our test environment. But suppose you are aware of the fact that very few of your customers are gold-status. Perhaps a nested loops semi-join would be more efficient. You can add an NL_SJ hint and see the results for yourself: SELECT C.short_name, C.customer_id FROM customers C WHERE C.customer_type = 'Gold' AND EXISTS ( SELECT /*+ NL_SJ */ 1 FROM orders O WHERE O.customer_id = C.customer_id AND O.order_date > SYSDATE - 3 ) ORDER BY C.short_name; Rows Row Source Operation ------- --------------------------------------------------- 2 SORT ORDER BY (cr=833 r=725 w=0 time=358431 us) 2 NESTED LOOPS SEMI (cr=833 r=725 w=0 time=358232 us) 10 TABLE ACCESS FULL CUSTOMERS (cr=38 r=0 w=0 time=2210 us) 2 TABLE ACCESS BY INDEX ROWID ORDERS (cr=795 r=725 w=0 time=355822 us) 780 INDEX RANGE SCAN ORDERS_N1 (cr=15 r=13 w=0 time=5601 us)(object id 34176) Indeed, the nested loops semi-join brought the consistent reads down to 833 from 33608 and reduced the execution time by over 90%. So it appears that Oracle chose a hash semi-join when a nested loops semi-join was more efficient. Once we get Oracle to see the light with the NL_SJ hint, we end up with a much more efficient query than we started with. We achieved this performance gain by leveraging Oracle’s nested loops semi-join access path. Semi-Join Example #2 I recently worked on a project where applying the concepts presented in this section brought the execution time for a key query in my client’s application down from 46 minutes to 1.1 seconds. A screen in the application allowed a user to enter a project owner and up to five participants, and a list of assignments involving all of the specified people would be displayed. The users found that the response time was tolerable if they only selected one participant, but it went downhill from there as they added more participants. The original version of the query looked like this: SELECT DISTINCT A.name, A.code, A.description, A.item_id, A.assignment_id, FI.string0, FI.string1 FROM relationships R, assignments A, format_items FI, relationships R1, relationships R2, relationships R3, relationships R4, relationships R5 WHERE R.user_id = 134546 AND R.account_id = 134545 AND R.type_code = 0 AND A.item_id = R.item_id AND FI.item_id = A.item_id AND R1.item_id = A.item_id AND R1.status = 5 AND R1.user_id = 137279 AND R2.item_id = A.item_id AND R2.status = 5 AND R2.user_id = 134555 AND R3.item_id = A.item_id AND R3.status = 5 AND R3.user_id = 134546 AND R4.item_id = A.item_id AND R4.status = 5 AND R4.user_id = 137355 AND R5.item_id = A.item_id AND R5.status = 5 AND R5.user_id = 134556 ORDER BY A.name ASC; (I have changed the names of tables and columns, stripped away extraneous columns and predicates, and eliminated bind variables in order to both protect my client’s confidentiality and make the example easier to understand.) The execution plan looked like this: Rows Row Source Operation ------- --------------------------------------------------- 642 SORT UNIQUE (cr=23520269 r=34 w=0 time=2759937104 us) 64339104 TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=23520269 r=34 w=0 time=1838051782 us) 95184881 NESTED LOOPS (cr=7842642 r=23 w=0 time=907238095 us) 2710288 NESTED LOOPS (cr=2266544 r=23 w=0 time=103840003 us) 317688 NESTED LOOPS (cr=484734 r=11 w=0 time=23494451 us) 50952 NESTED LOOPS (cr=43280 r=10 w=0 time=2688237 us) 4146 NESTED LOOPS (cr=19016 r=3 w=0 time=988374 us) 1831 NESTED LOOPS (cr=13353 r=0 w=0 time=608296 us) 4121 HASH JOIN (cr=7395 r=0 w=0 time=399488 us) http://www.dbspecialists.com/files/presentations/semijoins.html[08/29/2012 5:00:12 PM]
  • 6. Speeding Up Queries with Semi-Joins and Anti-Joins 2046 TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=7211 r=0 w=0 time=299181 us) 17788 INDEX RANGE SCAN RELATIONSHIPS_N3 (cr=71 r=0 w=0 time=81158 us)(object id 34528) 3634 TABLE ACCESS FULL ASSIGNMENTS (cr=184 r=0 w=0 time=25536 us) 1831 TABLE ACCESS BY INDEX ROWID FORMAT_ITEMS (cr=5958 r=0 w=0 time=163252 us) 1831 INDEX RANGE SCAN FORMAT_ITEMS_N1 (cr=4127 r=0 w=0 time=115113 us)(object id 34316) 4146 TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=5663 r=3 w=0 time=349554 us) 4264 INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=3678 r=0 w=0 time=224390 us)(object id 34345) 50952 TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=24264 r=7 w=0 time=1538051 us) 70976 INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=8428 r=0 w=0 time=630831 us)(object id 34345) 317688 TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=441454 r=1 w=0 time=19584539 us) 1631032 INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=108302 r=0 w=0 time=7213989 us)(object id 34345) 2710288 TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=1781810 r=12 w=0 time=72015925 us) 4237104 INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=643298 r=1 w=0 time=29018764 us)(object id 34345) 92474592 INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=5576098 r=0 w=0 time=503165776 us)(object id 34345) It is no wonder that this query took so long when five participants were selected. With each join to the RELATIONSHIPS table, the Cartesian product gets more and more out of hand. By the final sort to eliminate the duplicates, we have over 64 million rows to reckon with. This query struck me right away as a good candidate for a nested loops semi-join. We want a list of assignments that meet certain criteria. Whether a specific user has one qualifying relationship record or 10, it makes no difference. We are only concerned with the existence of at least one qualifying relationship record for each of the specified users. Rewriting the query with EXISTS clauses to semi-join to the RELATIONSHIPS table five times is pretty easy. The challenge here is that the query does have a DISTINCT operator, and this defeats Oracle’s semi-join access paths. Not to worry—an inline view with a NO_MERGE hint saved the day. The revised query looked like: SELECT /*+ NO_MERGE (M) */ DISTINCT M.name, M.code, M.description, M.item_id, M.assignment_id, M.string0, M.string1 FROM ( SELECT A.name, A.code, A.description, A.item_id, A.assignment_id, FI.string0, FI.string1 FROM relationships R, assignments A, format_items FI WHERE R.user_id = 134546 AND R.account_id = 134545 AND R.type_code = 0 AND A.item_id = R.item_id AND FI.item_id = A.item_id AND EXISTS (SELECT 1 FROM relationships R1 WHERE R1.item_id = A.item_id AND R1.status = 5 AND R1.user_id = 137279) AND EXISTS (SELECT 1 FROM relationships R2 WHERE R2.item_id = A.item_id AND R2.status = 5 AND R2.user_id = 134555) AND EXISTS (SELECT 1 FROM relationships R3 WHERE R3.item_id = A.item_id AND R3.status = 5 AND R3.user_id = 134546) AND EXISTS (SELECT 1 FROM relationships R4 WHERE R4.item_id = A.item_id AND R4.status = 5 AND R4.user_id = 137355) AND EXISTS (SELECT 1 FROM relationships R5 WHERE R5.item_id = A.item_id AND R5.status = 5 AND R5.user_id = 134556) ) M ORDER BY M.name ASC; http://www.dbspecialists.com/files/presentations/semijoins.html[08/29/2012 5:00:12 PM]
  • 7. Speeding Up Queries with Semi-Joins and Anti-Joins The execution plan was as follows: Rows Row Source Operation ------- --------------------------------------------------- 642 SORT UNIQUE (cr=36315 r=89 w=0 time=1107054 us) 1300 VIEW (cr=36315 r=89 w=0 time=1085116 us) 1300 NESTED LOOPS SEMI (cr=36315 r=89 w=0 time=1082232 us) 1314 NESTED LOOPS SEMI (cr=32385 r=89 w=0 time=1002330 us) 1314 NESTED LOOPS SEMI (cr=28261 r=89 w=0 time=904654 us) 1314 NESTED LOOPS SEMI (cr=22822 r=89 w=0 time=737705 us) 1322 NESTED LOOPS SEMI (cr=18730 r=89 w=0 time=651196 us) 1831 NESTED LOOPS (cr=13353 r=89 w=0 time=530670 us) 4121 HASH JOIN (cr=7395 r=89 w=0 time=347584 us) 2046 TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=7211 r=0 w=0 time=186820 us) 17788 INDEX RANGE SCAN RELATIONSHIPS_N3 (cr=71 r=0 w=0 time=43770 us)(object id 34528) 3634 TABLE ACCESS FULL ASSIGNMENTS (cr=184 r=89 w=0 time=91899 us) 1831 TABLE ACCESS BY INDEX ROWID FORMAT_ITEMS (cr=5958 r=0 w=0 time=141416 us) 1831 INDEX RANGE SCAN FORMAT_ITEMS_N1 (cr=4127 r=0 w=0 time=100207 us)(object id 34316) 1322 TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=5377 r=0 w=0 time=92046 us) 2472 INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=3664 r=0 w=0 time=61077 us)(object id 34345) 1314 TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=4092 r=0 w=0 time=63478 us) 1582 INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=2647 r=0 w=0 time=40433 us)(object id 34345) 1314 TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=5439 r=0 w=0 time=133620 us) 11011 INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=2639 r=0 w=0 time=65312 us)(object id 34345) 1314 TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=4124 r=0 w=0 time=75657 us) 2234 INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=2630 r=0 w=0 time=49373 us)(object id 34345) 1300 TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=3930 r=0 w=0 time=58651 us) 1300 INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=2630 r=0 w=0 time=38832 us)(object id 34345) The execution plan starts out the same as before, with a hash join between RELATIONSHIPS and ASSIGNMENTS followed by a nested loops join to FORMAT_ITEMS. At this point there are 1831 rows in the candidate result set. Whereas the original query does five conventional joins to the RELATIONSHIPS table and the 1831 rows balloon to over 64 million, the revised query does semi-joins instead and whittles the candidate rows down to 1300. Joining 1300 rows is a lot faster than joining 64 million rows, so you can see how the semi-join access path really speeds up this query. The NO_MERGE hint and the inline view (aliased as “M”) causes Oracle to separate the DISTINCT operation from the bulk of the query. This is how we get around the fact that a DISTINCT operation defeats semi-join access paths. In the second version of the query, Oracle executes the query as if the DISTINCT keyword were not there. Then, as a final step, it applies the DISTINCT operation to the interim result set that it has stored in memory (the output of the VIEW operator). If you are wondering why the NO_MERGE hint is necessary, recall what we discussed earlier about the heuristic Oracle uses when deciding whether or not to merge a subquery into the main body of a query. Oracle assumes that merging is a good thing and will do it whenever possible. Thus without the NO_MERGE hint Oracle would merge the inline view into the DISTINCT operation (eliminating the inline view), and the semi-join access path would be defeated by the presence of the DISTINCT operation. This example demonstrates the value of the semi-join access paths. Your gains may not be as spectacular as we have seen here, but hopefully you can see from this discussion how proper use of semi-joins can make a certain class of queries much more efficient. This example also underscores the importance of knowing how to properly use semi- joins. If you did not remember that semi-join access paths are defeated by the DISTINCT operation, you might have changed the example query to use EXISTS clauses and wondered why the query still ran slowly. The NOT EXISTS and NOT IN Constructs In this section we will discuss how Oracle evaluates NOT EXISTS and NOT IN clauses, prerequisites for using http://www.dbspecialists.com/files/presentations/semijoins.html[08/29/2012 5:00:12 PM]
  • 8. Speeding Up Queries with Semi-Joins and Anti-Joins Oracle’s anti-join access paths, and hints that influence anti-join query optimization. Then we’ll look at a few examples of how anti-join access paths can be used to make some queries more efficient. How Oracle Evaluates NOT EXISTS and NOT IN Clauses NOT EXISTS and NOT IN clauses are used to retrieve rows from a table for which no match is found in another table. In the early days, Oracle would use the subquery of a NOT EXISTS or NOT IN clause as a filter. For each candidate row retrieved by the main body of the query, Oracle would execute the subquery. If the subquery retrieved one or more rows, then the candidate row would be discarded. Conceptually this approach seems similar to a nested loops anti-join. Somewhere around Oracle 7.3, the merge anti-join and hash anti-join access paths were added. These enabled Oracle to leverage the merge and hash join algorithms when performing anti-joins. However, Oracle would only use these algorithms if the always_anti_join instance parameter was set appropriately or if a hint was added to the query. In Oracle 8i this still appears to be true. Oracle 9i, however, does away with the always_anti_join instance parameter and instead the optimizer evaluates the costs of the different algorithms and chooses the one with the lowest cost. In an earlier section of this paper we discussed when the EXISTS clause should be used versus when the IN clause should be used. The two are functionally equivalent, and actively choosing one over the other is usually done in the hopes that this will help the optimizer choose a better execution plan. When talking about NOT EXISTS and NOT IN, however, it is important to note that these two constructs are not functionally equivalent—there is a significant difference between the two in how null values are handled. If the subquery of a NOT IN clause returns at least one null value, then the NOT IN predicate evaluates to false. The NOT EXISTS construct, meanwhile, concerns itself only with whether a row is returned or not, and thus does not do anything differently if a null value is returned. This may seem a bit abstract, so let’s clarify the point with an example. Suppose you want a list of empty departments—departments that have no employees. You could write this query with a NOT EXISTS clause or a NOT IN clause. We saw the NOT EXISTS case earlier: SELECT D.deptno, D.dname FROM dept D WHERE NOT EXISTS ( SELECT 1 FROM emp E WHERE E.deptno = D.deptno ) ORDER BY D.deptno; Written with a NOT IN clause, the query becomes: SELECT D.deptno, D.dname FROM dept D WHERE D.deptno NOT IN ( SELECT E.deptno FROM emp E ) ORDER BY D.deptno; The queries look like they would be functionally equivalent, but if you are using the EMP and DEPT tables from the SCOTT schema of an Oracle 9i database, then the two queries are in fact different. Insert a row into the EMP table with a null DEPTNO, run the two versions of the query, and you will see the difference. The NOT EXISTS version will show you that department 40 has no employees. However, the NOT IN version of the query will say there are no empty departments. Why? Because the subquery of the NOT IN clause retrieves a row with a null value, causing the NOT IN clause to always evaluate to false. As strange as this NOT IN nuance may seem, it is fully documented in Metalink document 28934.1. The difference between NOT EXISTS and NOT IN with respect to handling of nulls is an important one—not just from the standpoint of differing functionality, but also from a performance perspective as well. To evaluate a NOT IN http://www.dbspecialists.com/files/presentations/semijoins.html[08/29/2012 5:00:12 PM]
  • 9. Speeding Up Queries with Semi-Joins and Anti-Joins subquery that is capable of returning nulls, Oracle cannot scan an index because an implicit NVL() function must be applied and this would defeat the index. (Null values aren’t stored in single-column non-unique indexes, anyway.) We’ll see the impact of this implicit NVL() function in the examples later in this section. If the subquery of a NOT IN clause is physically incapable of returning a null value (because the columns involved have NOT NULL constraints or because predicates have been added to the WHERE clause requiring non-null values), then the query with the NOT IN clause will return the same results as a query with an equivalent NOT EXISTS clause —and indexes won’t be defeated in the process. If we wanted to write the query that shows empty departments with a NOT IN clause that gave the same results as the NOT EXISTS version, we would need to add a predicate to the subquery that rules out null values. The resulting query is: SELECT D.deptno, D.dname FROM dept D WHERE D.deptno NOT IN ( SELECT E.deptno FROM emp E WHERE E.deptno IS NOT NULL ) ORDER BY D.deptno; To choose whether to use NOT EXISTS or NOT IN in a query, therefore, you must first consider how you want null values to be handled. If you need the presence of a null value in the subquery to cause the expression to always evaluate to false, then you need to use the NOT IN construct. Otherwise you have a choice between NOT EXISTS and NOT IN (with the extra WHERE predicate if the subquery would otherwise be capable of returning a null value). If you don’t need the special null value semantics of NOT IN, then your choice of whether to use NOT EXISTS or NOT IN hinges on what sort of access path you think will be best for the query. In a conventional join, the nested loops algorithm is desirable when the predicates on the first table are very selective and the join columns in the second table are selectively indexed. The merge and hash join algorithms, meanwhile, are more desirable when predicates are not very selective or the join columns in the second table are not selectively indexed. A similar rule of thumb applies to anti-joins. That is to say, merge and hash anti-joins are most valuable in situations where the first table has a lot of candidate rows or the join columns are not selectively indexed. If you think your query is going to be better off with a merge or hash anti-join than a nested loops anti-join or a filter operation, then you might want to code your query with a NOT IN clause instead of a NOT EXISTS clause. Whenever the Oracle documentation discusses anti-join access paths, it always seems to do so in the context of the NOT IN construct with no mention of NOT EXISTS. While sometimes Oracle 9i will choose an anti-join access path for a query with a NOT EXISTS clause, it does seem that Oracle 9i might be more likely to do so if NOT IN is used instead of NOT EXISTS. In Oracle 8i, I believe the query optimizer will never choose an anti-join access path for a query with a NOT EXISTS clause (even if always_anti_join is set and a hint appears in the subquery). Meanwhile, if you don’t think your query will benefit from a merge or hash anti-join, then you might be best off with the good old filter technique instead of the nested loops anti-join access path. It appears that, at least in Oracle 9i, the nested loops anti-join might not be as efficient as it could be. In the example of the query for empty departments that we have been discussing, the filter technique will rule out a department as soon as the first employee in that department is located. The nested loops anti-join, meanwhile, appears to locate all employees in the department before ruling it out. This is unnecessary work. Whether a department has one employee or 100, it does not qualify as an empty department. So Oracle should cut to the chase and move on to the next department as soon as the first employee is found. The filter technique does just that, but the nested loops anti-join (at least in the cases I have seen) does not. The second example in this section will demonstrate this. The upshot is that you must first consider how null values should be handled when deciding whether to use NOT EXISTS or NOT IN. If you need the special semantics provided by NOT IN, then your decision has been made. Otherwise, you should next consider whether or not the query might benefit from a merge or hash anti-join. If so, then you probably ought to choose NOT IN. If you decide to go with NOT IN but do not want the expression to evaluate to http://www.dbspecialists.com/files/presentations/semijoins.html[08/29/2012 5:00:12 PM]
  • 10. Speeding Up Queries with Semi-Joins and Anti-Joins false if a null value is found, then make sure the subquery cannot return a null value. If there is no chance that the query will benefit from a merge or hash anti-join and the special semantics of NOT IN are not desired, then you might want to select the NOT EXISTS construct so that there is a better chance Oracle will perform an efficient filter instead of an inefficient nested loops anti-join. Prerequisites and Hints that Impact Anti-Joins The key prerequisite for the anti-join access paths is that the subquery of a NOT IN clause cannot be capable of returning a null value. As mentioned before, this means that either the columns being selected must have NOT NULL constraints or there must be predicates in the WHERE clause of the subquery to ensure there are no nulls. Even if every row in the table has a non-null value in a nullable column, you must still add the extra predicate to the WHERE clause or else Oracle will refuse to use the merge and hash anti-join access paths. Assuming that an anti-join access path is possible, Oracle chooses the anti-join algorithm to use while evaluating execution plans. In Oracle 9i an anti-join can be performed using the nested loops, merge, or hash join algorithms. As with a conventional join, Oracle will choose the join algorithm with the lowest cost, or it may choose to use the filter technique instead of an anti-join altogether. In Oracle 8i an anti-join can be performed using the merge or hash join algorithms, but as with semi-joins, Oracle 8i will not choose the join algorithm with the lowest cost. Instead, the always_anti_join instance parameter dictates which approach Oracle 8i should use. (Oracle 8i does not implement the nested loops anti-join access path, so if the always_anti_join parameter is set to nested_loops, the filter technique is used instead.) The always_anti_join instance parameter is obsolete in Oracle 9i. Oracle provides the NL_AJ, MERGE_AJ, and HASH_AJ hints in order for you to manipulate the anti-join process if you need to. As with the semi-join hints, the hint is applied to the subquery and not the main body of the query itself. Anti-Join Example #1 Suppose you want a list of customers who have not placed an order within the last ten days. You might start with a query that looks like: SELECT C.short_name, C.customer_id FROM customers C WHERE NOT EXISTS ( SELECT 1 FROM orders O WHERE O.customer_id = C.customer_id AND O.order_date > SYSDATE - 10 ) ORDER BY C.short_name; The execution plan looks like: Rows Row Source Operation ------- --------------------------------------------------- 11 SORT ORDER BY (cr=18707 r=301 w=0 time=22491917 us) 11 FILTER (cr=18707 r=301 w=0 time=22491555 us) 1000 TABLE ACCESS FULL CUSTOMERS (cr=38 r=36 w=0 time=15277 us) 989 VIEW (cr=18669 r=265 w=0 time=22365647 us) 989 HASH JOIN (cr=18669 r=265 w=0 time=22347234 us) 100000 INDEX RANGE SCAN ORDERS_N1_CUST_ID (cr=2207 r=208 w=0 time=277584 us)(object id 34555) 5338680 INDEX RANGE SCAN ORDERS_N2_ORD_DATE (cr=16462 r=57 w=0 time=9036516 us)(object id 34556) You can see that the query took over 22 seconds to run and performed 18,707 logical reads. Oracle did not use an anti- join access path in this execution plan, instead opting to read every row in the CUSTOMERS table and use the subquery as a filter by executing it once for each customer. Although Oracle found a clever way to hash join two indexes on the ORDERS table and thereby make the subquery very efficient, the query overall is still resource intensive because the subquery must execute 1000 times—once for each customer. This query seems like a good candidate for an anti-join access path because executing the subquery once and http://www.dbspecialists.com/files/presentations/semijoins.html[08/29/2012 5:00:12 PM]
  • 11. Speeding Up Queries with Semi-Joins and Anti-Joins performing a merge or hash anti-join might be more efficient than performing the subquery once for each customer as a filter operation. We could rewrite the query using a NOT IN clause instead of NOT EXISTS to see if Oracle chooses an anti-join access path: SELECT C.short_name, C.customer_id FROM customers C WHERE C.customer_id NOT IN ( SELECT O.customer_id FROM orders O WHERE O.order_date > SYSDATE - 10 ) ORDER BY C.short_name; Unfortunately, the results are quite disappointing—elapsed time shot up to 695 seconds! The execution plan looks like: Rows Row Source Operation ------- --------------------------------------------------- 11 SORT ORDER BY (cr=5360749 r=4870724 w=0 time=695232973 us) 11 FILTER (cr=5360749 r=4870724 w=0 time=695232475 us) 1000 TABLE ACCESS FULL CUSTOMERS (cr=38 r=129 w=0 time=61614 us) 989 TABLE ACCESS BY INDEX ROWID ORDERS (cr=5360711 r=4870595 w=0 time=695088325 us) 5359590 INDEX RANGE SCAN ORDERS_N2_ORD_DATE (cr=16520 r=0 w=0 time=22299946 us)(object id 34556) You can see that Oracle did not choose an anti-join access path. Instead, Oracle is still reading every row from the CUSTOMERS table and executing the subquery once for each of the 1000 customers. Moreover, Oracle is no longer hash joining the two ORDERS indexes to make the subquery very efficient. For each customer, Oracle is now grabbing all orders from the last 10 days and sifting through them to find those belonging to the customer currently being processed. It turns out that the CUSTOMER_ID column in the ORDERS table is nullable, and so Oracle conceptually rewrites the query in order to implement the special null value semantics of NOT IN. The rewritten query looks like: SELECT C.short_name, C.customer_id FROM customers C WHERE NOT EXISTS ( SELECT 1 FROM orders O WHERE O.order_date > SYSDATE – 10 AND NVL (O.customer_id, C.customer_id) = C.customer_id ) ORDER BY C.short_name; You can see how the implicit NVL() function that Oracle inserts during the transformation would be necessary in order to cause the predicate to evaluate to false if a null value is returned. You can also see how this defeats the index on the CUSTOMER_ID column of the ORDERS table. You might wonder why the CUSTOMER_ID column in the ORDERS table can be null. Let’s suppose the developers tell you that this handles a special case and you should not worry about it. The developers want the query to ignore nulls. Thus you can add an extra predicate to the WHERE clause of the subquery in order to meet the prerequisite for anti-join access paths. The resulting query is: SELECT C.short_name, C.customer_id FROM customers C WHERE C.customer_id NOT IN ( SELECT O.customer_id FROM orders O WHERE O.order_date > SYSDATE - 10 AND O.customer_id IS NOT NULL ) ORDER BY C.short_name; The execution plan becomes: Rows Row Source Operation http://www.dbspecialists.com/files/presentations/semijoins.html[08/29/2012 5:00:12 PM]
  • 12. Speeding Up Queries with Semi-Joins and Anti-Joins ------- --------------------------------------------------- 11 SORT ORDER BY (cr=311 r=132 w=98 time=1464035 us) 11 HASH JOIN ANTI (cr=311 r=132 w=98 time=1463770 us) 1000 TABLE ACCESS FULL CUSTOMERS (cr=38 r=34 w=0 time=37976 us) 20910 VIEW (cr=273 r=98 w=98 time=1318222 us) 20910 HASH JOIN (cr=273 r=98 w=98 time=1172207 us) 20910 INDEX RANGE SCAN ORDERS_N2_ORD_DATE (cr=58 r=0 w=0 time=43870 us)(object id 34556) 100000 INDEX FAST FULL SCAN ORDERS_N1_CUST_ID (cr=215 r=0 w=0 time=196953 us)(object id 34555) You can see that Oracle now chooses to perform a hash anti-join. The CUSTOMERS table gets a full table scan as before, but the critical difference is that now the subquery executes just once. The two indexes on the ORDERS table are hash joined once in order to develop a list of order dates for each customer, and this is hash anti-joined to the CUSTOMERS table in order to yield the list of customers with no orders in the last 10 days. The result is a query that completes in 1.46 seconds and performs just 311 logical reads. In this example we did not need to use hints to get Oracle to find the most efficient execution plan. All we had to do was write the query in a way that made it eligible for the anti-join access paths. (This entailed using NOT IN instead of NOT EXISTS, and adding a predicate to the subquery’s WHERE clause to make it impossible for the subquery to return a null value.) Anti-Join Example #2 Along the lines of the previous example, suppose you needed to find out if a specific customer has not placed an order within the last 10 days. Your query might look like: SELECT C.short_name, C.customer_id FROM customers C WHERE C.customer_id = 22 AND NOT EXISTS ( SELECT 1 FROM orders O WHERE O.order_date > SYSDATE - 10 AND O.customer_id = C.customer_id ) ORDER BY C.short_name; The execution plan looks like: Rows Row Source Operation ------- --------------------------------------------------- 0 TABLE ACCESS BY INDEX ROWID CUSTOMERS (cr=6 r=0 w=0 time=341 us) 0 INDEX UNIQUE SCAN CUSTOMERS_PK (cr=6 r=0 w=0 time=333 us)(object id 34552) 1 TABLE ACCESS BY INDEX ROWID ORDERS (cr=4 r=0 w=0 time=170 us) 2 INDEX RANGE SCAN ORDERS_N1_CUST_ID (cr=2 r=0 w=0 time=65 us)(object id 34555) As you can see, Oracle uses a filter operation instead of an anti-join. This is not surprising, because we used the NOT EXISTS construct instead of NOT IN. The query performed 6 logical reads and completed in 341 microseconds. Customer 22 has 110 orders in the system. By looking at the row counts in the execution plan, you can see that Oracle stopped the filter process as soon as it found the first order placed by the customer within the last 10 days. There is no need to look at all of the orders placed by this customer because we’ve already found one that excludes the customer from the result set. This is an example of a query with an anti-join where the main query is extremely selective—we are only looking at one customer record. This begs the question: Would this be a good opportunity to use the nested loops anti-join access path? It is hard to imagine an execution plan for this query that is more efficient than what we have just seen, but let’s try. The query rewritten with a NOT IN clause and an NL_AJ hint looks like: SELECT C.short_name, C.customer_id FROM customers C WHERE C.customer_id = 22 AND C.customer_id NOT IN ( SELECT /*+ NL_AJ */ O.customer_id http://www.dbspecialists.com/files/presentations/semijoins.html[08/29/2012 5:00:12 PM]
  • 13. Speeding Up Queries with Semi-Joins and Anti-Joins FROM orders O WHERE O.order_date > SYSDATE - 10 AND O.customer_id IS NOT NULL ) ORDER BY C.short_name; The execution plan is: Rows Row Source Operation ------- --------------------------------------------------- 0 SORT ORDER BY (cr=115 r=86 w=0 time=16824 us) 0 NESTED LOOPS ANTI (cr=115 r=86 w=0 time=16724 us) 1 TABLE ACCESS BY INDEX ROWID CUSTOMERS (cr=3 r=2 w=0 time=467 us) 1 INDEX UNIQUE SCAN CUSTOMERS_PK (cr=2 r=2 w=0 time=429 us)(object id34552) 10 TABLE ACCESS BY INDEX ROWID ORDERS (cr=112 r=84 w=0 time=16213 us) 110 INDEX RANGE SCAN ORDERS_N1_CUST_ID (cr=2 r=0 w=0 time=308 us)(object id 34555) You can see that Oracle obeyed our hint and used the nested loops anti-join access path. Unfortunately, the number of logical reads went up from 6 to 115 and the elapsed time went up by a factor of 50 to 16824 microseconds. Why? The row counts in the execution plan tell the story. Oracle looked at all 110 orders from the customer and found that 10 were placed less than 10 days ago. Oracle did a lot of unnecessary work. In this regard the filter access path seems much smarter than the nested loops anti-join access path. This behavior was seen on an Oracle 9.2.0.4 database. Let’s hope Oracle fixes this in a future release. Until then, I can’t think of a case where the nested loops anti-join access path would be useful. By the way, it turns out that Oracle would have chosen the nested loops anti-join access path even without the NL_AJ hint. So it turns out that this query was more efficient when written with a NOT EXISTS than a NOT IN. Again, hopefully this is just an anomaly that will be fixed in a future release of Oracle. Anti-Join Example #3 I recently used the concepts presented in this section to bring a client’s query down from 33 minutes (where one CPU on the server was pegged) to less than four seconds. It turned out to be another case of the implicit NVL() function in a NOT IN clause defeating an index. This time it was a query that showed a count of calls to the customer service center placed by users who did not belong to corporate customers. The original query looked like this: SELECT COUNT(*) FROM calls C WHERE C.requestor_user_id NOT IN ( SELECT CM.member_user_id FROM corp_members CM ); (Once again names have been changed to protect the innocent.) The execution plan looked like this: Rows Row Source Operation ------- --------------------------------------------------- 1 SORT AGGREGATE (cr=12784272 r=1864678 w=0 time=1978321835 us) 0 FILTER (cr=12784272 r=1864678 w=0 time=1978321817 us) 184965 TABLE ACCESS FULL CALLS (cr=3588 r=1370 w=0 time=979769 us) 61032 TABLE ACCESS FULL CORP_MEMBERS (cr=12780684 r=1863308 w=0 time=2948544268 us) It appears that Oracle is reading rows from the CALLS table one at a time and, for each row, is scanning the CORP_MEMBERS table until it finds a row with a matching MEMBER_USER_ID. Since there are over 180,000 calls in the system, it takes over 180,000 partial scans of the CORP_MEMBERS table to complete this query. (The CORP_MEMBERS table has 584 blocks below its high water mark.) A quick check of the CORP_MEMBERS table shows that the MEMBER_USER_ID column is indexed. However, the column is also nullable and so Oracle cannot use the index because of the implicit NVL() function associated with the NOT IN clause. Since the client saw records in the CORP_MEMBERS table with a null MEMBER_USER_ID as an unrelated special http://www.dbspecialists.com/files/presentations/semijoins.html[08/29/2012 5:00:12 PM]
  • 14. Speeding Up Queries with Semi-Joins and Anti-Joins case, they wanted the query to ignore such records. So my first step was to rewrite the query with the extra WHERE predicate in the subquery that makes the NOT IN clause ignore nulls: SELECT COUNT(*) FROM calls C WHERE C.requestor_user_id NOT IN ( SELECT CM.member_user_id FROM corp_members CM WHERE CM.member_user_id IS NOT NULL ); The resulting execution plan was: Rows Row Source Operation ------- --------------------------------------------------- 1 SORT AGGREGATE (cr=3790 r=3906 w=420 time=5450615 us) 0 HASH JOIN ANTI (cr=3790 r=3906 w=420 time=5450591 us) 184965 TABLE ACCESS FULL CALLS (cr=3588 r=3480 w=0 time=646554 us) 81945 INDEX FAST FULL SCAN CORP_MEMBERS_USER_ID (cr=202 r=6 w=0 time=113178 us)(object id 34580) Now not only did Oracle use the index on the MEMBER_USER_ID column when accessing the CORP_MEMBERS table, but more importantly Oracle also switched to a hash anti-join access path. This means that Oracle only had to access the CORP_MEMBERS table once instead of more than 180,000 times. The execution time dropped by 99% to 5.45 seconds. The users were thrilled with the improvement and the problem was considered resolved. But to better understand Oracle’s data access paths and join methods, I benchmarked two more versions of the query. First I added a MERGE_AJ hint to cause a merge anti-join instead of a hash anti-join. This version of the query performed the same number of logical reads as the hash anti-join but used a little more CPU time. In this situation it looked like the merge anti-join used up more CPU time on sorting than did the hash anti-join. I also tried rewriting the query with a NOT EXISTS clause to see how it would perform using a filter operation instead of an anti-join access path. This version of the query looked like: SELECT COUNT(*) FROM calls C WHERE NOT EXISTS ( SELECT 1 FROM corp_members CM WHERE CM.member_user_id = C.requestor_user_id ); The execution plan was: Rows Row Source Operation ------- --------------------------------------------------- 1 SORT AGGREGATE (cr=125652 r=3489 w=0 time=3895569 us) 0 FILTER (cr=125652 r=3489 w=0 time=3895547 us) 184965 TABLE ACCESS FULL CALLS (cr=3588 r=3489 w=0 time=906699 us) 61032 INDEX RANGE SCAN CORP_MEMBERS_USER_ID (cr=122064 r=0 w=0 time=1842121 us)(object id 34580) This version of the query performed a lot more logical reads than the hash anti-join version, but actually used less CPU time. This query appeared to hammer on the blocks of the CORP_MEMBERS_USER_ID index very heavily. Most likely the whole index segment was sitting in the buffer cache and thus the logical reads were relatively inexpensive. Meanwhile, the large sort required by a hash anti-join was avoided and the result was a net reduction in CPU time. These examples demonstrate the value of the anti-join access paths and the significance of the way NOT IN treats null values. Again, your gains may not be as spectacular as those shown here, but hopefully you can see from this discussion how the proper use of anti-joins and NOT EXISTS and NOT IN clauses can make a certain class of queries much more efficient. These examples also demonstrate the importance of the difference between NOT EXISTS and http://www.dbspecialists.com/files/presentations/semijoins.html[08/29/2012 5:00:12 PM]
  • 15. Speeding Up Queries with Semi-Joins and Anti-Joins NOT IN, and the apparent behavior (although not fully documented as far as I can tell) that Oracle typically will not apply anti-join access paths to a NOT EXISTS clause. Conclusion The semi-join and anti-join are two special ways of joining data from multiple tables in a query. Although SQL does not have a special symbol or keyword to designate a semi- or anti-join, we use the EXISTS, IN, NOT EXISTS, and NOT IN constructs to indicate when one of these joins is desired. Over the years Oracle has gotten better at optimizing semi-joins and anti-joins, and new data access paths make it possible to write very efficient queries that use these two types of joins. However, there are certain prerequisites to be met before Oracle can use these access paths, and sometimes hints or specific coding styles are necessary in order for the query optimizer to find the best execution plan possible. Understanding how semi-joins and anti-joins work—and how Oracle implements the relevant data access paths—will enable you to write very efficient SQL and dramatically speed up an entire class of queries. This in turn will make your employer (or client) very happy. About the Author Roger Schrag, OCP, has been an Oracle DBA and application architect for over 15 years. He began his career at Oracle Corporation on the Oracle Financials development team. In 1995, he founded Database Specialists, Inc., a boutique consulting firm specializing in Oracle database technology, remote administration, and performance tuning. Since that time, Roger has focused his expertise in the area of performance optimization. Roger is a frequent speaker at Oracle OpenWorld and the International Oracle Users Group (IOUG) Live conferences, where he has frequently been voted in the Top 10% of speakers. Roger has been an Oracle Masters Class instructor for the IOUG and is the current president of the Northern California Oracle Users Group (NoCOUG). He can be reached at rschrag@dbspecialists.com. Still Looking for Help on this Subject? Get a Consultation We would be happy to talk with you about our services and how our senior-level database team might help you. Call Database Specialists at 415-344-0500 or 888-648-0500 or fill out a free consultation request form. Complimentary Newsletter If you'd like to receive our complimentary monthly newsletter with database tips and new white paper announcements, sign up for The Specialist. Copyright © 2005 Database Specialists, Inc. http://www.dbspecialists.com http://www.dbspecialists.com/files/presentations/semijoins.html[08/29/2012 5:00:12 PM]