Friday, February 24, 2012

Compare data in two tables

Using a GROUP BY and some arithmetic it is possible to improve the performance of SQL comparing data of two tables. Even better, in Oracle this can be achieved with the PARTITION BY analytic function.
Below is an example comparing two tables (TAB1, TAB2) with two columns (col1 number, col2 varchar2) using a PARTITION BY analytic function:


create table tab1 (col1 number, col2 varchar2(3));
create table tab2 (col1 number, col2 varchar2(3));
 
insert into tab1 (col1, col2) values (3, 'ccc');
insert into tab1 (col1, col2) values (1, 'aaa');
insert into tab2 (col1, col2) values (1, 'aaa');
insert into tab2 (col1, col2) values (2, 'aaa');
 
select * from 
 -- list all results
 -- change to "select count(*) from" for a count of differences
(
   select rid, -- ROWID
   col1, -- column 1 of table to compare
   col2, -- column 2 of table to compare
   sum(summ) over (partition by col1, col2) summ
 -- must list all columns to compare in the PARTITION BY
   from
   (select 'In TAB1, NOT in TAB2' tab,
     rowid rid, -- need to alias ROWID
 -- list all/some columns of table to compare
     col1, -- column 1 of table to compare
     col2, -- column 2 of table to compare
 ------
     1 summ 
 -- summ is used for sum() operation,
 -- a matched row from other table will sum to 0
   from tab1
     UNION ALL 
 -- change to UNION for unique rows only
   select 'In TAB2, NOT in TAB1' tab,
     rowid rid, -- need to alias ROWID
 -- list all/some columns of table to compare
     col1, -- column 1 of table to compare
     col2, -- column 2 of table to compare
 ------
     -1 summ 
 -- summ used for sum() operation,
 -- a matched row from other table will sum to 0
   from tab2)
)
-- all the rows that didn't have a match have a non-zero sum()
where summ != 0
/   

RID                      COL1 COL       SUMM
------------------ ---------- --- ----------
AAAWK9AAGAAADNHAAB          2 aaa         -1
AAAWK8AAGAAADM/AAA          3 ccc          1

------------------------------------------------------------------------------
| Id  | Operation             | Name | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |      |     4 |   164 |     7  (15)| 00:00:01 |
|*  1 |  VIEW                 |      |     4 |   164 |     7  (15)| 00:00:01 |
|   2 |   WINDOW SORT         |      |     4 |   124 |     7  (15)| 00:00:01 |
|   3 |    VIEW               |      |     4 |   124 |     6   (0)| 00:00:01 |
|   4 |     UNION-ALL         |      |       |       |            |          |
|   5 |      TABLE ACCESS FULL| TAB1 |     2 |    56 |     3   (0)| 00:00:01 |
|   6 |      TABLE ACCESS FULL| TAB2 |     2 |    56 |     3   (0)| 00:00:01 |
------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("SUMM"<>0) 


Two full table scans and a SORT. Compare this to what traditionally people have used:

(
 select 'IN TAB1, NOT TAB2', col1, col2
   from tab1
   MINUS
 select 'IN TAB1, NOT TAB2', col1, col2
   from tab2
 )
 UNION ALL
 (
 select 'IN TAB2, NOT TAB1', col1, col2
   from tab2
   MINUS
 select 'IN TAB2, NOT TAB1', col1, col2
   from tab1
 )
 /  

'INTAB1,NOTTAB2'        COL1 COL
----------------- ---------- ---
IN TAB1, NOT TAB2          3 ccc
IN TAB2, NOT TAB1          2 aaa

-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |     4 |   128 |    12  (75)| 00:00:01 |
|   1 |  UNION-ALL           |      |       |       |            |          |
|   2 |   MINUS              |      |       |       |            |          |
|   3 |    SORT UNIQUE       |      |     2 |    32 |            |          |
|   4 |     TABLE ACCESS FULL| TAB1 |     2 |    32 |     3   (0)| 00:00:01 |
|   5 |    SORT UNIQUE       |      |     2 |    32 |            |          |
|   6 |     TABLE ACCESS FULL| TAB2 |     2 |    32 |     3   (0)| 00:00:01 |
|   7 |   MINUS              |      |       |       |            |          |
|   8 |    SORT UNIQUE       |      |     2 |    32 |            |          |
|   9 |     TABLE ACCESS FULL| TAB2 |     2 |    32 |     3   (0)| 00:00:01 |
|  10 |    SORT UNIQUE       |      |     2 |    32 |            |          |
|  11 |     TABLE ACCESS FULL| TAB1 |     2 |    32 |     3   (0)| 00:00:01 |
-----------------------------------------------------------------------------


Four full table scans, four sorts. For a large table the PARTITION BY wins.

No comments:

Post a Comment