jump to navigation

Counting the Cost #5 – accurate and fast July 9, 2009

Posted by mwidlake in performance.
Tags: , ,
trackback

I started off my blog by discussing quick ways of selecting count(*) from tables. I never completed the series, so I’m back to it.

This is the previous post on the topic.

It covers how using ALL_TAB_MODIFICATIONSand the NUM_ROWS column in ALL_TABLES can be combined to get a very accurate estimate of count(*) for most tables in most situations.

If this does not work for you {e.g no privilege to see the tables or flush the information), or you absolutely must have the count as accurately as you can right now, as a Manager is threatening you, this is another method using a count table. It can be expanded to hold more than just count(*) and is particularly useful for partitioned tables {which are often huge}.

Basically, every time you select count(*)on a table {or even eg select count(*) where status=0} you are probably scanning a lot of table you scanned last time you did count(*) on that table. If you are not deleting records from that table, that scanning effort is effectively a waste of time – it won’t have changed since last time you did select count(*). So, as a developer or DBA, you can decide you can trust the last count(*), so you need only count more recent records.

If you have an ascending primary key {traditionally a number sourced from a sequence} you can store the max(PK) and count(*) in a simple table. You then count all the records with a PK greater than the stored one. I’ve used count tables several times in the past and two enhancements invariably crop up, so I am going to include them in the below example of how you can implement this.

You need the partitioning option to run the test yourself, so sorry you will need Enterprise Edition with the option. If you downloaded Oracle to your PC/linux box to play on, you probably have it. {it never ceases to amaze me that Oracle Corp will only allow Partitions to be used in EE edition and it is an extra cost option!}

You need to create tablespaces to hold the partition tables {or alter the test script to use existing tablespaces}. Feel free to take and modify this script  to create the tablespaces.

This is my test script .

This is my basic setup:-
NAME VALUE
—————————— ——————–
compatible 10.2.0.1.0
cpu_count 1
db_block_size 8192
db_file_multiblock_read_count 8
optimizer_mode ALL_ROWS
sga_target 612368384
sort_area_size 65536

My test table:-

create table test_p
 id    number(8) not null
 ,status number(1) not null
,num_1 number(3) not null
 ,num_2  number(3)
,vc_pad varchar2(2000))
tablespace parts_curr
 partition by range (id)
 (partition id_10k values less than (10000)
tablespace parts_old
 ,partition id_20k values less than (20000)
tablespace parts_old
 ,partition id_30k values less than (30000)
tablespace parts_old
 ,partition id_40k values less than (40000)
tablespace parts_old
,partition id_max values less than (maxvalue)
tablespace parts_curr
 )
/
-- {41999 records created - see script}
alter table test_p
add constraint pt_pk primary key (id)
using index
local(
partition id_10k tablespace parts_old
 ,partition id_20k tablespace parts_old
,partition id_32k tablespace parts_old
 ,partition id_40k tablespace parts_old
,partition id_max tablespace parts_curr)
/
-- gather stats
begin
dbms_stats.gather_table_stats(ownname=>user
  ,tabname=>'TEST_P',estimate_percent=>100,granularity=>'ALL');
  end;
/-- create the count table
create table test_p_running_count
  (max_id	number(8) not null
  ,row_count number(8)
,constraint tprc_pk primary key(max_id)
  )
/

OK, let’s count the number of records in TEST_P and then do the same but insert the data collected into the count table:-

select count(*) from test_p;

  COUNT(*)
----------
     41999
Execution Plan
----------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Time | Pstart| Pstop |
----------------------------------------------------------
|   0 | SELECT STATEMENT       |       |     1 |    26   (4)| 00:00:01 |       |       |
|   1 |  SORT AGGREGATE        |       |     1 |            |          |       |       |
|   2 |   PARTITION RANGE ALL  |       | 41999 |    26   (4)| 00:00:01 |     1 |     5 |
|   3 |    INDEX FAST FULL SCAN| PT_PK | 41999 |    26   (4)| 00:00:01 |     1 |     5 |
-----------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
        111  consistent gets
          0  physical reads

insert into test_p_running_count (max_id,row_count)
select max(id),count(*) from test_p;

Execution Plan
----------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| P stop |
--------------------------------------------------------------
|   0 | INSERT STATEMENT       |       |     1 |     5 |    26   (4)| 00:00:01 |       |     |
|   1 |  SORT AGGREGATE        |       |     1 |     5 |            |          |       |     |
|   2 |   PARTITION RANGE ALL  |       | 41999 |   205K|    26   (4)| 00:00:01 |     1 |   5 |
|   3 |    INDEX FAST FULL SCAN| PT_PK | 41999 |   205K|    26   (4)| 00:00:01 |     1 |   5 |
---------------------------------------------------------------

Statistics
----------------------------------------------------------
         26  recursive calls
         21  db block gets
        115  consistent gets
          0  physical reads

So, it takes 111 consistent gets and an estimated cost of 26 to do a count(*) on this relatively small table, doing a fast full scan of the PK index.  It takes only slightly more effort to create a record to hold this information.

{As an aside, it is curious that the total estimated cost of the insert is the same as the select, suggesting naughtily that the insert is free, but that’s not my topic today}.

I now create 100 records by running a little insert script.

Let’s select the number of records and the new max(ID) created since we last stored information in the count table:

select max(id),count(*) from test_p
where id>(select max(max_id) from test_p_running_count)
 /
   MAX(ID)   COUNT(*)
---------- ----------
     42099        100

Execution Plan
----------------------------------------------------------
| Id  | Operation                     | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
Pstart| Pstop |
--------------------------------------------------------------
|   0 | SELECT STATEMENT              |         |     1 |     5 |     4   (0)| 00:00:01 |      |       |
|   1 |  SORT AGGREGATE               |         |     1 |     5 |            |          |      |       |
|   2 |   PARTITION RANGE ITERATOR    |         |  2100 | 10500 |     2   (0)| 00:00:01 |  KEY |     5 |
|   3 |    INDEX RANGE SCAN           | PT_PK   |  2100 | 10500 |     2   (0)| 00:00:01 |  KEY |     5 |
|   4 |     SORT AGGREGATE            |         |     1 |    13 |            |          |      |       |
| 5 | INDEX FULL SCAN (MIN/MAX)| TPRC_PK | 1 | 13 | 2 (0)| 00:00:01 | | |
---------------------------------------------------------------
Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          3  consistent gets
          0  physical reads

The CBO scanned the index of the TEST_P_RUNNING_COUNT table using an index full scan (min/max), which means Oracle basically worked down the index to get a min or max value, which is highty efficient. This value was then used to do an index range scan of only one index partition (partition number 5).

Estimated cost was 4 and the execution took 3 consistent gets {NB this is for the second execution, it is 12 consistent gets an 4 recursive calls the first time, when the statement is parsed}.

But that only got the new max(id) and how many records have been created since you last recorded it, 100. What you really want is the count of records since you last recorded the count and max(id) PLUS the count as it was then. ie the full count now.
This is not straightforward in a single SQL statement as you want the count, a goup function, from one table and the single row value from another. You can do this though. You basically write SQL statements to do the bits you want and then select from them. My example script shows how I build up to this, but the final statement I came up with {and you might well be able to come up with better options yourself} is:

select change.max_id new_max,orig.max_id orig_max
,change.rowcount chg_c,orig.rowcount orig_c 
,change.rowcount+orig.rowcount tot_c
from
(select max(id) max_id,count(*) rowcount,1 tabjoin
  from test_p
  where id>(select max(max_id) from test_p_running_count) ) change
,(select max(max_id) max_id,max(row_count) rowcount,1 tabjoin
  from test_p_running_count)                                            orig
where change.tabjoin=orig.tabjoin
 /

NEW_MAX ORIG_MAX  CHG_C     ORIG_C    TOT_C
---------- ---------- ---------- ---------- -----------------------------
     42099      41999        100      41999                         42099

Execution Plan
----------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CP
U)| Time | Pstart| Pstop |
----------------------------------------------------------
|   0 | SELECT STATEMENT                |                      |     1 |    58 |     7   (0)| 00:00:01 |       |       |
|   1 |  NESTED LOOPS                   |                      |     1 |    58 |     7   (0)| 00:00:01 |       |       |
|   2 |   VIEW                          |                      |     1 |    29 |     4   (0)| 00:00:01 |       |       |
|   3 |    SORT AGGREGATE               |                      |     1 |     5 |  |          |       |       |
|   4 |     PARTITION RANGE ITERATOR    |                      |  2100 | 10500 |     2   (0)| 00:00:01 |   KEY |     5 |
|*  5 |      INDEX RANGE SCAN           | PT_PK                |  2100 | 10500 |     2   (0)| 00:00:01 |   KEY |     5 |
|   6 |       SORT AGGREGATE            |                      |     1 |    13 |  |          |       |       |
| 7 | INDEX FULL SCAN (MIN/MAX)| TPRC_PK | 1 | 13 | 2 (0)| 00:00:01 | | |
|*  8 |   VIEW                          |                      |     1 |    29 |     3   (0)| 00:00:01 |       |       |
|   9 |    SORT AGGREGATE               |                      |     1 |    26 |  |          |       |       |
|  10 |     TABLE ACCESS FULL           | TEST_P_RUNNING_COUNT |     1 |    26 |     3   (0)| 00:00:01 |       |       |
--------------------------------------------------------

Statistics
---------------------------------------------------------
          0  recursive calls
          0  db block gets
         10  consistent gets
          0 physical reads

{you might want to click on “show plain” in the above code window for a clearer layout}
That’s an estimated cost of 7 by the CBO and 10 consistent gets {again, the second itteration}. Remember, scanning the whole partitioned table was costed at 27 and took 111 consistent gets.

You can convert that sql select statement into an insert and use it to update your count table:-

insert into test_p_running_count (max_id,row_count)
 select change.max_id
--,orig.max_id,change.rowcount,orig.rowcount
 ,change.rowcount+orig.rowcount
 from
(select max(id) max_id,count(*) rowcount,1 tabjoin
   from test_p
   where id>(select max(max_id) from test_p_running_count) ) change
,(select max(max_id) max_id,max(row_count) rowcount,1 tabjoin
   from test_p_running_count) orig
where change.tabjoin=orig.tabjoin
 /

That is one of the improvements usually asked for when you have a count table, people want to see what the counts were historically, seeing as you are now holding the data in a table {why is it so many people suddenly want to keep information once it is seen in a table?}. That is why the above does an insert and not an update and my code selects the max(max_id) and max(rowcount). It is not perfect, if no records have been inserted into your massive table since last the max(id) and row count was inserted, you will get a duplicate primary key error. You could add a datetime column to get around this.

Also, my SQL is a little naughty in that I select the max(id) and max(rowcount). There is nothing in the table design to enforce the unspoken rule that they both appear in the “last record” of the table, but as a human you can see that they do. Again, the use of a datetime can help with this.

The final bit for today. The above two sql statments look a bit nasty, certainly a lot more complex than “select count(*) from table”. Also, they are not as efficient as they could be, the selecting max(row_count) is not supported by an index and could slow down over time. I could tweak the SQL statement even further to work around this, but a simple piece of PL/SQL does the job better, and is what I usually end up implementing {within a package, along with a load of other count table functions for several large tables}.

Here is a script to create a function . that gets the current count(*) for the table.

--get_new_max
create or replace function get_tp_rowcount
return number
as
--
v_rtn number;
v_extra_rc number;
v_last_rc   number;
v_max_id    number;
v_last_max  number;
begin
  select max_id,row_count
into v_last_max,v_last_rc
  from test_p_running_count
  where max_id=(select max(max_id) from test_p_running_count);
  select count(*),max(id)
into v_extra_rc,v_max_id
  from test_p
  where id>v_last_max;
dbms_output.put_line(to_char(v_last_rc)
||'~'||to_char(v_last_rc)
              ||'~'||to_char(v_last_max)
              ||'~'||to_char(v_max_id));
v_rtn :=v_last_rc+v_extra_rc;
return v_rtn;
end;
/

And, as you can see below, it remains an efficient trick. I’ll leave it to you to create a procedure to update the count table and with the observation that you can then argue for several hours as to how often you update the count table… :-)

select get_tp_rowcount from dual;

GET_TP_ROWCOUNT
---------------
          42299

1 row selected.

Elapsed: 00:00:00.01

Execution Plan
----------------------------------------------------------
| Id  | Operation        | Name | Rows  | Cost (%CPU)| Time     |
-------------------------------------------------------------
|   0 | SELECT STATEMENT |      |     1 |     2   (0)| 00:00:01 |
|   1 |  FAST DUAL       |      |     1 |     2   (0)| 00:00:01 |
-------------------------------------------------------------

Statistics
----------------------------------------------------------
          2  recursive calls
          0  db block gets
          5  consistent gets
          0  physical reads

--
--

select count(*) from test_p;

  COUNT(*)
----------
     42299

Execution Plan
----------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Time | Pstart| Pstop |
----------------------------------------------------------
|   0 | SELECT STATEMENT       |       |     1 |    26   (4)| 00:00:01 |       |       |
|   1 |  SORT AGGREGATE        |       |     1 |            |          |       |       |
|   2 |   PARTITION RANGE ALL  |       | 41999 |    26   (4)| 00:00:01 |     1 |     5 |
|   3 |    INDEX FAST FULL SCAN| PT_PK | 41999 |    26   (4)| 00:00:01 |     1 |     5 |
----------------------------- ----------------------------
Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
        111  consistent gets
          0  physical reads

About these ads

Comments»

1. Blogroll Report 03/07/2009 – 10/07/2006 « Coskan’s Approach to Oracle - July 10, 2009

[...] Martin Widlake – Counting the Cost #5 – accurate and fast [...]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 156 other followers

%d bloggers like this: