jump to navigation

Row Level Security 3 – In Pictures! November 22, 2012

Posted by mwidlake in database design, security.
Tags: , ,

<..Part one intro and examples
<….Part two Permissions

I’ve noticed that there has not been a lot of traffic on this series on Row Level Security (data masking) so far – maybe due to how I am presenting the material? So here is a summary to date in picture/diagram format:

This is the flow of logic as to whether Row Level Security is applied to the SQL statement. Usually this is a query and usually it is on a table but the policy can be applied to other DML and also on views and synonyms:

These are the steps you need to add RLS to a table:

Below is a summary of how oracle fires and uses RLS functions and the predicates they provide:

An finally, a summary of what happens if an RLS function is invalid:

If you have not looked at the previous posts on this topic, more details are in there.


Row Level Security Part 2 – permissions November 20, 2012

Posted by mwidlake in internals, security, SQL.
Tags: , ,

<..Part 1, introduction..
..Part 3 summary in pictures..>

In this second post on the topic of “an introduction to Row Level Security” I want to cover a few things about what permissions you need to implement RLS and some of the consequences. In my introduction in part one I just said my main user has “DBA type Privileges”.

{NB This is all on Oracle V11.2 and I believe everything below is applicable to V10 as well. Also, I should point out that I am not an Oracle security expert – but despite repeatedly saying this, it seems like at least once a year I am asked to improve a system’s security on the grounds of “more than we have now is an improvement”}.

Firstly, you need the permission to create and drop RLS policies. This is controlled by having the EXECUTE privilege on SYS.DBMS_RLS. That is it. If you can execute this package you can create and drop RLS Policies, enable and disable them and also create, manage and use Policy Groups.

Obviously EXECUTE on SYS.DBMS_RLS is a privilege you need to restrict to only “empowered” accounts- but you also want to be very careful about your definition of empowered. If an account has this privilege, it can alter (disable!) any RLS policies that ANYONE created on the instance. You may need to think carefully about having several accounts that can alter all the RLS policies across all schemas in your instance.

You might want to create a single user that has the EXECUTE on SYS.DBMS_RLS and use that to implement and control RLS across the instance {and even have the account locked when it is not needed}. Of course, under normal Oracle anyone with SYS access can run the procedure – but you have SYS access under control already, don’t you?… :-).

By default, who has this privilege?

MDW> @obj_privs
Enter value for obj_name: dbms_rls
Enter value for obj_own: sys
GRANTEE              OWNER        TABLE_NAME             GRANTOR  PRIVILEGE    GRA
-------------------- ------------ ---------------------- -------- ------------ ---
EXECUTE_CATALOG_ROLE SYS          DBMS_RLS               SYS      EXECUTE      NO
XDB                  SYS          DBMS_RLS               SYS      EXECUTE      NO
WMSYS                SYS          DBMS_RLS               SYS      EXECUTE      NO

-- and who has that EXECUTE_CATALOG_ROLE?
select * from dba_role_privs where granted_role='EXECUTE_CATALOG_ROLE'

GRANTEE                        GRANTED_ROLE                   ADM DEF
------------------------------ ------------------------------ --- ---
DBA                            EXECUTE_CATALOG_ROLE           YES YES
SYS                            EXECUTE_CATALOG_ROLE           YES YES

So, our old friend the DBA role has that privilege, via the EXECUTE_CATALOG_ROLE and IMP_/EXP_FULL_DATABASE.

Let’s go back to creating policies. I’m sticking to row-level RLS for now, not column-level. When you create a policy you basically are saying “when object X is accessed, run function Y and apply the resulting string to all access, as a predicate”

Lifted an example from the manual, but adding in a couple of bits the authors were too lazy to put in { :-) } :-

 object_schema   =>   'hr'
,object_name     => 'employees'
,policy_name     => 'emp_policy'
,function_schema => 'hr_rls'
,policy_function => 'emp_sec'
,statement_types => 'select');

The function HR_RLS.EMP_SEC returns a string P1

When someone issues a statement that access the HR.EMPLOYEES table the function HR_RLS.EMP_SEC is executed as part of the parse and, internally, the following view is created and placed into the orginal statement:

SELECT * FROM hr.employees WHERE P1;

As this takes place as part of the parse part of the statement processing, errors with the function are found before any effort to access the actual data occur. It is also why it is not simple to avoid RLS – it is done as part of the overall SQL processing carried out by Oracle.

If the function is invalid, generates an error or the supplied “WHERE predicate” P1 causes the final statement to be syntactically incorrect, you get an error.
At this point Oracle fails “secure” and simply won’t fire the SQL statement. Even if you own the tables and have full privilege access to them. I think it is worth a quick demonstration

I am logged on as the owner of the table and RLS function, MDW:-

MDW> select * from test1
        ID CP_NAME                        OTHER_NAME                        MASK_FL
---------- ------------------------------ ------------------------------ ----------
         1 ERIC                           ERIC THE RED                            1
         2 BROWN                          BOB THE BROWN                           2
         3 GREEN                          GARY THE GREEN                          1
         4 BLUE                           BILL THE BLUE                           3
         5 BLACK                          DAVE THE BLACK                          4
         6 PURPLE                         PETE THE PURPLE                         5
         7 RED                            ROGER THE RED                           6
--All works OK
MDW> @rls_func_damage
 v_schema IN VARCHAR2
,v_objname IN VARCHAR2)
con VARCHAR2 (200);
-- The line below states loser not user
con := (loser='''||v_schema||''' or MASK_FL > 3)';
RETURN (con);
END hide_rows;
Function created.
-- Let us check the predicate coming from the function
select hide_rows(v_schema=>user,v_objname=>'TEST1') from dual;

(loser='MDW' or MASK_FL > 3)

-- "loser" is not a valid piece of syntax or a known variable. So...
select * from TEST1
ERROR at line 1:
ORA-28113: policy predicate has error

-- I now damage the statement creating the function so that it is syntactically incorrect
-- missing quote
con := (loser='''||v_schema||''' or MASK_FL > 3)';

Warning: Function created with compilation errors.

select * from TEST1
ERROR at line 1:
ORA-28110: policy function or package MDW.HIDE_ROWS has error

--re-create the function as I started with

Function created.

select * from TEST1
        ID CP_NAME                        OTHER_NAME                        MASK_FL
---------- ------------------------------ ------------------------------ ----------
         1 ERIC                           ERIC THE RED                            1
         2 BROWN                          BOB THE BROWN                           2
         7 RED                            ROGER THE RED                           6
-- Back to a working system

As you can see, you get ORA-28110 if the function used by the RLS policy is damaged and ORA-28113 if the function returns a non-valid WHERE Predicate. In either case, the data in those tables is now inaccessible. That could be a problem…

It is of course very interesting that RLS is kind of “blind” to the functions it uses until it uses it. You can alter RLS activity if you can alter the functions.

Last post I mentioned that, as you RLS security is based on these functions, you should give thought to who can see them and change them. Hopefully the above highlights that. There is another access aspect to these functions that is key.

In the first example where I use DBMS_RLS.ADD_POLICY, I state function_schema – this is who owns the PL/SQL function. It defaults to the person who creates the Policy. The function is fired in the standard way with DEFINER RIGHTS – i.e. the rights of the function owner. {You could override this when creating the function such that it runs with executioner rights – but that strikes me as a potentially very confusing and a source of much pain and anguish}.

So the function can see what the owner can see, not the executioner. This has two key effects:

  • You can restrict the access to any DB objects that the function requires to the owner of the function. The end user does not need to have access and it may be more secure if they do not.
  • You have to grant access to objects directly to the function owner, via either object grants or system grants. Stored PL/SQL does not “see” roles if executed with definer rights.

So eg if your function is owned by MDW and references a master table called SEC_MASTER.MASKED_USERS then you need to grant select on SEC_MASTER.MASKED_USERS to MDW. No using roles.
{of course, calling a table MASKED_USERS might give anyone acquiring access to the DB a clue that it is important. I tend to call such “security important” tables things that are “obviously” boring, like ROTA_DATA.}

Finally, that WHERE Predicate P1 is in effect added to your SQL statement. It can be pretty much anything. It can reference other database objects. If it references other database objects THE EXECUTING USER MUST BE ABLE TO SEE THEM. Not the owner of the function; that function has been fired and the WHERE Predicate generated already. On actually running the SQL, the executing user must have access to all objects referenced, including those introduced by the P1 WHERE Predicate. Oh, and remember synonyms!

There is there a way to sidestep RLS and you sometimes need to.

If your logon has the EXEMPT ACCESS POLICY system privilege then RLS functions are utterly ignored by you. No check is made to see if policies apply to your SQL statements, let alone run them. You may need to use this privilege if you need to access functionality that RLS has issues with (see comments by Dom Brooks and Tony Sleight on my first post). You might also need it if you have errors with the policies but you need access to the data to keep the business moving. You probably need a locked-down user somewhere with the EXEMPT ACCESS POLICY privilege if you use RLS.
Of course, great care has to be taken in making use of this privilege as it side-steps all RLS_based security.
If you have master users that have full access to the data, this privilege also removes the overhead of the SQL engine identifying any policies, firing the function and including the extra predicates.

One very final point. I have found myself writing RLS functions that all check if the user was the owner of the table and, if so, allowing them to see all data. By granting EXEMPT ACCESS POLICY to that user I was able to do away with those checks and make the functions much simpler. In this case I did not so much think of EXEMPT ACCESS POLICY turning off security but the lack of it turning it on for restricted users.

That will do for part 2.

Row Level Security Part 1 November 15, 2012

Posted by mwidlake in database design, security.
Tags: , ,

I’ve been working a little on Row Level Security (RLS) recently and wanted to mention a few things, so first some groundwork.

If you want to limit the rows certain users can see, you might think to use views or you might think to use RLS (part of VPD – Virtual Private Database). You can also (from V10 I think) limit which columns users can see. An example is probably the best way to show this. I’m doing this on Oracle

I have two users, MDW and MDW_OFFSHORE. MDW has DBA-type privileges and MDW_OFFSHORE has connect, resource and one or two other simple privs. I will now demonstrate creating and populating a simple table under MDW, adding RLS to it and how it alters what MDW_OFFSHORE sees.

MDW> -- create a table and show row level and column level rls
MDW> --
MDW> drop table TEST1 purge;
MDW> --
MDW> create table test1
  2  (id         number       not null
  3  ,cp_name    varchar2(30) not null
  4  ,other_name varchar2(30) not null
  5  ,MASK_FL    number
  6  )
  7  /
MDW> insert into test1 VALUES (1,'ERIC','ERIC THE RED',1);
MDW> insert into test1 VALUES (2,'BROWN','BOB THE BROWN',2);
MDW> insert into test1 VALUES (3,'GREEN','GARY THE GREEN',1);
MDW> insert into test1 VALUES (4,'BLUE','BILL THE BLUE',3);
MDW> insert into test1 VALUES (5,'BLACK','DAVE THE BLACK',4);
MDW> insert into test1 VALUES (6,'PURPLE','PETE THE PURPLE',5);
MDW> insert into test1 VALUES (7,'RED','ROGER THE RED',6);
MDW> --

        ID CP_NAME                        OTHER_NAME                        MASK_FL
---------- ------------------------------ ------------------------------ ----------
         1 ERIC                           ERIC THE RED                            1
         2 BROWN                          BOB THE BROWN                           2
         3 GREEN                          GARY THE GREEN                          1
         4 BLUE                           BILL THE BLUE                           3
         5 BLACK                          DAVE THE BLACK                          4
         6 PURPLE                         PETE THE PURPLE                         5
         7 RED                            ROGER THE RED                           6
MDW> grant all on test1 to mdw_onshore;
MDW> grant select on test1 to mdw_offshore;
MDW> --
--This is where I create a function to generate the RLS rules
  2   v_schema IN VARCHAR2
  3  ,v_objname IN VARCHAR2)
  5  con VARCHAR2 (200);
  6  BEGIN
  7  -- note the use of backets to make it an atomic test
  8  con := '(user='''||v_schema||''' or MASK_FL > 3)';
  9  RETURN (con);
 10  END hide_rows;
 11  /
MDW> --
-- Now add the actual RLS policy to the table TEST1 using the function hide_rows
  2    sys.DBMS_RLS.ADD_POLICY (
  3      object_schema     => 'MDW'
  4     ,object_name       => 'TEST1'
  5     ,policy_name       => 'hide_rows_policy'
  6     ,policy_function   => 'hide_rows'
  7     ,function_schema   => 'MDW'
  8                            );
  9  END;
 10  /
MDW> --

That’s it. In effect you create a function, in this hide_rows, that creates a WHERE predicate (more on that later) and associate it with the table by creating a policy hide_ rows_policy.

Now let’s see what impact it has:

-- as the main user:
MDW> select * from test1
  2  /
        ID CP_NAME                        OTHER_NAME                        MASK_FL
---------- ------------------------------ ------------------------------ ----------
         1 ERIC                           ERIC THE RED                            1
         2 BROWN                          BOB THE BROWN                           2
         3 GREEN                          GARY THE GREEN                          1
         4 BLUE                           BILL THE BLUE                           3
         5 BLACK                          DAVE THE BLACK                          4
         6 PURPLE                         PETE THE PURPLE                         5
         7 RED                            ROGER THE RED                           6
7 rows selected.
-- now as the offshore user:
mdw_offshore> select * from mdw.test1
  2  /
        ID CP_NAME                        OTHER_NAME                        MASK_FL
---------- ------------------------------ ------------------------------ ----------
         5 BLACK                          DAVE THE BLACK                          4
         6 PURPLE                         PETE THE PURPLE                         5
         7 RED                            ROGER THE RED                           6
3 rows selected.

As you can see, mdw_offshore is seeing only a subset of rows in the table. This subset is determined by the function hide_rows. This function returns a WHERE PREDICATE that is added to any query run against the table. It is actually very simply to see what that predicate is, you just call the function passing in the table owner and name:

MDW> select hide_rows(v_schema => user,v_objname=>’TEST1′) from dual;

(user=’MDW’ or MASK_FL > 3)

Note that my refernce to v_owner in the function comes out as the text MDW – a way to ensure the predicate does not filter data looked at by the table owner. This implies of course that yes, you can stop the table owner from seeing data!

By default, this extra WHERE PREDICATE is applied to any SELECT, INSERT, UPDATE or DELETE statement run against this table. The function has to accept the v_schema and v_objname parameters and output a varchar2 string. At the time of being called as part of a RLS policy, oracle passes in the owner and name of the object being masked (which is usually a table and sometimes a view).
There are a few things I think it is important to keep in mind in respect of this extra Where Predicate:-

  • You have very little control over what other where predicates will be applied to DML against this table, so keep it self-contained – which is why I bound it in brackets.
  • It has to be syntactically correct but no check is made for this when you create the function – it is simply a string being outputted, as far as Oracle is concerned, at the time of creation.
  • It can be any valid Where Predicate you like, so long as it does not reference other rows in the table with the policy is applied to
  • This could be fired a lot. The performance impact could be considerable, which is why CONTEXTs are often used (as they are memory resident and can be configured by eg logon triggers) but I have skipped over that. Go look at the manuals.

I mentioned that the masking of data could be applied to just a column rather than whole rows. So let’s demonstrate that. I can do this by simply dropping and reapplying the same function in a different way:

-- first drop the policy, simply by stating the object owner, name and policy
  2    sys.dbms_rls.drop_policy(
  3      object_schema     => 'MDW'
  4     ,object_name       => 'TEST1'
  5     ,policy_name       => 'hide_rows_policy'
  6                            );
-- and recreate - note the two new paramters 
  7    sys.DBMS_RLS.ADD_POLICY (
  8      object_schema     => 'MDW'
  9     ,object_name       => 'TEST1'
 10     ,policy_name       => 'hide_rows_policy'
 11     ,policy_function   => 'hide_rows'
 12     ,function_schema   => 'MDW'
 13     ,sec_relevant_cols => 'OTHER_NAME,MASK_FL'
 14     ,sec_relevant_cols_opt => sys.dbms_rls.ALL_ROWS
 15                            );
 16  END;
 17  /
PL/SQL procedure successfully completed.
-- as the main user
MDW> select * from test1
  2  /
        ID CP_NAME                        OTHER_NAME                        MASK_FL
---------- ------------------------------ ------------------------------ ----------
         1 ERIC                           ERIC THE RED                            1
         2 BROWN                          BOB THE BROWN                           2
         3 GREEN                          GARY THE GREEN                          1
         4 BLUE                           BILL THE BLUE                           3
         5 BLACK                          DAVE THE BLACK                          4
         6 PURPLE                         PETE THE PURPLE                         5
         7 RED                            ROGER THE RED                           6
7 rows selected.
-- now as the offshore user:
mdw_offshore> select * from mdw.test1
  2  /
        ID CP_NAME                        OTHER_NAME                        MASK_FL
---------- ------------------------------ ------------------------------ ----------
         1 ERIC
         2 BROWN
         3 GREEN
         4 BLUE
         5 BLACK                          DAVE THE BLACK                          4
         6 PURPLE                         PETE THE PURPLE                         5
         7 RED                            ROGER THE RED                           6
7 rows selected.
-- but what can look a little odd is that if mdw_offshore describes the table..
mdw_offshore> desc mdw.test1
 Name                                                  Null?    Type
 ----------------------------------------------------- -------- --------------
 ID                                                    NOT NULL NUMBER
 CP_NAME                                               NOT NULL VARCHAR2(30)
 OTHER_NAME                                            NOT NULL VARCHAR2(30)
 MASK_FL                                                        NUMBER
-- they see that OTHER_NAME is NOT NULL but...
mdw_offshore> select count(*) from mdw.test1 where other_name is null;


That’s rather nice, I think. The same function can be used and rather than hide the rows it masks the stated columns. The sec_relevant_cols => ‘OTHER_NAME,MASK_FL’ parameter states the columns to apply the function on and the sec_relevant_cols_opt => sys.dbms_rls.ALL_ROWS parameter means “show a row even if a column or columns are hidden”. Without it, you only see rows if you do not reference the masked columns (even if indirectly via eg “select (*)”) – see this nice article by Tim Hall if you want an example of that.

With column masking the string from the function is not actually used as a Where Predicate but as a Boolean TRUE/FALSE test. If the test results in TRUE, you see the column(s), otherwise they are hidden.

As you can see from the example, this column-level RLS can result in some odd looking (though intentional) results. A DESC of the table shows the column OTHER_NAME is mandatory but records are found with it set to null. There can also be the situation where a user can insert a value for the masked column, but then cannot see it when they select the data back, though other users can see the populated column.

A limitation of column-level masking is that you can only mask columns with NULL and not, for example, a string of “MASKED”. Thus if a column you are masking is not manadatory, how would you distinguish between a true NULL and a masked NULL? {A nod of thanks goes to Pete Finnigan for confirming this “NULL only” limitation.}

One of the appeals of RLS over eg views is that it is hard to circumvent, as it is applied at the level of the SQL. Another is that the same function can be applied to many tables and might also be applied to both column and row masking, if designed well.

That will almost do for an introduction to the use of RLS to mask rows and columns. However I’ll finish with just a few important points that follow on from this being a security feature.

  • I demonstrated that I could see what the function responsible for this security actually returns. If you do not protect unauthorised execution of that function, that could be a security hole.
  • I masked the column that is used to control if the OTHER_NAME column is masked. Would you need to also do this?
  • Data that is missing is data that could be of interest.
  • Anyone who can alter that function can undo your security.
  • Anyone who can view DBA_SOURCE (eg has SELECT ANY DICTIONARY) can view the function and see how it works.
  • As you can appreciate from the above, apply Security securely is a right pain in the proverbial.

DBMS_APPLICATION_INFO for Instrumentation September 3, 2012

Posted by mwidlake in development, Instrumentation, performance.
Tags: , , , ,

I just wanted to put up a post about DBMS_APPLICATION_INFO. This is a fantastic little built-in PL/SQL package that Oracle has provided since Oracle 8 to allow you to instrument your code. i.e record what it is doing. I’m a big fan of DBMS_APPLICATION_INFO and have used it several times to help identify where in an application time is being spent and how that pattern of time has altered.

Some PL/SQL developers use it and some don’t. It seems to me that it’s use comes down to where you work, as most PL/SQL developers are aware of it – but not everyone uses it (a friend of mine made the comment recently that “all good PL/SQL developers use it“. I can understand his point but don’t 100% agree).

It is incredibly easy to use. You use the procedures SET_MODULE(module_name,action_name), SET_ACTION(action_name) and SET_CLIENT_INFO(client_info) to set the values of the corresponding columns in V$SESSION for your current session. This is done in memory, there is no inserting or updating of any table rows, so it is incredibly quick and light-weight. The below shows a subset of V$SESSION including the columns that get set:

desc v$session
Name Null? Type
----------------------------------------- -------- ----------------

Just something odd to note. MODULE, ACTION and CLIENT_INFO are shown at VC(64) but if you check the documentation you will see that:

MODULE is limited to 48 bytes
ACTION is limited to 32 bytes
CLIENT_INFO can be set to the whole 64 bytes

I have no idea why MODULE and ACTION are limited in this way but it might be historic, backward compatibility with prior versions.

As a quick example of their use:

test_11_2> --demo_dai.sql
test_11_2> -- quick demo of dbms_application_info
test_11_2> COL sid form 9999
test_11_2> col username form a12
test_11_2> col module form a18
test_11_2> col action form a22
test_11_2> col client_info form a15
test_11_2> -- set module and action
test_11_2> select sid,username,module,action,client_info from v$session where sid=650;

----- ------------ ------------------ ---------------------- ---------------
test_11_2> --
test_11_2> -- Update the action within a module
test_11_2> select sid,username,module,action,client_info from v$session where sid=650;

----- ------------ ------------------ ---------------------- ---------------
650 ERIC SALES_LOAD Update all Prices

test_11_2> -- clear them
test_11_2> select sid,username,module,action,client_info from v$session where sid=650;

----- ------------ ------------------ ---------------------- ---------------
650 ERIC

test_11_2> -- you can set just an action
test_11_2> EXEC DBMS_APPLICATION_INFO.SET_action(ACTION_NAME=>'orphan process')
test_11_2> select sid,username,module,action,client_info from v$session where sid=650;

----- ------------ ------------------ ---------------------- ---------------
650 ERIC orphan process

test_11_2> -- set them to something sensible
test_11_2> select sid,username,module,action,client_info from v$session where sid=650;

----- ------------ ------------------ ---------------------- ---------------

test_11_2> -- client_info can be longer
test_11_2> EXEC DBMS_APPLICATION_INFO.SET_client_info(client_info=>'entered this module at '||sysdate)
test_11_2> select sid,username,module,action,client_info from v$session where sid=650;

----- ------------ --------------- --------------- ----------------------------------------
650 ERIC RETURNS_LOAD MATCH_TO_SALE entered this module at 03-SEP-2012 13:07

The intention is that you set the MODULE as you enter a functional chunk of your application (be it a PL/SQL package or a chunk of JAVA, whatever {but be aware of the stateless nature of systems using middle tiers and pooled connections}) and then update the ACTION as you progress. You can set an action with no module, as demonstrated above, but I would recommend against it. On that topic I would make the following recommendations:

  • Set the MODULE at the start, eg as you enter a PL/SQL package’s main module
  • Set the MODULE before you do all the validation – you may wish to see if that validation is as quick as you think.
  • Use SET_ACTION to update the action prudently. ie when the activity of the code alters but not all the time. It will come clearer with use, but what you are after is to be able to judge the relative run durations of parts of the application. Too much detail can mask the overall picture. If you can alter code you can always add more instrumentation.
  • Clear the MODULE and ACTION explicitly as you exit a package. The activity is lightweight and you do not want any code that lacks this instrumentation to inherit values from calling code.
  • Do not set an ACTION without having set the MODULE earlier in the code.
  • Keep the text of these values short and specific.

You do not have to look at V$SESSION to see these values, which is good as you may lack the privs needed to do so. Since oracle 9 there have also been procedures to check what the values are set to. e.g:

test_11_2> BEGIN
5 begin
6 dbms_application_info.READ_MODULE(V_MODULE,V_ACTION);
8 END;
9 END;
10 /


So as you can see, these values can be set, they can be accessed via the data dictionary or the package itself and, if you believe my wild claims, they are very lightweight to use. However, these values really come into themselves with eg OEM’s performance screens. I’ll leave examples of that to a second post. Here I just wanted to cover how easy it is to set and get the information.

That does lead me onto what I think is the main reason that DBMS_APPLICATION_INFO is not as widely used as it could {and should} be. Developers often lack access to eg OEM and so do not see the full benefit of using it. Also, developers tend to be {note the get-out-clause) more concerned with making the code work and less with monitoring it after it is released. DBA-types tend to be more concerned with monitoring it. Thus the DBA types would want it in the code and developers would not want to spend the time coding it in. We are all one big team though, so some negotiation and beer-buying (or management enforcement) might well resolve that one.

Friday Philosophy – Lead or Lag (When to Upgrade)? January 20, 2012

Posted by mwidlake in development, Friday Philosophy, Testing.
Tags: , , ,

I was involved in a discussion recently with Debra Lilley which version of Oracle to use. You can see her blog about it here (and she would love any further feedback from others). Oracle now has a policy that it will release the quarterly PSUs for a given point release for 12 months once that point release is superseded. ie once came out, Oracle will only guarantee to provide PSUs for for 12 months. See “My Oracle Support” note ID 742060.1. However, an older Terminal release such as is not superseded and is supported until 2015 – and will get the quarterly PSU updates. This left the customer with an issue. Should they start doing their development on the latest and theoretically greatest version of Oracle and be forced to do a point upgrade “soon” to keep getting the PSUs, or use an older version of Oracle and avoid the need to upgrade?

This is in many ways a special case of the perennial issue of should you use the latest version of Oracle (or in fact any complex software solution) or go with the version you know and trust? Plus, should you patch up to the latest version which in theory gives you protection against bugs and vulnerabilities (along with the CPUs). Yes, they are two separate issues but people tend to sit on the same side of both points, for the same reasons.

The arguments to stay using an older version are that it is working, it is stable, you do not need the new features and upgrading is a lot of work and effort. Plus the new version will have new bugs that come along with the new features you do not need and things might be turned on by default that you could do without (like stats collecting or not creating the actual segments when a new table or partition is created). If you remain on your favourite version long enough, you get another issue which is that the latest version of Oracle might not be compatible with your ancient version of the OS or another package or programming language critical to your system (I got caught in a terrible web with old perl, old O/S and old DB that resulted in a need to upgrade all three together – ouch!).

The arguments to moving forward are that you get access to the latest features, that over all older features will have more bugs fixed in newer version, performance will be better {again, overall, exceptions allowing}. Also, if you do hit bugs and problems there are no issues in having to first upgrade to a fully supported version. Plus, fixes are made for current versions first and then back-ported to older ones. Those pack-ported fixes can cause real problems when you DO decide to upgrade.

The big sticking points are the effort involved in upgrading and living with the bugs that you find that Oracle Testing didn’t.

I’ve got a few of other considerations to throw into the pot.

Firstly, if you are developing something new, it is not a lot more effort to use the latest version. This allows you to learn the new version and eases the transition of older systems to it.

Secondly, Oracle like you if you use the latest version, especially if it is the latest-latest version or even beta. Yeah, the helpdesk will not have a clue about some of your issues but in my experience you get access to those really smart guys and gals in Oracle who do the third-line support or even the development work.

Thirdly, if you are on the latest version, if you do decide to freeze on that version for a while, for stability and a quiet life, you have a lot longer before your version (at least at a major level) drops out of support.

Fourthly, dynamic, inquisitive, flexible staff like new things. In my experience, environments that freeze on an old version have a higher percentage of staff who either like it dull and repetitive, or hate it being dull and repetitive – and itch to get out. If I’m in charge, I know which type of staff I like to have more of {NB there are some very good arguments for having some staff who like it dull and repetitive}.

As you can guess, I am in the “be on the latest version” side of the argument. I was ambivalent about it until a few years ago when I noticed a trend:

Sites that like to move forward tend to (a) do it in a controlled manner and (b) have the infrastructure to do proper regression testing.
Site that like to stay still lack the ability to do regression testing and move forward only when forced – and in a pressured, unplanned and frankly chaotic manner.

That was it, that was the real key thing for me. The further you lag behind the more likely you are to eventually be forced to upgrade and it won’t be a nice time doing it. I know, there are exceptions, systems still running Oracle 6 absolutely fine on an old DOS6.1 box. In the same way you also get the odd 95-year-old life-long smokers – and thousands of 45-year-old smokers with emphysema.

When I have any sway over the situation I now always strive to be on modern versions of Oracle {OS, language, whatever} and to patch small and regular. To support all this, have very good regression testing. I’ve only a couple of times been able to get the regression testing sorted out as well as I would like, but when you do the pain of patching and upgrading, as well as developing and integrating, is so much reduced that not patching seems madness.

So to sum up:

  • If it is a new development, go for the very latest version, play with the latest features if potentially beneficial and see if you can get Oracle to be interested in your attempts. ie (B)lead.
  • If you have good regression testing, plan and carry out patch and version upgrades as they come available and stay current. ie Lead
  • If you have a complex solution in place and no/poor regression testing, do not move to a new major release, leave it a while for the worst new bugs to be found and fixed. Then move. ie Lag
  • If your system is old AND critical and all the guys and gals who implemented it are long gone, stay on that version for ever. ie stagnate.

Oh, and if that last one applies to many of your systems – dust off the CV and start reading technical manuals. One day you will need a new job in a hurry.

Oracle Nostalgia December 15, 2011

Posted by mwidlake in database design, development.
Tags: ,

When preparing the material for my “Oracle Lego – an introduction to Database Design” presentation for the UKOUG last week, I was looking back at my notes from a course on the topic from “a few years back”. There were a few bits which made me smile.

Oracle’s [SQL] implementation conforms to ANSI standard, although referential integrity will not be enforced until version 7

Any other old geezers having flashbacks? I am so glad my first major Oracle development project swapped to using a Beta of V7 very early, so we had the integrity turned on during most of development. I had to help a few projects go from V6 to V7 and turn on the RI – it was usually very, very painful. Or impossible. I always think back to those nightmare experiences when some bright spark suggests turning off referential integrity for “ease of development” or “performance” reasons. There are good performance reasons for altering how you implement RI but, as I said during my presentation on database design, I have never, ever, ever seen a system with RI turned off that did not have damaged data.

Oracle’s optimiser is rule-based. Designing efficient queries involves taking advantage of the optimiser behaviour


You can tell this course was run in the UK due to the lack of ‘z’ in ‘optimiser’ :-). How many of use can now make a stab at the seven or eight significant rules from the 15 (16, 17 as versions advanced) in the list? Several rules were to do with Clusters so you did not care. Let’s think, what were the main things to keep in mind…

  • most significant table last in the FROM clause and order upwards in the order you wanted to visit {most significant being the one you felt you could most efficiently do the first filter against}
  •  WHERE clauses ordered downwards in the order you wanted them to be applied.
  • Order of preference to identify a row was something like ROWID, primary key, unique key, full non-unique key, partial unique key, partial non-unique key, full index scan, full table scan.
  • Disable index access by adding 0 to numeric columns and concatinating null to varchars.

I’ve not checked back in the manuals (I have a set of the V7 on my laptop) so I’m probably wrong.

Storage….selecting suitable values for storage parameters … will improve the final performance of the database

Considering the “suitable values for storage parameters” was perhaps my first real conscious step into being a performance/design guy {I was lucky to be on a project where designing for the RBO and matching those rules was just part of being any developer}, but the calculating of rows-per-block, initial and next extent, pctincrease (not always zero you know), initrans/maxtrans, segment to tablespace size… I learnt all about that and had spreadsheets for it all.

Now of course, all of the above about storage (and RBO) has pretty much disappeared. Oracle has made some of the contents of my brain redundant.

But some things have not changed at all in 18 years:

Users can be relied upon to know what they do NOT want, not what they want, which {unfortunately} is the premise from which analysis starts)

I think the above is the fundamental issue from which all iterative design methodologies spring. ie do not believe what the user says they want, show them something and fix it. It is probably human nature that we are not well able to express what we want but have no problem pointing out something is not at all what we want :-). Add in all the issues in respect of forgetting about the exceptions, assumed knowledge, incompatible vocabularies (the words your users say to you are as confusing as the techno-babble you fire back at them) and all analysis is fundamentally flawed.

Do some analysis – but then prototype like crazy. With real users.

Will the Single Box System make a Comeback? December 8, 2011

Posted by mwidlake in Architecture, future, Hardware.
Tags: , ,

For about 12 months now I’ve been saying to people(*) that I think the single box server is going to make a comeback and nearly all businesses won’t need the awful complexity that comes with the current clustered/exadata/RAC/SAN solutions.

Now, this blog post is more a line-in-the-sand and not a well researched or even thought out white paper – so forgive me the obvious mistakes that everyone makes when they make a first draft of their argument and before they check their basic facts, it’s the principle that I want to lay down.

I think we should be able to build incredible powerful machines based on PC-type components, machines capable of satisfying the database server requirements of anything but the most demanding or unusual business systems. And possibly even them. Heck, I’ve helped build a few pretty serious systems where the CPU, memory and inter-box communication is PC-like already. If you take the storage component out of needing to be centralise (and this shared), I think that is a major change is just over the horizon.

At one of his talks at the UKOUG conference this year, Julian Dyke showed a few tables of CPU performance, based on a very simple PL/SQL loop test he has been using for a couple of years now. The current winner is 8 seconds by a… Core i7 2600K. ie a PC chip and one that is popular with gamers. It has 4 cores and runs two threads per core, at 3.4GHz and can boost a single core to 3.8 GHz. These modern chips are very powerful. However, chips are no longer getting faster so much as wider – more cores. More ability to do lots of the same thing at the same speed.

Memory prices continue to tumble, especially with smart devices and SSD demands pushing up the production of memory of all types. Memory has fairly low energy demands so you can shove a lot of it in one box.

Another bit of key hardware for gamers is the graphics card – if you buy a top-of-the-range graphics card for a PC that is a couple of years old, the graphics card probably has more pure compute grunt than your CPU and a massive amount of data is pushed too and fro across the PCIe interface. I was saying something about this to some friends a couple of days ago but James Morle brought it back to mind when he tweeted about this attempt at a standard about using PCI-e for SSD. A PCI-e 16X interface has a theoretical throughput of 4000MB per second – each way. This compares to 600MB for SATA III, which is enough for a modern SSD. A single modern SSD. {what I am not aware of is the latency for PCI-e but I’d be surprised if it was not pretty low}. I think you can see where I am going here.

Gamers and image editors have probably been most responsible for pushing along this increase in performance and intra-system communication.

SSD storage is being produced in packages with a form factor and interface to enable an easy swap into the place of spinning rust, with for example a SATA3 interface and 3.5inch hard disk chassis shape. There is no reason that SSD (or other memory-based) storage cannot be manufactured in all sorts of different form factors, there is no physical constraint of having to house a spinning disc. Density per dollar of course keeps heading towards the basement. TB units will soon be here but maybe we need cheap 256GB units more than anything. So, storage is going to be compact and able to be in form factors like long, thin slabs or even odd shapes.

So when will we start to see cheap machines something like this: Four sockets for 8/16/32 core CPUs, 128GB main memory (which will soon be pretty standard for servers), memory-based storage units that clip to the external housing (to provide all the heat loss they require) that combine many chips to give 1Gb IO rates, interfaced via the PCIe 16X or 32X interface. You don’t need a HBA, your storage is internal. You will have multipath 10GbE going in and out of the box to allow for normal network connectivity and backup, plus remote access of local files if need be.

That should be enough CPU, memory and IO capacity for most business systems {though some quote from the 1960’s about how many companies could possible need a computer spring to mind}. You don’t need shared storage for this, in fact I am of the opinion that shared storage is a royal pain in the behind as you are constantly having to deal with the complexity of shared access and maximising contention on the flimsy excuse of “sweating your assets”. And paying for the benefit of that overly complex, shared, contended solution.

You don’t need a cluster as you have all the cpu, working memory and storage you need in a 1U server. “What about resilience, what if you have a failure?”. Well, I am swapping back my opinion on RAC to where I was in 2002 – it is so damned complex it causes way more outage than it saves. Especially when it comes to upgrades. Talking to my fellow DBA-types, the pain of migration and the number of bugs that come and go from version to version, mix of CRS, RDBMS and ASM versions, that is taking up massive amounts of their time. Dataguard is way simpler and I am willing to bet that for 99.9% of businesses other IT factors cause costly system outages an order of magnitude more times than the difference between what a good MAA dataguard solution can provide you compared to a good stretched RAC one can.

I think we are already almost at the point where most “big” systems that use SAN or similar storage don’t need to be big. If you need hundreds of systems, you can virtualize them onto a small number of “everything local”

A reason I can see it not happening is cost. The solution would just be too cheap, hardware suppliers will resist it because, hell, how can you charge hundreds of thousands of USD for what is in effect a PC on steroids? But desktop games machines will soon have everything 99% of business systems need except component redundancy and, if your backups are on fast SSD and you a way simpler Active/Passive/MAA dataguard type configuration (or the equivalent for your RDBMS technology) rather than RAC and clustering, you don’t need that total redundancy. Dual power supply and a spare chunk of solid-state you can swap in for a failed raid 10 element is enough.

IOT Part 6 – Inserts and Updates Slowed Down (part A) November 1, 2011

Posted by mwidlake in Architecture, performance, Testing.
Tags: , , , ,

<..IOT1 – the basics
<….IOT2 – Examples and proofs
<……IOT3 – Significantly reducing IO
<……..IOT4 – Boosting Buffer Cache efficiency
<……….IOT5 – Primary Key Drawback
…………>IOT6(B) – OLTP Inserts

A negative impact of using Index Organized Tables is that inserts are and updates can be significantly slowed down. This post covers the former and the reasons why – and the need to always run tests on a suitable system. (I’m ignoring deletes for now – many systems never actually delete data and I plan to cover IOTs and delete later)

Using an IOT can slow down insert by something like 100% to 1000%. If the insert of data to the table is only part of a load process, this might result in a much smaller overall impact on load, such as 25%. I’m going to highlight a few important contributing factors to this wide impact spread below.

If you think about it for a moment, you can appreciate there is a performance impact on data creation and modification with IOTs. When you create a new record in a normal table it gets inserted at the end of the table (or perhaps in a block marked as having space). There is no juggling of other data.
With an IOT, the correct point in the index has to be found and the row has to be inserted at the right point. This takes more “work”. The inserting of the new record may also lead to an index block being split and the extra work this entails. Similar extra work has to be carried out if you make updates to data that causes the record to move within the IOT.
Remember, though, that an IOT is almost certainly replacing an index on the heap table which, unless you are removing indexes before loading data and recreating them after, would have to be maintained when inserting into the Heap table. So some of the “overhead” of the IOT would still occur for the heap table in maintaining the Primary Key index. Comparing inserts or updates between a heap table with no indexes and an IOT is not a fair test.

For most database applications data is generally written once, modified occasionally and read many times – so the impact an IOT has on insert/update is often acceptable. However, to make that judgement call you need to know

  • what the update activity is on the data you are thinking of putting into an IOT
  • the magnitude of the impact on insert and update for your system
  • the ratio of read to write.

There is probably little point putting data into an IOT if you constantly update the primary key values (NB see IOT-5 as to why an IOT’s PK columns might not be parts of a true Primary Key) or populate previously empty columns or hardly ever read the data.

There is also no point in using an IOT if you cannot load the data fast enough to support the business need. I regularly encounter situations where people have tested the response of a system once populated but fail to test the performance of population.

Now to get down to the details. If you remember the previous posts in this thread (I know, it has been a while) then you will remember that I create three “tables” with the same columns. One is a normal heap table, one is an Index Organized Table and one is a partitioned Index Organized Table, partitioned into four monthly partitions. All tables have two indexes on them, the Primary Key index (which is the table in the case of the IOTs) and another, roughly similar index, pre-created on the table. I then populate the tables with one million records each.

These are the times, in seconds, to create 1 million records in the the HEAP and IOT tables:

                  Time in Seconds
Object type         Run_Normal
------------------  ----------
Normal Heap table        171.9  
IOT table               1483.8

This is the average of three runs to ensure the times were consistent. I am using Oracle V11.1 on a machine with an Intel T7500 core 2 Duo 2.2GHz, 2GB memory and a standard 250GB 5000RPM disk. The SGA is 256MB and Oracle has allocated around 100MB-120MB to the buffer cache.

We can see that inserting the 1 million rows into the IOT takes 860% the time it does with a heap table. That is a significant impact on speed. We now know how large the impact is on Insert of using an IOT and presumably it’s all to do with juggling the index blocks. Or do we?

This proof-of-concept (POC) on my laptop {which you can also run on your own machine at home} did not match with a proof-of-concept I did for a client. That was done on V10.2.0.3 on AIX, on a machine with 2 dual-core CPUS with hyper-threading (so 8 virtual cores), 2GB SGA and approx 1.5GB buffer cache, with enterprise-level storage somewhere in the bowels of the server room. The results on that machine to create a similar number of records were:

                  Time in Seconds
Object type         Run_Normal
------------------  ----------
Normal Heap table        152.0  
IOT table                205.9

In this case the IOT inserts required 135% the time of the Heap table. This was consistent with other tests I did with a more complex indexing strategy in place, the IOT overhead was around 25-35%. I can’t go into too much more detail as the information belongs to the client but the data creation was more complex and so the actual inserts were only part of the process – this is how it normally is in real life. Even so, the difference in overhead between my local-machine POC and the client hardware POC is significant, which highlights the impact your platform can have on your testing.

So where does that leave us? What is the true usual overhead? Below are my more full results from the laptop POC.

                        Time in Seconds
Object type         Run_Normal    Run_quiet    Run_wrong_p
------------------  ----------    ---------    -----------
Normal Heap table        171.9        81.83         188.27  
IOT table               1483.8      1055.35        1442.82
Partitioned IOT          341.1       267.83         841.22 

Note that with the partitioned IOT the creation took 341 second, the performance ratio to a heap table is only 198% and is much better than the normal IOT. Hopefully you are wondering why!

I’m running this test on a windows laptop and other things are going on. The timings for Run_Quiet are where I took steps to shut down all non-essential services and applications. This yielded a significant increase for all three object types but the biggest impact was on the already-fastest Heap table.

The final set of figures is for a “mistake”. I created the partitions wrong such that half the data went into one partition and the rest into another and a tiny fraction into a third, rather than being spread over 4 partitions evenly. You can see that the Heap and normal IOT times are very similar to the Run_Normal results (as you would expect as these test are the same) but for the partitioned IOT the time taken is half way towards the IOT figure.

We need to dig into what is going on a little further to see where the effort is being spent, and it turns out to be very interesting. During my proof-of-concept on the laptop I grabbed the information from v$sesstat for the session before and after each object creation so I could get the figures just for the loads. I then compared the stats between each object population and show some of them below {IOT_P means Partitioned IOT}.

STAT_NAME                            Heap    	IOT	        IOT P
------------------------------------ ---------- -------------  -----------
CPU used by this session                  5,716         7,222        6,241
DB time                                  17,311       148,866       34,120
Heap Segment Array Inserts               25,538            10           10

branch node splits                           25            76           65
leaf node 90-10 splits                      752         1,463        1,466
leaf node splits                          8,127        24,870       28,841

consistent gets                          57,655       129,717      150,835
cleanout - number of ktugct calls        32,437        75,201       88,701
enqueue requests                         10,936        28,550       33,265

file io wait time                     4,652,146 1,395,970,993  225,511,491
session logical reads                 6,065,365     6,422,071    6,430,281
physical read IO requests                   123        81,458        3,068
physical read bytes                   2,097,152   668,491,776   25,133,056
user I/O wait time                          454       139,585       22,253
hot buffers moved to head of LRU         13,077       198,214       48,915
free buffer requested                    64,887       179,653      117,316

The first section shows that all three used similar amounts of CPU, the IOT and partitioned IOT being a little higher. Much of the CPU consumed was probably in generating the fake data.The DB Time of course pretty much matches the elapsed time well as the DB was doing little else.
It is interesting to see that the Heap insert uses array inserts which of course are not available to the IOT and IOT_P as the data has to be inserted in order. {I think Oracle inserts the data into the heap table as an array and then updates the indexes for all the entries in the array – and I am only getting this array processing as I create the data as an array from a “insert into as select” type load. But don’t hold me to any of that}.

In all three cases there are two indexes being maintained but in the case of the IOT and IOT_P, the primary key index holds the whole row. This means there has to be more information per key, less keys per block and thus more blocks to hold the same data {and more branch blocks to reference them all}. So more block splits will be needed. The second section shows this increase in branch node and leaf block splits. Double the branch blocks and triple the leaf block splits. This is probably the extra work you would expect for an IOT. Why are there more leaf block splits for the partitioned IOT? The same data of volume ends up taking up more blocks in the partitioned IOT – 200MB for the IOT_P in four partitions of 40-60MB as opposed to a single 170MB for the IOT. The larger overall size of the partition is just due to a small overhead incurred by using partitions and also a touch of random fluctuation.

So for the IOT and IOT_P there is about three times the index-specific work being done and a similar increase in related statistics such as enqueues, but not three times as it is not just index processing that contribute to these other statistics. However, the elapsed time is much more than three times as much. Also, the IOT_P is doing more index work than the IOT but it’s elapsed time is less. Why?

The fourth section shows why. Look at the file io wait times. This is the total time spent waiting on IO {in millionths of a second} and it is significantly elevated for the IOT and to a lesser degree for the IOT_P. Physical IO is generally responsible for the vast majority of time in any computer system where it has not been completely avoided.
Session logical reads are only slightly elevated, almost negligably so but the number of physical reads to support it increases from 123 for the Heap table insert to 81,458 for the IOT and 3,068 for the IOT_P. A clue as to why comes from the hot buffers moved to head of LRU and free buffer requested statistics. There is a lot more activity in moving blocks around in the buffer cache for the IOT and IOT_P.

Basically, for the IOT, all the blocks in the primary key segment are constantly being updated but eventually they won’t all fit in the block buffer cache – remember I said the IOT is eventually 170MB and the buffer cache on my laptop is about 100MB – so they are flushed down to disk and then have to be read back when altered again. This is less of a problem for the IOT_P as only one partition is being worked on at a time (the IOT_P is partitioned on date and the data is created day by day) and so more of it (pretty much all) will stay in memory between alterations. The largest partition only grows to 60MB and so can be worked on in memory.
For the heap, the table is simply appended to and only the indexes have to be constantly updated and they are small enough to stay in the block buffer cache as they are worked on.

This is why when I got my partitioning “wrong” the load took so much longer. More physical IO was needed as the larger partition would not fit into the cache as it was worked on – A quick check shows that logical reads and in fact almost all statistics were very similar but 26,000 IO requests were made (compared to 81,458 for the IOT and 3,068 for the correct IOT_P).

Of course, I set my SGA size and thus the buffer cache to highlight the issue on my laptop and I have to say even I was surprised by the magnitude of the impact. On the enterprise-level system I did my client’s proof of concept on, the impact on insert was less because the buffer cache could hold the whole working set, I suspect the SAN had a considerable cache on it, there was ample CPU resource to cope with the added latching effort and the time taken to actually create the data inserted was a significant part of the workload, reducing the overall impact of the slowness caused by the IOT.

{Update, in This little update I increase my block buffer cache and show that physical IO plummets and the IOT insert performance increases dramatically}.

This demonstrates that a POC, especially one for what will become a real system, has to be a realistic volume on realistic hardware.
For my client’s POC, I still did have to bear in mind the eventual size of the live working set and the probably size of the live block buffer cache and make some educated guesses.

It also explains why my “run_quiet” timings showed a greater benefit for the heap table than the IOT and IOT_P. A windows machine has lots of pretty pointless things running that take up cpu and a bit of memory, not really IO so much. I reduced the CPU load and it benefits activity that is not IO, so it has more impact on the heap table load. Much of the time for the IOT and IOT_P is taken hammering the disk and that just takes time.

So, in summary:

  • Using an IOT increases the index block splitting and, in turn, enqueues and general workload. The increase is in proportion to the size of the IOT compared to the size of the replaced PK.
  • The performance degredation across the whole load process may well be less than 50% but the only way to really find out is to test
  • You may lose the array processing load that may benefit a heap table load if you do the load via an intermediate table.
  • With an IOT you may run into issues with physical IO if the segment (or part of the segment) you are loading into cannot fit into the buffer cache (This may be an important consideration for partitioning or ordering of the data loaded)
  • If you do a proof of concept, do it on a system that is as similar to the real one as you can
  • Just seeing the elapsed time difference between test is sometimes not enough. You need to find out where that extra time is being spent

I’ve thrown an awful lot at you in this one post, so I think I will stop there. I’ve not added the script to create the test tables here, they are in IOT-5 {lacking only the grabbing of the v$sesstat information}.

Friday Philosophy – The One Absolute Requirement for System Success October 14, 2011

Posted by mwidlake in development, Friday Philosophy, Perceptions.
Tags: , , , ,

What must you always achieve for an IT system to be a success?

  • Bug free? Never happens.
  • Within budget/time frame? That would be nice.
  • Includes critical business functionality? Please define critical.
  • Secure? Well, it’s important for many systems but then it is often lacking (even when it is important).
  • That it is to specification? Well we all know that’s wrong.

There is only one thing that an IT system must always achieve to be a success.

User Acceptance.

For an individual system other considerations may well be very important, but the user acceptance is, I think, non-negotiable.

{Slight update – This blog was written in 2011 with corporate IT systems in mind, where the use of the system is almost if not explicitly mandated. With the explosion of Apps and web-based systems over the last 5 years, for a lot of IT systems the users, the customers, have to be attracted to the system. We now call User Acceptance User Experience and it is, if anything, even more vital. If the user does not find the system easy to use and of some benefit, they will click off somewhere else.}

The user must get enough out of using the system for it to be worth their while, otherwise at best they will resent using it and at worst… Well, at worst they will use it but will put in any old rubbish to fulfill the dictate that it be used. In this worst case you would be better off if they did not use the system at all. Here are a couple of examples from my working past.

In the first one, I was involved in extending a hospital management system so that it kept track of the expected arrival and departure times for ward patients, allowing a predication of when beds would become available and calculation of expected occupancy rates. Yes, this was a while ago (maybe 1990) and on an a system that was old then. The information was needed by someone with the title “bed sister” {or something similar – and would now be titled “head of occupancy management!…} so that they could better manage the flow of patients and keep a higher bed usage ratio. Was this to make the hospital more efficient? No, it was to satisfy a politically demanded report to the NHS executive. Oh, the overall intention was to increase efficiency – but the report soon became more important than any true efficiency. So, we added columns in tables and field on screens and prompts for the ward staff to fill in the required new information.

And they didn’t.

The nurses were busy, they were pretty demoralized {due to having recently been used by the government as a way to control public sector pay} and they had more nursing duties to do than they could manage. They were not going to waste a couple of minutes trying to check when Mrs Jenkins was going to be sent home when Mrs Leonard needed a bed pan. The nursing staff were given a hospital-wide telling off, this information had to be entered. So they put in the data – but guessed wildly, just to get management off their backs. Thus the design of the system enhancement was fine, the report was logically accurate, only the correct staff could run it, but we failed to achieve User Acceptance and thus the system was a failure.

So I added something else. It was a very crude screen (this was 1990) that showed a “diagram” of a ward – Down the left and right side of a VT220 screen you saw little oblong boxes labelled with a bed number and in it the patient name, the consultant’s initials, a medical specialty code and the arrival and departure date-time. This was based on some information we already had plus the new information we wanted and the screen was quite basic, limited and slow to draw. But it was useful to the ward staff. They could find any patient, they knew who to call if there was an emergency {not the actual consultant of course, but their secretary :-)}, they could check when they expected a patient to leave, they could even see when a new patient was expected. They used it. They put in the expected departure times {sobering thought, this might not be expected leaving alive}, patients would be booked into empty beds when they became available (often before they actually arrived) and the bed nurse could plan new arrivals and the oh-so-important report could be run. We had achieved User Acceptance and the system became a success.

The second example is also from healthcare but from a different system in a different hospital. We were putting together a system to schedule outpatient clinics. We knew what we were doing, it’s pretty simple. You have some medical staff holding the clinic (a consultant and probably a senior house officer), a period for the clinic (3 or 4 hours) and a set of people to be seen, say 40. We gave some flexibility in slot lengths (some people need 5 minutes, some 15), checked for and stopped double booking and verified against other data such as correct speciality for the outpatient’s clinical data. We did not go and ask the patient admin staff about the spec, we knocked up the design and the screens and asked them to test. After all, I was very experienced now, I’d been doing these systems for 3, 4 years…

They very quickly came back to us and said it was rubbish. Oh dear.

We went and saw them. “What’s the problem?” There were a few – but the main one was that you could not double book a slot. Why would you want to do that? Do two patients really want to be consulted at the same time with the same doctor? {NB this can happen in some special cases – such as fertility treatment, psychology and a couple of others, but not Urology or Orthopedics!}
“Err, maybe, it might happen, can we just be able to double book?” OK, we could maybe alter things to allow two patients to be seen at the same time… The patient admin staff are not looking happy. The hospital liaison is looking confused & concerned and interrupts – “You can’t do that! Patient confidentiality can’t be broken!” he says. It got worse. “We need to be able to book… several people in one slot”. How many? “We need to book all the patients into the first slot, with the consultant, so the letters go out to them saying come to see Mr Winders at 1pm”. The admin staff are now looking very shifty.

If any of you have worked in the health service you are probably way ahead of me. The admin staff needed to book all the patients in at this first slot so that the ill and suffering would all turn up at the start of the session. The consultant would then run through his notes and see the two or three he was interested in – and then go and play golf. The Senior House Officer (or whoever) would then work through the rest of the patients for the following three or four hours. If you have ever had to turn up at the start of a consultancy session and sat there for three hours, now you know why. You see, back then, the consultant was only a very small step away from deity level (and I leave it to you to decide if it was a step up or down). What they said went and if they wanted to go and play golf or store 200 medical records in the boot of their car or refuse to speak to “that stupid idiot in renal medicine” then you worked around it. {I’m assured that things are a lot better now, but I’d love to know how it really is}.

We had designed a sensible solution. The users of the system needed a non-sensible {to our mind} solution. Even the NHS liaison chap had never appreciated exactly how much the consultants abused the system, he thought they just booked the people s(he) wanted at the start of the session, but no. The consultant decided that day who was interesting and as a result every patient had to be there at the start.

I count myself lucky that I learnt from direct experience so soon in my working life that (a) you have to deliver what the user will accept and (b) the only way to know what they want is to show them the system and talk with them.

Friday Philosophy – The start of Computing October 7, 2011

Posted by mwidlake in Friday Philosophy, history.
Tags: , , ,
add a comment

This week I finally made a visit to Bletchley Park in the middle of England. Sue and I have been meaning to go there for several years, it is the site of the British code-breaking efforts during the second world war and, despite difficulties getting any funding, there has been a growing museum there for a number of years. {Hopefully, a grant from the Heritage Lottery Fund, granted only this month, will secure it’s future}.

Why is Bletchley Park so significant? Well, for us IT-types it is significant because Alan Turing did a lot of work there and it was the home of Colossus, one of the very first electrical, programmable computers. More generally of interrest, their efforts and success in cracking enemy ciphers during WW2 were incredibly important and beneficial to the UK and the rest of the allies.

In this post, I am not going to touch on Colossus or Alan Turing, but rather a machine called the “Bombe”. The Bombe was used to help discover the daily settings of the German Enigma machines, used for decrypting nearly all German and Italian radio messages. All the Bombes were destroyed after the war (at least, all the UK ones were) to help keep secret the work done to crack the cyphers – but at Bletchley Park the volunteers have recreated one. Just like the working model of Babbage’s Difference Engine, it looks more like a work of art than a machine. Here is a slightly rough video I took of it in action:

My slightly rough video of the bombe

{OK, if you want a better video try a clearer video by someone else.}

I had a chat with the gentleman you see in both videos about the machine and he explained something that the tour we had just been on did not make clear – the Bombe is a parallel processing unit. Enigma machines have three wheels. There are banks of three coloured disks in the bombe (see the picture below). eg, in the middle bank the top row of disks are black, middle are yellow and bottom are red. Each vertical set of three disks, black-yellow-red, is the equivalent of a single “enigma machine”. Each trio of disks is set to different starting positions, based on educated guesses as to what the likely start positions for a given message might be. The colour of the disk matches, I think, one of the known sets of wheels the enigma machines could be set up with. The machine is then set to run the encrypted message through up to 36 “Enigmas” at once. If the output exceeds a certain level of sense (in this case quite crucially, no letter is every encrypted back to itself) then the settings might be correct and are worth further investigation. This machine has been set up with the top set of “Enigmas” not in place, either to demonstrate the workings or because the machine is set up for one of the more complex deciphering attempts where only some of the banks can be used.

This is the bombe seen from the front

The reason the chap I was talking to really became fascinated with this machine is that, back in about 1999, a home PC programmed to do this work was no faster than the original electro-mechanical machines from 1944 were supposed to have taken. So as an engineer he wanted to help build one and find out why it was so fast. This struck a chord with me because back in the late 1990’s I came across several examples of bespoke computers designed to do specific jobs (either stuff to do with natural gas calorific value, DNA matching or protein folding), but by 2000, 2002 they had all been abandoned as a general PC could be programmed to be just as fast as these bespoke machines – because bespoke means specialist means longer and more costly development time means less bangs for your buck.

Admittedly the Bombe is only doing one task, but it did it incredibly fast, in parallel, and as a part of the whole deciphering process that some of the best minds of their time had come up with (part of the reason the Bletchley Park site was chosen was that it was equidistant between Oxford and Cambridge and, at that time, there were direct train links. {Thanks, Dr Beeching}. ).

Tuning and reliability was as important then as it is now. In the below picture of the back of the machine (sorry about the poor quality, it was dim in that room), you can see all the complex wiring in the “door” and, in the back of the machine itself, those three rows of bronze “pipes” are in fact…Pipes. Oil pipes. This is a machine, they quickly realised that it was worth a lot of effort to keep those disks oiled, both for speed and reliability.

All the workings of the Bombe from the back

Talking of reliability, one other thing my guide said to me. These machines are complex and also have some ability to cope with failures or errors built into them. But of course, you needed to know they were working properly. When these machines were built and set up, they came with a set of diagnostic tests. These were designed to push the machine, try the edge cases and to be as susceptible to mechanical error as possible. The first thing you did to a new or maintained machine was run your tests.

1943, you had awesome parallel processing, incredible speed and test-driven development and regression testing. We almost caught up with all of this in the early 21st Century.


Get every new post delivered to your Inbox.

Join 206 other followers