Compare tables

Dec 7, 2007

Recently I described the use of Oracle Streams to do an Oracle upgrade. It allows to create a new empty database of a higher version, and then use streams replication to keep it in sync with the old one. The final migration then consists of stopping the activity on the old database, make sure it is all replicated on the new, and then connect all clients to the new.

This is running fine now in test. However, sometimes you want that extra confirmation that the new database is really identical to the old. Personally, I know that Streams will do its work correctly, but I do have to prove this to others as well.

For small tables, you could easily write in the new database a “select from some_table minus select from some_table@old” and use this construction to verify that rows are identical in old and new database. However this will become a complex operation on large table, and the use of LOBs is even not possible over a database link.

So I came up with a plsql package to calculate a checksum of a table. After some initial thoughts, it turned out not to be so difficult. dbms_utility is my friend. It contains a function “get_hash_value” that returns a number calculated out of a string. The same string will always return the same number, and by choosing an output range between 0 and power(2,31), you will have a hard time finding two strings that return the same hash value.

The problem then reduces to finding a way to convert every column in a string. Easy for char and varchar2, ok for number, date and timestamp using a to_char function, but what about my CLOBs and BLOBs. Well, now dbms_crypto is my friend (10g). This package contains a “hash” function that accepts clob and blob as its input and returns a 128-bit hash (using MD4). The only problem is that it returns a RAW. So you need to be familiar with raw and hex stuff in order to know that the rawtohex function can convert this output to a string containing only hex characters (0..9,A..F), and the string is what we needed, right?

Now it is just a question of generating some dynamic sql, concatenating all output of all columns to eachother and calling the dbms_utility.get_hash_value on it. This returns a number for every row in the table. You can then just select sum(…) of this, to obtain a global hash value of the table.

My experience is that these functions are all very fast, and that the time to calculate the table hash value is nearly equal to the full tabe scan, which is the minimum you have to do in order to check *all* data in the table.

Applying this to streams, I look at the apply process to find the last apply scn, and then use my hash function to calculate the hash number on the tables “as of scn <last apply scn>”. If this returns the same result in my old and my new database, I am 100% sure that both tables are identical (except for a chance of 1 out of power(2,31), but it is still easier than that to win the lottery 2 times in a row…).

I do have another package to compare rows between two tables, with extended options, customization, logging and exact identification of the differences, (however it is slower than the table hash above) but that’s for later…