Oracle VM and multiple local disks

Jun 9, 2008

For my Oracle VM test environment I have a server available with multiple internal disks of different size and speed. So I was wondering if it is possible to have all these disks used together for my virtual machines in Oracle VM.

If all disks would have been the same size and speed, I could easily use the internal raid controller to put them in mirror, stripe or raid5 and end up with one large volume, alias disk, for my Oracle VM. However due to the differences in characteristics of the disks (speed/size) this is not a good idea. So I started to look in Oracle VM Manager (the java console) to see what is possible.

It turned out soon to me that Oracle VM is designed for a different architecture: in fact the desired setup is to dispose of a (large) SAN box with shared storage that is available to multiple servers. Then all these servers can be put in a server pool, sharing the same storage. This setup allows live migration of running machines to another physical server. Of course this makes sense because it fits nicely in the concept of grid computing: if any physical server fails, just restart your virtual machine on another one, and add machines according to your performance needs. But it doesn’t help me: I don’t have got one storage with multiple servers, but I have one server with multiple disks.

So I started to browse a little in all the executables of the OVM installation, and I found under /usr/lib/ovs the ovs-makerepo script. According to me the architecture is as follows (as far as I can find on the internet, because there is not much clear documentation on this): when installing OVM, you have a /boot a / and a swap partition (just as in traditional linux) and OVM requires one large partition to be used for virtual machines, which will be mounted under /OVS. In this partition you find subdirectories “running_pool” which contains all the virtual machines that you have created and that you can start, and a subdirectory “seed_pool” which contains templates you can start from for creating new machines. There is also “local”, “remote” and “publish_pool”, however they were irrelevant for me at the moment and I didn’t try to figure out what they are used for.

With this in mind I can install Oracle VM on my first disk and end up with 4 partitions on /dev/sda:

   Filesystem 1K-blocks     Used Available Use% Mounted on
   /dev/sda1     248895    25284    210761  11% /boot
   (sda2 is swap)
   /dev/sda3    4061572   743240   3108684  20% /
   /dev/sda4   24948864 22068864   2880000  89% /OVS

With this in mind I now want to add the space on my second disk (/dev/sdb) to this setup. So first I create one large partition on the disk using fdisk. Then I create an ocfs file system on it as follows:

[root@nithog ovs]# mkfs.ocfs2 /dev/sdb1
mkfs.ocfs2 1.2.7
Filesystem label=
Block size=4096 (bits=12)
Cluster size=4096 (bits=12)
Volume size=72793694208 (17771898 clusters) (17771898 blocks)
551 cluster groups (tail covers 31098 clusters, rest cover 32256 clusters)
Journal size=268435456
Initial number of node slots: 4
Creating bitmaps: done
Initializing superblock: done
Writing system files: done
Writing superblock: done
Writing backup superblock: 4 block(s)
Formatting Journals: done
Writing lost+found: done
mkfs.ocfs2 successful

Initially I created the file system as ext3 which worked well. However there was one strange thing. This is what you get:

  • Create a new (paravirtualized) (linux) virtual machine in this new (ext3-based) repository (see later how exactly)
  • Specify a disk of e.g. 2Gb
  • Complete the wizard
  • This prepares a machine where you can start using the linux installer on the console to install the machine (do not start to install yet)
  • Now look in …/running_pool/machine_name and see a file of 2Gb
  • Now do du -sk on …/running_pool/machine and see that only 20Kb is used
  • From the moment you start to partition your disk inside the virtual machine, the output of “du -sk” grows the same amount as the data you really put in it. So it behaves a bit like ‘dynamic provisioning’.
  • Note however that ls -l shows a file of 2Gb at any time

I don’t know for the moment if this behaviour is caused by the fact that the file system is ext3, but anyway, I leave it up to you to judge if this is an advantage or a disadvantage.

Now when trying to add my new sdb1 partition as an extra repository, I got:

Usage:

[root@nithog ~]# /usr/lib/ovs/ovs-makerepo
 usage: /usr/lib/ovs/ovs-makerepo <source> <shared> <description>
        source: block device or nfs path to filesystem
        shared: filesystem shared between hosts?  1 or 0
        description: descriptive text to be displayed in manager

Execution:

   [root@nithog ovs]# /usr/lib/ovs/ovs-makerepo /dev/sdb1 0 “Repo on disk 2″
   ocfs2_hb_ctl: Unable to access cluster service while starting heartbeat mount.ocfs2:
   Error when attempting to run /sbin/ocfs2_hb_ctl: “Operation not permitted”
   Error mounting /dev/sdb1

Seems like the script expects something like a cluster, but I just have a standalone node… I think that this script is intended to add a shared repository to a cluster of nodes. No problem, let’s try to convert our standalone machine to a one-node cluster: create the file /etc/ocfs2/cluster.conf:

cluster:
        node_count = 1
        name = ocfs2
node:
        ip_port = 7777
        ip_address = 10.7.64.160
        number = 1
        name = nithog
        cluster = ocfs2

Note that the indented lines MUST start with a <TAB> and then the parameter with its value. After creating this file I could do:

   [root@nithog ovs]# /etc/init.d/o2cb online ocfs2
   Starting O2CB cluster ocfs2: OK
and then
   [root@nithog ovs]# /usr/lib/ovs/ovs-makerepo /dev/sdb1 0 “Repo on disk 2″
   Initializing NEW repository /dev/sdb1
   SUCCESS: Mounted /OVS/877DECC5B658433D9E0836AFC8843F1B
   Updating local repository list.
   ovs-makerepo complete

As you can see, an extra subdirectory is created in the /OVS file system, with a strange UUID as its name. Under this directory my new file system /dev/sdb1 is mounted. This file system is a real new repository, because under /OVS/877DECC5B658433D9E0836AFC8843F1B you find as well the running_pool and seed_pool directories. It is also listed in /etc/ovs/repositories (but it is NOT recommended to edit this file manually).

Then I looked in the Oracle VM Manager (the java based web gui) but I didn’t find anything of this new repository. It looks as if this gui is not (yet) designed to handle multiple repositories. However I started to figure out if my new disk could really be used for virtual machines, and my results are:

  • When creating a new virtual machine, you have no chance of specifying in which repository it has to come
  • It seems to come in the repository where there is the most amount of free space (but I should do more testing to get 100% certainty)
  • When adding a new disk to an existing virtual machine (an extra file on oracle-vm level) the file will come in the same repository, even the same directory as where the initial files of your virtual machine are located. If there is NOT enough free space on the disk, Oracle VM will NOT put your file in another repository on another disk.
  • You can move the datafiles of your virtual machine to any other location while the machine is not running, and while changing the reference to the file in /etc/xen/<machine_name>
  • So actually it looks that on xen-level you can put your vm datafiles in any directory; the concept of the repositories seems to be oracle-vm specific.
  • So if you create a new virtual machine and Oracle puts it in the wrong repository, it is not difficult at all to move it afterwards to another filesystem/repostory. It just requires a little manual intervention. However it seems recommended to keep your machines always in an oracle-vm repository, in the running_pool, because only in that way it can be managed by the Oracle-vm gui.

I am sure that there are many things that have an abvious explanation, but I have to admit that I didn’t read the manuals of ocfs and oracle vm completely from the start to the end. Also I think that Oracle

Conclusion: Oracle VM seems to be capable of having multiple repositories on different disks, but the GUI is not ready to handle them. But with a minimum of manual intervention, it is easy to do all desired tasks in command-line mode.


ASM and large amount of files: follow-up

Apr 20, 2008

As promised I will keep you up to date about the problem in my previous post about the issues with ASM on Solaris when a large amount of files exists (or have existed). I can confirm that there is a fix now that completely solves the problem.

Read the rest of this entry »


Warning: ASM and large amount of files (on Solaris)

Mar 5, 2008

This post is to warn you of a potential problem you may encounter. A combination where it definitely occurs is 10.2 Rac and ASM on Solaris. The problem is the following: due to a lot of db activity, no cleanup of archivelog files, multiple databases present and multiple instances per database, the amount of archivelog files has grown up to +100.000 in ASM (in my case). As a result it turns out that a query on v$asm_file lasts 1 minute or even more, up to 3 minutes. Not a problem on itself, wasn’t it that the query on v$asm_file turns out to block all file manipulation operations in ASM, including the creation/deletion of archivelogs or registering archivelogs in the controlfile. This generates CF enqueue waits in the databases using asm (for lgwr and arc) and very soon waits by user processes on log file sync becasue the lgwr is blocked. In this way your production may be frozen until the query on v$asm_file ends or is interrupted. Knowing that emagent can access v$asm_file, and creation and registering or archivelogs as well, this problem may occur very often, especially in data guard environments where a lot of archivelog manipulation is done.

Read the rest of this entry »


Improve the world, start with yourself.

Jan 31, 2008

Today I wanted to open a Metalink SR for a high availability problem (in a RAC + Dataguard environment), and guess what I got.

It seems like theory is still different from practice.


The Oracle Enterprise Linux story of the master himself: Wim Coekaerts

Jan 25, 2008

Yesterday I had the uniqe opportunity to attend a Q&A session with Wim Coekaerts, the father of the whole linux story within Oracle. I agree completely with what Freek already posted in his blog (he attended as well). On top of that I want to add some personal thoughts as well. Read the rest of this entry »


My experiences with ocrmirror, voting disks and stretched clusters

Jan 24, 2008

If you are about to migrate to Oracle RAC, there is an important thing to know about the OCRmirror. As far as I know, this is documented nowhere in any Oracle doc. This post talks about the failover properties of the OCRmirror. Read the rest of this entry »


It’s my turn to tell 8 things about myself

Jan 14, 2008

I was tagged by Freek so I’ll do what is expected from me.

Read the rest of this entry »


Compare tables

Dec 7, 2007

Recently I described the use of Oracle Streams to do an Oracle upgrade. It allows to create a new empty database of a higher version, and then use streams replication to keep it in sync with the old one. The final migration then consists of stopping the activity on the old database, make sure it is all replicated on the new, and then connect all clients to the new.

This is running fine now in test. However, sometimes you want that extra confirmation that the new database is really identical to the old. Personally, I know that Streams will do its work correctly, but I do have to prove this to others as well.

For small tables, you could easily write in the new database a “select from some_table minus select from some_table@old” and use this construction to verify that rows are identical in old and new database. However this will become a complex operation on large table, and the use of LOBs is even not possible over a database link.

So I came up with a plsql package to calculate a checksum of a table. After some initial thoughts, it turned out not to be so difficult. dbms_utility is my friend. It contains a function “get_hash_value” that returns a number calculated out of a string. The same string will always return the same number, and by choosing an output range between 0 and power(2,31), you will have a hard time finding two strings that return the same hash value.

The problem then reduces to finding a way to convert every column in a string. Easy for char and varchar2, ok for number, date and timestamp using a to_char function, but what about my CLOBs and BLOBs. Well, now dbms_crypto is my friend (10g). This package contains a “hash” function that accepts clob and blob as its input and returns a 128-bit hash (using MD4). The only problem is that it returns a RAW. So you need to be familiar with raw and hex stuff in order to know that the rawtohex function can convert this output to a string containing only hex characters (0..9,A..F), and the string is what we needed, right?

Now it is just a question of generating some dynamic sql, concatenating all output of all columns to eachother and calling the dbms_utility.get_hash_value on it. This returns a number for every row in the table. You can then just select sum(…) of this, to obtain a global hash value of the table.

My experience is that these functions are all very fast, and that the time to calculate the table hash value is nearly equal to the full tabe scan, which is the minimum you have to do in order to check *all* data in the table.

Applying this to streams, I look at the apply process to find the last apply scn, and then use my hash function to calculate the hash number on the tables “as of scn <last apply scn>”. If this returns the same result in my old and my new database, I am 100% sure that both tables are identical (except for a chance of 1 out of power(2,31), but it is still easier than that to win the lottery 2 times in a row…).

I do have another package to compare rows between two tables, with extended options, customization, logging and exact identification of the differences, (however it is slower than the table hash above) but that’s for later…


Auditing. Nice, but how to get rid of those audit records…

Nov 27, 2007

I was very surprised today. I was working on a system (prototype) that has the intention to enable auditing on several databases and to collect the results in a large table in a central database. The idea would be to use database links from the central database to the monitored databases in order to read out the aud$ table and delete the entries after they are copied to the local database. Now my testcase showed a really strange case:

select * from aud$ where sessionid = … and entryid=… and scn=…;

1 row selected

delete from aud$ where sessionid = … and entryid=… and scn=…; (same values)

0 rows deleted

Strange, isn’t it. However my initial search on Metalink gave me the answer immediately, in Note 388169.1: since 10.2 a security enhancement has been made which states that all DML on the AUD$ table should be audited as well, and that as a consequence only user SYS can succesfully do DML on the AUD$ table.

For security reasons I can really appreciate this, but for database management it is not so easy. This means that SYS access is required to clean up the audit trail. So if you had planned not to use SYS anymore and to try doing all by a ‘normal’ dba like system, you are stuck. Note that on a recent poll question on DBA-Village, 40% of the respondents said that they use user SYSTEM or a separate dba account for dba maintenance tasks.


The beauty of Oracle Streams

Nov 24, 2007

I’ve said it already many times and I keep repeating it: Oracle Streams is a wondeful feature. Consider the following case I encountered recently:

A large (+250Gb) and very critical database 10.1.0.4 on single-node non-rac server needs to be migrated to a 10.2 RAC system on ASM with minimal downtime. How would you do this?

Option 1: export/import? Too slow. Database contains 80% of CLOBs which would take ages to import.

Option 2: upgrade using dbua or catpatch. Will be faster but

  • Requires time to transfer the datafiles to the new system
  • Requires rman intervention to get the files in ASM + a lot of testing
  • Turns out that the database has nearly all options installed (olap, data mining, spatial), while on the new server we don’t want them, because we don’t pay for them. So this would mean that we would have to drop these options before starting the migration, but this would give us no chance to test this in advance. And when doing a highly critical migration, I want to have everything tested carefully.

Option 3: data pump? Over a database link? Will be faster than exp/imp, but might still take several hours. Especially when you know that old and new servers are in different data centers with a not-so-fast connection inbetween (don’t know the exact speed).

Option 4: logical standby as 10.2 for the 10.1.0.4 primary? This would allow a quick switchover at the moment of migration, but gives the same problem as the catpatch option: how to get rid of the installed options in a safe way when you use 10.2 software in which these options are not installed.

Option 5: Stream? Of course… Look at the benefits:

  • This option allows to have only the schema(s) replicated containing application data. So the problem of the spatial, data mining and olap options is solved
  • Streams can work between 10.1 and 10.2 databases
  • We start with an empty database created by dbca in ASM, so no stuff to do with rman to copy files from file system into ASM
  • Can be combined with datapump over a database link, so no intermediate temp space required during the migration. Note that this is an operation in advance, so the time for copying the data does not matter. Streams will easily catch up afterwards.
  • Just like logical standby, the new database is open and readable before the migration, so you can continuously check if all changes are received correctly before doing the final switch, and you have the guarantee in advance that the new database is valid and correct.

However there was one concern I had. Adding a capture process on the old database is an important change to do on a highly critical database. What if this capture process would cause too much overhead? Could it cause any kind of process or even instance failure on the old database? Is it safe to add this to a highly critical database?

Well, there is a great solution: downstream capture. The architecture is as follows: the current production sends its archivelogs to the new 10.2 database. This requires only adding an extra, optional, log_archive_dest_x, which is safe to do. The new database receives the archivelogs and runs the capture process, which enqueues it logical changes on a queue in this 10.2 database, i.e. all the sql done on the application tables. (yes, a 10.2 capture can read 10.1 archivelogs). A local apply process reads these changes from the queue and applies them in the 10.2 database. In this way, the new database is kept in sync with the old without any overhead on the old, i.e. a very safe operation in a critical environment. The final switch will consist of stopping all activity on the old database, switch logfile and verify that the last one is applied, and within seconds all is done.

I have to admit that there is a minimal impact on the primary. In order for streams to work best, it is recommended to have primary or unique keys on all tables (which we have, our developers are really good). Additionally supplemental logging must be enabled. This will cause to write the primary key colums in the redolog for every update, even if only non-pk columns are updated. However, my experience tells me that this is also a very safe change with minimal overhead (at least if PK or UN keys exist, otherwise all columns will be written in the redolog).

So to conclude, Oracle Streams with downstream capture allows to prepare the migration completely, guarantees that the new database is correct before starting the final migration, perform no risky changes to the old database, gives no issues to get files from file systen into ASM, requires no intermediate temp space and allows the final switch to be done in a few seconds. That’s what I call the beauty of Streams.