Warning: ASM and large amount of files (on Solaris)

This post is to warn you of a potential problem you may encounter. A combination where it definitely occurs is 10.2 Rac and ASM on Solaris. The problem is the following: due to a lot of db activity, no cleanup of archivelog files, multiple databases present and multiple instances per database, the amount of archivelog files has grown up to +100.000 in ASM (in my case). As a result it turns out that a query on v$asm_file lasts 1 minute or even more, up to 3 minutes. Not a problem on itself, wasn’t it that the query on v$asm_file turns out to block all file manipulation operations in ASM, including the creation/deletion of archivelogs or registering archivelogs in the controlfile. This generates CF enqueue waits in the databases using asm (for lgwr and arc) and very soon waits by user processes on log file sync becasue the lgwr is blocked. In this way your production may be frozen until the query on v$asm_file ends or is interrupted. Knowing that emagent can access v$asm_file, and creation and registering or archivelogs as well, this problem may occur very often, especially in data guard environments where a lot of archivelog manipulation is done.

This behaviour is hard to believe but I can confirm that I have seen this with my own eyes and analyzed, tested and reproduced it myself. The root cause of this is the combination of the slow asm query and the blocking effect of it, two separate things. I have had no chance yet to test the same on linux or any other platform. It might be Solaris specific because there exists bug 6761100: Query on V$asm_files very slow on Solaris compared to Linux. If the query on v$asm_file would be fast, you probably would never run into the wait events mentioned above. I cannot confirm either if it only occurs in RAC, or also in non-rac installations.

So to me it looks as if ASM isn’t designed (at the moment) for very large amounts of files. However if you have a 3-node rac cluster with 4 databases on it, each instance doing a log switch every 15 minutes (because maybe there is a standby that should not lag behind too much), you produce 3x4x96 archives per day. Keeping these three weeks gives already 25.000 files. If then something is accidentally wrong in the cleanup script, or you have batch jobs that generate a lot of redo, it may happen that you get still  a larger amount of files.

The most annoying thing about this is that you can’t get rid of it. It turns out that removing the files again afterwards doesn’t really solve the problem. To me it looks as if for each deleted file, something is left in ASM, that needs to be traversed during the query. Only emptying the diskgroup and recreating it with fewer files will solve the problem.

But I repeat, it is only the combination of the two issues (slow query and blocking effect) that causes trouble, and as far as I know, only in the combination 10.2 rac on solaris. For me, the query on v$asm_file may be slow as hell, as long as it doesn’t block anything else.

So you are warned, it is not a bad idea to keep the amount of files in asm relatively low (I would say below 10.000).

P.S. If you experienced similar behaviour in another hw/sw configuration, I am very interested in knowing the details of it.

Advertisements

4 Responses to Warning: ASM and large amount of files (on Solaris)

  1. Greg Rahn says:

    Have you filed an SR with Oracle support? What is their response?

    What is asm_diskstring set to?
    How many devices match that pattern?
    How many of those devices are provisioned for ASM?

    For instance, if asm_diskstring=’/dev/rdsk/*’ what does
    ls -1 /dev/rdsk/*|wc -l
    return?

    I would hesitant to call it a design issue, but it may be a bug that is exacerbated by a long list or something similar.

  2. pier00 says:

    The blocking effect is under investigation at Oracle Support. Is probably a bug. I just have to wait for further answers. The asm diskstring refers to a directory (/dev/oracle) containing only 2 devices for the 2 luns on the storage.

  3. […] amount of files: follow-up As promised I will keep you up to date about the problem in my previous post about the issues with ASM on Solaris when a large amount of files exists (or have existed). I can […]

  4. odenysenko says:

    Hi.

    I have blogged about this situation in
    http://odenysenko.wordpress.com/2009/06/25/asm-performance-with-huge-number-of-files/

    The really bad thing is to revert to good
    performance without applying specified patch
    you need to recreate diskgroup.

    Another issue that supplements it is
    OEM GC agent that do queries on V$ASM_
    quite ofter… and from every node of the cluster.

    Oleksandr

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: