Home > Oracle > VLDB with ASM?

VLDB with ASM?

Today I’ve been on the first day of the ASM course/workshop led by Martin Gosejacob. He is quite good and knows the stuff. Since I’ve been playing with it before and installed the practice clusters for it, I kind of knew general content. Nevertheless, I got enough new and interesting pieces of information. Today I will mention my concerns regarding managing large amount of storage with ASM. Later I will try to post some other interesting twists and bells that are not commonly mentioned.

Serious limitation of ASM - it doesn’t have dirty region logs. Oracle claims that you don’t need it as Oracle database is ASM aware and recovery on ASM level is not required. However, there is serious flaw in a way that resyncing/rebalancing/resilvering of a large part of diskgroup have to be done from scratch.

Imagine few TB database with data mirrored by ASM across two SAN boxes accessed via separate FC’s and separate switches. You will need to organized all disks into 2 failure groups (each group with disks from the same SAN box) so that all extents are mirrored across two SAN boxes.

Maintenance or unplanned outage on one of those components will cause the whole mirror to be unavailable and ASM will try to rebalance but having no available disks in one of failure groups it won’t get anywhere. Database operational status is not impacted at this point (unless you need to add new datafile or resize it - see below). Again, ASM will start “ejecting” failed disks and status of disks will be HUNG, i.e. stuck in attempt to drop from diskgroup. If multipathing is used and both switches are connected to both enclosures than switch or FC card failure is not that bad but still no protection from SAN outage or firmware upgrade.

Let’s assume that access to the secondary box is back in 30 minutes and we are trying to put disks back. Since there is no change log, there is no way to perform fast resilvering. But this is not the full story. As mentioned before, ASM has already initiated removal of failed disks from diskgroup and the fact that disks are back doesn’t affect ASM in any way now. You cannot drop failed disks until ASM finishes rebalancing, you cannot resync them as ASM is actually dropping them - no way back. The only way out is to add enough disks to the failure group with HUNG disks so that ASM can finish rebalancing and finally drop disks from failed SAN box.

This transforms in hours/days of rebalancing for several TB and additional space equal to the size of failed disks. In reality, it’s probably possible to reuse failed capacity but it will include reorganizing it on SAN level, presenting as new devices, cleaning them up and adding as new disks. There are two options to add same disks:
1. Clean up ASM labels (dd zeroes to the beginning of the file about 1 MB) prior to adding.
2. Add disks to diskgroup with force option.
The first method requires additional efforts and the second is quite dangerous as one can force adding another good disk used somewhere else. Especially, if you use wildcards (would like to put all 100 devices manually?). Well, in the first case you may as well overwrite good disks.

As I mentioned already, if there is not enough free space for rebalancing in one of failure groups then additional space cannot be allocated - i.e. files cannot be extended and new files added in this diskgroup. So until enough space is provided, you are at risk of getting your tablespaces full.

Interesting what is going to happen with archive logs when one failure group is “out”. Need to test it as I believe that new files will not be created until some disks are added.

In fact, it ASM with its current methodology couldn’t really use dirty region logs principle. As soon as failure is detected, ASM initiates drop of those disks. To solve this issue, there should be an option to avoid automatic rebalancing in addition to introduction of dirty region logs.
So how to mitigate this problem with current ASM implementation?
It’s possible to use many small storage boxes instead of couple high end enclosures. However, this will require more complex SAN networking with multiple switches and FC cards.

Categories: Oracle Tags:
  1. Earl Cooper
    June 12th, 2006 at 23:05 | #1

    Perhaps someone could take a moment and drop me an email.
    We are testing 10G and going to GRID but we continue to be mystified as to why customers who have SAN’s choose ASM. When I read about ASM it sounds like a solution for customers who do not have RAID. What do you see as the main benefits?

  2. June 13th, 2006 at 10:04 | #2

    Earl,
    Well, you need to organize you storage in files/volumes/whatever you call it since Oracle needs these chunks for datafiles, redo logs, controlfiles. It also needs archivelog destination and backup area. While the latter strictly speaking doesn\’t have to be shared across all nodes, datafiles and others must be shared for RAC.
    So the requirement for RAC is shared storage. Usualy there are three ways to do it - cluster filesystem, cluster volume manager (often underneath of CFS) and NAS type network storage. Without these pieces of software you can theoretically live with pure SAN and use LUNs directly as volumes for Oracle. However, I never saw anyone using it for practical reasons - it\’s unmanageble.
    So ASM gives you some hybrid of CVM and CFS that can be used only for Oracle. You can find plenty of white papers at http://www.oracle.com/technology/products/database/asm/index.html.

  3. June 13th, 2006 at 14:57 | #3

    Earl,

    as our host says, you need some kind of clustered disk for RAC, and ASM is at least no worse than OCFS. If you trust your SAN, it’s easy enough to set the ASM disk groups to external redundancy, which stops them trying to do the mirroring / resilvering which it seems not to be very good at.

  4. David Burnham
    July 31st, 2006 at 19:57 | #4

    Hi Alex,
    your blog article is very interesting as we are testing something very similar on Oracle 10.2.0.2 on Linux (Redhat AS 4). What we’ve found is that even if you have loads of space in your diskgroups - when the failure group LUN’s disappear the absent disks go into an offline hung status.

    We’ve tried this with a 5 GB database and a 75 GB DGDATA disk group (mirrored over 2 arrays so 150 GB storage) and a 30GB DGFRA mirrored archive and FRA space.

    It looks like ASM tries to kick off a rebalance operation which fails with a spurious disk group space error then the missing disks are hung. Once hung even when the storage is added back ASM never looks at them.

    It is possible to remove the failgroup and add the disks back with the ALTER DISKGROUP and the FORCE option - but this introduces corruption…

    Did you ever get mirroring to work in your tests ?

  5. July 31st, 2006 at 22:13 | #5

    Hi David,

    This is exactly what I mean. If you mirror over two boxes than you will want two failure groups. If one of your box is lost (even for short time but long enough to mark disks failed and initiate rebalancing) than you are lost and the only recovery, as you figured, is to add space to the lost failure group (or create a new one) possibly reusing old disks (which are in HUNG state) with force option.

    What you can explore is another mirroring software (using either storage vendor techniques of “normal” volume managers).

    Alternatively, you can consider different organization of storage and, instead of large storage boxes, use many small boxes and, consequntly, many failure groups. In this case, complete failure of one failure group won’t cause you any problems as long as you have enough space on other failure group to rebalance in such a way that every ASM extent can be placed in separate failure groups. But watch out to what it adds to your SAN infrastructure - it will be way more complex.

    If you find anything interesting or other solutions, let me know.

    Cheers,
    Alex

  6. August 12th, 2006 at 16:08 | #6

    Hi Alex,
    I have an update on our ASM mirroring problem that you and your readers may be interested in. We have had lengthy discussions with Oracle and have managed to speak with the designer of ASM at Redwood Shores. We have also done extensive testing with 2 failure group mirrored ASM configurations.

    Firstly I must say that the documentation for ASM is shockingly poor and the knowledge within Oracle Support of ASM is not much better. Anyway on to what we have found out.

    The reintegration of disks back into a failed ASM failure group is NOT automatic (although this is implied in the Admin Guide). If you are running with 2 failure groups and you lose access to the storage for one of the failure groups - all of the disk names will go into an offline hung as seen in V$ASMDISK. You will also notice there is no path to the disk devices in the view - that mapping has been lost and there is another entry in v$asmdisk that contains each of the disk paths in a member, offline state.

    Once you know that you storage has returned and is stable YOU are expected to add the disks back into the failed failure group with the following syntax
    ALTER DISKGROUP dgdata ADD FAILGROUP fgname
    disk ‘/dev/raw/raw4′ name New_name_for_disk4 force,
    disk ‘/dev/raw/raw5′ name New_name_for_disk5 force ;

    With the following caviats -
    The failure group should be the same name as the failure group that had the disks in that you lost connectivity to.

    The names for the disks, need to be new names, ones that do not appear in in the V$ASMDISK view - you cannot reuse the old names.

    If you are reusing the disks that failed (say you lost connectivity to them temporarily) you will need to use the force option. This basically says to the command I know that the disks have an ASM header block on them but I want you to reuse them anyway. An alternative is to use dd to zero out the headers but I think this is more dangerous than using force. Using force also checks that you device is not actually in use by the particular ASM instance - so gives a little more protection against finger trouble.

    I’ve found the above command does reintegrate the failing disks back into the failure group as a rebalance process is kicked off to perform the operation. When the rebalance completes the original disk names (the ones left without a path) are deleted from V$asmdisk and everything is mirrored up again.

    As to the corruption we saw - we are almost sure that this was down to us using the
    ALTER DISKDGROUP DGDATA DROP ALL DISKS fgname;
    to simulate loss of SAN storage without playing with the hardware.
    We could only replicate the corruption (on rebalancing) on the
    64 bit linux port of Oracle 10.2.0.2 and not the 32 bit port.
    We’ve raised a bug with Oracle and they’re investigating.

    ASM has the potential to be an extremely powerful and useful volume management system for Oracle grid implementations - but is is still quite buggy and is not quite mature yet.

  7. August 15th, 2006 at 21:38 | #7

    Hi David,

    Thanks for sharing your experience. This is exactly the behavior I observed.

    Btw, I think you can add all disks back with a new failure group name - the only requirement is that it must be different from survived FG and can be also the same as your lost FG. The requirement is to place mirror extent to any OTHER failure group which would be just fine to use the third FG. This is, as I mentioned, is possibly a workaround for mirroring issues in ASM - use more than 2 (like 10 :) ) SAN boxes each in its own FG. In this case one broken FG won’t hurt assuming you have enough free space to accomodate rebalancing.

    Regarding 64 bit Linux platform - I also noticed that it’s not as stable/well tested and up to date (availability of one-offs) as 32 bit.

    Cheers,
    Alex

  8. Nitin Vengurlekar
    December 21st, 2006 at 23:50 | #8

    As David pointed out nicely, to recover from that failure group issue you can
    do: ALTER DISKGROUP ADD FAILGROUP
    disk ‘/dev/raw/raw4′ name New_name_for_disk4 force,
    disk ‘/dev/raw/raw5′ name New_name_for_disk5 force ;

    But just to be clear on this, for those new to ASM, this issue with failure groups is specific when using ASM redudancy and not an issue w/ external redudancy. Moreover, most VLDB customers are using external redudancy. For those that attended OOW 2006 California, you may have heard several customers speak at seesions, claiming sucessful multi-terabyte deployments of ASM. I’m currently working with a customer that will have an initial system size of 30Tb and grow to 100Tb+ by end of 2007….all within ASM.

    btw, 11g ASM will introduce Disk resilvering !!

  9. December 22nd, 2006 at 02:13 | #9

    Nitin,

    Actually, there is a design that can be used for very large databases with ASM mirroring - use many low-end storage boxes with each in its own failure group and have redundancy normal (i.e. 2). This way, loss of the whole storage box accompanied by one complete failure group wouldn’t compromise your system providing you have enough space on the rest of failure groups. So if you have 10 failure groups - make sure you have at least 10% space free.

    On the surface, usage of low-end storage boxes should decrease cost of the system but this will lead to more complex SAN network infrastructure and, obviously, more expensive to setup and maintain.

    Re-silvering move is a good one but this will add complexity into so far relatively simple ASM software. It’s actually not as buggy as many other new features that Oracle delivers these days so I hope that Oracle will improve quality of ASM even more.

    Thanks for your response.

    Cheers,
    Alex

  10. Marcin Przepiorowski
    January 17th, 2007 at 06:48 | #10

    Hello,

    I made some test and I was able to resize datafile even if one of disk from disk group was in status HUNG. I was using 10g R2 out of the box on Windows. What version do you use ?

    regards,
    Marcin

  11. January 17th, 2007 at 09:44 | #11

    10.2.0.1 on Linux. The reason it should not be able to allocate more ASM extents (allocation units) is that there is no available failure group to mirror it.

  12. Marcin Przepiorowski
    January 18th, 2007 at 02:59 | #12

    Hello Alex,
    So it was a little misunderstanding. When one of disk is in HUNG mode, you can’t allocate more ASM allocation units but you can still resize your datafile inside existing space.
    I’m going to implement that solution, but I have a concern about replication time because i will have two storages with 1 km fiber channel distance between it, so what is worse it will be RAC implementation.

    regards,
    Marcin

  13. January 19th, 2007 at 22:24 | #13

    Marcin,
    Right. If all of the disks in one of two failure groups are in hung state then there is a problem. Typically for your case you would have two SAN boxes on two sites and LUNs from those two boxes are grouped into two failure groups. Obviously, if one box/site has a problem - the WHOLE failure group is in HUNG state.
    Thanks for your comments.

  14. Shiva Raghunath
    February 26th, 2007 at 12:18 | #14

    Alex,

    Thanks for the information and article.

    Will not most people go for external redundancy? At least I am in that category. I appear to be fine.

    Thanks,
    Shiva

  15. February 26th, 2007 at 17:26 | #15

    Shiva,
    Well, yes until at least 11g that is as rumors say coming with fast resilvering. ;-)
    Cheers,
    Alex

  16. Tim Jacobs
    June 13th, 2007 at 17:21 | #16

    I have used ASM with a Peta-Byte of Storage. We used external-reduntancy. No issues. Initial Start was slow due to many many many paths. We used HP’s EVAs and EMC storage. This was on 10.1.0.4. ASM worked great. We had problems with the BIG Tablespace not really being that BIG…. But it just gets better with each release.

  17. June 13th, 2007 at 17:37 | #17

    Tim, you didn’t mirror between the boxes then. Right?
    Actually, if you have many storage boxes and many failure group - you are not that bad even with ASM mirroring.

  18. Tim Jacobs
    June 18th, 2007 at 11:19 | #18

    Correct Alex, The SANS have enough reduntancy. I do like the idea of having Oracle level the storage,
    but I think that would be better when the database is smaller (tera) and there might be different types of
    storage, NAS and slow and fast storage.

  19. June 18th, 2007 at 11:39 | #19

    Tim,

    Peta-Byte with ASM is quite impressive. And with 10.1.0.4 - sounds like a dream. :)
    You mention different types of storage — did you mean it’s good idea to manage them under ASM?
    I would disagree with that. Mixing fast and slow storage in one DG is a bad idea as ASM assumes they all have same performance and stripes based on SAME approach.

    Also, it doesn’t look like ASM has convenient provisioning schema. At least, storage admins that I’ve been working with are not favoring provisioning in ASM. They are missing lots of enterprise features they have in Veritas stack, for example. And they are smart chaps so they can’t be completely wrong.

    Anyway, thanks for sharing your experience!

  20. Anton Krasimirov
    August 16th, 2007 at 12:12 | #20

    Hello Alex

    I`m wondering what happens when i have 3 failure groups in normal redundancy . Will the diskgroup be still available if 2 of the failgroups fail ? and .. how is data distributed in this configuration

  21. August 20th, 2007 at 20:59 | #21

    With 3 FG and normal redundancy the data will be distributed amongst all 3 failure groups so that primary and secondary ASM extents are in different FGs. I.e. if you have normal redundancy diskgroup with 3 FG 100 GB each then you get 150 GB usable space in this DG. If you use 90 GB of this diskgroup (real size of datafiles before mirroring) than each FG will contain 60 GB of data and extent randomly mirrored across 3 FGs. The failure of one FG will cause ASM to re-balance the data so that everything fits on two remaining failure groups - i.e. 90 GB in each FG. This should work just fine.

    If in the same configuration 120 GB is used, than each failure group contains 80 GB. In case of failure of one FG, ASM won’t be able to re-balance the rest of the data on teo remaining FGs - it need to place 120 GB on each failure group whereas each FG contains 100 GB of disks.

    In order to sustain the failure of a single FG in a normal redundancy diskgroup with n failure groups, you have to keep at least 100/n percent of space free.
    2 FGs - 50%
    3 FGs - 33%
    4 FGs - 25%
    5 FGs - 20%
    and so on.

  22. Anton Krasimirov
    August 21st, 2007 at 16:58 | #22

    Thank you Alex :) !

  23. November 26th, 2007 at 00:35 | #23

    Hi Alex,

    Can you shed me some lights on how to bring back an asm disk in HUNG state back to normal state.

    NAME HEADER_STATU MOUNT_S MODE_ST STATE
    —————————— ———— ——- ——- ——–
    DATA1 MEMBER CACHED ONLINE NORMAL
    DATA2 MEMBER CACHED ONLINE NORMAL
    DATA3 MEMBER CACHED ONLINE NORMAL
    DATA4 MEMBER CACHED ONLINE NORMAL
    DATA5 MEMBER CACHED ONLINE HUNG
    DATA6 MEMBER CACHED ONLINE HUNG

  24. January 30th, 2008 at 23:21 | #24

    Lemu,

    Sorry I noticed your questions very late. You have probably already resolved it but here is quick explanation for others if they get here via search of other references.

    the status of HUNG for ASM disk is “awarded” when you requests to drop this disk and ASM doesn’t have enough space on other disk to finish rebalancing (copying extents off those disks that are being dropped). It might be not obvious when you use normal or high redundancy. This way you need to consider how your disks are organized in failure groups.

    That’s a nutshell. I hope this comment leads to the right investigation path without overwhelming with details here.

  25. Prakash
    January 28th, 2011 at 09:36 | #25

    Alex;
    Very impressive discussions here, thanks do much. I have a doubt as I’m required to drop 2 of the disks to from 2 diffrent existing normal redundancy diskgroup, in order to create another diskgroups with External rdunndancy. As I have my SAN storage(DS4700) configured with RAID 5, with proper failover capbilities. I’ve posted below my configuration for your perusal, kindly reply by today please.

    SQL> ;
    1 SELECT
    2 NVL(a.name, ‘[CANDIDATE]‘) disk_group_name
    3 , b.path disk_file_path
    4 , b.name disk_file_name
    5 , b.failgroup disk_file_fail_group
    6 FROM
    7 v$asm_diskgroup a RIGHT OUTER JOIN v$asm_disk b USING (group_number)
    8 ORDER BY
    9* a.name
    SQL> /

    DISK_GROUP_NAME DISK_FILE_PATH DISK_FILE_NAME DISK_FILE_FAIL_GROUP
    —————————— —————————————- —————————— ——————————
    ARCHDG /dev/rhdisk26 ARCHDG_0001 GROUP2
    ARCHDG /dev/rhdisk25 ARCHDG_0000 GROUP1
    DATADG1 /dev/rhdisk17 DATADG1_0001 GROUP2 — want to take out
    DATADG1 /dev/rhdisk16 DATADG1_0000 GROUP1
    DATADG1 /dev/rhdisk15 DATADG1_0003 DATADG1_0003 –want to take out
    DATADG1 /dev/rhdisk14 DATADG1_0002 DATADG1_0002
    DATADG2 /dev/rhdisk27 DATADG2_0002 DATADG2_0002
    DATADG2 /dev/rhdisk28 DATADG2_0003 DATADG2_0003 ————– want to take out
    DATADG2 /dev/rhdisk19 DATADG2_0001 GROUP2 ———–want to take out
    DATADG2 /dev/rhdisk18 DATADG2_0000 GROUP1
    INDEXDG /dev/rhdisk22 INDEXDG_0001 GROUP2
    INDEXDG /dev/rhdisk21 INDEXDG_0000 GROUP1
    TEMPDG /dev/rhdisk24 TEMPDG_0002 TEMPDG_0002— want to take out
    TEMPDG /dev/rhdisk11 TEMPDG_0000 TEMPDG_0000
    TEMPDG /dev/rhdisk13 TEMPDG_0001 TEMPDG_0001
    [CANDIDATE] /dev/rhdisk23
    [CANDIDATE] /dev/rhdisk7
    [CANDIDATE] /dev/rhdisk8
    [CANDIDATE] /dev/rhdisk9
    [CANDIDATE] /dev/rhdisk20
    [CANDIDATE] /dev/rhdisk12

    21 rows selected.

    SQL>

    I’ve maked the disks , I’m willing to take out. Kindly reply Sonnest by today please.

    Regards
    Prakash

  26. January 28th, 2011 at 10:31 | #26

    @Prakash,

    I would like to emphasize that I do not do free consulting on the blog. The blog is created to help readers understand the technology and make right decisions. If your need expert help with a specific production problem we can do a proper analysis and solution as part of Pythian services.

    I’m also quite busy these days and usually ignore the requests with “Kindly reply Sonnest by today”.

    What you need to do is to understand the impact of the disks you are dropping and whether you still have enough capacity on your normal redundancy disk groups - not only to rebalance the data off the dropped disks but also to survive the failure of one of the disks that are left there. Based on your output, there are normal redundancy diskgroups with only two disks - you won’t be able to recover automatically from a disk failure in these diskgroups.

  1. No trackbacks yet.