Friday, December 23, 2011

Golden Gate RMAN Interaction Causing RMAN-08137

**6/3/2013 Update: The following applies to Classic Capture. Things have changed a little bit for Integrated Capture. Hope to address this in a new blog *****
We are test-running GG on our development instance and therefore we had to put the DB in archivelog mode.
Till this incident hapenned I used to think the only reson why the database was in ARCHIVELOG mode was because the golden gate can read the archive log files if it lags behind the redo log files.

We used to run a RMAN job via cron every 4 hours which would just run the following command -

delete noprompt archivelog all;

We did not backup the archive logs as this was a DEV instance and the database was in ARCHIVELOG only for the GG capture process and we were also aware that GG will prevent archivelogs from being deleted (if necessary)which it had to read because of a lag where it could not keep in pace with redo generation of the database.

Lately the archive destination was getting 100% full and when we ran the job manually we would see archive files as old as 8-10 hours still in the destination.
The rman delete also would fail for the above files complaining as -

RMAN> delete noprompt archivelog all;

using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=8 device type=DISK
RMAN-08137: WARNING: archived log not deleted, needed for standby or upstream capture process
archived log file name=/oracle/arch/test123/arch/1_3071_767059108.dbf thread=1 sequence=3071
..
..
..
..
..
Then we checked which redo log the capture process (info ext <ext name>,detail) was currently reading and it would show the sequence number of the  online redo log file.

So the big question was

WHY IS GOLDEN GATE PREVENTING ARCHIVE LOGS FROM BEING DELETED IF IT IS NOT LAGGING BEHIND

So today I am going to share a few things I learnt working with Oracle Support.

The first thing I came to know was the capture process writes only committed transactions into a trail file. This seemed to be the root cause of the issue.

So technically if it is 3pm now and there is an active transaction running from 7am in the morning , all the changes between 7am and 3pm have to be recorded somewhere, right? This obviously cannot go into the trail file as only commited transactions can get into trail file (I am still not sure why  this was architected this way???Any thoughts from anybody?).

So it looks like the Golden Gate capture process stores the details of a transaction in memory and it is written into the trail file only when comitted. There is something called Bounded Recovery (BR) whose timeout is set to 4 hours(by default) . That means if an active transaction goes beyond 4 hours,  the transaction is copied from memory into the BR file under $$GG_HOME/BR/<extract name> directory.

This parameter can be changed by putting the following parameter into the extract parameter file.
The following line sets it to 1 hour.

BR BRINTERVAL 1H
or
BR BRINTERVAL 20M  -> 20 minutes

Now assuming the BR is set to 4 hours (default value) and somebody stopped the extract process at say 7pm and as of 7pm there were many active transactions of which the oldest transaction started at 8am .

Now without the BR feature, when the extract is restarted it has to read redo information starting from 8am which would most probably be archived. Therefore the extract just wouldn't start if it is not able to access this archived file. This seems to be the most important reason for any database to be in archivelog mode for GG capture prcess to function effectively.

Please note that even by reducing BR down to 1 hour or so, is not going to resolve the RMAN-08137 issue .

The golden rule is
However old the archive file is, if it contains a record from an active transaction , then it cannot be deleted.Also the subsequent archive logfiles will not be deleted.



For example , if you wrote a code like

BEGIN
insert into t1 values .... ; -->redo sequence is 3500 on day 1
dbms_lock.sleep('15 days'); -->sleep for 15 days
insert into t2 values(......); redo sequence is 6000 on day 16
END;
/
RMAN will be unable to delete from sequence 3500 onwards eventhough sequence 3501 ...5999 does not have anything to do with this transaction (I am testing this out next week and update if i see any anomalies).

And since it is not possible to ask people to commit their transactions frequently , I seem to favour to decouple extract and rman by adding the following parameter
 ggsci> stop ext <extract name>
ggsci> dbloging userid <id> password <pw>
ggsci> UNREGISTER EXTRACT <extract name> LOGRETENTION

and editing the extract parameter file
TRANLOGOPTIONS LOGRETENTION DISABLED

Again this could lead to other issues like extract abending as it had to read from archivelogs (as it was lagging)
and could not find the archived files as rman had deleted it.

Therefore a good way is to put a balance by deleting on like sysdate - 4 hours worth of archive log files.
Let me know if anybody has found out a better way instead of decoupling extract and rman.

*********IMPORTANT 6/7/2012 UPDATE ***************
PLEASE NOTE THAT THE DEFAULT BEHAVIOUR HAS CHANGED SINCE GOLDEN GATE VERSION 11.1.1.1.2. FROM THIS VERSION , GG AND RMAN ARE DE-COUPLED BY DEFAULT AND THEREFORE YOU HAVE TO MANUALLY COUPLE THEM (IF YOU WANT BOTH OF THEM TO INTERACT WITH EACH OTHER). A GOOD WAY TO CHECK IS TO QUERY DBA_CAPTURE.
HERE IS STATEMENT FROM ORACLE SUPPORT

                            ++++++++
I have checked again and the fix for Bug 12648838 went into 11.1.1.1.2 (the note I looked at is not correct) so the default behaviour is to not create an implicit capture process unless you have explicity specified
the LOGRETENTION parameter. In your case you have not therefore you do not have one.

So what you observe is correct in your version.


                         +++++++++++++++ 
*************************************************************************** 

A few useful commands

To check the time of the oldest transaction
============================

 select vt.xidusn, vt.xidslot, vt.xidsqn, vs.sid, vs.username, vs.program, vs.machine, vt.used_ublk,
     vt.start_time, TO_CHAR(SYSDATE, 'MM-DD-YYYY HH24:MI:SS') Current_Time
   from gv$transaction vt, gv$session vs where vt.addr = vs.taddr order by vt.
 start_time ;

To find the oldest dependent archive file
============================

send <extract name>, showtrans

Decouple RMAN and GoldenGate
===========================================

ggsci> stop ext <extract name>

Modify the extract param file as follows

Extract XTATHDV
----------------------------------------------------
--Extract Process for testdb Database
--List of Schemas
--SCOTT
----------------------------------------------------
SETENV (ORACLE_HOME="/oracle/software/rdbms/11.2.0.3")
SETENV (ORACLE_SID="testdb")
userid ggddlusr, password xxxx
TRANLOGOPTIONS ASMUSER "sys@asm",asmpassword "xxx" LOGRETENTION DISABLED
exttrail /oracle/goldengatedata/dirdat/at
Reportcount every 30 Minutes, Rate
Report at 11:00
ReportRollover at 11:15
DiscardFile /oracle/goldengatedata/dirrpt/xtathdv.dsc, Append
DiscardRollover at 06:00 ON MONDAY
DDL include Mapped
table SCOTT.*;

Make a backup of the checkpoint file (./dirchk/<extract>.cpe)

ggsci> dbloging userid <id> password <pw>
ggsci> UNREGISTER EXTRACT <extract name> LOGRETENTION

If the above returns that extract was not registered you can try the following to see if extract has an entry in the dba_capture table, which has been disconnected from Oracle GoldenGate.

select capture_name, queue_owner, capture_user, start_scn, status from dba_capture;

This entry could be stopping the RMAN from removing the archives. To remove the entry use the following command

exec DBMS_CAPTURE_ADM.DROP_CAPTURE ('OGG$_LGHTYC15666903B');

ggsci> start ext <extract name>

 Couple GG and RMAN
=======================

Please note that post GG version 11.1.1.1.2   and beyond, GG and RMAN are decoupled by default .
So If you want to couple them, here are the steps



ggsci> stop ext <extract name>

Modify the extract param file as follows

Extract XTATHDV
----------------------------------------------------
--Extract Process for ATHENADV Database
--List of Schemas
--PUBS
----------------------------------------------------
SETENV (ORACLE_HOME="/oracle/software/rdbms/11.2.0.3")
SETENV (ORACLE_SID="xxx")
userid ggddlusr, password xxx
TRANLOGOPTIONS ASMUSER "sys@asm",asmpassword "xxx" LOGRETENTION ENABLED
exttrail /oracle/goldengatedata/dirdat/at
Reportcount every 30 Minutes, Rate
Report at 11:00
ReportRollover at 11:15
DiscardFile /oracle/goldengatedata/dirrpt/xtathdv.dsc, Append
DiscardRollover at 06:00 ON MONDAY
DDL include Mapped
table PUBS.*;

Make a backup of the checkpoint file (./dirchk/<extract>.cpe)

ggsci> dblogin userid <id> password <pw>
ggsci> REGISTER EXTRACT <extract name> LOGRETENTION

9 comments:

  1. Hi Gautham,

    Thanks for stopping by my blog on this topic at:

    http://otipstricks.blogspot.com/2011/07/be-careful-when-you-stop-using-v11.html

    I don't know if its a BETTER way or not...but a different way is just to add the force command to your delete...ie:

    delete noprompt force archivelog all completed before 'sysdate-240/1440';

    ...would delete everything >4hrs old. I hate doing that though...because, like you said, this means your gg extract is going to fail. What I prefer to do is to:

    1. Identify the transaction and if it belongs to an ad-hoc person, let them know they need to commit or rollback (after they do that, everything will work as it should)

    2. If that person left a session open and went on vacation, or if its an application issue, you can skip the transaction in GG or kill their session. The majority of the time we skip transaction in GG, assuming the transaction will eventually be rolled back by other means. Under normal usage, the transactions should take hours to complete.

    3. You probably have scripts that will notify you when you have a long running transaction issue so you can manually take action. A 3rd way to handle this is to automate this with sniping the sessions (creating a timeout in the user profile) or by setting up a session limit in ORM. You can do this by user so you can separate out the app user from the ad hoq user. This might be a good solution in your dev environment.

    ReplyDelete
  2. Sorry...meant to say, the transaction normally should not take hours to complete.

    ReplyDelete
  3. Hi Gautam,

    Thanks for the blog as it is very helpful.

    We are facing a issue almost every day our GG got abendended due to archive log not missing. We have to manually restore the that particular archivelog. Client is now asking that is there any resolution for this as this issue very frequent.

    Our policy is that we take a archivelog backup 4 times everyday with archive deletion policy. As soon as the the archivelog get backedup, it get deleted.

    As you have mentioned that rman and GG needs to be coupled
    by adding LOGRETENTION ENABLED.

    I would be very thankful if you assist me in getting the steps to couple RMAN and GG as our version is 11.1.1

    Kindly Suggest.

    Thanks and Regards

    ReplyDelete
  4. Hi Dan,
    I have mentioned a way to couple GG and RMAN at the very end of the blog. Does it not seem to work?
    But before that you can try reducing the BR interval(to say 20 minutes) in your extract process to by putting the following parameter in your extract file
    BR BRINTERVAL 20M

    And then when you remove the archive files , please only remove files older than sysdate -(1/24), which means archive files older than 1 hour. The combination of lowering BR and retaining archive logs for few hours has worked very well for us. We haven't had the need to couple GG and RMAN.
    Also please put autorestart of extract in your manager process or put my script(which autostarts extract process).

    ReplyDelete
    Replies
    1. Thanks Gautham for the article, it helped clear a few doubts. But in your main article you mentioned that BR does not help and in your reply to DAN you mentioned that lowering BR helped. So I am a little confused. Can you clarify please?

      Thanks

      Delete
    2. Sorry if that caused confusion.
      Lowering BR will help , if your extract died and you had to restart it.It would only need archivelogs from the time the last BR process hapenned.You would not need to restore the archivelogs from the last 4 hours(i.e. the default BR interval), if you were unlucky enough to have a crash just before the BR checkpoint took place.
      The place BR will not help is allowing you to delete archive logs.
      Just because BR is set to 20 minutes , it wouldn't mean RMAN delete would allow you to delete files older than 20 minutes.
      So I am addressing two different issues.

      Delete
  5. I re-read your question and I am certain that the same steps should also work in your version 11.1.1. Please make sure that it is registered in dba_capture to make sure it is coupled.

    select capture_name, queue_owner, capture_user, start_scn, status from dba_capture;

    ReplyDelete
    Replies
    1. Hi Gautam,

      Thanks for the reply.

      I am asking too much as i am very much new to GG.I am aware about Archivelog deletion policy standby.

      As per company policy, The backup of archivelog is 4 times in a day with deletion policy.

      In my case, Archive log is getting deleted successfully. However the archive which is required by GG is also getting deleted by RMAN. we have to manually restore that archivelog for GG. It is very frequent.I dont want the archivelog should by deleted by RMAN which is required by GG since i dont have to restore it manually.

      putting the BR interval to 2 hrs Will work.
      Then i have to decouple RMAN and GG.
      As you have mentioned that coupling will not work.

      Kindly Suggest,

      Thanks and Regards,

      Delete
  6. Hi Gautam,

    Thanks for your note. I have same requirement to prevent RMAN from deleting archivelog that are required by GG.

    I have also seen GoldenGate LOGRETENTION (Doc ID 1465847.1) note in which they have asked to run extra steps as below.

    Please suggest if this needs to be run too.

    Oracle EE version 10.2
    Run package to grant Oracle Streams admin privilege.
    exec dbms_streams_auth.grant_admin_privilege('user')
    Grant INSERT into logmnr_restart_ckpt$.
    grant insert on system.logmnr_restart_ckpt$ to user;
    Grant UPDATE on streams$_capture_process.
    grant update on sys.streams$_capture_process to user;
    Grant the 'become user' privilege.
    grant become user to user;

    Regards,
    Mayur

    ReplyDelete