Friday, December 23, 2011

Golden Gate RMAN Interaction Causing RMAN-08137

**6/3/2013 Update: The following applies to Classic Capture. Things have changed a little bit for Integrated Capture. Hope to address this in a new blog *****
We are test-running GG on our development instance and therefore we had to put the DB in archivelog mode.
Till this incident hapenned I used to think the only reson why the database was in ARCHIVELOG mode was because the golden gate can read the archive log files if it lags behind the redo log files.

We used to run a RMAN job via cron every 4 hours which would just run the following command -

delete noprompt archivelog all;

We did not backup the archive logs as this was a DEV instance and the database was in ARCHIVELOG only for the GG capture process and we were also aware that GG will prevent archivelogs from being deleted (if necessary)which it had to read because of a lag where it could not keep in pace with redo generation of the database.

Lately the archive destination was getting 100% full and when we ran the job manually we would see archive files as old as 8-10 hours still in the destination.
The rman delete also would fail for the above files complaining as -

RMAN> delete noprompt archivelog all;

using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=8 device type=DISK
RMAN-08137: WARNING: archived log not deleted, needed for standby or upstream capture process
archived log file name=/oracle/arch/test123/arch/1_3071_767059108.dbf thread=1 sequence=3071
..
..
..
..
..
Then we checked which redo log the capture process (info ext <ext name>,detail) was currently reading and it would show the sequence number of the  online redo log file.

So the big question was

WHY IS GOLDEN GATE PREVENTING ARCHIVE LOGS FROM BEING DELETED IF IT IS NOT LAGGING BEHIND

So today I am going to share a few things I learnt working with Oracle Support.

The first thing I came to know was the capture process writes only committed transactions into a trail file. This seemed to be the root cause of the issue.

So technically if it is 3pm now and there is an active transaction running from 7am in the morning , all the changes between 7am and 3pm have to be recorded somewhere, right? This obviously cannot go into the trail file as only commited transactions can get into trail file (I am still not sure why  this was architected this way???Any thoughts from anybody?).

So it looks like the Golden Gate capture process stores the details of a transaction in memory and it is written into the trail file only when comitted. There is something called Bounded Recovery (BR) whose timeout is set to 4 hours(by default) . That means if an active transaction goes beyond 4 hours,  the transaction is copied from memory into the BR file under $$GG_HOME/BR/<extract name> directory.

This parameter can be changed by putting the following parameter into the extract parameter file.
The following line sets it to 1 hour.

BR BRINTERVAL 1H
or
BR BRINTERVAL 20M  -> 20 minutes

Now assuming the BR is set to 4 hours (default value) and somebody stopped the extract process at say 7pm and as of 7pm there were many active transactions of which the oldest transaction started at 8am .

Now without the BR feature, when the extract is restarted it has to read redo information starting from 8am which would most probably be archived. Therefore the extract just wouldn't start if it is not able to access this archived file. This seems to be the most important reason for any database to be in archivelog mode for GG capture prcess to function effectively.

Please note that even by reducing BR down to 1 hour or so, is not going to resolve the RMAN-08137 issue .

The golden rule is
However old the archive file is, if it contains a record from an active transaction , then it cannot be deleted.Also the subsequent archive logfiles will not be deleted.



For example , if you wrote a code like

BEGIN
insert into t1 values .... ; -->redo sequence is 3500 on day 1
dbms_lock.sleep('15 days'); -->sleep for 15 days
insert into t2 values(......); redo sequence is 6000 on day 16
END;
/
RMAN will be unable to delete from sequence 3500 onwards eventhough sequence 3501 ...5999 does not have anything to do with this transaction (I am testing this out next week and update if i see any anomalies).

And since it is not possible to ask people to commit their transactions frequently , I seem to favour to decouple extract and rman by adding the following parameter
 ggsci> stop ext <extract name>
ggsci> dbloging userid <id> password <pw>
ggsci> UNREGISTER EXTRACT <extract name> LOGRETENTION

and editing the extract parameter file
TRANLOGOPTIONS LOGRETENTION DISABLED

Again this could lead to other issues like extract abending as it had to read from archivelogs (as it was lagging)
and could not find the archived files as rman had deleted it.

Therefore a good way is to put a balance by deleting on like sysdate - 4 hours worth of archive log files.
Let me know if anybody has found out a better way instead of decoupling extract and rman.

*********IMPORTANT 6/7/2012 UPDATE ***************
PLEASE NOTE THAT THE DEFAULT BEHAVIOUR HAS CHANGED SINCE GOLDEN GATE VERSION 11.1.1.1.2. FROM THIS VERSION , GG AND RMAN ARE DE-COUPLED BY DEFAULT AND THEREFORE YOU HAVE TO MANUALLY COUPLE THEM (IF YOU WANT BOTH OF THEM TO INTERACT WITH EACH OTHER). A GOOD WAY TO CHECK IS TO QUERY DBA_CAPTURE.
HERE IS STATEMENT FROM ORACLE SUPPORT

                            ++++++++
I have checked again and the fix for Bug 12648838 went into 11.1.1.1.2 (the note I looked at is not correct) so the default behaviour is to not create an implicit capture process unless you have explicity specified
the LOGRETENTION parameter. In your case you have not therefore you do not have one.

So what you observe is correct in your version.


                         +++++++++++++++ 
*************************************************************************** 

A few useful commands

To check the time of the oldest transaction
============================

 select vt.xidusn, vt.xidslot, vt.xidsqn, vs.sid, vs.username, vs.program, vs.machine, vt.used_ublk,
     vt.start_time, TO_CHAR(SYSDATE, 'MM-DD-YYYY HH24:MI:SS') Current_Time
   from gv$transaction vt, gv$session vs where vt.addr = vs.taddr order by vt.
 start_time ;

To find the oldest dependent archive file
============================

send <extract name>, showtrans

Decouple RMAN and GoldenGate
===========================================

ggsci> stop ext <extract name>

Modify the extract param file as follows

Extract XTATHDV
----------------------------------------------------
--Extract Process for testdb Database
--List of Schemas
--SCOTT
----------------------------------------------------
SETENV (ORACLE_HOME="/oracle/software/rdbms/11.2.0.3")
SETENV (ORACLE_SID="testdb")
userid ggddlusr, password xxxx
TRANLOGOPTIONS ASMUSER "sys@asm",asmpassword "xxx" LOGRETENTION DISABLED
exttrail /oracle/goldengatedata/dirdat/at
Reportcount every 30 Minutes, Rate
Report at 11:00
ReportRollover at 11:15
DiscardFile /oracle/goldengatedata/dirrpt/xtathdv.dsc, Append
DiscardRollover at 06:00 ON MONDAY
DDL include Mapped
table SCOTT.*;

Make a backup of the checkpoint file (./dirchk/<extract>.cpe)

ggsci> dbloging userid <id> password <pw>
ggsci> UNREGISTER EXTRACT <extract name> LOGRETENTION

If the above returns that extract was not registered you can try the following to see if extract has an entry in the dba_capture table, which has been disconnected from Oracle GoldenGate.

select capture_name, queue_owner, capture_user, start_scn, status from dba_capture;

This entry could be stopping the RMAN from removing the archives. To remove the entry use the following command

exec DBMS_CAPTURE_ADM.DROP_CAPTURE ('OGG$_LGHTYC15666903B');

ggsci> start ext <extract name>

 Couple GG and RMAN
=======================

Please note that post GG version 11.1.1.1.2   and beyond, GG and RMAN are decoupled by default .
So If you want to couple them, here are the steps



ggsci> stop ext <extract name>

Modify the extract param file as follows

Extract XTATHDV
----------------------------------------------------
--Extract Process for ATHENADV Database
--List of Schemas
--PUBS
----------------------------------------------------
SETENV (ORACLE_HOME="/oracle/software/rdbms/11.2.0.3")
SETENV (ORACLE_SID="xxx")
userid ggddlusr, password xxx
TRANLOGOPTIONS ASMUSER "sys@asm",asmpassword "xxx" LOGRETENTION ENABLED
exttrail /oracle/goldengatedata/dirdat/at
Reportcount every 30 Minutes, Rate
Report at 11:00
ReportRollover at 11:15
DiscardFile /oracle/goldengatedata/dirrpt/xtathdv.dsc, Append
DiscardRollover at 06:00 ON MONDAY
DDL include Mapped
table PUBS.*;

Make a backup of the checkpoint file (./dirchk/<extract>.cpe)

ggsci> dblogin userid <id> password <pw>
ggsci> REGISTER EXTRACT <extract name> LOGRETENTION