Monday 30 June 2014

Importance of Bounded Recovery on GoldenGate

The Importance of Bounded Recovery


Bounded Recovery is a component of Oracle GoldenGate’s Extract process checkpointing facility. It guarantees an efficient recovery after Extract stops for any reason, planned or unplanned, no matter how many open (uncommitted) transactions there were at the time that Extract stopped, nor how old they were. Bounded Recovery sets an upper boundary for the maximum amount of time that it would take for Extract to recover to the point where it stopped and then resume normal processing.
Extract performs this recovery as follows:
·         If there were no open transactions when Extract stopped, the recovery begins at the current Extract read checkpoint. This is a normal recovery.
·         If there were open transactions whose start points in the log were very close in time to the time when Extract stopped, Extract begins recovery by re-reading the logs from the beginning of the oldest open transaction. This requires Extract to do redundant work for transactions that were already written to the trail or discarded before Extract stopped, but that work is an acceptable cost given the relatively small amount of data to process. This also is considered a normal recovery.
·         If there were one or more transactions that Extract qualified as long-running open transactions, Extract begins its recovery with a Bounded Recovery.

Bounded Recovery is new feature in OGG 11.1, this is how it works:
A transaction qualifies as long-running if it has been open longer than one Bounded Recovery interval, which is specified with the BRINTERVAL option of the BR parameter.
For example, if the Bounded Recovery interval is four hours, a long-running open transaction is any transaction that started more than four hours ago.
At each Bounded Recovery interval, Extract makes a Bounded Recovery checkpoint, which persists the current state and data of Extract to disk, including the state and data (if any) of long-running transactions. If Extract stops after a Bounded Recovery checkpoint, it will recover from a position within the previous Bounded Recovery interval or at the last Bounded Recovery checkpoint, instead of processing from the log position where the oldest open long-running transaction first appeared, which could be several trail files ago.

Bounded Recovery is enabled by default for Extract processes and has a 4 hour BR interval. To adjust the BR interval to say 24 hours, use the following syntax in your Extract parameter file:

BR BRINTERVAL 24, BRDIR BR

The default location for BR checkpoint files is the GoldenGate home directory. This can be altered by including a full path:

BR BRINTERVAL 24, BRDIR /ggsdata/brcheckpoint

Case Study

The Problem

In a recent case, Bounded Recovery was disabled through the following Extract parameter:
BR BROFF
Consequently the following behavior prevented the Extract process from recovering and starting.
1.       Firstly, GoldenGate had fallen behind due to a batch job and subsequently the Extract process was reading the archived redologs and not the online redologs. Also at this time an archived redolog was deleted by RMAN during a scheduled backup, that caused the Extract process to abend with OGG-00446 (caused by ORA-15173)
Error in ggserr.log
2012-07-04 11:03:03  ERROR OGG-00446  Oracle GoldenGate Capture for Oracle, euktds01.prm:  Getting attributes for ASM file +FRA/2_86717_716466928.dbf, SQL <BEGIN dbms_diskgroup.getfileattr('+FRA/2_86717_716466928.dbf', :filetype, :filesize, :lblksize); END;>: (15056) ORA-15056: additional error message ORA-15173: entry '2_86717_716466928.dbf' does not exist in directory '/' ORA-06512: at "SYS.X$DBMS_DISKGROUP", line 304 ORA-06512: at line 1Not able to establish initial position for sequence 86717, rba 122140688.
2012-07-04 11:03:03  ERROR OGG-01668  Oracle GoldenGate Capture for Oracle, euktds01.prm:  PROCESS ABENDING.


2.       Some hours later, the deleted archived redolog file was restored and the Extract process restarted. However, despite the process running, the RBA# and Sequence# were not incrementing. The Extract process was stuck!
The INFO GGSCI command with DETAIL option revealed the source redo was not available.
GGSCI (dbserver09a) 2> info EUKMDS01, detail

Extract Source Begin             End

Not Available 2012-07-04 23:30 2012-07-04 23:30
Not Available 2012-07-04 23:28 2012-07-04 23:30
Not Available 2012-07-01 05:35 2012-07-04 23:28
+DATA/ukhub/onlinelog/group_4.282.716467031  2012-06-24 05:28  2012-07-01 05:35
+DATA/ukhub/onlinelog/group_3.280.716467027  2012-06-23 21:06  2012-06-24 05:28

3.       The ggserr.log also revealed a long running transaction detected.
2012-07-04 23:31:47  WARNING OGG-01027  Oracle GoldenGate Capture for Oracle, euko1els.prm:  Long Running Transaction: XID 197.8.3521317, Items 0, Extract EUKO1ELS, Redo Thread 2, SCN 51.3925309013 (222968641109), Redo Seq #86717, Redo RBA 122140688.


The Solution

The Extract process was stuck in recovery mode, but could not find the starting RBA. In order to get the process up and running, the following steps were executed on the source system.
1.       First of all, the Extract process was stopped with the force option.
GGSCI (dbserver09a) 4> send extract EUKMDS01, forcestop
2.       The start position of the Extract process was altered to the beginning of the long running transaction.
GGSCI (dbserver09a) 5> alter extract EUKMDS01, begin 2012-07-04 23:31:47
3.       The extract process was started.
GGSCI (dbserver09a) 4> start extract EUKMDS01
4.       Sure enough, the Extract process was reinitialized and continued to process the backlog.
GGSCI (uklpdptoy09a) 2> info EUKMDS01, detail

Extract Source Begin             End

+DATA/ukhub/onlinelog/group_4.282.716467031  2012-07-04 23:31  2012-07-05 02:58
Not Available * Initialized *   2012-07-04 23:31
Not Available 2012-07-04 23:30 2012-07-04 23:30


Conclusion

Never disable Bounded Recovery else Extract processes may fail to recover automatically. Furthermore, to prevent RMAN from deleting archived log files that are still required. If you register the extract with LOGRETENTION then the GoldenGate will retain the archive logs that Extract needs for recovery.
To register Extract do the following:

1.        Stop the Extract ( Ensure that all the archive log files starting from recovery checkpoint till current checkpoint is available on all nodes )
2.       Execute the following GGSCI commands
GGSCI> dblogin userid <username>, password <password>
GGSCI> register extract <Extract-name>, LOGRETENTION

You can confirm whether Extract is registered or not using the query “select * from dba_capture”. (This sounds like Streams!) This should have an entry for Extract.

3.       Start the Extract
GGSCI>start extract <Extract-name>