Home Dataguard Troubleshooting Oracle Dataguard Transport and Apply issue

Troubleshooting Oracle Dataguard Transport and Apply issue

I share this post to explain how we identified and fixed an issue when our physical standby (Active Dataguard) apply was delayed and replication fell behind the primary database. Though this is a single unique scenario, the post should be a diagnostic approach when troubleshooting Oracle dataguard transport and apply issues.

Environment

  • Primary – Oracle RAC Database 11.2.0.4
  • Standby – Oracle RAC Database 11.2.0.4
  • Dataguard configuration – Total 3 databases (1 primary and 2 standby) on separate clusters. Realtime apply using standby redo logs.
  • Operating System – Oracle Enterprise Linux 6.6 – 64 bit.

Alert

Our monitoring system alerted us with this error – ORA-16810: multiple errors or warnings detected for the database

Troubleshooting approach

Dataguard status

What does the current Dataguard config report? Connect to DGMGRL and check the configuration.

rac1db01 | ORCL1 | /home/oracle
> dgmgrl /
DGMGRL for Linux: Version 11.2.0.4.0 - 64bit Production

Copyright (c) 2000, 2009, Oracle. All rights reserved.

Welcome to DGMGRL, type "help" for information.
Connected.


DGMGRL> show configuration

Configuration - ORCL

  Protection Mode: MaxPerformance
  Databases:
    ORCL     - Primary database
      Error: ORA-16810: multiple errors or warnings detected for the database

    ORCLDG1   - Physical standby database
      Warning: ORA-16857: standby disconnected from redo source for longer than specified threshold

    ORCLDG2 - Physical standby database

Fast-Start Failover: DISABLED

Configuration Status:
ERROR

From the above results, we can see the primary database is reporting errors though it is not clear what the issue is. The first standby is now disconnected from the redo source (primary). The second standby appears to be fine.

Archiving status

Since error points to redo source, I connect to the primary database to check the archival status for any errors.

SQL> select dest_id, error from gv$archive_dest;

   DEST_ID ERROR
---------- -----------------------------------------------------------------
	 1
	 2
	 3
	 4 ORA-00270: error creating archive log
	 5

From the above output, we can see there was an error creating archive log on destination id 4. I know that dest id 4 is the remote service for the first standby database.

Dataguard log

Next check the dataguard broker log. It is usually in the same location (diagnostic dest) as the database alert log file. It basically recorded/confirms the message we saw in Dataguard status and v$archive_dest.error.

03/03/2020 18:46:48
Data Guard Broker Status Summary:
  Type                        Name                             Severity  Status
  Configuration               ORCL                               Warning  ORA-16607
  Primary Database            ORCL                                Error  ORA-16810
  Physical Standby Database   ORCLDG1                            Warning  ORA-16857
  Physical Standby Database   ORCLDG2                          Success  ORA-00000
03/03/2020 18:46:50
Redo transport problem detected: redo transport for database ORCLDG1 has the following error:
  ORA-00270: error creating archive log
03/03/2020 18:47:54
Redo transport problem detected: redo transport for database ORCLDG1 has the following error:
  ORA-00270: error creating archive log

What is ORA-16857

16857,0000, "standby disconnected from redo source for longer than specified threshold"
// *Cause: The amount of time the standby was disconnected from the
//         redo source database exceeded the value specified by the
//         'TransportDisconnectedThreshold' database property. It is caused by
//         no network connectivity between the redo source and the standby
//         databases.
// *Action: Ensure that there is network connectivity between the redo source
//          and standby databases, and the redo source is working properly.

The error points to network connectivity between the primary and standby database. I will check the network in a moment.

I’m doing a quick check on the primary database alert log for any additional information.

Primary Database Alert log

Tue Mar 03 18:26:46 2020
ARC4: Attempting destination LOG_ARCHIVE_DEST_4 network reconnect (270)
ARC4: Destination LOG_ARCHIVE_DEST_4 network reconnect abandoned
FAL[server, ARC4]: Error 270 creating remote archivelog file '(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=rac2.domain.com)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=ORCLDG1)))'
Tue Mar 03 18:26:46 2020
ARC6: Attempting destination LOG_ARCHIVE_DEST_4 network reconnect (270)
ARC6: Destination LOG_ARCHIVE_DEST_4 network reconnect abandoned
FAL[server, ARC4]: FAL archive failed, see trace file.
ARCH: FAL archive failed. Archiver continuing
ORACLE Instance ORCL1 - Archival Error. Archiver continuing.
FAL[server, ARC6]: Error 270 creating remote archivelog file '(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=rac2.domain.com)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=ORCLDG1)))'
FAL[server, ARC6]: FAL archive failed, see trace file.
ARCH: FAL archive failed. Archiver continuing
ORACLE Instance ORCL1 - Archival Error. Archiver continuing.
...


...
Tue Mar 03 18:42:02 2020
Suppressing further error logging of LOG_ARCHIVE_DEST_4.
Suppressing further error logging of LOG_ARCHIVE_DEST_4.
Suppressing further error logging of LOG_ARCHIVE_DEST_4.
Errors in file /u01/app/oracle/diag/rdbms/orcl/ORCL1/trace/ORCL1_nsa4_82618.trc:
ORA-00270: error creating archive log
Tue Mar 03 18:42:02 2020
Suppressing further error logging of LOG_ARCHIVE_DEST_4.
Suppressing further error logging of LOG_ARCHIVE_DEST_4.
Suppressing further error logging of LOG_ARCHIVE_DEST_4.
FAL[server, ARC6]: FAL archive failed, see trace file.
ARCH: FAL archive failed. Archiver continuing
ORACLE Instance ORCL1 - Archival Error. Archiver continuing.

Verify the network connectivity

From the primary database server, I try to connect to the remote standby database.

rac1db01 | ORCL1 | /u01/app/oracle/diag/rdbms/orcl/ORCL1/trace
> sqlplus system@'(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=rac2.domain.com)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=ORCLDG1)))'

SQL*Plus: Release 11.2.0.4.0 Production on Tue Mar 3 18:59:55 2020

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Enter password:

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options

SQL>

Network is fine (usually the case).

Standby Database Alert log

Since we know primary database is fine. So is the second standby database. Network between primary and first standby is fine. So moving on to check the standby database alert log.

Errors in file /u01/app/oracle/diag/rdbms/orcldg1/ORCLDG11/trace/ORCLDG11_arc1_159919.trc:
ORA-19815: WARNING: db_recovery_file_dest_size of 43980465111040 bytes is 100.00% used, and has 0 remaining bytes available.
************************************************************************
You have following choices to free up space from recovery area:
1. Consider changing RMAN RETENTION POLICY. If you are using Data Guard,
   then consider changing RMAN ARCHIVELOG DELETION POLICY.
2. Back up files to tertiary device such as tape using RMAN
   BACKUP RECOVERY AREA command.
3. Add disk space and increase db_recovery_file_dest_size parameter to
   reflect the new space.
4. Delete unnecessary files using RMAN DELETE command. If an operating
   system command was used to delete files, then use RMAN CROSSCHECK and
   DELETE EXPIRED commands.
************************************************************************
ARCH: Archival stopped, error occurred. Will continue retrying
ORACLE Instance ORCLDG11 - Archival Error
ORA-16038: log 18 sequence# 202828 cannot be archived
ORA-19809: limit exceeded for recovery files
ORA-00312: online log 18 thread 1: '+DATA/orcldg1/onlinelog/group_18.613.912628973'
ORA-00312: online log 18 thread 1: '+RECO/orcldg1/onlinelog/group_18.5205.912628975'
Tue Mar 03 19:18:47 2020
ARC7: Archiving not possible: error count exceeded
ARCH: Archival stopped, error occurred. Will continue retrying
ORACLE Instance ORCLDG11 - Archival Error

And now we know the root cause of the issue. The standby database could not archive any further since the recovery area filled up. For some reason, the archive log cleanup job was not running on the standby cluster. There is a gap here and we will address it. This is a very busy OLTP database, so we have a fair bit of archive logs that gets generated every day. Temporarily increasing the recovery dest size so archiving can continue.

SQL> show parameter reco

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
control_file_record_keep_time	     integer	 7
db_recovery_file_dest		     string	 +RECO
db_recovery_file_dest_size	     big integer 40960G
db_unrecoverable_scn_tracking	     boolean	 TRUE
recovery_parallelism		     integer	 0

SQL> alter system set db_recovery_file_dest_size=51200G scope=both sid='*';

System altered.

SQL> show parameter reco

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
control_file_record_keep_time	     integer	 7
db_recovery_file_dest		     string	 +RECO
db_recovery_file_dest_size	     big integer 51200G
db_unrecoverable_scn_tracking	     boolean	 TRUE
recovery_parallelism		     integer	 0

Dataguard status

If we check the dataguard status right away, it may still report the same error. Issue a show database command so the broker will check for current status and in doing so resumes the archiving right away since the issue has cleared now.

DGMGRL> show configuration

Configuration - ORCL

  Protection Mode: MaxPerformance
  Databases:
    ORCL     - Primary database
      Error: ORA-16778: redo transport error for one or more databases

    ORCLDG1   - Physical standby database
      Warning: ORA-16857: standby disconnected from redo source for longer than specified threshold

    ORCLDG2 - Physical standby database

Fast-Start Failover: DISABLED

Configuration Status:
ERROR



DGMGRL> show database 'ORCLDG1'

Database - ORCLDG1

  Role:            PHYSICAL STANDBY
  Intended State:  APPLY-ON
  Transport Lag:   1 hour(s) 1 minute 32 seconds (computed 5 seconds ago)
  Apply Lag:       1 hour(s) 1 minute 54 seconds (computed 0 seconds ago)
  Apply Rate:      19.93 MByte/s
  Real Time Query: ON
  Instance(s):
    ORCLDG11
    ORCLDG12 (apply instance)
    ORCLDG13

Database Status:
SUCCESS



DGMGRL> show configuration

Configuration - ORCL

  Protection Mode: MaxPerformance
  Databases:
    ORCL     - Primary database
    ORCLDG1   - Physical standby database
    ORCLDG2 - Physical standby database

Fast-Start Failover: DISABLED

Configuration Status:
SUCCESS

We are good now. The configuration does not report the error anymore. It will take a little bit of time for the recovery to catch up.

Even though the initial error pointed to TransportDisconnectedThreshold property configuration or network error, it was misleading.

Hope this helps. Have you encountered transport or apply issue in your DG setup. What was the issue and how did you resolve it. I would love to hear about your experiences.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Latest posts

How to rename a PDB in 19c

Let's look at how to rename a PDB in 19c. The steps are the same as 12c....

How to change the ethernet switch password on Exadata

Let's look at how to log in and change the ethernet switch (adm) password on an Exadata...

Install Oracle GoldenGate Plug-in on Oracle Enterprise Manager Cloud Control 13c

In this post, we will install and deploy Oracle GoldenGate Plug-in 13.4.1.0 on Oracle Enterprise Manager Cloud...

Fixed – EXPDP – OS/ Wallet Authentication does not work, prompts for password

I recently encountered a strange issue on a newly created multitenant database on 19.6 version. Our EXPDP...

You might also likeRELATED
Recommended to you