新闻中心
近年来,我们专注于提供企业级客户的数字化转型赋能/服务,以数据库/中间件运维咨询服务、数据库国产化改造服务、开源数据库技术栈服务、数据中心硬件带外监控服务、大数据服务、AI开发平台为核心业务,是IT智能运维服务提供商。
返回列表
首页 / 新闻资讯 / 行业资讯
干货 | Controlfile Sequence Number耗尽引起的控制文件损坏案例
来源:   日期:2018-06-26

作者:赵世雄 | 东方龙马 · 广州分公司

 

我们通常碰到过的ORACLE数据库控制文件损坏情况有哪些呢?你是否碰到过controlfile sequence number值被耗尽而引起损坏的情况?下面我来跟大家讲述一个东方龙马广州团队曾经碰到过的一个案例。



背景



在2017年国庆小长假快结束的时候,东方龙马广州团队接到了一个客户的紧急报障电话,客户电话里急促的说道:"我们数据库告警日志出现控制文件损坏的报错",我们的工程师立即远程连到客户系统,查看告警日志:




Completed checkpoint up to RBA [0x1dd32.2.10], SCN: 10184774533893

Fri Oct 06 12:08:31 2017

Errors in file /var/log/xxx/ORACLE/diag/rdbms/xxx/xxx/trace/xxx_ora_37874.trc:

ORA-00202: control file: '/database/XXX/XXX1/XXX/ctrl01.ctl'

Fri Oct 06 12:08:31 2017

Errors in file /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/trace/XXX_m000_32185.trc:

ORA-00235: control file read without a lock inconsistent due to concurrent update

Errors in file /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/trace/XXX_ora_37874.trc  (incident=4273683):

ORA-00227: corrupt block detected in control file: (block 1, # blocks 1)

ORA-00202: control file: '/database/XXX/XXX1/XXX/ctrl01.ctl'

Incident

details in: /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/incident/incdir_4273683/XXX_ora_37874_i4273683.trc

Fri Oct 06 12:08:32 2017

Errors in file /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/trace/XXX_ora_70265.trc:

ORA-00202: control file: '/database/XXX/XXX1/XXX/ctrl01.ctl'

Errors in file /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/trace/XXX_ora_70265.trc  (incident=4275459):

ORA-00227: corrupt block detected in control file: (block 1, # blocks 1)

ORA-00202: control file: '/database/XXX/XXX1/XXX/ctrl01.ctl'

Incident details in: /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/incident/incdir_4275459/XXX_ora_70265_i4275459.trc

.........................内容较多,省略中间部分内容

.........................内容较多,省略中间部分内容

Fri Oct 06 13:19:00 2017

Errors in file /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/trace/XXX_lgwr_49423.trc:

ORA-00202: control file: '/database/XXX/XXX1/XXX/ctrl01.ctl'

Errors in file /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/trace/XXX_lgwr_49423.trc:

ORA-00227: corrupt block detected in control file: (block 1, # blocks 1)

ORA-00202: control file: '/database/XXX/XXX1/XXX/ctrl01.ctl'




告警日志里,我们看到ctrl01.ctl控制文件持续出现损坏报错,本以为用ctrl02.ctl控制文件copy一份就能修复ctrl01.ctl控制文件。心里嘀咕着:这是个小问题的时候,数据库突然宕掉了。




Fri Oct 06 13:19:00 2017

Errors in file /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/trace/XXX_lgwr_49423.trc:

ORA-00202: control file: '/database/XXX/XXX1/XXX/ctrl01.ctl'

Errors in file /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/trace/XXX_lgwr_49423.trc:

ORA-00227: corrupt block detected in control file: (block 1, # blocks 1)

ORA-00202: control file: '/database/XXX/XXX1/XXX/ctrl01.ctl'

LGWR (ospid: 49423): terminating the instance due to error 227

Fri Oct 06 13:19:15 2017

opiodr aborting process unknown ospid (71057) as a result of ORA-1092

Fri Oct 06 13:19:17 2017

System state dump requested by (instance=1, osid=49423 (LGWR)), summary=[abnormal instance termination].

Fri Oct 06 13:19:25 2017

ORA-1092 : opitsk aborting process

Fri Oct 06 13:19:26 2017

opiodr aborting process unknown ospid (71043) as a result of ORA-1092

System State dumped to trace file /var/log/xxx/ORACLE/diag/rdbms/xxx/XXX/trace/XXX_diag_49394.trc

Fri Oct 06 13:19:27 2017

opiodr aborting process unknown ospid (71045) as a result of ORA-1092

Fri Oct 06 13:19:27 2017

ORA-1092 : opitsk aborting process

Fri Oct 06 13:19:27 2017

ORA-1092 : opitsk aborting process

Fri Oct 06 13:19:27 2017

opiodr aborting process unknown ospid (71047) as a result of ORA-1092

Fri Oct 06 13:22:47 2017

opiodr aborting process unknown ospid (71055) as a result of ORA-1092

Fri Oct 06 13:19:27 2017

ORA-1092 : opitsk aborting process

Fri Oct 06 13:19:27 2017

ORA-1092 : opitsk aborting process

Instance terminated by LGWR, pid = 49423




心里一阵拔凉,问题没那么简单,难道ctrl02.ctl控制文件也损坏了?

确实正如我们工程师所料,使用ctrl02.ctl控制文件重启数据库的时候,报ctrl02.ctl也损坏。




解决方案



此系统是客户的核心系统,先帮客户恢复系统再分析原因,于是利用备份的控制文件来修复控制文件。




$mv /database/XXX/XXX1/XXX/ctrl01.ctl /database/XXX/XXX1/XXX/ctrl01.ctl_bak

$mv /database/XXX/XXX2/XXX/ctrl02.ctl /database/XXX/XXX2/XXX/ctrl02.ctl_bak

RMAN>restore controlfile from '/backup/ctl.bak';




系统恢复后,开始查找原因。通常碰到控制文件损坏,我们的第一反应就是存储是不是出问题了。于是查看操作系统日志,没发现任何异常。在定位原因的过程时,控制文件再次损坏。我们通过把控制文件创建到本地硬盘、切换到备机、禁用HA Cluseter测试、断开容灾链路测试等一系列测试,问题依旧。在上面这些测试期间,反复跟原厂沟通确认是否是Bug 20324049引起,原厂回复比较含糊,不能百分百肯定。




Bug 20324049  ORA-227 Controlfile Corruption when reaching Maximum Value for Control Seq kccfhcsq


 This note gives a brief overview of bug 20324049. 

 The content was last updated on: 17-APR-2018

 Click here for details of each of the sections below.



Affects:

Product (Component)

Oracle Server (Rdbms)

Range of versions believed to be affected

(Not specified)

Versions confirmed as being affected

12.1.0.2 (Server Patch Set)

11.2.0.4

10.2.0.5

8.1.7.4

Platforms affected

Generic (all / most platforms affected)


Fixed:

The fix for 20324049 is first included in

18.1.0

12.2.0.1.170919 (Sep 2017) Database Release Update (DB RU)

12.2.0.1.DBOCT2017RUR:180417 (Apr 2018) Database Release Update Revision(DB RUR)

12.2.0.1.DBJAN2018RUR:180417(Apr 2018) Database Release Update Revision(DB RUR)

12.1.0.2.180116 (Jan 2018) Database Patch Set Update (DB PSU)

12.1.0.2.180116 (Jan 2018) Database Proactive Bundle Patch

11.2.0.4.180116 (Jan 2018) Database Patch Set Update (DB PSU)

11.2.0.4.180116 (Jan 2018) Exadata Database Bundle Patch

12.2.0.1.171017 (Oct 2017) Bundle Patch for Windows Platforms

12.1.0.2.180116 (Jan 2018) Bundle Patch for Windows Platforms

11.2.0.4.180116 (Jan 2018) Bundle Patch for Windows Platforms



Interim patches may be available for earlier versions - click here to check.

Symptoms:

Corruption (Logical)

Instance May Crash

HCHECK script reports this issue

Error May Occur

ORA-227 / HCKE-50


Related To:

Instance Startup




Description

The controlfile sequence number is not expected to increment so

 rapidly that it would ever reach the architectural limit during

 the database lifetime, but if the controlfile sequence number

 does reach the limit for some reason, then processes will fail

 with ORA-227 errors causing the instance to terminate, then after

 this, the database cannot be mounted again because the FG process

 doing the mount will always fail with the same ORA-00227 error.

 This fix writes regular warning messages to the alert log in the

 case where the controlfile sequence number approaches the

 architectural limit. And this fix also provides an event which

 can be set to modify the behavior of "CREATE CONTROLFILE" with

 "NORESETLOGS" so that it will reset the controlfile sequence

 number to 1 in the datafiles, the online redo logs, and the

 new controlfile.

 

If hcheck in Note:136697.1 is run, it may detect this issue

with error HCKE-0050, see Note:2128446.1.

 

Rediscovery Notes

Look for all the following:

 - various processes start reporting the following error:

     ORA-227: corrupt block detected in control file: (block 1,# blocks 1)

 - a common stack trace for the ORA-227 errors might be this:

     kcvsursuht kcc_begin_txn_internal kccocx kccchb kcccsi ...

 - then a fatal background process gets an ORA-227 error and terminates the instance

 - then the database cannot be mounted again

 - do a hex dump of the controlfile via

    $ xxd -g4 <controlfile>  > cf.hexdump

   and search forward in the hexdump for the first occurrence

   of the database name, eg "R12B" in the example below:

    ...

    0004000: 15c20000 01000000 00000000 00000104  ................

    0004010: 59320000 00000000 0000200b 24e7e472  Y2........ .$..r

    0004020: 52313242 00000000 00000000 00020000  R12B............

    ...        ^^^^^^^^

   the 3rd 32-byte word in the row with the database name is

   the controlfile sequence number, it will be zero (but it

   should never be zero, and it's the reason for the ORA-227's)

 - an "xxd" hex dump of a restored recent backup of the controlfile

   should show the controlfile sequence number is some high value

   eg e1ffffff (which is 0xffffffe1 after endian conversion).

 

Workaround

There is no workaround for the ORA-227 problem after it has

happened.

It is possible to proactively prevent the ORA-227 problem from

happening (before it happens) in the specific case where the

controlfile seq# is incrementing very quickly (eg many times per

second) due to storing of last scn/time of nologging operations

in the controlfile; in this specific case, the workaround is to

set:  db_unrecoverable_scn_tracking=FALSE

 

Solution:

After installing this fix:

1. verify the controlfile sequence# is at or above 0xFF000000:

set numwidth 15

select max(FHCSQ)

from x$kcvfh;

2. Generate Trace file to recreate the controlfile:

    alter database backup controlfile to trace noresetlogs;

3. shutdown

4. startup nomount

5. alter session set events '20324049 trace name context forever, level 1';

6. execute the commands in the tracefile generated by step#2

7. alter session set events '20324049 trace name context off';

8. confirm the controlfile sequence# is now low with the same query in 1.

 

Take new backups after applying this solution, otherwise recovery will not be possible

failing with these errors:

 

 ORA-00283: recovery session canceled due to errors

 ORA-01122: database file 1 failed verification check

 ORA-01110: data file 1: '/oracle/dbs/t_db1.f'

 ORA-01207: file is more recent than control file - old control file



此系统的数据库是11.2.0.3版本,Bug 20324049描述影响的版本不涉及到11.2.0.3版本,由于主机、存储上已经做了各种排查也未能发现原因跟其相关。于是我们建议把数据库升级到11.2.0.4版本,并打上Bug 20324049补丁,并重建控制文件,把controlfile sequence number的通过此补丁压回值到1。  



CREATE CONTROLFILE is being performed with event 20324049 set, this will now reset the controlfile sequence number to 1 in the datafiles, the online redo logs, and the new controlfile.

datafile 1 header updated: controlfile sequence# reset to 1

datafile 2 header updated: controlfile sequence# reset to 1

datafile 3 header updated: controlfile sequence# reset to 1

datafile 4 header updated: controlfile sequence# reset to 1

datafile 5 header updated: controlfile sequence# reset to 1

datafile 6 header updated: controlfile sequence# reset to 1

datafile 7 header updated: controlfile sequence# reset to 1

datafile 8 header updated: controlfile sequence# reset to 1

datafile 9 header updated: controlfile sequence# reset to 1

datafile 10 header updated: controlfile sequence# reset to 1

datafile 11 header updated: controlfile sequence# reset to 1

datafile 12 header updated: controlfile sequence# reset to 1

..........................................................

..........................................................

datafile 344 header updated: controlfile sequence# reset to 1

datafile 345 header updated: controlfile sequence# reset to 1

datafile 346 header updated: controlfile sequence# reset to 1

logfile /database/xxxx/xxxx/xxxx/redo01-1.log header updated: controlfile sequence# reset to 1

logfile /database/xxxx/xxxx/xxxx/redo01-2.log header updated: controlfile sequence# reset to 1

logfile /database/xxxx/xxxx/xxxx/redo02-1.log header updated: controlfile sequence# reset to 1

logfile /database/xxxx/xxxx/xxxx/redo02-2.log header updated: controlfile sequence# reset to 1

logfile /database/xxxx/xxxx/xxxx/redo03-1.log header updated: controlfile sequence# reset to 1

logfile /database/xxxx/xxxx/xxxx/redo03-2.log header updated: controlfile sequence# reset to 1

logfile /database/xxxx/xxxx/xxxx/redo04-1.log header updated: controlfile sequence# reset to 1

logfile /database/xxxx/xxxx/xxxx/redo04-2.log header updated: controlfile sequence# reset to 1 



当然,我们做这个升级是先利用一份存储镜像出来的数据进行测试,测试一切都进行顺利,且经过应用一个周期的测试,控制文件未出现损坏。于是,我们在国庆小长假结束前的凌晨立即对生产库进行修复。通过东方龙马广州团队的协力合作,终于在国庆小长假结束前帮客户恢复了系统的正常使用。



关于controlfile sequence number 


OERR: ORA-00227 corrupt block detected in controlfile: (block %s, # blocks %s) Master Note / Troubleshooting, Diagnostic and Solution (文档 ID 48808.1)

APPLIES TO:

Oracle Database - Enterprise Edition - Version 8.0.6.0 to 12.1.0.2 [Release 8.0.6 to 12.1]

Information in this document applies to any platform.

PURPOSE

This article provides information about error ORA-00227 and possible actions.

SCOPE

 This note is intended for general audience as initial starting point for beginning diagnosis of ORA-00227.

DETAILS


Error:   ORA 227

Text:    corrupt block detected in controlfile: (block %s, # blocks %s) 

..........................................................


Cause:  A block header corruption or checksum error was detected on reading the controlfile. 

Action: Use the CREATE CONTROLFILE or RECOVER DATABASE USING BACKUP CONTROLFILE command.

 

Cause

The controlfile is corrupted. This could be caused by a problem external to Oracle like a Hardware/OS problem or an Oracle Defect.


Identify if Bug 20324049 may be causing the error ORA-227

If Bug 20324049 is causing the error ORA-227; then the maximum sequence number in the Database will be close to the maximum value (4294967295):

set numwidth 15

select max(FHCSQ)

case trunc(max(FHCSQ)/4294967295,1) 

when 0.9 then 'WARNING: Reference Bug 20324049' 

else 'NO Warning' end "Bug 20324049"

from x$kcvfh;


控制文件序列值(controlfile sequence number)最大值为:4294967295(即43亿左右,2的32次方减1),可以利用上面的语句检查你的系统是否存在控制文件序列值超过极限值的风险。

 

Take a backup of the control files as they are now

Take a backup of the existent control files with a regular copy (cp) or any other mechanism.  This can be used for future diagnosis and in case they are needed for additional recovery.

 

Identify which control file is failing

ORA-00227 is normally accompanied by ORA-202 which prints the affected control file name.  Review the alert log for more details.

 

Execute DBVERIFY on all Control Files

DBVERIFY is a tool that is intended to identify corruptions in Datafiles but in some cases may help to identify block corruptions in Control Files. 


Identify control file block size by executing dbfsize. Example:

$ dbfsize /oradata/controlfile/control1.ctl


Database file: /oradata/controlfile/control1.ctl

Database file type: file system

Database file size: 614 16384 byte blocks

In this case the control file block size is 16384.

Alternatively identify the control file block size by querying view V$CONTROLFILE.


Execute DBVERIFY on each Control File copy using the block size identified in 1.1.  Example:

dbv file=/oradata/controlfile/control1.ctl blocksize=16384

 

Solution

Solution 1. Use another mirror copy of the control file

If it is determined that the control file is damaged (the corruption is persistent) and if the database is down, then take one of the other control files from the control_files parameter and copy it over the bad control file noted above or startup the database with one control file at a time.  


Try opening the database and if error persists then go to Solution 2:

Solution 2. Recreate the control file

Use the next article to recreate the control file:

Doc ID 735106.1 How to Recreate a Controlfile

Solution 3. Restore a backup of the control file and apply media recovery

Restore a backup of control file and apply media recovery using RECOVER DATABASE USING BACKUP CONTROLFILE

If the database is on IBM AIX, make sure to review the issue described in Note 2237498.1



触发控制文件序列值增长的情况有


发生检查点

日志文件的切换

归档online redolog

运行崩溃后的恢复

热备的开始和结束

DML通过nologging等选项执行对象时

对大象进行直接DML操作等

还有其他你能想到的方面吗?

欢迎补充








微信图片_20180626150221.jpg


|  北京    |    上海    |   广州    |   成都    |


4008-906-960



微信图片_20180626150229.png



4008-906-960

全国免费咨询电话
  • 官方微博
  • 官方微信
Copyright 1998-2024 版权所有 北京东方龙马软件发展有限公司 京ICP备14000200号-1