Oracle诊断案例：Job任务停止执行-红联Linux系统门户

　　摘要:

　　本文通过一次Oracle Job任务异常案例诊断，分析其原因及解决过程，从内部揭示Oracle Job任务调度及内部计时机制。

　　问题及环境

　　接到研发人员报告，数据库定时任务未正常执行，导致某些操作失败。

　　开始介入处理该事故.

　　系统环境:

　　以下为引用的内容：

[code]SunOS DB 5.8 Generic_108528-21 sun4u sparc SUNW,Ultra-4
Oracle9i Enterprise Edition Release 9.2.0.3.0 - Production[/code]

　　解决过程

　　首先介入检查数据库任务

　　以下为引用的内容：

[code]$ sqlplus "/ as sysdba"
SQL*Plus: Release 9.2.0.3.0 - Production on Wed Nov 17 20:23:53 2004
Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved.
Connected to:
Oracle9i Enterprise Edition Release 9.2.0.3.0 - Production
With the Partitioning, OLAP and Oracle Data Mining options
JServer Release 9.2.0.3.0 - Production
SQL> select job,last_date,last_sec,next_date,next_sec,broken,failures from
dba_jobs;
JOB LAST_DATE LAST_SEC NEXT_DATE NEXT_SEC B FAILURES
INTERVAL
---------- --------- ---------------- --------- ---------------- - ----------
----------------------------
31 16-NOV-04 01:00:02 17-NOV-04 01:00:00 N 0
trunc(sysdate+1)+1/24
27 16-NOV-04 00:00:04 17-NOV-04 00:00:00 N 0
TRUNC(SYSDATE) + 1
35 16-NOV-04 01:00:02 17-NOV-04 01:00:00 N 0
trunc(sysdate+1)+1/24
29 16-NOV-04 00:00:04 17-NOV-04 00:00:00 N 0
TRUNC(SYSDATE) + 1
30 01-NOV-04 06:00:01 01-DEC-04 06:00:00 N 0
trunc(add_months(sysdate,1),’MM’)+6/24
65 16-NOV-04 04:00:03 17-NOV-04 04:00:00 N 0
trunc(sysdate+1)+4/24
46 16-NOV-04 02:14:27 17-NOV-04 02:14:27 N 0
sysdate+1
66 16-NOV-04 03:00:02 17-NOV-04 18:14:49 N 0
trunc(sysdate+1)+3/24
8 rows selected.[/code]

　　发现JOB任务是都没有正常执行，最早一个应该在17-NOV-04 01:00:00执行。但是没有执行。

　　建立测试JOB

　　以下为引用的内容：

[code]create or replace PROCEDURE pining

IS
BEGIN
NULL;
END;
/
variable jobno number;
variable instno number;
begin
select instance_number into :instno from v$instance;
dbms_job.submit(:jobno, ’pining;’, trunc(sysdate+1/288,’MI’),
’trunc(SYSDATE+1/288,’’MI’’)’, TRUE, :instno);
end;
/[/code]

　　发现同样的，不执行。

　　但是通过dbms_job.run()执行没有任何问题。

　　进行恢复尝试

　　怀疑是CJQ0进程失效，首先设置JOB_QUEUE_PROCESSES为0，Oracle会杀掉CJQ0及相应job进程

　　以下为引用的内容：

[code]SQL> ALTER SYSTEM SET JOB_QUEUE_PROCESSES = 0;[/code]

　　等2~3分钟，重新设置以下为引用的内容：

[code]SQL> ALTER SYSTEM SET JOB_QUEUE_PROCESSES = 5;[/code]

　　此时PMON会重起CJQ0进程以下为引用的内容：

[code]Thu Nov 18 11:59:50 2004

ALTER SYSTEM SET job_queue_processes=0 SCOPE=MEMORY;
Thu Nov 18 12:01:30 2004
ALTER SYSTEM SET job_queue_processes=10 SCOPE=MEMORY;
Thu Nov 18 12:01:30 2004
Restarting dead background process CJQ0
CJQ0 started with pid=8
但是Job仍然不执行，而且在再次修改的时候，CJQ0直接死掉了。
Thu Nov 18 13:52:05 2004
ALTER SYSTEM SET job_queue_processes=0 SCOPE=MEMORY;
Thu Nov 18 14:09:30 2004
ALTER SYSTEM SET job_queue_processes=10 SCOPE=MEMORY;
Thu Nov 18 14:10:27 2004
ALTER SYSTEM SET job_queue_processes=0 SCOPE=MEMORY;
Thu Nov 18 14:10:42 2004
ALTER SYSTEM SET job_queue_processes=10 SCOPE=MEMORY;
Thu Nov 18 14:31:07 2004
ALTER SYSTEM SET job_queue_processes=0 SCOPE=MEMORY;
Thu Nov 18 14:40:14 2004
ALTER SYSTEM SET job_queue_processes=10 SCOPE=MEMORY;
Thu Nov 18 14:40:28 2004
ALTER SYSTEM SET job_queue_processes=0 SCOPE=MEMORY;
Thu Nov 18 14:40:33 2004
ALTER SYSTEM SET job_queue_processes=1 SCOPE=MEMORY;
Thu Nov 18 14:40:40 2004
ALTER SYSTEM SET job_queue_processes=10 SCOPE=MEMORY;
Thu Nov 18 15:00:42 2004
ALTER SYSTEM SET job_queue_processes=0 SCOPE=MEMORY;

Thu Nov 18 15:01:36 2004
ALTER SYSTEM SET job_queue_processes=15 SCOPE=MEMORY;[/code]

　　尝试重起数据库，这个必须在晚上进行：

　　以下为引用的内容：

[code]PMON started with pid=2
DBW0 started with pid=3
LGWR started with pid=4
CKPT started with pid=5
SMON started with pid=6
RECO started with pid=7
CJQ0 started with pid=8
QMN0 started with pid=9
....[/code]

　　CJQ0正常启动，但是Job仍然不执行。

　　没办法了...

　　继续研究...居然发现Oralce有这样一个bug 以下为引用的内容：

[code]1. Clear description of the problem encountered:
slgcsf() / slgcs() on Solaris will stop incrementing after
497 days 2 hrs 28 mins (approx) machine uptime.

2. Pertinent configuration information
No special configuration other than long machine uptime. .
3. Indication of the frequency and predictability of the problem
100% but only after 497 days.
4. Sequence of events leading to the problem
If the gethrtime() OS call returns a value > 42949672950000000
nanoseconds then slgcs() stays at 0xffffffff. This can
cause some problems in parts of the code which rely on
slgcs() to keep moving.
eg: In kkjssrh() does "now = slgcs(&se)" and compares that
to a previous timestamp. After 497 days uptime slgcs()
keeps returning 0xffffffff so "now - kkjlsrt" will
always return 0. .
5. Technical impact on the customer. Include persistent after effects.
In this case DBMS JOBS stopped running after 497 days uptime.
Other symptoms could occur in various places in the code.[/code]

　　好么，原来是计时器溢出了，一检查我的主机:

　　以下为引用的内容：

[code]bash-2.03$ uptime
10:00pm up 500 day(s), 14:57, 1 user, load average: 1.31, 1.09, 1.08
bash-2.03$ date
Fri Nov 19 22:00:14 CST 2004[/code]

　　刚好到事发时是497天多一点.ft.

　　安排重起主机系统..

　　这个问题够郁闷的，NND，谁曾想Oracle这都成...

　　Oracle最后声称:

　　fix made it into 9.2.0.6 patchset

　　在Solaris上的9206尚未发布...晕.

　　好了，就当是个经历吧，如果有问题非常不可思议的话，那么大胆怀疑Oracle吧，是Bug，可能就是Bug。

　　重起以后问题解决，状态如下:

　　以下为引用的内容：

[code]$ sqlplus "/ as sysdba"
SQL*Plus: Release 9.2.0.3.0 - Production on Fri Nov 26 09:21:21 2004
Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved.
Connected to:
Oracle9i Enterprise Edition Release 9.2.0.3.0 - Production
With the Partitioning, OLAP and Oracle Data Mining options
JServer Release 9.2.0.3.0 - Production
SQL> select job,last_date,last_sec,next_date,next_sec from user_jobs;

JOB LAST_DATE LAST_SEC NEXT_DATE NEXT_SEC
---------- --------- ---------------- --------- ----------------
70 26-NOV-04 09:21:04 26-NOV-04 09:26:00
SQL> /
JOB LAST_DATE LAST_SEC NEXT_DATE NEXT_SEC
---------- --------- ---------------- --------- ----------------
70 26-NOV-04 09:26:01 26-NOV-04 09:31:00
SQL>
SQL> select * from v$timer;
HSECS
----------
3388153
SQL> select * from v$timer;
HSECS
----------
3388319
SQL>
FAQ[/code]

Oracle诊断案例：Job任务停止执行

共有 0 条评论

频道文章

最新教程

随机推荐