Thursday, November 7, 2013

Script to monitor Alert log files for an Oracle Database

Recently I was being asked to write a script to monitor the Oracle Alert Logs for "ORA-" errors and send out an automated email to the DBA team.

Here are the challenges :
1.How do I make the script scan only the lines which were not scanned by the previous scan .
An other way to reframe the question - if scan x happened at 3pm, how do I ensure that the Scan (x+1) picked up only the new lines generated after 3pm. Should I store the time stamp of the last scan or should I store the high-water line count . I finally decided to go with the HWM line number.

2.Also how do I deal when a alert log is recycled?Will my script be able to find out that alert log has been recycled and then automatically set by line HWM to start from line 1 of the new alert log.
Looks like checking the stored HWM line count and the current line count of the alert log would help me in finding this out .If the alert log line count is smaller than the stored HWM line count then it would be safe to assume the alert log was recycled. Again in extraordinary circumstances where an alert log was recycled and the subsequent script ran so late(say after a couple of weeks or so) that new alert log line count exceeded the stored HWM line count ,then my logic would make it skip the first stored HWM lines of the alert log for that particular run only (but then again regaing normal operations for the subsequent scans)but I choose to ignore this possibility as this is extremely very remote and even if so, only the first scan would fail to detect the ORA- errors.

#!/bin/bash
#Alert Log Monitoring Script
#Usage: alert_monitor.sh
#Author:Gautham Chandrasekaran
#Version 1.0 06 NOVEMBER 2013

EMAIL_LIST=gautha@hotmail.com
ALERT_LOG_FILE=/u01/app/oracle/diag/rdbms/prod/PROD_1/trace/alert_PROD_1.log
EMAIL_FILE=/tmp/alert_mail.log

#Error out if Alert Log is missing
if [  ! -f $ALERT_LOG_FILE ] ; then
 echo "**** $ALERT_LOG_FILE FILE MISSING FROM ALERT LOG MONITOR ******" > $EMAIL_FILE
 cat $EMAIL_FILE|mailx -s "Alert Log Errors" $EMAIL_LIST
 exit 1
fi



ALERT_LOG_LINE_CNT_FILE=${ALERT_LOG_FILE}.ctr

#First run of the script or somebody deleted it.Therefore start from zero.
if [  ! -f $ALERT_LOG_LINE_CNT_FILE ] ; then
echo 0 > $ALERT_LOG_LINE_CNT_FILE
fi

STORED_LINE_CTR=`cat $ALERT_LOG_LINE_CNT_FILE`


ALERT_LOG_LINE_CTR=`cat $ALERT_LOG_FILE|wc -l`

#check to see whether somebody has recycled alert log file.
#in this case the STORED_LINE_CTR will be higher than ALERT_LOG_LINE_CTR
#If so just reset STORED_LINE_CTR to 0.
if [ $ALERT_LOG_LINE_CTR -lt $STORED_LINE_CTR ] ; then
STORED_LINE_CTR=0
fi


LINES_TO_MONITOR=`expr $ALERT_LOG_LINE_CTR - $STORED_LINE_CTR`


tail -n $LINES_TO_MONITOR $ALERT_LOG_FILE|grep -i "ora-"  > /tmp/alert_mail.log

#Reset line number to ctr file to skip the scanned rows in the next run
echo $ALERT_LOG_LINE_CTR > $ALERT_LOG_LINE_CNT_FILE


#Email only if the file is not empty
if [ -s $EMAIL_FILE ] ; then
 cat $EMAIL_FILE|mailx -s "Alert Log Errors" $EMAIL_LIST
fi

6 comments:

  1. Furthermore, if anyone wants to know how to make this work with sitescope 11.2 or later, email me at jesse@josserand.com and I'll explain how to modify it to run as a remote script and how to configure the sitescope script monitor to work with it.

    ReplyDelete
  2. This is scanning all the lines for each run. Any suggestions please?

    ReplyDelete
    Replies
    1. no it does not-
      tail -n $LINES_TO_MONITOR $ALERT_LOG_FILE|grep -i "ora-" > /tmp/alert_mail.log

      tail -n num: Prints the last ‘num’ lines

      Delete
  3. Thanks for the response! I did "sh script.sh" at 3:00 PM for the first time, and at 3:15 pm second time, email received at two timings has repeated content. Not sure if I am doing something wrong.

    ReplyDelete
  4. All i can say is the script is used in my production site and has no such issues.Probably you had occurrences of the same content after 3pm and therefore the 3.15pm scan picked it.

    ReplyDelete
  5. I did debug the script, looks like its taking total rows-0 during each run. How does it remember count from last time? Thanks.

    ReplyDelete