User Tools

Site Tools


monitoring

Monitoring

Making sure that all your backup jobs are running correctly and to schedule can be a pain, particularly if you have a large number of jobs emailing you every day whether or not they succeed. Failed jobs can easily get lost in the noise, and configuration mistakes leading to jobs not running in the first place can be hard to spot.

I've put together a nasty SQL query that attempts to calculate the next expected run time of a job and use this in two ways: 1) A dashboard (unfortunately tightly integrated into internal code, so I can't share it as a whole), and 2) a nagios check that alerts when a job is behind schedule.

It's a bit naive, it relies catalog records for multiple runs of a job in the past and uses the delta to predict the next run time. It's not hugely efficient either, but it gets the job done for me. YMMV.

It uses two additional tables I added to the bacula database to store exceptions for Jobs which should not be alerted on (AlertExclusions) and for individual instances of jobs which should not be taken into account for calculations (JobAlertExclusions) to prevent out-of-schedule backups triggering false alerts.

CREATE TABLE `AlertExclusion` (
  `ExclusionId` INT(11) NOT NULL AUTO_INCREMENT,
  `Name` tinytext NOT NULL,
  `Level` CHAR(1) DEFAULT NULL,
  `ExcludeFlags` tinyint(4) DEFAULT NULL,
  `Expiry` datetime DEFAULT NULL,
  `Username` VARCHAR(64) DEFAULT NULL,
  `Comment` text,
  `Added` datetime NOT NULL,
  PRIMARY KEY (`ExclusionId`)
) ENGINE=InnoDB AUTO_INCREMENT=39 DEFAULT CHARSET=latin1
 
CREATE TABLE `JobAlertExclusion` (
  `JobId` INT(11) NOT NULL,
  `Username` VARCHAR(64) DEFAULT NULL,
  `Comment` text,
  `Added` datetime NOT NULL,
  PRIMARY KEY (`JobId`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
 
 SELECT                                                                                 
     t2.Name,                                                                           
     t2.Job,                                                                            
     t2.Level,                                                                          
     t4.CurrentBytes,                                                                   
     t2.ApparentStartTime AS StartTime,                                                 
     UNIX_TIMESTAMP(t2.ApparentStartTime) AS StartTimeStamp,                            
     t2.JobStatus,                                                                      
     t2.JobStatusLong,                                                                  
     t1.ScheduleInterval,                                                               
     t1.AverageRunTime,                                                                 
     t1.AverageSize,                                                                    
     t1.TotalGoodJobs,                                                                  
     t1.TotalGoodJobSize,                                                               
     UNIX_TIMESTAMP(DATE_ADD(t2.SchedTime, INTERVAL t1.ScheduleInterval SECOND)) AS NextRun,
     UNIX_TIMESTAMP(DATE_ADD(t2.SchedTime, INTERVAL t1.ScheduleInterval SECOND)) - UNIX_TIMESTAMP() AS NextRunDiff,
     t5.ExcludeFlags,                                                                   
     t5.Expiry,                                                                         
     UNIX_TIMESTAMP(t5.Expiry) AS ExpiryTimestamp,                                      
     t5.Username,                                                                       
     t5.Comment                                                                         
 FROM                                                                                   
     (                                                                                  
         SELECT                                                                         
             t3.*,                                                                      
             STATUS.JobStatusLong                                                       
         FROM                                                                           
             (                                                                          
                 SELECT                                                                 
                     *,                                                                 
                     GREATEST(StartTime,SchedTime) AS ApparentStartTime                 
                 FROM                                                                   
                     Job                                                                
                 WHERE                                                                  
                     Name!=''                                                           
                     AND TYPE='B'                                                       
                     AND JobStatus NOT IN ('A','E','f')                                 
                     AND PriorJobID=0                                                   
                 ORDER BY                                                               
                     GREATEST(StartTime,SchedTime) DESC                                 
             ) AS t3                                                                    
             LEFT JOIN STATUS ON t3.JobStatus=STATUS.JobStatus                          
         GROUP BY                                                                       
             Name,Level                                                                 
     ) AS t2                                                                            
     LEFT JOIN (                                                                        
         SELECT                                                                         
             Name,                                                                      
             Level,                                                                     
             TIMESTAMPDIFF(SECOND, MIN(SchedTime), MAX(SchedTime))/(COUNT(SchedTime) - 1) AS 'ScheduleInterval',
             AVG(TIMESTAMPDIFF(SECOND, StartTime, EndTime)) AS 'AverageRunTime',        
             AVG(JobBytes) AS 'AverageSize',                                            
             COUNT(JobBytes) AS 'TotalGoodJobs',                                        
             SUM(JobBytes)  AS 'TotalGoodJobSize'                                       
         FROM                                                                           
             Job                                                                        
             LEFT JOIN JobAlertExclusion AS x ON Job.JobId=x.JobId                   
         WHERE                                                                          
             Job.JobStatus IN ('T','E')                                                 
             AND Job.Type='B'                                                           
             AND Job.PriorJobId=0                                                       
             AND x.JobId IS NULL                                                        
         GROUP BY                                                                       
             Job.Name,                                                                  
             Job.Level                                                                  
     ) AS t1 ON t1.Name=t2.Name AND t1.Level=t2.Level                                   
     LEFT JOIN (                                                                        
         SELECT                                                                         
             JobId,                                                                     
             SUM(EndBlock-StartBlock)*64512 AS 'CurrentBytes'                           
         FROM                                                                           
             JobMedia                                                                   
         GROUP BY                                                                       
             JobId                                                                                                                                                                
     ) AS t4 ON t4.JobId=t2.JobId
     LEFT JOIN (                                                                      
         SELECT                                                                       
             *                                                                        
         FROM                                                                         
             (                                                                        
                 SELECT                                                               
                     *                                                                
                 FROM                                                                 
                     AlertExclusion AS t6                                          
                 ORDER BY                                                             
                     Expiry DESC                                                      
             ) AS t7                                                                  
         GROUP BY                                                                     
             t7.Name,                                                                 
             t7.Level                                                                 
     ) AS t5 ON t5.Name=t2.Name AND t5.Level=t2.Level                                 
 WHERE                                                                                
     t5.ExclusionId IS NULL                                                           
     OR t5.Expiry < NOW()                                                             
     OR (t5.ExcludeFlags & 2) = 0

(It also attempts to make a crude estimate of progress based on amount of data written compared to the average historical size. This only seems to work for tape backups, not file backups, I've not yet found out why and would appreciate any attempts to fix it).

monitoring.txt · Last modified: 2014/07/25 08:29 by benroberts