Script showing successfully failing SQL Server agent jobs

Problem: One of your tasks as a production DBA is to ensure that all SQL agent jobs run successfully 100% of the time. Whether this is for end-user reporting or your backup and restore jobs, it is imperative that jobs succeed without any failed steps within them. When a job succeeds but has failed steps within it, that’s what I call successfully failing.

Now comes your daily routine. A quick check within SSMS’ Job Activity Monitor shows that all jobs show a last run outcome of Succeeded. Everything is good right? However, job outcome is not as expected, or end users complain (I mean politely mention) that they are missing some specified data. Looking deeper into the jobs indicate that steps have failed but the overall outcome is a success? What is going on? This could potentially be disastrous. The kicker here is, and you have to ask yourself, do you know how long this job has been successfully failing?

An innocent SQL Server agent job, reporting that is was successful, but was it?

You cannot always trust the output of SSMS’ Job Activity Monitor as we can see from the snippet above. Looking under the hood shows a different story.

Now, some will say to check the job history for the job. But let’s be realistic. You don’t have just one or two jobs to check daily (or multiple times per day), you can have hundreds, and are you going to check each individual job? No, that is a waste of your time.

SQL Server’s log viewer, finding a failed job step is like looking for a needle in a haystack.

Others will say, review the job history, but check all jobs to show every single history event in the log file viewer. Again, you have hundreds of jobs, why would you scroll through a massive list of job histories to find a single job (or potentially not find anything) that has that yellow warning triangle icon in the left? You wouldn’t. You could filter the list, but you are going to spend a lot of time attempting to ascertain the correct filter, if you can come up with one at all (the filtering in SSMS’ log viewer is not all that intuitive). Again, that is a big waste of your valuable DBA time.

Solution: Below is my solution to this problem hopefully saving you some time in your daily routine.

/* 
Author: Brent Sodtke
Created: 2022-09
Description: Shows all SQL Agent jobs that are successful but contain failed job steps.
----------
License: CC BY-SA (Creative Commons Share-alike)
You are free to:
Share - copy and redistribute the material in any medium or format.
Adapt - remix, transform, and build upon the material for any purpose, even commercially.

Under the following terms:
Attribution - You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
ShareAlike - If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED
*/
use master
go

set nocount on

declare @history int = 14 -- number in days from script run to look for entries in sysjobhistory table.

/*
attempt # 1 : get all successful job outcomes from sysjobhistory, then gather failed steps matching successful job.
attempt # 2 : get all failed job steps from sysjobhistory, then gather successful job outcomes matching the step(s).
*/

if OBJECT_ID(N'tempdb.dbo.#failedjobsteps') is not null
drop table tempdb.dbo.#failedjobsteps
create table #failedjobsteps
(
instance_id int null
, job_id uniqueidentifier null
, job_name sysname null
, rundatetime datetime null
)
insert into #failedjobsteps
select
   sjh.instance_id
   , sjh.job_id
   , sj.name
   , msdb.dbo.agent_datetime(sjh.run_date, sjh.run_time) as rundatetime
from msdb.dbo.sysjobhistory sjh
join msdb.dbo.sysjobs sj
on sjh.job_id = sj.job_id
where
   sjh.step_id <> 0
   and sjh.run_status = 0
   and msdb.dbo.agent_datetime(sjh.run_date, sjh.run_time) >= getdate()-@history -- putting in a timeframe here since this can take a long time with large job history
order by sjh.instance_id asc

-- ##################################################

if OBJECT_ID(N'tempdb.dbo.#successfuljobrunswithstepfailures') is not null
   drop table tempdbo.dbo.#successfuljobrunswithstepfailures
create table #successfuljobrunswithstepfailures
(
   job_name sysname null
   , job_id uniqueidentifier null
   , rundatetime datetime null
)
while (select count(1) from #failedjobsteps) > 0
begin
   insert into #successfuljobrunswithstepfailures
   select distinct(sj.name), sjh.job_id , msdb.dbo.agent_datetime(sjh.run_date, sjh.run_time)
   from msdb.dbo.sysjobhistory sjh
   join msdb.dbo.sysjobs sj
   on sjh.job_id = sj.job_id
   -- we know that of all the rows from within the sysjobhistory table, job runs are grouped by
   -- (1) the job_id matching
   -- (2) the instance_id is less than or equal to the instance_id of the row containing the job outcome
   -- (3) the run_date & run_time are greater than or equal to the run_date & run_time of the row containing the job outcome
   where sjh.job_id = (select top 1 job_id from #failedjobsteps order by instance_id asc)
   and sjh.instance_id >= (select top 1 instance_id from #failedjobsteps order by instance_id asc)
   and msdb.dbo.agent_datetime(sjh.run_date, sjh.run_time) <= (select top 1 rundatetime from #failedjobsteps order by instance_id asc)
   and sjh.run_status = 1 -- only concerned with successes
   and sjh.step_id = 0 -- only concerned with job outcome steps
   group by sj.name, sjh.job_id , msdb.dbo.agent_datetime(sjh.run_date, sjh.run_time)

   delete from #failedjobsteps where instance_id = (select top 1 instance_id from #failedjobsteps order by instance_id asc)
end

-- select * from #successfuljobrunswithstepfailures;

-- (1) show unique job runs that have at least one failed step. remember, a job can have more than one failed step, be successful and thus
-- have more than one entry within #successfuljobrunswithstepfailures.
-- this will show a list of successful jobs in sysjobhistory having at least one failed job step.

if OBJECT_ID(N'tempdb.dbo.#unique') is not null
   drop table tempdbo.dbo.#unique
create table #unique
(
   job_name sysname null
   , job_id uniqueidentifier null
   , rundatetime datetime null
)
insert into #unique
select
   distinct(job_name)
   , job_id
   , rundatetime
from #successfuljobrunswithstepfailures
group by job_name, job_id, rundatetime
order by rundatetime asc

select * from #unique order by rundatetime desc -- return data back to client for later use; let client determine the count.
-- select count(1) numberofuniquejobruns from #unique

-- (2) show the list of successful outcome jobs having at least one failed step and their most recent execution.
-- this list will be different than (1) since no job_id should occur more than once.
-- this will also show the user a more immediate list (most recent) of jobs that should require attention.

select job_name, job_id, rundatetime -- return data back to client for later use; let client determine the count.
from
(
   select u.*
   , row_number() over (partition by job_name order by rundatetime desc) as rownum
   from #unique u
) u2
where u2.rownum = 1
order by u2.rundatetime desc

The output of the above script will produce something similar to the following:

As the script notes, there are going to be 2 output windows within SSMS for this.

The first, showing all occurrences of jobs with failed steps based on the @history variable, i.e., the number of days back that you want to check for. In this example, all job runs in the past 14 days that have at least one failed job step, but “successful” job outcome are shown. This can be useful in determining if a particular job has been successfully failing for a period of time. Also note that this first output block will contain the same job multiple times should it have executed and successfully failed in that time period.

The second, showing a single row per “successful” job with at least one failed step for its most recent job execution. This is useful to get a quick list of actionable jobs to look into and resolve.

In a future script we’ll explore expanding this concept to include the specific job step IDs and job step names that failed to further pinpoint trouble areas.

And there you have it, a means for checking your entire SQL agent job history for successfully failing jobs.

External Links: Here are some external links to system tables I reference within the script above:
msdb.dbo.sysjobhistory; msdb.dbo.sysjobs.
Unfortunately I could not find any official documentation for scalar function msdb.dbo.agent_datetime within the Microsoft Docs website, however, I will reference this one by Pinal Dave on SQL Authority for your reading pleasure.


Did you find this useful? Did it help you solve an immediate need within your SQL Server agent estate? If so, you might want to check out SQL Agent Insight, a product I’ve been working on during spare time to assist in determining problem areas of SQL Server agent instances.


Did you find this useful? Do you have a better way to achieve the desired result using T-SQL? Let me know in the comments below. Cheers.