Howdy,
Today I'm going to talk about how to troubleshoot SLA's. Troubleshooting SLA's is no simple task. There are a lot of moving parts to the SLA engine and the engine itself has changed multiple times over the last 3 years. With this in mind, my goals for this article are two-fold. First I'd like to give a brief explanation of the workings of the current SLA engine; called the "2011" engine (com.snc.sla.engine.version). Second I will talk about how to approach troubleshooting an SLA issue.


How it works


From an event driven perspective, it all starts with a task (the table that is). Somebody or something submits an Incident, updates a Problem or requests a Service Catalog item. When this happens a business rule called "Run SLAs" kicks off. This Business Rule doesn't do too much except call a Script Include, "TaskSLAController". But that is actually saying quite a lot because the TaskSLAController is the gateway to the SLA Engine.





Transitions

As of the June 2011 re-write of the SLA engine most all of the real work for SLAs are handled by Script Includes. The code is written in a re-usable and nicely encapsulated way that allows extension and customization. It also makes for much more easily understood code and a grouping of tasks into logical parts. Probably the most creative and helpful example of this encapsulation is the code that handles the transitions between stages. The key component of this is a base class Script Include (SLAConditionBase) with methods that can easily be overwritten should you desire custom functionality. (see the Wiki article, Modifying SLA Condition Rules for a great description.) If you notice that your stages are mysteriously going to "Complete" when you expect them to be "Achieved" it might be part of this mechanism that is mis-configured. If you notice systemic issues with calculations or SLA operation then perhaps one of the core Script Includes has been customized and did not update during your last upgrade.


Workflow

Since the 2010 engine (introduced in one of the later Spring 2010 patches), SLA Workflows have been used only for Notification purposes. Before that time they were used to update fields and manage escalations. If you are not getting the Notifications you expect or are not getting them at the right/expected times, Workflow is the place to look for issues.


Breach timers

Breach timers are sys_trigger records (System Scheduler > Scheduled Jobs module). This is a very central table that handles all things scheduled in the system. Every time that an SLA goes into "In Progress" a Breach Timer is generated that will expire at the time indicated in the Planned End Time field. When the breach timers expire they call TaskSLA.breachTimerExpired(). I have not seen many issue with Breach Timers. If you did have an issue then it might look like incidents that breach later than they should. Sys_trigger records do not get destroyed during a restart and there isn't any normal reason for them to stop running unless something extreme like an out-of-memory error occurs. If your sys_trigger queue gets backed up then all the jobs in the queue will stop running. You'll notice that not only do SLA's stop breaching, but Workflow Notification will stop going out too because Workflow Timers run on sys_trigger records too. Remember, everything that can be scheduled is scheduled in the sys_trigger table.


Calculate on Display

Ever wonder what updates SLAs? If you were to set up reporting on your SLAs and run the same report twice you would notice that the elapsed percentage fields on your In Progress SLAs don't change. Seems odd doesn't it? In some cases customers will call in wondering why they can open an Incident, then twenty seconds later look at the Elapsed Percentage field and it is still at zero. The problem is that the Calculated fields on SLAs aren't magically getting upgraded in real time. Unless you turn on the glide.sla.calculate_on_display property, the calculations won't even get upgraded when you look at the Incident. The reason is that SLA calculations only get updated when some event occurs. By default this is when either the Conditions are met for the task_sla to go through a Transition, a Breach Timer expires or a Calculation Job runs (explanation coming up next). Even if you turn on the glide.sla.calculate_on_display System Property (sys_property table) you won't get the results you expect when looking at the task_sla table directly or through the reporting engine. This is because the Calculate on Display business rule only runs when a user looks at a task record not a task_sla record! Definitely something to consider when planning how you will design your SLA reporting.


Calculation Jobs

These are Scheduled Jobs (sys_trigger table) that run at different intervals and update the calculated values of SLAs that will breach soon or have already breached. These Calculation Jobs help keep more reliable calculated values for SLAs that are near breach. For example, if you search in the Scheduled Jobs table for jobs that start with "SLA Update" you'll find a number of jobs; one of which runs every minute and updates all SLAs that will breach within 10 minutes.


Troubleshooting


You can do this the hard way or the real hard way. All kidding aside, most SLA issues take time and effort to solve. Sometimes the effort is in understanding how the SLA engine should be expected to operate (so kudos to you for reading this article). It is worth the effort and time though because for most companies (maybe yours!) SLAs are the primary performance metric for their operation.


The best way to start out is to determine exactly what the expected behavior is. Consider the different areas mentioned above. What mechanism controls the behavior that you believe is acting unexpectedly? What version of the SLA engine are you running and are there any customizations*?


Desk check

Here is my baby. I've been troubleshooting SLAs since Spring 2010 and I finally got tired of piecing together the who/what/whens of SLA mysteries so I wrote a background script that would grab almost all of the pertinent details and output them for me in a timeline. This script is pretty barebones on the display side, but it outputs all the configuration information for the Schedule, Workflow, SLA Definition and then outputs a timeline of all the related activity including the Activities of the Active Workflow Context, Changes to Task fields that affect the Task SLA Transitions, and the current values of the SLA calculations themselves. Note that the script does not force an update to the Task SLA Calculations so it is only as accurate as the last Calculation time. However, that would be a nice feature.



/*
* @Usage - Currently the following two examples are the way I
* imagine this object being used. The first example will return a
* report of a specified number of task_sla's. The second example
* takes a sys_id of a task_sla record and returns a report about
* that task_sla.
*/
var dCheck = newDCheck();
var filters = [
{"field":"end_time","operator":"!=","value":""},
{"field":"sys_created_on","operator":">","value":"2012-04-13 21:00:00"},
{"field":"sys_created_on","operator":"<","value":"2012-04-13 23:00:00"}
];
gs.print(dCheck.getLatestByDefinition('name','ADSK - Incident- P3 Resolution(5 Days) ',3,filters));
gs.print(dCheck.getSlaTimeline('5161366d0a0a0b3000d59bf577381424'));



/*
* @Description - This script speeds up the information gathering process
* about an existing task_sla. It outputs information about the
* contract_sla (SLA Definition), task (Incident, Problem, etc.), task_sla
* (running SLA), wf_context (Running Workflow) and wf_history (History
* of the running Workflow). First it outputs information about the
* definitions of the above items and then it compiles the history of
* changes to each item and orders them by timestamp so that you can piece
* together expected vs. actual behavior. See bottom of script for usage
* information.
*/
function newDCheck() {
return {
//Specify sys_id of the SLA definition
//@param fld {string} field of SLA Definition to query
//@param value {string} value to match agains fld
//@param filters {array} array of objects in format [{field:"",op:"",value:""},...]
getLatestByDefinition : function(fld, value, test, filters) {
if (typeof tests == "undefined")
tests = 3;//number of sla's to test
var out = ["NOTE: All times in GMT\n"];


var sld = new GlideRecord("contract_sla");
sld.addQuery(fld, value);
sld.query();
while (sld.next()) {
//Print conditions from SLA definition
out.push(this.printSlaDefinition(sld));


//Pull out the field names
var flds = this.getSlaConditions(sld);
out.push("fields from conditions: " + flds.join());


//Print information about the SLA schedule SKIPPED


//Get group of task_sla records
var slt = new GlideRecord("task_sla");
slt.addQuery("sla",sld.sys_id+"");
slt.addQuery("sys_created_on","<",gs.daysAgoStart(1));
if (filters)
for (var iz = 0; iz < filters.length; iz++)
slt.addQuery(filters[iz].field, filters[iz].operator, filters[iz].value);
slt.orderByDesc("sys_created_on");
slt.setLimit(tests);
slt.query();
//For each task_sla
id = 0;
while(slt.next()) {
out.push("\n*******" + ++id + "*******");
out.push(this.getSlaTimeline(slt, flds));
}
}
return out.join("\n");
},



printSlaDefinition : function(contract) {
var out = [];
var sldFlds = ['name', 'workflow', 'collection', 'duration', 'duration_type', 'retroactive', 'set_start_to', 'schedule', 'timezone', 'type', 'start_condition', 'stop_condition', 'pause_condition', 'reset_condition'];
for (var ia = 0; ia < sldFlds.length; ia++)
out.push("contract_sla." + sldFlds[ia] + ": " + contract[sldFlds[ia]]);


return out.join("\n");
},



getSlaConditions : function(contract) {
var conds = ['start_condition', 'stop_condition', 'pause_condition', 'reset_condition'];
var flds = [];
for (var ib = 0; ib < conds.length; ib++) {
var cond = contract[conds[ib]].split("^");
for (var ic = 0; ic < cond.length; ic++) {
var fld = cond[ic].split(/[^a-z_]/)[0];
if (fld) flds.push(fld);
}
}
return flds;
},


timeline : {
_timeline : [],


sort : function() {
return this._timeline.sort();
},


push : function(value) {
var i = this._timeline.length;
while(i--) if (this._timeline === value) return;
this._timeline.push(value);
},


join : function(delim) {
return this._timeline.join(delim);
},


reset : function() {
this._timeline = new Array();
}
},


getSlaTimeline : function(sla, condFlds) {
//var timeline = [];
var fields = [];


if (typeof sla == "string") {
var sltask = new GlideRecord("task_sla");
sltask.get(sla);
sla = sltask;
fields.push(this.printSlaDefinition(sla.sla));
} else if (!sla) return;
if (!condFlds) condFlds = this.getSlaConditions(sla.sla);


//Print task information (retroactive value, any field in conditions...)
if (sla.task) {
var taskRec = new GlideRecord("task");
taskRec.addQuery("sys_id",sla.task+"");
taskRec.query();
if (taskRec.next()) {
var tFlds = ['sys_id','number',sla.sla.set_start_to];
for (var ih = 0; ih < tFlds.length; ih++)
fields.push("task." + tFlds[ih] + ": " + sla.task[tFlds[ih]]);
//Add task information to timeline (sys_created_on, sys_updated_on...)


for (var ii = 0; ii < condFlds.length; ii++) {
//Query audit tables for all field names from the conditions list ordered by Created On
var aud = new GlideRecord("sys_audit");
aud.addQuery("documentkey",sla.task.sys_id+"");
aud.addQuery("fieldname",condFlds[ii]);
aud.orderBy("sys_created_on");
aud.query();
if (aud.getRowCount() == 0) this.timeline.push(sla.task.sys_created_on + " Orig " + condFlds[ii] + ": " + sla.task[condFlds[ii]]);
while(aud.next()) {
this.timeline.push(aud.sys_created_on + " Task " + aud.fieldname + " old:" + aud.oldvalue + " new:" + aud.newvalue);
}
//Add audit information to timeline
}
} else {//bad reference
fields.push("ERROR: task is a bad reference");
}
}


//Print task_sla information
var sFlds = ['sys_id','task','start_time', 'end_time', 'planned_end_time', 'duration', 'percentage', 'pause_time', 'pause_duration', 'stage', 'has_breached', 'sys_created_on', 'sys_updated_on'];
for (var ie = 0; ie < sFlds.length; ie++)
fields.push("task_sla."+sFlds[ie] + ": " + sla[sFlds[ie]]);


//Add task_sla information to timeline (start_time, end_time, planned_end)
var tlFlds = ['start_time', 'end_time', 'pause_time', 'planned_end_time', 'sys_created_on', 'sys_updated_on'];
for (var ig = 0; ig < tlFlds.length; ig++)
this.timeline.push(sla[tlFlds[ig]] + " SLA " + tlFlds[ig]);


//Query wf_context for this task_sla
var wf = new GlideRecord("wf_context");
//Print wf_context info
wf.addQuery("id",sla.sys_id+"");
wf.query();
//Query all Activities in the wf_context
while(wf.next()) {
fields.push("wf_context.name: " + wf.name);
fields.push("wf_context.started: " + wf.started);
fields.push("wf_context.ended: " + wf.ended);
//Add Activity start/end to timeline
var wfe = new GlideRecord("wf_history");
wfe.addQuery("context",wf.sys_id+"");
wfe.query();
while(wfe.next()) {
this.timeline.push(wfe.started + " WF " + wfe.activity.name + " (activity) began");
this.timeline.push(wfe.ended + " WF " + wfe.activity.name + " (activity) ended");
}
}


var slaTimeline = fields.join("\n") + "\n**Timeline (GMT)**\n" + this.timeline.sort().join("\n");
this.timeline.reset();
return slaTimeline;
}
}
}


NOTE: I've also attached this as a file to this blog post for download


Gotcha's


Here's a collection of some of the most common mistakes I see.


Duplicate Workflows

SLA Workflows get kicked off when Conditions from the SLA Definition (contract_sla table) match. This means you don't need to have the workflow set up to Run If... If you do set up the Workflow to Run If then you'll get duplicate Active Workflow Contexts.


Retroactive Start Time set to Weird Values

Setting a Retroactive Start time means that at the time of Task SLA (task_sla) creation the Start Time field will be set to some field from the Task record. Task SLAs get started when a Task gets Updated. So if you set the Retroactive Start time to Updated On then you might as well not set it to anything! Also, retroactive start time has some caveats. Check out the Wiki's explanation of prior pause time. In June there were also issues discovered with workflow timers not working correctly with retroactive SLAs and Schedules.


Relative Duration

The biggest thing here is to understand that Relative Duration does not work with Pause Time. To avoid confusion we now hide the Pause Condition field on Relative Duration SLA Definitions. The next biggest thing to understand is the the "current" object is not supported in the scripted section of a Relative Duration. If you use it you will get results, but they will not be consistent. This means that you can not use Relative Duration SLAs to design an SLA that will have a different duration depending on some value on the Task record.


Calculate on Display

I already mentioned this earlier, but it's worth repeating. Setting the Calculate on Display Property means that every time a user views a Task record the calculations of the SLAs in the Related List will be updated. It does NOT mean that every time a Task SLA record is viewed it's calculations will be updated.




Regards, Matthew Watkins




* ServiceNow tracks most customizations to configuration records in your system. During the next upgrade, any changes that would otherwise be applied to configuration records that have been altered will be SKIPPED. You should make sure to check the Upgrade History module to see if any updates have been skipped and merge or overwrite the custom code with the new ServiceNow changes. If you want to know more about this, check the following WIKI articles:

Tracked Customizations in Update Sets

Upgrade History





What about you? What are some techniques you use to troubleshoot SLAs? What are some gotchas you've discovered?