Instance Performance Maintenance and Administration

Mwatkins · ‎06-21-2017

< Previous Article		Next Article >
Outbound Integrations Using SOAP / REST: Performance Best Practices		Viewing statistics (xmlstats.do & stats.do) for all my nodes?

Introduction

This guide is written by the ServiceNow Technical Support Performance team (All Articles). We are a global group of experts that help our customers with performance issues. If you have questions about the content of this article we will try to answer them here. However, if you have urgent questions or specific issues, please see the list of resources on our profile page: ServiceNowPerformanceGTS

Instance performance administration, or "Instance Hygiene" is probably not something that is at the forefront of our customer ServiceNow administrators. It is easily overlooked because the platform is pretty tolerant to the various demands users put on it. That said, with considered planning and periodic maintenance activities, you can ensure that your instance scales well and runs in an optimal fashion.

It is worth pointing out that the ServiceNow Cloud Operations teams have in place a comprehensive set of alerts and monitors, so we will always be notified if an instance is in a critical state. Our Support engineers (and our customers) will be notified via an incident on HI and mitigating actions recommended to provide relief. The purpose of the recommendations made in this document is to enable customers to proactively prevent their instances getting into a critical state in the first place.

Due to the extensively customizable nature of the ServiceNow platform, the recommendations that follow are not an exhaustive list of preventative activities and some items will offer more benefit than others for your specific implementation.

Daily Activities

Review the "System Diagnostics" home page

The System Diagnostics homepage tracks some high level statistics for each of the nodes (JVMs) in your instance. Values are either "real time" at the point the page is rendered or are cumulative counts (for example the transactions and errors values) since the node was last started (see JVM UP time). When reviewing this information, do not be concerned if the total number of JVM Classes differs between nodes: This metric is showing the number of classes which have been loaded and subsequently unloaded on each JVM. Depending on what activities users have been performing on each node, there could legitimately be a disparity in what has been called since that JVM was last started.

You may like to set up a spreadsheet (or a table in your instance) to track the uptime, number of errors since last restart, the number of transactions performed and the number of logged in users for each node. While the platform does have built in performance graphs which show this information, they are rendered on a per node basis. If you spot an uncharacteristic jump in these numbers, it can be a good indicator there is an underlying performance issue which needs to be identified and addressed.

The System Diagnostics homepage has a number of very useful features out-of-the-box. It is the one place you can go to see performance details across all your instance's nodes. However, the page can be easily extended to include information about scheduler workers, semaphores an more. See some ideas for extending the page at How can I see statistics (xmlstats.do & stats.do) for all my nodes?

Review yesterday's "slow" transactions

By reviewing the Transactions (All user) information, you can see which transactions are taking more than a specified amount of time. Be sure to have the Client Transaction Timings Plugin enabled to capture all the data. More details on this plugin can be found on Client Transaction Timings.

Filter showing all transactions created yesterday which took more than 3,000 ms to complete

In this list, you can view the total response time along with a break down of the composite parts (time spent rendering in browser, time spent on the server processing the transaction and a calculated time spent in the network), details of which node processed the request, the IP address of the host making the request, the user making the request and of course when the transaction occurred. The session ID is also captured, so if you wanted to review the application logs to forensically unpick everything a user has performed in their session, you can do.

What should stand out for you will be whether there is a particular time of day when transactions execute slowly, whether these transactions are all being processed by the same node (suggesting one or more transaction or background job consuming large quantities of memory) or are the transaction response times poor across all nodes (typically signifies the database was working harder than usual, impacting all transactions). You might notice that the top ten slowest queries were all issued by Joe Bloggs and are incident lists. If that is the case, then you can review the user's settings or impersonate that user and try to recreate the issue.

For more detailed instructions on how to work with the transaction logs, see our documentation here or this KB article, Troubleshooting Guide: Using the Transaction Logs.

Use case consideration

How much data do your users need to interactively review in a single screen?

If you identify that "list" transactions are slow, it is worth checking how much data your users are requesting. When a user elects to "Show 100 rows per page" on a list, this sets a user preference which means every single list in the platform will pull back 100 rows for that user. This includes related and embedded lists on forms as well as the list views where the preference was set.

User chooses to show 100 rows per page

This becomes problematic when a table with many reference fields has to render a list. The platform has to build the relationships for all of the reference fields for all the rows being displayed on screen. For most service environments, agents can't practically use more than 20 — 30 rows at a time. If the page load is fast, a good case can be made for "paging" to the next chunk of results rather than scrolling down. Consider removing any options for more than 50 rows at a time from the platform. You can refer to Improve performance by displaying "just enough" data for further details or contact ServiceNow Customer Support for assistance.

Weekly Activities

Review Scheduled Jobs

By reviewing your scheduled job activity, you can help ensure the smooth running of "background activities", such as scheduled reports, discovery sensors and other routine tasks. We're going to check for anything which is running for more than an hour (3,600,000 ms).

Navigate to Sytem Logs > Transactions (Background)
Apply the following filter (note the response may take several minutes to return)

If you don't return any results for an hour, try the same again with a more stringent value, say a half hour (1,800,000 ms). Of course, some scheduled jobs are going to take a long time because well, they have a lot to work to process. Due to the way the transaction log tables are stored and rotated in the database, it is not possible to use the "group by" function in the list view. You may therefor find it easier to do your trend analysis by exporting the result set to Excel.

If there is a job which has executed multiple times for a long duration, drill down into what is taking the time. The most common culprits are glide record queries which request information from large tables with un-indexed 'where' clause or sorts/groups. These are often found inside of scripted transform maps and sometimes inside of script includes or business rules.

Configure scheduled jobs to use 'burst' scheduler workers

For instances running Dublin or newer releases, you can insulate against "clogged up" scheduler worker queues by setting the 'priority' field on the sys_trigger entry for the scheduled job to be "25". By making this change, you can ensure that core jobs such as event processors, SMTP sender, POP reader and SMS sender get triggered in a timely fashion. Should all the scheduler workers be busy with other jobs, an "important" job which is more than 60 seconds past due will spawn a 'burst scheduler worker' and execute in parallel to the core 8 schedulers on the node. This is taken care of Out Of Box from the Geneva release and onwards. Caution: This is good insulation, but should not be used as an excuse not to address the root causes of the other "long running" or "high volume" scheduled jobs.

Check for "excessive" logging

Looking out for unusually large log files is a relatively crude (although surprisingly accurate) way of spotting potential problems which warrant closer attention. Navigate to System Logs > Utilities > Node Log File Download. Apply a filter of "Name starts with local". This will show you all the application logs for the node your session is active on. Note that the most recent 5 days of log files are unzipped, the remaining files are zipped. The size value is measured in kilobytes. If you notice that one day is significantly larger than the others, or there is a progressive increase in file size, there may be cause to investigate further.

ServiceNow Gems has built a Chrome extension that allows you to select the node to which you log in. See the blog entry here: https://servicenowgems.com/2016/12/12/node-switcher-google-chrome-extension/#more-913

Bear in mind that all transactions and associated parameters are tracked in the logs, so if the number of users has ramped up or a new piece of functionality has gone live, the log files will naturally increase.

Aside from the obvious "there were errors", a significant spike in log file size may indicate that gs.log or gs.print statements which were used in sub-production testing have not been removed. Unnecessary logging is a bad thing — it makes the tables bulky, which slows maintenance activities such as backups and also makes searching the syslog table slow and cumbersome.

Trend "top 20" transactions

You may find it useful to trend the top 20 transactions. These may be constitute of solely the 20 most executed transactions in a given week, or you may choose to track the most "business critical" transactions (eg incident / catalog) or indeed a mix of the above. Common causes of slow form load times are related/embedded lists (either a "bad" query or filter, or the number of rows being requested), a high number of Ajax calls (could these be consolidated into fewer round trips) or an inefficient client side script.

Monthly Activities

Monitor table growth rates

You may choose to extend the spreadsheet you are using to track the top 20 request response times to include the number of rows in the table. There are broadly two types of data which will be stored in the instance: persistent data (such as task/user info) which you want to retain and transient data (such as log information/staging data for imports or integrations) which needs to be cleaned away after a given period of time.

Growing data sets can be associated with an increase in response times for end users and an increase in execution time for maintenance tasks such as cloning, backup and restore.

It is normal to see growth of persistent data over time. If you see a correlation of increased table size and a slow down in response time, it is possible that there are list definitions/glide record queries which need to be refactored or supported by appropriate indexing to accommodate the growth of the data.

When looking at the transient data tables, it is important that a sensible data retention policy is being enforced. If a record is "throw away", ie there is no benefit in retaining it once it has been processed, a table cleaner should be set up to remove the row in a timely fashion. With the advent of solid state drives, the table cleaner can comfortably delete approximately 1m rows from a table on a daily basis and keep up. In scenarios where more than 1m rows need to be purged at a time, then table rotation may be a more appropriate solution. Table rotation is a non-trivial piece of platform functionality and while is open to users with the Admin role, it is recommended a ticket be raised with the Technical Support team to investigate your individual requirements.

The number of rows can easily be obtained by accessing the list view for that table, for example "incident.list" in the navigator, or "<my instance>.service-now.com/incident_list.do" in the address bar. A count of the table will be displayed. You may want to amend your "show x" records preference to be 10 or 20, to speed up the list rendering time (see earlier inset).

Review the "Slow Queries" log

Navigate to System Diagnostics > Stats > Slow Queries. The Slow Queries log groups these transactions into similar patterns, providing you with an example set of parameters. The pattern is established by removing any specific values from the query structure and creating a hash, so that two queries with the same "hash" are considered the same pattern.

The slow query log records patterns of slow queries since the beginning of time (or the last time sys_query_pattern was truncated). You may find the results more meaningful by applying a filter to show only patterns which were first sighted in the last month and which occurred more than 100 times. If you click through to an individual query pattern record, you are presented with an example URL where the query was generated from, the first and last sighting, the number of executions and the average execution time. The stack trace of the thread executing the query is also displayed so with some careful forensics, you can cross reference which element on screen requested the information. Once you know this, you have the opportunity to review the gauge / list which made the call and verify whether it would benefit from refactoring or supporting with an index. Many times something as simple as adding "active=1" to a query will significantly reduce the execution time.

Review "Slow Scripts" [Added in Helsinki]

Similar to the "Slow Queries" module, in Helsinki and later, ServiceNow tracks average execution duration, total duration and total count for server and client-side scripts.

[NOTE: All "Moving average" fields - as shows in the below screenshot - were deprecated and are not usable in any current release: as of April 23, 2020 current release family is Orlando. This is true for all slow pattern tables.]

Screenshot 2017-08-10 09.51.17.png

Review "Slow Transactions" [Added in Helsinki]

Another addition to the slow pattern tables is the "Slow Transactions" table. This feature works as an aggregator of the Transaction Logs table (syslog_transaction). Types of transactions tracked include just about everything you can think of, scheduled jobs, web service integrations and end user transactions (including REST /now/ui/, AJAX and Angular).

Screenshot 2017-08-10 09.59.25.png

You can rank transactions by average execution time or see what new transactions have been introduced this month (using the First Sighting field).

Screenshot 2017-08-10 09.54.58.png

Redesigned "ServiceNow Performance" Homepage [Added in Jakarta]

The old "ServiceNow Performance" homepage has been completely redesigned in Jakarta. See the excellent documentation in the product doc pages under Performance metrics. The new graphs are much easier to use and understand, with click+drag zoom capability and various breakdown/drill-down options. One completely new type of metric that is tracked is Slow pattern metrics. Slow pattern metrics match against your Slow Scripts, Slow Transactions and Slow Queries (all inherited from the sys_pattern table) to provide a time chart that allows you to visually correlate slow patterns against system-wide performance events!

Index Suggestion Tool [Added in Jakarta]

See Index suggestions for slow queries. This tool, introduced in Jakarta, automatically suggests database indexes that might improve the slowest queries in your instance. The tool includes a feature that tracks the impact of the new index on the affected slow query over time and allows you to decide if it made enough of an impact to keep the index in place or not. The product documentation has lots of details about how to use this feature so I will not duplicate them here. One caveat I will mention, is that MySQL has a 64 index limitation per table. That may seem like a lot of indexes but it can easily be hit when the indexes are on the "flattened" task, Table Per Hierarchy Model.

Check user "rowcount" preferences

Perhaps rashly, out of box we offer users the option to display anything from 10 to 100 rows in any list. The greater the number of rows being requested has a direct correlation with the time taken to render forms. If sufficient users are requesting "high" numbers of rows, a platform-wide performance degradation may be experienced due to demands on memory in the JVM to render the lists or CPU demands at the database layer due to expensive queries.

There are three things that can be done to address the issue:

1) Individual users can change their "rowcount" user preference via the hamburger icon (three horizontal lines) on the list UI header.

2) Administrators can manually set the values of the rowcount preference through the module "User Administration > User Preferences" or the list below that already has the filter added for rowcounts:

<my_instance>.service-now.com/sys_user_preference_list.do?sysparm_query=nameLIKErowcount%5Evalue!%3D20%5EORvalue%3DNULL%5Evalue!%3D50%5EORvalue%3DNULL%5Evalue!%3DNULL%5Evalue!%3D10%5EORvalue%3DNULL%5Evalue!%3D15%5EORvalue%3DNULL

3) Administrators can restrict the options that users are allowed to select by setting the "glide.ui.per_page" property

NOTE: The rowcount setting becomes especially impactful when using the "group by field" option in the list UI. If rowcount is set to 100, each group in the list UI will have up to 100 records in it. For every record displayed in the UI, the platform has to execute hundreds of security and rendering activities. This can all add up very quickly.

Additional resources:

Troubleshooting Performance - ServiceNow Doc Site

Client Transaction Timings - ServiceNow Doc Site

ServiceNow Dev Site - Technical Best Practices

Improve performance by displaying "just enough" data

Performance Best Practice for Efficient Queries - Top 10 Practices

< Previous Article		Next Article >
Outbound Integrations Using SOAP / REST: Performance Best Practices		Viewing statistics (xmlstats.do & stats.do) for all my nodes?

Mwatkins · ‎06-28-2017

Update coming soon! There are some great performance administration tools in Helsinki and Jakarta that didn't make it on to this list.

Deepak Ingale1 · ‎08-10-2017

We are waiting for an update

Mwatkins · ‎08-10-2017

Thanks for the reminder, Deepak! As promised, I have updated the document with details about some of the great performance administration features introduced in Helsinki and Jakarta. These are also outlined in the ServiceNow documentation site here:

Jakarta: Platform performance release notes

Helsinki: Database performance tuning tools

Deepak Ingale1 · ‎08-10-2017

Thank you Matthew for quick response

Matt88 · ‎08-17-2017

Hi Matt

I'd appreciate it if you or someone else could corroborate these findings. I'm currently conducting a thorough investigation into performance of some REST API transactions and I've found the output for "Transaction Execution Time Trend" to be misleading.

The problem lies in the use of "total server execution time" which is aggregated data stored in time periods. As per the graph below, it would appear that execution time for my REST API is averaging 13 seconds, but looking at individual transactions in the transaction log for the same time span reveals every API call actually completed in less than 2.5 seconds except for 4 transactions which were still under 10 seconds.

In this regard, I don't really see any meaningful information to be gleaned from this graph - I can't even work out how the maximum and average is calculated (I found the minimum came from a period of time where a single execution required 982ms).

Do you agree?

Regards

Matt

Mwatkins · ‎08-18-2017

I believe the graph you are showing is the graph from the "ServiceNow Slow Performance 1-Day" homepage available in Helsinki and Istanbul. I am vaguely aware of problems with the accuracy of those graphs, so you may in all likelihood be correct. All the "Slow Performance" homepages have been taken out of Jakarta and the "ServiceNow Performance" homepage now replaces their functionality with the "Slow Pattern" Graph set.

See Slow pattern metrics

BhupeshG · ‎10-26-2017

I have a query

Why table are extended from task.? Why didn't we had separate physical tables for inc, chg and so on..

Because of task table flattening the indices are exhausted

How can we have run time log which show the latest sql statement on the top. The debug method is very clumsy

adamjgreenberg · ‎12-04-2017

Bhupesh,

The union statements that joined large tables together were expensive and as such the developers decided to file flatten for better performance.

The slow queries table will highlight queries with poor performance. You can also run an explain plan on each of these to ensure the query either has or is properly using the indices.

You can also ask the CS-Performance team to get index usage by table if you happen to hit the 64 index limit.

Lukas15 · ‎12-20-2017

Awesome work, thank you!

S Hall · ‎07-12-2021

Additional links for managing for performance are found here.

Gabriela Cortes · ‎01-18-2022

Question. When building a Service Catalog and many of the topics have variables specific to the process, since the variables are all stored in another table, ¿is indexing enough to generate the reports quickly? ¿Is there something customers can consider to ensure we have the best efficiency in this regard?

ServiceNow Community servicenow community