Instance Performance Maintenance and Administration

Version 10

    Introduction

         Instance performance administration, or “Instance Hygiene” is probably not something that is at the forefront of our customer ServiceNow administrators. It is easily overlooked because the platform is pretty tolerant to the various demands users put on it. That said, with considered planning and periodic maintenance activities, you can ensure that your instance scales well and runs in an optimal fashion.

         It is worth pointing out that the ServiceNow Cloud Operations teams have in place a comprehensive set of alerts and monitors, so we will always be notified if an instance is in a critical state. Our Support engineers (and our customers) will be notified via an incident on HI and mitigating actions recommended to provide relief. The purpose of the recommendations made in this document is to enable customers to proactively prevent their instances getting into a critical state in the first place.

         Due to the extensively customizable nature of the ServiceNow platform, the recommendations that follow are not an exhaustive list of preventative activities and some items will offer more benefit than others for your specific implementation.

    Daily Activities

    Review the “System Diagnostics” home page

    ReviewSystemDiagnostics.png

     

     

         The System Diagnostics homepage tracks some high level statistics for each of the nodes (JVMs) in your instance. Values are either “real time” at the point the page is rendered or are cumulative counts (for example the transactions and errors values) since the node was last started (see JVM UP time). When reviewing this information, do not be concerned if the total number of JVM Classes differs between nodes: This metric is showing the number of classes which have been loaded and subsequently unloaded on each JVM. Depending on what activities users have been performing on each node, there could legitimately be a disparity in what has been called since that JVM was last started.

         You may like to set up a spreadsheet (or a table in your instance) to track the uptime, number of errors since last restart, the number of transactions performed and the number of logged in users for each node. While the platform does have built in performance graphs which show this information, they are rendered on a per node basis. If you spot an uncharacteristic jump in these numbers, it can be a good indicator there is an underlying performance issue which needs to be identified and addressed.

         The System Diagnostics homepage has a number of very useful features out-of-the-box. It is the one place you can go to see performance details across all your instance's nodes. However, the page can be easily extended to include information about scheduler workers, semaphores an more. See some ideas for extending the page at How can I see statistics (xmlstats.do & stats.do) for all my nodes?

     

    Review yesterday’s “slow” transactions

         By reviewing the Transactions (All user) information, you can see which transactions are taking more than a specified amount of time. Be sure to have the Client Transaction Timings Plugin enabled to capture all the data. More details on this plugin can be found on https://wiki.servicenow.com/index.php?title=Client_Transaction_Timings.

    ReviewYesterday.png

         Filter showing all transactions created yesterday which took more than 3,000 ms to complete

     

         In this list, you can view the total response time along with a break down of the composite parts (time spent rendering in browser, time spent on the server processing the transaction and a calculated time spent in the network), details of which node processed the request, the IP address of the host making the request, the user making the request and of course when the transaction occurred. The session ID is also captured, so if you wanted to review the application logs to forensically unpick everything a user has performed in their session, you can do.

         What should stand out for you will be whether there is a particular time of day when transactions execute slowly, whether these transactions are all being processed by the same node (suggesting one or more transaction or background job consuming large quantities of memory) or are the transaction response times poor across all nodes (typically signifies the database was working harder than usual, impacting all transactions). You might notice that the top ten slowest queries were all issued by Joe Bloggs and are incident lists. If that is the case, then you can review the user’s settings or impersonate that user and try to recreate the issue.

     

    For more detailed instructions on how to work with the transaction logs, see our wiki article here or this KB article, Troubleshooting Guide: Using the Transaction Logs.

     

    Use case consideration

    How much data do your users need to interactively review in a single screen?

         If you identify that “list” transactions are slow, it is worth checking how much data your users are requesting. When a user elects to “Show 100 rows per page” on a list, this sets a user preference which means every single list in the platform will pull back 100 rows for that user. This includes related and embedded lists on forms as well as the list views where the preference was set.

    UseCaseConsideration.png

    User chooses to show 100 rows per page

     

         This becomes problematic when a table with many reference fields has to render a list. The platform has to build the relationships for all of the reference fields for all the rows being displayed on screen. For most service environments, agents can’t practically use more than 20 – 30 rows at a time. If the page load is fast, a good case can be made for “paging” to the next chunk of results rather than scrolling down. Consider removing any options for more than 50 rows at a time from the platform. You can refer to Improve performance by displaying "just enough" data  for further details or contact ServiceNow Customer Support for assistance.             

     

    Weekly Activities

    Review Scheduled Jobs

         By reviewing your scheduled job activity, you can help ensure the smooth running of “background activities”, such as scheduled reports, discovery sensors and other routine tasks. We’re going to check for anything which is running for more than an hour (3,600,000 ms).

     

    1. Navigate to Sytem Logs > Transactions (Background)
    2. Apply the following filter (note the response may take several minutes to return)
      ReviewScheduledJobs.png

     

         If you don’t return any results for an hour, try the same again with a more stringent value, say a half hour (1,800,000 ms). Of course, some scheduled jobs are going to take a long time because well, they have a lot to work to process. Due to the way the transaction log tables are stored and rotated in the database, it is not possible to use the “group by” function in the list view. You may therefor find it easier to do your trend analysis by exporting the result set to Excel.

         If there is a job which has executed multiple times for a long duration, drill down into what is taking the time. The most common culprits are glide record queries which request information from large tables with un-indexed ‘where’ clause or sorts/groups. These are often found inside of scripted transform maps and sometimes inside of script includes or business rules.

     

    Configure scheduled jobs to use ‘burst’ scheduler workers

         For instances running Dublin or newer releases, you can insulate against “clogged up” scheduler worker queues by setting the ‘priority’ field on the sys_trigger entry for the scheduled job to be “25”. By making this change, you can ensure that core jobs such as event processors, SMTP sender, POP reader and SMS sender get triggerd in a timely fashion. Should all the scheduler workers be busy with other jobs, an “important” job which is more than 60 seconds past due will spawn a ‘burst scheduler worker’ and execute in parallel to the core 8 schedulers on the node. This is taken care of Out Of Box from the Geneva release and onwards. Caution: This is good insulation, but should not be used as an excuse not to address the root causes of the other “long running” or “high volume” scheduled jobs.

     

    Check for “excessive” logging

         Looking out for unusually large log files is a relatively crude (although surprisingly accurate) way of spotting potential problems which warrant closer attention. Navigate to System Logs > Utilities > Node Log File Download. Apply a filter of “Name starts with local”. This will show you all the application logs for the node your session is active on. Note that the most recent 5 days of log files are unzipped, the remaining files are zipped. The size value is measured in kilobytes. If you notice that one day is significantly larger than the others, or there is a progressive increase in file size, there may be cause to investigate further.

     

    ServiceNow Gems has built a Chrome extension that allows you to select the node to which you log in. See the blog entry here: https://servicenowgems.com/2016/12/12/node-switcher-google-chrome-extension/#more-913

     

         Bear in mind that all transactions and associated parameters are tracked in the logs, so if the number of users has ramped up or a new piece of functionality has gone live, the log files will naturally increase.

         Aside from the obvious “there were errors”, a significant spike in log file size may indicate that gs.log or gs.print statements which were used in sub-production testing have not been removed. Unnecessary logging is a bad thing – it makes the tables bulky, which slows maintenance activities such as backups and also makes searching the syslog table slow and cumbersome.

     

    Trend “top 20” transactions

         You may find it useful to trend the top 20 transactions. These may be constitute of solely the 20 most executed transactions in a given week, or you may choose to track the most “business critical” transactions (eg incident / catalog) or indeed a mix of the above. Common causes of slow form load times are related/embedded lists (either a “bad” query or filter, or the number of rows being requested), a high number of Ajax calls (could these be consolidated into fewer round trips) or an inefficient client side script. Refer to Troubleshooting Performance - ServiceNow Wiki  for advice on how to investigate the performance of individual transactions.

     

    Monthly Activities

    Monitor table growth rates

         You may choose to extend the spreadsheet you are using to track the top 20 request response times to include the number of rows in the table. There are broadly two types of data which will be stored in the instance: persistent data (such as task/user info) which you want to retain and transient data (such as log information/staging data for imports or integrations) which needs to be cleaned away after a given period of time.

    Growing data sets can be associated with an increase in response times for end users and an increase in execution time for maintenance tasks such as cloning, backup and restore.

         It is normal to see growth of persistent data over time. If you see a correlation of increased table size and a slow down in response time, it is possible that there are list definitions/glide record queries which need to be refactored or supported by appropriate indexing to accommodate the growth of the data.

         When looking at the transient data tables, it is important that a sensible data retention policy is being enforced. If a record is “throw away”, ie there is no benefit in retaining it once it has been processed, a table cleaner should be set up to remove the row in a timely fashion. With the advent of solid state drives, the table cleaner can comfortably delete approximately 1m rows from a table on a daily basis and keep up. In scenarios where more than 1m rows need to be purged at a time, then table rotation may be a more appropriate solution. Table rotation is a non-trivial piece of platform functionality and while is open to users with the Admin role, it is recommended a ticket be raised with the Technical Support team to investigate your individual requirements.

         The number of rows can easily be obtained by accessing the list view for that table, for example “incident.list” in the navigator, or “<my instance>.service-now.com/incident_list.do” in the address bar. A count of the table will be displayed. You may want to amend your “show x” records preference to be 10 or 20, to speed up the list rendering time (see earlier inset).

     

    Review the “Slow Queries” log

         Navigate to System Diagnostics > Stats > Slow Queries. The platform records any SQL statement which takes more than 100ms to complete. The Slow Queries log groups these transactions into similar patterns, providing you with an example set of parameters.

    ReviewSlowQueries.png

         The slow query log records patterns of slow queries since the beginning of time (or the last time sys_query_pattern was truncated). You may find the results more meaningful by applying a filter to show only patterns which were first sighted in the last month and which occurred more than 100 times. If you click through to an individual query pattern record, you are presented with an example URL where the query was generated from, the first and last sighting, the number of executions and the average execution time. The stack trace of the thread executing the query is also displayed so with some careful forensics, you can cross reference which element on screen requested the information. Once you know this, you have the opportunity to review the gauge / list which made the call and verify whether it would benefit from refactoring or supporting with an index. Many times something as simple as adding “active=1” to a query will significantly reduce the execution time.

     

    Review "Slow Scripts" [Added in Helsinki]

         Similar to the "Slow Queries" module, in Helsinki and later, ServiceNow tracks average execution duration, total duration and total count for server and client-side scripts. You can see also the monthly, daily and hourly moving averages - you may need to personalize the list view to see this (Personalize a v2 list). These aggregate average times are very helpful to see if a script recently became slower (daily is much higher than monthly) or to measure the effectiveness of improvements that you make (daily is much lower than monthly).

    Screenshot 2017-08-10 09.51.17.png

    Review "Slow Transactions" [Added in Helsinki]

         Another addition to the slow pattern tables is the "Slow Transactions" table. This feature works as an aggregator of the Transaction Logs table (syslog_transaction). Types of transactions tracked include just about everything you can think of, scheduled jobs, web service integrations and end user transactions (including REST /now/ui/, AJAX and Angular).

    Screenshot 2017-08-10 09.59.25.png

    You can rank transactions by average execution time or see what new transactions have been introduced this month (using the First Sighting field).

    Screenshot 2017-08-10 09.54.58.png

     

    Redesigned "ServiceNow Performance" Homepage [Added in Jakarta]

         The old "ServiceNow Performance" homepage has been completely redesigned in Jakarta. See the excellent documentation in the product doc pages under Performance metrics. The new graphs are much easier to use and understand, with click+drag zoom capability and various breakdown/drill-down options. One completely new type of metric that is tracked is Slow pattern metrics. Slow pattern metrics match against your Slow Scripts, Slow Transactions and Slow Queries (all inherited from the sys_pattern table) to provide a time chart that allows you to visually correlate slow patterns against system-wide performance events!

    newPerformanceGraphs.png

    Index Suggestion Tool [Added in Jakarta]

    IndexSuggestions.png

         See Index suggestions for slow queries. This tool, introduced in Jakarta, automatically suggests database indexes that might improve the slowest queries in your instance. The tool includes a feature that tracks the impact of the new index on the affected slow query over time and allows you to decide if it made enough of an impact to keep the index in place or not. The product documentation has lots of details about how to use this feature so I will not duplicate them here. One caveat I will mention, is that MySQL has a 64 index limitation per table. That may seem like a lot of indexes but it can easily be hit when the indexes are on the "flattened" task, Table Per Hierarchy Model.

     

    Check user “rowcount” preferences

         Perhaps rashly, out of box we offer users the option to display anything from 10 to 100 rows in any list. The greater the number of rows being requested has a direct correlation with the time taken to render forms. If sufficient users are requesting “high” numbers of rows, a platform-wide performance degradation may be experienced due to demands on memory in the JVM to render the lists or CPU demands at the database layer due to expensive queries.

    There are three things that can be done to address the issue:

    1) Individual users can change their “rowcount” user preference via the hamburger icon (three horizontal lines) on the list UI header.

    2) Administrators can manually set the values of the rowcount preference through the module "User Administration > User Preferences" or the list below that already has the filter added for rowcounts:

    <my_instance>.service-now.com/sys_user_preference_list.do?sysparm_query=nameLIKErowcount%5Evalue!%3D20%5EORvalue%3DNULL%5Evalue!%3D50%5EORvalue%3DNULL%5Evalue!%3DNULL%5Evalue!%3D10%5EORvalue%3DNULL%5Evalue!%3D15%5EORvalue%3DNULL

    3) Administrators can restrict the options that users are allowed to select by setting the "glide.ui.per_page" property

    http://wiki.servicenow.com/index.php?title=System_Performance_Best_Practices - Default_Row_Count

     

    NOTE: The rowcount setting becomes especially impactful when using the "group by field" option in the list UI. If rowcount is set to 100, each group in the list UI will have up to 100 records in it. For every record displayed in the UI, the platform has to execute hundreds of security and rendering activities. This can all add up very quickly.

    Additional resources:

    Troubleshooting Performance - ServiceNow Wiki

    Platform performance http://wiki.servicenow.com/index.php?title=System_Performance_Best_Practices

    Client Transaction Timings - ServiceNow Wiki

    Improve performance by displaying "just enough" data

    Performance Best Practice for Efficient Queries - Top 10 Practices

    _________________________
    Matthew Watkins, ServiceNow

    ❖❖ Please mark as 'Correct Answer' if my response answered your question. Thanks! ❖❖