Caching data to improve performance

GTSPerformance · ‎06-01-2021

< Previous Article		Next Article >
Performance Best Practice for Before Query Business Rules		Database Performance: Improving Slow OR and JOIN Queries

Overview

This guide is written by the ServiceNow Technical Support Performance team (All Articles). We are a global group of experts that help our customers with performance issues. If you have questions about the content of this article we will try to answer them here. However, if you have urgent questions or specific issues, please see the list of resources on our profile page: ServiceNowPerformanceGTS

This article assumes you are very familiar with scripting in ServiceNow already. It will discuss the performance implications and best practices regarding Before Query Business Rules (official ServiceNow product documentation link).

In one of our earlier community articles (Performance Best Practice for Before Query Business Rules) a reference was made to potentially caching data - to jog the readers memory:

Leverage some type of caching strategy to avoid frequent lookups of static data. For example, a user's region does not frequently change. Instead of adding a complex condition that requires a JOIN between the region table and a many-to-many table to every single task query, is there a way that you can just cache each user's region membership in memory for the duration of their user session and add a simple condition to task that does not require a JOIN (e.g. task.region IN (...list of region sys_id's from memory to which the user belongs...)? This can done using an actual in-memory cache (e.g. gs.getSession().putClientData()) and/or using a "truth table" that stores the simple results of more complex calculations.

In this article we will look more closely at the type of caching strategies which are commonly used and the difference these can make to transactional (and therefore instance) performance. To help make this relatable we will describe this in the context of a customer escalation recently worked on by our team. Although this customer had a use case specific to them, the concepts in this article can be applied in a wide variety of scenarios.

Background

The customer's instance was accessed by users from multiple geographic locations (territories) and it was a business requirement that data in certain tables should be segregated based on territory. At a high level this was designed as follows:

Each user was assigned a specific territory according to their geographic location (via a custom field on the sys_user table)
Territories and their relationships were defined via records in a custom 'territory' table - the way in which this was implemented meant that there was the concept of 'territory hierarchies' (i.e. users in the United Kingdom should have access to records owned by / in the Northern Ireland, Wales, Scotland, and England territories whereas users in the Great Britain territory should only have access to the Wales, Scotland, and England territories)
Records in certain key tables (such as task and cmdb) were assigned an 'owning territory' via a custom field which referenced the 'territory' table

To enforce data segregation the customer had implemented a set of Before Query Business Rules and a script include. Again, at a high level, these operated as follows:

Each time a user executed a transaction which triggered SQL queries against a key / segregated table one of the before query business rules would be executed
The corresponding business rule would call one of many functions in the script include (depending on the exact nature of segregation required)
The script include would use the territory table to determine a list of territories to which that user should have access (depending on the end user territory and use case of the function called) before returning an encoded query - commonly this would read something similar to 'AND territory IN (territory1, territory2, ..., territoryN)'
The business rule would append the encoded query to the SQL query being executed

Performance Issues

First and foremost we should point out that whilst Before Query Business Rules are a valid mechanism by which this type of use case can be implemented, they may not be the best solution. Again, as mentioned in our previous article:

Roles, ACL's, Domain Separation and other out-of-box security features are the preferred and recommended method of applying data security and/or segregation to ServiceNow data. They are optimized for performance and cause less technical debt than a custom scripted Query Business Rule. Ideally, if you are attempting to achieve data security/segregation in a way that will result in some logical evaluation before every single call to a frequently accessed table, you should try very hard to use something other than a Before Query Business Rule. It should probably be a last resort when features that are actually designed to accomplish the same goal have been exhausted.

In this escalation, however, the customers objective was not to completely rewrite / redesign their data segregation code (which was deeply ingrained into their platform design) but simply to optimise what they already had. On reviewing the performance of the instance it was clear that business rules were of significant impact to certain transactions. For example:

2021-06-01 06:02:47 (585) Default-thread-5 6C7C6C6CDBE0FC98AC705F8BD396194C txid=2dc2b028db28 EXCESSIVE *** End  #11860627 /pm_project.do, user: userxzy, total time: 0:00:41.346, processing time: 0:00:41.344, total wait: 0:00:00.002, semaphore wait: 0:00:00.002, SQL time: 0:00:24.343 (count: 12,284), business rule: 0:00:27.257 (count: 2,998), ACL time: 0:00:01.435, UI Action time: 0:00:00.934, Cache build time: 0:00:00.682, source: xxx.xxx.xxx.xxx null

Note that the above transaction spent:

41.344s executing on the application node (processing time)
27.257s (i.e. 66% of total processing time) executing 2998 business rules

On further inspection it was found that the transaction:

Did not execute any individual Business Rules which were slow (i.e. took >= 100ms to execute) - Business Rule time was therefore high due to a very large number of reasonably quick (in isolation) Business Rules
Executed the Before Query data segregation Business Rules many thousands of times (often repeatedly against the same underlying table / function in the Script Include)

When the Before Query data segregation Business Rules were debugged it was found that:

A single invocation of a before query business rule was averaging around 10ms execution time
The vast majority of this 10ms was spent traversing the 'territory' table determining the hierarchy of territories the given user should have access to (and building the corresponding encoded query)

Unfortunately this is a scenario we see relatively often when working with customer instances - the Business Rule / Script Include code was well written and there is no reason to believe that it was not well / exhaustively tested before being deployed into production. It is likely, however, that whoever wrote this code found that it took an insignificant time to execute in isolation and therefore assumed that its impact to the platform / end users would be negligible. What they did not consider, however, is that this design means that certain transactions might execute these Business Rules many thousands of times - even if a single execution of the Business Rule is fast this design does not scale well (as evidenced above). Ultimately this caused quite significant performance degradation across the instance and a general loss of confidence in the ServiceNow platform.

Whilst working with the customer, one other aspect of design became apparent that we could ultimately use to our advantage. When we reviewed the Before Query Business Rules we found that they all:

Called one of three different functions (or entry points) in the corresponding script include
Passed the name of the table the Business Rule was running against and details of the current user to the Script Include

Likewise the script include would always return a consistent encoded query for each combination of function (entry point) / table / user. Furthermore, the design meant that the same function / table / user combinations were being passed to the script include over and over again in single transactions. In short the vast majority of Business Rule time was spent calculating the same encoded query over and over again. This meant that if the encoded query could only be calculated once per function / table / user combination (then re-used if applicable) it was likely that we could significantly improve the performance / reduce the impact of these business rules.

The next question was how could the platform store / reuse specific encoded queries? There are two possible options here - a cache table or the users session. We will discuss each (along with their pros and cons) below.

Using a Cache Table

Remember that, in this case, the Script Include produced an encoded query in response to a function (entry point) / table / user combination - as a result the following table was deployed to hold this information:

SQL> show columns from u_mf_cache;
+------+----------------+-------------+------+-----+---------+-------+
| Port | Field          | Type        | Null | Key | Default | Extra |
+------+----------------+-------------+------+-----+---------+-------+
| 3403 | u_encodedquery | mediumtext  | YES  |     |         |       | <=== HOLDS CALCULATED ENCODED QUERY
| 3403 | u_table        | varchar(40) | YES  |     |         |       | <=== TABLE PROVIDED TO SCRIPT INCLUDE
| 3403 | u_user         | varchar(32) | YES  | MUL |         |       | <=== USER SYS_ID PROVIDED TO SCRIPT INCLUDE
| 3403 | sys_id         | char(32)    | NO   | PRI |         |       |
| 3403 | sys_updated_by | varchar(40) | YES  |     |         |       |
| 3403 | sys_updated_on | datetime    | YES  |     |         |       | <=== DATE / TIME CACHE RECORD CREATED / LAST UPDATED
| 3403 | sys_created_by | varchar(40) | YES  |     |         |       |
| 3403 | sys_created_on | datetime    | YES  |     |         |       |
| 3403 | sys_mod_count  | int(11)     | YES  |     |         |       |
| 3403 | u_function     | varchar(40) | YES  |     |         |       | <=== FUNCTION / ENTRY POINT WITHIN SCRIPT INCLUDE CALLED BY BUSINESS RULE
+------+----------------+-------------+------+-----+---------+-------+

The Script Include was then modified such that:

As soon as it was called it would query the cache table looking for a record with corresponding function (u_function), table (u_table), and user (u_user)
If such a record was found, then the date / time the cache record was last updated / created (sys_updated_on) was compared against the current date / time - if the record was created / updated within a certain period (i.e. last 1 hour) the Script Include would immediately return the encoded query from the record in the cache table (u_encodedquery) and exit
If no such record existed or the cache record was too old (stale), then the Script Include would calculate the encoded query in the normal way but then insert results into the cache table (or update any existing record accordingly)

This mechanism meant that even if no record existed for a given function / table / user combination the Script Include would simply calculate the required encoded query (as it had been doing previously) and place the result in the cache table - this operation took approximately the same amount of time as a single invocation of the 'non caching' Script Include (i.e. it introduced negligible degradation for a single execution of the Business Rule / Script Include). If, however, a cache record did exist for the given function / table / user combination, then the Script Include (and therefore Business Rule) could 'short circuit'; completing significantly more quickly than the 'non caching' Script Include.

In summary, it was expected that, for transactions which might only execute the Business Rules a handful of times with different function / table / user combinations, any performance degradation from maintaining the cache table would be negligible; certainly not enough to be noticed by end users. Transactions that executed the Business Rules many thousands of times with repeated function / table / user combinations, however, could be expected to show significant performance improvements as the vast majority of Business Rule executions would 'short circuit', exit early, and, therefore, the overall Business Rule time of the transactions would be drastically reduced.

Note that the structure of the cache table and design of the Script Include meant that the cache table would constantly service SQL queries such as 'SELECT ... FROM u_mf_cache WHERE u_user = [x] AND u_table = [y] AND u_function = [z]'. To ensure that these queries remained performant under load and as the size of the table increased a composite index was added across the (u_user, u_table) columns.

Advantages of a cache table:

No restriction as to the size of data (i.e. encoded queries) which can be placed in the table - as such makes sense if large data structures / strings need to be cached
Simple to understand and implement - the cache table can be interacted with via standard APIs such as GlideRecord()
Allows easy avoidance of security issues - lets consider that a user has an incorrect territory associated - this will generate incorrect encoded queries in the cache table allowing the user access to data which they shouldn't have. A Business Rule could be added to the sys_user table which, if a users territory changes, immediately purges all the users cache entries. This would ensure that as soon as the incorrect territory is corrected the user will immediately be forced to generate / use new (correct) encoded queries

Disadvantages of a cache table:

Every single invocation of a before query business rule will perform a SQL query against the cache table - this adds (some) latency to transactions / the instance
In busy instances the cache table could become extremely large (in terms of record count) and busy (in terms of SQL queries / inserts / updates) which could cause the table to become a point of contention in the underlying database. A good indexing strategy (as described above) will help to mitigate this to an extent

Caching Data in User Sessions

Instead of caching data in a table in the underlying database, it is also possible to cache string based data in a user's session in memory. Various APIs are available to perform this, i.e.:

To cache data in a users session:

var session = gs.getSession();
session.putClientData([key], [data]);

For example:
var session = gs.getSession();
session.putClientData('foo', 'bar');

To retrieve cached data from a users session:

var session = gs.getSession();
session.getClientData([key]);

For example:
var session = gs.getSession();
gs.info(session.getClientData('foo'));
...
*** Script: bar

To clear cached data from a users session:

var session = gs.getSession();
session.clearClientData([key]);

For example:
var session = gs.getSession();
session.clearClientData('foo');
gs.info(session.getClientData('foo'));
...
*** Script: null

To leverage caching data in user sessions the Script Include was further modified such that:

When an encoded query for a given function / table / user combination was built, a cache entry was created within the user's session, i.e.:
- Key: mfc_[table]_[function]
- Data: JSON String containing two pieces of information
  - Calculated encoded query
  - Unix timestamp tracking when the cache entry was created
When called, the Script Include would:
- Immediately check the user's session to determine if a cache entry for the given table / function combination was found (i.e. key of mfc_[table]_[function])
- If found then the Script Include would check the Unix timestamp held within the cache entry and compare against the current date / time
- If the cache entry was recent enough (created in the past 1 hour) the Script Include would immediately return the cached encoded query (i.e. data from key with mfc_[table]_[function]) - if the cache entry was stale (i.e. created > 1 hour ago) the cache entry would be cleared
- If no cache entry was found in the user's session or the cached data was found to be stale, the Script Include would recalculate the encoded query (as normal) and add a corresponding key / data back to the user's session

Note that keys for data within the user's session do not include details of the corresponding user (which was of great importance when using a cache table) - this is because each user's session cache is private to that user and is not shared. This means that any cache entries in a user's session can only be accessed by code initiated by that user and therefore it is not necessary to record details of the user in cache.

Advantages of caching data in a user's session:

The user's session can be interacted with via standard APIs - as a result it is extremely easy to implement caching within session objects
Session objects / data are held within memory (heap) on application nodes - as a result, interacting with the session (and therefore cached data) is extremely fast and does not involve any kind of database query
Session objects are created for both interactive (i.e. user based) and non-interactive (i.e. scheduled job) based sessions - as a result, caching data in user sessions is also valid when executing scheduled jobs

Disadvantages of caching data in a user's session:

Application nodes run with relatively small amounts of available memory (heap is always sized at 2Gb and a significant proportion of this is taken by other objects required during normal operation of the node) - if large / numerous cache entries are held in user sessions this can quickly cause memory contention on application nodes leading to widespread performance degradation / service impacting events. User sessions should only be used to cache a small number of relatively short pieces of data. For some ideas about how to do keep your code efficient in terms of memory usage, see the article Performance Best Practices for Server-side Coding in ServiceNow and read the sections "Running out of memory due to storing dot-walked GlideElement fields", "Not limiting number of returned records when querying very large tables" and "Large arrays or objects". If your instance runs out of memory or runs low on memory it can cause an outage.
Sessions are created on login and destroyed on logout (causing all data held within the corresponding session object in memory to be lost). This is not of particular concern for user sessions which tend to be relatively long lived but may impact integrations which execute frequently using a new session each time as they may calculate / cache the same data over and over again with little benefit
Session objects are private (i.e. there is no way in which one user can manipulate the contents of any other user's session object) - if a user caches a problematic encoded query (due to them having an 'incorrect' territory associated) the problematic encoded query will continue to be used until either the user logs out (and their session is destroyed) or the cached entries are considered stale. Even if an administrator addresses the 'incorrect' territory, the user will still access data related to the prior 'territory' for a period of time

Caching Data using the ScopedCacheManager

The ScopedCacheManager was introduced in Tokyo as an API for developers to write instance-wide server side caching using a concept of cacheable pairs. A cacheable pair is when cached values are paired with a persistent data structure. When the data structure is altered via Create/Update/Delete operation, the related cache is flushed - and on the next access attempt, the cached will need to be updated. For full documentation see the developer guide on our official doc site: https://docs.servicenow.com/bundle/washingtondc-api-reference/page/integrate/guides/scopedcachemgr/c...

Advantages of ScopedCacheManager

Since ScopedCacheManager is a fully supported ServiceNow Platform cache mechanism. This method keeps your instance more out-of-the-box and avoids some of the risks associated with customization when compared to the other methods discussed in this article.
ScopedCacheManager provides 4 types of cacheable pairs. Depending on the use case, this flexibility may allow developers to avoid unnecessary cache flushing.
ScopedCacheManager provides a maximum entry control and Least Recently Used (LRU) data structure so that the cache does not grow too large and freshly accessed entries are not evicted.

Disadvantages of ScopedCacheManager

If not carefully designed ScopedCacheManager is susceptible to excessive cache flushing; i.e., cache thrashing. Cache thrashing is when the cache is constantly being rebuilt and, thus, negating the performance advantage of having a value cached in the first place.
Since ScopedCacheManager is flushed based on updates to the underlying data store, developers do not control exactly when the ScopedCacheManager is flushed. If the underlying data store is frequently updated it can lead to very short lived cache objects. Conversely, if the underlying data is not frequently updated it can lead to very "stale" cache objects.
ScopedCacheManager has an API to limit number of objects that can be stored, but it doesn't have a limit on how much memory the cache contains. Developers should take care to consider how much memory their cache could possibly take if storing many large objects.

Performance Improvements Seen

To demonstrate the effectiveness of caching data significant testing was performed both with and without caching enabled. Initially, this testing was somewhat synthetic in that a Background Script was executed to call a single function within the script include 10,000 times, calculating total time taken - for example:

var impUser = new GlideImpersonate();
var initialUser = impUser.impersonate('d37741dcdbae2b0414f4bc2ffe96194c');
var stopWatch = new GlideDateTime();
for (var i = 0; i < 10000; i++) {
   var smf = new [script_include]]();
   [script_include].[function]]('pm_project', 'd37741dcdbae2b0414f4bc2ffe96194c');
}
gs.info('TOTAL MS: ' + (new GlideDateTime().getNumericValue() - stopWatch.getNumericValue()) + ' AVERAGE PER ITERATION MS: ' + ((new GlideDateTime().getNumericValue() - stopWatch.getNumericValue()) / 10000));
impUser.impersonate(initialUser);

For a given function, timings were as follows (note that this was similar across all functions within the Script Include):

No caching: 72.328s / 7.2328ms per iteration
Caching using table: 7.864s / 0.7864ms per iteration (89% improvement vs no caching)
Caching using session: 0.861s / 0.0861ms per iteration (99% improvement vs no caching)

Further testing involved identifying slow transactions being executed in the customer's production instance where those transactions were adversely affected by high Business Rule time (which, in turn, was determined to be due to large numbers of executions of Before Query Business Rules). In general, with caching enabled, transaction processing time (i.e. time spent executing on an application node) improved by between 55 - 65%. Transactions which were not adversely affected by execution of large numbers of Before Query Business Rules saw very limited (essentially no) performance degradation from use of caching.

Summary

Caching of data to improve performance is a somewhat advanced topic that is likely to only be of benefit in very specific scenarios (essentially wherever an expensive logical operation using a consistent set of inputs / returning a consistent set of results is executed over and over again). If your instance does have such a use case, however, implementing caching of data as described in this article can provide significant performance improvements and allow your instance to scale as expected.

< Previous Article		Next Article >
Performance Best Practice for Before Query Business Rules		Database Performance: Improving Slow OR and JOIN Queries

Mwatkins · ‎09-15-2021

So much good content in this article! The way I'd sum this up is:

Do you write a good deal of custom code?
Does that code involve frequent lookups of the same exact data?
Is the cost of those frequent lookups causing performance issues?

If you answered yes to all those things then you should probably think about caching!

One thing to watch out for: your data should be small enough to be safely saved in memory. Total accumulated memory per node should be less than 1MB - preferably much less. See the advice about large arrays/objects and GlideElements in Performance Best Practices for Server-side Coding in ServiceNow. Depending on how much data you want to cache, you might need to use the Cache Table method rather than Session Caching.

Gabriela Cortes · ‎01-18-2022

Question. When building a Service Catalog and many of the topics have variables specific to the process, since the variables are all stored in another table, ¿is indexing enough to generate the reports quickly? ¿Is there something customers can consider to ensure we have the best efficiency in this regard?

GTSPerformance · ‎01-18-2022

While this article doesn't mention Glide Properties (the sys_properties table), it seems prudent to add a note about them here since they are a server-side method of caching data that can be used to improve performance. One particularly confusing topic regarding Glide Properties is the performance impact incurred when they are changed. When you change a Glide Property it can cause a system-wide cache flush leading to potentially severe performance degradation for anywhere from 5 to 30 minutes. In some rare cases it can even cause hours of impact. This has been observed in cases where node are down hard enough during cache flush that the load balancer takes them offline, causing session imbalance. Let us explain why.

ServiceNow is a clustered platform with multiple "nodes" or JVMs working as a single "instance" of ServiceNow (e.g. acme.service-now.com). Properties (records in the sys_properties table) are cached in node memory. This is a way of avoiding database calls for commonly accessed values. When a property is changed on one node, it tells all the other nodes to dump their property caches and get the new value of all their properties from the database again. This has a completely trivial impact on system performance. This part happens regardless of if the ignore_cache field is set to true or not.

However, if ignore_cache is set to false - this was the default value until Rome or perhaps San Diego release - then we not only flush the specific cache related to properties, but we also flush the whole Glide System cache!! Let me say that again and, please, pay attention to the double negative. If a property is set to not ignore the cache then we are telling it to flush the whole Glide System cache! This is what triggers the significant performance impact. So why would we do it? The reason that this cache flush is done is so that we make sure to flush any dependencies or stale values in other caches that might be related to the old value of the property that was just changed.

For example, imagine that you have a UI cache that stores the rendered HTML of a page on the server-side, on a per session basis. The purpose of a such a cache would be to avoid rendering the same page over and over for the same user. Now suppose further that the way the HTML renders depends on the value of a property that was just changed. If we don't flush this UI cache then the old value of the property will still be getting used in the rendered HTML and any changes you expect to see in the UI based on the change of the related property would not reflect in the UI until the user starts a new session - i.e. until they log out and log back in. In this scenario you would want to make sure that ignore cache is set to false so that we will not ignore the cache flush, thereby ensuring that any dependent cache would also be flushed.

So, in summary, if you have a property value that will be frequently updated (more than, say, once a month) and you know that there are no other caches that might depend on the value of the property, then set ignore_cache = true. That way the system will only flush the property-specific cache when the property is updated and not the whole Glide System cache.

Alternatively, you could just use some table besides sys_properties in which to store the value. However, we recognize that if you were to use a new custom table that would incur licensing charges and so this might not be a viable option.

Best regards, your performance team

ServiceNowPerformanceGTS

NOTE: This whole discussion about sys_property.ignore_cache is only in relation to ServiceNow server-side caching. It has nothing to do with the caching mechanisms implemented on the client-side; within Browsers/User Agents

ArjunBasnet · ‎02-06-2024

Thanks for this article @GTSPerformance, it provides insight on the platform. Could you update this article with usage of ScopedCacheManager, what role it plays particularly on server-side data processing? Is there any other article that best describes the usage of ScopedCacheManager, internal workings?

GTSPerformance · ‎02-06-2024

@ArjunBasnet Thanks! We've updated the article to include a note about ScopedCacheManager. We haven't used this method ourselves, but a very similar feature has been available inside ServiceNow's java layer for a long time and we are familiar with it. It is a good addition to any developer's arsenal of caching methods. As with any caching method, be aware that ScopedCacheManager puts pressure on Java's heap memory and, therefore, carries risks and tradeoffs to system performance and scalability.

ServiceNow Community servicenow community

Caching data to improve performance

Overview