The Now Platform® Washington DC release is live. Watch now!
06-01-2021 04:19 AM - edited 02-06-2024 10:49 AM
< Previous Article | Next Article > | |
Performance Best Practice for Before Query Business Rules | Database Performance: Improving Slow OR and JOIN Queries |
This guide is written by the ServiceNow Technical Support Performance team (All Articles). We are a global group of experts that help our customers with performance issues. If you have questions about the content of this article we will try to answer them here. However, if you have urgent questions or specific issues, please see the list of resources on our profile page: ServiceNowPerformanceGTS
This article assumes you are very familiar with scripting in ServiceNow already. It will discuss the performance implications and best practices regarding Before Query Business Rules (official ServiceNow product documentation link).
In one of our earlier community articles (Performance Best Practice for Before Query Business Rules) a reference was made to potentially caching data - to jog the readers memory:
Leverage some type of caching strategy to avoid frequent lookups of static data. For example, a user's region does not frequently change. Instead of adding a complex condition that requires a JOIN between the region table and a many-to-many table to every single task query, is there a way that you can just cache each user's region membership in memory for the duration of their user session and add a simple condition to task that does not require a JOIN (e.g. task.region IN (...list of region sys_id's from memory to which the user belongs...)? This can done using an actual in-memory cache (e.g. gs.getSession().putClientData()) and/or using a "truth table" that stores the simple results of more complex calculations.
In this article we will look more closely at the type of caching strategies which are commonly used and the difference these can make to transactional (and therefore instance) performance. To help make this relatable we will describe this in the context of a customer escalation recently worked on by our team. Although this customer had a use case specific to them, the concepts in this article can be applied in a wide variety of scenarios.
Background
The customer's instance was accessed by users from multiple geographic locations (territories) and it was a business requirement that data in certain tables should be segregated based on territory. At a high level this was designed as follows:
To enforce data segregation the customer had implemented a set of Before Query Business Rules and a script include. Again, at a high level, these operated as follows:
Performance Issues
First and foremost we should point out that whilst Before Query Business Rules are a valid mechanism by which this type of use case can be implemented, they may not be the best solution. Again, as mentioned in our previous article:
Roles, ACL's, Domain Separation and other out-of-box security features are the preferred and recommended method of applying data security and/or segregation to ServiceNow data. They are optimized for performance and cause less technical debt than a custom scripted Query Business Rule. Ideally, if you are attempting to achieve data security/segregation in a way that will result in some logical evaluation before every single call to a frequently accessed table, you should try very hard to use something other than a Before Query Business Rule. It should probably be a last resort when features that are actually designed to accomplish the same goal have been exhausted.
In this escalation, however, the customers objective was not to completely rewrite / redesign their data segregation code (which was deeply ingrained into their platform design) but simply to optimise what they already had. On reviewing the performance of the instance it was clear that business rules were of significant impact to certain transactions. For example:
2021-06-01 06:02:47 (585) Default-thread-5 6C7C6C6CDBE0FC98AC705F8BD396194C txid=2dc2b028db28 EXCESSIVE *** End #11860627 /pm_project.do, user: userxzy, total time: 0:00:41.346, processing time: 0:00:41.344, total wait: 0:00:00.002, semaphore wait: 0:00:00.002, SQL time: 0:00:24.343 (count: 12,284), business rule: 0:00:27.257 (count: 2,998), ACL time: 0:00:01.435, UI Action time: 0:00:00.934, Cache build time: 0:00:00.682, source: xxx.xxx.xxx.xxx null
Note that the above transaction spent:
On further inspection it was found that the transaction:
When the Before Query data segregation Business Rules were debugged it was found that:
Unfortunately this is a scenario we see relatively often when working with customer instances - the Business Rule / Script Include code was well written and there is no reason to believe that it was not well / exhaustively tested before being deployed into production. It is likely, however, that whoever wrote this code found that it took an insignificant time to execute in isolation and therefore assumed that its impact to the platform / end users would be negligible. What they did not consider, however, is that this design means that certain transactions might execute these Business Rules many thousands of times - even if a single execution of the Business Rule is fast this design does not scale well (as evidenced above). Ultimately this caused quite significant performance degradation across the instance and a general loss of confidence in the ServiceNow platform.
Whilst working with the customer, one other aspect of design became apparent that we could ultimately use to our advantage. When we reviewed the Before Query Business Rules we found that they all:
Likewise the script include would always return a consistent encoded query for each combination of function (entry point) / table / user. Furthermore, the design meant that the same function / table / user combinations were being passed to the script include over and over again in single transactions. In short the vast majority of Business Rule time was spent calculating the same encoded query over and over again. This meant that if the encoded query could only be calculated once per function / table / user combination (then re-used if applicable) it was likely that we could significantly improve the performance / reduce the impact of these business rules.
The next question was how could the platform store / reuse specific encoded queries? There are two possible options here - a cache table or the users session. We will discuss each (along with their pros and cons) below.
Using a Cache Table
Remember that, in this case, the Script Include produced an encoded query in response to a function (entry point) / table / user combination - as a result the following table was deployed to hold this information:
SQL> show columns from u_mf_cache;
+------+----------------+-------------+------+-----+---------+-------+
| Port | Field | Type | Null | Key | Default | Extra |
+------+----------------+-------------+------+-----+---------+-------+
| 3403 | u_encodedquery | mediumtext | YES | | | | <=== HOLDS CALCULATED ENCODED QUERY
| 3403 | u_table | varchar(40) | YES | | | | <=== TABLE PROVIDED TO SCRIPT INCLUDE
| 3403 | u_user | varchar(32) | YES | MUL | | | <=== USER SYS_ID PROVIDED TO SCRIPT INCLUDE
| 3403 | sys_id | char(32) | NO | PRI | | |
| 3403 | sys_updated_by | varchar(40) | YES | | | |
| 3403 | sys_updated_on | datetime | YES | | | | <=== DATE / TIME CACHE RECORD CREATED / LAST UPDATED
| 3403 | sys_created_by | varchar(40) | YES | | | |
| 3403 | sys_created_on | datetime | YES | | | |
| 3403 | sys_mod_count | int(11) | YES | | | |
| 3403 | u_function | varchar(40) | YES | | | | <=== FUNCTION / ENTRY POINT WITHIN SCRIPT INCLUDE CALLED BY BUSINESS RULE
+------+----------------+-------------+------+-----+---------+-------+
The Script Include was then modified such that:
This mechanism meant that even if no record existed for a given function / table / user combination the Script Include would simply calculate the required encoded query (as it had been doing previously) and place the result in the cache table - this operation took approximately the same amount of time as a single invocation of the 'non caching' Script Include (i.e. it introduced negligible degradation for a single execution of the Business Rule / Script Include). If, however, a cache record did exist for the given function / table / user combination, then the Script Include (and therefore Business Rule) could 'short circuit'; completing significantly more quickly than the 'non caching' Script Include.
In summary, it was expected that, for transactions which might only execute the Business Rules a handful of times with different function / table / user combinations, any performance degradation from maintaining the cache table would be negligible; certainly not enough to be noticed by end users. Transactions that executed the Business Rules many thousands of times with repeated function / table / user combinations, however, could be expected to show significant performance improvements as the vast majority of Business Rule executions would 'short circuit', exit early, and, therefore, the overall Business Rule time of the transactions would be drastically reduced.
Note that the structure of the cache table and design of the Script Include meant that the cache table would constantly service SQL queries such as 'SELECT ... FROM u_mf_cache WHERE u_user = [x] AND u_table = [y] AND u_function = [z]'. To ensure that these queries remained performant under load and as the size of the table increased a composite index was added across the (u_user, u_table) columns.
Advantages of a cache table:
Disadvantages of a cache table:
Caching Data in User Sessions
Instead of caching data in a table in the underlying database, it is also possible to cache string based data in a user's session in memory. Various APIs are available to perform this, i.e.:
var session = gs.getSession();
session.putClientData([key], [data]);
For example:
var session = gs.getSession();
session.putClientData('foo', 'bar');
var session = gs.getSession();
session.getClientData([key]);
For example:
var session = gs.getSession();
gs.info(session.getClientData('foo'));
...
*** Script: bar
var session = gs.getSession();
session.clearClientData([key]);
For example:
var session = gs.getSession();
session.clearClientData('foo');
gs.info(session.getClientData('foo'));
...
*** Script: null
To leverage caching data in user sessions the Script Include was further modified such that:
Note that keys for data within the user's session do not include details of the corresponding user (which was of great importance when using a cache table) - this is because each user's session cache is private to that user and is not shared. This means that any cache entries in a user's session can only be accessed by code initiated by that user and therefore it is not necessary to record details of the user in cache.
Advantages of caching data in a user's session:
Disadvantages of caching data in a user's session:
Caching Data using the ScopedCacheManager
The ScopedCacheManager was introduced in Tokyo as an API for developers to write instance-wide server side caching using a concept of cacheable pairs. A cacheable pair is when cached values are paired with a persistent data structure. When the data structure is altered via Create/Update/Delete operation, the related cache is flushed - and on the next access attempt, the cached will need to be updated. For full documentation see the developer guide on our official doc site: https://docs.servicenow.com/bundle/washingtondc-api-reference/page/integrate/guides/scopedcachemgr/c...
Advantages of ScopedCacheManager
Disadvantages of ScopedCacheManager
Performance Improvements Seen
To demonstrate the effectiveness of caching data significant testing was performed both with and without caching enabled. Initially, this testing was somewhat synthetic in that a Background Script was executed to call a single function within the script include 10,000 times, calculating total time taken - for example:
var impUser = new GlideImpersonate();
var initialUser = impUser.impersonate('d37741dcdbae2b0414f4bc2ffe96194c');
var stopWatch = new GlideDateTime();
for (var i = 0; i < 10000; i++) {
var smf = new [script_include]]();
[script_include].[function]]('pm_project', 'd37741dcdbae2b0414f4bc2ffe96194c');
}
gs.info('TOTAL MS: ' + (new GlideDateTime().getNumericValue() - stopWatch.getNumericValue()) + ' AVERAGE PER ITERATION MS: ' + ((new GlideDateTime().getNumericValue() - stopWatch.getNumericValue()) / 10000));
impUser.impersonate(initialUser);
For a given function, timings were as follows (note that this was similar across all functions within the Script Include):
Further testing involved identifying slow transactions being executed in the customer's production instance where those transactions were adversely affected by high Business Rule time (which, in turn, was determined to be due to large numbers of executions of Before Query Business Rules). In general, with caching enabled, transaction processing time (i.e. time spent executing on an application node) improved by between 55 - 65%. Transactions which were not adversely affected by execution of large numbers of Before Query Business Rules saw very limited (essentially no) performance degradation from use of caching.
Summary
Caching of data to improve performance is a somewhat advanced topic that is likely to only be of benefit in very specific scenarios (essentially wherever an expensive logical operation using a consistent set of inputs / returning a consistent set of results is executed over and over again). If your instance does have such a use case, however, implementing caching of data as described in this article can provide significant performance improvements and allow your instance to scale as expected.
< Previous Article | Next Article > | |
Performance Best Practice for Before Query Business Rules | Database Performance: Improving Slow OR and JOIN Queries |
So much good content in this article! The way I'd sum this up is:
If you answered yes to all those things then you should probably think about caching!
One thing to watch out for: your data should be small enough to be safely saved in memory. Total accumulated memory per node should be less than 1MB - preferably much less. See the advice about large arrays/objects and GlideElements in Performance Best Practices for Server-side Coding in ServiceNow. Depending on how much data you want to cache, you might need to use the Cache Table method rather than Session Caching.
Question. When building a Service Catalog and many of the topics have variables specific to the process, since the variables are all stored in another table, ¿is indexing enough to generate the reports quickly? ¿Is there something customers can consider to ensure we have the best efficiency in this regard?
While this article doesn't mention Glide Properties (the sys_properties table), it seems prudent to add a note about them here since they are a server-side method of caching data that can be used to improve performance. One particularly confusing topic regarding Glide Properties is the performance impact incurred when they are changed. When you change a Glide Property it can cause a system-wide cache flush leading to potentially severe performance degradation for anywhere from 5 to 30 minutes. In some rare cases it can even cause hours of impact. This has been observed in cases where node are down hard enough during cache flush that the load balancer takes them offline, causing session imbalance. Let us explain why.
ServiceNow is a clustered platform with multiple "nodes" or JVMs working as a single "instance" of ServiceNow (e.g. acme.service-now.com). Properties (records in the sys_properties table) are cached in node memory. This is a way of avoiding database calls for commonly accessed values. When a property is changed on one node, it tells all the other nodes to dump their property caches and get the new value of all their properties from the database again. This has a completely trivial impact on system performance. This part happens regardless of if the ignore_cache field is set to true or not.
However, if ignore_cache is set to false - this was the default value until Rome or perhaps San Diego release - then we not only flush the specific cache related to properties, but we also flush the whole Glide System cache!! Let me say that again and, please, pay attention to the double negative. If a property is set to not ignore the cache then we are telling it to flush the whole Glide System cache! This is what triggers the significant performance impact. So why would we do it? The reason that this cache flush is done is so that we make sure to flush any dependencies or stale values in other caches that might be related to the old value of the property that was just changed.
For example, imagine that you have a UI cache that stores the rendered HTML of a page on the server-side, on a per session basis. The purpose of a such a cache would be to avoid rendering the same page over and over for the same user. Now suppose further that the way the HTML renders depends on the value of a property that was just changed. If we don't flush this UI cache then the old value of the property will still be getting used in the rendered HTML and any changes you expect to see in the UI based on the change of the related property would not reflect in the UI until the user starts a new session - i.e. until they log out and log back in. In this scenario you would want to make sure that ignore cache is set to false so that we will not ignore the cache flush, thereby ensuring that any dependent cache would also be flushed.
So, in summary, if you have a property value that will be frequently updated (more than, say, once a month) and you know that there are no other caches that might depend on the value of the property, then set ignore_cache = true. That way the system will only flush the property-specific cache when the property is updated and not the whole Glide System cache.
Alternatively, you could just use some table besides sys_properties in which to store the value. However, we recognize that if you were to use a new custom table that would incur licensing charges and so this might not be a viable option.
Best regards, your performance team
NOTE: This whole discussion about sys_property.ignore_cache is only in relation to ServiceNow server-side caching. It has nothing to do with the caching mechanisms implemented on the client-side; within Browsers/User Agents |
Thanks for this article @GTSPerformance, it provides insight on the platform. Could you update this article with usage of ScopedCacheManager, what role it plays particularly on server-side data processing? Is there any other article that best describes the usage of ScopedCacheManager, internal workings?
@ArjunBasnet Thanks! We've updated the article to include a note about ScopedCacheManager. We haven't used this method ourselves, but a very similar feature has been available inside ServiceNow's java layer for a long time and we are familiar with it. It is a good addition to any developer's arsenal of caching methods. As with any caching method, be aware that ScopedCacheManager puts pressure on Java's heap memory and, therefore, carries risks and tradeoffs to system performance and scalability.