Have you ever asked yourself why your events are not being processed and your emails are not being sent? Generally, it can be very difficult to identify what is happening in these cases from a customer's perspective. The are a few reasons you may be experiencing this:
- The event processor has been running for several hours and is possibly stuck.
- The event processor has not been able to start.
- The event processor is running for a long time but is actually processing a lot of events.
- The event processor is running for a long time but is working on a rather big event, which could require the processing of various business rules or various records.
You can do some troubleshooting on your own to determine if the event processor has been unable to start although it has already been queued in the system scheduler.
Troubleshooting the Events Processor and the Scheduler's Workers
1) Check the scheduled jobs to verify if the event processor has already been queued.
To do so, connect to the instance and go to System Scheduler > Scheduled Jobs > Scheduled Jobs. In this page, search for the "events process" job. Make sure that you can see the "state" column. If it shows as "queued," then the job has been scheduled to run by an available worker. This normally indicates that the Scheduler has a lot of scheduled jobs running and all of its workers are running jobs.
There is a second reason for the job to remain in "queued" state and it may be that it was running on a node when the node abruptly restarted. In this case, all that is
needed is to reschedule it by changing the state of the job back to "ready" and changing the Next Action value to something in the near future.
2) Check the stats page and xmlstats
If the "events process" job has been queued, you will want to check the stats page of the node you are on, and the xmlstats pages of the other primary nodes on the instance (you can reach this page from the Diagnostics page). In the stats page, look for the Background Scheduler and check if all the workers have a job assigned to them and what the value of "queue length" is. You can check the xmlstats page for similar information.
The reason for this is to see if there are long queues in the node schedulers. If the "queue length" is greater than 0, we can confirm that the scheduler has a lot of jobs scheduled and will take a little longer to complete. This often occurs when Discovery jobs are running contemporarily.
Next, check the xmlstats page. For example, you can search for "scheduler.queue.length" to verify how many jobs are queued up. It represents the value for "queue length" seen in the stats page. The same considerations made for the stats page can be made for the xmlstats page. If you look under "scheduler.workers" you will be able to see if the workers are executing jobs, just like in the stats page.
3) Check which jobs are taking a long time to run
If the above checks show that there are long queues on the scheduler in all nodes, you will need to check what jobs are taking a long time to run. Once you have identified the scheduled jobs that are running for an extended amount of time, you will have to decide whether to stop them or request Customer Support to stop them. Please note that you will be able to stop the long running scheduled jobs only on the node you have logged in to (i.e. the node where you see the stats page). If your long running scheduled jobs are on another node, you will need to request Customer Support to stop them.
In case your long running jobs are shown in the stats page, you can go to the All Active Transactions page (under User Administration > All Active Transactions). Make sure that you can see the "Uncancelable" and "Thread" columns. These will help you identify the long running job and if it is possible to cancel it. Should the "Uncancelable" field show true, then you will not be able to stop it yourself. To stop the job, select the checkbox next to the job (on the job's record) and under "Actions on selected rows..." choose "Kill." Please note that choosing "Delete" will not stop the job although it will make it disappear from the view.
As of Eureka, we have introduced the burst workers. These workers will allow high priority jobs (such as the events processor) to be prioritized and run no matter what queue there is in the scheduler. There is, however, a situation which can still impact these high priority jobs and have them not run in a timely fashion; this happens when the burst worker has become idle. This means that when the job finished, instead of closing off the thread, it has become blocked and is unusable. You can check if this is the situation by looking in your stats page to see if there is a burst worker without a job running on it. This will allow you to check on the node you have logged in to. For the other nodes, you can check the xmlstats page for the idle burst workers by looking for "burst.worker". Should you find one (or more) in the "scheduler.workers" tag, you will need to check if a job is currently running on it. If not, you will have an idle burst worker which will need to be checked by Customer Support.
If none of the above show that the events processor is queued waiting on a scheduler worker, then you probably are falling under another of the situations mentioned at the beginning of this article.