This article was written to outline some issues that were recently discovered and addressed. Hopefully with the end that any customers currently still using Servicedesk 7.0 will fill additional motivation to take steps necessary to move to the current release.
The What
After many years of using Change Management, one Servicedesk 7.0 customer discovered that their change requests were not moving from one phase to the next. While not completing the CAB approval, or the implementation stage, they seemed to stall and, if they progressed to next part of the process, it was extremely slow and erratic. There were some new change requests eventually processing, but not on a consistent, predictable schedule.
Some of the troubleshooting steps that were attempted included:
- Restarting the Symantec Workflow services
- Resetting IIS
- Recycling the application pools
- Rebooting the Service Desk server, and related other servers, including the application server, notification server (SMP) and database server. We also republished Change Management core workflow, to no avail.
Upon review of the log files, the following error was observed:
Error,Wednesday, November 27, 2013 6:16:57 AM,Cannot process timeouts and escalations now because there are active threads running against this process.
Why was the Timeouts and Escalations process not being able to run successfully was the question being asked ourselves.
The Why
The 7.0 version of the Change Management process (and Incident Management) used a “Merge” component, which essentially “waits” for other portions of the process to complete, and provide data for the process to move forward. This process of waiting is letting the appropriate “messages” pass through the message handler. This merge component was deemed to be a problem after 7.0, and subsequent releases of the applications did not include it.
Each major process (Change, Incident, Problem) has a “Timeouts and Escalations” process that is executed to determine what messages should be handled. This comes in the form of a SQL query to the Messages table, looking for specific messages related to the process. If there are timeout messages to be processed, or an escalation (something next to do, when a specific date/time has been reached) to be handled, the system acts upon the messages found. That query for Change Management looks something like this:
exec sp_executesql N'SELECT M.MessageId FROM Messages AS M WHERE QueueName=@exchangeName AND (LeaseExpireTime IS NULL OR LeaseExpireTime<GETDATE())AND M.MessageId IN (SELECT MessageId FROM MessageProperties WHERE QueueName=@exchangeName AND AttributeKey=@attr0Key AND AttributeNumValue<@attr0NumValue )',N'@attr0NumValue bigint,@attr0Key nvarchar(12),@exchangeName nvarchar(45)',@attr0NumValue=635260623575664642,@attr0Key=N'TRIGGER_DATE',@exchangeName='local.workflowsqlexchange-change_management.tasks'
The Message Properties table contains the keys for these. Each message has four properties entries. The two key fields from this query are:
AttributeNumValue = Current Date/Time in ticks (You can find out what it is by a Powershell command - [DateTime]::Now.Ticks)
AttributeKey = What we are looking for. In this case we were looking for a ‘TRIGGER_DATE’, which now obviously goes with the tick count in ‘Attr0NumValue’. Other possible entries here are DataTypeOfPayload, SizeOfPayload, and the important one, TRACKING_ID. We will use this later.
For this customer, this query was generating more than 890 messages that it needed to process, just for Change Management’s timeouts and escalations. This was also running every 30 seconds, a hard-coded value for all timeouts and escalations in Servicedesk. Essentially, this process could not complete before it was called to run again, and so on. This was also causing some of the other Change Management processes to stop running.
The Way
What to do to resolve this was the next task at hand. It involved a bit of trial and error, but, following the steps here, we were able to resolve the problem
1. We needed to stop IIS from resetting when Change Management stopped working. We could tell when change management stopped working. We had set INFO level logging on for Change, and when the timeouts process was running correctly, we would see an entry like:
“Checking Count …”
This was an indication that Timeouts and Escalations were running. When we opened up IIS, and went to the application pool supporting the Servicedesk applications, we turned OFF the Idle timeout. Normally this is set to 20 minutes. We did not want it to turn off, or even recycle. We believe, because it was running over the top of itself, that this was causing a recycle.
2. We decided to assume that all of the current requests were backing the system up. We took all of the requests that the query generated, and backed them up into a secondary table.
--Backup
/*CREATE TABLE MessageProperties2 (MessageID Varchar(50), Value Decimal(38,4))
INSERT INTO MessageProperties2 (MessageId, Value)
SELECT MessageId, AttributeNumValue
FROM MessageProperties
WHERE AttributeKey = 'TRIGGER_DATE'
AND AttributeNumValue < 635230523254287593
AND QueueName = 'local.workflowsqlexchange-change_management.tasks'
AND messageid <> '9befa56a-91f5-4654-9a97-ad4c431f3c37'
*/
3. We then set all of the messages generated by the current timeouts query to a new date/time value by changing the tickcount (attributeNumvalue) value. This looked like:
--update
/*
UPDATE MessageProperties SET AttributeNumValue = 935230523254287593
WHERE AttributeKey = 'TRIGGER_DATE'
AND AttributeNumValue < 635230523254287593
AND QueueName = 'local.workflowsqlexchange-change_management.tasks'
*/
The timeouts and escalations process was now in a position that, when it executed every 30 seconds, it would only have to deal with messages that had been included since its last run. The results for the customer were that all new change requests starting processing through at a normal rate of flow, allowing proper execution.
4. We let the system rest over a weekend, and over a holiday. It automatically process a few outstanding requests that had been previously bottled up.
5. We then started copying back some of the messages we had changed the dates on. We used the following SQL:
--restore
/*
UPDATE top (10)MessageProperties
SET MessageProperties.AttributeNumValue = MessageProperties2.Value
FROM MessageProperties
INNER JOIN MessageProperties2 ON
MessageProperties.MessageId
COLLATE DATABASE_DEFAULT = MessageProperties2.MessageId COLLATE DATABASE_DEFAULT
AND MessageProperties.AttributeKey = 'TRIGGER_DATE'
AND MessageProperties.AttributeNumValue = 935230523254287593
*/
We started re-introducing these messages 10 at a time, once a day, and started seeing slow progress. As we added subsequent days of messages, we noticed that some of the messages kept reappearing in our group. We were running a query to check the status of those:
--this query checks on the problem messages
SELECT m.messageposteddate, mp.*
FROM MessageProperties mp with (Nolock)
join messages m with (nolock) on (mp.messageid = m.messageid)
WHERE mp.AttributeKey = 'TRIGGER_DATE'
AND mp.AttributeNumValue < 635230523254287593
AND mp.QueueName = 'local.workflowsqlexchange-change_management.tasks'
AND mp.messageid <> '9befa56a-91f5-4654-9a97-ad4c431f3c37'
order by m.messageid
--order by m.messageposteddate
We would take the messages, and capture the tracking_ID, and then, using a couple of simple queries, shown here:
--Specific Message Search
--Select * from MessageProperties where MessageID = 'ee77a5e2-d499-46bc-9a30-551b7eb942ab'
-Specific Task Search
--Select * from task where taskID = '9befa56a-91f5-4654-9a97-ad4c431f3c37'
-Specific Message Properties Search
--Select * from messageProperties where AttributeKey='TRIGGER_DATE' and messageid = '9befa56a-91f5-4654-9a97-ad4c431f3c37'
We would then take the results one by one, and plug them into the Change WorkflowManagement Service in IIS. SD.ServicedeskChangeManagement.WorkflowManagmentService.asmx. We would use the Tracking ID of the message to determine the tasks (using the GetCurrentWorkflowTasks operation (method)), and then find the TaskID of the Merge task.
Using that TASKID, we would then use the ChangeTriggerTiming operation (method) , and then change the timing of the trigger for the task message to execute. In addition to the TASKID, which we had, we usually used the number 0, and a newtime of “01-01-2014 15:30”. The date could be any date, as long as it was shortly prior to the actual time you were working on this.
We would do this for about 5 or 10 messages per day. Within about 1 week we saw substantial improvement and the numbers of outstanding requests drop. Within 2 weeks they were all pretty much handled correctly.
Summary:
This was a slow tedious process, but we were able to save the customer the headache of losing the problem change messages. We believe that we could develop a workflow process to automate some of this going forward. However, given the fact that the Merge component is no longer used, that may not be necessary.
Other related SQL QUERIES used:
/*select mp2.* from MessageProperties2 mp2
join messageproperties mp with (nolock) on
mp.messageid = mp2.messageid
and mp.Attributekey = 'trigger_date'
*/
--select * from messageproperties2
--Verify the restore
SELECT DISTINCT(MessageProperties.AttributeNumValue - MessageProperties2.Value) FROM MessageProperties, MessageProperties2
WHERE MessageProperties.MessageId COLLATE DATABASE_DEFAULT = MessageProperties2.MessageId COLLATE DATABASE_DEFAULT
AND MessageProperties.AttributeKey = 'TRIGGER_DATE'
AND QueueName = 'local.workflowsqlexchange-change_management.task'