Jump to content
mcmcghee

Commands stuck in "executing" state

Recommended Posts

So we have a number of machines where the LT agent gets stuck in an "executing" state.

 

nHSM7U1.png

 

When this happens the only thing the machine will do is check in to the LT server. Solution is a reboot or restart of the LT service.

 

My question is how can I monitor when this happens? I would like to detect machines in this state and auto run psexec/psservice from another machine in the location to restart the agent.

 

Probably the bigger question is- does anyone else have this problem?

Share this post


Link to post
Share on other sites

I've had the problem in the past. Haven't found the cause, though the last time it happened, ESET and Webroot were fighting over clock cycles.

 

If you want to monitor for it, create an internal monitor that does a count of the commands in the executing status and use the agent ID as the identity field.

Share this post


Link to post
Share on other sites

I've written a script and monitor that will detect agents that are stuck in an executing state, locate another machine on the same network (preferring master PCs), then using that PC, remotely stop the labtech service, kill the process then start the service again.

 

The scripts and Monitor are attached in a ZIP. Make sure the Monitor calls the "CCP - Repair Broken Agent" script. Make sure when you import the scripts, edit the "CCP - Repair Broken Agent" and make sure the script run in there references the "CCP - Reset LTSVC on remote PC". If anyone can improve the monitor or script, feel free.

 

The script relies on the location having admin credentials set, since it uses "shell as admin".

 

The monitor will trigger if it finds a PC that has two or more commands stuck "executing" for more than 15 minutes within the last hour.

CCP Agent Fix.zip

Share this post


Link to post
Share on other sites

I realize this is an old thread, and the scripts are much appreciated! I can't seem to get the monitor to run the first script however. The monitor will definitely build the query and find the agents, but the first script never runs. I have also tried to run the Repair broken agent script manually on the affected agents, but they just remain in a queued state forever.

 

Any ideas?

Share this post


Link to post
Share on other sites

To the 2 above, it doesn't work that way, the script is meant to be ran on another machine in the same network and requires the parameters that are passed to it by the monitor to know which agent it is repairing.

 

Oh, and I can confirm this does work and has saved me much time and hassle. Thanks CCP!

Share this post


Link to post
Share on other sites

rgreen83 is mostly correct. The script executes on the computer that is faulty and then runs some SQL to pick another computer in the same Labtech location as the faulty one. This does mean you need to use locations correctly though :). I've actually just changed ours to use an alert template that runs the script instead, to see if it increases stability a little, but as rgreen83 has said - It's saving us a lot of time and hassle as well.

Share this post


Link to post
Share on other sites

It's been a while since I wrote this, but the script attempts to find a computer on the same network based on the router public address, client and location.

If you have dual WAN's at a site, it may have trouble picking the best computer.

It will first try to run it from a "MASTER". Failing that, it will try to run from a non-master computer.

 

It does not execute commands or show the script on the "stuck" computer, as that would be pointless. It will instead show and execute the script on the next best online computer it can find based on the above.

You also need to make sure your admin passwords are set correctly, which if you're using labtech right, it should be.

 

Since I first wrote this, I also found putting the following just before the "Run Script: Reset LTSVC on Remote PC" and just before the "LOG: Script Complete..." fixed a bug where it kept the same computer ID from the last time it ran.

 

Function: Variable Set

Type: Reload Computer Variables

Parameter: @ComputerID@

Share this post


Link to post
Share on other sites

So my question ccp would be what do you mean by use locations correctly? we have ours separated out a bit like DataCenter and workstations, should this be combined? especially in the case of single servers in the "datacenter".

Share this post


Link to post
Share on other sites

Yep, all machines at the same physical location (really the same network) are in the same location.

Share this post


Link to post
Share on other sites

This used to happen to us until we disabled tunnels. Agent would use up cpu/ram until crashed many hours later, no commands would go through. LT10 is supposed to fix, but we had similar problem and had to disable tunnels globally.

Share this post


Link to post
Share on other sites

Just had to put a reply in. Had an issue with a server today, started writing a script and remember to check here. This is awesome and works beautifully!

 

To import the SQL statement, you need to use the SQL option under Import where you also find the XML option for importing the 2 XML files.

After that, you just need to follow CCP's instructions in the post with the attachment and make sure the scripts are referenced correctly and set the monitor to run the script.

 

Cheers,

Stuart.

Share this post


Link to post
Share on other sites

Sorry, me again.

I'm getting some false positives due to KeepAlive for some long running processes. Does anyone have any suggestions on how to exclude these from the result?

Share this post


Link to post
Share on other sites

Thanks to CCP for the original scripts and monitor. I was able to build on those and haven't had the problem since.

 

Sorry, me again.

I'm getting some false positives due to KeepAlive for some long running processes. Does anyone have any suggestions on how to exclude these from the result?

 

I had the same issue, I modified the monitor as below:

 

INSERT INTO `Agents` (`Name`,`LocID`,`ClientID`,`ComputerID`,`DriveID`,`CheckAction`,`AlertAction`,`AlertMessage`,`ContactID`,`interval`,`Where`,`What`,`DataOut`,`Comparor`,`DataIn`,`LastScan`,`LastFailed`,`FailCount`,`IDField`,`AlertStyle`,`Changed`,`Last_Date`,`Last_User`,`ReportCategory`,`TicketCategory`,`Flags`,`GUID`,`AgentDefaultGUID`,`WarningCount`,`DeviceId`) Values('LT - Commands Stuck Executing','0','0','0','','0','72','~~~%NAME% %STATUS% on %CLIENTNAME%\\%COMPUTERNAME% at %LOCATIONNAME% for %FIELDNAME% result %RESULT%.!!!~~~%NAME% %STATUS% on %CLIENTNAME%\\%COMPUTERNAME% at %LOCATIONNAME% for %FIELDNAME% result %RESULT%.','1','300','commands','Status','commands.DateUpdated < DATE_ADD(NOW(),INTERVAL -15 MINUTE) AND commands.DateUpdated > DATE_ADD(NOW(),INTERVAL -60 MINUTE) AND Computers.ComputerID in (SELECT ComputerID FROM commands WHERE commands.status=2 GROUP BY commands.ComputerID HAVING Count(*) > 2) AND Computers.ComputerID NOT IN (SELECT ComputerID FROM commands WHERE commands.status=3 GROUP BY commands.ComputerID HAVING COUNT(*) > 1) and\r\n Computers.LastContact > DATE_ADD(NOW(),INTERVAL -15 MINUTE)','1','2','2016/07/06 08:40:44','2016/07/06 02:08:44','0','ComputerID','0','6303','2016/07/06 08:40:44','root@localhost','23','134','0','','','0','0');

 

am6GwVN.png

Share this post


Link to post
Share on other sites

Can I get a little help? Sorry... but the monitor resets and the query fails both in the originally imported one and @mcmcghee's contribution. I'll admit that this could entirely be user error (i.e. me...)

Share this post


Link to post
Share on other sites

Seems like this issue is related to Webroot, at least for me it is. I had trouble with the query because it was returning too many results and timing out

Share this post


Link to post
Share on other sites

Commands may get stuck for a few reason but if you are seeing a lot of commands for adding/removing monitors and your running WebRoot, its a known issue and WR is working on a fix.

 

Use this SQL query to find stuck commands, it will do a count per agent

SELECT cmd.computerid, c.name, COUNT(cmd.computerid) AS NumOfTimes 
FROM commands AS cmd
LEFT JOIN computers AS c ON cmd.computerid = c.computerid
WHERE cmd.status = 2 
GROUP BY cmd.computerid
ORDER BY NumOfTimes DESC;

 

You can use this query to kill any monitor commands:

 

DELETE FROM commands WHERE STATUS = 2 AND command IN (84,85);

 

You will need to kill/restart the LT agent and it should start processing commands again.

Share this post


Link to post
Share on other sites

We were having issues with the SQL query timing out in LT, and an almost 4 minute execution time in SQLYog just for 70 some odd rows. The problem is that the IN portion of the query is executed on every row that is returned. We changed the query so that the subquery is only executed once. Change the additional condition to this:

 

 

commands.DateUpdated < DATE_ADD(NOW(),INTERVAL -15 MINUTE) 
AND commands.DateUpdated > DATE_ADD(NOW(),INTERVAL -60 MINUTE) 
AND Computers.ComputerID IN (SELECT * FROM ( SELECT ComputerID FROM commands WHERE STATUS=2 GROUP BY commands.ComputerID HAVING COUNT(*) > 2) AS subquery)
AND Computers.ComputerID NOT IN (SELECT * FROM (SELECT ComputerID FROM commands WHERE commands.status=3 GROUP BY commands.ComputerID HAVING COUNT(*) > 1) AS subquery2) 
AND Computers.LastContact > DATE_ADD(NOW(),INTERVAL -15 MINUTE)

 

It executes almost instantly.

 

This also uses the modification suggested by mcmcghee.

 

SRWqiKt.png

Share this post


Link to post
Share on other sites

Just FYI,

 

We created a quick plugin to assist people who do not want all the hassle of trying to setup a monitor and fixes for agents.

 

https://www.plugins4labtech.com/products/stalled-labtech-agent-detector

 

New plugin searched LabTech for failed agents and allows you to see if they are on, how bad they are and then selecting that agent will allow you to push a set of service restart commands down to agents on that network that can use RPC to kill and restart the failed services.

 

Pretty cool.

 

Enjoy :geek:

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×