Jump to content
dlong93

Speed optimization of On-Prem Automate

Recommended Posts

So, we seem to have weird performance issues in our On-Premise Automate system with load times and sometimes the Automate Client itself just stops loading anything and leaves our techs with the never ending spinning green circle (this happens most noticeably when our techs are in house and even using the local IP of the Automate server to connect which seems odd).

We have about 800 agents and we do have some scripts that run fairly regularly and our system is running on a Xeon CPU E31220 using 4 Cores, 24GB of RAM and two SSD's running in RAID 1

The system as of right now has only been up for about 30 days and is utilizing 11.6GB of RAM out of the 24GB.

 

I have worked with CW Support on some of this stuff in the past but I thought I'd reach out and see if others with on-prem have the same issues or if there is something we could be doing to make our system run better? 

 

I've attached a Crystal Disk Mark test result to see if this is not good enough disk speed because I know that was one of the things an Automate support rep informed me of before because our speeds were even worse off prior to switching to SSD's.

 

Thanks in advance!

2018-07-31_11-03-25.jpg

Edited by dlong93

Share this post


Link to post
Share on other sites

I'm told this is solved in Patch 9. It's terrible. We have started to use the web version (/automate) to help alleviate the issue a bit.

 

Did it seem to start with Patch 6 for everyone else?

Edited by gerrick

Share this post


Link to post
Share on other sites

I am not sure if it actually started with Patch 6 or if it's just been this way for awhile. It is hard telling because i get complaints about the system (some really silly and small) on a fairly regular basis.

 

A lot of our techs (even myself) have started using the Web version to get by especially off-site just because it moves so much faster!

Edited by dlong93

Share this post


Link to post
Share on other sites
20 hours ago, dlong93 said:

I am not sure if it actually started with Patch 6 or if it's just been this way for awhile. It is hard telling because i get complaints about the system (some really silly and small) on a fairly regular basis.

 

A lot of our techs (even myself) have started using the Web version to get by especially off-site just because it moves so much faster!

I wish the web version timeout was better.

Share this post


Link to post
Share on other sites

I've still not heard anything from CW Support and there is still no one assigned to the ticket...it's been 3 days...

Share this post


Link to post
Share on other sites

We had the same issue and tried to reach out to support and were given comically bad answers. Our CWA server runs on a SAN with tier 1 storage (enterprise SSDs) and for the most part it lives in cache. It should be MILES ahead of what is required... however when we talked to support they suggested that we move it to a physical server with 5400 RPM HDD drives. 

I honestly thought he was joking, but he went on to show saved benchmarks that he had "proving" that 5400 RPM drives are the best for read/write at the 4k level which is apparently where CWA lives and dies. I've been disappointed by CW support before, but that was an all time low.

To fix our issue, we ended up disabling all plugins > deleting all of the plugins that we didn't use (Trust me you've probably got plugins installed that you don't use) > reboot > enable the remaining plugins. Make sure that you're not overloading your server with logs (besides if you highlight everything, then you highlight nothing). When we installed Third Wall, we logged way to much information and had to dial it back quite a bit. 

Also how much RAM is dedicated to MySQL?

 

If you're still stuck, then you might hop in Slack and chat with us. I don't always check Slack, but it's almost always running, so feel free to ping me to get my attention.

Share this post


Link to post
Share on other sites

I'm glad we're not the only ones having this issue.   I hope that CW get's to the bottom of this.  I'm relying on the web control center more and more lately.

Share this post


Link to post
Share on other sites

We've had CC lockups for some time.  We've recently done some things, while working with the CWA Product team, that have significantly improved our day-to-day perception of perfomance  (we're on-prem CWA 12p7, 8K agents):

  1. Disabled Veeam plugin - It is officially not compatible with CWA12, but we rolled the dice. We lost. It has a poorly crafted query that resulted in a recurrent lengthy lock of our Computers table. This created a waterfall of queued agent checkins, flooded IIS, etc.
  2. We were missing S&H plugin indexes in our DB which they restored. This primarily affected CC/computer management screen load times for Super Admin and elevated privileged accounts.
  3. We added a property which disabled the Patch Approval calc that was occurring every 30 minutes. Ours takes over 2 hours (which is its own ongoing issue). It now occurs only once at Midnight:
Property added:
Name: GroupPatching30Minutes
Value: False

 

Hope it helps someone in some way.

  • Like 1

Share this post


Link to post
Share on other sites

Am I missing it or is there nothing in the Patch Release notes for Patch 9 that indicate they fixed the timeout issue in the Thick Client?

Share this post


Link to post
Share on other sites

I've been in multiple Labtech/Automate environments.  200-1500 agents.  Put simply, it always has under performed.  I've had the VM server on-prem same building, server in data-center & server in another state.  I've done in place upgrades & solo upgrade to 12.

I would recommend only have plugins you're utilizing and keep them up to date.  Ensure the maintenance that are set out of the box are functioning.  I've never used the web version.  I think if this product is used as intended in its major parts, it worth the wait.  

Share this post


Link to post
Share on other sites

Ensure Max connections in sql in file is set to 3000, not 3 x agent count as support tool says. 

Trim your database wherever you can. Get it as low as possible. That means 1 day of event logs, cleaning up ticket tables, truncating windows update, anything. Get it as small as you can and make sure you have as much ram as you can and sql in file is modified as per install guide to suit your ram count. 

If your CPU is busy, add more. 

If it's possible, add ssd storage. I appreciate it's not always available, and I was a hater on this, but more IO the better. 

It's a hungry baby that needs to be fed resources and demands love. 

Share this post


Link to post
Share on other sites

If you'd like a sneak peak of what testing metrics Connectwise is employing for the future to help prevent configuration problems, wala:

This diskspd.exe -c10G -t4 -si16K -b16K -d30 -L -o1 -w100 -D -Sh C:\testfile.dat command line syntax fully supports changing paths and or file test sizes / block size to find out what performance your disk provides at differing metrics.

You can choose read or write as well with this Microsoft tool.
------
The goal is to suggest 3000 IOPS on the 'total' column shown below for the 100% write @ 16k metric as a 'minimum spec' we require for new builds.
Any one who doubts performance or does not believe performance is a factor for issues on the server, we first need to ensure your current Automate Patch version.
--If certain indexing has been added to the database to work-around known issues with patching queries and networkdevices tables queries which happen to also join the 'computers' table then, it should come as no surprise, once we rule out these factors, disk is import to review.

Without these indexes (coming in Patch 11 and 12 of Automate 12) Agent check-ins and CC login can back up and spike IIS server / w3wp.exe CPU use along with MySQLd.exe's usage by way of held open connections eating resources and responsiveness ;

..Some are Known issues and are included in a current, future, or no patch (yet).
In order to find out if this is the case, please contact Automate Support and get a ticket in to see if your server is optimized using all the known fixes we have through subject  like 'Server Performance review'
------
To get the bigger picture, also run a 4k write IOPS test, in order to gain insight on the contrast between these two properties of the disk sub-system.
Below, I've run two 16k write tests as the validation would do pre-server install.

*You must edit the path of C:\testfile.dat to the drive you wish to test on no matter when diskspd.exe is located as the exe location wont change results 
(The test runs on the drive you specify via path in CMD line syntax)

David-SSD-TestDiskspd.thumb.png.e528fce0f413847230bd0cc133488a74.png                                                                                                                             David-HDD-TestDiskspd.thumb.png.3ca0fd74385669eb281acb5702d1b474.png
I've attached a benchmark of a direct connected (SATA 6G) 1TB 7200RPM 'Performance' Hitachi Deskstar spinning disk on the right. 
On the left, a Samsung 850 EVO 250GB is running the test virtualized.

We need to make note of Total IOPS in these screenshots under the 'Total: ' I/Os per second' output of the test result screen via CMD.exe
From these test results, the HDD scores ~3716 Operations Per Second in total when writing 16k blocks, while the SSD does 12,264 IOPS on a 16k Write

Keep in mind, the testfile.dat size variations as modified by -c10G (default of 10GB test file) show little change to SSDs, while HDDs can be up to 30% faster on first vs last sector of the disk, so use this on new drives with a size closer to total disk space to see total drive IOPS on average, with expense of longer test run time.

The test is going to give you an idea of if your disk meets our requirements for new server builds. 
Example: Any server scoring below '3000 IOPS' may cause EDF rebuild, search rebuild, or patching rebuild, nightly maintenance, or intermittent performance issues as varying queries run throughout the day. Some of these are traced to IO issues, other relating to a optimization we must apply through patches.

This is directly linked with high avg CPU usage by the DB AND WEB server due to 'managing the idle time' between IO requests and transaction commits.
Some of this is preventable, some of it we can re-write code to optimize, but get this out of the way first, run a benchmark!

diskspd.exe

Edited by dfleisch
  • Like 1
  • Thanks 1

Share this post


Link to post
Share on other sites

Aren't the values of -t4 and -o1 artificially limiting and not taking advantage of the real-world potential of the system, potentially providing erroneously low values? For example, in one test using -t4 and -o1 I might show 1700 IOPS and 26 MBps. But if I use more real-world values of -t16 and -o16 I might show 7000 IOPS and 110 MBps.

Share this post


Link to post
Share on other sites

Yes, by design. If a deadlock happens it will wait on the write speed @ QD=1 to complete then release, allowing others to process.

The test takes advantage of this fact and makes sure to test a worst case 
scenario @ 16k write. We really need to be testing 4k blocks instead as after some time graphing MySQL servers running our application, the 4k block is used and NOT 16k as previously thought.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×