Jump to content
timwiser

Running Automate in Azure

Recommended Posts

Has anyone here taken the plunge and are running their Automate server in Microsoft Azure?  We're considering it and would like to see how other partners are finding it.

We've got just over 4000 agents and have plenty of scripts running in the system.

I'd appreciate some feedback if you're using this platform - including number of agents you have in it and what spec/level Azure VM you're running with.  Also, are you split server or got your eggs in a single basket?

Share this post


Link to post
Share on other sites

We've Been running automate in azure for over 3 years now. Right at 4,000 agents today. Currently on a DS3v2 instance type. I'd love to have more ram but it starts getting pricey and so far this has worked quite well for us.

You absolutely MUST run on an instance with SSD storage to get acceptable I/o for automate. We took 3 SSD disks and pooled them with storage spaces for performance (raid 0 essentially).

Let me know if you have any other questions. I'm happy to share our lessons learned.

Share this post


Link to post
Share on other sites
On 10/30/2017 at 7:52 AM, tlphipps said:

We've Been running automate in azure for over 3 years now. Right at 4,000 agents today. Currently on a DS3v2 instance type. I'd love to have more ram but it starts getting pricey and so far this has worked quite well for us.

You absolutely MUST run on an instance with SSD storage to get acceptable I/o for automate. We took 3 SSD disks and pooled them with storage spaces for performance (raid 0 essentially).

Let me know if you have any other questions. I'm happy to share our lessons learned.

Sorry to resurrect an old post, however, we're contemplating shifting our Automate workload into Azure. If you had to re-do your setup, what would you use today? I was looking at the E2s V3 model (2 Core, 16GB RAM). We have approx. 1,300 endpoints. We will be going with the 3 SSD disks as well.

Also, one thing I've always been confused on, is support for Azure offered through the Microsoft partner hours, or is it necessary to purchase standard support? It's an extra $100/month so it adds up.

You still happy running it in Azure?

Any other advice you can give would be great.

Thanks!

Share this post


Link to post
Share on other sites
11 minutes ago, avdlaan said:

Sorry to resurrect an old post, however, we're contemplating shifting our Automate workload into Azure. If you had to re-do your setup, what would you use today? I was looking at the E2s V3 model (2 Core, 16GB RAM). We have approx. 1,300 endpoints. We will be going with the 3 SSD disks as well.

Also, one thing I've always been confused on, is support for Azure offered through the Microsoft partner hours, or is it necessary to purchase standard support? It's an extra $100/month so it adds up.

You still happy running it in Azure?

Any other advice you can give would be great.

Thanks!

We still run in Azure and really like it. It's still pricier than doing something on-prem, but tremendously flexible and I swear the performance of Azure VMs is faster than the "same specs" on a physical or Hyper-V VM on-prem. We're now above 5,000 agents and had to move up to a bigger VM instance type to get more CPU and RAM in the mix. (and probably need to start exploring a split server config)

I'd look into "managed disks" if I were starting over now as I understand you can get similar performance without needing to use Storage Spaces striping with multiple SSDs (which is how we're currently setup). Basically, find guidance on doing IaaS for SQL Server in Azure and assume you need to follow that same basic setup.

For support, to get actual Azure support, you need to pay for it. But I hear you can buy it, use it for a few days/weeks, then cancel it and they'll refund you unused time. We haven't needed to use it, so I haven't tried that.

Share this post


Link to post
Share on other sites

Hi Folks,

We are trying to deploy Automate to Azure - will have max 1500 agents for the next year or two we built a DS12 with managed premium SSD.

Automate came along to do the install and said disk performance is not good enough for Automate.

They are running a diskspd command - we went to Microsoft and they say that command is not valid for Azure they ran their command and we are getting as expected performance.

Anybody any thoughts on this I see tlphipps is running 5000 agents on Azure without an issue.

All seems mad to me

Cheers

Michael

Share this post


Link to post
Share on other sites

Actually really glad somebody else asked this. I recently ran the diskspd tool against our Azure instance as well and get results well below what CW recommends. But we're now at 5,400 agents and still see really great performance IMHO. I mentioned previously we're using striped disks in storage pools. I also mentioned increasing our instance size. We're up to an instance size with 56GB RAM and have 50GB available for MySQL. After some DB tuning to clear out crap, our DB size hovers around 46GB or so which means we're basically keeping the whole thing in RAM. When I look at disk performance using resource monitor, I rarely see spikes or queue build-up which says to me that we don't really have any disk performance issues despite what the CW recommended tool is saying.

For now I'm happy where we're at and after doing some DB cleanup and getting all the latest patches for CWA installed, our performance is better than ever. I'm definitely starting to look at split-server config as we continue growing. But based on what I've seen/heard from others on performance, I'm not sure I really expect much gain over what we have right now.

Share this post


Link to post
Share on other sites
14 minutes ago, tlphipps said:

But we're now at 5,400 agents and still see really great performance IMHO. I mentioned previously we're using striped disks in storage pools. I also mentioned increasing our instance size. We're up to an instance size with 56GB RAM and have 50GB available for MySQL.

Would you mind sharing what you're using? Michael Martin was using a DS12, I earlier spec'd a E2s V3 (2 Core, 16GB RAM). What are you running?

I'm just trying to get a handle on what others are doing as Azure spec'ing is still a bit of a mystery to me.

Share this post


Link to post
Share on other sites

Sure. We're currently running on a DS13 (8 vCPU; 56GB RAM).

I'm interested in some of the newer sizes and especially in managed disks, but sadly can't easily switch to either of those.
So if somebody else wanted to do some testing for me...…...that'd be awesome!

Share this post


Link to post
Share on other sites
36 minutes ago, tlphipps said:

Sure. We're currently running on a DS13 (8 vCPU; 56GB RAM).

Thanks for that!

Wow, I see that's $820/mth just for the server. That's pretty pricey. For that money it seems a heck of a lot cheaper to do it in-house, but perhaps that's short-sighted...

Share this post


Link to post
Share on other sites

Yeah, that's often a topic of conversation here.

For us we have to factor in hardware cost, colo cost (or redundancies added to office), we're already a multi-state organization so Azure helps with that some, We're pushing clients to Azure so being 'all in' ourselves helps a bit. And in the 4 years we've been in Azure, we've not had a single outage. Not that it CAN'T happen. But awfully nice never really worrying about our internal 'infrastructure.'

Pricey for sure, but as long as we're pricing our services correctly, if you spread that cost amongst 5,400 agents, it's not all that bad.

I'm seriously considering doing a reserved instance for this to really cut the cost down.

Share this post


Link to post
Share on other sites

Same here. We couldn't do what we do without the awesome sharing that takes place on this forum and in Slack. Anywhere I can give back in any way, I'm 100% in.

And I also love seeing/hearing what others are doing with Azure too so I know I'm not on an island!

Share this post


Link to post
Share on other sites

tlphipps do you think with Managed Premium SSD you need the storage spaces?

We had two managed Premium SSD - C was 127GB and D was 1TB - We contacted Microsoft and they ran diskspd and they were getting 125 Mib/s whereas when Connectwise ran their command they got 15 - 20 mib/s. MS said CW command was not suitable for cloud / virtual disks.

Stuck between a rock and hard place CW won't proceed with install and MS say nothing wrong my gut says there is nothing wrong and with 24GB Ram and max 1500 agents I just can't see it being an issue.

CW Command is - diskspd.exe -c10G -t4 -si16K -b16K -d30 -L -o1 -w100 -D -Sh c:\temp\testfile.dat

 

 

Share this post


Link to post
Share on other sites

My understanding is that 'managed disks' do NOT need storage space striping to achieve great I/o. But I don't have any VMs using them at this point to confirm that.

I CAN confirm that I get pretty dismal results when running that command from CW on my Azure system. Basically reporting 15.25MiB/s. I can also confirm that with 5,400 agents on our system, there's NO WAY that's indicative of our actual performance. Sorry I can't help with CW refusing to move forward though. No real ideas there.

 

Share this post


Link to post
Share on other sites

The command we give consultants to run to pre-check a server runs a test using a Queue Depth of 1, Microsoft is likely able to scale the Azure disks to run well with 32 or higher queue depths, but MySQL is not going to behave like that.

MySQL will use a 4k and 16k block size on the disk and write (randomly) at a single queue depth.
The test we give you to check the server is going to validate if the server is able to sustain the required IOPS on a single MySQL query, something that varies widely depending on hardware setups.

From experience, we are not able to to a EDF rebuild, Search Rebuild, or other intensive tasks below a certain number of IOPS on that 16k random write, 4 thread test syntax for Diskspd.exe, and as a result server down tickets come in with complaints that the application is not usable.

To give you some history, back when Azure came out, people were running the base storage plans and not able to meet specs on these 16k and 4k block sizes, but if you've upgraded to a SSD on Azure there should be a better chance you'll pass.

The issue is, even Azure's striped SSD setups are still many times slower than a single SSD drive which today are very inexpensive.
If you're paying for Azure and getting less performance than a $100 piece of hardware, I find this to be a waste of resources, EXCEPT for the fact that the cloud type of hosting may be better for a company's situation as a result of the management side of the OS / Hardware, ISP and Power / Heating / Cooling / Redundancy guarantees.

Because of this, I UNDERSTAND why people are doing it, but there has to be some sort of min spec because most server down tickets are focused on 'why doesn't the application work' ; Answering the question of "Why is the server not able to perform well on a single thread, but scales well with a high queue depth" is generally the answer to the problem: The storage is not optimized for the type of load the application calls for.

There have been talks about standardizing a disk performance test for some time here at Connectwise's Automate division.
To define min specs, we have to cut off the slower servers somewhere.

Yes, I've seen systems showing great IO perform slowly on a EDF rebuild or other queries due to row count / Optimizations we may need to add.

Yes, I've seen slow disks work for the intended purpose of serving MySQL queries quickly

There are always outliers and scenarios where you may be able to get by, HOWEVER, in the worse case type of situation where a query is running slow, the IIS (WEB SERVER) can queue a bunch of connections (due to a deadlock on say computers while EDF rebuild runs), and those agent check-in requests filtering through the web server WILL stall out the GUI as the # of waiting connections must go into the 'queue' after 100 requests are added (and are waiting on MySQL).
BECAUSE OF THIS.. the problem may not be obvious until a query runs that takes a while, and as you grow this can get worse and worse.

This DiskSPD.exe Syntax pasted above is going to be the best defense we have against your server not working correctly long-term by simply finding a provider that can sustain the IO we suggest at minimum.

Amazon's EC2 instances that use EBS Optimized volumes are going to run about 3200 IOPS on that test and so we've set a min of this spec as single 7200 RPM SATA disks are also posting these (or higher) numbers.

It sounds like Microsoft is trying to convince people that the config they've chosen for running their platform and storage pools is 'best' and from your note on the Automate server requiring 3 SSDs JUST to sustain the performance required by our apps, while a single ~$100 Samsung outperforms the '3 ssd' stripe by 10 times over.

The important part of this is: They are artificially throttling the IO and get to choose what a 'SSD' qualifies as.
SO, while the application may work after 3 SSDs are combined, the results of the IOPS test we choose to run may still not be optimal for Automate's application calls, but it may 'work' to a point.

Heck, for that ~$800 price why is one NOT building a server with hardware that benchmarks @ 33,000 IOPS (NVME SSD) for pennies compared ; sending that hardware to a datacenter, and calling it a day?
--The min spec is NOT voodoo to reach, and if you are below 3000 IOPS, you have no excuses besides your own doing for going against experts who have direct experience with the application's design.

Azure is simply not the best choice for a cost/performance on Automate. Can you get it to work? Maybe. How well? A little more time wasted troubleshooting may answer that once the load scales with company expansion.

Be careful out there with what you buy into!

Edited by dfleisch

Share this post


Link to post
Share on other sites

@dfleisch I greatly appreciate you jumping in with all of this incredibly detailed and helpful information. It's great to see CW employees contributing more and more in these forums.

I don't doubt that MS is doing 'trickery' with how they implement and present some of the Azure virtual hardware.

I will say though that I thought some of your comments toward the end of the post were maybe a bit more snarky than they needed to be. Or maybe I just read too much into it. I completely respect your expertise and appreciate the technical detail. At the end of the day, we're currently running over 5,300 agents on a single Azure VM and we're quite happy with the performance. But as we continue growing I'm certainly always evaluating our choices of on-prem/cloud/whatever to make sure it works well for our business.

Share this post


Link to post
Share on other sites

@tlphipps

Please forgive me if I came across this way.
I read through the thread and wanted to put out a message to everyone first: My goal isn't to insult anyone about their choice in providers, heck that's a business decision more than anything.

This started as a small post but I want to make a few items clear about our approach and for the reason tests are being run.
*To help you guys stay up and stable!* for one..

The goal of my last post was to call out and describe what works and what doesn't, and to forewarn someone new or seasoned who may not be used to looking at these numbers and troubleshooting an issue caused by IO.

First, the volume of IO calls made by Automate is massive. The number and variety being pretty this complex creates issues, but sometimes they are harder to identify as each server is different in how the problem exposes itself and how it manifests.
Your first line of defense is to meet requirements.

Automate 2019.x + Automate 12.x runs a variety of services that require a certain min spec. The CWAFILESERVICE, Solution Center, Startup routines of LTAgent, and more all require a low latency and high throughput disk. Most IO issues are first exposed when the server is first rebooted by way of timeouts while all agents are ALSO trying to check in, and I've seen 10,000 in web-garden type IIS queue buckets that eventually creates a 100% CPU problem and slows the queries further.

If your server cannot process this many IOP (minute) it may take 30+ minutes to 'settle' down or a 'rename' of the eventlogs table at worse, before the startup of LTAgent finishes.
Without a LTAgent startup/POST, you may never get the server online, so these issues occur because it couldn't process x data in x time.

Add that into custom monitors and wild variety of query calls that vary per company and you've got a recipe for disaster in cases where IO is not plentiful per thread, and a thread deadlocks.

If you've bought into Azure and have it working, I am not going to tell you move off it! 
Running hardware hosted by the makers of Windows can provide lots of benefits and advantages vs other providers ;

I just don't wish to see a partner reaching out to Microsoft, and going through all this trouble to find out what is going on with performance when a design choice at the provider (one which they will defend) has driven the IO to be optimized for another purpose outside of Automate's needs on the storage plan that fits their budget. 

Seeing that the SSD stripe options may provide the performance required, this is good to have data on the instance size / type as previously this was very cost prohibitive.
What I'd like to raise awareness of is the fact that it takes very little to create a server down situation by backing up the web server when IO demand vs availability is exceeded on a *per thread performance basis*. 

The threshold could be close to being exceeded as the DB's table row counts and data-set sizes increase, and for the cost that one would spend, you want a much larger window of opportunity to increase load IF the numbers are anything like what I saw back when they released the Azure cloud.
To give yall an idea how bad it was, I've attached two screenshots.

--First is a Azure server in 2015, notice lack of write speeds and 4k performance with results that would likely fail out DISKSPD.exe test.
--Second file is a newer Azure server from 2018, with higher IO plan but still showing only 60MB/s Sequentially (4TB SATA drives are at 150MB/s today)
16K write MB/s looks better here.

Latency measurements were not taken.
Since these screenshots pre-date the diskspd.exe test we use, the resulting measurement of MB/s x ms Latency = IOPS was not available to get a clear picture.

Personally, I have not done the numbers on a newer 3x SSD Azure setup and I may have got a bit ahead of myself.
The current VMs may have more headroom with the improvements MS has made, but all I am saying is: just be careful!!

If the plan works for your business and provides features and security you aim for within the budget with performance looking good, that's all that matters.

In my eyes, the bottom line is:
1. We've troubleshot Azure performance issues from day one, got a bad taste for their servers as result.
--Azure has improved, how much? Hard to say as I don't have raw numbers / data. Just speaking in general ( would be good to run some benchmarks on different tiers )
2. I dislike the trend of a growing company running into issues with Connectwise after specing the server to our 'recommended' specs on the Documentation site, then finding later they are going down due to a infrastructure scaling / hardware problem.

One of the great things we've done as of late is to develop provide a standardized benchmark to set a IOPS bar and hope that you wish to keep that headroom.
Happy MSPing out there!

Azure Bad Performance.png

YourDisk-Azure.png

Share this post


Link to post
Share on other sites

Hi dfleisch

Thanks for the response - it frustrates a little that we have to go to a non official channel such as mspgeek to get a technical response from Automate but anyway.

Azure is a no brainer unfortunately for most MSPs as we get credits for being a partner and more often than not we are selling it - equally buying our own tin in my view is old hat and not a good way to do business - you have to manage the hardware / manage the DR / BC / backups - monitor hardware monitor internet lines / have failover firewalls and switches or you go to the cloud and forget all about it.

With regards to the MS V Amazon debate cause this is what this is really about MS came back with this (part of the response)

"Thank you for sharing the information. Your software vendor’s DBA team is right when he mentioned “Amazon's EC2 instances that use EBS Optimized volumes are going to run about 3200 IOPS on that test”. However, this is due to the burst mode for General Purpose (gp2) storage type for EBS where IOPS is the size of the volume (you configure during creation in GiB) * 3, with minimum of 100 IOPS and max of 10K IOPS. But due to burst mode, every gp2 volume regardless of size starts with 5.4 million I/O credits at 3000 IOPS which used in full capacity will last you about 30 minutes. Workload will continue to see very good IOPS as long as credits are replenished faster than they are consumed. But if IOPS credit is used up, then it will be slow. Here is the doc for your reference https://aws.amazon.com/blogs/database/understanding-burst-vs-baseline-performance-with-amazon-rds-and-gp2/. On the other hand, if you go with the storage option of EBS, you can actually configure how much IOPS you want in your VM but will have additional price. "

My question to you avoiding all of the stuff above is with regard to the block sizes 16 vs higher is this a Mysql issue or how automate is written - for example if we increase the block size on the diskspd command to 5m we are fine.

Equally on each Azure VM we have a temp drive which has higher IOPS than premium SSD but is obviously not persistant can MySql temp logs be utilised there to speed up the system.

Also if the RAM is sufficient can the DB not be loaded fully into RAM - what are the drawbacks / potential issues to this?

Forgive me if these seem lame or potentially dumb questions but as the only response to date from Automate has been "computer says no" and "Azure is not suitable" I need to come up with some sort of a plan as to how to move forward and equally the number of partners I know running Automate on Azure may need further information than "you should have gone with AWS".

Thanks


Michael

 

 

 

Share this post


Link to post
Share on other sites

Hi Michael, 

It seems to me like the MySQL server instance is programmed to use 4k and 16k sized blocks for reading/writing data from tables.
I ran HDTune Pro to get an idea using their Disk Monitor tool to see what block sizes were most popular while the application was running.

Give it a shot and see if you can isolate the same!
https://www.hdtune.com/download.html

While our other services do things besides talk to SQL, they still mostly rely on MySQL for a response back through a query most times, and that is the bottleneck.
We need to do more testing to confirm some more detailed items, but this for now should give you a general idea what is used by the software calls.

Edited by dfleisch

Share this post


Link to post
Share on other sites
1 hour ago, Michael Martin said:

equally buying our own tin in my view is old hat and not a good way to do business - you have to manage the hardware / manage the DR / BC / backups - monitor hardware monitor internet lines / have failover firewalls and switches or you go to the cloud and forget all about it.

Sorry Michael but that is not true and is one of the biggest misunderstandings of cloud. Simply moving your infrastructure to cloud does not eliminate the need for DR/BC/Backups. You still have to back up the servers, you still have to monitor those backups, you still have to have a DR solution in place if your server dies in the cloud. Thinking you can do away with all of these items just because you moved your server to Azure is short-sited and false.

Call Microsoft and ask their support if that new VM you just spun up in Azure is backed up and they'll state no. What happens if you have data corruption or ransomware, how do you recover from that with a cloud server that does not have a backup? You still need a backup solution in place.

So many people want to state that moving servers to the cloud will remove all head-aches and that's simply untrue. We had more server outages when our servers were in Azure than we did when they were running on our own infrastructure. In addition to that, the required spec of server you need to run Automate in Azure often makes it much more expensive to run a VM in Azure than host it yourself.

I'm not anti-cloud, but we have to be careful that we realize it doesn't always make sense for everyone in every instance.

Share this post


Link to post
Share on other sites
29 minutes ago, avdlaan said:

Sorry Michael but that is not true and is one of the biggest misunderstandings of cloud. Simply moving your infrastructure to cloud does not eliminate the need for DR/BC/Backups.

Hi Avdlaan,

Yeah I am aware you still need them perhaps my wording was not clear (or wrong) - however running BC / DR and backups in Azure is a doddle plus you can monitor / patch and get performance stats all from within the one console. 

Makes life a ton easier - plus it is only cheaper if you already have the infrastructure in which case I would agree but if you don't have the virtual hosts with redundant storage and multiple switches and firewalls and backup internet lines etc etc then firing up a server in Azure or equivalent is a ton easier.

 

 

Edited by Michael Martin

Share this post


Link to post
Share on other sites
On 2/22/2019 at 3:47 PM, Michael Martin said:

Yeah I am aware you still need them perhaps my wording was not clear (or wrong) - however running BC / DR and backups in Azure is a doddle plus you can monitor / patch and get performance stats all from within the one console. 

Thanks for the response Michael, that makes more sense! Certainly I can appreciate that the management of it is easier.

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


×
×
  • Create New...