Jump to content
timwiser

Running Automate in Azure

Recommended Posts

On 2/22/2019 at 1:38 PM, Michael Martin said:

Equally on each Azure VM we have a temp drive which has higher IOPS than premium SSD but is obviously not persistant can MySql temp logs be utilised there to speed up the system.

Also if the RAM is sufficient can the DB not be loaded fully into RAM - what are the drawbacks / potential issues to this?

Forgive me if these seem lame or potentially dumb questions but as the only response to date from Automate has been "computer says no" and "Azure is not suitable" I need to come up with some sort of a plan as to how to move forward and equally the number of partners I know running Automate on Azure may need further information than "you should have gone with AWS".

Thanks
Michael

Michael,

The Azure Temp drive has good IOPS and is not persistent, so it seems it could be a RAM disk. The temp directory for MySQL can be used here, but if it crashes and the drive is not available even to show the files that got force closed (may be corrupted), the MySQL log will be littered with errors about #sql files missing, and this can lead you down a dark path of inflation of IBDATA1, resulting in long-term performance impact. DB rebuilds are done often due to inflation, so I would avoid this at all costs unless you're ready to rebuild the DB a lot..

Other points.. we suggest 50% alloc of on-disk data-size in-RAM's Buffer_pool_size. 

10GB DB on Disk (excluding logs) ? 5GB Buffer for MySQL OR higher, is best.
On a 32GB server, we use 21GB for a buffer pool size since.. 20% overhead .. and Windows / Other processes need ram (including windows caching routines).

If your 21GB buffer on that 32GB server has a 42GB DB, technically it's meeting best practice.
For larger DBs, or DBs that need a larger RAM buffer to cache queries hitting a table that may exceed 50% of the on-disk (total) data size for \Labtech\ may need further optimization or more RAM.

The "buffer_pool_instances" goes along with buffer pool size. Less instances and lots of connections? Thread Lock Contention occurs.
Buffer pool instances should be 40 if 40GB buffer is used, but 39 and lower would also work..

41 instances and 40Gb buffer = no no since it doesn't match the 1:1 (or less) rule of 1GB (no less) or more per buffer pool instance rule.
Max value is 64 for this line above in quotes.

I would prioritize disk performance and 4 cores FIRST, then work on more than RAM. The SSD disk (real ssd) and 4 Cores / 16GB RAM is a min spec I would set for servers with a 0~20GB DB sizes

Edited by dfleisch

Share this post


Link to post
Share on other sites
On 2/21/2019 at 1:59 PM, dfleisch said:

The command we give consultants to run to pre-check a server runs a test using a Queue Depth of 1, Microsoft is likely able to scale the Azure disks to run well with 32 or higher queue depths, but MySQL is not going to behave like that.

MySQL will use a 4k and 16k block size on the disk and write (randomly) at a single queue depth.

That's only true for an individual query. Simultaneous queries take advantage of parallelism and thus much greater potential queue depth. What individual queries are so demanding that they require thousands of IOPS? That seems like a query structure issue if it's essential to daytime operations.

Share this post


Link to post
Share on other sites

We are worried about deadlocking and per query execution times and I will explain why.
The process loops in the software which requests certain queries are processed necessitates completion before others can process. 

Simply put: You're waiting on one query because it locked a table to do updates, if you cannot process that query faster than 5MB/s and it needs 40MB/s to complete in 1 second, then your server is not going to magically work UNTIL you have faster commit performance @ QD1

SO while a high IOPS number @ QD32 helps recovery time after a deadlock by being able to clear the queues quickly, we are only worried about a single queries write performance in the moment of the deadlock, SO we don't care of you can do 1MB/s x 32  32MB/s @ that queue depth because we are waiting on the 1MB/s PER THREAD that is bottle-necking the server queues.

Take for example the query which rebuilds EDFs for computers. It locks the computers table and agent check-ins (for a 10k agent these can pile up pretty quickly) until it's done.
On a slow per transaction IOPS server, these queries are larger and higher in row count so it's going to take between 11 seconds (fast server) OR [up to 2-3 minutes] on slow server

How often do you want to see 20,000 requests in IIS waiting to commit to the table if these queries run multiple times a day ; and extend up to 3 minutes for a EDF rebuild? Answer: You NEVER want to see that many in the queue. IF you do, that performance at the higher queue depth certainly helps, and it's not like we are asking you to be WITHOUT acceptable performance at higher queue depths, just that the per thread performance is the problem and if you don't meet our minimum, you will be affected by lock outs and reliability concerns.

When that rebuild runs (like many other queries which do something similar, like for commands)
..the 2-minute gaps in processing (locks) for INSERT queries for agent-check-ins are going to make the server not respond while the IIS queues are high BY DESIGN of IIS. Soon 3306 is going away and ONLY web calls will be made for the fat-client, even local.

We can get around these pauses in web-server response times by using various tweaks to IIS and MySQL, but in the end we are simply band-aiding the real issue.
While waiting for a single query to run on a base Azure instance, sure the server can queue 32 other items @ that same data rate and complete them, but unless the query is re-developed and re-released (hint: it already has been many times) with code that prevents locking, or creates multiple partial locks instead, or doesn't lock at all, we are at mercy to this and other queries with this design;
--The requirement is there simply for a worst-case scenario Core query, in-house or third-party plugin, Custom or stock monitor, and or script that may not be completing quickly *for whatever reason*.


**IMPORTANT* ALL Environments are not the same, people use various custom queries. 
SO, for a core product min recommendation for disk performance, to set a standard of how fast the product works, we make a generalization that everyone will be fast enough to write that 16k block at a certain speed as EVERYTHING done in the application or web interface relies on queries.
We have set the 3,000 IOPS number,  not only because it's a easy spec to meet, but because it affords smaller or medium size servers to complete any straggling 'locking' or full table scan type queries a quick-enough turn-around in speed where major delays are avoided.

Power-users who accept the fact that they want a 'very fast' or 'faster' product will go above and beyond this spec.

Yes, There are limits to application design and efficiency based on design choices, but wouldn't you rather have a server that can process past a worst case scenario rather than one which isn't capable?

While the product continues to evolve, we are eyeing different ways to further improve performance and speed, but the simple fact is: a SINGLE 7200RPM SATA disk runs 3200 IOPS @ 16k Random Write, QD=1 and Threads=4.

Why are excuses being made in order to work-around the fact that the server NOT meet this spec (which every bargain bin single spinning disk based consumer desktop will do) since higher density drives came out past 1TB mark have been a thing since what 2008?

Now, Keep in mind that MySQL 5.7 currently does support 32 and 64K pages vs 5.6 allowing only 16, 8, and 4k.. so many you'll gain some benefit there at expense of wasted space if Azure scales better under larger block sizes on a QD1 scenario.

Share this post


Link to post
Share on other sites

Hi David, you may want to have the team re-evaluate what diskspd options are being used to ensure it models real-world situations. -si16k with -w100 forces a sequential write. A single spinning disk is great at sequential writes by its very physical construction. Park the drive head and vomit data to the disk, like a reverse record player. Spinning HDD are terrible at random I/O due to the drive head seek times. What we're going to see with real-world scenarios and server storage systems is random reads and writes, which is why heavy caching is used on the read side (both memory and SSD caches), and write-back buffers in storage controllers for writes (along with other optimizations by the storage controller).

Under real workloads, a single HDD is more like a 170 IOPS device. We don't go spending tens or hundreds of thousands of dollars on SANs for nothing. They're needed for shared storage, and to achieve random read and write performance that individual drives cannot provide. It doesn't matter that a single HDD with no workloads can achieve a synthetic sequential write of 3000 IOPS; that's not modeling anything realistic or relevant.

This also doesn't model MySQL behavior well (specifically InnoDB tables). "InnoDB uses asynchronous disk I/O where possible, by creating a number of threads to handle I/O operations, while permitting other database operations to proceed while the I/O is still in progress." Dirty DB memory pages are not written immediately; they're asynchronous under normal operations. That's even before storage subsystem operations come into play, which will return an ACK of the write request to the OS and App long before it's actually written to any disk.

Does CW have any metrics that can be shared for how many IO per agent these core operations take on each run? And total quantity of KB of data written per agent? Those can help us size correctly and also plan for growth.

You mention that EDF rebuilds (and other core product features) lock the Computers table and prevents agent check-ins. This is a major problem. You mention it has been partially addressed. Why can't it be more fully addressed? This is a huge concern for scaling that eventually you can't simply throw more hardware at. What is the next option when a single SSD can't effective band-aid for poor SQL design? Are we going to have to scale-out to multiple, isolated servers, one set per X thousand agents?

Even before locks are addressed, if there are core operations that benefit from additional tuning, wouldn't that be desirable? "We can get around these pauses in web-server response times by using various tweaks to IIS and MySQL, but in the end we are simply band-aiding the real issue." I'm certain everyone dedicates an entire server (or servers) to CW Automate; there should be no problem tuning those servers to LT-specific needs. It's not a zero-sum game; we can make improvements in many areas.

Specific question: Is having enough RAM so the entirety of the LT database set can reside in memory a practical requirement for good LT performance? If so, great, everyone can adapt to that. We just need to be told. It's not a normal thing; one expects a certain percentage of "cold" data. But if there's no real "cold" data in core LT, that's fine, we can size and tune accordingly.

Last, from another thread: "The test takes advantage of this fact and makes sure to test a worst case scenario @ 16k write. We really need to be testing 4k blocks instead as after some time graphing MySQL servers running our application, the 4k block is used and NOT 16k as previously thought.How did this only just now surface? You can imagine the questions this sparks in your customers. Even though the difference between 4k and 16k I don't think is ultimately terribly important in the face of order-of-magnitude style architectural challenges.

This is not meant to be a "bash ConnectWise" post. We're all genuinely interested in solutions, but we're trying to profile something that's a bit of a black box unless we start doing our own SQL profiling, which we shouldn't have to. The limited behind-the-scenes information we are given is concerning: table locks that we're told to throw hardware at. Irrelevant benchmarks.

Share this post


Link to post
Share on other sites

The real-world performance differences between tasks on a SAN vs Internal SSD are _not_ likely to be easily quantifiable IF the SAN is optimized for the work-load.

We are talking about workloads that may or mat not take advantage of certain aspects of storage. Workloads that are common: MySQL.

Differences should be smaller (at least when you spend that much on a network attached storage solution, it SHOULD mirror a SSD to a point) To clarify, or to be more scientific, by reviewing the impact to boot times and or application ready-time in seconds, an individual will generally say, OK this will work, and go for it, but metrics dig much deeper and are not often visualized by subjective 'feel' vs results and benchmark numbers to clarify how much better or worse.

Besides all that, Automate works better or worse with different hardware, like any software.

There will be only a small, or couple second difference in OS boots and normal less intensive and latency sensitive apps, where nothing would have been changed for the user's opinion about speed, so the slower SAN may not be flagged for replacement, or even worse it was purchased in the first place with the main goal to replace everything local storage wise in the company.. Meaning the money spent may have been better spent elsewhere for certain applications.

From my understanding, most IT companies make the their storage work for them based on application load time goals:
Does the solution work for productivity goals (speed), costs of ownership, and disaster recovery plans?
Most check YES and purchase on this basis, but may not deep dive in bench-marking every app and it's speed difference.
--This is a reactive situation: Can we improve on x, y and z, all while not changing real-world feel in general?
--Since we spent x,000 $ on this solution, why not move everything to it? They claim 100 users can use it without any impact to performance!

With Database type work-loads, SANs no longer work 'as well' per $, as they incur:
Additional latency / overhead by adding another layer on top of the on-disk storage controller, in between the client and the disk there is now a CPU, RAM, SATA Controller, as well as software overhead.
--This reply is NOT going to educate you on a SAN's downfalls, nor knock down a SAN for a storage solution.
It is simply here to explain what is fact based on what we see, NOT to quantify how much of an impact 'no longer work as well' has on performance vs another storage type.

Would you purchase a SAN based on the fact that it has enough space for the entire company, meets your backup and GUI design requirements to configure and manage it, saves you money, AND has enough IOPS for your productivity goal, then just take the plunge?
-Most would, not saying anyone who does is not smart, but there's a little more to it!

Any SAN / NAS / storage device has limits that are physical constraints based on design due to additional moving parts being added.

Any way I slice it, MySQL seems to works better if you have a drive that does fast random IO writes.

--So ask yourself, has a test been run to quantify the $20,000 SAN as 'faster' or 'cheaper per GB' (with redundancy built into the equation) vs another route? 

Is the choice done for convenience?

For apps needing more performance (for whatever reason) sometimes there is a convenience tax, and that tax can be new hardware or a more expensive monthly investment. In terms of optimizations where 'MOST HARDWARE' will work, I think most hardware will work fine.

Because these applications create thousands of calls per second when busy, performance is not free. So it goes both ways, hardware and software needs to work together, and performance is a result of that effort.

Specific things like high latency just make it crawl, so this is why I am using a SAN as an example of what may also show weakness in applications that rely on MySQL.
I don't think Azure is specifically showing high latency, it's just a optimization choice on there end.

That's what led up to Diskspd being used, set a min spec and if it's not met, lets try and meet it with MB/s AND latency as a product equaling the IOPS

If a server is under-spec, it shows a low result in most cases

Most  issues are because of a hardware configuration, SATA controller hardware choice (that sits between the OS and Physical disk), SAN / NAS (shared storage) latency spikes at random times which is VERY hard to track, (or something else) , all of which create negative impact.

The ones that do and don't look to scientifically measure and estimate the numbers will certainly appreciate it in more way's I can express!
--HOWEVER, Exceeding the goal is more desirable for resolving an issue with performance impact affecting the server, vs being under spec, or just at spec, or near spec, IMO.

Comparison of results on servers without performance issues, mainly in order to establish a min spec goal can be done to estimate the needed speed and type of disk.

With a standardized benchmark run, if a server scores '800 IOPS' and then it's upgraded and scores '10,000 IOPS', we can effectively say the transaction will complete QUICKER, but defining HOW MUCH quicker, or 'IF THE PROBLEMS WILL GO AWAY' will not be as cut-and-dry.

Here are some ideas I have: 
Measure examples of workloads produced from HUGE (10k+ Agents), Large, Medium, and Smaller (<100) Automate Server's by generating IO patterns that simulate a live box, then use some logic to suggest hardware and or IOPS mins based on these specs. This can ensure query completion times are met and exceeded for 'min vs recommended' specs.

As for MySQL RAM Allocs:
Caching 90 - 100GB of an on-disk datadir within RAM when you have 128GB of RAM may be a reality to some, but for others it's less than ideal to have to drop $3k on RAM.
--This goes along with 'what we typically see' ; NO you don't need 128GB of RAM and a 100GB buffer if the DB is 100GB ; Have I seen servers with that? Sure. Were they fast? Not always

Ideally, 64GB RAM / 50GB Buffer on a 100GB on-disk DB size is great to have..
-- Is it always attainable, or needed? If it's not, why?
Well, these questions are often answered by $. Spend more money, get x% better performance, or spend more time adjusting and logging data, and get a better result.
Where does one cross the line and say, $100 is not worth 1%, I'll stop there?
With advanced benchmark techniques maybe information can better define this.

I do have a MySQL specific 'suggestion' for innodb_buffer_pool_size, in the form of a query which I've borrowed off of Stack Exchange which may help clarify:
---------------------------------------------------------------------------------------------------
Recommended Buffer Pool Size\\ --> https://dba.stackexchange.com/questions/27328/how-large-should-be-mysql-innodb-buffer-pool-size/27341#27341
This will give you the RIBPS, Recommended InnoDB Buffer Pool Size based on all InnoDB Data and Indexes with an additional 60%.


SELECT CEILING(Total_InnoDB_Bytes*1.6/POWER(1024,3)) RIBPS FROM
(SELECT SUM(data_length+index_length) Total_InnoDB_Bytes
FROM information_schema.tables WHERE engine='InnoDB') A;

-----------
'More Concise formula':
-----------
SELECT CONCAT(CEILING(RIBPS/POWER(1024,pw)),SUBSTR(' KMGT',pw+1,1))
Recommended_InnoDB_Buffer_Pool_Size FROM
(
    SELECT RIBPS,FLOOR(LOG(RIBPS)/LOG(1024)) pw
    FROM
    (
        SELECT SUM(data_length+index_length)*1.1*growth RIBPS
        FROM information_schema.tables AAA,
        (SELECT 1.25 growth) BBB
        WHERE ENGINE='InnoDB'
    ) AA
) A;

-----------------------
Find how much actual GB of memory is in use by innodb data in the innodb buffer pool at this moment:

SELECT (PagesData*PageSize)/POWER(1024,3) DataGB FROM
(SELECT variable_value PagesData
FROM information_schema.global_status
WHERE variable_name='Innodb_buffer_pool_pages_data') A,
(SELECT variable_value PageSize
FROM information_schema.global_status
WHERE variable_name='Innodb_page_size') B;

---------------------------------------------------------------------------------------------------

I am not aware of the fact that the test is forcing a Sequential Write with w100 and that other variable.
Maybe you can make a suggestion otherwise. I didn't get a chance to really look into this yet. Are you saying the syntax is not outlining real MySQL performance differences by the numbers when run on multiple servers? From what I see, all servers (we run the test on) which are slow, end up below the min spec of 3000 IOPS.

A test that is more tailored to workloads vs synthetic tests, would be nice!

Edited by dfleisch
Clarifications to better align position

Share this post


Link to post
Share on other sites

The diskspd parameters being used to "prove" that customers have inadequate disk performance do not prove any such thing, they're not modeled after MySQL I/O patterns, and they're apparently not modeled after the I/O pattern that a dedicated Automate server generates. This is a stunning fact, that ConnectWise's architecture, QA, and support teams do not have any existing benchmarks to simulate the demand of the application at scale.

This is why larger customers (ourselves, and large peers I have talked to) often feel like our production environments are being used to QA CW's own software. CW has apparently not dedicated a team to generating IOmeter tests, or diskspd tests, or anything else to simulate the app. CW has not asked a partner for a large production database. These come from your own statements. I mean, come on... set up a lab with 20,000 virtual desktops. Am I to believe CW's company budget can't afford that? If not, maybe Thoma Bravo has a little money to invest in the continued operations of CW Automate.

Here's the CLI flags for diskspd: https://github.com/Microsoft/diskspd/wiki/Command-line-and-parameters . ConnectWise can develop and validate proper test patterns. I'm not giving CW free labor. Have the team look at the mutually-exclusive options -r and -s, and see which one says random and which one sequential, and which one is in the "standard" diskspd set CW's teams currently use...

The very point of having published specifications, tested and validated specifications, is so that customers can purchase with confidence. Not buy hardware, find out things aren't running well at all, and then be told to throw more money and hardware at it. Usually, required specifications are oversized considerably, because the software vendor has no interest in getting into a finger-pointing battle, and the purchasing customer has no interest in going back to their leadership to ask for more money immediately.

Here are some things that make us large partners immensely concerned about CW software quality:

1. Locks are the death of SQL. They must be avoided absolutely everywhere that they can. Software developers working in small environments never have to worry about them, but applications designed to scale must have them factored into all query design.

2. The software vendor's DBAs must be constantly monitoring the application to identify under-performing queries, as part of design and QA, not only in response to customer complaint.

3. 10,000 connections, a few thousand updates a second, are drops in the bucket for SQL databases. We should not be seeing an application fall on its face at these levels.

4. Scenarios where Automate falls apart are apparently already known, the root causes are known, but we the customer are still suffering from them. EDF rebuilds with locks. Etc. It's not steady-state operations that the partners I talk to or that we suffer from; it's when Automate suddenly stops responding, and stays that way for seconds, tens of seconds, or minutes.

5. Asking customers to throw hardware at the problem, when the above are known, is asking them to spend their money to band-aid CW's application. And it's not even a guarantee from CW that it will resolve things! It's a message of "make your disk faster because they don't meet this arbitrary value of Apples (when Oranges are what matters), then we can talk more if you're still having problems." Where's the detailed troubleshooting? Where's looking at actual metrics to properly isolate a root-cause of disk bottleneck, and even isolating read vs. write bottleneck? Read vs Write have different potential solutions, which do include query changes, changes to MySQL parameters, changes to O/S parameters, changes to quantities of RAM; it's not just all about disk. Disk is the last-resort for database performance; DBA's avoid it at all costs, because it's historical an order of magnitude (or more) slower than other optimizations.

(I may have hijacked this thread a bit, in that I'm not running Automate in Azure, nor have 2019 plans to do so, but I am watching closely for 2020 and beyond. On-prem absolutely will be replaced by commodity cloud for a huge percentage of workloads; it already is...)

  • Thanks 1

Share this post


Link to post
Share on other sites

I want to publicly thank both @dfleisch and @herrchin for their engagement in this thread and conversation. It's totally OK to disagree and have different thoughts/approaches to things.

As a long-time CW partner (Labtech user since 2011), I agree fully with @herrchin's suggestions and well-reasoned responses. I appreciate @dfleisch sharing the 'how we got here' information but, honestly, I think it's more telling of the failures on the CW side in properly setting standards and expectations for partners. I want to be VERY clear, that's not a jab or knock toward @dfleisch. It's a comment on CW, the company's, approach.

@herrchin does a great job here explaining our (partner) frustrations with the approach CW has taken on this and while it's great they came up with some benchmarks, I think it's clear from comments by both parties that the benchmarks being used aren't actually correct for the real-life workloads seen on CW Automate servers. So to steal @herrchin's example, CW is asking us to measure apples where oranges are what's needed. That's definitely doing a disservice to those partners (current and future) who are using or looking at using Azure or other services that have 'scored badly' on the apples test.

Not that anyone cares at this point, but I'll reiterate that we're happily running Automate in Azure on a single VM (8 CPU, 56GB RAM) with over 5,300 agents. And I'm about to grow that to close to 6,000 agents before the year is out. I've contemplated splitting our server roles, but to be honest, everyone that looks at our instance performance (including employees coming from other Automate shops) is blown away at the performance we have (compared to Automate in general.....we still hit the usual Automate slowdown points). Yes, it's an expensive VM. But it's how we've chosen to structure our business. It's working well for us. And it's performing awesomely. Even when paging to disk, etc. we see NO slowdown due to disk i/o. But our 'apples score' using the current diskspd recommendations is abysmal.

I truly hope some CW product managers see and engage with the information from this thread. Especially the points/info raised in @herrchin's last post. THAT is how CW can improve the product and their partner satisfaction and confidence level.

 

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×