Jump to content


  • Content Count

  • Joined

  • Last visited

  • Days Won


dfleisch last won the day on November 14 2018

dfleisch had the most liked content!

Community Reputation

2 Neutral

My Information

  • Agent Count
    Less than 100

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

  1. dfleisch

    Running Automate in Azure

    The real-world performance differences between tasks on a SAN vs Internal SSD are _not_ likely to be easily quantifiable IF the SAN is optimized for the work-load. We are talking about workloads that may or mat not take advantage of certain aspects of storage. Workloads that are common: MySQL. Differences should be smaller (at least when you spend that much on a network attached storage solution, it SHOULD mirror a SSD to a point) To clarify, or to be more scientific, by reviewing the impact to boot times and or application ready-time in seconds, an individual will generally say, OK this will work, and go for it, but metrics dig much deeper and are not often visualized by subjective 'feel' vs results and benchmark numbers to clarify how much better or worse. Besides all that, our application works better or worse with different hardware, like any software. There will be only a small, or couple second difference in OS boots and normal less intensive and latency sensitive apps, where nothing would have been changed for the user's opinion about speed, so the slower SAN may not be flagged for replacement, or even worse it was purchased in the first place with the main goal to replace everything local storage wise in the company.. Meaning the money spent may have been better spent elsewhere for certain applications. There are two, IMHO over 6+ years seeing Automate Servers: one of them for price/performance under MySQL: Do not pick a SAN or NAS / Azure unless you'd like to spend extra money for less performance in Automate. This is NOT a fact, but it's a generalization based on real-world experience seeing Automate servers of ALL types. From my understanding, most IT companies make the their storage work for them based on application load time goals: Does the solution work for productivity goals (speed), costs of ownership, and disaster recovery plans? Most check YES and purchase on this basis, but may not deep dive in bench-marking every app and it's speed difference. --This is a reactive situation: Can we improve on x, y and z, all while not changing real-world feel in general? --Since we spent x,000 $ on this solution, why not move everything to it? They claim 100 users can use it without any impact to performance! With Database type work-loads, SANs no longer work 'as well' per $, as they incur: Additional latency / overhead by adding another layer on top of the on-disk storage controller, in between the client and the disk there is now a CPU, RAM, SATA Controller, as well as software overhead. --This reply is NOT going to educate you on a SAN's downfalls, nor knock down a SAN for a storage solution. It is simply here to explain what is fact based on what we see, NOT to quantify how much of an impact 'no longer work as well' has on performance vs another storage type. Let's just start here: Would you purchase a SAN based on the fact that it has enough space for the entire company, meets your backup and GUI design requirements to configure and manage it, saves you money, AND has enough IOPS for your productivity goal, then just take the plunge? -Most would, not saying anyone who does is not smart, but there's a little more to it! Understand: SAN / NAS / Any storage device has limits that are physical constraints based on design due to additional moving parts being added. Do I understand: MySQL to the tee? Defiantly not, nor do I understand 100% of the time WHY faster IO may help I DO understand: MySQL works better if you have a drive that does fast random IO writes. --What are you LTAdmins basing the 'it meets IOPS specs' question on when choosing a storage solution? Likely Real-World Performance right? What if the Automate application has a code issue and something helps that code issue? Would you negate all these advantages and just purchase a fast(er) disk to keep the server online, or wait for the software vendor to develop a fix to work with a hardware choice that was mis-aligned with requirements? Is there a problem with the software that is just plain annoying and cannot be fixed, or can one go out and buy a piece of hardware that will make it work? Most people experiencing the issues JUST WANT IT FIXED. --I know almost 1-5 companies that WANT TO UNDERSTAND before it's fixed, and so these companies may be careful in the design phases once the bottleneck is explained. Others just use the software, outline requirements and if they are roughly met, then proceed with the roll-out. This is a standard human thing, check and deploy if validated. While some LTAdmins are more picky and into hardware and speed than others, most of us just read a doc and go ahead with it. BIG selling point on APPLE products, it just works right? To answer some of this, every single application out there can be either more complex, less complex, faster, slower, or in between. Point being, EVERY SINGLE APPLICATION is unique, and yet APPLE sells a lot of hardware just by saying it will work without issues. It doesn't matter if your smart, dumb, new, or old to the world, that's the best part, it just works... SO, if you want it to JUST WORK, just work-around the bug for now and upgrade the hardware. Maybe we should be more like APPLE and come out with these items.. I don't think anyone would be mad about us rolling out a better benchmark tool ; ) --So ask yourself, has a test been run to quantify the $20,000 SAN as 'faster' or 'cheaper per GB with redundancy built in' than another solution liek 1 SAN and 1 cheap SSD for apps needing performance? Do you benchmark the performance change in the real-world for 'most apps', then stop, or do you go through one b one in ALL of your applications and compare them side-by-side like Anandtech.com does on their reviews? WELL, the answer is: You shouldn't need to do ANY of this 'thinking', it is the software company's goal to design software that is easy to use, SOMETHING YOU WANT TO USE, is easy to source hardware for which works to allow it to run efficiently, and which hopefully has very small or non-existent bugs which may change your opinion of how you use the software. IF the software is becoming a problem performance or bug wise and the it's no longer meeting your goal, a vendor say something like what we have here: --> We know about a problem, the only way to get-around this is to do this; Do you as a consumer conform to request and do the suggestion, or do you keep the current setup knowing of the down-falls? Most of the audience would reasonably attempt to meet the request and then will certainly let us know if it doesn't work for them! It would be nice if the company was working towards supporting ALL setups , and we'd like to think THIS IS HOW IT SHOULD BE! In terms of optimizations where 'MOST HARDWARE' will work, we are close but still far away as these applications create thousands of calls per second when 10k+ I think any vendor making MSP software will have a challenge in staggering that IO count, However with MySQL as the core of the product we have additional complexities. Specific things like high latency just make it crawl, and so again, do you work-around the issues until a patch makes your SAN process that work properly at that IO rate, or do you run a SSD in the mean-time which is a known-solution which works? Does the SSD at some point show the applications weakness? -- I think we are closer than ever as again, row counts go up and MySQL is ONLY able to process certain 'full table scan type operations' so fast. This brings it back to, well we need to optimize.. Sure we do and we always will be, but a TIMELINE on that is just NOT there for specific items, YET.. This bring us to a classic argument: My hardware is fast, why is the app slow? We meet spec How can I fix this? As we start answering questions, things become a bit less clear as to why we even suggest anything in the first-place hardware wise, instead of just asking them to run a tool to test Hardware speed in various aspects before the form can be filled out & Automate Purchased! Sure.. someone can fake the test and run it on their desktop they are filling out a form on to obtain the 'next step' on lets say a 'am i eligible' form which validates the server before you purchase, but that's up to the partner to 'ACTUALLY VALIDATE on the server and not fudge results on' It is up to the end-user to accept any bugs, defects, and or problems, including employing work-around that are well-known as helping while It is up to the VENDOR to either fix them quickly, or expect loss of subscribers if the problem work-around is not attainable by the customer-base. The conversation on a typical performance review ticket go something like this: Q= Partner A= MSP answer --Q What do you mean change the disk, the DB server meets specs on your page! --A Your server says it has a 11,000 IOPS $200 SSD, and by all means it should score that number, but we got 800 IOPS --Q We spent $20,000 on the SAN and JUST upgraded the drives to Enterprise class SSDs, AND have 10 of them! --A Configuration or environmental facts may be at play here, the SAN adds complexity and performance may be slower at work-loads vs a single SSD. You could also reduce the data-footprint, by reducing row counts in table..what do you think? <Partner moves SSDs over to the Hyper-V Host and sets up a direct cable from the SSD to the SATA Controller> --Q How is performance now? We didn't want to disable features in the software, so we upgraded the hardware --A Latency is now lower but MB/s is still lacking, your only @ 1,100 IOPS now ; Can you tell me a bit more about the hardware running the server? What SATA controller do you have --Q How can we get better performance? On paper we have everything your docs asked and we don't see any mention about a $800 RAID card being needed, so this should 'work' --A I am not in-front of the server to consult on hardware setup or design, but I can make a suggestion based on information you tell me; Can you tell me what RAID card do you use for the Hyper-V Host currently? --Q We run a MEGARAID 10000 --A Ah, that's why! The MEGARAID 1000 has no write cache and is considered slow by others on the forums, check this link out: www.serverraid0000.com/forums/MEGARAID1000-slow.html In this scenario, the partner simply was not aware of the problem and naturally believed he met spec as he DID based on our docs. At that point, all we have to offer is: Upgrade your RAID controller and the issue should go away. No one with MEGARAID 9000 series card see this. Now we are making hardware suggestions for software slow-downs on the basis that it WILL help, but we cannot quantify HOW MUCH it will help, and cannot GUARANTEE it will FIX anything. Partner now is under the assumption that it is our responsibility to define more detail on our requirements page and that the vendor (us) has direct-causation regarding the performance issue. Yes, we should do this, and yes we should have that more clearly defined, SO I think this is something useful to have documented.. However, now the partner now has a bad-taste because we explained WHY it broke, but didn't fix what caused them to make a ticket. It is now on US to resolve this problem that previously would have been fixed by a doc update, or hardware spec min being set, otherwise. ..If the SOFTWARE was better optimized, it'd work fine though.. this is also true, can't argue there. As you can see, we get to a point where we eventually get into a tough situation where yes it's the software's fault, but yes a hardware change fixes it and makes the real-world performance acceptable *to a point*. Commonly, the partner can then pose the conflict of interest in favor or against Connectwise: Do I upgrade the hardware or do you wait for the software to be fixed (or ditch it all together)? What can Connectwise do to standardize the requirement goal IF the hardware configured as per our KB is not performing as expected? Make a test! SO, that's what led up to Diskspd being used. IN THE CURRENT TIME, the Automate product's KB site outlines requirements and suggests the software works fine for 'a certain agent count' only when running on certain hardware specs: We EXPECT y speed on said hardware for the software to work, *IF the hardware is 30% of that speed despite original out of box hardware in perfect environments bench-marking @ 100%, you've got a scenario where you meet spec, but don't perform like it, thus TEST TIME! IF YOU DO NOT meet specs, expect problems as a result. IF after reading the requirements, one cannot figure out why the software performance is still slow, they make a ticket where eventually the issue gets escalated to P1 / Server Down to our team which uses said TEST. Today, we have multiple tools at our disposable to expose hardware related problem, but only one is standard to Automate Server: -The standardized test you see us run IS NOT PERFECT. -IT DOES expose slow servers which are the CAUSE, and provides a baseline to compare others against. How much does this test 'lie' vs the real-world needed performance? Not clear Anytime you make a support case, WE SHOULDN'T BLAME YOUR CONFIG for being slower, but we SHOULD outline to you WHY it's slower and what that causes in the end, so you can make a decision. This is easy to see, and ultimately your decision on how to resolve the issue. We then asked Development to make a test that can be standardized to help us with this push as the #1 issue we see. Some push back and say, well it does this at this .. well sorry to say we have not vetted (though product management) the other Syntax. Until then, this is the DiskSpd.exe test we run which we compare other servers against. This is the tool that tells us when a server is going to be slow before it is found to be the source of the problem. They send us this and say: MySQL behavior mimics this test, it uses 16k blocks, use this, it's going to be a good test. When we run it and if a server is under-spec, it shows a low result. Good we will use this. Suggestion to partner IF it doesn't meet spec: upgrade until the score changes to meet or exceed 3000 IOPS. I am not here to specifically and scientifically explain what causes issues with hardware to make it perform slower than expected, but I can tell you most of the issues are because of a hardware configuration, SATA controller hardware choice (that sits between the OS and Physical disk), SAN / NAS (shared storage) latency spikes at random times which is VERY hard to track, (or something else) which creates the negative impact we try to avoid: Slow software. You're absolutely right that we were late to the party with the 16k vs 4k argument, and I'm a little surprised no one else has caught onto this when monitoring the server. HOWEVER, as you say, real-world wise, it's not going to change our suggestions.. we need x operation to complete in x time, or y people leave because software doesn't work. IF this 16k raw data is 20% off vs a 4k test, so be it! Most drives scale higher in performance as the block size increases and they are going to do this in a linear fashion: You can see this here>> https://windows-cdn.softpedia.com/screenshots/ATTO-Disk-Benchmark_8.png After a certain point, performance is near the same from 64k all the way to 64MB tests in MB/s. Each result then would be 20% higher than the real-sector size (4k) speed we are testing. If you are a MSP are you're trying to minimize the slow-downs in your application, what do you do? Create a ticket with the vendor who has seen the issue before and or check your other applications on the SAN / Hyper-V to see if they are slow too. Most often I hear, well the other is not slow, why is yours? Well, we are special. After many tries, the vendor comes up with a critique on performance despite the on-paper specs being met or exceeded and other servers are running fine as mentioned, however Automate is special performance wise.. --The partner (you) comes back with: But x, y, and z factors all point back to your software not working right per the suggested configuration you told us to get! It's slow and we meet spec! As a result, We chose to not rely on subjective differences in other servers speed and or MB/s graphs like Disk Queue, and instead and make a test that can give us raw data. -------------------------- >>>> SO, If we change test syntax, the same server scoring 3000 will score lets say 1000 on 4k vs 16k which it got 3000 (not real number just example), so all other servers not meeting spec would then be at 200 IOPS vs 1200 on the 4k test. Point being, we can change the test and do all of that and make it use a SysBench, IOMeter, or other test-pattern and get better results and more accurate numbers to the ACTUAL IOs needed, but in the grand-scheme of 'Is the server working' or 'how to stop my server from locking up' Does it really matter for real-world performance and quantifiable data so we can make suggestions? For a server down / P1 to get resolved, upgrading IOPS fixes the problem. For companies planning, on designing a specific storage solution for the biz WHILE meeting goals of required MySQL performance on TOP of other needs, this would be GOLD. Does this change in test-methodology and or documentation sites where we can provide some tools to benchmark and estimate based on these widely varied servers make partners happier? Sure! The ones that do and don't look to scientifically measure and estimate the numbers will certainly appreciate it in more way's I can express! --HOWEVER, Exceeding the goal is more desirable in support cases for performance impact as 'they just need it fixed' or 'don't want to think about it again, as agent count increases' ; SO, while a long-term development goal would be to design a tool to 'benchmark' and or 'trace' the real-world impact and or IOPS requirement, we are a little far out right now as this Diskspd tool I've discussed is also. --Would I LOVE to see this added as it would FIX all the problems we see due to disk IO bottle-necking, WHILE defining just how much of a bottleneck each item creates. Certainly a wonderful idea Herrchin! Because my job description is not performance analysis, (or not yet We use the 16k diskspd test on servers with performance problems to establish impact. This is mainly for a comparison or estimation of how much of an impact the disk may have as we can compare vs results on servers without performance issues, mainly in order to establish a min spec goal. With a standardized benchmark we run, if a server scores '800 IOPS' and then it's upgraded and scores '10,000 IOPS', we can effectively say the transaction will complete QUICKER, but defining HOW MUCH quicker, or 'IF THE PROBLEMS WILL GO AWAY' will not be as cut-and-dry. The goal is to make the product as efficient as possible, while letting dev focus on improvements and bug fixes. I would like to see certain queries optimized before others, the EDF rebuild, search rebuild, and other queries which lock in the end are the primary pain-points, however some may NEVER see it while others may see it with as little as 1000 agents and a single plugin being enabled. Then, what other servers without the issue see higher row counts and have LESS issue with a similar setup, but with a NVMe SSD? This is the primary 'ah-ha' ; One can throw IO at things are lessen impact. --What's not to like about IO? You can use it for something later if we fix the issue!!! Will the issue get fixed if everyone is on SSDs and not seeing the deadlocking? Sure.. this can make development delay and or cancel things as hardware matures and performance is cheaper, so it's VERY important to hope we spot it, or spot it yourself and technically define it. --The community makes up our software's design direction and this is a great place to express your expectations! Please, get out there and add a 'feature request' on our forums, the more that chime in about it the better! https://product.connectwise.com/communities/5-automate-enhancements Once we have a TON of people on Azure with issues (if you can hold-out) chiming in on our forums about the 'unacceptable performance and how Azure is not the issue and our software is' ; The better the chance the development team will tear into another path: 'how do we optimize' vs 'what is wrong on the server that causes this on this env. & not on others (aka upgrade the disk)' SO, one must determine if we take the time to run the diskspd.exe test on all servers, (which appears to be working to outline slow servers with the given syntax), or get more complicated and create a IOMeter standard test routine (that takes time to develop) which STILL doesn't show all scenarios of the 50,000+ servers running Automate out there. Since the test patterns are highly variable to DB, who'd like to volunteer a full DB dump to us to run some tests on? Near impossible without a internal team dedicated to this sort of thing, so this is KEY! >>(Once people) Start making a very loud racket about the problem and define technically why they think a specific feature, query, or design choice is not working correctly for their business needs .. Soon, the team will be forced to address the complaints! If these definitions of issues hold water, which I think a lot of you are going to be good at defining for us, you as MSPs can only gain! New tools which benchmark a storage-sub-system and develop a couple real-world tests that move past the current DiskSpd Syntax sound exciting and would be great to have! Here are some ideas I have: Measure examples of workloads produced from HUGE (10k+ Agents), Large, Medium, and Smaller (<100) Automate Server's by generating IO patterns that simulate a live box, then use some logic to suggest hardware and or IOPS mins based on these specs. This can ensure query completion times are met and exceeded for 'min vs recommended' specs. We do not have any such IO tests currently developed, so this could be a future goal. As for MySQL RAM Allocs: Caching 90 - 100GB of an on-disk datadir within RAM when you have 128GB of RAM may be a reality to some, but for others it's less than ideal to have to drop $3k on RAM. --This goes along with 'what we typically see' ; NO you don't need 128GB of RAM and a 100GB buffer if the DB is 100GB ; Have I seen servers with that? Sure. Were they fast? Not always Would we like to see 64GB RAM / 50GB Buffer on a 100GB on-disk DB size? Yes, sure that would be a goal and ideal. -- Is it always attainable? If it's not, why? Well, these questions are often answered by $. Spend more money, get x% better performance. Where do we cross the line and say, $100 is not worth 1%, I'll stop there? With advanced benchmark techniques (I think we should have) maybe we can better define this, but currently they are just not there yet to answer these questions. I do have a MySQL specific 'suggestion' for innodb_buffer_pool_size, in the form of a query which I've borrowed off of Stack Exchange which may help clarify: --------------------------------------------------------------------------------------------------- Recommended Buffer Pool Size\\ --> https://dba.stackexchange.com/questions/27328/how-large-should-be-mysql-innodb-buffer-pool-size/27341#27341 This will give you the RIBPS, Recommended InnoDB Buffer Pool Size based on all InnoDB Data and Indexes with an additional 60%. SELECT CEILING(Total_InnoDB_Bytes*1.6/POWER(1024,3)) RIBPS FROM (SELECT SUM(data_length+index_length) Total_InnoDB_Bytes FROM information_schema.tables WHERE engine='InnoDB') A; ----------- 'More Concise formula': ----------- SELECT CONCAT(CEILING(RIBPS/POWER(1024,pw)),SUBSTR(' KMGT',pw+1,1)) Recommended_InnoDB_Buffer_Pool_Size FROM ( SELECT RIBPS,FLOOR(LOG(RIBPS)/LOG(1024)) pw FROM ( SELECT SUM(data_length+index_length)*1.1*growth RIBPS FROM information_schema.tables AAA, (SELECT 1.25 growth) BBB WHERE ENGINE='InnoDB' ) AA ) A; ----------------------- Find how much actual GB of memory is in use by innodb data in the innodb buffer pool at this moment: SELECT (PagesData*PageSize)/POWER(1024,3) DataGB FROM (SELECT variable_value PagesData FROM information_schema.global_status WHERE variable_name='Innodb_buffer_pool_pages_data') A, (SELECT variable_value PageSize FROM information_schema.global_status WHERE variable_name='Innodb_page_size') B; --------------------------------------------------------------------------------------------------- I am not aware of the fact that the test is forcing a Sequential Write with w100 and that other variable. Maybe you can make a suggestion otherwise. I didn't get a chance to really look into this yet. Are you saying the syntax is not outlining real MySQL performance differences by the numbers when run on multiple servers? From what I see, all servers (we run the test on) which are slow, end up below the min spec of 3000 IOPS. Special cases like Azure are interesting because we often see performance issues there until high monthly costs add 'just enough IO' to push through the connection queue fast enough to cease complaints, but still be slower than recommended. Businesses moving off of that platform report better load times and less locking for a lower monthly rate when using Automate outside of Azure. This is a subjective observation, but could be limited to lower tier disks on Azure only, and not the '3 SSDs' combined, OR I just don't hear back from that partner. I guess I am getting a little muddy on the actual motivations for the MS Virtualized platform's usage if it's a known performance problem at tiers commonly afforded by medium sized businesses. The end-goal for most users of their cloud is to decrease complexity of server and or application management, so it's going to be interesting to see how this develops. We should come up with a test that is more tailored, AND more accurate in the future, that I can agree on. There is no timeline, but thanks for bringing these points up!
  2. dfleisch

    Speed optimization of On-Prem Automate

    Yes, by design. If a deadlock happens it will wait on the write speed @ QD=1 to complete then release, allowing others to process. The test takes advantage of this fact and makes sure to test a worst case scenario @ 16k write. We really need to be testing 4k blocks instead as after some time graphing MySQL servers running our application, the 4k block is used and NOT 16k as previously thought.
  3. dfleisch

    Running Automate in Azure

    We are worried about deadlocking and per query execution times and I will explain why. The process loops in the software which requests certain queries are processed necessitates completion before others can process. Simply put: You're waiting on one query because it locked a table to do updates, if you cannot process that query faster than 5MB/s and it needs 40MB/s to complete in 1 second, then your server is not going to magically work UNTIL you have faster commit performance @ QD1 SO while a high IOPS number @ QD32 helps recovery time after a deadlock by being able to clear the queues quickly, we are only worried about a single queries write performance in the moment of the deadlock, SO we don't care of you can do 1MB/s x 32 32MB/s @ that queue depth because we are waiting on the 1MB/s PER THREAD that is bottle-necking the server queues. Take for example the query which rebuilds EDFs for computers. It locks the computers table and agent check-ins (for a 10k agent these can pile up pretty quickly) until it's done. On a slow per transaction IOPS server, these queries are larger and higher in row count so it's going to take between 11 seconds (fast server) OR [up to 2-3 minutes] on slow server How often do you want to see 20,000 requests in IIS waiting to commit to the table if these queries run multiple times a day ; and extend up to 3 minutes for a EDF rebuild? Answer: You NEVER want to see that many in the queue. IF you do, that performance at the higher queue depth certainly helps, and it's not like we are asking you to be WITHOUT acceptable performance at higher queue depths, just that the per thread performance is the problem and if you don't meet our minimum, you will be affected by lock outs and reliability concerns. When that rebuild runs (like many other queries which do something similar, like for commands) ..the 2-minute gaps in processing (locks) for INSERT queries for agent-check-ins are going to make the server not respond while the IIS queues are high BY DESIGN of IIS. Soon 3306 is going away and ONLY web calls will be made for the fat-client, even local. We can get around these pauses in web-server response times by using various tweaks to IIS and MySQL, but in the end we are simply band-aiding the real issue. While waiting for a single query to run on a base Azure instance, sure the server can queue 32 other items @ that same data rate and complete them, but unless the query is re-developed and re-released (hint: it already has been many times) with code that prevents locking, or creates multiple partial locks instead, or doesn't lock at all, we are at mercy to this and other queries with this design; --The requirement is there simply for a worst-case scenario Core query, in-house or third-party plugin, Custom or stock monitor, and or script that may not be completing quickly *for whatever reason*. **IMPORTANT* ALL Environments are not the same, people use various custom queries. SO, for a core product min recommendation for disk performance, to set a standard of how fast the product works, we make a generalization that everyone will be fast enough to write that 16k block at a certain speed as EVERYTHING done in the application or web interface relies on queries. We have set the 3,000 IOPS number, not only because it's a easy spec to meet, but because it affords smaller or medium size servers to complete any straggling 'locking' or full table scan type queries a quick-enough turn-around in speed where major delays are avoided. Power-users who accept the fact that they want a 'very fast' or 'faster' product will go above and beyond this spec. Yes, There are limits to application design and efficiency based on design choices, but wouldn't you rather have a server that can process past a worst case scenario rather than one which isn't capable? While the product continues to evolve, we are eyeing different ways to further improve performance and speed, but the simple fact is: a SINGLE 7200RPM SATA disk runs 3200 IOPS @ 16k Random Write, QD=1 and Threads=4. Why are excuses being made in order to work-around the fact that the server NOT meet this spec (which every bargain bin single spinning disk based consumer desktop will do) since higher density drives came out past 1TB mark have been a thing since what 2008? Now, Keep in mind that MySQL 5.7 currently does support 32 and 64K pages vs 5.6 allowing only 16, 8, and 4k.. so many you'll gain some benefit there at expense of wasted space if Azure scales better under larger block sizes on a QD1 scenario.
  4. Mike, please create a ticket for the server down / ERT team to review this you, I'll take care of top to bottom troubleshooting on this vs Support. We are confident we can identify an issue but we do need to start a remote session / call to id a few points before starting. David Fleisch ConnectWise Automate Support
  5. dfleisch

    Running Automate in Azure

    Michael, The Azure Temp drive has good IOPS and is not persistent, so it seems it could be a RAM disk. The temp directory for MySQL can be used here, but if it crashes and the drive is not available even to show the files that got force closed (may be corrupted), the MySQL log will be littered with errors about #sql files missing, and this can lead you down a dark path of inflation of IBDATA1, resulting in long-term performance impact. DB rebuilds are done often due to inflation, so I would avoid this at all costs unless you're ready to rebuild the DB a lot.. Other points.. we suggest 50% alloc of on-disk data-size in-RAM's Buffer_pool_size. 10GB DB on Disk (excluding logs) ? 5GB Buffer for MySQL OR higher, is best. On a 32GB server, we use 21GB for a buffer pool size since.. 20% overhead .. and Windows / Other processes need ram (including windows caching routines). If your 21GB buffer on that 32GB server has a 42GB DB, technically it's meeting best practice. For larger DBs, or DBs that need a larger RAM buffer to cache queries hitting a table that may exceed 50% of the on-disk (total) data size for \Labtech\ may need further optimization or more RAM. The "buffer_pool_instances" goes along with buffer pool size. Less instances and lots of connections? Thread Lock Contention occurs. Buffer pool instances should be 40 if 40GB buffer is used, but 39 and lower would also work.. 41 instances and 40Gb buffer = no no since it doesn't match the 1:1 (or less) rule of 1GB (no less) or more per buffer pool instance rule. Max value is 64 for this line above in quotes. I would prioritize disk performance and 4 cores FIRST, then work on more than RAM. The SSD disk (real ssd) and 4 Cores / 16GB RAM is a min spec I would set for servers with a 0~20GB DB sizes
  6. dfleisch

    Running Automate in Azure

    Hi Michael, It seems to me like the MySQL server instance is programmed to use 4k and 16k sized blocks for reading/writing data from tables. I ran HDTune Pro to get an idea using their Disk Monitor tool to see what block sizes were most popular while the application was running. Give it a shot and see if you can isolate the same! https://www.hdtune.com/download.html While our other services do things besides talk to SQL, they still mostly rely on MySQL for a response back through a query most times, and that is the bottleneck. We need to do more testing to confirm some more detailed items, but this for now should give you a general idea what is used by the software calls.
  7. dfleisch

    Running Automate in Azure

    @tlphipps Please forgive me if I came across this way. I read through the thread and wanted to put out a message to everyone first: My goal isn't to insult anyone about their choice in providers, heck that's a business decision more than anything. This started as a small post but I want to make a few items clear about our approach and for the reason tests are being run. *To help you guys stay up and stable!* for one.. The goal of my last post was to call out and describe what works and what doesn't, and to forewarn someone new or seasoned who may not be used to looking at these numbers and troubleshooting an issue caused by IO. First, the volume of IO calls made by Automate is massive. The number and variety being pretty this complex creates issues, but sometimes they are harder to identify as each server is different in how the problem exposes itself and how it manifests. Your first line of defense is to meet requirements. Automate 2019.x + Automate 12.x runs a variety of services that require a certain min spec. The CWAFILESERVICE, Solution Center, Startup routines of LTAgent, and more all require a low latency and high throughput disk. Most IO issues are first exposed when the server is first rebooted by way of timeouts while all agents are ALSO trying to check in, and I've seen 10,000 in web-garden type IIS queue buckets that eventually creates a 100% CPU problem and slows the queries further. If your server cannot process this many IOP (minute) it may take 30+ minutes to 'settle' down or a 'rename' of the eventlogs table at worse, before the startup of LTAgent finishes. Without a LTAgent startup/POST, you may never get the server online, so these issues occur because it couldn't process x data in x time. Add that into custom monitors and wild variety of query calls that vary per company and you've got a recipe for disaster in cases where IO is not plentiful per thread, and a thread deadlocks. If you've bought into Azure and have it working, I am not going to tell you move off it! Running hardware hosted by the makers of Windows can provide lots of benefits and advantages vs other providers ; I just don't wish to see a partner reaching out to Microsoft, and going through all this trouble to find out what is going on with performance when a design choice at the provider (one which they will defend) has driven the IO to be optimized for another purpose outside of Automate's needs on the storage plan that fits their budget. Seeing that the SSD stripe options may provide the performance required, this is good to have data on the instance size / type as previously this was very cost prohibitive. What I'd like to raise awareness of is the fact that it takes very little to create a server down situation by backing up the web server when IO demand vs availability is exceeded on a *per thread performance basis*. The threshold could be close to being exceeded as the DB's table row counts and data-set sizes increase, and for the cost that one would spend, you want a much larger window of opportunity to increase load IF the numbers are anything like what I saw back when they released the Azure cloud. To give yall an idea how bad it was, I've attached two screenshots. --First is a Azure server in 2015, notice lack of write speeds and 4k performance with results that would likely fail out DISKSPD.exe test. --Second file is a newer Azure server from 2018, with higher IO plan but still showing only 60MB/s Sequentially (4TB SATA drives are at 150MB/s today) 16K write MB/s looks better here. Latency measurements were not taken. Since these screenshots pre-date the diskspd.exe test we use, the resulting measurement of MB/s x ms Latency = IOPS was not available to get a clear picture. Personally, I have not done the numbers on a newer 3x SSD Azure setup and I may have got a bit ahead of myself. The current VMs may have more headroom with the improvements MS has made, but all I am saying is: just be careful!! If the plan works for your business and provides features and security you aim for within the budget with performance looking good, that's all that matters. In my eyes, the bottom line is: 1. We've troubleshot Azure performance issues from day one, got a bad taste for their servers as result. --Azure has improved, how much? Hard to say as I don't have raw numbers / data. Just speaking in general ( would be good to run some benchmarks on different tiers ) 2. I dislike the trend of a growing company running into issues with Connectwise after specing the server to our 'recommended' specs on the Documentation site, then finding later they are going down due to a infrastructure scaling / hardware problem. One of the great things we've done as of late is to develop provide a standardized benchmark to set a IOPS bar and hope that you wish to keep that headroom. Happy MSPing out there!
  8. dfleisch

    Running Automate in Azure

    The command we give consultants to run to pre-check a server runs a test using a Queue Depth of 1, Microsoft is likely able to scale the Azure disks to run well with 32 or higher queue depths, but MySQL is not going to behave like that. MySQL will use a 4k and 16k block size on the disk and write (randomly) at a single queue depth. The test we give you to check the server is going to validate if the server is able to sustain the required IOPS on a single MySQL query, something that varies widely depending on hardware setups. From experience, we are not able to to a EDF rebuild, Search Rebuild, or other intensive tasks below a certain number of IOPS on that 16k random write, 4 thread test syntax for Diskspd.exe, and as a result server down tickets come in with complaints that the application is not usable. To give you some history, back when Azure came out, people were running the base storage plans and not able to meet specs on these 16k and 4k block sizes, but if you've upgraded to a SSD on Azure there should be a better chance you'll pass. The issue is, even Azure's striped SSD setups are still many times slower than a single SSD drive which today are very inexpensive. If you're paying for Azure and getting less performance than a $100 piece of hardware, I find this to be a waste of resources, EXCEPT for the fact that the cloud type of hosting may be better for a company's situation as a result of the management side of the OS / Hardware, ISP and Power / Heating / Cooling / Redundancy guarantees. Because of this, I UNDERSTAND why people are doing it, but there has to be some sort of min spec because most server down tickets are focused on 'why doesn't the application work' ; Answering the question of "Why is the server not able to perform well on a single thread, but scales well with a high queue depth" is generally the answer to the problem: The storage is not optimized for the type of load the application calls for. There have been talks about standardizing a disk performance test for some time here at Connectwise's Automate division. To define min specs, we have to cut off the slower servers somewhere. Yes, I've seen systems showing great IO perform slowly on a EDF rebuild or other queries due to row count / Optimizations we may need to add. Yes, I've seen slow disks work for the intended purpose of serving MySQL queries quickly There are always outliers and scenarios where you may be able to get by, HOWEVER, in the worse case type of situation where a query is running slow, the IIS (WEB SERVER) can queue a bunch of connections (due to a deadlock on say computers while EDF rebuild runs), and those agent check-in requests filtering through the web server WILL stall out the GUI as the # of waiting connections must go into the 'queue' after 100 requests are added (and are waiting on MySQL). BECAUSE OF THIS.. the problem may not be obvious until a query runs that takes a while, and as you grow this can get worse and worse. This DiskSPD.exe Syntax pasted above is going to be the best defense we have against your server not working correctly long-term by simply finding a provider that can sustain the IO we suggest at minimum. Amazon's EC2 instances that use EBS Optimized volumes are going to run about 3200 IOPS on that test and so we've set a min of this spec as single 7200 RPM SATA disks are also posting these (or higher) numbers. It sounds like Microsoft is trying to convince people that the config they've chosen for running their platform and storage pools is 'best' and from your note on the Automate server requiring 3 SSDs JUST to sustain the performance required by our apps, while a single ~$100 Samsung outperforms the '3 ssd' stripe by 10 times over. The important part of this is: They are artificially throttling the IO and get to choose what a 'SSD' qualifies as. SO, while the application may work after 3 SSDs are combined, the results of the IOPS test we choose to run may still not be optimal for Automate's application calls, but it may 'work' to a point. Heck, for that ~$800 price why is one NOT building a server with hardware that benchmarks @ 33,000 IOPS (NVME SSD) for pennies compared ; sending that hardware to a datacenter, and calling it a day? --The min spec is NOT voodoo to reach, and if you are below 3000 IOPS, you have no excuses besides your own doing for going against experts who have direct experience with the application's design. Azure is simply not the best choice for a cost/performance on Automate. Can you get it to work? Maybe. How well? A little more time wasted troubleshooting may answer that once the load scales with company expansion. Be careful out there with what you buy into!
  9. That URL is below, and can also be a custom port if your agent templates are designed to change that port for check-in by edit of the 'server address' ; Here's the specific URL it's hitting by default, for agent ID 89: http://FQDN/LabTech/agent.aspx?89c5&10 The text and numbers after ? are specific encoded items we can determine, like what is it sending, how many items, etc We've seen cases where the WEB GUI can be hit, but a IPSEC policy can block agent communication through LTSVC.exe, so this would not guarantee communication, better try hitting that URL and look for the agent version returned to verify if the URL / ASP can be served (again not a guarantee event if that aspx site can be reached for agent.aspx ; Just a guideline. Better run a Wireshark instead to see if deep packet inspection is a culprit if troubleshooting communication issues..
  10. Yes, this is a Known limitation of Chrome 71 and the \Automate\ Web Control Center Our fix is to use Firefox for now until we can patch this limitation on Automate 12 Patch build .492+ Take this site as example of a working patch .489 server on Chrome 71: https://dfleisch.ddns.net/automate
  11. dfleisch

    Intel Processors

    Maybe doing this by Hardware ID then having a cross-reference for diff gens of Intel Procs? https://stackoverflow.com/questions/7480556/how-to-get-hardware-id-for-a-network-adapter-programmatically-in-c-sharp The Hardware ID via WMI query should be available to query with a remote monitor..
  12. dfleisch

    Speed optimization of On-Prem Automate

    If you'd like a sneak peak of what testing metrics Connectwise is employing for the future to help prevent configuration problems, wala: This diskspd.exe -c10G -t4 -si16K -b16K -d30 -L -o1 -w100 -D -Sh C:\testfile.dat command line syntax fully supports changing paths and or file test sizes / block size to find out what performance your disk provides at differing metrics. You can choose read or write as well with this Microsoft tool. ------ The goal is to suggest 3000 IOPS on the 'total' column shown below for the 100% write @ 16k metric as a 'minimum spec' we require for new builds. Any one who doubts performance or does not believe performance is a factor for issues on the server, we first need to ensure your current Automate Patch version. --If certain indexing has been added to the database to work-around known issues with patching queries and networkdevices tables queries which happen to also join the 'computers' table then, it should come as no surprise, once we rule out these factors, disk is import to review. Without these indexes (coming in Patch 11 and 12 of Automate 12) Agent check-ins and CC login can back up and spike IIS server / w3wp.exe CPU use along with MySQLd.exe's usage by way of held open connections eating resources and responsiveness ; ..Some are Known issues and are included in a current, future, or no patch (yet). In order to find out if this is the case, please contact Automate Support and get a ticket in to see if your server is optimized using all the known fixes we have through subject like 'Server Performance review' ------ To get the bigger picture, also run a 4k write IOPS test, in order to gain insight on the contrast between these two properties of the disk sub-system. Below, I've run two 16k write tests as the validation would do pre-server install. *You must edit the path of C:\testfile.dat to the drive you wish to test on no matter when diskspd.exe is located as the exe location wont change results (The test runs on the drive you specify via path in CMD line syntax) I've attached a benchmark of a direct connected (SATA 6G) 1TB 7200RPM 'Performance' Hitachi Deskstar spinning disk on the right. On the left, a Samsung 850 EVO 250GB is running the test virtualized. We need to make note of Total IOPS in these screenshots under the 'Total: ' I/Os per second' output of the test result screen via CMD.exe From these test results, the HDD scores ~3716 Operations Per Second in total when writing 16k blocks, while the SSD does 12,264 IOPS on a 16k Write Keep in mind, the testfile.dat size variations as modified by -c10G (default of 10GB test file) show little change to SSDs, while HDDs can be up to 30% faster on first vs last sector of the disk, so use this on new drives with a size closer to total disk space to see total drive IOPS on average, with expense of longer test run time. The test is going to give you an idea of if your disk meets our requirements for new server builds. Example: Any server scoring below '3000 IOPS' may cause EDF rebuild, search rebuild, or patching rebuild, nightly maintenance, or intermittent performance issues as varying queries run throughout the day. Some of these are traced to IO issues, other relating to a optimization we must apply through patches. This is directly linked with high avg CPU usage by the DB AND WEB server due to 'managing the idle time' between IO requests and transaction commits. Some of this is preventable, some of it we can re-write code to optimize, but get this out of the way first, run a benchmark! diskspd.exe