Jump to content

dfleisch

Members
  • Content Count

    10
  • Joined

  • Last visited

  • Days Won

    1

dfleisch last won the day on November 14 2018

dfleisch had the most liked content!

Community Reputation

2 Neutral

My Information

  • Agent Count
    Less than 100

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

  1. Mike, please create a ticket for the server down / ERT team to review this you, I'll take care of top to bottom troubleshooting on this vs Support. We are confident we can identify an issue but we do need to start a remote session / call to id a few points before starting. David Fleisch ConnectWise Automate Support
  2. dfleisch

    Running Automate in Azure

    Michael, The Azure Temp drive has good IOPS and is not persistent, so it seems it could be a RAM disk. The temp directory for MySQL can be used here, but if it crashes and the drive is not available even to show the files that got force closed (may be corrupted), the MySQL log will be littered with errors about #sql files missing, and this can lead you down a dark path of inflation of IBDATA1, resulting in long-term performance impact. DB rebuilds are done often due to inflation, so I would avoid this at all costs unless you're ready to rebuild the DB a lot.. Other points.. we suggest 50% alloc of on-disk data-size in-RAM's Buffer_pool_size. 10GB DB on Disk (excluding logs) ? 5GB Buffer for MySQL OR higher, is best. On a 32GB server, we use 21GB for a buffer pool size since.. 20% overhead .. and Windows / Other processes need ram (including windows caching routines). If your 21GB buffer on that 32GB server has a 42GB DB, technically it's meeting best practice. For larger DBs, or DBs that need a larger RAM buffer to cache queries hitting a table that may exceed 50% of the on-disk (total) data size for \Labtech\ may need further optimization or more RAM. The "buffer_pool_instances" goes along with buffer pool size. Less instances and lots of connections? Thread Lock Contention occurs. Buffer pool instances should be 40 if 40GB buffer is used, but 39 and lower would also work.. 41 instances and 40Gb buffer = no no since it doesn't match the 1:1 (or less) rule of 1GB (no less) or more per buffer pool instance rule. Max value is 64 for this line above in quotes. I would prioritize disk performance and 4 cores FIRST, then work on more than RAM. The SSD disk (real ssd) and 4 Cores / 16GB RAM is a min spec I would set for servers with a 0~20GB DB sizes
  3. dfleisch

    Running Automate in Azure

    Hi Michael, It seems to me like the MySQL server instance is programmed to use 4k and 16k sized blocks for reading/writing data from tables. I ran HDTune Pro to get an idea using their Disk Monitor tool to see what block sizes were most popular while the application was running. Give it a shot and see if you can isolate the same! https://www.hdtune.com/download.html While our other services do things besides talk to SQL, they still mostly rely on MySQL for a response back through a query most times, and that is the bottleneck. We need to do more testing to confirm some more detailed items, but this for now should give you a general idea what is used by the software calls.
  4. dfleisch

    Running Automate in Azure

    @tlphipps Please forgive me if I came across this way. I read through the thread and wanted to put out a message to everyone first: My goal isn't to insult anyone about their choice in providers, heck that's a business decision more than anything. This started as a small post but I want to make a few items clear about our approach and for the reason tests are being run. *To help you guys stay up and stable!* for one.. The goal of my last post was to call out and describe what works and what doesn't, and to forewarn someone new or seasoned who may not be used to looking at these numbers and troubleshooting an issue caused by IO. First, the volume of IO calls made by Automate is massive. The number and variety being pretty this complex creates issues, but sometimes they are harder to identify as each server is different in how the problem exposes itself and how it manifests. Your first line of defense is to meet requirements. Automate 2019.x + Automate 12.x runs a variety of services that require a certain min spec. The CWAFILESERVICE, Solution Center, Startup routines of LTAgent, and more all require a low latency and high throughput disk. Most IO issues are first exposed when the server is first rebooted by way of timeouts while all agents are ALSO trying to check in, and I've seen 10,000 in web-garden type IIS queue buckets that eventually creates a 100% CPU problem and slows the queries further. If your server cannot process this many IOP (minute) it may take 30+ minutes to 'settle' down or a 'rename' of the eventlogs table at worse, before the startup of LTAgent finishes. Without a LTAgent startup/POST, you may never get the server online, so these issues occur because it couldn't process x data in x time. Add that into custom monitors and wild variety of query calls that vary per company and you've got a recipe for disaster in cases where IO is not plentiful per thread, and a thread deadlocks. If you've bought into Azure and have it working, I am not going to tell you move off it! Running hardware hosted by the makers of Windows can provide lots of benefits and advantages vs other providers ; I just don't wish to see a partner reaching out to Microsoft, and going through all this trouble to find out what is going on with performance when a design choice at the provider (one which they will defend) has driven the IO to be optimized for another purpose outside of Automate's needs on the storage plan that fits their budget. Seeing that the SSD stripe options may provide the performance required, this is good to have data on the instance size / type as previously this was very cost prohibitive. What I'd like to raise awareness of is the fact that it takes very little to create a server down situation by backing up the web server when IO demand vs availability is exceeded on a *per thread performance basis*. The threshold could be close to being exceeded as the DB's table row counts and data-set sizes increase, and for the cost that one would spend, you want a much larger window of opportunity to increase load IF the numbers are anything like what I saw back when they released the Azure cloud. To give yall an idea how bad it was, I've attached two screenshots. --First is a Azure server in 2015, notice lack of write speeds and 4k performance with results that would likely fail out DISKSPD.exe test. --Second file is a newer Azure server from 2018, with higher IO plan but still showing only 60MB/s Sequentially (4TB SATA drives are at 150MB/s today) 16K write MB/s looks better here. Latency measurements were not taken. Since these screenshots pre-date the diskspd.exe test we use, the resulting measurement of MB/s x ms Latency = IOPS was not available to get a clear picture. Personally, I have not done the numbers on a newer 3x SSD Azure setup and I may have got a bit ahead of myself. The current VMs may have more headroom with the improvements MS has made, but all I am saying is: just be careful!! If the plan works for your business and provides features and security you aim for within the budget with performance looking good, that's all that matters. In my eyes, the bottom line is: 1. We've troubleshot Azure performance issues from day one, got a bad taste for their servers as result. --Azure has improved, how much? Hard to say as I don't have raw numbers / data. Just speaking in general ( would be good to run some benchmarks on different tiers ) 2. I dislike the trend of a growing company running into issues with Connectwise after specing the server to our 'recommended' specs on the Documentation site, then finding later they are going down due to a infrastructure scaling / hardware problem. One of the great things we've done as of late is to develop provide a standardized benchmark to set a IOPS bar and hope that you wish to keep that headroom. Happy MSPing out there!
  5. dfleisch

    Running Automate in Azure

    The command we give consultants to run to pre-check a server runs a test using a Queue Depth of 1, Microsoft is likely able to scale the Azure disks to run well with 32 or higher queue depths, but MySQL is not going to behave like that. MySQL will use a 4k and 16k block size on the disk and write (randomly) at a single queue depth. The test we give you to check the server is going to validate if the server is able to sustain the required IOPS on a single MySQL query, something that varies widely depending on hardware setups. From experience, we are not able to to a EDF rebuild, Search Rebuild, or other intensive tasks below a certain number of IOPS on that 16k random write, 4 thread test syntax for Diskspd.exe, and as a result server down tickets come in with complaints that the application is not usable. To give you some history, back when Azure came out, people were running the base storage plans and not able to meet specs on these 16k and 4k block sizes, but if you've upgraded to a SSD on Azure there should be a better chance you'll pass. The issue is, even Azure's striped SSD setups are still many times slower than a single SSD drive which today are very inexpensive. If you're paying for Azure and getting less performance than a $100 piece of hardware, I find this to be a waste of resources, EXCEPT for the fact that the cloud type of hosting may be better for a company's situation as a result of the management side of the OS / Hardware, ISP and Power / Heating / Cooling / Redundancy guarantees. Because of this, I UNDERSTAND why people are doing it, but there has to be some sort of min spec because most server down tickets are focused on 'why doesn't the application work' ; Answering the question of "Why is the server not able to perform well on a single thread, but scales well with a high queue depth" is generally the answer to the problem: The storage is not optimized for the type of load the application calls for. There have been talks about standardizing a disk performance test for some time here at Connectwise's Automate division. To define min specs, we have to cut off the slower servers somewhere. Yes, I've seen systems showing great IO perform slowly on a EDF rebuild or other queries due to row count / Optimizations we may need to add. Yes, I've seen slow disks work for the intended purpose of serving MySQL queries quickly There are always outliers and scenarios where you may be able to get by, HOWEVER, in the worse case type of situation where a query is running slow, the IIS (WEB SERVER) can queue a bunch of connections (due to a deadlock on say computers while EDF rebuild runs), and those agent check-in requests filtering through the web server WILL stall out the GUI as the # of waiting connections must go into the 'queue' after 100 requests are added (and are waiting on MySQL). BECAUSE OF THIS.. the problem may not be obvious until a query runs that takes a while, and as you grow this can get worse and worse. This DiskSPD.exe Syntax pasted above is going to be the best defense we have against your server not working correctly long-term by simply finding a provider that can sustain the IO we suggest at minimum. Amazon's EC2 instances that use EBS Optimized volumes are going to run about 3200 IOPS on that test and so we've set a min of this spec as single 7200 RPM SATA disks are also posting these (or higher) numbers. It sounds like Microsoft is trying to convince people that the config they've chosen for running their platform and storage pools is 'best' and from your note on the Automate server requiring 3 SSDs JUST to sustain the performance required by our apps, while a single ~$100 Samsung outperforms the '3 ssd' stripe by 10 times over. The important part of this is: They are artificially throttling the IO and get to choose what a 'SSD' qualifies as. SO, while the application may work after 3 SSDs are combined, the results of the IOPS test we choose to run may still not be optimal for Automate's application calls, but it may 'work' to a point. Heck, for that ~$800 price why is one NOT building a server with hardware that benchmarks @ 33,000 IOPS (NVME SSD) for pennies compared ; sending that hardware to a datacenter, and calling it a day? --The min spec is NOT voodoo to reach, and if you are below 3000 IOPS, you have no excuses besides your own doing for going against experts who have direct experience with the application's design. Azure is simply not the best choice for a cost/performance on Automate. Can you get it to work? Maybe. How well? A little more time wasted troubleshooting may answer that once the load scales with company expansion. Be careful out there with what you buy into!
  6. That URL is below, and can also be a custom port if your agent templates are designed to change that port for check-in by edit of the 'server address' ; Here's the specific URL it's hitting by default, for agent ID 89: http://FQDN/LabTech/agent.aspx?89c5&10 The text and numbers after ? are specific encoded items we can determine, like what is it sending, how many items, etc We've seen cases where the WEB GUI can be hit, but a IPSEC policy can block agent communication through LTSVC.exe, so this would not guarantee communication, better try hitting that URL and look for the agent version returned to verify if the URL / ASP can be served (again not a guarantee event if that aspx site can be reached for agent.aspx ; Just a guideline. Better run a Wireshark instead to see if deep packet inspection is a culprit if troubleshooting communication issues..
  7. Yes, this is a Known limitation of Chrome 71 and the \Automate\ Web Control Center Our fix is to use Firefox for now until we can patch this limitation on Automate 12 Patch build .492+ Take this site as example of a working patch .489 server on Chrome 71: https://dfleisch.ddns.net/automate
  8. dfleisch

    Intel Processors

    Maybe doing this by Hardware ID then having a cross-reference for diff gens of Intel Procs? https://stackoverflow.com/questions/7480556/how-to-get-hardware-id-for-a-network-adapter-programmatically-in-c-sharp The Hardware ID via WMI query should be available to query with a remote monitor..
  9. dfleisch

    Speed optimization of On-Prem Automate

    If you'd like a sneak peak of what testing metrics Connectwise is employing for the future to help prevent configuration problems, wala: This diskspd.exe -c10G -t4 -si16K -b16K -d30 -L -o1 -w100 -D -Sh C:\testfile.dat command line syntax fully supports changing paths and or file test sizes / block size to find out what performance your disk provides at differing metrics. You can choose read or write as well with this Microsoft tool. ------ The goal is to suggest 3000 IOPS on the 'total' column shown below for the 100% write @ 16k metric as a 'minimum spec' we require for new builds. Any one who doubts performance or does not believe performance is a factor for issues on the server, we first need to ensure your current Automate Patch version. --If certain indexing has been added to the database to work-around known issues with patching queries and networkdevices tables queries which happen to also join the 'computers' table then, it should come as no surprise, once we rule out these factors, disk is import to review. Without these indexes (coming in Patch 11 and 12 of Automate 12) Agent check-ins and CC login can back up and spike IIS server / w3wp.exe CPU use along with MySQLd.exe's usage by way of held open connections eating resources and responsiveness ; ..Some are Known issues and are included in a current, future, or no patch (yet). In order to find out if this is the case, please contact Automate Support and get a ticket in to see if your server is optimized using all the known fixes we have through subject like 'Server Performance review' ------ To get the bigger picture, also run a 4k write IOPS test, in order to gain insight on the contrast between these two properties of the disk sub-system. Below, I've run two 16k write tests as the validation would do pre-server install. *You must edit the path of C:\testfile.dat to the drive you wish to test on no matter when diskspd.exe is located as the exe location wont change results (The test runs on the drive you specify via path in CMD line syntax) I've attached a benchmark of a direct connected (SATA 6G) 1TB 7200RPM 'Performance' Hitachi Deskstar spinning disk on the right. On the left, a Samsung 850 EVO 250GB is running the test virtualized. We need to make note of Total IOPS in these screenshots under the 'Total: ' I/Os per second' output of the test result screen via CMD.exe From these test results, the HDD scores ~3716 Operations Per Second in total when writing 16k blocks, while the SSD does 12,264 IOPS on a 16k Write Keep in mind, the testfile.dat size variations as modified by -c10G (default of 10GB test file) show little change to SSDs, while HDDs can be up to 30% faster on first vs last sector of the disk, so use this on new drives with a size closer to total disk space to see total drive IOPS on average, with expense of longer test run time. The test is going to give you an idea of if your disk meets our requirements for new server builds. Example: Any server scoring below '3000 IOPS' may cause EDF rebuild, search rebuild, or patching rebuild, nightly maintenance, or intermittent performance issues as varying queries run throughout the day. Some of these are traced to IO issues, other relating to a optimization we must apply through patches. This is directly linked with high avg CPU usage by the DB AND WEB server due to 'managing the idle time' between IO requests and transaction commits. Some of this is preventable, some of it we can re-write code to optimize, but get this out of the way first, run a benchmark! diskspd.exe
×