In the previous post, you learned about how Alteryx Analytics Hub utilizes storage, and how to modify the locations where that storage exists.
Today, we’ll start with a brief detour into the land of IOPS, throughput, and burst bucket credits. Next, we’ll look at the two types of disks I actually used during testing.
We’ll also explore some of the common metrics folks use to measure disk performance. I’m honestly not sure that these metrics are 100% applicable to our scenario as they are to others, but they’re important enough to cover off.
Then, we start looking at data. We’ll see execution time and queue time for our workflows on 4 core machines running different numbers of workers and with different disk types, sizes, and layouts.
I’ll spoil some of the fun upfront by saying you will see only a few black-and-white “answers” here. Instead, you’ll get a feel for what is directionally correct for you when you setup your AAH rig. Sorry. Life can be that way.
Basic EBS Disk Stuff You should Know
What are IOPS (and why do they suck)?
Tech Target defines IOPS this way:
IOPS (input/output operations per second) is the standard unit of measurement for the maximum number of reads and writes to non-contiguous storage locations. IOPS is pronounced EYE-OPS.
IOPS is frequently referenced by storage vendors to characterize performance in solid-state drives (SSD), hard disk drives (HDD) and storage area networks. However, an IOPS number is not an actual benchmark, and numbers promoted by vendors may not correspond to real-world performance.
Especially if you watched the IOPS suck video, you’ll understand that this measure can be somewhat arbitrary and easy to game. That said, when dealing with the same vendor (in this case, AWS) we can be relatively certain that “more is better”.
What is throughput?
Again, from Tech Target:
Throughput measures how many units of information a system can process in a period of time. It can refer to the number of I/O operations per second, but is typically measured in bytes per second.
I’m far from an expert on this stuff, but I always am more interested in throughput than IOPS. Throughput can generally be expressed as (average request size) * #IOPS. For example, if the disk activity I need generates 1000 IOPS with a request size of 4K each, my throughput is 1000 x 4 KB = 4000 KB, or about 4 MB /sec
AWS EBS Disk Types
I played with two disk types (and only two) because it quickly became apparent that some of AWS’s fancier offerings would be overkill.
General Purpose SSD
I did the lion’s share of my testing with General Purpose 500 GB SSD drives. Nerds generally refer to these as gp2s. The larger the disks are, the more IOPS they deliver. At 500 GB, the gp2 will deliver 1500 IOPS, which in all situations seemed to be more than enough throughput for what I was doing
Back to the gp2 disk. Per AWS’s docs, a 500 GB gp2 not only delivers a baseline of 1500 IOPS, but it can burst up to 3000 IOPS for a limited amount of time by burning “burst bucket credits”. Think of your burst bucket credits like an automatically regenerating bag of “powerups”. The bigger your disk, the faster your powerups regenerate, too. In fact, if you use a gp2 disk with a size of 1TB or larger, you will never run out of burst bucket credits.
A single gp2 volume has a maximum throughput of 250 MiB/s but could be as low as 128 MiB/s based on the size of the volume itself.
As the name suggests, the gp2 disk a good, all-around disk. It’ll cost you about $0.10 / GB per month.
Throughput Optimized HDD
The Throughput Optimized (known as st1) is a cheap mechanical disk. It costs less than half of a gp2 at $0.045 / GB per month. This disk is pretty good at reading and writing large, sequential files. It’s NOT very good at lots of random reading and writing of small files.
To make things more confusing, AWS doesn’t use IOPS to measure performance on this type of disk. Instead, they use throughput. The baseline performance of an st1 is 40 MiB/s per 1 TB size of the volume. So, my 500 GB st1 disk is “only” going to give me 20 MiB/s throughput, which doesn’t sound like much.
However, the st1 also has a bursting capability. For a 500 GB disk, that means I can burst up to about 125 MiB/s while I have credits. That’s more like it!
Already, you’re probably thinking, “Hmmm, this sucker might not be good for workloads with lots of temp files being read and written”. Correct – we’ll see proof of that in a moment.
Comparing Performance with fio
Let’s compare performance between a 500 GB gp2 disk and a 500 GB st1 disk. We’ll test two different scenarios with a disk benchmarking tool named Fio:
- Randomly reading and writing from (and to) 8 files using 8k blocks
- Making sequential writes to 8 files using 256k blocks
gp2:
Here’s the first test:
I’m getting ~3000 IOPS (my 1500 IOPS disk is bursting to 3000) and ~ 23.6 MiB/s throughput, probably because we’re bouncing all over the place on the disk.
Here’s the second test:
Because my block size is so much bigger and I’m making sequential writes, I don’t need as many IO actions to write my data. So, my IOPS are lower, but my throughput is like 10X higher than it was before at 253 MiB/s.
st1:
First test:
Wow. This blows. 130 IOPS and just over 1MiB/s throughput. This disk is CLEARLY not made for bouncing around and reading/writing lots of small files.
Here’s the second test where we write files sequentially using 256k blocks:
We still have very low IOPS, but now we’re at least getting nearly passable throughput at 35 MiB/s
Honest Conclusion:
Part of the challenge here is understanding exactly how all the various components of Alteryx Analytics Hub actually USE your disks.
Are those services mostly writing big, fat files? Constantly performing a random mix of non-sequential reads and writes across many files? I have a fair “feel” for this, but I don’t know for sure and I don’t want to spend tons of time profiling the IO characteristics of AAH running “my test workload”, only to find that those characteristics are totally different for “your real-world workload”.
Do you really want to think this hard about your disk and worry about whether you made the right decisions? Probably not. So my take away is: Don’t use mechanical storage and you’ll sleep way better at night.
Just one more nugget of info, I promise.
I have some Windows-related Performance Monitor disk metrics for you. I tracked these throughout all the tests I ran to see what was happening on my volumes. When and if necessary, I can attempt to correlate disk performance to workflow execution and queue time.
Average Disk sec / Transfer is a great disk counter to “cut through the noise” – it measures how long a disk transfer takes. Folks often refer to this metric as disk latency. You’ll find some varied opinions around what makes for “good” and “bad” values. I think this article represents a pretty happy medium:
https://techcommunity.microsoft.com/t5/sql-server-support/slow-i-o-sql-server-and-disk-i-o-performance/ba-p/333983]
- Counter: Physical Disk / Logical Disk – Avg. Disk sec/Transfer
- Definition: Measures average latency for read or write operations
- Values: < .005 excellent; .005 – .010 Good; .010 – .015 Fair; > .015 investigate
The values above are in seconds, so .015 means latency above 15ms for long periods of time is probably not good.
There’s also Average Disk Queue Length: it’s the average number of both read (Avg. Disk Read Queue Length) and write (Avg. Disk Write Queue Length) requests that were queued for the selected disk during the performance data interval. Unlike Current Disk Queue Length, Avg. Disk Queue Length is a derived value and not a direct measurement.
Here’s what PowerAdmin has to say about this counter:
For both Current and Avg. Disk Queue Length, 5 or more requests per disk could suggest that the disk subsystem is bottlenecked….
…if the Avg. Disk Queue Length is greater than 2 per hard disk for a prolonged period of time, it may produce a bottlenecked system.
Can we start talking about Alteryx now? Please?
Yes. You’ve been very patient with me.
We’re going to begin by testing on a basic, small machine: EC2’s m5a.2xlarge which is an 8 vCPU, 32 GB RAM machine. The “m” series is a “good enough” workhorse which is evenly balanced across CPU, RAM and potential disk & network throughput. It is not especially great in any one category. Also, remember that 8 vCPUs is the equivalent of 4 physical cores. So this machine meets the minimum requirements for CPU on Alteryx Analytics Hub and has 2x the minimum recommended RAM.
8 vCPU, Single Drive Only
I ran two tests with all disk activity limited to a single, 500 GB gp2 disk. EVERYTHING is running off of C:
I ran one test using the “Alteryx Approved” worker formula of 1 worker per 2 cores. I then bumped up the worker count by 1 (to 3 workers) and re-ran the same workloads.
Let’s take a look at disk activity. The chart below is showing CPU usage on the machine and disk latency.
The first viz shows 2 workers in action, with CPU activity as a grey area chart. Each one of those dots is a reading of disk latency. Most of the disk latency readings are < 5ms, which is good. I dropped an Average reference line on this chart, and while you can’t see the actual value, it’s around 4ms. (3.9ms). You can’t see the average CPU on this run, but I looked it up for you: 50.5%. If you find the animated gif annoying, and I also uploaded a basic screenshot of the dashboard. Download it here.
The second viz shows 3 workers running the same workload. Note that CPU utilization is definitely higher at an average of ~75%. The average disk latency is about the same, however. In fact, it happens to be a touch lower at 3.3ms. We have some CPU headroom on this machine, but probably not enough to attempt adding a 4th worker.
The final Dashboard simply combines both charts into a single canvas. Note that the periods of high-latency disk activity are few and short. As I grab some of them you can see that they last for maybe 10-15s each time.
On this machine, it appears that we’ll bottleneck on CPU before we actually start getting into trouble on our disk subsystem.
Now, let’s look at execution time on the two tests. Below, we see the 10 primary workflows I executed while testing. Blue dots represent a single execution of a workflow on the 2-worker machine, orange dots represent an execution on the machine while it’s running 3 workers.
As you can see, workflows tended to execute more quickly when the machine was running only two workers: Blue dots generally fell more on the left part of the box plot (they executed more quickly), while orange dots were more to the right (executing more slowly). In longer running, more IO-intensive workflows the pattern was a bit more clear.
8 vCPU, Storage folder on Distinct Drive
What happens when we move the Storage folder to a new, 500 GB gp2 disk with 1500 IOPS of it’s own? Not much, it turns out. Here, I re-run the same tests with either 2 or 3 workers, and execution time stays mostly the same:
I honestly expected there to be a marginal improvement using a separate disk for Storage, but I frankly don’t really see it. At times, it appears that the opposite thing happens: moving data between disks (Storage on D: and Staging and Windows Temp on C:) slows us down, especially if neither disk is too busy to begin with – we’re just wasting effort for no real gain.
Here’s the same data aggregated to show average execution time per workflow and machine configuration. I used the execution time of the 2-worker machine running all activity on C: as a benchmark. The label on each bar represents how much slower each execution is compared against the group’s benchmark.
The one where I use a mechanical drive (st1)
Because I’ve watched this movie a few times, I could guess that the combination of a mechanical drive used to host the Staging folder could be deadly. It was!
First, take a look at these AWS CloudWatch charts which are tracking disk performance on the new mechanical drive that hosts Staging.
Start with the bottom-most time series, which shows the Burst Balance Credits slowly getting used up. It took approximately 6 hours of continuously running workflows on a 2-worker 8 vCPU box to eat up all my credits. At the moment we ran out of the ability to burst, things started going south rapidly. Note how Average Read and Write Latency shoot up. Also, the Average Queue Length jumps.
Here’s what the same activity looks like from inside Windows using the disk counters we’ve talked about before. I’m zooming in on the moment we run out of burst credits.
The Magenta line represents the time at which point Burst Bucket credits are exhausted. Yes. I’m using logarithmic axis for disk latency because things get so bad. I’ve also done some additional labeling to make things clear.
Did you note how CPU Utilization seems to drop in sympathy with poor disk performance?
Until the disk goes bye-bye, execution time actually looks pretty decent. The dark green marks show on par execution time with what we saw using our “benchmark” blue 2-worker 8 vCPU box where everything lives on C:
However, after I remove the filter on time, the real story emerges. Hold on to your hats…
Ouch. All of the outliers outside the whiskers on the box plot belong to the mechanical drive executions once Burst Bucket is exhausted.
Another view, this time using the bar chart, unfiltered:
Keep in mind the numbers above include all the “nice, fast” executions we got on the “Staging on Mechanical” test for the first 6-ish hours. SO, when this disk goes down, it goes down hard.
Lesson learned
Even on a relatively small machine that normally wouldn’t be able to generate enough disk activity to really get you into trouble, you can shoot yourself in the foot using mechanical drives.
What about Queuing?
What about job queuing? On a smaller machine executing lots of schedules, you’re going to have a queue, period. Get a bigger machine. I’ll show you some metrics so you can get a feel for where the longest queue time for our jobs was, but it’s going to be high across the board. This metric will become more interesting in the NEXT installment when we start playing with a 16 vCPU (8 core) machine.
Our “Staging on Mechanical” configuration is the clear laggard. The configurations using more workers do better in terms of lowering queue time.
Remember, that nothing comes free. Lower queue time is partially offset by longer execution time on the 3-worker rigs. The 3-worker workflow executions are taking between 10-20% more time to get done. At scale, I’d seriously consider taking that trade-off though. More on that in the next post