The eternal question is not “Why am I here?”, “Is there life after death?”, or even “Does (s)he really love me, or is it all about my cool lightsaber collection?”
No, the real question we all want to be answered is “What sort of storage should I use with Alteryx Analytics Hub, and how should I lay it out?”.
Answering the question
To address this conundrum, I spent lots of time on AWS EC2 experimenting. EC2 gives me lots of flexibility in terms of quickly picking different types of machines, storage, etc. EC2 makes me happy.
In order to do my testing, I put together 10-15 different workflows (some of which execute under AMP, others which use our original engine). Each workflow leverages files for input, output, or both. Typically, I used a *.yxzp to include the file and force an unzip to add more IO.
Some of the workflows are pretty simple, generating 300K or 3M rows and writing then to an output *.yxdb. These guys execute in between one second to about two minutes based on row count and execution engine.
I also created much larger workflows that take 2GB and 4GB input files, then pivot, filter, and do all sorts of wonderful stuff. Again, some of these workflows run under Engine 1, and some under AMP. In addition, I typically configured these workflows to leverage disk-intensive output – for example, I export as Tableau .hyper files, or an Avro container file.
Overall, my goal was to stress different disks and disk layouts in a fairly repeatable way. I do so by using schedules executing on a particular cadence. Each schedule fires off one of the workflows above.
- Figure out what storage and patterns might make workflows execute artificially slowly because of disk saturation
- Figure out what disk types and layouts to AVOID
- See if I can increase throughput or lower AAH’s queue length by increasing engines-per-worker beyond the “2 cores per workflow” rule of thumb
- Figure out the perfect combination of disk types, worker counts, and CPU to make workflows go “really fast”. This is a waste of time since you won’t be executing my test workflows to solve your problems
How does AAH consume storage, anyway?
Alteryx Analytics Hub has a bunch of components that use your disk:
- The Repository leverages disk space for your PostgreSQL data, configuration, log, and control files in <INSTALL_DIR>\Alteryx Analytics Hub\Postgres\data
- AAH drops log and audit files on the hard disk at <INSTALL_DIR>\Alteryx Analytics Hub\Logs
- Alteryx Analytics Hub stores your assets either in a file system folder called Storage (default) or in the PostgreSQL repository. The location of this folder is generally <INSTALL_DIR>\vfs\storage, but it can be moved
The workers associated with Alteryx Analytics hub can use a ton of disk for temporary storage:
- Workers leverage a Staging folder to do temporary work. This location can get very busy. You’ll normally find it in C:\ProgramData\Alteryx\Service\Staging. It is moveable.
- Your Workers launch copies of the Alteryx Engine, and each engine leverages your Windows Temp folder for some work they do. The location of Windows Temp can vary, but it is often situated in C:\Windows\Temp. This folder can be moved, too.
Based on what sort of disks host Storage, Staging, and the Windows Temp folder, your workflows may execute faster or slower, based on how much IO they need to stay happy. To test this, I thought about a bunch of different techniques, and tried most of them:
- “The Default”: Throw everything on the C: drive and see what happens
- “The Happy Medium”: Put AAH Storage on a distinct drive
- “The Other Happy Medium”: Put Worker’s Staging area on a distinct drive, and leave Storage on C:
- ‘The Crazy Man”: Put Storage and Staging on different, distinct drives.
- “You Have Too Much Time on Your Hands”: Do the Crazy Man, PLUS change the location of the TEMP folder to a third, distinct drive. I’ll be using 4 drives with this approach. Overkill you think?
I also experimented with different types of EBS storage on AWS:
- General Purpose SSD (gp2) volumes at 100 GB, 200 GB, and 500 GB
- Throughput Optimized HDD (st1) at 500 GB
Don’t want to wait for the next post?
Well, shame on you. I made charts and everything. Fine, my abbreviated guidance is:
If you don’t use st1 disks for Staging and you keep your gp2 disks at ~ 200GB and higher, you should be pretty much OK. Your mileage may vary.
You can go away now.
Moving Storage, Staging, and Windows Temp
Based on what you’re about to learn, you may decide you want to move your Storage, Staging, or Windows Temp folders to another disk. Here’s how you complete said tasks…
When you install AAH, you can specify whether you want to use file system based storage (default) or store your assets inside the PostgreSQL repository database itself. You cannot flip back and forth between the storage modes, so choose wisely.
As an aside, I personally don’t see any real reason why one would want to use PostgreSQL storage. This option might be useful if you have very few large assets and you want to simplify your backup and restore process. That said, It can sometimes take longer for AAH to store and retrieve things using this mechanism, so I don’t like it.
Assuming you choose file system based storage, you can change its location using this article. Make sure you backup the entire system before you play with this technique.
Your workers control the location of Staging. If you have 4 machines acting as workers, you’ll need to repeat this process 4 times, once on each machine…or not! It’s up to you.
Here’s what to do:
- In Windows Explorer, navigate to C:\Program Files\Alteryx\Alteryx Analytics Hub and open the CutlassSettings.yml file.
- Update the worker.staging_directory property, making sure you “escape” backslashes in the path
- Restart the worker using .\ayxworker.ps1 -restart on a machine running only workers. If you have a single node of AAH, use .\ayxhub.ps1 -restart instead, but keep in mind you’ll be restarting the entire server!
You’ll change the location of Windows Temp on the operating system itself. Rather than write up the steps here, I’m going to be lazy and point you to another article.
I frankly am unsure if you should change the TMP variable too, so I normally do. Remember that this setting affects everything on your system, not just AAH
In the next post, we’ll compare how 2, 3, 4, and 6 workers on the same machine stress your disk and also move around Storage, Staging, and Windows Temp to see what happens. It’s going to be very exciting. At least for me.
Love this Russell – Looking forward to the results!