So it’s Wednesday afternoon and you think, “I wonder if Unifi can….”.
If the task in question can be accomplished on Hadoop or via the operating system Unifi runs on, it’s quite likely “Unifi can”… Your secret “Arsenal of Yes” weapon is the Operating System job. The OS Job allows you to go off-road whenever necessary and do all sorts of things from trivial to mind-bendy.
What is an Operating System Job?
In short, the OS job allows you to extend OOB functionality of Unifi. Here are some common examples we see:
- Processing unstructured data before it Unifi ingests it
- Importing PDF documents and arbitrary data as Unifi Datasets
- Executing advanced statistical functions or performing ML on data
You can code the logic of these jobs with the language of your choice: Python, Node, Java, even Bash: Whatever you can execute or call from Linux. You have the ability to feed your OS job parameters which inform how it executes. Finally, the job can (and actually should) return status back to the Unifi workflow so it can make intelligent decisions for you downstream.
A Use Case
As an example, this post will focus on grabbing data “lost” inside a PDF and making it available in Unifi. Here’s the example PDF I’m using: I’m a PDF
We’re going to use the tabula-py library to extract text from the sample PDF and save it as CSV. Unifi consumes the CSV.
There are lots of different ways you can approach this task, and I’m taking a simple, friction-free path. It takes a single line of code to do the conversion. That qualifies as no big thing.
The key thing to remember is you can use the tool you want to rather than a driver/approach dictated by a vendor. The tabula-py library is pretty flexible and my fave, but there are others.
Here’s some code:
#!/usr/bin/env python import tabula import argparse import sys import subprocess parser = argparse.ArgumentParser(description='Process a PDF to CSV') parser.add_argument('--input', help='Path to the input PDF') parser.add_argument('--output', help='Path to the output CSV') args = parser.parse_args() try: # Write pdf to csv tabula.convert_into(args.input, args.output, output_format="csv") try: # Copy csv to HDFS subprocess.call(["hadoop", "fs", "-put", "-f", args.output, "/demo/"]) except Exception as err: print (err) sys.exit(1) sys.exit(0) except Exception as err: print (err) sys.exit(1)
The script above is really simple. It takes:
- –input: Represents the path (including the name) to the PDF you want to process
- –output: You got it, the path of the CSV to be created.
I installed tabula-py on the Unifi host with a pip install tabula-py while logged in as the unifi user. PIP took care of installing pandas (which took a while) and other pre-requisites.
How do you call an Operating System Job?
When calling an OS job, you have two parameters to worry about:
- Command: The program that will be doing the work. In this case, python.
- Arguments: The path to the script you’re going to execute as well as any arguments/parameters you’re passing in. Arguments are separated by a space. Use double-quotes to wrap values which contain a space.
Note how the first line of my script didn’t make any assumptions around the location of our interpreter (in this case Python). We’re making sure to find the “correct” Python based on what’s in the environment.
If I want to run a shell script, then I can pretty much count on the fact that it’ll contain a shebang pointing at your interpreter (#!/bin/bash). If this is the case, you’ll need to chmod +x the script. By definition, the #!/bin/bash is telling the script where it’s interepreter is…so you don’ thave to type that (/bin/bash) into the Command property. Instead use the path to your script (ala: /home/unifi/somescript.sh). You’ll continue to use Arguements to pass in additional paramters to the script.
Using your job in a Workflow
Once you have a working OS Job, you’ll generally want to leverage it’s output in some part of larger workflow. After my CSV has been generated, I might want to consume in in a Data Prep Transform job.
In the simple example to the left, our first job coverts a PDF to CSV.
It leans on an On Success Condition to execute the PDF Ingest data prep job, which grabs the newly-generated CSV file and joins it with other information.
The Success Condition depends on the Python script to tell it “all is well”, which is why you see me doing a sys.exit(0). If I were doing the same thing in a shell script, I’d drop in an exit 0
Putting it altogether
What good is a workflow if you don’t have a way to execute it? For that, you’ll need a Schedule. Create a schedule, drop your Workflow in, configure your Data Transformation Job’s output:
That’s it. Not difficult. You can do this.
Once you have the basic approach down, it’s quite literally trivial to churn out variations on the same idea. Here I am scraping data out of HTML tables so I can use them in Unifi: