No, PII isn’t a four-letter word.
Made you stop and think, eh? Mission accomplished. Anyway, dealing with personally identifiable information (PII) can be a real bummer. PII is useful and often critical to the cycle of analytics, but must be treated very carefully. Users who don’t know any better can expose PII quite easily with tools like Tableau, QlikView, and Looker.
In Unifi 2.4, we’ve added the first of many features that will help you work with PII and move you down the governance, risk management, and compliance (GRC) road without much effort on your part.
What am I talking about? PII tagging. When creating or editing a Dataset, any column is fair game and can be identified as PII:
In the example above, I’m deciding that Marital Status is PII and identifying it as same. (Wouldn’t it be cool if Unifi did this automatically for me? Or even automatically masked the values? Don’t worry, the robots are coming).
After throwing this flag, I get immediate feedback. Check out the image on the right:
By the way, are you paying attention? Do you notice those fancy data types in the column headers? The iconography? Unifi has been doing quite a bit of AI work and you can see a small part of the “automagic” payoff over yonder. We’ve taken three string fields and auto-classified them based on data sampling. Who knew that “bald” was a hair color! Alan, there’s hope for you yet.
Below is a whole bunch of columns from the same dataset, all of which have been auto-classified. I also threw the PII flag on many of them.
Note the creditCard column. The data in the CSV file is synthetic and happens to contain only fake Mastercard numbers. Our AI gremlins were smart enough to catch that fact and let me know that this is a “Mastercard” column vs. a more generic Credit Card or a really generic String. We’re getting really meta.
Now at this point you may be thinking, “Yeah, that’s fine, but so what? What can I actually do now?” Well, Unifi is building the answer to those questions as I type… and I don’t want to ruin the surprise.
That said, here’s something you can do right this minute:
Tracking PII with Tableau
Unifi is really good at generating metadata from your data. All of this information is stored, and easily accessible to you via the UI or from inside the Unifi system catalog.
I’m going to take some work I did in my last post and rev it a tiny little bit.
Each dataset column we store has a distinct PII flag that signals whether or not the information is sensitive. If you know where to find this metadata, you’re pretty much set. You can ask questions like:
- Who has access to PII?
- Who doesn’t have access to PII?
- Which users have accessed PII and when?
- How much PII and I adding in my enterprise over time?
You’re going to want to go after the uf_dataset_column.is_pii column to begin with. The basic ERD you work with looks like so:
You’ll then want to add the uf_dataset (datasets) and uf_src (data sources) tables into the mix to give the PII a bit more context and add user-friendly grouping. A complete Tableau data source will look something like this:
After we have a data source, it becomes trivial to see who has access to PII across my entire enterprise. In the viz below, each mark represents an attribute (column) of a dataset. Orange marks highlight attributes which someone has tagged as PII.
It took me about 5 minutes to put this thing together, yet I can see:
- Which of my datasets are most complex from an attribute standpoint
- Which datasets contain PII, and how much of it
- What that PII represents
- Potential gaps in my ACLs and/or policies which allow users access to stuff they shouldn’t have.
Below, I’m doing a bit of data exploration. If I were to guess I’d bet that someone meant to strip permissions away from Customer Annual Income and Twitter Handle for the users Ethan and Fabian. Either that or perhaps those two gents shouldn’t have access to the Salesforce Customer Dataset at all since Gwen doesn’t. Hard to say, but something just doesn’t look right. I need to investigate further.
Even with the great Permissions Explorer Unifi gives you, this potential issue could slip through the cracks unless I’m making a point to eyeball each attribute. Here, the problem just jumps out at you at a glance.
What about folks who not only have permissions on PII data, but actually access it? The treemap below groups users by their function and then shows column-level access in each of the datasets they have been GRANTed. The larger the mark in the treemap, the more times a user has accessed the dataset column in question.
Doesn’t look good for Fabian. He seems very interested in the Customer Annual Income column.
Summary
PII functionality in Unifi: Useful. More coming. You’ll love it. Except for Fabian.