A Deeper Dive into PII

In the December town hall meetings for the password implementation, I provided a brief overview of personally identifiable information (PII), and a more in-depth review of a “confidentiality impact assessment.” This blog post digs a little deeper into “what is PII?”

The book Managing Data for Patron Privacy: Comprehensive Strategies for Libraries does an excellent job describing the types of PII. According to the authors, there are three types of PII and two categories of data sets. The three types of PII are direct, indirect, and behavioral. Sets of data can be classified as aggregate data or de-identified data.

Direct PII consists of information about a person sufficient to identify that person by a single data point. This is what most people think of when they think about PII. Examples of direct PII include name, physical and email addresses, biometric data, official identifiers such as social security numbers or drivers license numbers, and phone numbers. A library card number is a direct identifier, and library patron records include several fields for information that is a direct identifier.

Indirect PII consists of information about a person that can be combined to identify a specific person. Briney and Yoose list examples of indirect identifiers such as age, race, gender identity, year in school, veteran status, socioeconomic status, and disability status.

The third type of PII is behavioral data which “describes a person’s actions instead of a person’s attributes.” Behavioral data is data based on the recurring activities of a person that, taken together, can be used to identify a specific person. To quote the authors again, “this could be the patron who has checked out hundreds and hundreds of books, the person who regularly uses niche resources, or the patron who always attends library workshops.” Reading history and proxy history are perfect examples of behavioral data that will be better protected by passwords in Sierra.

When data about individuals is combined into sets of data, as in the case of the hundreds of thousands of patron records in Marmot’s Sierra, it can be manipulated as either aggregate or de-identified data. Aggregate data consists of “one or a few data points about a group and is not about individuals within that group.” These are the types of statistics that are often included in annual reports, such as how many people visited the library, how many items circulated, etc. The risk associated with aggregate data is when too many data points are summarized together, or the number of people inside a set of data become small enough that it becomes possible to identify a specific person based on combinations of data.

Finally, de-identified data is individualized data that has been stripped of identifiers. De-identified data is often used when libraries conduct assessments of programs or services, in order to study patterns of behavior across many individuals. As Briney and Yoose point out, de-identified data still carries the risk that it may be possible to re-identify people within the data set, particularly if de-identified data is combined with other types or sets of data.

The next blog post, in two weeks, will look at the spectrum of risk associated with these types and categories of PII.