Chloe McAree (McAteer)
Published on

Revolutionising Redaction - My Final Year Project

Authors

Now with major changes in regulations, it has become even more important that personal data is secured and only seen by the right people. With this being such a hot topic right now, I was extremely keen to work on a final year project that could tap into this every increasing issue.

I am currently finishing my final year of studying software engineering at Queens University, Belfast and have been sponsored by Kainos to complete this project. In this blog I plan on taking you on a journey through my project, including the need for it and a high-level overview of the technologies used.

So, what is the problem?

Redaction is the actual process of removing sensitive information from documents and it needs to be completed across many different industries. It typically involves someone manually going through a document word by word looking for sensitive information to remove — which of course is a very time consuming and tedious task that is prone to human error

I’m sure we have all seen a few documents, that have ended up looking like this:

One of the major problems with redaction, is that when a lot of words have been removed, the whole context of the document can be lost.

My Solution

The overall aim of this project is to identify and remove sensitive information within a document, whilst still retaining its overall context. The project is divided into 3 main features:

  1. Auto-Redactor — Aims to take a user’s document, automatically identify and replace the sensitive words and present them with a redacted copy of their document

  2. User Selection redaction — Aims to take the user’s document, automatically identify the sensitive words, and send these words back to the user for them to select which ones they want removed.

  3. Labeller — Aims at making the process of retraining the model to recognize more words easier and quicker.

How can I identify sensitive words?

This project revolves around the ability to recognise sensitive words within documents. To do this I am making use of Natural Language Processing (NLP) which is a sub-field in Artificial Intelligence, and it is the ability for computers to process, analyse and understand human languages.

Within NLP, the project plays a particular focus on Named Entity Recognition (NER) which searches a body of text and classifies named entities into predefined categories. These categories would be the different types of sensitive words e.g. people’s names, organisations, monetary values, locations and dates.

To help me with this task I came across a really awesome open-source python library called spaCy, which is a library for advanced Natural Language Processing.

spaCy is amazing and helps to show the grammatical structure of text through a range of linguistic annotations. It can identify different attributes about the text such as: the base word form; if the word contains alphabetic characters or digits; sentence boundary detection and can tag parts-of-speech words e.g. if the word is a noun, verb, adjective etc.

Handling Different Types of Documents

As redaction takes place across so many different types of industries, I wanted the project to be able to cater to the different styles of documents that might be uploaded. For example, the phrasing of words in a legal document will probably be different from those in a hospital report. To be able to deal with this I need to make use of multiple AI models — but what are these?

AI models are binary data that has been produced from showing a system enough examples for it to make predictions that generalize across the language. For this project I used the default spaCy model and built upon it with my own training data.

The project consists of three separate models, so that the documents can be redacted using a domain specific model. For example; if the user is uploading a legal document it will use the model that has been trained specifically with legal documents. The three models that the project consists of are legal, insurance and general.

Auto-Redactor

One of the main pieces of functionality within the project is the auto-redactor. I have built it so that once the user uploads a file, they need to select its type e.g. legal, insurance or general, for the application to work out which model should be used.

Once the document has been uploaded range of different processes occur, in order to make predictions about the different types of words/phrases in the document. If you are interested in hearing more about this process, in-depth descriptions and examples can be found here.

My main aim for the auto-redactor was for it to not only remove the sensitive information from the document, but to retain the documents context. In order to achieve this once the sensitive words have been identified I am replacing them by their label. For example, if a name is identified it is removed and the word ‘PERSON’ is put in its place. This ensures that the content of the document is still intact, but the document no longer contains any sensitive information.

User Selection Redaction

During initial project research I discovered that sometimes, not all sensitive words need to be removed from a document. For example, if a business is working with a particular client, maybe information about that client needs to be removed, but information about the business itself can remain.

To be able to handle this use case, I created a feature called User Selection Redaction and it works by giving the user the choice to preview the sensitive words that have been identified.

When the user uploads a document, if they selected the preview option, the lists of sensitive words that have been identified by the model are then sent back to them.

The user can then simply select the words they want removed from the document. Only the selected words are then replaced by their labels and the user is sent the copy of the redacted document which they can download.

Retraining Models

Another major piece of functionally for this project is the labeller. I have built this tool to try and make the process of retraining the models to recognize more words easier and quicker.

Retraining the models is not about getting them to memorise more examples but getting the individual models to improve their algorithms so that they can be used to generalise across more and more documents. In the auto redactor the models are making predictions based on documents they have seen, so the only way to improve the accuracy at identifying all the sensitive words in a document, is to retrain with a lot more data.

The labeller allows the user to highlight a word by selecting it and then hitting the button for which label they wish to assign it.

To actually use these words to retrain an existing model, the start index and the end index of each of the words in relation to the whole file needs to be acquired. These indexes and the label for each sensitive word are used to build up the training data along with the original file.

Without going into too much detail, this process takes every word in the training data and uses the model makes a predication for what named entity it thinks the word is, it then looks at the label attached to the word in the training data to see if the predication was correct. If the prediction was not correct, it adjusts its weight to gain the correct result next time. The updates to the model are then saved to the model to update it.

Project Completion

On completion it is clear to see that for a project like this there is significant potential in both the private and public sectors as it can render text data in a GDPR compliant position.

Not only does this project make the redaction process quicker and easier but it also has the potential to be revolutionary to the AI industry. Finding available data to work with for training AI models is difficult when it contains sensitive information. At the minute there are many tasks that AI could assist with but can’t because the datasets required contain personal information, resulting in it being expendable. With a product like this, datasets could easily be anonymised and so could be used.

Nearing completion, this project was listed as one of the finalists for the prestigious Megaw Memorial Lecture, which is a competition within Queens University, that is opened to all final year students within the School of Electronics, Electrical Engineering and Computer Science to present their final year project.

After presenting, both the business need for this project and the technology used to create it, this project was named as the overall winner and was presented with the Megaw Memorial Award by the Institution of Engineering and Technology (IET).

I would also like to say I huge thank you to Kainos for sponsoring this project and for providing me with Jordan McDonald as a project mentor, his guidance throughout the duration of this project has been invaluable.