TrUly Intelligent Data Extraction (Article)

The need to extract data from documents has become an essential part of meeting requirements from regulators, but also to ensure Banks run effective businesses. The need to Know Your Customer goes beyond just ticking boxes. Historically firms have needed to invest significant time and money into solutions that have relied on being codified or trained before they can be used to bring business value. Modern AI solutions take a different approach, requiring little to no training, but yielding unpredictable results. Is there a middle ground?

What is Data Extraction and Why Bother?

The building blocks of a good business, be that in financial services or any other industry, are good data. If what you know about your customers, your supply chain, or even your own employees contain gaps or errors, you’re in trouble. That might be regulatory trouble or legal challenges. But poor data quality is also likely to hurt you commercially in lost business, missed opportunities, poor customer experience and reputational damage.

Data, therefore, must be central to most business strategies.

In the case of financial services, the regulatory and legal cost of not having up to date, high quality data about the Banks customers can be very real. Recent history has seen fines in the US well in excess of $100m for failings of data governance and control leading to poor customer outcomes. Similar stories can be found across the UK and Europe as well.

But why is it so difficult to create and maintain good quality data? On the surface the solution to the problem is simple – ask your customer to provide evidence of the critical things you need to know to enable you to trade with them. But the forms this evidence can take vary wildly depending on the account type, the risk profile of the customer, the risk appetite of the institution, and the regulatory guidelines at the time. All of which continue to change and evolve. Balance that off with a shareholder-driven imperative to control costs, as well as to avoid annoying customers, and it becomes more complex.

In retail banking and most forms of corporate banking, firms tend to rely on information coming solely from customers in the first instance. Passports, ID cards, Drivers Licenses, Utility Bills, proof of income and so on. These documents all provide the basic information needed to perform background checks against sanctions lists and other law enforcement registers, as well as screening for negative news stories.

In Capital Markets and larger commercial banking relationships, there tends to be a reliance on a combination of public source information, such as news sites, company websites, government and state registers, as well as paid data aggregation providers and of course, the customers themselves. However, in this domain we see the documentary evidence shift from ID-related documents to sources that provide ownership information about the business, sources of revenue, trading locations and supply chain details.

The fundamentals of what needs to be done are the same in all cases. Source the documents that you need, open them, find the data points you want and copy them into your upstream systems. Assuming you source the documents well, the task of reading and extracting data is not complex for humans to undertake. But it is repetitive, prone to mistakes and at scale, can be very expensive.

Here's is where technology comes in. There are many different software products in the market that can help to fully or partially automate the data extraction process. In recent months, thanks to the current AI boom, there is a significant increase in focus on whether such tools can overcome some of the challenges of solutions that have otherwise been available for a long time.

How to Automate Data Extraction

Reliably automating the extraction of data from documentation is nothing new. Optical Character Recognition, or OCR for short, has been around for decades. OCR is the process of turning an image, such as a scanned document, into a machine-readable format and can be extended to then search the text that has been created for what you need. OCR does have its limits, most notably when documents are long or have complex layouts or have data that spans multiple pages.

These can be overcome to an extent, particularly if the layouts of the document never, or rarely, change. It is then possible to codify the positioning information that is likely to contain the data you need. Think about a passport – there are only so many layouts in the world and it can be reasonably efficient to codify those into a system, allowing you to automate the extraction of data from them. For documents that have less rigid structures, or change layout frequently, this approach can be more difficult to use unless the data you want to extract is simple.

Machine Learning is another method often used in conjunction with OCR. Machine Learning essentially allows you to train a model that can then recognise patterns from the training data you provide it. In other words, find enough samples of the documents you want to extract data from, annotate all of the data you want extracted and feed that to a Machine Learning model and it can then learn what you want, and how you want it, and then replicate that process on unseen documents. Machine Learning relies less on consistency of layouts than OCR by itself and can be applied to more complex and lengthy document types. Machine Learning is simpler to explain and predict than newer forms of AI and implemented correctly can give you a highly robust and accessible audit trail for where the data came from. Its main drawback is time (and therefore cost) to train, and the number of samples that are needed to drive sufficient accuracy.

Recent advancements in Artificial Intelligence, particularly those surrounding Large and Small Language Models take a different approach. Instead of pre-labelling hundreds or thousands of samples, the approach instead relies on a very heavily trained language model that can connect information in vast and interesting ways. I avoid here the term “understands” deliberately, because that’s not technically true, even though it often feels like it when interacting with something like GPT. AI can use reasoning to predict what answer to give you based on the question that you ask it and the information it has access to. Provide it a document and ask it to summarise the contents, or pick out specific data points, it will usually do a good job. Particularly if the document is not too long and the data required not heavily nuanced. Principal drawbacks include the run cost of these models, complexity of predicting or explaining why the model made the prediction it did and evidencing where in the document the model retrieved its answer from. Consistency is often an issue as well. Feed an AI model the same document 10 times with the same questions to extract data, and it is highly unlikely you’ll get the same answer answers each time. This can be a big problem in heavily regulated use cases.

Soloution	Pro's	Con's
OCR	Highly predictable Easy to explain Great for form-based documents Relatively inexpensive Fast to implement	Struggles with variable-format documents Cheaper solutions struggle with columns, tables, images and data that spans multiple pages
Machine Learning	Struggles with variable-format documents Cheaper solutions struggle with columns, tables, images and data that spans multiple pages Can handle complex document structures Run costs are reasonably inexpensive Control of how data is used to improve accuracy is easier	Human annotation needed which can take time and cost to setup If data extraction needs change (e.g. different data needed) then model requires retraining, which can sometimes be a lengthy process or come with additional cost
AI (LLM's)	Can handle complex document structures Very fast to get off the ground Relatively low build and implementation costs Doesn't require training or retraining if data extraction needs change over time	Limited control of how data is used to train the models or how accuracy improves over time Unpredictable results even with same data and same question Run costs are often much higher than expected at scale

Data, Information, Intelligence

At FinTrU, we take a combination approach to extracting data for our clients. We have developed our own unique Intelligent Document Processing product called TrU Label, which blends humans, OCR, Machine Learning and very soon, AI, to ensure that accuracy is always 99% or above.

TrU Label delivers three important moments of value:

Instant quality and consistency improvement – event without automatic data extraction, the software can be deployed with human users extracting data directly from documents. By having a purpose-built application, firms benefit from a more consistent output and faster maker/checker models of up to 25% efficiency
Machine-assisted processing – using FinTrU’s pre-trained models, firms can quickly part-automate the data extraction process to 70%+ accuracy, delivering on average a 40-50% total efficiency on the maker processing time. For some document types, it is possible to jump to straight-through-processing as well
Full automation – by using the product for a period of time, the models will continue to improve and exceed the straight-through-processing threshold. Whilst 100% automation is unlikely to be realistic, or even desirable, across all document types and data point requirements, it is possible to reduce the human intervention requirement to 5-10%

Even where FinTrU has no pre-trained model in place, the existing framework model can be updated meaning that if firms are able to supply data, FinTrU has the in-house talent and flexibility to take on large-scale labelling projects to make a completely bespoke model to suit a firm’s needs.

But data is only useful if it’s useful. Most software vendors operating in the Intelligent Document Processing market offer a solution that transforms unstructured data into structured data. That does have value, but it’s not the end game. TrU Label can transform unstructured data into structured data, and then turn that structured data into useful information.

If you have 10 documents about a single customer, we extract the data from each document and then compare the results from across each document to create a single consolidated report. This is configurable according to each clients’ unique rules and policies. At this stage we have provided useful information.

But we don’t stop there. In active development is the ability to then apply the firm’s rules and policies to compare the new client information against the existing “on-file” information and identify material differences. The word “material” is critical here. Just flagging differences is not effective. If a client moves three blocks down the road in Manhatton, for most Banks that’s unlikely to be a material change. If they start deriving revenue from a sanctioned country? Different story.

"Data is single fields, such as names, addresses, facts and figures. Information is the de-duplicated and consolidated view of that data. Intelligence is the information transformed into recommendations or action." - Steven Hewlett-Light

That’s what our materiality assessment feature will do for us shortly. It transforms the information gained in the previous step into actionable business intelligence that firms can use to determine the next course of action to take with a given client.

Privacy and Security is also critical to the way that TrU Label works. Our models are not like major LLM’s – not everything that is processed through the product immediately feeds the models. We hand-pick what we use to train the model on, both as a means of ensuring accuracy, but also to ensure that we only process what we absolutely need. This puts our clients in control of how their data is used to improve the models, and ensures that the model they are using is tuned and trained specifically to meet their needs.

It's this unique combination of subject matter expertise, technical delivery capability and real-world client insight that allows us to understand what our clients need and respond rapidly. It differentiates us from other IDP vendors. We do a few important things exceptionally well.

Key Takeaways

The following points are important for firms to consider when deciding on their approach to automating the extracting of data from documentation as part of regulatory processes such as KYC:

Data extraction from documentation is nothing new. There are different methods and solutions in the market and not everything has to be about using an LLM. OCR, Machine Learning and rule-based extraction methods can be highly effective for a lot of document types.
Firms should undertake thorough assessments of run cost for technology-based solutions, particularly where those solutions incorporate LLM’s. The same goes for data privacy and security. Know where your data is going and how it is being used to train or improve the models, and whether there are any sub-processors involved along the way.
Not all data extraction companies and products are equal. Generalist platform providers can be highly effective solutions if you have the time, budget and appetite to craft your own extraction routines, but they’re unlikely to understand.

Written by:

Steven Hewlett-Light

Steven Hewlett-Light is FinTrU's Head of Product, responsible for the vision and strategy behind the technology products we develop for our clients. Steven has over 20 years' experience in financial services, working across a wide range of financial services verticals. Steven has extensive experience creating software solutions that solve complex regulatory challenges, including the use of Artificial Intelligence and Machine Learning, and working with clients to ensure those solutions meet their individual needs and requirements.