Automated data extraction from annual reports to help a large North American bank in making investment decisions.


The client is a large bank with operations in North America. It provides M&A, corporate finance advisory and valuation services to mid-sized and private organizations. The client needs to undertake in-depth analysis of annual reports to extract key financial data which is often buried within exhaustive notes, schedules, footnotes and tables appended to the primary financial statements. In the current system, annual reports were being analyzed manually to extract relevant data.

Reading long-winded and complex annual reports can be tedious and monotonous. Despite spending long hours scanning the documents manually, there is a genuine risk of vital data points being overlooked or omitted. This may result in inaccurate or incomplete data which is a significant risk when it comes to making investment decisions.

The client needed a robust solution that would automate data extraction from annual reports by emulating the intelligence of the human mind. Without this intelligence, the automation would fail to achieve its objective of extracting data points needed to make critical investment decisions.


The length, design and differences in layouts of annual reports pose a challenge when it comes to the process of automating data extraction from them, especially when such data needs to be read and extracted from

  • Scanned documents
  • Tables, particularly complex tables
  • Images and infographics

Widely-used document scanning and data extraction technologies such as optical character recognition (OCR) fail to perform optimally when faced with the challenges described above. This results in the extraction of inaccurate or corrupt data which defeats the purpose of automation.


To be successful, our solution had to not only overcome the shortcomings of other data extraction technologies but also implement a standard methodology based on pre-defined rule sets. We deployed ScaleCred’s proprietary pixel-based machine learning technology to achieve the following:

  • Pre-processing documents to categorize them into individual and pages and also to identify financial statements, notes and schedules, tables and images before commencing the actual data extraction process.
  • Granular data extraction through pixel-based machine reading capabilities. By converting the content of annual reports into pixels, the technology overcomes limitations with regard to variations in font size, style and color in documents.
  • Disaggregating table rows and columns into individual records for the purposes of data extraction
  • We continuously trained our deep learning algorithms to observe even minute variations in annual reports to extract accurate and reliable data for making investment decisions.


By automating data extraction from annual reports, the client was able to achieve

  • 5x increase in data accuracy.
  • 10x increase in the speed of analysis.
  • Higher coverage of annual reports.