Implementing Automated KPI Extraction from Financial Reports: Part 2

Introduction

This second part of the case study builds on the infrastructure and cost-optimization strategies described in Part 1 (read here), and focuses on the retrieval-augmented generation (RAG) pipeline that powers automated KPI extraction from financial reports.

Technical implementation

Core components

Our KPI extraction pipeline is built on three fundamental components, each implemented as modular actions within the Ryax platform for maximum flexibility and maintainability.

Our system architecture integrates vision-language modeling with efficient information retrieval, orchestrated through Ryax's workflow engine.

Integrating Vision Language Models (VLMs) like ColPali into document processing workflows enhances the recognition of financial table values. ColPali directly embeds document images, capturing both textual and visual elements without complex preprocessing. It employs a late interaction mechanism, comparing query tokens with document image patches to improve retrieval accuracy. This approach streamlines the extraction of financial data from documents, improving efficiency and accuracy.

1. Document processing

The document processing stage handles the critical first step of converting financial reports into a standardized format for analysis. Ryax's workflow engine orchestrates parallel PDF processing while managing memory constraints effectively.

Key features:

Automatic PDF to image conversion with configurable DPI settings
Memory-efficient parallel processing through chunking (50-page chunks)
Support for both digital and scanned documents through adaptive preprocessing

2. Information retrieval system

The cornerstone of our KPI extraction pipeline is a sophisticated information retrieval system that combines visual and semantic understanding. Through Ryax's workflow engine, we implement a multi-stage retrieval process that achieves performance comparable to BEIR benchmark standards. It lies the ColPali embedding model, which processes document pages through carefully controlled batching:

# Efficient embedding generation with memory management

top_images = resize_images(original_images, factor=factor)

embeddings = get_colpali_embeddings(model, processor, top_images)

This adaptive approach enables processing of high-resolution financial documents while managing GPU memory constraints effectively. The embeddings capture both visual layout and textual content, crucial for understanding financial tables and statements.

We implement colBERT-style late interaction matching, allowing for:

Fine-grained similarity computation between queries and documents
Maintenance of contextual information throughout the matching process
Efficient handling of both structured and unstructured document regions

Our implementation achieves performance metrics competitive with specialized financial QA systems:

nDCG[1]@10 scores comparable (significantly above) to FiQA-2018 benchmarks
40% accuracy maintained on scanned documents
Resilient performance across various document formats

The system employs a multi-vector indexing strategy that balances retrieval accuracy with computational efficiency:

Document pages are represented by multiple embedding vectors
Contextual information is preserved through regional embeddings
Efficient similarity search implementation for rapid KPI location

[1]

The nDCG (normalized Discounted Cumulative Gain) measures ranking quality by evaluating how well retrieved results are ordered. It is computed as the ratio of DCG (which gives higher weight to relevant items at the top) to IDCG (the ideal ranking's DCG). nDCG@k evaluates ranking for the top k results, e.g., nDCG@5 for the top 5 and nDCG@10for the top 10, rewarding systems that prioritize relevance in earlier results. This metric is widely used in RAG systems to assess the quality of retrieved context documents for query-driven tasks.

The diagram illustrates how Ryax's workflow engine orchestrates these components, managing GPU resources and data flow between stages. The platform's built-in monitoring capabilities allow tracking of embedding quality and retrieval performance across different document types.

Performance optimizations include:

Dynamic batch sizing based on available GPU memory
Automated fallback to CPU processing for memory-intensive documents
Caching of intermediate embeddings for frequently accessed documents

This retrieval system forms the bridge between raw document processing and precise KPI extraction, maintaining consistent performance across both digital and scanned reports. Through Ryax's infrastructure, we achieve reliable scaling while managing computational resources efficiently.

3. Value extraction

The final stage of our automated pipeline leverages advanced vision-language modeling for precise KPI extraction. Implementation through Ryax's workflow engine enables efficient model deployment and resource management.

Qwen2-VL model integration for visual-text understanding:

Our system uses the Qwen2-VL 7B-parameter model, deployed as a containerized component within Ryax's infrastructure. The model processes identified relevant pages through carefully crafted prompts:

prompt = f"""Analyze ALL provided images to:

1. Identify the table containing values related to: {term_list}

2. Extract the SINGLE most relevant value for year {year}

Format: "VALUE|UNIT" (e.g., "2,801|£'000s")"""

This structured approach enables consistent extraction across various document formats while maintaining contextual understanding.

Workflow architecture

1. Input processing

Document ingestion and preprocessing is handled through a series of Ryax actions:

PDF downloading and validation
Parallel image extraction (configurable DPI and batch size)
Memory-efficient document chunking

The workflow interface demonstrates the sequential processing stages with built-in monitoring and error handling.

2. KPI location

The system employs a multi-stage approach for precise KPI identification, the page selection process is as follow:

Initial embedding generation through ColPali
Relevance scoring using late interaction matching
Top-K page selection (configurable, typically K=3)

Performance metrics show consistent accuracy across different document types:

Clean PDFs: ~50% accuracy
Scanned documents: ~40% accuracy
Resilient performance with varying document quality

3. Data extraction

The extraction process combines vision-language understanding with structured data parsing:

Dynamic prompt engineering based on target KPIs
Automated unit normalization and standardization
JSON output formatting for database integration

4. Manual validation workflow

The manual verification workflow provides a critical human-in-the-loop validation process to ensure accuracy of extracted KPIs while maintaining the efficiency of automated processing. This hybrid approach combines the speed of automated extraction with the precision of human verification.

The verification interface provides a comprehensive form for KPI validation, showing:

Document URL for traceability
Year of financial data
Currency specification
JSON format KPI values for review
Dry run option for safe validation

This interface serves as the main entry point for human validators to review and adjust extracted KPIs before committing them to the database.

The verification process follows a structured approach:

Initial review:

- Retrieve JSON output from automated extraction workflow
- Access to automatically identified relevant pages
- Review initial KPI extraction results

Validation steps:

Compare values against source documents
Adjust numerical values if needed

{

  "cx_fi_income_net_trading": 2801.0,

  "cx_fi_imp": 32.0,

  "cx_opex": 1552801.0

}

Verify currency and units
Option to perform dry run validation

The verification workflow serves as a crucial complement to the automated extraction process, ensuring data accuracy while maintaining operational efficiency. The interface design and implementation focus on user experience while maintaining robust data validation and security measures.

Current performance

Information retrieval from financial documents, particularly tables, remains a significant challenge in the field. Our implementation, built on Ryax's workflow engine, demonstrates competitive performance while addressing practical deployment concerns.

Benchmark context

This comprehensive benchmark comparison presents nDCG scores across different model architectures, from simple lexical approaches to advanced late-interaction models. Of particular interest is the FiQA-2018 row, highlighting the challenging nature of financial document processing. The ColBERT architecture, which we adapted for our implementation, achieves a score of 0.317, with only the re-ranking approach (BM25+CE) performing marginally better at 0.347. This validates our architectural choices while highlighting the inherent complexity of financial information extraction.

The complexity of financial document processing can be better understood by examining the characteristics of various document understanding tasks.

While financial document processing shares characteristics with Industrial (DocVQA) and Table (TabFQuAD) tasks, it presents unique challenges due to the structured yet variable nature of financial reports. The query volumes (500-1600 per dataset) provide context for evaluation robustness.

Domain-specific performance

Our system demonstrates robust performance across different document types:

Clean digital documents

- 50% average accuracy for KPI extraction
- Consistent performance across different financial statement layouts
- Reliable handling of various unit formats (thousands, millions)

Scanned documents

- 40% accuracy maintained despite OCR challenges
- Resilient to common scanning artifacts
- Effective handling of table structure variations

Our system's performance must be considered in the context of different document processing approaches, from basic text extraction to advanced captioning.

Technical optimizations

Through Ryax's workflow engine, we implement several critical optimizations:

Our system architecture integrates vision-language modeling with efficient information retrieval, orchestrated through Ryax's workflow engine.

The Qwen2 7B vision-language model processes queries and documents, while the retrieval system manages efficient page selection. The TopK pages component enables focused processing of relevant content, optimizing both accuracy and resource usage. This architecture, implemented through Ryax's workflow engine, ensures efficient resource utilization while maintaining processing accuracy.

Resource management

Dynamic GPU allocation for vision-language models
Automated memory optimization through image resizing (1.0x to 4.0x)
Efficient batch processing with failure recovery

Processing pipeline

Parallel CPU operations for document preprocessing
Smart chunking for large document handling
Automated sanity checks ensuring result consistency

Current Limitations

Current bottleneck in lexical embeddings:
- Lexical gap in financial terminology
- Context preservation in table strucutres
- Tradeoffs beween embedding quality and compute resources

Memory constraints for large documents:
- Memory requirements for high-resolution documents
- GPU utilization and optimization for large batches

Processing speed vs accuracy tradeoffs:
- Complex table layouts impact performance
- unit conversion reliability
- Multipage context maintenance

These limitations inform our ongoing development roadmap, with Ryax's modular architecture enabling incremental improvements without disrupting production workflows.

Our measured performance demonstrates that practical, production-ready financial document processing is achievable while maintaining competitive accuracy. The system's ability to handle both clean and scanned documents with consistent performance makes it particularly valuable for real-world applications.

GITHUB

JOIN OUR DISCORD

Implementing Automated KPI Extraction from Financial Reports: Part 2 – RAG Implementation