
Claims Processing Innovation Project
birkle IT empowered a leading insurance provider in Germany with scalable, reliable, and agile solutions.
Innovating with flexibility
In today’s rapidly evolving world of business technology, success hinges on innovation and adaptability. For large, DAX-listed companies, managing complex IT systems while keeping up with market demands presents a significant challenge.
Together with a leading insurer, birkle IT set out to answer a bold question: Can artificial intelligence (AI) transform the way claims are processed by improving the management and analysis of documents?
By leveraging cutting-edge machine learning (ML) and deep learning (DL) techniques, the project aimed not to create a ready-to-use algorithm but to explore whether AI could be both feasible and valuable in practice.
A flexible approach to innovation
The project’s approach was refreshingly resilient. Instead of following a fixed plan, the team remained open to new ideas and surprising insights. A pivotal moment came during a workshop in Munich, where the team thoroughly examined the day-to-day workflows of the insurance company employees.
This vital hands-on session was about observing, questioning, and rethinking the problem to align with real-world processes.
Dissecting AI’s intelligence
The project set out to tackle these high-stakes challenges, aiming to lay the foundation for an AI-powered future in claims processing.
To ensure the project’s success, the team defined bold, measurable goals, asking if AI could:
- Search through 90% of claim-related documents with full-text accuracy.
- Reliably recognize and extract critical information like bank details or addresses in 80% of cases.
- Spot-check gross and net invoice amounts to validate 50% of payouts.
- Match partial damage details to existing claims 50% of the time.

Simplifying complex claims processing
A vital insight emerged from the July 2019 workshop in Munich: Claims processing depended on identifying roles, such as appraiser, claimant, or workshop, often buried in a mountain of documents. To address this, the team focused on extracting communication data (addresses, names, and IBANs) from a curated set of document types, including invoices and expert opinions.
To train the AI, a process was developed to dynamically pull thousands of archived PDFs. Despite delays in development, this provided a rich dataset: over 2,200 expert opinions, nearly 5,000 invoices, and more.
Overcoming the challenge of labeling data
Training the AI started with the careful task of labeling thousands of documents by identifying specific data and linking it to roles.
Initially, it took two minutes to label each document. Later, semi-automated scripts sped up the process to just 30 seconds, although they sometimes caused errors.
Despite these difficulties, the team labeled 766 documents, prioritizing quality over quantity to ensure accurate results.
Main achievements of the pilot
- Clearly identified the roles and document types needed for claims processing.
- Developed a reusable tool for extracting data from documents in future projects.
- Trained the AI effectively using both manual and automated labeling techniques.

Technical details: three phases of development
The project unfolded in three phases: Preprocessing, Training, and Post-Processing.
Python (v3.7.3) and Key Libraries Utilized:
- OpenCV: Extracted labels and processed documents.
- Numpy & Scipy: Managed arrays and provided useful helper functions.
- Pdfminer.six: Enabled text extraction from PDFs.
- Scikit-learn: Applied Naive Bayes and SVM algorithms for data classification.
- TensorFlow & Keras: Developed and trained neural networks.
- Pdf2image: Converted PDFs into images for easier processing.
- PyPDF2: Facilitated editing of PDF documents.
- Difflib (SequenceMatcher): Compared and assessed string similarity.
* Training took place on a GPU server within birkle IT’s network, though the final scripts and models were not delivered to the client, requiring additional work to integrate them into the system.
Model selection and evaluation
in claims processing
Model selection and evaluation
Several models, including Naive Bayes, SVM, and neural networks, were tested. The neural network outperformed the simpler models by learning complex patterns and optimizing through parameters like layers and epochs.
Models were evaluated using precision, recall, and accuracy metrics.
Notable Highlights
Bounding boxes were used to create feature vectors, capturing both positional and content-based information.
Neural networks outperformed simpler models, improving through iterative feature and parameter optimization.
* For training, the K-fold method was employed, dividing the dataset into 10 equal parts. In each iteration, one part was used for testing while the remaining data was used for training. The average results from all 10 runs served as the benchmark, providing more reliable results than a single train/test split.
Results from model evaluation
During training, performance varied for certain roles like “Workshop” due to the small amount of related data. For example, the precision for the “Workshop” role changed from one iteration to the next because there weren’t many workshop-related entries in the training set. Despite this, the final model performed well overall.
Here’s how the final model performed, based on precision, recall, and accuracy:
- Censor role: Precision 77%, Recall 70%, Accuracy 95%
- Claimant role: Precision 73%, Recall 44%, Accuracy 95%
- Workshop role: Precision 78%, Recall 28%, Accuracy 95%
To further test its effectiveness, the model was checked against a separate set of 22 documents, which included 16,897 bounding boxes. The results for this test were:
- Censor role: Precision 82%, Recall 44%, Accuracy 95%
- Claimant role: Precision 82%, Recall 36%, Accuracy 95%
- Workshop role: Precision 88%, Recall 24%, Accuracy 95%
What we learned from the evaluation
- The K-fold method provided more reliable results by training on different subsets of the data, helping to reduce overfitting.
- While the “Censor” and “Workshop” roles showed strong precision and accuracy, the “Claimant” role had lower recall, suggesting there’s room for improvement.
- Small and imbalanced training datasets, especially for roles like “Workshop,” caused some performance variation, but overall, the model proved effective at identifying key roles in documents.
Postprocessing & Observations
Visualization: The AI’s predicted bounding boxes were color-coded to help humans easily spot the model’s mistakes. Over time, the AI’s role recognition improved, with identified areas for further refinement.
What was learned:
- Early collaboration is critical Success relied on clear communication and the early participation of experts like data scientists. Their insights shaped both planning and evaluation, preventing costly missteps.
- Data quality drives performance OCR errors left 6% of pages unreadable, directly impacting model performance. Future iterations should explore OCR tools that offer richer metadata and higher accuracy.
- Manual labeling Is a necessary investment At least 1,500 labeled documents per type were essential for effective training. While automated labeling introduced up to 10% error, manual effort ensured reliable data for the model.
- Adding context improves feature vectors Incorporating additional context, such as data from neighboring bounding boxes, could significantly enhance feature vectors. Comparing data across documents would further refine the model’s predictive power.
- Limited data causes overfitting Overfitting arose from training on a small dataset. Expanding the dataset and increasing diversity would resolve this and improve generalization.
- Errors highlight refinement areas False positives often appeared near correct predictions or in incorrect roles. A combination of grouping similar terms and experimenting with ensemble models could minimize these errors and enhance accuracy.
- Scaling and advanced tools are the future Expanding datasets and integrating advanced architectures, like Recurrent Neural Networks (RNNs), would enable the model to capture relationships between words across documents better. This approach promises to elevate both performance and adaptability.
Achievements
The project made significant strides in leveraging AI for document role classification.
- Accurate role detection: The AI nailed identifying roles like “claimant” and “workshop” in expert documents with precision and recall hitting 70-80%.
- Efficiency boost: Neural networks cut down manual effort by recognizing patterns and automating role assignment in documents.
Areas for improvement
While the project achieved notable successes, several challenges remain:
- Testing on more documents: We need to see how well the AI handles different types of documents.
- Consistent results: Role detection isn’t perfect yet, especially for less frequent roles like “workshop.”
The road ahead
To build on these achievements and address existing challenges, the next steps include:
- More data, better results: More labeled documents = better AI accuracy.
- Smarter AI: Add advanced models like RNNs and improve data processing to boost performance.
- Rule + AI combo: Using rule-based systems alongside AI will make it even smarter—think contact details.
- Clear benchmarks: Set measurable goals
Contact our insurance expert now to find out more about claim processes and innovation in the insurance sector!
Max Fuchs
Senior Business Development Manager

