Mid Sweden University

miun.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Artificial intelligence application for feature extraction in annual reports: AI-pipeline for feature extraction in Swedish balance sheets from scanned annual reports
Mid Sweden University, Faculty of Science, Technology and Media, Department of Computer and Electrical Engineering (2023-).
2024 (English)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
Abstract [sv]

Hantering av ostrukturerade och fysiska dokument inom vissa områden, såsom finansiell rapportering, medför betydande ineffektivitet i dagsläget. Detta examensarbete fokuserar på utmaningen att extrahera data från ostrukturerade finansiella dokument, specifikt balansräkningar i svenska årsredovisningar, genom att använda en AI-driven pipeline. Syftet är att utveckla en metod för att automatisera datautvinning och möjliggöra förbättrad dataanalys. Projektet fokuserade på att automatisera utvinning av finansiella poster från balansräkningar genom en kombination av Optical Character Recognition (OCR) och en modell för Named Entity Recognition (NER). TesseractOCR användes för att konvertera skannade dokument till digital text, medan en BERT-baserad NER-modell tränades för att identifiera och klassificera relevanta finansiella poster. Ett Python-skript användes för att extrahera de numeriska värdena som är associerade med dessa poster. Projektet fann att NER-modellen uppnådde hög prestanda, med ett F1-score på 0,95, vilket visar dess effektivitet i att identifiera finansiella poster. Den fullständiga pipelinen lyckades extrahera över 99% av posterna från balansräkningar med en träffsäkerhet på cirka 90% för numerisk data. Projektet drar slutsatsen att kombinationen av OCR och NER är en lovande lösning för att automatisera datautvinning från ostrukturerade dokument med liknande attribut som årsredovisningar. Framtida arbeten kan utforska att förbättra träffsäkerheten i OCR och utvidga utvinningen till andra sektioner av olika typer av ostrukturerade dokument.

Abstract [en]

The persistence of unstructured and physical document management in fields such as financial reporting presents notable inefficiencies. This thesis addresses the challenge of extracting valuable data from unstructured financial documents, specifically balance sheets in Swedish annual reports, using an AI-driven pipeline. The objective is to develop a method to automate data extraction, enabling enhanced data analysis capabilities. The project focused on automating the extraction of financial posts from balance sheets using a combination of Optical Character Recognition (OCR) and a Named Entity Recognition (NER) model. TesseractOCR was used to convert scanned documents into digital text, while a fine-tuned BERT-based NER model was trained to identify and classify relevant financial features. A Python script was employed to extract the numerical values associated with these features. The study found that the NER model achieved high performance metrics, with an F1-score of 0.95, demonstrating its effectiveness in identifying financial entities. The full pipeline successfully extracted over 99% of features from balance sheets with an accuracy of about 90% for numerical data. The project concludes that combining OCR and NER technologies could be a promising solution for automating data extraction from unstructured documents with similar attributes to annual reports. Future work could explore enhancing OCR accuracy and extending the methodology to other sections of different types of unstructured documents.

Place, publisher, year, edition, pages
2024. , p. 77
Keywords [en]
Artificial intelligence, Feature extraction, Named Entity Recognition, BERT, Optical Character Recognition, financial documents
Keywords [sv]
Artificiell intelligens, Datautvinning, Named Entity Recognition, BERT, Optical Character Recognition, finansiella dokument
National Category
Software Engineering
Identifiers
URN: urn:nbn:se:miun:diva-51628Local ID: DT-V24-G3-048OAI: oai:DiVA.org:miun-51628DiVA, id: diva2:1875843
Subject / course
Computer Engineering DT1
Educational program
Computer Science TDATG 180 higher education credits
Supervisors
Examiners
Available from: 2024-06-24 Created: 2024-06-24 Last updated: 2024-06-24Bibliographically approved

Open Access in DiVA

fulltext(1311 kB)142 downloads
File information
File name FULLTEXT01.pdfFile size 1311 kBChecksum SHA-512
08d225d6e2b74985837814ee540880039404f06d1d1b8c8344bb4db6dd59ce0b0a6bd86e9b7a0122ee790982793c80a7cc639ea866522acfbe1f49d251029215
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Nilsson, Jesper
By organisation
Department of Computer and Electrical Engineering (2023-)
Software Engineering

Search outside of DiVA

GoogleGoogle Scholar
Total: 142 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 451 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf