Mid Sweden University

Change search
CiteExportLink to record
Permanent link

Direct link
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Bilingual Auto-Categorization Comparison of two LSTM Text Classifiers
Mid Sweden University, Faculty of Science, Technology and Media, Department of Information Systems and Technology.
Mid Sweden University, Faculty of Science, Technology and Media, Department of Information Systems and Technology.
Mid Sweden University, Faculty of Science, Technology and Media, Department of Information Systems and Technology.ORCID iD: 0000-0002-1797-1095
Mid Sweden University, Faculty of Science, Technology and Media, Department of Information Systems and Technology.
2019 (English)In: 2019 8th International Congress on Advanced Applied Informatics (IIAI-AAI), 2019Conference paper, Published paper (Refereed)
Abstract [en]

Multi linguistic problems such as auto-categorization is not an easy task. It is possible to train different models for each language, another way to do auto-categorization is to build the model in one base language and use automatic translation from other languages to that base language. Different languages have a bias to a language specific grammar and syntax and will therefore pose problems to be expressed in other languages. Translating from one language into a non-verbal language could potentially have a positive impact of the categorization results. A non-verbal language could for example be pure information in form of a knowledge graph relation extraction from the text. In this article a comparison is conducted between Chinese and Swedish languages. Two categorization models are developed and validated on each dataset. The purpose is to make an auto-categorization model that works for n'importe quel langage. One model is built upon LSTM and optimized for Swedish and the other is an improved Bidirectional-LSTM Convolution model optimized for Chinese. The improved algorithm is trained on both languages and compared with the LSTM algorithm. The Bidirectional-LSTM algorithm performs approximately 20% units better than the LSTM algorithm, which is significant.

Place, publisher, year, edition, pages
National Category
Computer Sciences
URN: urn:nbn:se:miun:diva-37261DOI: 10.1109/IIAI-AAI.2019.00127Scopus ID: 2-s2.0-85080902973ISBN: 978-1-7281-2627-2 (electronic)OAI: oai:DiVA.org:miun-37261DiVA, id: diva2:1352551
8th International Congress on Advanced Applied Informatics, Toyama, Japan, July 7-11 (Main Event) & 12 (Forum), 2019
SMART (Smarta system och tjänster för ett effektivt och innovativt samhälle)Available from: 2019-09-19 Created: 2019-09-19 Last updated: 2021-04-01Bibliographically approved
In thesis
1. Extracting Text into Meta-Data: Improving machine text-understanding of news-media articles
Open this publication in new window or tab >>Extracting Text into Meta-Data: Improving machine text-understanding of news-media articles
2021 (English)Licentiate thesis, comprehensive summary (Other academic)
Alternative title[sv]
Extrahera Meta-Data från texter : Förbättra förståelsen för nyheter med hjälp av maskininlärning
Abstract [en]

Society is constantly in need of information. It is important to consume event-based information of what is happening around us as well as facts and knowledge. As society grows, the amount of information to consume grows with it. This thesis demonstrates one way to extract and represent knowledge from text in a machine-readable way for news media articles. Three objectives are considered when developing a machine learning system to retrieve categories, entities, relations and other meta-data from text paragraphs. The first is to sort the terminology by topic; this makes it easier for machine learning algorithms to understand the text and the unique words used. The second objective is to construct a service for use in production, where scalability and performance are evaluated. Features are implemented to iteratively improve the model predictions, and several versions are run at the same time to, for example, compare them in an A/B test. The third objective is to further extract the gist of what is expressed in the text. The gist is extracted in the form of triples by connecting two related entities using a combination of natural language processing algorithms. 

The research presents a comparison between five different auto categorization algorithms, and an evaluation of their hyperparameters and how they would perform under the pressure of thousands of big, concurrent predictions. The aim is to build an auto-categorization system that can be used in the news media industry to help writers and journalists focus more on the story rather than filling in meta-data for each article. The best-performing algorithm is a Bidirectional Long-Short-Term-Memory neural network. Three different information extraction algorithms for extracting the gist of paragraphs are also compared. The proposed information extraction algorithm supports extracting information from texts in multiple languages with competitive accuracy compared with the state-of-the-art OpenIE and MinIE algorithms that can extract information in a single language. The use of the multi-linguistic models helps local-news media to write articles in different languages as a help to integrate immigrants  into the society.

Place, publisher, year, edition, pages
Sundsvall: Mid Sweden University, 2021. p. 55
Mid Sweden University licentiate thesis, ISSN 1652-8948 ; 181
National Category
Natural Language Processing Computer Sciences
urn:nbn:se:miun:diva-41775 (URN)978-91-89341-02-9 (ISBN)
2021-04-29, C312 / via Zoom, Holmgatan 10, Sundsvall, 14:00 (English)

Vid tidpunkten för presentationen var följande delarbeten opublicerade: delarbete 4 inskickat.

At the time of the public defence the following papers were unpublished: paper 4 submitted.

Available from: 2021-04-07 Created: 2021-04-01 Last updated: 2025-02-01Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Lindén, JohannesForsström, StefanZhang, Tingting

Search in DiVA

By author/editor
Lindén, JohannesWang, XutaoForsström, StefanZhang, Tingting
By organisation
Department of Information Systems and Technology
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar


Altmetric score

Total: 208 hits
CiteExportLink to record
Permanent link

Direct link
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf