Mid Sweden University

miun.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Extracting Text into Meta-Data: Improving machine text-understanding of news-media articles
Mid Sweden University, Faculty of Science, Technology and Media, Department of Information Systems and Technology.
2021 (English)Licentiate thesis, comprehensive summary (Other academic)Alternative title
Extrahera Meta-Data från texter : Förbättra förståelsen för nyheter med hjälp av maskininlärning (Swedish)
Abstract [en]

Society is constantly in need of information. It is important to consume event-based information of what is happening around us as well as facts and knowledge. As society grows, the amount of information to consume grows with it. This thesis demonstrates one way to extract and represent knowledge from text in a machine-readable way for news media articles. Three objectives are considered when developing a machine learning system to retrieve categories, entities, relations and other meta-data from text paragraphs. The first is to sort the terminology by topic; this makes it easier for machine learning algorithms to understand the text and the unique words used. The second objective is to construct a service for use in production, where scalability and performance are evaluated. Features are implemented to iteratively improve the model predictions, and several versions are run at the same time to, for example, compare them in an A/B test. The third objective is to further extract the gist of what is expressed in the text. The gist is extracted in the form of triples by connecting two related entities using a combination of natural language processing algorithms. 

The research presents a comparison between five different auto categorization algorithms, and an evaluation of their hyperparameters and how they would perform under the pressure of thousands of big, concurrent predictions. The aim is to build an auto-categorization system that can be used in the news media industry to help writers and journalists focus more on the story rather than filling in meta-data for each article. The best-performing algorithm is a Bidirectional Long-Short-Term-Memory neural network. Three different information extraction algorithms for extracting the gist of paragraphs are also compared. The proposed information extraction algorithm supports extracting information from texts in multiple languages with competitive accuracy compared with the state-of-the-art OpenIE and MinIE algorithms that can extract information in a single language. The use of the multi-linguistic models helps local-news media to write articles in different languages as a help to integrate immigrants  into the society.

Place, publisher, year, edition, pages
Sundsvall: Mid Sweden University , 2021. , p. 55
Series
Mid Sweden University licentiate thesis, ISSN 1652-8948 ; 181
National Category
Natural Language Processing Computer Sciences
Identifiers
URN: urn:nbn:se:miun:diva-41775ISBN: 978-91-89341-02-9 (print)OAI: oai:DiVA.org:miun-41775DiVA, id: diva2:1541717
Presentation
2021-04-29, C312 / via Zoom, Holmgatan 10, Sundsvall, 14:00 (English)
Opponent
Supervisors
Note

Vid tidpunkten för presentationen var följande delarbeten opublicerade: delarbete 4 inskickat.

At the time of the public defence the following papers were unpublished: paper 4 submitted.

Available from: 2021-04-07 Created: 2021-04-01 Last updated: 2025-02-01Bibliographically approved
List of papers
1. Evaluating Combinations of Classification Algorithms and Paragraph Vectors for News Article Classification
Open this publication in new window or tab >>Evaluating Combinations of Classification Algorithms and Paragraph Vectors for News Article Classification
2018 (English)In: Proceedings of the 2018 Federated Conference on Computer Science and Information Systems / [ed] Maria Ganzha, Leszek Maciaszek, Marcin Paprzycki, Warzaw: Polskie Towarzystwo Informatyczne , 2018, p. 489-495, article id 8511213Conference paper, Published paper (Refereed)
Abstract [en]

News companies have a need to automate and make the process of writing about popular and new events more effective. Current technologies involve robotic programs that fill in values in templates and website listeners that notify editors when changes are made so that the editor can read up on the source change on the actual website. Editors can provide news faster and better if directly provided with abstracts of the external sources and categorical meta-data that supports what the text is about. In this article, the focus is on the importance of evaluating critical parameter modifications of the four classification algorithms Decisiontree, Randomforest, Multi Layer perceptron and Long-Short-Term-Memory in a combination with the paragraph vector algorithms Distributed Memory and Distributed Bag of Words, with an aim to categorise news articles. The result shows that Decisiontree and Multi Layer perceptron are stable within a short interval, while Randomforest is more dependent on the parameters best split and number of trees. The most accurate model is Long-Short-Term-Memory model that achieves an accuracy of 71%.

Place, publisher, year, edition, pages
Warzaw: Polskie Towarzystwo Informatyczne, 2018
Series
Annals of Computer Science and Information Systems, ISSN 2300-5963
National Category
Computer Sciences
Identifiers
urn:nbn:se:miun:diva-34767 (URN)10.15439/2018F110 (DOI)000454652300071 ()2-s2.0-85057226648 (Scopus ID)978-83-949419-7-0 (ISBN)
Conference
3rd International Workshop on Language Technologies and Applications (LTA'18) at 2018 Federated Conference on Computer Science and Information Systems, FedCSIS 2018; Poznan; Poland; 9 September 2018 through 12 September 2018
Projects
SMART (Smarta system och tjänster för ett effektivt och innovativt samhälle)
Available from: 2018-10-23 Created: 2018-10-23 Last updated: 2021-04-01Bibliographically approved
2. Bilingual Auto-Categorization Comparison of two LSTM Text Classifiers
Open this publication in new window or tab >>Bilingual Auto-Categorization Comparison of two LSTM Text Classifiers
2019 (English)In: 2019 8th International Congress on Advanced Applied Informatics (IIAI-AAI), 2019Conference paper, Published paper (Refereed)
Abstract [en]

Multi linguistic problems such as auto-categorization is not an easy task. It is possible to train different models for each language, another way to do auto-categorization is to build the model in one base language and use automatic translation from other languages to that base language. Different languages have a bias to a language specific grammar and syntax and will therefore pose problems to be expressed in other languages. Translating from one language into a non-verbal language could potentially have a positive impact of the categorization results. A non-verbal language could for example be pure information in form of a knowledge graph relation extraction from the text. In this article a comparison is conducted between Chinese and Swedish languages. Two categorization models are developed and validated on each dataset. The purpose is to make an auto-categorization model that works for n'importe quel langage. One model is built upon LSTM and optimized for Swedish and the other is an improved Bidirectional-LSTM Convolution model optimized for Chinese. The improved algorithm is trained on both languages and compared with the LSTM algorithm. The Bidirectional-LSTM algorithm performs approximately 20% units better than the LSTM algorithm, which is significant.

National Category
Computer Sciences
Identifiers
urn:nbn:se:miun:diva-37261 (URN)10.1109/IIAI-AAI.2019.00127 (DOI)2-s2.0-85080902973 (Scopus ID)978-1-7281-2627-2 (ISBN)
Conference
8th International Congress on Advanced Applied Informatics, Toyama, Japan, July 7-11 (Main Event) & 12 (Forum), 2019
Projects
SMART (Smarta system och tjänster för ett effektivt och innovativt samhälle)
Available from: 2019-09-19 Created: 2019-09-19 Last updated: 2021-04-01Bibliographically approved
3. Productify news article classification model with Sagemaker
Open this publication in new window or tab >>Productify news article classification model with Sagemaker
2020 (English)In: Advances in Science, Technology and Engineering Systems, ISSN 2415-6698, Vol. 5, no 2, p. 13-18Article in journal (Refereed) Published
Abstract [en]

 News companies have a need to automate and make the process of writing about popular and new events more effective. Current technologies involve robotic programs that fill in values in templates and website listeners that notify editors when changes are made so that the editor can read up on the source change on the actual website. Editors can provide news faster and better if directly provided with abstracts of the external sources and categorical meta-data that supports what the text is about. To make categorical meta-data a reality an auto-categorization model was created and optimized for Swedish articles written by local news journalists. The problem was that it was not scale-able enough to use out of the box. Instead of having this local model that could make good predictions of the text documents, the model is to be deployed in the cloud and an API interface is created. The API can be accessed from the tools where the articles is being written and therefore these services can automatically assign categories to the articles once the journalist is done writing it. To allow scale-ability to several thousands of simultaneously categorized articles and at the same time improving the workflow of deploying new models easier the API is uploaded to Sagemaker where several models are trained and once an improved model is found that model will be used in production in such a way that the system organically adapts to new written articles. An evaluation of Sagemaker API was done and it was concluded that the complexity of this solution was polynomial. 

Keywords
Big data, Data mining, Editors, Journalists, Machine learning, Natural language processing, News events, NLP, Paragraph vectors, Text analysis
National Category
Media and Communications
Identifiers
urn:nbn:se:miun:diva-38834 (URN)10.25046/aj050202 (DOI)2-s2.0-85082473779 (Scopus ID)
Available from: 2020-04-07 Created: 2020-04-07 Last updated: 2025-02-07Bibliographically approved
4. Multi-language Information Extraction with Text Pattern Recognition
Open this publication in new window or tab >>Multi-language Information Extraction with Text Pattern Recognition
2021 (English)In: Computer Science & Information Technology (CS & IT): Natural Language Computing7th International Conference on Natural Language Computing (NATL 2021), November 27~28, 2021, London, United Kingdom / [ed] David C. Wyld, Dhinaharan Nagamalai, 2021, Vol. 11, p. 1-17Conference paper, Published paper (Refereed)
Abstract [en]

Information extraction is a task that can extract meta-data information from text. The research in this article proposes a new information extraction algorithm called GenerateIE. The proposed algorithm identifies pairs of entities and relations described in a piece of text. The extracted meta-data is useful in many areas, but within this research the focus is to use them in news-media contexts to provide the gist of the written articles for analytics and paraphrasing of news information. GenerateIE algorithm is compared with existing state of the art algorithms with two benefits. Firstly, the GenerateIE provides the co-referenced word as the entity instead of using he, she, it, etc. which is more beneficial for knowledge graphs. Secondly GenerateIE can be applied on multiple languages without changing the algorithm itself apart from the underlying natural language text-parsing. Furthermore, the performance of GenerateIE compared with state-of-the-art algorithms is not significantly better, but it offers competitive results. 

Keywords
Information Extraction, IE, Information representation, Knowledge Graph, Natural Language Processing, NLP, Pattern Recognition, Entity Recognition
National Category
Computer Sciences
Identifiers
urn:nbn:se:miun:diva-41798 (URN)978-1-925953-54-1 (ISBN)
Conference
7th International Conference on Natural Language Computing (NATL 2021), London, United Kingdom, November 27 - 28, 2021
Available from: 2021-04-01 Created: 2021-04-01 Last updated: 2022-01-03Bibliographically approved

Open Access in DiVA

fulltext(1472 kB)668 downloads
File information
File name FULLTEXT02.pdfFile size 1472 kBChecksum SHA-512
1f706e9da3792e32d31f7db63e2d84119d31a06ff5037e4bcc3e01c6b338302c451100efbba88c26d4437189224d5ce0a54574ef38a94f500e2372ae95e163af
Type fulltextMimetype application/pdf

Authority records

Lindén, Johannes

Search in DiVA

By author/editor
Lindén, Johannes
By organisation
Department of Information Systems and Technology
Natural Language ProcessingComputer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 677 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 595 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf