Mid Sweden University

miun.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Huvudtitel: Understand and Utilise Unformatted Text Documents by Natural Language Processing algorithms
Mid Sweden University, Faculty of Science, Technology and Media, Department of Information Systems and Technology.
2017 (English)Independent thesis Advanced level (professional degree), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

News companies have a need to automate and make the editors process of writing about hot and new events more effective. Current technologies involve robotic programs that fills in values in templates and website listeners that notifies the editors when changes are made so that the editor can read up on the source change at the actual website. Editors can provide news faster and better if directly provided with abstracts of the external sources. This study applies deep learning algorithms to automatically formulate abstracts and tag sources with appropriate tags based on the context. The study is a full stack solution, which manages both the editors need for speed and the training, testing and validation of the algorithms. Decision Tree, Random Forest, Multi Layer Perceptron and phrase document vectors are used to evaluate the categorisation and Recurrent Neural Networks is used to paraphrase unformatted texts. In the evaluation a comparison between different models trained by the algorithms with a variation of parameters are done based on the F-score. The results shows that the F-scores are increasing the more document the training has and decreasing the more categories the algorithm needs to consider. The Multi-Layer Perceptron perform best followed by Random Forest and finally Decision Tree. The document length matters, when larger documents are considered during training the score is increasing considerably. A user survey about the paraphrase algorithms shows the paraphrase result is insufficient to satisfy editors need. It confirms a need for more memory to conduct longer experiments.

Place, publisher, year, edition, pages
2017. , p. 47
Keywords [en]
Machine learning, data mining, big data, news events, journalists, editors, text analysis, natural language processing, nlp, document vectors, seq2seq, recurrent neural network
National Category
Computer Systems
Identifiers
URN: urn:nbn:se:miun:diva-31043Local ID: DT-V17-A2-001OAI: oai:DiVA.org:miun-31043DiVA, id: diva2:1116941
Subject / course
Computer Engineering DT1
Educational program
Master of Science in Engineering - Computer Engineering TDTEA 300 higher education credits
Supervisors
Examiners
Available from: 2017-06-28 Created: 2017-06-28 Last updated: 2017-06-28Bibliographically approved

Open Access in DiVA

fulltext(2205 kB)440 downloads
File information
File name FULLTEXT01.pdfFile size 2205 kBChecksum SHA-512
0412c47de3d9ded69f7793ee7ac73deaf43320a03533e9052f769002eefbdd8780f337a48f63d79611eb9a43251fe62a422ac9e88c913dc5284a073cd824fe5d
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Lindén, Johannes
By organisation
Department of Information Systems and Technology
Computer Systems

Search outside of DiVA

GoogleGoogle Scholar
Total: 440 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 18802 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf