Part-of-speech tagging and detection of social media texts

Neunerdt, Melanie; Mathar, Rudolf; Zesch, Torsten

doi:HT018885168

Part-of-speech tagging and detection of social media texts = Automatische Wortartenbestimmung und Erkennung von Social Media Texten

Neunerdt, Melanie

2015 & 2016

VerantwortlichkeitsangabeMelanie Neunerdt

Ausgabe1. Auflage

ImpressumAachen : Apprimus Verlag 2015

Umfangv, 123 Seiten : Illustrationen, Diagramme

ISBN978-3-86359-402-2

ReiheElektro- und Informationstechnik

Dissertation, RWTH Aachen, 2015

Weitere Reihe: Edition Wissenschaft Apprimus. - Auch veröffentlicht auf dem Publikationsserver der RWTH Aachen University

Genehmigende Fakultät
Fak06

Hauptberichter/Gutachter
Mathar, Rudolf (Thesis advisor) ; Zesch, Torsten (Thesis advisor)

Tag der mündlichen Prüfung/Habilitation
2015-09-09

Online
URN: urn:nbn:de:hbz:82-rwth-2016-008692
URL: https://publications.rwth-aachen.de/record/567622/files/567622.pdf
URL: https://publications.rwth-aachen.de/record/567622/files/567622.pdf?subformat=pdfa

Einrichtungen

Lehrstuhl und Institut für Theoretische Informationstechnik (613410)

Inhaltliche Beschreibung (Schlagwörter)
Elektrotechnik, Elektronik (frei) ; natural language processing (frei) ; part-of-speech tagging (frei) ; text classification (frei) ; natürliche Sprachverarbeitung (frei) ; automatische Wortartenbestimmung (frei) ; Textklassifikation (frei)

Thematische Einordnung (Klassifikation)
DDC: 621.3

Kurzfassung
Die vorliegende Dissertation umfasst neuartige Konzepte auf dem Gebiet der automatischen Textverarbeitung. Neben Modellen und Algorithmen zur automatische Wortartenbestimmung, werden Methoden zur Social-Media-Text Erkennung und Webseiten Bereinigung vorgestellt. Im ersten Teil der Arbeit werden verschiedene Ansätze zur Klassifikation und Erkennung von Social-Media-Texten in Webseiten diskutiert. Präsentiert werden sequentielle Methoden, die eine Sequenz von Textsegmenten auf Basis von hoch-dimensionalen Merkmalsvektoren klassifizieren. Im zweiten Teil wird ein Part-of-Speech Tagger zur automatischen Wortartenbestimmung in nicht-standardisierten Social-Media-Texten vorgestellt und diskutiert.

This thesis contributes to sequence labeling tasks in the field of Natural Language Processing by introducing novel concepts, models and algorithms for Part-of-Speech (POS) tagging, social media text detection and Web page cleaning. First, the task of social media text classification in Web pages is addressed, where sequences of Web text segments are classified based on a high-dimensional feature vector. New features motivated by social media text characteristics are introduced and investigated with respect to different classifiers. Two classification problems in the context of social media text classification are treated, (1) the problem of social media text detection and (2) a method for Web page cleaning for social media platforms. A new Web page corpus, particularly designed to train and test the classifiers on representative Web pages is created. Moreover, a POS tagger for social media texts is developed. The need for a specialized tagger is due to the specific social media text characteristics and the high non-standardization of such texts. Based on these factors, a Markov model tagger with parameter estimation enhancements with respect to social media texts is proposed. Particular focus is put on reliable estimation of non-standardized tokens like out-of-vocabulary words. To that end, methods are proposed to improve the reliability of probability estimation. Moreover, a novel approach mapping unknown tokens to tokens either known from training or tokens which fall into a class represented by regular expressions is presented. Finally, for remaining unknown tokens, semi-supervised auxiliary lexica and adequate estimation from prefix and suffix information is proposed. Furthermore, we propose to combine sparse in-domain social media training data and a newspaper corpus by an oversampling technique which improves POS tagging accuracies significantly. Training and evaluation of the proposed POS tagger is performed on a new manually annotated German social media text corpus. Tagging accuracies are presented and compared to accuracies achieved with state-of-the-art POS taggers. Finally, we show that the proposed social media text detection and Web cleaning methods, as well as the presented POS tagger can be efficiently used in the context of information retrieval for Web page corpus construction. By applying Web page cleaning and social media text detection to Web page corpora obtained from Web crawlers, the generated corpus can be further refined.

OpenAccess:
PDF PDF (PDFA)
(additional files)