h1

h2

h3

h4

h5
h6
http://join2-wiki.gsi.de/foswiki/pub/Main/Artwork/join2_logo100x88.png

Part-of-speech tagging and detection of social media texts = Automatische Wortartenbestimmung und Erkennung von Social Media Texten



VerantwortlichkeitsangabeMelanie Neunerdt

Ausgabe1. Auflage

ImpressumAachen : Apprimus Verlag 2015

Umfangv, 123 Seiten : Illustrationen, Diagramme

ISBN978-3-86359-402-2

ReiheElektro- und Informationstechnik


Dissertation, RWTH Aachen, 2015

Weitere Reihe: Edition Wissenschaft Apprimus. - Auch veröffentlicht auf dem Publikationsserver der RWTH Aachen University


Genehmigende Fakultät
Fak06

Hauptberichter/Gutachter
;

Tag der mündlichen Prüfung/Habilitation
2015-09-09

Online
URN: urn:nbn:de:hbz:82-rwth-2016-008692
URL: https://publications.rwth-aachen.de/record/567622/files/567622.pdf
URL: https://publications.rwth-aachen.de/record/567622/files/567622.pdf?subformat=pdfa

Einrichtungen

  1. Lehrstuhl und Institut für Theoretische Informationstechnik (613410)

Inhaltliche Beschreibung (Schlagwörter)
Elektrotechnik, Elektronik (frei) ; natural language processing (frei) ; part-of-speech tagging (frei) ; text classification (frei) ; natürliche Sprachverarbeitung (frei) ; automatische Wortartenbestimmung (frei) ; Textklassifikation (frei)

Thematische Einordnung (Klassifikation)
DDC: 621.3

Kurzfassung
Die vorliegende Dissertation umfasst neuartige Konzepte auf dem Gebiet der automatischen Textverarbeitung. Neben Modellen und Algorithmen zur automatische Wortartenbestimmung, werden Methoden zur Social-Media-Text Erkennung und Webseiten Bereinigung vorgestellt. Im ersten Teil der Arbeit werden verschiedene Ansätze zur Klassifikation und Erkennung von Social-Media-Texten in Webseiten diskutiert. Präsentiert werden sequentielle Methoden, die eine Sequenz von Textsegmenten auf Basis von hoch-dimensionalen Merkmalsvektoren klassifizieren. Im zweiten Teil wird ein Part-of-Speech Tagger zur automatischen Wortartenbestimmung in nicht-standardisierten Social-Media-Texten vorgestellt und diskutiert.

This thesis contributes to sequence labeling tasks in the field of Natural Language Processing by introducing novel concepts, models and algorithms for Part-of-Speech (POS) tagging, social media text detection and Web page cleaning. First, the task of social media text classification in Web pages is addressed, where sequences of Web text segments are classified based on a high-dimensional feature vector. New features motivated by social media text characteristics are introduced and investigated with respect to different classifiers. Two classification problems in the context of social media text classification are treated, (1) the problem of social media text detection and (2) a method for Web page cleaning for social media platforms. A new Web page corpus, particularly designed to train and test the classifiers on representative Web pages is created. Moreover, a POS tagger for social media texts is developed. The need for a specialized tagger is due to the specific social media text characteristics and the high non-standardization of such texts. Based on these factors, a Markov model tagger with parameter estimation enhancements with respect to social media texts is proposed. Particular focus is put on reliable estimation of non-standardized tokens like out-of-vocabulary words. To that end, methods are proposed to improve the reliability of probability estimation. Moreover, a novel approach mapping unknown tokens to tokens either known from training or tokens which fall into a class represented by regular expressions is presented. Finally, for remaining unknown tokens, semi-supervised auxiliary lexica and adequate estimation from prefix and suffix information is proposed. Furthermore, we propose to combine sparse in-domain social media training data and a newspaper corpus by an oversampling technique which improves POS tagging accuracies significantly. Training and evaluation of the proposed POS tagger is performed on a new manually annotated German social media text corpus. Tagging accuracies are presented and compared to accuracies achieved with state-of-the-art POS taggers. Finally, we show that the proposed social media text detection and Web cleaning methods, as well as the presented POS tagger can be efficiently used in the context of information retrieval for Web page corpus construction. By applying Web page cleaning and social media text detection to Web page corpora obtained from Web crawlers, the generated corpus can be further refined.

OpenAccess:
Download fulltext PDF Download fulltext PDF (PDFA)
(additional files)

Dokumenttyp
Book/Dissertation / PhD Thesis

Format
online, print

Sprache
English

Externe Identnummern
HBZ: HT018885168

Interne Identnummern
RWTH-2016-00869
Datensatz-ID: 567622

Beteiligte Länder
Germany

 GO


OpenAccess

QR Code for this record

The record appears in these collections:
Document types > Theses > Ph.D. Theses
Document types > Books > Books
Faculty of Electrical Engineering and Information Technology (Fac.6)
Publication server / Open Access
Public records
Publications database
613410

 Record created 2016-02-02, last modified 2024-07-15