Part-of-speech tagging and detection of social media texts

Neunerdt, Melanie; Mathar, Rudolf; Zesch, Torsten

doi:HT018885168

Items
Marc 21

001			567622
005			20240715100535.0
020	_	_	\|a 978-3-86359-402-2
024	7	_	\|2 URN \|a urn:nbn:de:hbz:82-rwth-2016-008692
024	7	_	\|2 HBZ \|a HT018885168
024	7	_	\|2 Laufende Nummer \|a 35466
037	_	_	\|a RWTH-2016-00869
041	_	_	\|a English
082	_	_	\|a 621.3
100	1	_	\|0 P:(DE-82)013817 \|a Neunerdt, Melanie \|b 0
245	_	_	\|a Part-of-speech tagging and detection of social media texts \|c Melanie Neunerdt \|h online, print
246	_	3	\|a Automatische Wortartenbestimmung und Erkennung von Social Media Texten \|y German
250	_	_	\|a 1. Auflage
260	_	_	\|a Aachen \|b Apprimus Verlag \|c 2015
260	_	_	\|c 2016
300	_	_	\|a v, 123 Seiten : Illustrationen, Diagramme
336	7	_	\|2 DataCite \|a Output Types/Dissertation
336	7	_	\|0 PUB:(DE-HGF)3 \|2 PUB:(DE-HGF) \|a Book \|m book
336	7	_	\|2 ORCID \|a DISSERTATION
336	7	_	\|2 BibTeX \|a PHDTHESIS
336	7	_	\|0 2 \|2 EndNote \|a Thesis
336	7	_	\|0 PUB:(DE-HGF)11 \|2 PUB:(DE-HGF) \|a Dissertation / PhD Thesis \|b phd \|m phd
336	7	_	\|2 DRIVER \|a doctoralThesis
490	0	_	\|a Elektro- und Informationstechnik
500	_	_	\|a Weitere Reihe: Edition Wissenschaft Apprimus. - Auch veröffentlicht auf dem Publikationsserver der RWTH Aachen University
502	_	_	\|a Dissertation, RWTH Aachen, 2015 \|b Dissertation \|c RWTH Aachen \|d 2015 \|g Fak06 \|o 2015-09-09
520	3	_	\|a Die vorliegende Dissertation umfasst neuartige Konzepte auf dem Gebiet der automatischen Textverarbeitung. Neben Modellen und Algorithmen zur automatische Wortartenbestimmung, werden Methoden zur Social-Media-Text Erkennung und Webseiten Bereinigung vorgestellt. Im ersten Teil der Arbeit werden verschiedene Ansätze zur Klassifikation und Erkennung von Social-Media-Texten in Webseiten diskutiert. Präsentiert werden sequentielle Methoden, die eine Sequenz von Textsegmenten auf Basis von hoch-dimensionalen Merkmalsvektoren klassifizieren. Im zweiten Teil wird ein Part-of-Speech Tagger zur automatischen Wortartenbestimmung in nicht-standardisierten Social-Media-Texten vorgestellt und diskutiert. \|l ger
520	_	_	\|a This thesis contributes to sequence labeling tasks in the field of Natural Language Processing by introducing novel concepts, models and algorithms for Part-of-Speech (POS) tagging, social media text detection and Web page cleaning. First, the task of social media text classification in Web pages is addressed, where sequences of Web text segments are classified based on a high-dimensional feature vector. New features motivated by social media text characteristics are introduced and investigated with respect to different classifiers. Two classification problems in the context of social media text classification are treated, (1) the problem of social media text detection and (2) a method for Web page cleaning for social media platforms. A new Web page corpus, particularly designed to train and test the classifiers on representative Web pages is created. Moreover, a POS tagger for social media texts is developed. The need for a specialized tagger is due to the specific social media text characteristics and the high non-standardization of such texts. Based on these factors, a Markov model tagger with parameter estimation enhancements with respect to social media texts is proposed. Particular focus is put on reliable estimation of non-standardized tokens like out-of-vocabulary words. To that end, methods are proposed to improve the reliability of probability estimation. Moreover, a novel approach mapping unknown tokens to tokens either known from training or tokens which fall into a class represented by regular expressions is presented. Finally, for remaining unknown tokens, semi-supervised auxiliary lexica and adequate estimation from prefix and suffix information is proposed. Furthermore, we propose to combine sparse in-domain social media training data and a newspaper corpus by an oversampling technique which improves POS tagging accuracies significantly. Training and evaluation of the proposed POS tagger is performed on a new manually annotated German social media text corpus. Tagging accuracies are presented and compared to accuracies achieved with state-of-the-art POS taggers. Finally, we show that the proposed social media text detection and Web cleaning methods, as well as the presented POS tagger can be efficiently used in the context of information retrieval for Web page corpus construction. By applying Web page cleaning and social media text detection to Web page corpora obtained from Web crawlers, the generated corpus can be further refined. \|l eng
591	_	_	\|a Germany
653	_	7	\|a Elektrotechnik, Elektronik
653	_	7	\|a natural language processing
653	_	7	\|a part-of-speech tagging
653	_	7	\|a text classification
653	_	7	\|a natürliche Sprachverarbeitung
653	_	7	\|a automatische Wortartenbestimmung
653	_	7	\|a Textklassifikation
700	1	_	\|0 P:(DE-82)IDM00567 \|a Mathar, Rudolf \|b 1 \|e Thesis advisor
700	1	_	\|a Zesch, Torsten \|b 2 \|e Thesis advisor
856	4	_	\|u https://publications.rwth-aachen.de/record/567622/files/567622.pdf \|y OpenAccess
856	4	_	\|u https://publications.rwth-aachen.de/record/567622/files/567622_source.zip \|y restricted
856	4	_	\|u https://publications.rwth-aachen.de/record/567622/files/567622.gif?subformat=icon \|x icon \|y OpenAccess
856	4	_	\|u https://publications.rwth-aachen.de/record/567622/files/567622.jpg?subformat=icon-180 \|x icon-180 \|y OpenAccess
856	4	_	\|u https://publications.rwth-aachen.de/record/567622/files/567622.jpg?subformat=icon-700 \|x icon-700 \|y OpenAccess
856	4	_	\|u https://publications.rwth-aachen.de/record/567622/files/567622.pdf?subformat=pdfa \|x pdfa \|y OpenAccess
909	C	O	\|o oai:publications.rwth-aachen.de:567622 \|p dnbdelivery \|p VDB \|p driver \|p urn \|p open_access \|p openaire
910	1	_	\|0 I:(DE-HGF)0 \|6 P:(DE-82)013817 \|a TI \|b 0
914	1	_	\|y 2016
915	_	_	\|0 StatID:(DE-HGF)0510 \|2 StatID \|a OpenAccess
920	1	_	\|0 I:(DE-82)613410_20140620 \|k 613410 \|l Lehrstuhl und Institut für Theoretische Informationstechnik \|x 0
980	1	_	\|a FullTexts
980	_	_	\|a phd
980	_	_	\|a VDB
980	_	_	\|a book
980	_	_	\|a I:(DE-82)613410_20140620
980	_	_	\|a UNRESTRICTED

Library	Collection	CLSMajor	CLSMinor	Language	Author

Marc 21

h1

h2

h3

h4

h5

h6