Adaptive subspace methods for high-dimensional variable selection

Staerk, Christian; Cramer, Erhard; Kateri, Maria; Ntzoufras, Ioannis
doi:HT019763045
000729869 001__ 729869
000729869 005__ 20230408005633.0
000729869 0247_ $$2HBZ$$aHT019763045
000729869 0247_ $$2datacite_doi$$a10.18154/RWTH-2018-226562
000729869 0247_ $$2Laufende Nummer$$a37413
000729869 037__ $$aRWTH-2018-226562
000729869 041__ $$aEnglish
000729869 082__ $$a510
000729869 1001_ $$0P:(DE-588)1163838640$$aStaerk, Christian$$b0$$urwth
000729869 245__ $$aAdaptive subspace methods for high-dimensional variable selection$$cvorgelegt von Christian Staerk, M.Sc.$$honline
000729869 246_3 $$aAdaptive Subspace Methoden für hoch-dimensionale Variablenselektion$$yGerman
000729869 260__ $$aAachen$$c2018
000729869 300__ $$a1 Online-Ressource (v, 214 Seiten) : Illustrationen
000729869 3367_ $$2DataCite$$aOutput Types/Dissertation
000729869 3367_ $$2ORCID$$aDISSERTATION
000729869 3367_ $$2BibTeX$$aPHDTHESIS
000729869 3367_ $$02$$2EndNote$$aThesis
000729869 3367_ $$0PUB:(DE-HGF)11$$2PUB:(DE-HGF)$$aDissertation / PhD Thesis$$bphd$$mphd
000729869 3367_ $$2DRIVER$$adoctoralThesis
000729869 500__ $$aVeröffentlicht auf dem Publikationsserver der RWTH Aachen University
000729869 502__ $$aDissertation, RWTH Aachen University, 2018$$bDissertation$$cRWTH Aachen University$$d2018$$gFak01$$o2018-04-26
000729869 5203_ $$aRasante Entwicklungen in der Informationstechnologie, der Genomforschung und weiteren Gebieten haben dazu geführt, dass heutzutage oftmals hoch-dimensionale Daten beobachtet werden, bei denen die Anzahl der Variablen wesentlich größer ist als die Anzahl der Beobachtungen. In solchen Situationen ist man insbesondere an der Selektion von erklärenden Variablen interessiert, um ein Modell mit möglichst wenigen Variablen zu finden, welches die beobachteten Daten gut beschreibt. Diese Arbeit handelt von dem Problem der Variablenselektion im Rahmen von hoch-dimensionalen generalisierten linearen Modellen (GLM). Viele Variablenselektionsmethoden wie das Lasso-Verfahren basieren auf der Lösung von $\ell_1$-regularisierten, konvexen Relaxierungen des ursprünglichen Problems. Eine wichtige Motivation für diese Arbeit ist es hingegen, Lösungen zu $\ell_0$-regularisierten, diskreten Problemen zu finden, die etwa von Modellselektionskriterien wie dem Extended Bayesian Information Criterion (EBIC) induziert werden und im Allgemeinen NP-schwer sind. Zu diesem Zweck wird die Adaptive Subspace (AdaSub) Methode vorgestellt, welche auf der Idee basiert, mehrere niedrig-dimensionale Teilprobleme des ursprünglich hoch-dimensionalen Problems adaptiv zu lösen. AdaSub ist ein stochastisches Verfahren, in welchem die individuellen Wahrscheinlichkeiten, mit denen die jeweiligen Variablen berücksichtigt werden, gemäß der jeweils aktuell geschätzten "Bedeutsamkeit" adjustiert werden. Es wird gezeigt, dass die Adaption des Verfahrens Bayesianisch motiviert werden kann, und dass die Methode "korrekt" gegen das beste Modell bezüglich des verwendeten Kriteriums konvergiert, sofern die sogenannte Ordered Importance Property (OIP) erfüllt ist. Des Weiteren wird die Variablenselektions-Konsistenz von AdaSub unter geeigneten Bedingungen bewiesen. Da für nichtlineare Regressionsmodelle die Lösung der Teilprobleme in AdaSub oftmals zu rechenintensiv ist, werden Varianten von AdaSub eingeführt, die die Teilprobleme mithilfe von Greedy-Verfahren approximativ lösen. Es wird sich herausstellen, dass BackAdaSub, eine Variante basierend auf schrittweiser Rückwärts-Selektion, in vielen Fällen als effizienter "Ersatz-Algorithmus" für AdaSub verwendet werden kann. Es wird gezeigt, dass die Modified Ordered Importance Propoperty (MOIP) eine hinreichende Bedingung für die "korrekte Konvergenz" von BackAdaSub ist, die jedoch eine stärkere Forderung darstellt als die ursprüngliche OIP. Die Performance von AdaSub und BackAdaSub im Vergleich zu anderen bekannten Verfahren wie Lasso, Adaptive Lasso, SCAD und Stability Selection wird anhand von vielfältigen simulierten und realen Datensätzen im Rahmen von linearen und logistischen Regressionsmodellen untersucht. Schließlich wird der sogenannte Metropolized AdaSub (MAdaSub) Algorithmus vorgestellt, um in einem Bayesianischen Kontext aus Posteriori-Modell-Verteilungen zu simulieren. MAdaSub stellt ein adaptives Markov Chain Monte Carlo (MCMC) Verfahren dar, welches die Verteilungen der vorgeschlagenen Modelle ("proposals") basierend auf Informationen von vorherigen Iterationen sequentiell adjustiert. Trotz der kontinuierlichen Adaption des Verfahrens kann gezeigt werden, dass der MAdaSub Algorithmus ergodisch ist, sodass MAdaSub "im Grenzfall" aus der korrekten Zielverteilung simuliert. Anhand von simulierten und realen Datensätzen wird demonstriert, dass MAdaSub selbst für hoch-dimensionale und multimodale Verteilungen stabile Schätzungen von marginalen Posteriori-Inklusionswahrscheinlichkeiten liefern kann.$$lger
000729869 520__ $$aDue to recent advancements in fields such as information technology and genomics, nowadays one commonly faces high-dimensional data where the number of explanatory variables is possibly much larger than the number of observations. In such situations one is particularly interested in variable selection, meaning that one aims at identifying a sparse model with a relatively small subset of variables that fits and ideally explains the observed data well. This thesis deals with the variable selection problem in the setting of high-dimensional generalized linear models (GLMs). While many variable selection methods like the Lasso are based on solving convex $\ell_1$-type relaxations of the original problem, a main motive of this work is the desire to provide solutions to generally NP-hard $\ell_0$-regularized problems induced by model selection criteria such as the Extended Bayesian Information Criterion (EBIC). For this purpose, the Adaptive Subspace (AdaSub) method is proposed which is based on the idea of adaptively solving several low-dimensional sub-problems of the original high-dimensional problem. AdaSub is a stochastic algorithm which sequentially adapts the sampling probabilities of the individual variables based on their currently estimated "importance". It is shown that the updating scheme of AdaSub can be motivated in a Bayesian way and that the method "converges correctly" against the best model according to the employed criterion, provided that the so-called ordered importance property (OIP) is satisfied. Furthermore, the variable selection consistency of AdaSub is proved under suitable conditions. Since solving the sampled sub-problems can be computationally expensive for GLMs different than the normal linear model, "greedy" modifications of AdaSub are introduced which provide approximate solutions to the sub-problems. It is argued that BackAdaSub, a version of AdaSub based on Backward Stepwise Selection, may be used as an efficient surrogate algorithm. The "correct convergence" of BackAdaSub can be guaranteed under the modified ordered importance property (MOIP), which is a stronger condition than the original OIP. The performance of AdaSub and BackAdaSub in comparison to other prominent competitors such as the Lasso, the Adaptive Lasso, the SCAD and Stability Selection is investigated via various simulated and real data examples in the framework of linear and logistic regression models. Finally, a Metropolized version of AdaSub, called the MAdaSub algorithm, is proposed for sampling from posterior model distributions in the Bayesian variable selection context. MAdaSub is an adaptive Markov Chain Monte Carlo (MCMC) algorithm which sequentially adjusts the proposal distribution based on the information from the previously sampled models. It is shown that the MAdaSub algorithm is ergodic despite its continuing adaptation, i.e. "in the limit" it samples from the correct target distribution. Through simulated and real data examples it is demonstrated that MAdaSub can provide stable estimates of posterior marginal inclusion probabilities even for very high-dimensional and multimodal posterior model distributions.$$leng
000729869 588__ $$aDataset connected to Lobid/HBZ
000729869 591__ $$aGermany
000729869 653_7 $$aHigh-Dimensional Statistics
000729869 653_7 $$aGeneralized Linear Models
000729869 653_7 $$aRegularization
000729869 653_7 $$aSubset Selection
000729869 653_7 $$aAdaSub
000729869 653_7 $$aMarkov Chain Monte Carlo
000729869 7001_ $$0P:(DE-82)IDM00092$$aKateri, Maria$$b1$$eThesis advisor$$uRWTH
000729869 7001_ $$0P:(DE-82)172852$$aNtzoufras, Ioannis$$b2$$eThesis advisor
000729869 7001_ $$0P:(DE-82)IDM00061$$aCramer, Erhard$$b3$$eThesis advisor$$uRWTH
000729869 8564_ $$uhttps://publications.rwth-aachen.de/record/729869/files/729869.pdf$$yOpenAccess
000729869 8564_ $$uhttps://publications.rwth-aachen.de/record/729869/files/729869_source.zip$$yRestricted
000729869 8564_ $$uhttps://publications.rwth-aachen.de/record/729869/files/729869.gif?subformat=icon$$xicon$$yOpenAccess
000729869 8564_ $$uhttps://publications.rwth-aachen.de/record/729869/files/729869.jpg?subformat=icon-1440$$xicon-1440$$yOpenAccess
000729869 8564_ $$uhttps://publications.rwth-aachen.de/record/729869/files/729869.jpg?subformat=icon-180$$xicon-180$$yOpenAccess
000729869 8564_ $$uhttps://publications.rwth-aachen.de/record/729869/files/729869.jpg?subformat=icon-640$$xicon-640$$yOpenAccess
000729869 8564_ $$uhttps://publications.rwth-aachen.de/record/729869/files/729869.jpg?subformat=icon-700$$xicon-700$$yOpenAccess
000729869 8564_ $$uhttps://publications.rwth-aachen.de/record/729869/files/729869.pdf?subformat=pdfa$$xpdfa$$yOpenAccess
000729869 909CO $$ooai:publications.rwth-aachen.de:729869$$pdnbdelivery$$pdriver$$pVDB$$popen_access$$popenaire
000729869 9101_ $$0I:(DE-588b)36225-6$$6P:(DE-82)IDM00061$$aRWTH Aachen$$b3$$kRWTH
000729869 9141_ $$y2018
000729869 915__ $$0StatID:(DE-HGF)0510$$2StatID$$aOpenAccess
000729869 9201_ $$0I:(DE-82)116510_20140620$$k116510$$lLehrstuhl für Statistik und Stochastische Modellierung$$x0
000729869 9201_ $$0I:(DE-82)110000_20140620$$k110000$$lFachgruppe Mathematik$$x1
000729869 961__ $$c2018-09-20T10:21:09.426232$$x2018-07-25T09:34:22.859319$$z2018-09-20T10:21:09.426232
000729869 9801_ $$aFullTexts
000729869 980__ $$aphd
000729869 980__ $$aVDB
000729869 980__ $$aUNRESTRICTED
000729869 980__ $$aI:(DE-82)116510_20140620
000729869 980__ $$aI:(DE-82)110000_20140620
h1

h2

h3

h4

h5

h6

RWTH

Kontakt

RWTH Publications

Allgemeines