Stable variable ranking and selection in regularized logistic regression for severely imbalanced big binary data.

Tytuł:: Stable variable ranking and selection in regularized logistic regression for severely imbalanced big binary data.
Autorzy:: Nadeem K; University of Guelph, Guelph, Ontario, Canada.
Jabri MA; University of Guelph, Guelph, Ontario, Canada.
Źródło:: PloS one [PLoS One] 2023 Jan 17; Vol. 18 (1), pp. e0280258. Date of Electronic Publication: 2023 Jan 17 (Print Publication: 2023).
Typ publikacji:: Journal Article; Research Support, Non-U.S. Gov't
Język:: English
Imprint Name(s):: Original Publication: San Francisco, CA : Public Library of Science
MeSH Terms:: Algorithms*
Logistic Models*
Computer Simulation
References:: J Stat Softw. 2010;33(1):1-22. (PMID: 20808728)
J Appl Stat. 2020 May 5;47(13-15):2879-2894. (PMID: 35707418)
Earths Future. 2019 Jan;7(1):2-10. (PMID: 35860503)
J Chem Inf Model. 2021 Jan 25;61(1):76-94. (PMID: 33350301)
J Cheminform. 2014 Mar 29;6(1):10. (PMID: 24678909)
J Clin Epidemiol. 2020 Jun;122:56-69. (PMID: 32169597)
Molecular Sequence:: Dryad 10.5061/dryad.ttdz08m20
Entry Date(s):: Date Created: 20230117 Date Completed: 20230120 Latest Revision: 20230320
Update Code:: 20240105
PubMed Central ID:: PMC9844919
DOI:: 10.1371/journal.pone.0280258
PMID:: 36649281
: Czasopismo naukowe

Pełny tekst

We develop a novel covariate ranking and selection algorithm for regularized ordinary logistic regression (OLR) models in the presence of severe class-imbalance in high dimensional datasets with correlated signal and noise covariates. Class-imbalance is resolved using response-based subsampling which we also employ to achieve stability in variable selection by creating an ensemble of regularized OLR models fitted to subsampled (and balanced) datasets. The regularization methods considered in our study include Lasso, adaptive Lasso (adaLasso) and ridge regression. Our methodology is versatile in the sense that it works effectively for regularization techniques involving both hard- (e.g. Lasso) and soft-shrinkage (e.g. ridge) of the regression coefficients. We assess selection performance by conducting a detailed simulation experiment involving varying moderate-to-severe class-imbalance ratios and highly correlated continuous and discrete signal and noise covariates. Simulation results show that our algorithm is robust against severe class-imbalance under the presence of highly correlated covariates, and consistently achieves stable and accurate variable selection with very low false discovery rate. We illustrate our methodology using a case study involving a severely imbalanced high-dimensional wildland fire occurrence dataset comprising 13 million instances. The case study and simulation results demonstrate that our framework provides a robust approach to variable selection in severely imbalanced big binary data.
Competing Interests: The authors have declared that no competing interests exist.
(Copyright: © 2023 Nadeem, Jabri. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.)

Zaloguj się, aby uzyskać dostęp do pełnego tekstu.

Informacja