SMOTE for high-dimensional class-imbalanced data

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

Classification using class-imbalanced data is biased in favor of the majority class. The bias is even larger for high-dimensional data, where the number of variables greatly exceeds the number of samples. The problem can be attenuated by undersampling or oversampling, which produce class-balanced data. Generally undersampling is helpful, while random oversampling is not. Synthetic Minority Oversampling TEchnique (SMOTE) is a very popular oversampling method that was proposed to improve random oversampling but its behavior on high-dimensional data has not been thoroughly investigated. In this paper we investigate the properties of SMOTE from a theoretical and empirical point of view, using simulated and real high-dimensional data.

Results

While in most cases SMOTE seems beneficial with low-dimensional data, it does not attenuate the bias towards the classification in the majority class for most classifiers when data are high-dimensional, and it is less effective than random undersampling. SMOTE is beneficial for k-NN classifiers for high-dimensional data if the number of variables is reduced performing some type of variable selection; we explain why, otherwise, the k-NN classification is biased towards the minority class. Furthermore, we show that on high-dimensional data SMOTE does not change the class-specific mean values while it decreases the data variability and it introduces correlation between samples. We explain how our findings impact the class-prediction for high-dimensional data.

Conclusions

In practice, in the high-dimensional setting only k-NN classifiers based on the Euclidean distance seem to benefit substantially from the use of SMOTE, provided that variable selection is performed before using SMOTE; the benefit is larger if more neighbors are used. SMOTE for k-NN without variable selection should not be used, because it strongly biases the classification towards the minority class.

Related collections

Most cited references 22

Record: found
Abstract: found
Article: not found

Diagnosis of multiple cancer types by shrunken centroids of gene expression.

R Tibshirani, T. Hastie, B. Narasimhan … (2002)

We have devised an approach to cancer class prediction from gene expression profiling, based on an enhancement of the simple nearest prototype (centroid) classifier. We shrink the prototypes and hence obtain a classifier that is often more accurate than competing methods. Our method of "nearest shrunken centroids" identifies subsets of genes that best characterize each class. The technique is general and can be used in many other classification problems. To demonstrate its effectiveness, we show that the method was highly efficient in finding genes for classifying small round blue cell tumors and leukemias.

0 comments Cited 148 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties

Evelyn Fix, J. L. Hodges (1989)

0 comments Cited 114 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

A molecular signature of metastasis in primary solid tumors.

Kenneth Ross, Sridhar Ramaswamy, Todd R. Golub … (2003)

Metastasis is the principal event leading to death in individuals with cancer, yet its molecular basis is poorly understood. To explore the molecular differences between human primary tumors and metastases, we compared the gene-expression profiles of adenocarcinoma metastases of multiple tumor types to unmatched primary adenocarcinomas. We found a gene-expression signature that distinguished primary from metastatic adenocarcinomas. More notably, we found that a subset of primary tumors resembled metastatic tumors with respect to this gene-expression signature. We confirmed this finding by applying the expression signature to data on 279 primary solid tumors of diverse types. We found that solid tumors carrying the gene-expression signature were most likely to be associated with metastasis and poor clinical outcome (P < 0.03). These results suggest that the metastatic potential of human tumors is encoded in the bulk of a primary tumor, thus challenging the notion that metastases arise from rare cells within a primary tumor that have the ability to metastasize.

0 comments Cited 98 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): BMC Bioinformatics

Journal ID (iso-abbrev): BMC Bioinformatics

Title: BMC Bioinformatics

Publisher: BioMed Central

ISSN (Electronic): 1471-2105

Publication date Collection: 2013

Publication date (Electronic): 22 March 2013

Volume: 14

Page: 106

Affiliations

[1 ]Institute for Biostatistics and Medical Informatics, University of Ljubljana, Vrazov trg 2, Ljubljana, Slovenia

Article

Publisher ID: 1471-2105-14-106

DOI: 10.1186/1471-2105-14-106

PMC ID: 3648438

PubMed ID: 23522326

SO-VID: 732c6452-92b5-4762-a7d7-9fe0e3a41efc

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

History

Date received : 24 July 2012

Date accepted : 22 February 2013

Comments

Comment on this article

scite_

Cited by 204

See all cited by

Most referenced authors 506

See all reference authors

- Version 1

SMOTE for high-dimensional class-imbalanced data

Read this article at

Abstract

Background

Results

Conclusions

Related collections

Network and Systems Medicine

Most cited references 22

Diagnosis of multiple cancer types by shrunken centroids of gene expression.

Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties

A molecular signature of metastasis in primary solid tumors.

Author and article information

Journal

Affiliations

Article

History

Categories

Comments

Comment on this article

Similar content 225

Cited by 204

Most referenced authors 506