31
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      SMOTE for high-dimensional class-imbalanced data

      research-article
      1 , 1 ,
      BMC Bioinformatics
      BioMed Central

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          Classification using class-imbalanced data is biased in favor of the majority class. The bias is even larger for high-dimensional data, where the number of variables greatly exceeds the number of samples. The problem can be attenuated by undersampling or oversampling, which produce class-balanced data. Generally undersampling is helpful, while random oversampling is not. Synthetic Minority Oversampling TEchnique (SMOTE) is a very popular oversampling method that was proposed to improve random oversampling but its behavior on high-dimensional data has not been thoroughly investigated. In this paper we investigate the properties of SMOTE from a theoretical and empirical point of view, using simulated and real high-dimensional data.

          Results

          While in most cases SMOTE seems beneficial with low-dimensional data, it does not attenuate the bias towards the classification in the majority class for most classifiers when data are high-dimensional, and it is less effective than random undersampling. SMOTE is beneficial for k-NN classifiers for high-dimensional data if the number of variables is reduced performing some type of variable selection; we explain why, otherwise, the k-NN classification is biased towards the minority class. Furthermore, we show that on high-dimensional data SMOTE does not change the class-specific mean values while it decreases the data variability and it introduces correlation between samples. We explain how our findings impact the class-prediction for high-dimensional data.

          Conclusions

          In practice, in the high-dimensional setting only k-NN classifiers based on the Euclidean distance seem to benefit substantially from the use of SMOTE, provided that variable selection is performed before using SMOTE; the benefit is larger if more neighbors are used. SMOTE for k-NN without variable selection should not be used, because it strongly biases the classification towards the minority class.

          Related collections

          Most cited references22

          • Record: found
          • Abstract: found
          • Article: not found

          Diagnosis of multiple cancer types by shrunken centroids of gene expression.

          We have devised an approach to cancer class prediction from gene expression profiling, based on an enhancement of the simple nearest prototype (centroid) classifier. We shrink the prototypes and hence obtain a classifier that is often more accurate than competing methods. Our method of "nearest shrunken centroids" identifies subsets of genes that best characterize each class. The technique is general and can be used in many other classification problems. To demonstrate its effectiveness, we show that the method was highly efficient in finding genes for classifying small round blue cell tumors and leukemias.
            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties

              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              A molecular signature of metastasis in primary solid tumors.

              Metastasis is the principal event leading to death in individuals with cancer, yet its molecular basis is poorly understood. To explore the molecular differences between human primary tumors and metastases, we compared the gene-expression profiles of adenocarcinoma metastases of multiple tumor types to unmatched primary adenocarcinomas. We found a gene-expression signature that distinguished primary from metastatic adenocarcinomas. More notably, we found that a subset of primary tumors resembled metastatic tumors with respect to this gene-expression signature. We confirmed this finding by applying the expression signature to data on 279 primary solid tumors of diverse types. We found that solid tumors carrying the gene-expression signature were most likely to be associated with metastasis and poor clinical outcome (P < 0.03). These results suggest that the metastatic potential of human tumors is encoded in the bulk of a primary tumor, thus challenging the notion that metastases arise from rare cells within a primary tumor that have the ability to metastasize.
                Bookmark

                Author and article information

                Journal
                BMC Bioinformatics
                BMC Bioinformatics
                BMC Bioinformatics
                BioMed Central
                1471-2105
                2013
                22 March 2013
                : 14
                : 106
                Affiliations
                [1 ]Institute for Biostatistics and Medical Informatics, University of Ljubljana, Vrazov trg 2, Ljubljana, Slovenia
                Article
                1471-2105-14-106
                10.1186/1471-2105-14-106
                3648438
                23522326
                732c6452-92b5-4762-a7d7-9fe0e3a41efc
                Copyright ©2013 Blagus and Lusa; licensee BioMed Central Ltd.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 24 July 2012
                : 22 February 2013
                Categories
                Research Article

                Bioinformatics & Computational biology
                Bioinformatics & Computational biology

                Comments

                Comment on this article