45
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Is it time to stop sweeping data cleaning under the carpet? A novel algorithm for outlier management in growth data

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          All data are prone to error and require data cleaning prior to analysis. An important example is longitudinal growth data, for which there are no universally agreed standard methods for identifying and removing implausible values and many existing methods have limitations that restrict their usage across different domains. A decision-making algorithm that modified or deleted growth measurements based on a combination of pre-defined cut-offs and logic rules was designed. Five data cleaning methods for growth were tested with and without the addition of the algorithm and applied to five different longitudinal growth datasets: four uncleaned canine weight or height datasets and one pre-cleaned human weight dataset with randomly simulated errors. Prior to the addition of the algorithm, data cleaning based on non-linear mixed effects models was the most effective in all datasets and had on average a minimum of 26.00% higher sensitivity and 0.12% higher specificity than other methods. Data cleaning methods using the algorithm had improved data preservation and were capable of correcting simulated errors according to the gold standard; returning a value to its original state prior to error simulation. The algorithm improved the performance of all data cleaning methods and increased the average sensitivity and specificity of the non-linear mixed effects model method by 7.68% and 0.42% respectively. Using non-linear mixed effects models combined with the algorithm to clean data allows individual growth trajectories to vary from the population by using repeated longitudinal measurements, identifies consecutive errors or those within the first data entry, avoids the requirement for a minimum number of data entries, preserves data where possible by correcting errors rather than deleting them and removes duplications intelligently. This algorithm is broadly applicable to data cleaning anthropometric data in different mammalian species and could be adapted for use in a range of other domains.

          Related collections

          Most cited references40

          • Record: found
          • Abstract: found
          • Article: not found

          Physical status: the use and interpretation of anthropometry. Report of a WHO Expert Committee.

          Anthropometry provides the single most portable, universally applicable, inexpensive and non-invasive technique for assessing the size, proportions, and composition of the human body. It reflects both health and nutritional status and predicts performance, health, and survival. As such, it is a valuable, but currently underused, tool for guiding public health policy and clinical decisions. This report presents the conclusions and comprehensive recommendations of a WHO Expert Committee for the present and future uses and interpretation of anthropometry. In a section that sets the technical framework for the report, the significance of anthropometric indicators and indices is explained and the principles of applied biostatistics and epidemiology that underlie their various uses are discussed. Subsequent sections provide detailed guidance on the use and interpretation of anthropometric measurements in pregnant and lactating women, newborn infants, infants and children, adolescents, overweight and thin adults, and adults aged 60 years and over. With a similar format for each section, the report assesses specific applications of anthropometry in individuals and populations for purposes of screening and for targeting and evaluating interventions. Advice on data management and analysis is offered, and methods of taking particular measurements are described. Each section also includes a discussion of the extent, reliability and universal relevance of existing reference data. An extensive series of reference data recommended by the Expert Committee and not widely distributed by WHO hitherto is included in an annex.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Data Cleaning: Detecting, Diagnosing, and Editing Data Abnormalities

            In this policy forum the authors argue that data cleaning is an essential part of the research process, and should be incorporated into study design.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Accuracy of self-reported body weight, height and waist circumference in a Dutch overweight working population

              Background In population studies, body mass index (BMI) is generally calculated from self-reported body weight and height. The self-report of these anthropometrics is known to be biased, resulting in a misclassification of BMI status. The aim of our study is to evaluate the accuracy of self-reported weight, height and waist circumference among a Dutch overweight (Body Mass Index [BMI] ≥ 25 kg/m2) working population, and to determine to what extent the accuracy was moderated by sex, age, BMI, socio-economic status (SES) and health-related factors. Methods Both measured and self-reported body weight and body height were collected in 1298 healthy overweight employees (66.6% male; mean age 43.9 ± 8.6 years; mean BMI 29.5 ± 3.4 kg/m2), taking part in the ALIFE@Work project. Measured and self-reported waist circumferences (WC) were available for a sub-group of 250 overweight subjects (70.4% male; mean age 44.1 ± 9.2 years; mean BMI 29.6 ± 3.0 kg/m2). Intra Class Correlation (ICC), Cohen's kappa and Bland Altman plots were used for reliability analyses, while linear regression analyses were performed to assess the factors that were (independently) associated with the reliability. Results Body weight was significantly (p < 0.001) under-reported on average by 1.4 kg and height significantly (p < 0.001) over-reported by 0.7 cm. Consequently, BMI was significantly (p < 0.001) under-reported by 0.7 kg/m2. WC was significantly (p < 0.001) over-reported by 1.1 cm. Although the self-reporting of anthropometrics was biased, ICC's showed high concordance between measured and self-reported values. Also, substantial agreement existed between the prevalences of BMI status and increased WC based on measured and self-reported data. The under-reporting of BMI and body weight was significantly (p < 0.05) affected by measured weight, height, SES and smoking status, and the over-reporting of WC by age, sex and measured WC. Conclusion Results suggest that self-reported BMI and WC are satisfactorily accurate for the assessment of the prevalence of overweight/obesity and increased WC in a middle-aged overweight working population. As the accuracy of self-reported anthropometrics is affected by measured weight, height, WC, smoking status and/or SES, results for these subgroups should be interpreted with caution. Due to the large power of our study, the clinical significance of our statistical significant findings may be limited. Trial Registration ISRCTN04265725
                Bookmark

                Author and article information

                Contributors
                Role: ConceptualizationRole: Data curationRole: Formal analysisRole: Funding acquisitionRole: InvestigationRole: MethodologyRole: Project administrationRole: SoftwareRole: ValidationRole: VisualizationRole: Writing – original draftRole: Writing – review & editing
                Role: ConceptualizationRole: Funding acquisitionRole: InvestigationRole: MethodologyRole: SoftwareRole: SupervisionRole: Writing – review & editing
                Role: ConceptualizationRole: Funding acquisitionRole: SupervisionRole: Writing – review & editing
                Role: ConceptualizationRole: Funding acquisitionRole: SupervisionRole: Writing – review & editing
                Role: ConceptualizationRole: Funding acquisitionRole: InvestigationRole: Project administrationRole: ResourcesRole: SupervisionRole: Writing – review & editing
                Role: Editor
                Journal
                PLoS One
                PLoS ONE
                plos
                plosone
                PLoS ONE
                Public Library of Science (San Francisco, CA USA )
                1932-6203
                24 January 2020
                2020
                : 15
                : 1
                : e0228154
                Affiliations
                [1 ] The Roslin Institute, The University of Edinburgh, Easter Bush Campus, Midlothian, Edinburgh, United Kingdom
                [2 ] The Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush Campus, Midlothian, Edinburgh, United Kingdom
                Liverpool John Moores University, UNITED KINGDOM
                Author notes

                Competing Interests: The authors have declared that no competing interests exist.

                Author information
                http://orcid.org/0000-0001-9176-8389
                Article
                PONE-D-19-16678
                10.1371/journal.pone.0228154
                6980495
                31978151
                19e93e64-7380-436c-8b31-9df1891de82c
                © 2020 Woolley et al

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                : 12 June 2019
                : 9 January 2020
                Page count
                Figures: 4, Tables: 7, Pages: 21
                Funding
                Funded by: funder-id http://dx.doi.org/10.13039/501100000268, Biotechnology and Biological Sciences Research Council;
                Award ID: BB/ J01446X/1
                Award Recipient :
                Funded by: funder-id http://dx.doi.org/10.13039/501100000268, Biotechnology and Biological Sciences Research Council;
                Award ID: BB/ J004235/1
                This work was supported by an Institute Strategic Programme Grant from the Biotechnology and Biological Sciences Research Council ( https://bbsrc.ukri.org/) to the Roslin Institute [BB/ J004235/1] and the lead author was funded by the Biotechnology and Biological Sciences Research Council under the EASTBIO ( http://www.eastscotbiodtp.ac.uk/) doctoral training programme [BB/ J01446X/1 to CW]. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Research Article
                Physical Sciences
                Mathematics
                Applied Mathematics
                Algorithms
                Research and Analysis Methods
                Simulation and Modeling
                Algorithms
                Research and Analysis Methods
                Simulation and Modeling
                Computer and Information Sciences
                Data Visualization
                Research and Analysis Methods
                Research Assessment
                Reproducibility
                Research and Analysis Methods
                Research Design
                Cohort Studies
                Biology and Life Sciences
                Organisms
                Eukaryota
                Animals
                Vertebrates
                Amniotes
                Mammals
                Dogs
                Physical Sciences
                Mathematics
                Statistics
                Statistical Data
                Biology and Life Sciences
                Organisms
                Eukaryota
                Animals
                Animal Types
                Pets and Companion Animals
                Biology and Life Sciences
                Zoology
                Animal Types
                Pets and Companion Animals
                Custom metadata
                Dogslife weight and height data was collected by the authors and is publicly available from the University of Edinburgh DataShare at https://doi.org/10.7488/ds/2569. SAVSNET data was obtained from a third party so cannot be shared for legal reasons but is available on request from https://www.liverpool.ac.uk/savsnet/using-savsnet-data-for-research/. Banfield data was obtained from a third party so cannot be shared for legal reasons but is published elsewhere at https://doi.org/10.1371/journal.pone.0182064 and can be requested from the authors of this publication. CLOSER data is publicly available and can be downloaded from the UK Data Service at http://doi.org/10.5255/UKDA-SN-8207-1.

                Uncategorized
                Uncategorized

                Comments

                Comment on this article