8
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Can synthetic data be a proxy for real clinical trial data? A validation study

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Objectives

          There are increasing requirements to make research data, especially clinical trial data, more broadly available for secondary analyses. However, data availability remains a challenge due to complex privacy requirements. This challenge can potentially be addressed using synthetic data.

          Setting

          Replication of a published stage III colon cancer trial secondary analysis using synthetic data generated by a machine learning method.

          Participants

          There were 1543 patients in the control arm that were included in our analysis.

          Primary and secondary outcome measures

          Analyses from a study published on the real dataset were replicated on synthetic data to investigate the relationship between bowel obstruction and event-free survival. Information theoretic metrics were used to compare the univariate distributions between real and synthetic data. Percentage CI overlap was used to assess the similarity in the size of the bivariate relationships, and similarly for the multivariate Cox models derived from the two datasets.

          Results

          Analysis results were similar between the real and synthetic datasets. The univariate distributions were within 1% of difference on an information theoretic metric. All of the bivariate relationships had CI overlap on the tau statistic above 50%. The main conclusion from the published study, that lack of bowel obstruction has a strong impact on survival, was replicated directionally and the HR CI overlap between the real and synthetic data was 61% for overall survival (real data: HR 1.56, 95% CI 1.11 to 2.2; synthetic data: HR 2.03, 95% CI 1.44 to 2.87) and 86% for disease-free survival (real data: HR 1.51, 95% CI 1.18 to 1.95; synthetic data: HR 1.63, 95% CI 1.26 to 2.1).

          Conclusions

          The high concordance between the analytical results and conclusions from synthetic and real data suggests that synthetic data can be used as a reasonable proxy for real clinical trial datasets.

          Trial registration number

          NCT00079274.

          Related collections

          Most cited references62

          • Record: found
          • Abstract: not found
          • Article: not found

          Unbiased Recursive Partitioning: A Conditional Inference Framework

            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Unique in the Crowd: The privacy bounds of human mobility

            We study fifteen months of human mobility data for one and a half million individuals and find that human mobility traces are highly unique. In fact, in a dataset where the location of an individual is specified hourly, and with a spatial resolution equal to that given by the carrier's antennas, four spatio-temporal points are enough to uniquely identify 95% of the individuals. We coarsen the data spatially and temporally to find a formula for the uniqueness of human mobility traces given their resolution and the available outside information. This formula shows that the uniqueness of mobility traces decays approximately as the 1/10 power of their resolution. Hence, even coarse datasets provide little anonymity. These findings represent fundamental constraints to an individual's privacy and have important implications for the design of frameworks and institutions dedicated to protect the privacy of individuals.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Data sharing and reanalysis of randomized controlled trials in leading biomedical journals with a full data sharing policy: survey of studies published in The BMJ and PLOS Medicine

              Abstract Objectives To explore the effectiveness of data sharing by randomized controlled trials (RCTs) in journals with a full data sharing policy and to describe potential difficulties encountered in the process of performing reanalyses of the primary outcomes. Design Survey of published RCTs. Setting PubMed/Medline. Eligibility criteria RCTs that had been submitted and published by The BMJ and PLOS Medicine subsequent to the adoption of data sharing policies by these journals. Main outcome measure The primary outcome was data availability, defined as the eventual receipt of complete data with clear labelling. Primary outcomes were reanalyzed to assess to what extent studies were reproduced. Difficulties encountered were described. Results 37 RCTs (21 from The BMJ and 16 from PLOS Medicine) published between 2013 and 2016 met the eligibility criteria. 17/37 (46%, 95% confidence interval 30% to 62%) satisfied the definition of data availability and 14 of the 17 (82%, 59% to 94%) were fully reproduced on all their primary outcomes. Of the remaining RCTs, errors were identified in two but reached similar conclusions and one paper did not provide enough information in the Methods section to reproduce the analyses. Difficulties identified included problems in contacting corresponding authors and lack of resources on their behalf in preparing the datasets. In addition, there was a range of different data sharing practices across study groups. Conclusions Data availability was not optimal in two journals with a strong policy for data sharing. When investigators shared data, most reanalyses largely reproduced the original results. Data sharing practices need to become more widespread and streamlined to allow meaningful reanalyses and reuse of data. Trial registration Open Science Framework osf.io/c4zke.
                Bookmark

                Author and article information

                Journal
                BMJ Open
                BMJ Open
                bmjopen
                bmjopen
                BMJ Open
                BMJ Publishing Group (BMA House, Tavistock Square, London, WC1H 9JR )
                2044-6055
                2021
                16 April 2021
                : 11
                : 4
                : e043497
                Affiliations
                [1 ] departmentCenter for Outcomes Research and Evaluation, Faculty of Medicine , McGill University , Montreal, Québec, Canada
                [2 ] departmentData Science , Replica Analytics Ltd , Ottawa, Ontario, Canada
                [3 ] departmentMedicine , McGill University , Montreal, Québec, Canada
                [4 ] departmentCentre for Outcomes Research and Evaluation , Research Institute of the McGill University Health Centre , Montreal, Québec, Canada
                [5 ] departmentElectronic Health Information Laboratory , Children’s Hospital of Eastern Ontario Research Institute , Ottawa, Ontario, Canada
                [6 ] departmentSchool of Epidemiology and Public Health , University of Ottawa , Ottawa, Ontario, Canada
                Author notes
                [Correspondence to ] Dr Khaled El Emam; kelemam@ 123456ehealthinformation.ca
                Author information
                http://orcid.org/0000-0002-6159-0628
                http://orcid.org/0000-0003-3325-4149
                Article
                bmjopen-2020-043497
                10.1136/bmjopen-2020-043497
                8055130
                33863713
                78311483-ccbe-4c09-923d-e666311b9b52
                © Author(s) (or their employer(s)) 2021. Re-use permitted under CC BY-NC. No commercial re-use. See rights and permissions. Published by BMJ.

                This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:  http://creativecommons.org/licenses/by-nc/4.0/.

                History
                : 06 August 2020
                : 14 January 2021
                : 18 March 2021
                Funding
                Funded by: The GOING-FWD Consortium is funded by the GENDER-NET Plus ERA-NET Initiative;
                Award ID: GNP-78
                Funded by: FundRef http://dx.doi.org/10.13039/501100000024, Canadian Institutes of Health Research;
                Award ID: GNP-161904
                Funded by: FundRef http://dx.doi.org/10.13039/501100000038, Natural Sciences and Engineering Research Council of Canada;
                Award ID: RGPIN-2016-06781
                Categories
                Health Informatics
                1506
                1702
                Original research
                Custom metadata
                unlocked

                Medicine
                epidemiology,health informatics,information management,information technology,statistics & research methods

                Comments

                Comment on this article