The DISTANCE Model for Collaborative Research: Distributing Analytic Effort Using Scrambled Data Sets

Background: Data-sharing is encouraged to fulfill the ethical responsibility to transform research data into public health knowledge, but data sharing carries risks of improper disclosure and potential harm from release of individually identifiable data. Methods: The study objective was to develop and implement a novel method for scientific collaboration and data sharing which distributes the analytic burden while protecting patient privacy. A procedure was developed where in an investigator who is external to an analytic coordinating center (ACC) can conduct original research following a protocol governed by a Publications and Presentations (P&P) Committee. The collaborating investigator submits a study proposal and, if approved, develops the analytic specifications using existing data dictionaries and templates. An original data set is prepared according to the specifications and the external investigator is provided with a complete but de-identified and shuffled data set which retains all key data fields but which obfuscates individually identifiable data and patterns; this “scrambled data set” provides a “sandbox” for the external investigator to develop and test analytic code for analyses. The analytic code is then run against the original data at the ACC to generate output which is used by the external investigator in preparing a manuscript for journal submission. Results: The method has been successfully used with collaborators to produce many published papers and conference reports. Conclusion: By distributing the analytic burden, this method can facilitate collaboration and expand analytic capacity, resulting in more science for less money.


Background
"Data should be made as widely and freely available as possible while safeguarding the privacy of participants and protecting confidential and proprietary data." (from NIH Statement on Sharing Research Data, February 26, Primary data collection and cohort creation are expensive endeavors, and the data generated typically far exceed the analytic capacity and time frame supported by the original grant. Data-sharing is encouraged to fulfill the ethical responsibility to transform research data into public health knowledge; the National Institutes of Health require a data-sharing plan for research applications requesting $500,000 or more of direct costs in any single year. [2] However, data sharing carries risks of improper disclosure and potential harm from release of individually identifiable data. The Privacy Rule, [3] as part of a federal mandate to safeguard the rights and welfare of human subjects, provides a framework by which health information can be shared (disclosed) for research purposes. Health information which has been "de-identified" may be used and disclosed freely, as it is no longer considered protected health information. [4] There are two approaches to data de-identification: Expert Determination Method or Safe Harbor Method. [4] The Safe Harbor Method requires "the removal of specified individual identifiers as well as absence of actual knowledge by the covered entity that the remaining information could be used alone or in combination with other information to identify the individual." [4] However, while the Privacy Rules permits de-identified data sets to be shared freely, the responsible covered entity may choose to restrict disclosures; further, de-identified data sets are generally regarded as being of limited value because, typically, relevant data have been removed [5].
The authors here describe a protocol for making deidentified data more productive using a protocol which enables an external investigator to collaborate with an analytic coordinating center (ACC). The ACC deidentifies and then shuffles data to create "scrambled data sets," a process which deletes or obfuscates individually identifiable data and patterns while leaving the population characteristics intact. A scrambled data set is useful to the external investigator as a sandbox to develop and test statistical code which is run against the original data at the analytic coordinating center(ACC), generating output which is used in preparing a manuscript. As the ACC analysts' time is often a limiting factor for productivity in multi-site studies, this method of collaboration based on shared effort and distributed data analysis has been used to leverage resources for greater productivity.
The Diabetes Study of Northern California (DISTANCE) began in 2005 as a survey follow-up study among a racially stratified cohort of 20,000 patients with diabetes (www.distancesurvey.org). [6] The survey data has been linked to extensive data from the Kaiser Permanente Northern California (Kaiser) electronic health record and the Kaiser Diabetes Registry, which was established in 1994 and currently includes over 230,000 patients with diabetes [7]. Today, the DISTANCE collaboration involves over 40 scientists from multiple institutions and is guided by the Publications and Presentations Committee (P&P) which strives to: (i) ensure accurate, uniform, timely, and high quality reporting of research findings; (ii) preserve the scientific integrity of the study; and (iii) safeguard the rights and confidentiality of participants. The P&P oversees the ACC where final research data resides and all analyses are performed.
Because the procedure described here uses de-identified data, it is not necessarily subject to human subjects protection rules; de-identified may be used and disclosed freely, as it is no longer considered protected health information. [4] However, in every instance in which we have applied this procedure, the use of original data has been approved by the Institutional Review Board of the Kaiser Foundation Research Institute. The following is a generalized description of the DISTANCE collaborative research method which distributes the analytic effort using scrambled data sets.

Methods
An external investigator with an idea for a study based on ACC data submits a written proposal to the P&P. The ACC provides the investigator with manuscript writing guidelines, data dictionaries and sample statistical specifications. The investigator follows a well delineated protocol and accepts responsibility for some of the analytic effort outside of the ACC (Figure 1). The P&P must give approval for any effort ("manuscript") intended to result in a publication, whether journal article, conference abstract/presentation or public report. The investigator is responsible for adhering to ACC policies and guidelines and for producing the final manuscript for publication. The investigator must possess the necessary skills-or have a qualified analyst-to carry out the analysis required by the proposal.
After P&P approval of the proposal, the investigator works with an assigned ACC analyst to develop detailed analytic specifications. It is helpful for the investigator to review specifications from previous analyses and become familiar with existing data which may include survey responses, clinical and administrative measures and various derived variables. In development of the specifications document, edit mode in word processing is used to track the refinements by investigator and analyst, recording their discussions about questions, comments or changes. In addition, the analytic plan is often presented at collaborator meetings for group feedback. Legend: From idea to publication. This flowchart illustrates the steps for the DISTANCE collaborative research method which distributes analytic effort outside the ACC using scrambled data In most cases, but especially in studies aiming to make causal inferences, investigators prepare a directed acyclic graph (DAG), which is a conceptualization of the causal framework underlying the proposed study and which graphically display assumed causal relationships between variables in the analysis, based on subject-matter knowledge. [8,9] A DAG can help clarify the a priori assumptions, identify potential confounders and mediators and avoid missing important covariates in the initial steps of building a data set. Thus, in addition to developing the relevant conceptual models, analysis of the DAG facilitates model development, with the aim of specifying the most parsimonious statistical model.
During development of the analytic specifications, the ACC analyst advises the investigator on the available data and its limitations, assists in defining the data cut points or transformations, or suggests analytic strategies and model specifications. In particular, the ACC analyst provides the investigator with univariate statistics for variables in the proposed study data set to facilitate an understanding of variable distributions and rates of data missingness. The investigator and analyst review existing cohort sand derived variables to minimize duplication of effort and to use study resources most economically. In many cases, an existing cohort or data set can be used, but additional or updated clinical or administrative health plan data may also be required for the analysis. In some cases, an existing data set can be used for which a scrambled data set has already been prepared. Collaborating clinicians or other members of the writing group can help identify potential covariates, confounders or mediators of particular clinical measures. Clinical data archiving is often very complex, and ACC analysts have background knowledge that can prove invaluable when designing a study. Issues such as changes in the availability and quality of clinical measures over time and changes in methods of measurement are taken into consideration when creating any variable derived from clinical or administrative data.
Once the specifications are complete (Table), the ACC analyst prepares the "original" data set containing only the data elements necessary for the proposed analysis. The ACC analyst then prepares the scrambled data set (described below) which the investigator will use to develop and test analytic code. The analytic work can be shared using any statistical software that is available to both the investigator and ACC analyst; however, if the external investigator and analyst have different versions of the same software, this can present a challenge which is best identified at the beginning of the process.  [3,10] typically only medical record numbers and dates are relevant to the proposed research. Medical record numbers are replaced with anonymous study identification numbers (Figure 2). Dates of birth or medical events (e.g., appointments, procedures, hospitalizations) are perturbed by adding or subtracting a random number of days (e.g., ± 0-365) to each date. Alternatively, especially for longitudinal studies, an index or baseline date (e.g., a diagnosis date, baseline survey date or first medication dispensing date) can be identified and perturbed, and then all other dates can be converted to a number representing days pre-or postbaseline.  Legend: In this small, mock dataset, original data is transformed into scrambled data. Gender is randomly reordered; each birth date has a random number added or subtracted; a set of smoking questions is randomly reordered; height, weight and calculated body mass index are randomly reordered.
In preparing the scrambled data set, the complete variable structure (and population characteristics) of the data remains intact, but all individually-identifiable data are replaced or randomly modified so that individuallyidentifiable patterns are disrupted. There is no technical novelty in this approach to de-identification (also known as "data shuffling" [11]), but a description of this simple method is provided here.
The scrambling process is most simply described for a single data set with rows (observations) and columns (variables), although data with more complicated database architecture (e.g., many-to-many structure) can be accommodated. Each cell within a given column is assigned a random number and then sorted (e.g., low to high) by the assigned random numbers. This process is repeated for each individual column. Sets of columns representing variables that form a scale, derived variable or index (e.g., smoking questions or height and weight with calculated BMI) are randomly sorted as a group in order to maintain their internal validity. The scrambling protocol thus disrupts patterns which could identify an individual (e.g., combinations such as an individual's gender plus smoking status plus weight plus age).
The scrambled data set has a structure identical to the original data set and retains actual values for each variable (except dates) from the original dataset. The scrambled data set is emailed to the external investigator to develop and test analytic code which should run equally well against the original data. In the scrambled dataset, nonmarginal statistics and associations are meaningless, but missingness and marginal summary statistics (e.g., mean patient weight) for each variable or derived variable are accurate and valid. This allows the investigator to characterize the study population (corresponding to the manuscript Table 1) directly from the scrambled data.
Once the code runs without error on the scrambled dataset, it is emailed to the ACC analyst to run against the original dataset. The ACC analyst corrects minor coding errors as needed and sends to the investigator the output (stripped of individually identifiable data, if any),notes on any changes made to the code and, if needed, the log files. If more complex errors occur, particularly in the code for model specification, the analyst alerts the investigator and asks for revised code. The process is repeated until analyses are complete. By having the collaborating investigator develop the analytic code, a substantial burden is removed from the ACC analyst, whose time may be a limiting factor, and thus ACC productivity is increased.
During this iterative process, the ACC analyst frequently runs code without closely checking the external investigator's methodology or the output. These unmonitored runs are time-saving and acceptable during the development of the model or method, given that the models often change. However, once the process approaches its final iteration, the investigator will ask the ACC analyst to review and approve the code and final output before the investigator prepares the draft manuscript. Once a draft manuscript is completed, the ACC analyst performs a final review, checking the appropriateness of analysis and the consistency between output and manuscript. As with all manuscripts, the investigator actively involves coauthors throughout the process. Targeted calls at critical junctures (e.g., to discuss specifications or focus of the manuscript or discuss a reviewer's comments) are very useful.
When the investigator and the ACC analyst are satisfied and all co-authors have given their final approvals, the journal-ready manuscript is submitted to P&P for final review and approval. After institutional approvals are obtained, the manuscript is ready for submission to the target journal.
The increased ACC productivity motivated the development of a database to track the progress of each manuscript from proposal to publication and to monitor the workload of each ACC analyst. Each manuscript is linked to its supporting grants and, upon publication, the database record is completed with its PubMed hyperlink, PubMed ID, PubMed Central ID and a 100-word summary. The database is also used in generating progress reports.
The DISTANCE collaborative research method uses scrambled data sets and a protocol which distributes some of the analytic effort outside of the ACC but presents no risk to patient privacy. Scrambled data sets provide a "sandbox" for investigators external to the primary data collection site to develop and test code for statistical models; because it is based on real data, it allows the external investigator to preview summary statistics (e.g., for the manuscript's Table 1) or to independently assess univariate data patterns (e.g., to identify appropriate cut points to categorize variables). In general, the external investigator is responsible for developing the proposal, specifications, analytic code, interpretation of analytic output and preparation of a manuscript for journal submission; the ACC analyst is responsible for preparing the original and scrambled data sets, running the code against the original data set and reviewing the final code and manuscript. The P&P provides guidance and oversight to ensure scientific integrity and quality.

Limitations
The specific method of shuffling data is not novel and there are likely other ways to accomplish the same end. [24] For example, instead of scrambling the data, one could insert random values or dates; however, basic characteristics of the data, such as means, would be lost with no saving of effort. While it is possible that deidentified data could be re-identified [25], the scrambling procedure eliminates the patterns which might permit reidentification and loss of privacy. Occasionally, the backand-forth in the development of the specifications and the analytic code creates delays, but the external investigator usually drives the process: typically, the code is submitted to the ACC analyst, run on the original data and promptly returned. This protocol works well even when the model specification becomes very complex.
This method is most compatible with hypothesis-driven research; it is less compatible with "data mining" since associations observed in the scrambled data sets are not meaningful. It expands collaborative opportunities to external investigators, especially junior faculty and fellows who may have sufficient funding (e.g., a K-award) to cover their time and who want to hone their analytic skills but lack access to quality data. This approach has been used successfully with many outside investigators, including a former doctoral student (Dr. Lyles) who successfully used scrambled data sets to produce her dissertation and four peer-reviewed journal articles. [13,14,15] The protocol has been replicated and used by the P&P of a DISTANCE sub-study, ("Diabetes and Aging in a Multi-ethnic Population," R01-DK081796) and we are developing the methodology to create and manage scrambled datasets for a longitudinal study with differential follow-up across subjects.
While the initial creation of a scrambled data set requires effort, it is small compared to the effort saved by distributing some of the analytic effort to the external investigator. Additionally, scrambled data sets can be reused or amended: the scrambled DISTANCE data set, based on subject responses to the 2005-2006 DISTANCE Survey [6], can be used for subsequent research questions or amended with additional or updated clinical or administrative scrambled data.
The DISTANCE collaborative research method has advantages over other common methods of data sharing and collaboration. Unlike public data archives or limited data sets, there is no risk to confidentiality. Unlike typical de-identified data sets, there is no loss of data quality and the original data set can be easily supplemented with additional data or updated with new follow-up data. The scrambled method is simple and effective and, unlike encryption, cannot be undone with any key and has no risk of re-identification without access to the scrambling records which were applied to the original data. Unlike typical analytic coordinating centers, the analysts' time is much less of a limiting factor, so there is no significant bottleneck or loss of productivity. Unlike data enclaves, there is no need to maintain office space or computers designated as secure data access points. There is no need to track the custody of data or its disposition and no risk of improper disclosure, though data agreements are advisable. This method avoids other legal, technical and cultural barriers to data sharing that often complicate multi-site studies, such as the administrative workload associated with executing data use agreements or the other paperwork required when patient data is involved. Scrambled data sets could also be used in studies in which access to original data depends on a lengthy approval process; an investigator could develop analytic code on a scrambled dataset while awaiting receipt of original data. Studies which use a common data model could also use this protocol.

Conclusions
The DISTANCE collaborative research method distributes the analytic effort using scrambled datasets and has been used successfully in collaborations with external investigators. The process creates minimal burden on the ACC and mitigates analytic bottlenecks while also eliminating the risk of improper disclosure of confidential patient data, mitigating some of the privacy concerns endemic to collaborative, data sharing endeavors. Finally, it has greatly expanded analytic capacity, resulting in more science for less money.