The production of healthcare data: ensuring that the system starts and ends with the patient

| Nicolas Garcelon - Data Science Platform - Imagine Institute of Genetic Diseases

The first case studies for educational purposes date from 1600 BC, described on an Egyptian papyrus (Al-Awqati 2006), and documents describing patients can be found throughout antiquity. From the 17th century onwards, this process has been expanding, always with an objective of teaching and anatomical and diagnostic research (Gillum 2013), but it was only in Paris and Berlin in the early 19th century that we saw the appearance of the first patient file, namely a file enabling the treatment and management of the patient (Hess 2010). In France, the emergence during the 1920s of a middle class that refused to be treated in hospitals reserved for the poor and indigent but did not have sufficient income to seek treatment in private clinics led, in the 1950s, to a process of ‘humanisation’ of hospitals, with the care process being focused on the patient (Anne Nardin 2010), and a resulting change in the methods used to collect clinical data. From the 1970s, the patient file became more structured, taking the form of nursing records, medical reports and discharge statements. In the late 1990s, the reports were computerised and became even more structured. Computerisation then accelerated, in particular in relation to assessment of costs and reimbursement of treatments (Medical Information Systems Program (PMSI), Diagnosis-Related Groups (DRG)). Computerised patient files first appeared in hospitals in the 2000s, and use of these files became general practice in 2010.

With the computerisation of the patient file, the reuse of these data for the purposes of research, hospital management and teaching has become increasingly immediate, to the point where we are no longer developing computerised patient files in order merely to improve patient treatment but to facilitate the reuse of the associated information. This change in the focus of the computerised patient file has rendered the completion of the files laborious and ultimately ineffective. We have observed that doctors prefer to fill in free-text fields rather than tick boxes. Furthermore, the use of free text means they can be more precise in detailing their thoughts and in raising any doubts and absences of indicators, or diagnostic hypotheses (Hanauer et al. 2015; Raghavan et al. 2014; Shivade et al. 2014). This is especially important in the context of rare diseases, where free text remains the ideal means of preserving the phenotypical richness of the patient’s case while continuing to focus the patient-doctor relationship around narrative medicine (Charon 2012). The point is not to exclude any data coded in the computerised patient file but, rather, to find an appropriate balance. It would be unrealistic to think that the clinician could be able to detail all of the necessary information at the time when treatment is provided.


Medical IT teams or data scientists must therefore develop methods that make it possible to reuse the data produced within the treatment process for the purposes of research, teaching and management, without skewing their primary purpose: enabling treatment of patients (Rosenbloom et al. 2011).

The Institut Imagine’s data science platform forms part of this approach based on the development of methods and software for clinicians/researchers. In recent years, we have developed a document-oriented data warehouse (Dr Warehouse®), which is intended to meet three specific usage scenarios: clinical research, detection of hypotheses or data mining, and translational research. Dr Warehouse is installed at the Necker Children’s Hospital and currently contains data on 445,000 patients, 3.4 million medical documents/reports, and 19 million coded results (biology, PMSI). Our feedback has made it possible to establish certain general principles as to the value of a tool of this kind.

From dirty data to smart data

Data warehouse and clinical research

One of the primary scenarios for which the use of data warehouses can be valuable is locating patients who meet the inclusion criteria for clinical studies. Clinicians are used to using Google or PubMed, and need a search engine that is simple and intuitive using free text entries. To take into account the diverse range of inclusion criteria, it must be possible to search using both text and structured data.

Working with free text data naturally involves false positives. Users must therefore be able to display the results and understand why a patient has been found in order to exclude that patient from the results or filter the search criteria. There is substantial work to be done on the associated ergonomic and interface considerations to ensure that a simple, effective tool can be offered to users: overview of results, highlighting of terms found, etc.

Moreover, automatic language processing methods must be developed to reduce the number of false positives associated with family history or negative parameters (‘no diabetes’, ‘the father has Crohn’s disease’, etc.), particularly for a data warehouse dealing with paediatric patients/rare diseases. The majority of the tools developed in this area are designed for use in English, and there is therefore a need to develop corresponding methods for French.

Proposing different user levels makes it possible to target the system to cover a wider population (expert, layperson). For experts, an advanced search engine is used to refine the search criteria (time constraint, minimum follow-up) or to extend a search (expansion of terminology: synonyms, hyponyms, etc.).

It is essential that the user fully understands why a patient is found and how that patient is linked to the query submitted. False positives will be accepted if the tool enables the user to detect them immediately.

One of the principal arguments in relation to systematic coded entry of medical data in the patient file is the time spent re-inputting data for research. However, ‘blind’ coding, namely coding without knowing how the data concerned will be used, raises questions as to the quality of the coding (exhaustiveness), loss of information and therefore the actual effectiveness of the process. It is more a question of offering clinicians tools to facilitate searching of data on a patient (data mining) and thus save time associated with re-input. There are numerous projects under way within the medical IT community in relation to the increasingly precise extraction of data from text data.


Data warehouse and high-throughput phenotyping

Irrespective of the ad hoc databases used for research, it is possible to extract new information from text-based clinical data. The ability to use a few simple clicks to obtain a phenotype description (signs, symptoms, diagnosis, demographic data) relevant to a population is particularly valued by clinicians and researchers. Unlike traditional knowledge bases, we have the option of automatically weighing up the relevance of a phenotype on the basis of actual data, which is a great strength.


Data warehouse and translational research

Using text data, algorithms to calculate similarities between patients can be valuable in both clinical and research contexts. In the case of an undiagnosed patient with an atypical profile, a tool of this kind makes it possible to search for similarities with diagnosed patients, which helps the clinician to treat the patient. If a new genetic diagnosis is identified in a patient, this tool enables a retrospective search for undiagnosed patients similar to that patient who might be eligible for the genetic test in question. This is designed to ensure that case-based reasoning (CBR) is easily accessible and can be incorporated into the daily processes of clinicians and researchers.


Dr Warehouse has allowed us to show that it is possible to retain a ‘literary’ patient file and to address all of the usage scenarios presented. All of the methods and interfaces that we have developed in Dr Warehouse have been created in close, ongoing cooperation with the clinicians and researchers at the Necker Children’s Hospital and the Institut Imagine. The geographical proximity of the developers-designers and the users is vital in ensuring that the actual needs of users can be met and that those users can fully support the tools developed.

From knowledge-driven to data-driven

With the generalised use of computerised patient files, biomedical data warehouses and statistical methods for machine learning, data‑driven processes are becoming a high‑performing, indispensable model for hospital research. The extraordinary advance of text analysis methods using artificial intelligence has confirmed that free text will no longer be an obstacle to the optimised reuse of data. But there is still much work to be done, in particular in terms of the automatic processing of French and the creation of an annotated, shared corpus in French on which learning algorithms can rely.