Machine learning: Opening doors in pharmaceutical development

  • Adoption of machine learning in early-stage development has been widespread
  • In later-stage development use cases for machine learning are still being developed
  • From adherence to drug repositioning, clinical site selection to pharmacovigilance, machine learning’s uses are many

Evidera’s Mustafa Oguz and Andrew Cox survey the adoption of machine learning by the pharmaceutical industry and show where it can increase R&D efficiencies

Machine learning (ML) is the science of programming computers to perform tasks based on rules learned from data instead of rules explicitly described by humans. Although statistical methods in health care for tasks such as stroke risk prediction1 have been in use for a long time, three trends enabled the widespread adoption of ML applications in the past decade: an increase in computing resources and cloud services that allow generation and storage of massive quantities of data, the availability and digitisation of diverse data sources, and improvements in ML algorithms which can reveal complex relationships in data that simpler algorithms might miss.

Despite the advances in the adoption of ML methods in the pharmaceutical industry, there is room for increased application, especially in late stage development. According to a 2017 survey of 3,073 companies globally from 14 business sectors, only 16% of health care firms adopted at least one artificial intelligence (AI) technology, compared to many other industries over 20%.2

This article outlines ML applications in the pharmaceutical industry that increase efficiency and allow more convincing value demonstration, broadly following a product’s lifecycle from drug discovery to drug repositioning.

Drug discovery

One of the most promising application areas for ML methods is new drug development, which is estimated to cost $2.6 billion on average.3 Although computational methods have been employed for drug discovery for decades,4,5 ML methods combined with large data sources enable access to deeper insights faster compared to traditional methods that mostly rely on numerous costly biochemical experiments.6 For example, deep learning is used to predict drug-target interactions (DTIs);7 generate novel molecules predicted to be active against a given biological molecule;8 predict cell-penetrating peptides for antisense delivery;9 and, to model quantitative structure – activity and structure – property relationship (QSAR/QSPR) of small molecules to predict blood-brain barrier10 permeability.

Clinical trial site and patient selection

A study that analysed data from 151 global clinical trials at 15,965 sites found that 52% of clinical trials exceeded their planned enrollment timelines, with 48% taking significantly longer to complete enrollment.11 Companies also reported that on average 11% of sites in clinical trials failed to enroll any patients at all. ML algorithms can leverage historical data on site performance to maximise the probability that selected sites can deliver patients quickly, minimise drop-out rates, and adhere to the clinical protocol. ML models can be built using historical data on past performance, focusing on clinical trials, infrastructure, and time to first patient enrollment, which are predictive of future performance according to studies conducted by the industry.12,13

Wearable technology

Increased miniaturisation and longer battery life of electronics have enabled wearable devices that make the collection of continuous and accurate medical data more practical than ever.14 ML methods are routinely employed to convert raw data collected from wearable technologies, such as smartwatches, wristbands, subcutaneous sensors, etc. used in observational studies and clinical trials to meaningful clinical end points.


ML and natural language processing methods are commonly used to identify patient experiences related to treatments in the real world. Social media in general, and patient forums in particular, offer a rich source of information about adverse events and other problems associated with treatments. The US Food and Drug Administration (FDA) encourages external stakeholders to explore the use of social media tools to identify patient perspectives regarding disease symptoms,15 and is exploring the value of social media to inform occurrence of adverse events.16 Social media content can be used to complement literature review findings, supplement focus groups, gather expert opinions, and elicit patient interviews, and extracting useful signals from large volumes of text data in social media is an active area of research.

Precision medicine

Precision medicine is a prevention and treatment approach that considers a patient’s genes, environment, and lifestyle.17 According to a survey of 100 pharmaceutical industry leaders, precision medicine has the potential to help accurately identify new drug targets; provide clarity regarding target patient profiles, thus, enabling more targeted clinical trials with smaller patient numbers and faster market access; reduce research and development (R&D) cycle length; and, more convincingly demonstrate benefits.18

Delivery of the premise of precision medicine depends on the ability to harmonise diverse data sources and develop predictive models to optimise treatment strategies. A recent study used deep learning methods to develop predictive models of mortality based on electronic health records and free text records; these models predicted the risk of inpatient mortality, unplanned readmission within 30 days, long lengths of stay and discharge diagnosis.19 Pai and Bader20 reviewed ML algorithms that leverage patient similarity scores based on genomics data and electronic health records to identify subgroups of type 2 diabetes patients, predict tumour subtype in ependymoma, and predict treatment response.

Adherence prediction

Non-adherence to medication is a major cause of revenue loss for the pharmaceutical industry and imposes a very high cost to public health care systems. A report on economic costs of medication non-adherence estimated the industry’s annual revenue loss from non-adherence at $564 billion globally.21 Unsupervised ML methods can be used to identify patient segments that display different characteristics and reasons for non-adherence to allow tailoring interventions to different patient groups to decrease non-adherence rates.

Drug repositioning

Faced with growing R&D costs and low approval rates for new compounds, repositioning of existing drugs can potentially cut costs and expand to new indications. Drug repositioning has the benefit of reducing drug development time, since toxicity and safety profiles of drug candidates for repositioning have already been studied.22 ML methods promise to accelerate the process of drug repositioning with widespread use of systematic approaches and computational methods, such as similarity searching, text mining, and network analysis, compared to traditional methods largely based on unexpected associations observed in clinical trials or in medical practice.23


Adoption of ML in early-stage development has been widespread, however, use in later-stage development is relatively early in its evolutionary path, with use cases still being developed. There is no doubt that ML approaches can yield benefits in terms of efficiencies, new insights, and actionable evidence. However, knowledge of appropriate use of ML methods and potential applications is not widespread in our industry, and there is often a misconception that the process can be automated and produce amazing results with little effort. Like other analytical approaches, such as that of traditional statistics, a detailed review and understanding of the problem and the data, as well as rigorous attention to methodological considerations, is absolutely crucial when ML is applied to health care. Inappropriate application of ML methods can lead to erroneous conclusions and inaccurate performance assessment. At worst, this can lead to mistakes in health care decisions which might be based on evidence derived from ML studies. Regardless of the ML application area, there are core steps in every ML project that must be followed to get actionable insights from data.

A more in-depth discussion of machine learning, including how to implement a successful machine learning project, can be found in the white paper “Machine Learning: Biopharma Applications and Overview of Key Steps for Successful Implementation.


1 Gage BF, Waterman AD, Shannon W, Boechler M, Rich MW, Radford MJ. Validation of Clinical Classification Schemes for Predicting Stroke: Results from the National Registry of Atrial Fibrillation. JAMA. 2001 Jun 13;285(22):2864-70.

2 Bughin J, Hazan E, Ramaswamy S, Chui M, Allas T, Dahlstrom P, Henke N, Trench M. Artificial Intelligence: The Next Digital Frontier? McKinsey Global Institute. Discussion Paper. June 2017. Available at: Accessed September 19, 2018.

3 DiMasi JA, Grabowski HG, Hansen RW. Innovation in the Pharmaceutical Industry: New Estimates of R&D Costs. J Health Econ. 2016 May; 47:20-33. doi: 10.1016/j.jhealeco.2016.01.012. Epub 2016 Feb 12.

4 Lo YC, Rensi SE, Torng W, Altman RB. Machine Learning in Chemoinformatics and Drug Discovery. Drug Discov Today. 2018 Aug;23(8):1538-1546. doi: 10.1016/j.drudis.2018.05.010. Epub 2018 May 8.

5 Hiller SA, Golender VE, Rosenblit AB, Rastrigin LA, Glaz AB. Cybernetic Methods of Drug Design. I. Statement of the Problem – The Perceptron Approach. Comput Biomed Res. 1973 Oct;6(5):411-21.

6 Wang Q, Feng Y, Huang J, Wang T, Cheng G. A Novel Framework for the Identification of Drug Target Proteins: Combining Stacked Auto-Encoders with a Biased Support Vector Machine. PLoS One. 2017 Apr 28;12(4):e0176486. doi: 10.1371/journal.pone.0176486. eCollection 2017.

7 Lee I, Nam H. Identification of Drug-Target Interaction by a Random Walk with Restart Method on an Interactome Network. BMC Bioinformatics. 2018 Jun 13;19(Suppl 8):208. doi: 10.1186/s12859-018-2199-x.

8 Olivecrona M, Blaschke T, Engkvist O, Chen H. Molecular De-Novo Design through Deep Reinforcement Learning. J Cheminform. 2017 Sep 4;9(1):48. doi: 10.1186/s13321-017-0235-x.

9 Wolfe JM, Fadzen CM, Choo ZN, Holden RL, Yao M, Hanson GJ, Pentelute BL. Machine Learning To Predict Cell-Penetrating Peptides for Antisense Delivery. ACS Cent Sci. 2018 Apr 25;4(4):512-520. doi: 10.1021/acscentsci.8b00098. Epub 2018 Apr 5.

10 Wang Z, Yang H, Wu Z, Wang T, Li W, Tang Y, Liu G. In Silico Prediction of Chemical Blood-Brain Barrier Permeability with Machine Learning and Re-Sampling Methods. ChemMedChem. 2018 Aug 15. doi: 10.1002/cmdc.201800533.

11 Lamberti MJ, Mathias A, Myles JE, Howe D, Getz K. Evaluating the Impact of Patient Recruitment and Retention Practices. Ther Innov Regul Sci. 2012 July 13; 46(5):573-580. doi: 10.1177/0092861512453040.

12 Getz KA. Predicting Successful Site Performance. Applied Clinical Trials. 2011 Nov 01. Available at: Accessed September 19, 2018.

13 Yang E, O’Donovan C, Phillips J, Atkinson L, Ghosh K, Agrafiotis DK. Quantifying and Visualizing Site Performance in Clinical Trials. Contemp Clin Trials Commun. 2018 Jan 31;9:108-114. doi: 10.1016/j.conctc.2018.01.005. eCollection 2018 Mar.

14 Yetisen AK, Martinez-Hurtado JL, Ünal B, Khademhosseini A, Butt H. Wearables in Medicine. Adv Mater. 2018 Jun 11:e1706910. doi: 10.1002/adma.201706910. [Epub ahead of print]

15 U.S. Food & Drug Administration (FDA). Patient-Focused Drug Development: Collecting Comprehensive and Representative Input. Guidance for Industry, Food and Drug Administration Staff, and Other Stakeholders. Draft Guidance. 2018 June. Available at: Accessed September 19, 2018.

16 U.S. Food & Drug Administration (FDA). Data Mining at FDA. Available at: Accessed September 19, 2018.

17 U.S. National Library of Medicine (NIH). What Is Precision Medicine? Available at: Accessed September 19, 2018.

18 Danner S, Solbach T, Ludwig M. Capitalizing on Precision Medicine: How Pharmaceutical Firms Can Shape the Future of Healthcare. Strategy&. 2017 Aug 24. Available at: Accessed September 19, 2018.

19 Rajkomar A, Oren E, et al. Scalable and Accurate Deep Learning with Electronic Health Records. npj. Digital Medicine. 2018 May 8. Available at: Accessed September 19, 2018.

20 Pai S, Bader GD. Patient Similarity Networks for Precision Medicine. J Mol Biol. 2018 Sep 14;430(18 Pt A):2924-2938. doi: 10.1016/j.jmb.2018.05.037. Epub 2018 Jun 1.

21 Forissier T, Firlik K. Estimated Annual Pharmaceutical Revenue Loss Due to Medication Non-Adherence. Capgemini Consulting.2012 Nov. Available at: Accessed September 19, 2018.

22 Papapetropoulos A, Szabo C. Inventing New Therapies Without Reinventing the Wheel: The Power of Drug Repurposing. Br J Pharmacol. 2018 Jan;175(2):165-167. doi: 10.1111/bph.14081.

23 Bolgár B, Arany Á, Temesi G, Balogh B, Antal P, Mátyus P. Drug Repositioning for Treatment of Movement Disorders: From Serendipity to Rational Discovery Strategies. Curr Top Med Chem. 2013;13(18):2337-63.

About the authors

Andrew Paul Cox, PhD, is a research scientist at Evidera focused on data science, epidemiology, and mathematical modelling, with 11 years of experience and publications within each of these disciplines. He has extensive experience conducting complex data driven research projects for a wide range of indications. In addition, he has specialist technical expertise in machine learning and text analysis (especially relating to analysis of health-related social media). Prior to joining Evidera, Dr. Cox worked at the London School of Hygiene and Tropical Medicine and at the Health Protection Agency, London, England where he worked on projects modelling the HIV epidemic in the UK, Uganda, and South Africa. Additionally, he has been involved in modelling studies based around RCTs for new interventions against tuberculosis in South Africa and syphilis in China. Dr. Cox received his PhD in molecular epidemiology from the University of Edinburgh.

Mustafa Oguz, PhD, is a senior research associate with Evidera’s Data Analytics team. Dr. Oguz has experience in machine learning, statistical analysis, modelling and simulation, and developing decision support tools. As part of his role, Dr. Oguz is responsible for using machine learning methods to answer research questions, leading and managing projects, and writing proposals, and he has experience in a wide variety of therapeutic areas. Previously he worked as a policy analysis assistant at the RAND Corporation where he was also a PhD Fellow in policy analysis. Some of his work includes using statistical and machine learning methods to predict West Point applicants’ success as officers, analysing health and economic surveys of US veterans, and developing decision support tools to predict disability caseload for the U.S. Army. Dr. Oguz has a PhD in policy analysis from Pardee RAND Graduate School.

Don’t miss your complimentary subscription to Deep Dive and our newsletter

Sign up



Your name

Your e-mail

Name receiver

E-mail address receiver

Your message