Abstract and Introduction
Stroke ranks among the leading causes for morbidity and mortality worldwide. New and continuously improving treatment options such as thrombolysis and thrombectomy have revolutionized acute stroke treatment in recent years. Following modern rhythms, the next revolution might well be the strategic use of the steadily increasing amounts of patient-related data for generating models enabling individualized outcome predictions. Milestones have already been achieved in several health care domains, as big data and artificial intelligence have entered everyday life.
The aim of this review is to synoptically illustrate and discuss how artificial intelligence approaches may help to compute single-patient predictions in stroke outcome research in the acute, subacute and chronic stage. We will present approaches considering demographic, clinical and electrophysiological data, as well as data originating from various imaging modalities and combinations thereof. We will outline their advantages, disadvantages, their potential pitfalls and the promises they hold with a special focus on a clinical audience. Throughout the review we will highlight methodological aspects of novel machine-learning approaches as they are particularly crucial to realize precision medicine. We will finally provide an outlook on how artificial intelligence approaches might contribute to enhancing favourable outcomes after stroke.
Why Precision Medicine in Stroke?
In spite of over 10 million yearly strokes worldwide and a global lifetime risk of 25% to suffer a stroke,[1,2] each of these strokes is a unique and very personal experience, leaving each stroke survivor with his or her very own story. Imagine being that one particular patient: you are female, 65 years old and have arterial hypertension as known comorbidity, yet are otherwise a healthy and independent person. You have noticed a weaker left-sided grip strength for 1 h and now cannot lift your left arm against gravity, your speech is slurred. Your symptom severity corresponds to a National Institutes of Health Stroke Scale (NIHSS) of 5 (maximum: 42). Initial MRI indicates ischaemia in the right internal capsule with an acute onset and no evidence for a large vessel occlusion. Would you choose a treatment and outcome prediction based on the ‘average’ stroke patient having a comparable NIHSS score, time constellation and imaging findings? Or would you rather prefer a more personalized version that takes into account (i) your individual constitution with respect to the potential to recovery; and/or (ii) response to a certain treatment, and has the potential to produce individualized predictions? The second, more complex choice, considering high-dimensional information, may be rendered more and more possible when merging artificial and human intelligence, as we will outline more in depth in this review.
In well-developed countries, stroke outcome has been steadily improving in recent years: These advancements have been achieved by highly effective recanalizing therapies for acute treatment, such as thrombolysis and thrombectomy,[3,4] high-quality imaging, the stratified extension of therapeutic time windows[5,6] and standardized care for dedicated stroke units. Intense rehabilitation programs[7,8] and secondary prevention, such as anticoagulants and statins,[9,10] are further examples in the subacute and chronic phases. However, most of these post-stroke treatment options require a high number of patients needed to treat to prevent an unfavourable outcome. Therefore, the optimal and most effective treatment decisions for an individual may not necessarily be derived from population averages.
These insights are not limited to stroke, but pertain to healthcare in general. They have prompted a new focus on individualizing treatments in recent years and ignited increasing numbers of precision medicine endeavours.[11,12] Ever since, more and more optimization aims focus on individuals rather than population averages to increase the efficacy in healthcare.
The Role of Artificial Intelligence for Precision Medicine. Modern artificial intelligence (AI) practices offer the great opportunity to realize the vision of precision medicine.[13–15] AI can be formally defined as ‘the capacity of computers or other machines to exhibit or simulate intelligent behaviour; the field of study concerned with this’ (Oxford English Dictionary, see also Matheny et al.). Of note, the term AI was introduced already about 70 years ago.[17,18] However, since then, AI has also experienced several periods of reduced interest (‘AI winter’) after falling short of expectations. Early AI implementations successfully completed tasks that are usually difficult for humans by applying a sequence of logical rules. Examples may be seen in expert systems that imitate human decision-making processes. These same implementations, however, failed to tackle tasks easy to complete for humans, such as image recognition. With the recent coincidence of growing amounts of data, exponentially increasing computational power, affordable computing and storing resources, as well as a broad software availability, techniques such as machine learning and deep learning have begun to remedy these previous shortcomings. In general, both machine and deep learning have led to ground-breaking innovations, such as intelligent software to understand language and images or, as a very recent biological example, the prediction of protein structures based on their amino acid sequence (AlphaFold). Machine and deep learning approaches, as modern branches of AI, excel in automatically detecting patterns in data and leveraging those pattern to predict future data (see Box 1 for examples of individual algorithms).[25–27] Deep learning is special in the way that it leverages artificial neural networks with multiple (‘deep’) levels of representations that facilitate the acquisition of particularly complex functions.
Notably, AI is not a new idea in healthcare, as expert-guided, rule-based medical approaches were already introduced in the 1970s, for example featuring the automated interpretation of ECGs.[18,30] Once again, machine and deep learning have recently enabled substantial improvements and demonstrated performances comparable with highly trained physicians, especially in the fields of radiology, dermatology and ophthalmology. For example, Gulshan and colleagues demonstrated the feasibility of automatically detecting diabetic retinopathy in retina fundus photographs. Esteva and colleagues predicted skin cancer type as accurately as dermatologists, and Hannun and colleagues constructed a deep learning model that could accurately classify computerized echocardiograms into 12 rhythm classes. These successful AI implementations hold several promises in the longer term, such as predicting future disease manifestations based on routinely collected healthcare data, or automated screening for certain cancer types in imaging data. In the shorter term, AI-based individualized predictions on clinical outcomes could provide essential information for healthcare professionals, as well as patients, their families and friends.
To foster the potential of machine and deep learning, it will be of particular importance to acquire large datasets, comprising subject-level information on hundreds to thousands of patients. Only then will these datasets have the potential to adequately represent interindividual variability in the presentation of the disease, comorbidities and predisposition,[36,37] and allow for an advantageous performance of AI models. Recent years have already seen the advent of big medical data initiatives, mostly within the framework of population studies that are not only impressive in the number of participants (number of participants >500 000), but also their data depth (number of variables >1000) (e.g. UK Biobank, NIH All of Us research programme in the USA and the Rhineland Study in Germany). First examples of similar developments in stroke research can be observed as well: the virtual international stroke trial archive (VISTA) contains clinical data, such as the NIHSS, comorbidities or laboratory results of 82 000 patients. However, ‘big’ imaging datasets of stroke patients are still at least an order of magnitude smaller (e.g. 2800 structural scans in the MRI-GENIE study,[42,43] 2950 scans of in Meta VCI map consortium, 1800 scans in ENIGMA or 1333 scans in an unicentre study[46,47]). All in all, there have been calls to accumulate and exploit regularly obtained clinical, imaging and genetic stroke patient data in a collaborative fashion.[48–50]
Article Structure. In the following sections, we will specifically illustrate single-subject prediction scenarios within stroke outcome research in the acute, subacute and chronic stage. Additionally, we will highlight important considerations with respect to methodological approaches in line with the aim of this review. We first address general aspects of motor outcome research after stroke (‘Motor impairment after stroke’ section). Then, we will summarize the statistical foundations necessary to understand the basic principles of AI in healthcare (‘Statistical background for precision medicine: inference versus prediction’ section). Afterwards, we present and discuss recent studies on stroke outcome research with a special focus on those using prediction models, organized depending on the type of data, i.e. clinical data (‘Stroke prognostic scales based on clinical data only’ section), neurophysiological data, and combinations of clinical, neurophysiological and basic imaging data (‘Neurophysiology and combination of biomarkers in individual data’ section), as well as more detailed structural (‘Structural imaging’ section) and functional imaging data (‘Functional imaging’ section). Given their prime importance for the realization of precision medicine, we will outline essential methodological aspects at the beginning and end of each section. Finally, we will present a synopsis of methods as employed in concrete scenarios in motor outcome research post-stroke (‘Overview of employed algorithms’ section), their general advantages and promises (‘General advantages and promises’ section), as well as disadvantages and pitfalls (‘Disadvantages and pitfalls’ section). All in all, our review complements previous reviews on the use of AI in stroke, for example with a focus on clinical decision support in the acute phase, acute stroke imaging,[52,53] stroke rehabilitation and prognostic scales on clinical outcomes and mortality[55,56] (see the Supplementary material for our literature research strategy and selection criteria).
Motor Impairment After Stroke
A substantial amount of stroke patients finds themselves affected by some degree of motor impairment. Studies[7,57] report frequencies as high as 80% and 50%, respectively. The enormous burdens associated with motor impairments with regard to economic costs, rehabilitation need and disability-adjusted life years necessitate optimizing acute and chronic stroke care. While acute stroke treatment has been considerably advanced leading to both reduced mortality and morbidity in the past decades, it may now be the restorative therapy after stroke that needs to see the same progress. This focus on the subacute-to-chronic post-stroke phase may be of particular importance since only a relatively small fraction of patients presenting with acute ischaemic stroke are eligible for acute treatment options (e.g. 15.9% for thrombolysis and 5.8% for mechanical thrombectomy in Germany in 2017, with comparable numbers in various other countries[63,64]).
Providing accurate outcome predictions has always been a central goal in stroke research. More specifically, predictions may point at the most suitable short and long-term treatment goals: should the focus of treatment be on true recovery or rather compensation, when significant behavioural restitution is unlikely? True recovery requires neural repair to allow for an at least partial return to the pre-stroke repertoire of behaviours, e.g. the same grasping movement pattern as present prior to cerebral ischaemia. Compensation implies the substitution of pre-stroke behaviours by newly learned pattern without the necessity of neural repair, e.g. compensatory movements of the shoulder to account for extension deficits of the hand. During rehabilitation, patients often show both phenomena, i.e. a partial recovery, which is complemented by compensatory behaviours. In this context, rehabilitation refers to the entire process of care after brain injury and an ‘active change by which a person who has become disabled acquires the knowledge and skills needed for optimum physical, psychological and social function’. The availability of predictions may help patients and their proxies to be informed about what to expect in the future and plan accordingly. Furthermore, predicting spontaneous recovery after stroke may be crucial to evaluate the effect of intervention studies. Using this information to stratify patients into control and treatment groups could decrease the overall number of patients needed to be recruited, thereby not only rendering significantly more studies feasible in terms of design and financial costs, but also yielding faster results. Last, outcome models could also target the prediction of response to specific therapies, such as non-invasive brain stimulation, and thus support the identification of probable responders before the start of the therapy. In the same vein, Stinear and colleagues previously defined several prerequisites for rehabilitation prediction tools that may be useful in clinical practice. Accordingly, prediction tools should forecast an outcome that is meaningful for individual patients at a specific time point in the future.
Statistical Background for Precision Medicine: Inference Versus Prediction
Classical inference statistics, such as F– or t-tests, comprise a powerful tool kit to evaluate research hypotheses, and offer explainable results. Null hypotheses testing represents a frequently used example, which is linked to resulting P-values and ensuing statistical significance statements.[69,70] Importantly, these classical statistical instruments were invented almost a hundred years ago, in an era of rather limited data availability and hardly any computational power. In regard to biomedical research, insights were previously commonly gleaned from either observational descriptions of single patients (e.g. Pierre-Paul Broca’s patient Mr Leborgne, called ‘Tan’), or group comparisons. This situation, however, is changing nowadays.
The perception of statistical significance will most probably experience a redefinition in times of emerging big data scenarios. On the one hand, extensive datasets will more frequently lead to statistical significance of effects with (clinically) negligible effect sizes.[73,74] For example, Miller and colleagues conducted 14 million individual association tests between MRI-derived brain phenotypes, e.g. brain volumes or functional connectivity strength between two brain areas, and sociodemographic, neuropsychological or clinical variables in 10 000 UK Biobank participants. These tests resulted in many statistically significant associations, yet these associations sometimes explained less than a percentage point of variance, which, thus, questions their relevance. On the other hand, the default use and interpretation of P-values has been challenged frequently in recent years. This process was triggered by increasing reports on low reproducibility of research findings. When trying to reproduce the findings of 100 psychological research studies, replication studies produced significant results in only 36%, while original studies reported significant results in 97% of cases. In response to these findings, Benjamin and colleagues suggested a lower level of significance, i.e. P < 0.005, for the discovery of new effects to increase the robustness of findings. Amrhein and colleagues went a step further yet and recommended to relax the over-reliance on P-values by completely abandoning dichotomous decisions. These suggestions have prompted vital discussions: While generally being supported widely—the call by Amrhein was accompanied by >800 signatures of international researchers—other statisticians have been more cautious, for example stressing the positive effect of statistical significance as gatekeeper.
It is also important to realize that statistically significant group differences, as indicated by low P-values, do not generally imply good single-subject level prediction performances, as measured by out-of-sample generalization (Figure 1). The latter, however, is the idea of precision medicine.[37,82–84] In contrast to the previous focus on inference and explanation, recent years have seen an upsurge of AI and, more specifically, machine-learning techniques, that predominantly target prediction performance of single-subject outcomes. Examples of these machine-learning models include, e.g. regularized regression, (deep) neural networks, nearest-neighbour algorithms, random forests or kernel support vector machines (SVMs) (Box 1). Given multiple input variables, such as age, sex, initial stroke severity and comorbidities, these models are trained to predict some specific individual outcome, such as a motor score 3 months after stroke, based on a weighted combination of these input variables, with the highest achievable prediction performance. This performance can be quantified by various established measures, such as explained variance, accuracy, sensitivity, specificity and area under the curve. As they are evaluated by their generalization capability to previously unseen, i.e. new data samples, they are well suited to ensure accurate predictions of individual future outcomes. At the same time, these models may not typically and reliably be able to explain their predictions any further and allow for inferences on particular biological mechanisms. This characteristic has prompted the denotation black-box model.
Three scenarios to compare group difference and classification analyses. Data is simulated, differences between groups 1 and 2 are determined via two-sample t-tests, classification via linear methods into groups 1 and 2 is achieved via thresholding (indicated by red dotted lines). (A) A significant group difference is found despite a poor classification performance. (B) Groups do not differ significantly, but classification accuracy is very high. (C) A significant group difference goes along with high classification accuracy. Overall, these three scenarios illustrate that neither significant group differences automatically lead to high classification accuracies, nor high classification accuracies to significant group differences. Adapted from Arbabshirani et al., with permission.