Practical Guide to Machine Learning and Artificial Intelligence in Surgical Education Research

Daniel A. Hashimoto; Julian Varas; Todd A. Schwartz

doi:10.1001/jamasurg.2023.6687

Introduction

Artificial intelligence (AI) is the study of machine intelligence as it relates to perceiving and inferring data, typically with the goal of approximating human performance on tasks. Modern growth in computing power and access to data has raised interest in AI, leading to investigation of its applications to surgery. Machine learning (ML), a subfield of AI, refers to the study of methods and algorithms that use data to improve task performance. With the growth of AI in different applications of medicine, there is growing interest in applying AI methods to surgical education research. While the potential for AI to be used in surgical education is high, current infrastructure and practices on data capture, storage, labeling, and analysis leave much to be desired to allow rigorous research using AI methods.¹ This practical guide provides an overview of important considerations in using AI techniques and tools in surgical education research and offers a framework to ensure that appropriately designed studies meet expectations for methodologic rigor, particularly regarding use of high-quality data that reflect the phenomenon of interest (Box).

Box.

Summary

Collaborate with a machine learning practitioner from start to finish. Identifying expertise for a particular application can be helpful to align goals and complementary knowledge between an education researcher and a computational scientist.
Clearly define your research question.
Determine appropriateness of using machine learning and whether you have a sufficient sample size for a machine learning approach.
Fully explore your data to ensure it adequately represents your phenomenon of interest.
Appropriately interpret your results within the context of your data. Assess for sources of bias within your data that could impact generalizability. Understand the limitations of your machine learning approach. Assess the validity of the output relative to its intended task.

Using the Methodology

The wide-ranging potential of AI means that it can be used in many different types of surgical education research, from analysis of quantitative data such as test scores, qualitative data such as written feedback, or visual data such as operative video.² The broad range of AI applications requires multidisciplinary collaboration with statisticians and computer or data scientists to ensure rigorous investigation and appropriate interpretation of results. As with any area of research, it is critical first to outline a clear question for investigation before considering which methods are appropriate to address it, including whether to use AI techniques. Artificial intelligence can be advantageous with large data sets with complex relationships or with data that do not lend themselves as naturally to conventional statistical analyses (ie, images, video, and natural language). As much of the current attention to AI for surgical education has been on ML, we focus our guide on ML methods.

While ML approaches encompass techniques that are considered traditional statistics, such as linear regression, they also offer methods to analyze complex relationships within data. Ensemble learning methods leverage the outputs of multiple models to improve performance, while deep learning offers approaches to nontabular data such as images and text. Other common methods include tree-based approaches, K-nearest neighbors, logistic and penalized regression, and K-means clustering. Improvements in hardware and access to large data sets have begun to unlock the potential to analyze unique combinations of data (eg, combining video analysis with clinical outcomes to assess learning curves).

Most ML problems are either classification or regression problems, and the choice of methods should be driven by the research question. Depending on the problem and type of data, supervised, unsupervised, or reinforcement learning approaches can be used; however, many current investigations adopt supervised or semisupervised learning.³ A researcher investigating whether videos of laparoscopic performance can predict whether a resident passes the Fundamentals of Laparoscopic Surgery examination may use supervised learning for classification of performance into pass or fail. Natural language processing of personal statements from residency applicants could predict the likelihood of matching into a program. Alternatively, ML tools can be used to facilitate decision-making⁴ and actions (such as case logging⁵) by trainees, or studies can compare groups who do or do not use such tools.

While ML is computationally intensive and time consuming, access to powerful computing resources, including graphical processing units, can accelerate these analyses. Cloud computing, while expensive at an hourly rate, reduces the cost of capital hardware acquisition. Although commercial statistical programs support some ML functionality, ML methods are currently often performed using R (R Project for Statistical Computing) or Python (Python Software Foundation) software with open-source libraries such as PyTorch, Keras, and TensorFlow facilitating deep learning techniques. Automated ML tools are offered by cloud vendors, allowing researchers to use pretrained models in a no-code environment. However, we advise that these tools are best used for simple problems with well-designed and maintained data sets, as they do not yet offer the ability to fine-tune all elements of a model and are limited to publicly available models.

It is critical to understand that ML methods have limitations. It is important to fully explore and understand the data to ensure they appropriately represent the phenomenon of interest. Sources of bias can permeate any data and lead to biased results. It is crucial to use high-quality annotations or labels for data, particularly in supervised learning and especially if multiple annotators have labeled the data. This is analogous to requiring rater training to achieve high interrater reliability in assessment studies. Regarding algorithm selection, many of these methods are referred to as black box approaches that automatically select data features that are most contributory to high performance output without providing easily interpretable results. Thus, while tools such as saliency and attention maps generated by algorithms may help with interpretation, the lack of true transparency and interpretability may negatively impact the trustworthiness of these methods. Investigators may further introduce their own bias during results interpretation to evaluate algorithmic performance.

Statistical Considerations

The reporting of AI research should follow the same standards of rigor as other research approaches. There are AI-specific versions of reporting guidelines under development to help guide reporting, including the Standards for Reporting of Diagnostic Accuracy (STARD)鈥揂I (diagnostic models), Transparent Reporting of a Multivariable Prediction Model of Individual Prognosis or Diagnosis (TRIPOD)鈥揂I (prediction models), Standard Protocol Items: Recommendations for Interventional Trials (SPIRIT)鈥揂I (interventional trials), and Consolidated Standards of Reporting Trials (CONSORT)鈥揂I (clinical trials) guidelines.

The selection of a particular method should depend on the question and availability of data. There are no one-size-fits-all algorithms, further stressing the importance of partnering with statistical and computational experts when planning and conducting the research. The range and type of algorithms available are evolving rapidly. However, the core principles of analytic approaches remain static. The research question should drive the analysis and selection of an appropriately sized data set. A small data set can result in overfitting or underperformance, while an overly large one can consume excessive resources.

When training ML models, it is important that data from the testing or validation sets do not contaminate the training data. For example, in computer vision studies, it is common to extract hundreds of thousands of frames from a data set of a few hundred videos. If data are split at the frame level, frames from 1 video may end up in both the training and the testing data sets, providing the model with a biased advantage in evaluating the testing set. In such a case, data should be split into training and testing sets at the video level.

Finally, evaluation metrics for algorithms are a key consideration. While many investigators are familiar with terms such as accuracy and area under the receiver operating characteristic curve, these should not be viewed as one-size-fits-all metrics. Metric selection should be driven by the research question, the intended application of the result, and the nature of the data.⁶

Where to Find More Information

JAMA has an overview of ML methodology in the clinical literature that is applicable to surgical education.³ Textbooks offer a broad introduction to concepts to AI and ML for clinicians.⁷ Informatics and engineering conferences such as those by the American Medical Informatics Association and the Medical Image Computing and Computer Assisted Intervention Society offer insights into state-of-the-art models that are being developed for application in medicine.

Back to top

Article Information

Corresponding Author: Daniel A. Hashimoto, MD, University of Pennsylvania, 3400 Spruce St, 4 Silverstein, Philadelphia, PA 19104 (daniel.hashimoto@pennmedicine.upenn.edu).

Published Online: January 3, 2024. doi:10.1001/jamasurg.2023.6687

Conflict of Interest Disclosures: Dr Hashimoto reported receiving grants from Olympus Corporation to Massachusetts General Hospital; personal fees from Johnson & Johnson, Activ Surgical, and Verily Life Science outside the submitted work; having patents pending for Concept Graph Neural Networks for Surgery, Surgical Phase Recognition with Sufficient Statistical Model, and Surgical Decision Support Using a Decision Theoretic Model, all owned by Massachusetts General Hospital. Dr Varas reported being CEO and founder of Training Competence, official spinoff from the Catholic University of Chile, outside the submitted work. No other disclosures were reported.

References

1.

Maier-Hein 聽L锘�, Eisenmann 聽M锘�, Sarikaya 聽D锘�, 聽et al. 聽Surgical data science鈥攆rom concepts toward clinical translation.聽锘� 聽Med Image Anal. 2022;76(102306):102306. doi:锘�

2.

Ward 聽TM锘�, Mascagni 聽P锘�, Madani 聽A锘�, Padoy 聽N锘�, Perretta 聽S锘�, Hashimoto 聽DA锘�. 聽Surgical data science and artificial intelligence for surgical education.聽锘� 聽J Surg Oncol. 2021;124(2):221-230. doi:锘�

3.

Liu 聽Y锘�, Chen 聽PC锘�, Krause 聽J锘�, Peng 聽L锘�. 聽How to read articles that use machine learning: users鈥� guides to the medical literature.聽锘� 听闯础惭础. 2019;322(18):1806-1816. doi:锘�

4.

Mascagni 聽P锘�, Alapatt 聽D锘�, Sestini 聽L锘�, 聽et al. 聽Computer vision in surgery: from potential to clinical value.聽锘� 聽NPJ Digit Med. 2022;5(1):163. doi:锘�

5.

Thanawala 聽R锘�, Jesneck 聽J锘�, Shelton 聽J锘�, Rhee 聽R锘�, Seymour 聽NE锘�. 聽Overcoming systems factors in case logging with artificial intelligence tools.聽锘� 聽J Surg Educ. 2022;79(4):1024-1030. doi:锘�

6.

Maier-Hein 聽L锘�, Reinke 聽A锘�, Godau 聽P锘�, 聽et al. 聽Metrics reloaded: Pitfalls and recommendations for image analysis validation.聽锘� 听补谤齿颈惫. Preprint published online June 3, 2022.

7.

Hashimoto 聽DA锘�, Rosman 聽G锘�, Meireles 聽OR锘�. 聽Artificial Intelligence in Surgery: Understanding the Role of AI in Surgical Practice. McGraw-Hill Education; 2021.

糖心vlog