Publications

† co-first author   ‡ co-second author

Deployment and Evaluation of an EHR-integrated, Large Language Model-Powered Tool to Triage Surgical Patients

Timothy Keyes†, Jane Wang†, April S. Liang, Stephen P. Ma, Jason Shen, Jerry Liu, Nerissa Ambers, Abby Pandya, Rita Pandya, Jason Hom, Natasha Steele, Jonathan H. Chen, Kevin Schulman

arXiv · 2026

PDF Link
Abstract Surgical co-management (SCM) is an evidence-based model in which hospitalists jointly manage medically complex perioperative patients alongside surgical teams. Despite its clinical and financial value, SCM is limited by the need to manually identify eligible patients. To determine whether SCM triage can be automated, we conducted a prospective, unblinded study at Stanford Health Care in which an LLM-based, electronic health record (EHR)-integrated triage tool (SCM Navigator) provided SCM recommendations followed by physician review. Using pre-operative documentation, structured data, and clinical criteria for perioperative morbidity, SCM Navigator categorized patients as appropriate, not appropriate, or possibly appropriate for SCM. Faculty indicated their clinical judgment and provided free-text feedback when they disagreed. Sensitivity, specificity, positive predictive value, and negative predictive value were measured using physician determinations as a reference. Free-text reasons were thematically categorized, and manual chart review was conducted on all false-negative cases and 30 randomly selected cases from the largest false-positive category. Since deployment, 6,193 cases have been triaged, of which 1,582 (23%) were recommended for hospitalist consultation. SCM Navigator displayed high sensitivity (0.94, 95% CI 0.91-0.96) and moderate specificity (0.74, 95% CI 0.71-0.77). Post-hoc chart review suggested most discrepancies reflect modifiable gaps in clinical criteria, institutional workflow, or physician practice variability rather than LLM misclassification, which accounted for 2 of 19 (11%) false-negative cases. These findings demonstrate that an LLM-powered, EHR-integrated, human-in-the-loop AI system can accurately and safely triage surgical patients for SCM, and that AI-enabled screening tools can augment and potentially automate time-intensive clinical workflows.

Designing Clinically Useful AI: A Blueprint for Impact

Timothy Keyes†, Shyon Parsa†, Dev Dash, Danton Char, Michelle M. Mello, Alison Callahan, Margaret Ann Smith, Sinjin Lee, Thomas Wang, Heidi Salisbury, Shinichi Goto, Vicki Parikh, Kenneth W. Mahaffey, Michael Salerno, Euan A. Ashley, Nigam H. Shah, and Sneha S. Jain

NEJM AI · 2026

PDF Link
Abstract Most artificial intelligence (AI) tools in health care are evaluated on statistical performance for diagnostic accuracy alone, which often fails to account for the realities of the clinical systems into which they may be deployed. This disconnect has contributed to a proliferation of AI tools that perform well in development but fail to gain traction or generate meaningful impact in clinical use. We propose the use of health AI target product profiles, which specify the performance thresholds an AI tool must meet to produce benefit within a specific care setting, accounting for workflow, capacity, and utility trade-offs. Using hypertrophic cardiomyopathy (HCM) detection as an example, we simulate the performance of an AI-augmented clinical program across a range of AI tool characteristics and health care resource constraints to identify the conditions under which clinical value could be realized. Health AI target product profiles can guide AI tool development, inform AI tool selection if multiple AI tools have already been developed, guide implementation strategies for AI-augmented programs, and prevent investment in AI tools that are unlikely to create value. Ultimately, this approach offers a proactive and context-driven pathway for designing clinically useful AI that can empower health systems, patients, and providers as active members of the AI design process.

Development, Evaluation, and Deployment of a Multi-Agent System for Thoracic Tumor Board

Timothy Keyes†, Tim Ellis-Caleo†, Nerissa Ambers, Faraah Bekheet, Wen-wai Yim, Nikesh Kotecha, Nigam H. Shah, Joel Neal

arXiv · 2026

PDF Link
Abstract Tumor boards are multidisciplinary conferences dedicated to producing actionable patient care recommendations with live review of primary radiology and pathology data. Succinct patient case summaries are needed to drive efficient and accurate case discussions. We developed a manual AI-based workflow to generate patient summaries to display live at the Stanford Thoracic Tumor board. To improve on this manually intensive process, we developed several automated AI chart summarization methods and evaluated them against physician gold standard summaries and fact-based scoring rubrics. We report these comparative evaluations as well as our deployment of the final state automated AI chart summarization tool along with post-deployment monitoring. We also validate the use of an LLM as a judge evaluation strategy for fact-based scoring. This work is an example of integrating AI-based workflows into routine clinical practice.

MedAgentBrief for Hospital Course Summarization: Safety, Use, and Discharge Documentation Burden

Francois Grolleau, Timothy Keyes‡, April S. Liang‡, Stephen P. Ma, Thomas Lew, Tridu R. Huynh, Natasha Steele, Philip Chung, Paige Qin, Gowri Chandra, Stephanie F. Wang, Evan Mullen, Lauren Carpenter, Mita Hoppenfeld, Matthew Morrin, Baffour A. Kyerematen, Nerissa Ambers, Nikesh Kotecha, Emily Alsentzer, Jason Hom, Nigam H. Shah, Kevin Schulman, Jonathan H. Chen

medRxiv · 2026

PDF Link
Abstract

Importance High-quality discharge summaries are essential for safe care transitions but contribute substantially to clinician documentation burden and burnout. While retrospective studies suggest large language models (LLMs) can generate clinical summaries of comparable quality to physicians, prospective data on their safety, utility, and impact on clinician well-being in real-world environments are lacking.

Objective To evaluate the safety, utilization, and impact on clinician burden of MedAgentBrief, an LLM-based agentic workflow for generating hospital course summaries, during prospective clinical deployment.

Design, Setting, and Participants Single-arm prospective pilot study encompassing 384 hospital discharges at one academic inpatient medicine unit from August 1 to October 11, 2025, with baseline comparisons drawn from April 9 to July 31, 2025.

Intervention MedAgentBrief, a custom agentic AI workflow utilizing Gemini 2.5 Pro, generated draft hospital course summaries nightly using the patient’s history and physical and daily progress notes. Drafts were securely emailed to physicians daily for review and optional use.

Main Outcomes and Measures The primary outcome was physician-reported potential for and severity of harm from unedited summaries (AHRQ Common Format Harm Scale). Secondary outcomes included utilization rate, error types (omissions, inaccuracies, hallucinations), time spent in discharge summaries (EHR logs), and changes in cognitive burden (NASA Task Load Index [NASA-TLX]) and burnout (Stanford Professional Fulfillment Index [PFI] Work Exhaustion Scale).

Results The system generated 1274 summaries. Of 384 discharges, physicians utilized AI content in 219 (57%) cases. Feedback on 100 summaries (40.2%) noted omissions (25%) and inaccuracies (20%) but rare hallucinations (2%). Physicians rated 88% of unedited summaries as having no harm potential and 1% as likely to cause moderate harm; no severe harm was reported. Physician burnout scores decreased significantly (1.75 vs 1.20; P = .03). Time savings were heterogeneous: 71% of physicians saw reductions in median documentation time (up to 2.9 minutes).

Conclusions and Relevance An LLM-based agentic workflow produced hospital course summaries that were frequently utilized with mild to minimal risk of harm identified. The intervention was associated with a significant reduction in physician burnout, supporting the viability of AI summarization to mitigate documentation burden.

Adoption and Use of LLMs at an Academic Medical Center

Nigam H. Shah, Nerissa Ambers, Abby Pandya, Timothy Keyes, Juan M. Banda, Srikar Nallan, Carlene Lugtu, Artem A. Trotsyuk, Suhana Bedi, Alyssa Unell, Miguel Fuentes, Francois Grolleau, Sneha S. Jain, Jonathan Chen, Dev Dash, Danton Char, et al.

arXiv · 2026

PDF Link
Abstract While large language models (LLMs) can support clinical documentation needs, standalone tools struggle with “workflow friction” from manual data entry. We developed ChatEHR, a system that enables the use of LLMs with the entire patient timeline spanning several years. ChatEHR enables automations - which are static combinations of prompts and data that perform a fixed task - and interactive use in the electronic health record (EHR) via a user interface (UI). The resulting ability to sift through patient medical records for diverse use-cases such as pre-visit chart review, screening for transfer eligibility, monitoring for surgical site infections, and chart abstraction, redefines LLM use as an institutional capability. This system, accessible after user-training, enables continuous monitoring and evaluation of LLM use. In 1.5 years, we built 7 automations and 1075 users have trained to become routine users of the UI, engaging in 23,000 sessions in the first 3 months of launch. For automations, being model-agnostic and accessing multiple types of data was essential for matching specific clinical or administrative tasks with the most appropriate LLM. Benchmark-based evaluations proved insufficient for monitoring and evaluation of the UI, requiring new methods to monitor performance. Generation of summaries was the most frequent task in the UI, with an estimated 0.73 hallucinations and 1.60 inaccuracies per generation. The resulting mix of cost savings, time savings, and revenue growth required a value assessment framework to prioritize work as well as quantify the impact of using LLMs. Initial estimates are $6M savings in the first year of use, without quantifying the benefit of the better care offered. Such a “build-from-within” strategy provides an opportunity for health systems to maintain agency via a vendor-agnostic, internally governed LLM platform.

DHODH as a Targetable Metabolic Achilles’ Heel for chemo-resistant B-ALL

Yuxuan Liu, Haowen Jiang, Jingjing Liu, Lucille Stuani, Milton J Merchant, Astraea Jager, Abhishek Koladiya, Ti-Cheng Chang, Pablo Domizi, Jolanda Sarno, Ao Wang, Timothy Keyes, Dorra Jedoui, Jodie Meng, Felix Hartmann, Ruida Hou, Carol Fries, Chiara Pirillo, Qingsong Gao, Ilaria Iacobucci, Sean C Bendall, Min Huang, Norman J Lacayo, Kathleen M Sakamoto, Charles G Mullighan, Mignon L Loh, Jiyang Yu, Jun J Yang, Jiangbin Ye, and Kara L Davis

Blood · 2026

PDF Link
Abstract

Relapse remains a major barrier to survival in B-cell acute lymphoblastic leukemia (B-ALL). Both activation of B-cell signaling pathways and increased glucose consumption have been linked to chemo-resistance and relapse risk. Here, we connect these observations, showing that B-ALL cells with active signaling, marked by high phosphorylated ribosomal protein S6 (pS6+), are glucose dependent. Isotope tracing confirms that pS6+ cells are highly glycolytic and rely on glucose for de novo nucleotide synthesis. Uridine, but not other purines or pyrimidines, rescues pS6+ cells from glucose deprivation, highlighting uridine as essential for survival. Active mTOR signaling in pS6+ cells drives de novo pyrimidine synthesis by activating CAD (Carbamoyl phosphate synthetase 2, Aspartate transcarbamylase, and Dihydroorotase), which catalyzes the first steps of de novo pyrimidine synthesis. Inhibiting signaling abolishes glucose dependency and CAD phosphorylation. Primary pS6+ cells express high levels of pyrimidine synthesis proteins, including dihydroorotate dehydrogenase (DHODH), the rate-limiting enzyme in pyrimidine synthesis. Increased DHODH expression correlates with relapse and poor event-free survival. Most B-ALL molecular subtypes exhibit DHODH activity. BAY-2402234, a DHODH inhibitor, effectively kills pS6+ cells in vitro, with IC50 values correlating with pS6 signaling strength across 14 B-ALL patient-derived xenografts (PDX). In vivo, DHODH inhibition prolongs survival and reduces leukemia burden in pS6+ B-ALL models. These findings link active signaling to pyrimidine dependency and relapse risk, highlighting DHODH inhibition as a promising therapeutic strategy for chemo-resistant B-ALL.

Monitoring Deployed AI Systems in Health Care

Timothy Keyes†, Alison Callahan†, Abby S. Pandya†, Nerissa Ambers, Juan M. Banda, Miguel Fuentes, Carlene Lugtu, Pranav Masariya, Srikar Nallan, Connor O’Brien, Thomas Wang, Emily Alsentzer, Jonathan H. Chen, Dev Dash, Matthew A. Eisenberg, Patricia Garcia, Nikesh Kotecha, Anurang Revri, Michael A. Pfeffer, Nigam H. Shah, Sneha S. Jain

arXiv · 2025

PDF Link
Abstract Post-deployment monitoring of artificial intelligence (AI) systems in health care is essential to ensure their safety, quality, and sustained benefit-and to support governance decisions about which systems to update, modify, or decommission. Motivated by these needs, we developed a framework for monitoring deployed AI systems grounded in the mandate to take specific actions when they fail to behave as intended. This framework, which is now actively used at Stanford Health Care, is organized around three complementary principles: system integrity, performance, and impact. System integrity monitoring focuses on maximizing system uptime, detecting runtime errors, and identifying when changes to the surrounding IT ecosystem have unintended effects. Performance monitoring focuses on maintaining accurate system behavior in the face of changing health care practices (and thus input data) over time. Impact monitoring assesses whether a deployed system continues to have value in the form of benefit to clinicians and patients. Drawing on examples of deployed AI systems at our academic medical center, we provide practical guidance for creating monitoring plans based on these principles that specify which metrics to measure, when those metrics should be reviewed, who is responsible for acting when metrics change, and what concrete follow-up actions should be taken-for both traditional and generative AI. We also discuss challenges to implementing this framework, including the effort and cost of monitoring for health systems with limited resources and the difficulty of incorporating data-driven monitoring practices into complex organizations where conflicting priorities and definitions of success often coexist. This framework offers a practical template and starting point for health systems seeking to ensure that AI deployments remain safe and effective over time.

Annotation-free discovery of disease-relevant cells in single-cell datasets

Timothy Keyes†, Erin Craig†, Jolanda Sarno, Jeremy P. D’Silva, Pablo Domizi, Maxim Zaslavsky, Albert Tsai, et al.

Science Advances · 2025

PDF Link
Abstract In single-cell datasets, patient labels indicating disease status (e.g., sick or not sick) are typically available, but individual cell labels indicating which of a patient’s cells are associated with their disease state are generally unknown. To address this, we introduce mixture modeling for multiple-instance learning (MMIL), an expectation-maximization approach that trains cell-level binary classifiers using only patient-level labels. Applied to primary samples from patients with acute leukemia, MMIL accurately separates leukemia from nonleukemia baseline cells, including rare minimal residual disease (MRD) cells; generalizes across tissues and treatment time points; and identifies biologically relevant features with accuracy approaching that of a hematopathologist. MMIL can also incorporate cell labels when they are available, creating a robust framework for leveraging both labeled and unlabeled cells. MMIL provides a flexible modeling framework for cell classification, especially in scenarios with unknown gold-standard cell labels.

Use of a large language model integrated within the electronic medical record for the evaluation of surgical site infections

Eugenia Miranti, Timothy Keyes, Alvaro Ayala, Nerissa Ambers, Gina Newman, Elmer de Leon, Erika Paola Viana-Cardenas, Wajeeha Tariq, Mindy Sampson, and Jorge L. Salinas

Infection Control & Hospital Epidemiology · 2025

PDF Link
Abstract

Our study evaluated a large language model (gpt-4o-mini) for surgical site infection (SSI) adjudication, achieving 100% sensitivity but 69.4% specificity. While reducing the manual screening workload by 66%, the agent generated many false positives, underscoring the need for refined models to improve specificity without compromising accuracy.

MedFactEval and MedAgentBrief: A Framework and Workflow for Generating and Evaluating Factual Clinical Summaries

Francois Grolleau, Emily Alsentzer, Timothy Keyes, Philip Chung, Akshay Swaminathan, Asad Aali, Jason Hom, Tridu Huynh, Thomas Lew, April S. Liang, Weihan Chu, Natasha Z. Steele, Christina F. Lin, Jingkun Yang, Kameron C. Black, Stephen P. Ma, Fateme N. Haredasht, Nigam H. Shah, Kevin Schulman, Jonathan H. Chen

Biocomputing 2026: Proceedings of the Pacific Symposium · 2025

PDF Link
Abstract Evaluating factual accuracy in Large Language Model (LLM)-generated clinical text is a critical barrier to adoption, as expert review is unscalable for the continuous quality assurance these systems require. We address this challenge with two complementary contributions. First, we introduce MedFactEval, a framework for scalable, fact-grounded evaluation where clinicians define high-salience key facts and an LLM Jury–a multi-LLM majority vote–assesses their inclusion in generated summaries. Second, we present MedAgentBrief, a model-agnostic, multi-step workflow designed to generate high-quality, factual discharge summaries. To validate our evaluation framework, we established a gold-standard reference using a seven-physician majority vote on clinician-defined key facts from inpatient cases. The MedFactEval LLM Jury achieved almost perfect agreement with this panel (Cohen’s _ = 81%), a performance statistically non-inferior to that of a single human expert (_ = 67%, P < 0.001). Our work provides both a robust evaluation framework (MedFactEval) and a high-performing generation workflow (MedAgentBrief), offering a comprehensive approach to advance the responsible deployment of generative AI in clinical workflows.

Using secure artificial intelligence agents integrated within the electronic medical record for the evaluation of blood culture appropriateness

Guillermo Rodriguez-Nava, Timothy Keyes, Nerissa Ambers, Eugenia Miranti, Erika Paola Viana-Cardenas, Wajeeha Tariq, Mindy Marie Sampson, Jorge Luis Salinas

Infection Control & Hospital Epidemiology · 2025

PDF Link
Abstract We evaluated large language model (LLM)-based agents integrated with the electronic medical record to assess blood culture appropriateness. While sensitivity was high, specificity remained low. Performance was shaped by prompt phrasing, sycophantic behavior, and semantic triggers, reflecting both the potential and limitations of LLMs in real-world clinical decision support.

Target Product Profile to Evaluate the Clinical Utility, Financial Impact, and Ethical Implications of an AI-Based HCM Detection Model

Shyon Parsa, Timothy Keyes, Dev Dash, Michelle Mello, Heidi Salisbury, Alison Callahan, Shinichi Goto, Michael Salerno, Victoria Parikh, Kenneth Mahaffey, Euan Ashley, Nigam Shah, Sneha Jain

Circulation · 2025

Link
Abstract Hypertrophic cardiomyopathy (HCM) remains underdiagnosed despite effective therapies and accessible screening with electrocardiogram (ECG) and echocardiography. Multiple artificial intelligence (AI) tools show promise in identifying missed HCM cases; however, the path from a promising model to clinical impact remains unclear. Without clear performance thresholds and workflow integration parameters, health systems face uncertainty about which tool to adopt and how to responsibly deploy it. We propose the use of Target Product Profiles (TPPs), an extension of the Fair, Useful, Reliable (AI) Models (FURM) Assessment framework, to define the minimum and ideal requirements for AI tools while incorporating resource, financial, and ethical considerations under real-world constraints. We developed a TPP to guide evaluation of an AI-augmented program for improving HCM diagnosis. Using APLUS, a discrete-event simulation engine, we simulated an HCM screening workflow for 134,856 eligible patients within Stanford Health Care, a multi-hospital health system in California. The diagnostic workflow included primary care, echocardiography, triage, and HCM specialty clinic referral. We simulated multiple combinations of model sensitivity (0.5–0.975) and specificity (0.85–0.99), incorporating resource constraints (ex. HCM clinic capacity) and utility weights reflecting diagnostic delay, misdiagnosis, and mortality. Financial modeling included AI deployment costs and downstream care utilization. Ethical analysis was conducted through stakeholder interviews exploring issues such as perceived risks and benefits, equity, and patient consent. In our simulations, AI models with specificity >=0.9 reduced HCM-related mortality using the proposed workflow, while lower specificity cutoffs overwhelmed referral capacity with false positive results (Figure 1). With a simulated 50% increase in HCM clinic capacity, a specificity of >=0.85 was sufficient to achieve benefit. Financial models showed cost-effectiveness concentrated in true positive cases and a net positive effect for the hospital at low false-positive rates (Figure 2). Ethical review highlighted concerns and mitigation strategies around access disparities, patient anxiety from alerts, and subgroup representation. For HCM, a TPP integrating workflow modeling, financial constraints, and ethical insights may help clarify necessary performance metrics in context–offering a roadmap for actionable, deployment-ready AI-augmented programs.

Holistic evaluation of large language models for medical tasks with MedHELM

Suhana Bedi, Hejie Cui, Miguel Fuentes, Alyssa Unell, Michael Wornow, Juan M. Banda, Nikesh Kotecha, Timothy Keyes, Yifan Mai, Mert Oez, Hao Qiu, Shrey Jain, Leonardo Schettini, Mehr Kashyap, Jason Alan Fries, et al.

Nature Medicine · 2025

PDF Link
Abstract

While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. Here we introduce MedHELM, an extensible evaluation framework with three contributions. First, a clinician-validated taxonomy organizing medical AI applications into five categories that mirror real clinical tasks–clinical decision support (diagnostic decisions, treatment planning), clinical note generation (visit documentation, procedure reports), patient communication (education materials, care instructions), medical research (literature analysis, clinical data analysis) and administration (scheduling, workflow coordination). These encompass 22 subcategories and 121 specific tasks reflecting daily medical practice. Second, a comprehensive benchmark suite of 37 evaluations covering all subcategories. Third, systematic comparison of nine frontier LLMs–Claude 3.5 Sonnet, Claude 3.7 Sonnet, DeepSeek R1, Gemini 1.5 Pro, Gemini 2.0 Flash, GPT-4o, GPT-4o mini, Llama 3.3 and o3-mini–using an automated LLM-jury evaluation method. Our LLM-jury uses multiple AI evaluators to assess model outputs against expert-defined criteria. Advanced reasoning models (DeepSeek R1, o3-mini) demonstrated superior performance with win rates of 66%, although Claude 3.5 Sonnet achieved comparable results at 15% lower computational cost. These results not only highlight current model capabilities but also demonstrate how MedHELM could enable evidence-based selection of medical AI systems for healthcare applications.

Epiregulon: Single-cell transcription factor activity inference to predict drug response and drivers of cell states

Tomasz Wlodarczyk, Aaron Lun, Diana Wu, Minyi Shi, Xiaofen Ye, Shreya Menon, Shushan Toneyan, Kerstin Seidel, Timothy Keyes, et al.

Nature Communications · 2025

PDF Link
Abstract

Transcription factors (TFs) and transcriptional coregulators are emerging therapeutic targets. Gene regulatory networks (GRNs) can evaluate pharmacological agents and identify drivers of disease, but methods that rely solely on gene expression often neglect post-transcriptional modulation of TFs. We present Epiregulon, a method that constructs GRNs from single-cell ATAC-seq and RNA-seq data for accurate prediction of TF activity. This is achieved by considering the co-occurrence of TF expression and chromatin accessibility at TF binding sites in each cell. ChIP-seq data allows motif-agonistic activity inference of transcriptional coregulators or TF harboring neomorphic mutations. Epiregulon accurately predicted the effects of AR inhibition across different drug modalities including an AR antagonist and an AR degrader, delineated the mechanisms of a SMARCA4 degrader by identifying context-dependent interaction partners, and prioritized drivers of lineage reprogramming and tumorigenesis. By mapping gene regulation across various cellular contexts, Epiregulon can accelerate the discovery of therapeutics targeting transcriptional regulators.

The tidyomics ecosystem: enhancing omic data analyses

Timothy Keyes†, William J. Hutchison†, Helena L. Crowell, Jacques Serizay, Charlotte Soneson, Eric S. Davis, Noriaki Sato, Lambda Moses, Boyd Tarlinton, Abdullah A. Nahid, Miha Kosmac, Quentin Clayssen, Victor Yuan, Wancen Mu, Ji-Eun Park, Izabela Mamede, Min Hyung Ryu, Pierre-Paul Axisa, Paulina Paiz, Chi-Lam Poon, Ming Tang, Raphael Gottardo, Martin Morgan, Stuart Lee, Michael Lawrence, Stephanie C. Hicks, Garry P. Nolan, Kara L. Davis, Anthony T. Papenfuss, Michael I. Love, Stefano Mangiola

Nature Methods · 2024

PDF Link
Abstract

The growth of omic data presents evolving challenges in data manipulation, analysis and integration. Addressing these challenges, Bioconductor provides an extensive community-driven biological data analysis platform. Meanwhile, tidy R programming offers a revolutionary data organization and manipulation standard. Here we present the tidyomics software ecosystem, bridging Bioconductor to the tidy R paradigm. This ecosystem aims to streamline omic analysis, ease learning and encourage cross-disciplinary collaborations. We demonstrate the effectiveness of tidyomics by analyzing 7.5 million peripheral blood mononuclear cells from the Human Cell Atlas, spanning six data frameworks and ten analysis tools.

IFN-gamma-Expressing Myeloid Cells Localize within Lipoproteinosis during Drug-Associated Pulmonary Alveolar Proteinosis occurring in Systemic Juvenile Idiopathic Arthritis

Alea Delmastro, Candace Liu, Xiao-Wen Ding, Serena Tan, Inna Averbukh, Marc Bosse, Timothy Keyes, et al.

bioRxiv · 2024

PDF Link
Abstract

In the United States, approximately one in 1000 children are diagnosed with the autoinflammatory disease, Juvenile Idiopathic Arthritis (JIA). A subset of JIA cases manifests as Systemic JIA (sJIA), which is characterized by joint pain, fevers, rashes, and systemic inflammation. Severe pulmonary complications have not historically been associated with sJIA. Since 2010, inhibitors of interleukin-1 and interleukin 6 (IL-1i/IL-6i) are the recommended course of treatment for sJIA, yet recently studies show evidence of a severe drug hypersensitivity reaction implicating these medications in a subset of those treated. With this reaction, sJIA patients can develop severe lung disease, including pulmonary alveolar proteinosis (PAP). As this drug-associated lung disease has only recently been identified, the etiology of sJIA drug-associated PAP (sJIA-daPAP) is poorly understood. We used multiplexed ion beam imaging by time-of-flight (MIBI-TOF) to define the cellular immune infiltrate and describe pathological features of PAP in sJIA-daPAP patients. We found an enrichment of eosinophils, neutrophils, and M2 macrophages within regions of lipoproteinosis. These enriched subsets all upregulate IFN_ within lipoproteinosis, a signature specific to sJIA-daPAP samples compared to non-sJIA-PAP samples. In a cellular neighborhood analysis, we identified that eosinophils, neutrophils and M2 macrophages frequently co-localize within the same cellular microenvironment, especially in lipoproteinosis regions. Therefore, this spatial coordination may be involved in clearance or persistence of lipoproteinosis in sJIA-daPAP. This study provides a comprehensive overview of sJIA-daPAP immune pathology and suggests cellular mechanisms that drive inflammation in sJIA patients experiencing pulmonary complications associated with delayed drug hypersensitivity during IL-1i/IL-6i treatment.

Sociodemographic factors and research experience impact MD-PhD program acceptance

Darnell K. Adrian Williams, Briana Christophers, Timothy Keyes, Rachit Kumar, Michael C. Granovetter, Alexandria Adigun, Justin Olivera, Jehron Pura-Bryant, Chynna Smith, Chiemeka Okafor, Mahlet Shibre, Dania Daye, Myles H. Akabas

JCI Insight · 2024

PDF Link
Abstract

The 2014 NIH Physician-Scientist Workforce Working Group predicted a future shortage of physician-scientists. Subsequent studies have highlighted disparities in MD-PhD admissions based on race, income, and education. Our analysis of data from the Association of American Medical Colleges covering 2014–2021 (15,156 applicants and 6,840 acceptees) revealed that acceptance into US MD-PhD programs correlates with research experience, family income, and research publications. The number of research experiences associated with parental education and family income. Applicants were more likely to be accepted with a family income greater than $50,000 or with one or more publications or presentations. Applicants were less likely to be accepted if they had parents without a graduate degree, were Black/African American, were first-generation college students, or were reapplicants, irrespective of the number of research experiences, publications, or presentations. These findings underscore an admissions bias that favors candidates from affluent and highly educated families, while disadvantaging underrepresented minorities.

tidytof: a user-friendly framework for scalable and reproducible high-dimensional cytometry data analysis

Timothy Keyes, Abhishek Koladiya, Yu-Chen Lo, Garry P. Nolan, Kara L. Davis

Bioinformatics Advances · 2023

PDF Link
Abstract While many algorithms for analyzing high-dimensional cytometry data have now been developed, the software implementations of these algorithms remain highly customized–this means that exploring a dataset requires users to learn unique, often poorly interoperable package syntaxes for each step of data processing. To solve this problem, we developed {tidytof}, an open-source R package for analyzing high-dimensional cytometry data using the increasingly popular ‘tidy data’ interface.

Teaching LGBTQ+ health, a web-based faculty development course: program evaluation study using the RE-AIM framework

Michael Albert Gisondi, Timothy Keyes, Shana Zucker, Deila Bumgardner

JMIR Medical Education · 2023

PDF Link
Abstract

Background: Many health professions faculty members lack training on fundamental lesbian, gay, bisexual, transgender, and queer (LGBTQ+) health topics. Faculty development is needed to address knowledge gaps, improve teaching, and prepare students to competently care for the growing LGBTQ+ population.

Objective: We conducted a program evaluation of the massive open online course Teaching LGBTQ+ Health: A Faculty Development Course for Health Professions Educators from the Stanford School of Medicine. Our goal was to understand participant demographics, impact, and ongoing maintenance needs to inform decisions about updating the course.

Methods: We evaluated the course for the period from March 27, 2021, to February 24, 2023, guided by the RE-AIM (Reach, Effectiveness, Adoption, Implementation, and Maintenance) framework. We assessed impact using participation numbers, evidence of learning, and likelihood of practice change. Data included participant demographics, performance on a pre- and postcourse quiz, open-text entries throughout the course, continuing medical education (CME) credits awarded, and CME course evaluations. We analyzed demographics using descriptive statistics and pre- and postcourse quiz scores using a paired 2-tailed t test. We conducted a qualitative thematic analysis of open-text responses to prompts within the course and CME evaluation questions.

Results: Results were reported using the 5 framework domains. Regarding Reach, 1782 learners participated in the course, and 1516 (85.07%) accessed it through a main course website. Of the different types of participants, most were physicians (423/1516, 27.9%) and from outside the sponsoring institution and target audience (1452/1516, 95.78%). Regarding Effectiveness, the median change in test scores for the 38.1% (679/1782) of participants who completed both the pre- and postcourse tests was 3 out of 10 points, or a 30% improvement (P<.001). Themes identified from CME evaluations included LGBTQ+ health as a distinct domain, inclusivity in practices, and teaching LGBTQ+ health strategies. A minority of participants (237/1782, 13.3%) earned CME credits. Regarding Adoption, themes identified among responses to prompts in the course included LGBTQ+ health concepts and instructional strategies. Most participants strongly agreed with numerous positive statements about the course content, presentation, and likelihood of practice change. Regarding Implementation, the course cost US $57,000 to build and was intramurally funded through grants and subsidies. The course faculty spent an estimated 600 hours on the project, and educational technologists spent another 712 hours. Regarding Maintenance, much of the course is evergreen, and ongoing oversight and quality assurance require minimal faculty time. New content will likely include modules on transgender health and gender-affirming care.

Conclusions: Teaching LGBTQ+ Health improved participants’ knowledge of fundamental queer health topics. Overall participation has been modest to date. Most participants indicated an intention to change clinical or teaching practices. Maintenance costs are minimal. The web-based course will continue to be offered, and new content will likely be added.

Single-cell technologies uncover intra-tumor heterogeneity in childhood cancers

Yu-Chen Lo, Yuxuan Liu, Marte Kammersgaard, Abhishek Koladiya, Timothy Keyes, Kara L. Davis

Seminars in Immunopathology · 2023

PDF Link
Abstract

Childhood cancer is the second leading cause of death in children aged 1 to 14. Although survival rates have vastly improved over the past 40 years, cancer resistance and relapse remain a significant challenge. Advances in single-cell technologies enable dissection of tumors to unprecedented resolution. This facilitates unraveling the heterogeneity of childhood cancers to identify cell subtypes that are prone to treatment resistance. The rapid accumulation of single-cell data from different modalities necessitates the development of novel computational approaches for processing, visualizing, and analyzing single-cell data. Here, we review single-cell approaches utilized or under development in the context of childhood cancers. We review computational methods for analyzing single-cell data and discuss best practices for their application. Finally, we review the impact of several studies of childhood tumors analyzed with these approaches and future directions to implement single-cell studies into translational cancer research in pediatric oncology.

Improved Relapse Prediction in Pediatric Acute Myeloid Leukemia By Deconvolving Lineage-Specific and Cancer-Specific Features in Single-Cell Data

Timothy Keyes, Astraea Jager, Mason Krueger, Sylvia Plevritis, Robert Tibshirani, Richard Aplenc, et al.

Blood · 2022

Link
Abstract

Introduction

While most children with acute myeloid leukemia (AML) achieve first remission, nearly 40% will relapse. Of these children, few survive to a second remission even with highly-escalated treatment protocols. Recent studies have shown that many AML patients harbor rare, stem cell-like subpopulations that resist chemotherapy and drive relapse. However, the exact characteristics of these relapse-associated cells are a matter of contention, with reported phenotypes spanning the hematopoietic developmental continuum. In some patients, treatment-resistant cells can be detected as minimal residual disease (MRD), which is often used to predict relapse, albeit with limited accuracy and only after induction chemotherapy. Thus, the identity of treatment-resistant cells as well as their relationship to normal progenitors remain mysterious, thereby limiting the development of targeted therapies for pediatric AML.

Here, we present a computational approach for decomposing high-dimensional single-cell measurements into two components: a lineage-specific component that can be used to align cancer cells with specific stages of myeloid development and a cancer-specific component that can be used to identify aberrant phenotypes unique to AML cells. We show that, together, these components can be used at the time of diagnosis to predict relapse more accurately than clinical information alone.

Methods and Results

Using mass cytometry, we analyzed paired diagnostic and post-induction samples collected from 19 (8 relapse, 11 non-relapse) pediatric patients who enrolled on the Children’s Oncology Group trial AAML1031. All patients were treated on the control arm and consented to banking of tissue for research. We also included 5 bone marrow samples from healthy donors. After thawing, samples were divided in half and stimulated with conditioned medium from the human bone marrow stromal cell line HS-5 to activate relevant signaling pathways, or left unstimulated. An average of 5 x 105 cells per patient were analyzed for each condition. The mass cytometry panel included 31 antibodies to surface markers, 6 antibodies to intracellular signaling mediators, and 4 antibodies to intracellular proteins and transcription factors.

Following data collection, the singular value decomposition was applied to the data matrix of healthy single-cell measurements to construct a linear subspace representing the predominant protein expression programs (“eigencells”) within the healthy myeloid developmental continuum. By projecting AML single-cell measurements onto this subspace, we derived a healthy feature vector aligned with the healthy subspace and a cancer-specific feature vector orthogonal to the healthy subspace for each AML cell. These feature vectors-along with clinical metadata about each patient including age, blast percentage at diagnosis, and cytogenetic status-were used as the input to regularized Cox proportional hazards models predicting time-to-relapse for each patient. Using the relative risk scores from the proportional hazards model, patients were assigned to high-risk or low-risk groups according to the optimal log-rank test threshold.

The baseline clinical model used only age, blast percentage at diagnosis, and cytogenetic status as predictors and predicted relapse status with an accuracy of 13/19 (68%). This baseline model was outperformed by the model constructed using the average value of the single-cell cancer-specific feature vectors for each patient, which predicted relapse status with an accuracy of 16/19 (84%). Interestingly, despite using only information available at diagnosis, the single-cell model also outperformed a clinical model incorporating patients’ MRD status after induction chemotherapy, which predicted relapse with an accuracy of 15/19 (79%). Interrogation of the coefficients of the single-cell feature model revealed specific cellular signaling programs associated with relapse, including enhanced pCreb and pSTAT1 signaling as well as depleted pSTAT5 signaling relative to healthy lineage cells (Figure 1).

Conclusions

These results support the feasibility of predicting relapse in AML as early as diagnosis by leveraging a computational approach that compares cancer cells to the native lineage from which they arise. Validation of this approach in an independent cohort is ongoing and will be presented.

Sexual and gender minority identity disclosure from undergraduate to graduate medical education: perceptions of professional Outness among Medical Students

Timothy Keyes, Shana Zucker, Teddy G. Goetz, Justin L. Jia, Samuel R. Bunting, Mitchell R. Lunn, Leslee L. Subak

Annals of LGBTQ Public and Population Health · 2022

Link
Abstract

Increasingly, medical schools and residency programs seek to recruit trainees from diverse backgrounds, including sexual and gender minority (SGM) people. However, many trainees do not disclose their SGM identity during medical training due to fear of discrimination, which remains a challenge for institutional diversity and inclusion efforts. Despite this, relatively few studies have rigorously quantified trainees’ SGM identity self-disclosure across different stages of medical training. In 2018 and 2019, the Medical Student Pride Alliance (MSPA) distributed a 33-item online questionnaire interrogating practices and attitudes about SGM identity disclosure to medical students at allopathic and osteopathic medical schools in the United States. Here, we analyze these data to compare 1) the degree to which medical students disclose SGM identity in various professional contexts during undergraduate and graduate medical training and 2) students’ attitudes regarding SGM identity disclosure across those contexts. Overall, 1,162 medical students from 125 medical schools responded to the survey. Of these respondents, 629 (54%) were SGM-identified. Among SGM-identified respondents, students were most likely to report SGM identity self-disclosure to peers (91%) and least likely to report SGM identity self-disclosure on applications to residency or post-doctoral work (29%). Cisgender women were less likely to report SGM identity self-disclosure than other genders, and students performing research were more likely to report SGM identity self-disclosure among mentors. Overall, most (>90%) survey respondents supported trainees’ ability to disclose their sexual orientation or gender identity during medical training. This exploratory study provides preliminary evidence that SGM-identifying medical students often do not disclose their sexual orientation or gender identity in evaluative professional contexts. Future work should assess this phenomenon in a larger national sample and propose targeted policies to support SGM inclusion throughout medical training in general and on applications to graduate medical education specifically.

CytofIn enables integrated analysis of public mass cytometry datasets using generalized anchors

Yu-Chen Lo, Timothy Keyes, Astraea Jager, Jolanda Sarno, Pablo Domizi, Ravindra Majeti, Kathleen M. Sakamoto, Norman Lacayo, Charles G. Mullighan, Jeffrey Waters, Bita Sahaf, Sean C. Bendall, Kara L. Davis

Nature Communications · 2022

PDF Link
Abstract

The increasing use of mass cytometry for analyzing clinical samples offers the possibility to perform comparative analyses across public datasets. However, challenges in batch normalization and data integration limit the comparison of datasets not intended to be analyzed together. Here, we present a data integration strategy, CytofIn, using generalized anchors to integrate mass cytometry datasets from the public domain. We show that low-variance controls, such as healthy samples and stable channels, are inherently homogeneous, robust against stimulation, and can serve as generalized anchors for batch correction. Single-cell quantification comparing mass cytometry data from 989 leukemia files pre- and post normalization with CytofIn demonstrates effective batch correction while recapitulating the gold-standard bead normalization. CytofIn integration of public cancer datasets enabled the comparison of immune features across histologies and treatments. We demonstrate the ability to integrate public datasets without necessitating identical control samples or bead standards for fast and robust analysis using CytofIn.

A cancer biologist’s primer on machine learning applications in high-dimensional cytometry

Timothy Keyes, Pablo Domizi, Yu-Chen Lo, Garry P. Nolan, Kara L. Davis

Cytometry Part A · 2020

PDF Link
Abstract The application of machine learning and artificial intelligence to high-dimensional cytometry data sets has increasingly become a staple of bioinformatic data analysis over the past decade. This is especially true in the field of cancer biology, where protocols for collecting multiparameter single-cell data in a high-throughput fashion are rapidly developed. As the use of machine learning methodology in cytometry becomes increasingly common, there is a need for cancer biologists to understand the basic theory and applications of a variety of algorithmic tools for analyzing and interpreting cytometry data. We introduce the reader to several keystone machine learning-based analytic approaches with an emphasis on defining key terms and introducing a conceptual framework for making translational or clinically relevant discoveries. The target audience consists of cancer cell biologists and physician-scientists interested in applying these tools to their own data, but who may have limited training in bioinformatics.

Progressive B Cell Loss in Revertant X-SCID

Connie H. Lin, Hye Sun Kuehn, Timothy J. Thauland, Christine M. Lee, Suk See De Ravin, Harry L. Malech, Timothy Keyes, Astraea Jager, Kara L. Davis, Maria I. Garcia-Lloret, Sergio D. Rosenzweig, Manish J. Butte

Journal of Clinical Immunology · 2020

PDF Link
Abstract

We report the case of a patient with X-linked severe combined immunodeficiency (X-SCID) who survived for over 20 years without hematopoietic stem cell transplantation (HSCT) because of a somatic reversion mutation. An important feature of this rare case included the strategy to validate the pathogenicity of a variant of the IL2RG gene when the T and B cell lineages comprised only revertant cells. We studied the X-inactivation of sorted T cells from the mother to show that the pathogenic variant was indeed the cause of his SCID. One interesting feature was a progressive loss of B cells over 20 years. CyTOF (cytometry time of flight) analysis of bone marrow offered a potential explanation of the B cell failure, with expansions of progenitor populations that suggest a developmental block. Another interesting feature was that the patient bore extensive granulomatous disease and skin cancers that contained T cells, despite severe T cell lymphopenia in the blood. Finally, the patient had a few hundred T cells on presentation but his TCRs comprised a very limited repertoire, supporting the important conclusion that repertoire size trumps numbers of T cells.

Student Education About Pre-Exposure Prophylaxis (PrEP) Varies Between Regions of the United States

Samuel R. Bunting, Sarah S. Garber, Robert H. Goldstein, Timothy D. Ritchie, Tamzin J. Batteson, Timothy Keyes

Journal of General Internal Medicine · 2020

PDF Link
Abstract

Background Daily, oral pre-exposure prophylaxis (PrEP) is an effective and safe prevention strategy for people at risk for HIV. However, prescription of PrEP has been limited for patients at the highest risk. Disparities in PrEP prescription are pronounced among racial and gender minority patients. A significant body of literature indicates that practicing healthcare providers have little awareness and knowledge of PrEP. Very little work has investigated the education about PrEP among health professionals in training.

Objective The objective of this study was to compare health professions students’ awareness of PrEP and education about PrEP between regions of the US, and to determine if correlations between regional HIV incidence and PrEP use were present.

Design Survey study.

Participants A cross-sectional sample of health professions students (N_=_1859) representing future prescribers (MD, DO, PA), pharmacists, and nurses in the US.

Key Results Overall, 83.4% of students were aware of PrEP, but only 62.2% of fourth-year students indicated they had been taught about PrEP at any time during their training. Education about PrEP was most comprehensive in the Northeastern US, the area with the highest PrEP to need ratio (4.7). In all regions, transgender patients and heterosexual men and women were least likely to be presented in education as PrEP candidates, and men who have sex with men were the most frequently presented.

Conclusions There are marked differences in education regarding PrEP both between academic programs and regions of the USA.

Medical Student Pride Alliance: The First National LGBTQ+ Medical Student Affinity Organization

Teddy G. Goetz, Shana Zucker, Timothy Keyes, Michael Gisondi

Medical Education · 2020

PDF Link
Abstract This paper documents how the Medical Student Pride Alliance (MSPA), the first national affinity organization for LGBTQ+ medical students, was founded in the United States in 2018.

Structural and functional features of central nervous system lymphatic vessels

Antoine Louveau, Igor Smirnov, Timothy Keyes, Jacob D. Eccles, Sherin J. Rouhani, J. David Peske, Noel C. Derecki, David Castle, James W. Mandell, Kevin S. Lee, Tajie H. Harris, Jonathan Kipnis

Nature · 2015

PDF Link
Abstract We discovered functional lymphatic vessels lining the dural sinuses. The discovery of the central nervous system lymphatic system may call for a reassessment of basic assumptions in neuroimmunology.

Deployment and Evaluation of an EHR-integrated, Large Language Model-Powered Tool to Triage Surgical Patients

Timothy Keyes†, Jane Wang†, April S. Liang, Stephen P. Ma, Jason Shen, Jerry Liu, Nerissa Ambers, Abby Pandya, Rita Pandya, Jason Hom, Natasha Steele, Jonathan H. Chen, Kevin Schulman

arXiv · 2026

PDF Link
Abstract Surgical co-management (SCM) is an evidence-based model in which hospitalists jointly manage medically complex perioperative patients alongside surgical teams. Despite its clinical and financial value, SCM is limited by the need to manually identify eligible patients. To determine whether SCM triage can be automated, we conducted a prospective, unblinded study at Stanford Health Care in which an LLM-based, electronic health record (EHR)-integrated triage tool (SCM Navigator) provided SCM recommendations followed by physician review. Using pre-operative documentation, structured data, and clinical criteria for perioperative morbidity, SCM Navigator categorized patients as appropriate, not appropriate, or possibly appropriate for SCM. Faculty indicated their clinical judgment and provided free-text feedback when they disagreed. Sensitivity, specificity, positive predictive value, and negative predictive value were measured using physician determinations as a reference. Free-text reasons were thematically categorized, and manual chart review was conducted on all false-negative cases and 30 randomly selected cases from the largest false-positive category. Since deployment, 6,193 cases have been triaged, of which 1,582 (23%) were recommended for hospitalist consultation. SCM Navigator displayed high sensitivity (0.94, 95% CI 0.91-0.96) and moderate specificity (0.74, 95% CI 0.71-0.77). Post-hoc chart review suggested most discrepancies reflect modifiable gaps in clinical criteria, institutional workflow, or physician practice variability rather than LLM misclassification, which accounted for 2 of 19 (11%) false-negative cases. These findings demonstrate that an LLM-powered, EHR-integrated, human-in-the-loop AI system can accurately and safely triage surgical patients for SCM, and that AI-enabled screening tools can augment and potentially automate time-intensive clinical workflows.

Designing Clinically Useful AI: A Blueprint for Impact

Timothy Keyes†, Shyon Parsa†, Dev Dash, Danton Char, Michelle M. Mello, Alison Callahan, Margaret Ann Smith, Sinjin Lee, Thomas Wang, Heidi Salisbury, Shinichi Goto, Vicki Parikh, Kenneth W. Mahaffey, Michael Salerno, Euan A. Ashley, Nigam H. Shah, and Sneha S. Jain

NEJM AI · 2026

PDF Link
Abstract Most artificial intelligence (AI) tools in health care are evaluated on statistical performance for diagnostic accuracy alone, which often fails to account for the realities of the clinical systems into which they may be deployed. This disconnect has contributed to a proliferation of AI tools that perform well in development but fail to gain traction or generate meaningful impact in clinical use. We propose the use of health AI target product profiles, which specify the performance thresholds an AI tool must meet to produce benefit within a specific care setting, accounting for workflow, capacity, and utility trade-offs. Using hypertrophic cardiomyopathy (HCM) detection as an example, we simulate the performance of an AI-augmented clinical program across a range of AI tool characteristics and health care resource constraints to identify the conditions under which clinical value could be realized. Health AI target product profiles can guide AI tool development, inform AI tool selection if multiple AI tools have already been developed, guide implementation strategies for AI-augmented programs, and prevent investment in AI tools that are unlikely to create value. Ultimately, this approach offers a proactive and context-driven pathway for designing clinically useful AI that can empower health systems, patients, and providers as active members of the AI design process.

Development, Evaluation, and Deployment of a Multi-Agent System for Thoracic Tumor Board

Timothy Keyes†, Tim Ellis-Caleo†, Nerissa Ambers, Faraah Bekheet, Wen-wai Yim, Nikesh Kotecha, Nigam H. Shah, Joel Neal

arXiv · 2026

PDF Link
Abstract Tumor boards are multidisciplinary conferences dedicated to producing actionable patient care recommendations with live review of primary radiology and pathology data. Succinct patient case summaries are needed to drive efficient and accurate case discussions. We developed a manual AI-based workflow to generate patient summaries to display live at the Stanford Thoracic Tumor board. To improve on this manually intensive process, we developed several automated AI chart summarization methods and evaluated them against physician gold standard summaries and fact-based scoring rubrics. We report these comparative evaluations as well as our deployment of the final state automated AI chart summarization tool along with post-deployment monitoring. We also validate the use of an LLM as a judge evaluation strategy for fact-based scoring. This work is an example of integrating AI-based workflows into routine clinical practice.

MedAgentBrief for Hospital Course Summarization: Safety, Use, and Discharge Documentation Burden

Francois Grolleau, Timothy Keyes‡, April S. Liang‡, Stephen P. Ma, Thomas Lew, Tridu R. Huynh, Natasha Steele, Philip Chung, Paige Qin, Gowri Chandra, Stephanie F. Wang, Evan Mullen, Lauren Carpenter, Mita Hoppenfeld, Matthew Morrin, Baffour A. Kyerematen, Nerissa Ambers, Nikesh Kotecha, Emily Alsentzer, Jason Hom, Nigam H. Shah, Kevin Schulman, Jonathan H. Chen

medRxiv · 2026

PDF Link
Abstract

Importance High-quality discharge summaries are essential for safe care transitions but contribute substantially to clinician documentation burden and burnout. While retrospective studies suggest large language models (LLMs) can generate clinical summaries of comparable quality to physicians, prospective data on their safety, utility, and impact on clinician well-being in real-world environments are lacking.

Objective To evaluate the safety, utilization, and impact on clinician burden of MedAgentBrief, an LLM-based agentic workflow for generating hospital course summaries, during prospective clinical deployment.

Design, Setting, and Participants Single-arm prospective pilot study encompassing 384 hospital discharges at one academic inpatient medicine unit from August 1 to October 11, 2025, with baseline comparisons drawn from April 9 to July 31, 2025.

Intervention MedAgentBrief, a custom agentic AI workflow utilizing Gemini 2.5 Pro, generated draft hospital course summaries nightly using the patient’s history and physical and daily progress notes. Drafts were securely emailed to physicians daily for review and optional use.

Main Outcomes and Measures The primary outcome was physician-reported potential for and severity of harm from unedited summaries (AHRQ Common Format Harm Scale). Secondary outcomes included utilization rate, error types (omissions, inaccuracies, hallucinations), time spent in discharge summaries (EHR logs), and changes in cognitive burden (NASA Task Load Index [NASA-TLX]) and burnout (Stanford Professional Fulfillment Index [PFI] Work Exhaustion Scale).

Results The system generated 1274 summaries. Of 384 discharges, physicians utilized AI content in 219 (57%) cases. Feedback on 100 summaries (40.2%) noted omissions (25%) and inaccuracies (20%) but rare hallucinations (2%). Physicians rated 88% of unedited summaries as having no harm potential and 1% as likely to cause moderate harm; no severe harm was reported. Physician burnout scores decreased significantly (1.75 vs 1.20; P = .03). Time savings were heterogeneous: 71% of physicians saw reductions in median documentation time (up to 2.9 minutes).

Conclusions and Relevance An LLM-based agentic workflow produced hospital course summaries that were frequently utilized with mild to minimal risk of harm identified. The intervention was associated with a significant reduction in physician burnout, supporting the viability of AI summarization to mitigate documentation burden.

Adoption and Use of LLMs at an Academic Medical Center

Nigam H. Shah, Nerissa Ambers, Abby Pandya, Timothy Keyes, Juan M. Banda, Srikar Nallan, Carlene Lugtu, Artem A. Trotsyuk, Suhana Bedi, Alyssa Unell, Miguel Fuentes, Francois Grolleau, Sneha S. Jain, Jonathan Chen, Dev Dash, Danton Char, et al.

arXiv · 2026

PDF Link
Abstract While large language models (LLMs) can support clinical documentation needs, standalone tools struggle with “workflow friction” from manual data entry. We developed ChatEHR, a system that enables the use of LLMs with the entire patient timeline spanning several years. ChatEHR enables automations - which are static combinations of prompts and data that perform a fixed task - and interactive use in the electronic health record (EHR) via a user interface (UI). The resulting ability to sift through patient medical records for diverse use-cases such as pre-visit chart review, screening for transfer eligibility, monitoring for surgical site infections, and chart abstraction, redefines LLM use as an institutional capability. This system, accessible after user-training, enables continuous monitoring and evaluation of LLM use. In 1.5 years, we built 7 automations and 1075 users have trained to become routine users of the UI, engaging in 23,000 sessions in the first 3 months of launch. For automations, being model-agnostic and accessing multiple types of data was essential for matching specific clinical or administrative tasks with the most appropriate LLM. Benchmark-based evaluations proved insufficient for monitoring and evaluation of the UI, requiring new methods to monitor performance. Generation of summaries was the most frequent task in the UI, with an estimated 0.73 hallucinations and 1.60 inaccuracies per generation. The resulting mix of cost savings, time savings, and revenue growth required a value assessment framework to prioritize work as well as quantify the impact of using LLMs. Initial estimates are $6M savings in the first year of use, without quantifying the benefit of the better care offered. Such a “build-from-within” strategy provides an opportunity for health systems to maintain agency via a vendor-agnostic, internally governed LLM platform.

Monitoring Deployed AI Systems in Health Care

Timothy Keyes†, Alison Callahan†, Abby S. Pandya†, Nerissa Ambers, Juan M. Banda, Miguel Fuentes, Carlene Lugtu, Pranav Masariya, Srikar Nallan, Connor O’Brien, Thomas Wang, Emily Alsentzer, Jonathan H. Chen, Dev Dash, Matthew A. Eisenberg, Patricia Garcia, Nikesh Kotecha, Anurang Revri, Michael A. Pfeffer, Nigam H. Shah, Sneha S. Jain

arXiv · 2025

PDF Link
Abstract Post-deployment monitoring of artificial intelligence (AI) systems in health care is essential to ensure their safety, quality, and sustained benefit-and to support governance decisions about which systems to update, modify, or decommission. Motivated by these needs, we developed a framework for monitoring deployed AI systems grounded in the mandate to take specific actions when they fail to behave as intended. This framework, which is now actively used at Stanford Health Care, is organized around three complementary principles: system integrity, performance, and impact. System integrity monitoring focuses on maximizing system uptime, detecting runtime errors, and identifying when changes to the surrounding IT ecosystem have unintended effects. Performance monitoring focuses on maintaining accurate system behavior in the face of changing health care practices (and thus input data) over time. Impact monitoring assesses whether a deployed system continues to have value in the form of benefit to clinicians and patients. Drawing on examples of deployed AI systems at our academic medical center, we provide practical guidance for creating monitoring plans based on these principles that specify which metrics to measure, when those metrics should be reviewed, who is responsible for acting when metrics change, and what concrete follow-up actions should be taken-for both traditional and generative AI. We also discuss challenges to implementing this framework, including the effort and cost of monitoring for health systems with limited resources and the difficulty of incorporating data-driven monitoring practices into complex organizations where conflicting priorities and definitions of success often coexist. This framework offers a practical template and starting point for health systems seeking to ensure that AI deployments remain safe and effective over time.

Use of a large language model integrated within the electronic medical record for the evaluation of surgical site infections

Eugenia Miranti, Timothy Keyes, Alvaro Ayala, Nerissa Ambers, Gina Newman, Elmer de Leon, Erika Paola Viana-Cardenas, Wajeeha Tariq, Mindy Sampson, and Jorge L. Salinas

Infection Control & Hospital Epidemiology · 2025

PDF Link
Abstract

Our study evaluated a large language model (gpt-4o-mini) for surgical site infection (SSI) adjudication, achieving 100% sensitivity but 69.4% specificity. While reducing the manual screening workload by 66%, the agent generated many false positives, underscoring the need for refined models to improve specificity without compromising accuracy.

MedFactEval and MedAgentBrief: A Framework and Workflow for Generating and Evaluating Factual Clinical Summaries

Francois Grolleau, Emily Alsentzer, Timothy Keyes, Philip Chung, Akshay Swaminathan, Asad Aali, Jason Hom, Tridu Huynh, Thomas Lew, April S. Liang, Weihan Chu, Natasha Z. Steele, Christina F. Lin, Jingkun Yang, Kameron C. Black, Stephen P. Ma, Fateme N. Haredasht, Nigam H. Shah, Kevin Schulman, Jonathan H. Chen

Biocomputing 2026: Proceedings of the Pacific Symposium · 2025

PDF Link
Abstract Evaluating factual accuracy in Large Language Model (LLM)-generated clinical text is a critical barrier to adoption, as expert review is unscalable for the continuous quality assurance these systems require. We address this challenge with two complementary contributions. First, we introduce MedFactEval, a framework for scalable, fact-grounded evaluation where clinicians define high-salience key facts and an LLM Jury–a multi-LLM majority vote–assesses their inclusion in generated summaries. Second, we present MedAgentBrief, a model-agnostic, multi-step workflow designed to generate high-quality, factual discharge summaries. To validate our evaluation framework, we established a gold-standard reference using a seven-physician majority vote on clinician-defined key facts from inpatient cases. The MedFactEval LLM Jury achieved almost perfect agreement with this panel (Cohen’s _ = 81%), a performance statistically non-inferior to that of a single human expert (_ = 67%, P < 0.001). Our work provides both a robust evaluation framework (MedFactEval) and a high-performing generation workflow (MedAgentBrief), offering a comprehensive approach to advance the responsible deployment of generative AI in clinical workflows.

Using secure artificial intelligence agents integrated within the electronic medical record for the evaluation of blood culture appropriateness

Guillermo Rodriguez-Nava, Timothy Keyes, Nerissa Ambers, Eugenia Miranti, Erika Paola Viana-Cardenas, Wajeeha Tariq, Mindy Marie Sampson, Jorge Luis Salinas

Infection Control & Hospital Epidemiology · 2025

PDF Link
Abstract We evaluated large language model (LLM)-based agents integrated with the electronic medical record to assess blood culture appropriateness. While sensitivity was high, specificity remained low. Performance was shaped by prompt phrasing, sycophantic behavior, and semantic triggers, reflecting both the potential and limitations of LLMs in real-world clinical decision support.

Target Product Profile to Evaluate the Clinical Utility, Financial Impact, and Ethical Implications of an AI-Based HCM Detection Model

Shyon Parsa, Timothy Keyes, Dev Dash, Michelle Mello, Heidi Salisbury, Alison Callahan, Shinichi Goto, Michael Salerno, Victoria Parikh, Kenneth Mahaffey, Euan Ashley, Nigam Shah, Sneha Jain

Circulation · 2025

Link
Abstract Hypertrophic cardiomyopathy (HCM) remains underdiagnosed despite effective therapies and accessible screening with electrocardiogram (ECG) and echocardiography. Multiple artificial intelligence (AI) tools show promise in identifying missed HCM cases; however, the path from a promising model to clinical impact remains unclear. Without clear performance thresholds and workflow integration parameters, health systems face uncertainty about which tool to adopt and how to responsibly deploy it. We propose the use of Target Product Profiles (TPPs), an extension of the Fair, Useful, Reliable (AI) Models (FURM) Assessment framework, to define the minimum and ideal requirements for AI tools while incorporating resource, financial, and ethical considerations under real-world constraints. We developed a TPP to guide evaluation of an AI-augmented program for improving HCM diagnosis. Using APLUS, a discrete-event simulation engine, we simulated an HCM screening workflow for 134,856 eligible patients within Stanford Health Care, a multi-hospital health system in California. The diagnostic workflow included primary care, echocardiography, triage, and HCM specialty clinic referral. We simulated multiple combinations of model sensitivity (0.5–0.975) and specificity (0.85–0.99), incorporating resource constraints (ex. HCM clinic capacity) and utility weights reflecting diagnostic delay, misdiagnosis, and mortality. Financial modeling included AI deployment costs and downstream care utilization. Ethical analysis was conducted through stakeholder interviews exploring issues such as perceived risks and benefits, equity, and patient consent. In our simulations, AI models with specificity >=0.9 reduced HCM-related mortality using the proposed workflow, while lower specificity cutoffs overwhelmed referral capacity with false positive results (Figure 1). With a simulated 50% increase in HCM clinic capacity, a specificity of >=0.85 was sufficient to achieve benefit. Financial models showed cost-effectiveness concentrated in true positive cases and a net positive effect for the hospital at low false-positive rates (Figure 2). Ethical review highlighted concerns and mitigation strategies around access disparities, patient anxiety from alerts, and subgroup representation. For HCM, a TPP integrating workflow modeling, financial constraints, and ethical insights may help clarify necessary performance metrics in context–offering a roadmap for actionable, deployment-ready AI-augmented programs.

Holistic evaluation of large language models for medical tasks with MedHELM

Suhana Bedi, Hejie Cui, Miguel Fuentes, Alyssa Unell, Michael Wornow, Juan M. Banda, Nikesh Kotecha, Timothy Keyes, Yifan Mai, Mert Oez, Hao Qiu, Shrey Jain, Leonardo Schettini, Mehr Kashyap, Jason Alan Fries, et al.

Nature Medicine · 2025

PDF Link
Abstract

While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. Here we introduce MedHELM, an extensible evaluation framework with three contributions. First, a clinician-validated taxonomy organizing medical AI applications into five categories that mirror real clinical tasks–clinical decision support (diagnostic decisions, treatment planning), clinical note generation (visit documentation, procedure reports), patient communication (education materials, care instructions), medical research (literature analysis, clinical data analysis) and administration (scheduling, workflow coordination). These encompass 22 subcategories and 121 specific tasks reflecting daily medical practice. Second, a comprehensive benchmark suite of 37 evaluations covering all subcategories. Third, systematic comparison of nine frontier LLMs–Claude 3.5 Sonnet, Claude 3.7 Sonnet, DeepSeek R1, Gemini 1.5 Pro, Gemini 2.0 Flash, GPT-4o, GPT-4o mini, Llama 3.3 and o3-mini–using an automated LLM-jury evaluation method. Our LLM-jury uses multiple AI evaluators to assess model outputs against expert-defined criteria. Advanced reasoning models (DeepSeek R1, o3-mini) demonstrated superior performance with win rates of 66%, although Claude 3.5 Sonnet achieved comparable results at 15% lower computational cost. These results not only highlight current model capabilities but also demonstrate how MedHELM could enable evidence-based selection of medical AI systems for healthcare applications.

DHODH as a Targetable Metabolic Achilles’ Heel for chemo-resistant B-ALL

Yuxuan Liu, Haowen Jiang, Jingjing Liu, Lucille Stuani, Milton J Merchant, Astraea Jager, Abhishek Koladiya, Ti-Cheng Chang, Pablo Domizi, Jolanda Sarno, Ao Wang, Timothy Keyes, Dorra Jedoui, Jodie Meng, Felix Hartmann, Ruida Hou, Carol Fries, Chiara Pirillo, Qingsong Gao, Ilaria Iacobucci, Sean C Bendall, Min Huang, Norman J Lacayo, Kathleen M Sakamoto, Charles G Mullighan, Mignon L Loh, Jiyang Yu, Jun J Yang, Jiangbin Ye, and Kara L Davis

Blood · 2026

PDF Link
Abstract

Relapse remains a major barrier to survival in B-cell acute lymphoblastic leukemia (B-ALL). Both activation of B-cell signaling pathways and increased glucose consumption have been linked to chemo-resistance and relapse risk. Here, we connect these observations, showing that B-ALL cells with active signaling, marked by high phosphorylated ribosomal protein S6 (pS6+), are glucose dependent. Isotope tracing confirms that pS6+ cells are highly glycolytic and rely on glucose for de novo nucleotide synthesis. Uridine, but not other purines or pyrimidines, rescues pS6+ cells from glucose deprivation, highlighting uridine as essential for survival. Active mTOR signaling in pS6+ cells drives de novo pyrimidine synthesis by activating CAD (Carbamoyl phosphate synthetase 2, Aspartate transcarbamylase, and Dihydroorotase), which catalyzes the first steps of de novo pyrimidine synthesis. Inhibiting signaling abolishes glucose dependency and CAD phosphorylation. Primary pS6+ cells express high levels of pyrimidine synthesis proteins, including dihydroorotate dehydrogenase (DHODH), the rate-limiting enzyme in pyrimidine synthesis. Increased DHODH expression correlates with relapse and poor event-free survival. Most B-ALL molecular subtypes exhibit DHODH activity. BAY-2402234, a DHODH inhibitor, effectively kills pS6+ cells in vitro, with IC50 values correlating with pS6 signaling strength across 14 B-ALL patient-derived xenografts (PDX). In vivo, DHODH inhibition prolongs survival and reduces leukemia burden in pS6+ B-ALL models. These findings link active signaling to pyrimidine dependency and relapse risk, highlighting DHODH inhibition as a promising therapeutic strategy for chemo-resistant B-ALL.

Annotation-free discovery of disease-relevant cells in single-cell datasets

Timothy Keyes†, Erin Craig†, Jolanda Sarno, Jeremy P. D’Silva, Pablo Domizi, Maxim Zaslavsky, Albert Tsai, et al.

Science Advances · 2025

PDF Link
Abstract In single-cell datasets, patient labels indicating disease status (e.g., sick or not sick) are typically available, but individual cell labels indicating which of a patient’s cells are associated with their disease state are generally unknown. To address this, we introduce mixture modeling for multiple-instance learning (MMIL), an expectation-maximization approach that trains cell-level binary classifiers using only patient-level labels. Applied to primary samples from patients with acute leukemia, MMIL accurately separates leukemia from nonleukemia baseline cells, including rare minimal residual disease (MRD) cells; generalizes across tissues and treatment time points; and identifies biologically relevant features with accuracy approaching that of a hematopathologist. MMIL can also incorporate cell labels when they are available, creating a robust framework for leveraging both labeled and unlabeled cells. MMIL provides a flexible modeling framework for cell classification, especially in scenarios with unknown gold-standard cell labels.

Epiregulon: Single-cell transcription factor activity inference to predict drug response and drivers of cell states

Tomasz Wlodarczyk, Aaron Lun, Diana Wu, Minyi Shi, Xiaofen Ye, Shreya Menon, Shushan Toneyan, Kerstin Seidel, Timothy Keyes, et al.

Nature Communications · 2025

PDF Link
Abstract

Transcription factors (TFs) and transcriptional coregulators are emerging therapeutic targets. Gene regulatory networks (GRNs) can evaluate pharmacological agents and identify drivers of disease, but methods that rely solely on gene expression often neglect post-transcriptional modulation of TFs. We present Epiregulon, a method that constructs GRNs from single-cell ATAC-seq and RNA-seq data for accurate prediction of TF activity. This is achieved by considering the co-occurrence of TF expression and chromatin accessibility at TF binding sites in each cell. ChIP-seq data allows motif-agonistic activity inference of transcriptional coregulators or TF harboring neomorphic mutations. Epiregulon accurately predicted the effects of AR inhibition across different drug modalities including an AR antagonist and an AR degrader, delineated the mechanisms of a SMARCA4 degrader by identifying context-dependent interaction partners, and prioritized drivers of lineage reprogramming and tumorigenesis. By mapping gene regulation across various cellular contexts, Epiregulon can accelerate the discovery of therapeutics targeting transcriptional regulators.

The tidyomics ecosystem: enhancing omic data analyses

Timothy Keyes†, William J. Hutchison†, Helena L. Crowell, Jacques Serizay, Charlotte Soneson, Eric S. Davis, Noriaki Sato, Lambda Moses, Boyd Tarlinton, Abdullah A. Nahid, Miha Kosmac, Quentin Clayssen, Victor Yuan, Wancen Mu, Ji-Eun Park, Izabela Mamede, Min Hyung Ryu, Pierre-Paul Axisa, Paulina Paiz, Chi-Lam Poon, Ming Tang, Raphael Gottardo, Martin Morgan, Stuart Lee, Michael Lawrence, Stephanie C. Hicks, Garry P. Nolan, Kara L. Davis, Anthony T. Papenfuss, Michael I. Love, Stefano Mangiola

Nature Methods · 2024

PDF Link
Abstract

The growth of omic data presents evolving challenges in data manipulation, analysis and integration. Addressing these challenges, Bioconductor provides an extensive community-driven biological data analysis platform. Meanwhile, tidy R programming offers a revolutionary data organization and manipulation standard. Here we present the tidyomics software ecosystem, bridging Bioconductor to the tidy R paradigm. This ecosystem aims to streamline omic analysis, ease learning and encourage cross-disciplinary collaborations. We demonstrate the effectiveness of tidyomics by analyzing 7.5 million peripheral blood mononuclear cells from the Human Cell Atlas, spanning six data frameworks and ten analysis tools.

IFN-gamma-Expressing Myeloid Cells Localize within Lipoproteinosis during Drug-Associated Pulmonary Alveolar Proteinosis occurring in Systemic Juvenile Idiopathic Arthritis

Alea Delmastro, Candace Liu, Xiao-Wen Ding, Serena Tan, Inna Averbukh, Marc Bosse, Timothy Keyes, et al.

bioRxiv · 2024

PDF Link
Abstract

In the United States, approximately one in 1000 children are diagnosed with the autoinflammatory disease, Juvenile Idiopathic Arthritis (JIA). A subset of JIA cases manifests as Systemic JIA (sJIA), which is characterized by joint pain, fevers, rashes, and systemic inflammation. Severe pulmonary complications have not historically been associated with sJIA. Since 2010, inhibitors of interleukin-1 and interleukin 6 (IL-1i/IL-6i) are the recommended course of treatment for sJIA, yet recently studies show evidence of a severe drug hypersensitivity reaction implicating these medications in a subset of those treated. With this reaction, sJIA patients can develop severe lung disease, including pulmonary alveolar proteinosis (PAP). As this drug-associated lung disease has only recently been identified, the etiology of sJIA drug-associated PAP (sJIA-daPAP) is poorly understood. We used multiplexed ion beam imaging by time-of-flight (MIBI-TOF) to define the cellular immune infiltrate and describe pathological features of PAP in sJIA-daPAP patients. We found an enrichment of eosinophils, neutrophils, and M2 macrophages within regions of lipoproteinosis. These enriched subsets all upregulate IFN_ within lipoproteinosis, a signature specific to sJIA-daPAP samples compared to non-sJIA-PAP samples. In a cellular neighborhood analysis, we identified that eosinophils, neutrophils and M2 macrophages frequently co-localize within the same cellular microenvironment, especially in lipoproteinosis regions. Therefore, this spatial coordination may be involved in clearance or persistence of lipoproteinosis in sJIA-daPAP. This study provides a comprehensive overview of sJIA-daPAP immune pathology and suggests cellular mechanisms that drive inflammation in sJIA patients experiencing pulmonary complications associated with delayed drug hypersensitivity during IL-1i/IL-6i treatment.

tidytof: a user-friendly framework for scalable and reproducible high-dimensional cytometry data analysis

Timothy Keyes, Abhishek Koladiya, Yu-Chen Lo, Garry P. Nolan, Kara L. Davis

Bioinformatics Advances · 2023

PDF Link
Abstract While many algorithms for analyzing high-dimensional cytometry data have now been developed, the software implementations of these algorithms remain highly customized–this means that exploring a dataset requires users to learn unique, often poorly interoperable package syntaxes for each step of data processing. To solve this problem, we developed {tidytof}, an open-source R package for analyzing high-dimensional cytometry data using the increasingly popular ‘tidy data’ interface.

Single-cell technologies uncover intra-tumor heterogeneity in childhood cancers

Yu-Chen Lo, Yuxuan Liu, Marte Kammersgaard, Abhishek Koladiya, Timothy Keyes, Kara L. Davis

Seminars in Immunopathology · 2023

PDF Link
Abstract

Childhood cancer is the second leading cause of death in children aged 1 to 14. Although survival rates have vastly improved over the past 40 years, cancer resistance and relapse remain a significant challenge. Advances in single-cell technologies enable dissection of tumors to unprecedented resolution. This facilitates unraveling the heterogeneity of childhood cancers to identify cell subtypes that are prone to treatment resistance. The rapid accumulation of single-cell data from different modalities necessitates the development of novel computational approaches for processing, visualizing, and analyzing single-cell data. Here, we review single-cell approaches utilized or under development in the context of childhood cancers. We review computational methods for analyzing single-cell data and discuss best practices for their application. Finally, we review the impact of several studies of childhood tumors analyzed with these approaches and future directions to implement single-cell studies into translational cancer research in pediatric oncology.

Improved Relapse Prediction in Pediatric Acute Myeloid Leukemia By Deconvolving Lineage-Specific and Cancer-Specific Features in Single-Cell Data

Timothy Keyes, Astraea Jager, Mason Krueger, Sylvia Plevritis, Robert Tibshirani, Richard Aplenc, et al.

Blood · 2022

Link
Abstract

Introduction

While most children with acute myeloid leukemia (AML) achieve first remission, nearly 40% will relapse. Of these children, few survive to a second remission even with highly-escalated treatment protocols. Recent studies have shown that many AML patients harbor rare, stem cell-like subpopulations that resist chemotherapy and drive relapse. However, the exact characteristics of these relapse-associated cells are a matter of contention, with reported phenotypes spanning the hematopoietic developmental continuum. In some patients, treatment-resistant cells can be detected as minimal residual disease (MRD), which is often used to predict relapse, albeit with limited accuracy and only after induction chemotherapy. Thus, the identity of treatment-resistant cells as well as their relationship to normal progenitors remain mysterious, thereby limiting the development of targeted therapies for pediatric AML.

Here, we present a computational approach for decomposing high-dimensional single-cell measurements into two components: a lineage-specific component that can be used to align cancer cells with specific stages of myeloid development and a cancer-specific component that can be used to identify aberrant phenotypes unique to AML cells. We show that, together, these components can be used at the time of diagnosis to predict relapse more accurately than clinical information alone.

Methods and Results

Using mass cytometry, we analyzed paired diagnostic and post-induction samples collected from 19 (8 relapse, 11 non-relapse) pediatric patients who enrolled on the Children’s Oncology Group trial AAML1031. All patients were treated on the control arm and consented to banking of tissue for research. We also included 5 bone marrow samples from healthy donors. After thawing, samples were divided in half and stimulated with conditioned medium from the human bone marrow stromal cell line HS-5 to activate relevant signaling pathways, or left unstimulated. An average of 5 x 105 cells per patient were analyzed for each condition. The mass cytometry panel included 31 antibodies to surface markers, 6 antibodies to intracellular signaling mediators, and 4 antibodies to intracellular proteins and transcription factors.

Following data collection, the singular value decomposition was applied to the data matrix of healthy single-cell measurements to construct a linear subspace representing the predominant protein expression programs (“eigencells”) within the healthy myeloid developmental continuum. By projecting AML single-cell measurements onto this subspace, we derived a healthy feature vector aligned with the healthy subspace and a cancer-specific feature vector orthogonal to the healthy subspace for each AML cell. These feature vectors-along with clinical metadata about each patient including age, blast percentage at diagnosis, and cytogenetic status-were used as the input to regularized Cox proportional hazards models predicting time-to-relapse for each patient. Using the relative risk scores from the proportional hazards model, patients were assigned to high-risk or low-risk groups according to the optimal log-rank test threshold.

The baseline clinical model used only age, blast percentage at diagnosis, and cytogenetic status as predictors and predicted relapse status with an accuracy of 13/19 (68%). This baseline model was outperformed by the model constructed using the average value of the single-cell cancer-specific feature vectors for each patient, which predicted relapse status with an accuracy of 16/19 (84%). Interestingly, despite using only information available at diagnosis, the single-cell model also outperformed a clinical model incorporating patients’ MRD status after induction chemotherapy, which predicted relapse with an accuracy of 15/19 (79%). Interrogation of the coefficients of the single-cell feature model revealed specific cellular signaling programs associated with relapse, including enhanced pCreb and pSTAT1 signaling as well as depleted pSTAT5 signaling relative to healthy lineage cells (Figure 1).

Conclusions

These results support the feasibility of predicting relapse in AML as early as diagnosis by leveraging a computational approach that compares cancer cells to the native lineage from which they arise. Validation of this approach in an independent cohort is ongoing and will be presented.

CytofIn enables integrated analysis of public mass cytometry datasets using generalized anchors

Yu-Chen Lo, Timothy Keyes, Astraea Jager, Jolanda Sarno, Pablo Domizi, Ravindra Majeti, Kathleen M. Sakamoto, Norman Lacayo, Charles G. Mullighan, Jeffrey Waters, Bita Sahaf, Sean C. Bendall, Kara L. Davis

Nature Communications · 2022

PDF Link
Abstract

The increasing use of mass cytometry for analyzing clinical samples offers the possibility to perform comparative analyses across public datasets. However, challenges in batch normalization and data integration limit the comparison of datasets not intended to be analyzed together. Here, we present a data integration strategy, CytofIn, using generalized anchors to integrate mass cytometry datasets from the public domain. We show that low-variance controls, such as healthy samples and stable channels, are inherently homogeneous, robust against stimulation, and can serve as generalized anchors for batch correction. Single-cell quantification comparing mass cytometry data from 989 leukemia files pre- and post normalization with CytofIn demonstrates effective batch correction while recapitulating the gold-standard bead normalization. CytofIn integration of public cancer datasets enabled the comparison of immune features across histologies and treatments. We demonstrate the ability to integrate public datasets without necessitating identical control samples or bead standards for fast and robust analysis using CytofIn.

A cancer biologist’s primer on machine learning applications in high-dimensional cytometry

Timothy Keyes, Pablo Domizi, Yu-Chen Lo, Garry P. Nolan, Kara L. Davis

Cytometry Part A · 2020

PDF Link
Abstract The application of machine learning and artificial intelligence to high-dimensional cytometry data sets has increasingly become a staple of bioinformatic data analysis over the past decade. This is especially true in the field of cancer biology, where protocols for collecting multiparameter single-cell data in a high-throughput fashion are rapidly developed. As the use of machine learning methodology in cytometry becomes increasingly common, there is a need for cancer biologists to understand the basic theory and applications of a variety of algorithmic tools for analyzing and interpreting cytometry data. We introduce the reader to several keystone machine learning-based analytic approaches with an emphasis on defining key terms and introducing a conceptual framework for making translational or clinically relevant discoveries. The target audience consists of cancer cell biologists and physician-scientists interested in applying these tools to their own data, but who may have limited training in bioinformatics.

Progressive B Cell Loss in Revertant X-SCID

Connie H. Lin, Hye Sun Kuehn, Timothy J. Thauland, Christine M. Lee, Suk See De Ravin, Harry L. Malech, Timothy Keyes, Astraea Jager, Kara L. Davis, Maria I. Garcia-Lloret, Sergio D. Rosenzweig, Manish J. Butte

Journal of Clinical Immunology · 2020

PDF Link
Abstract

We report the case of a patient with X-linked severe combined immunodeficiency (X-SCID) who survived for over 20 years without hematopoietic stem cell transplantation (HSCT) because of a somatic reversion mutation. An important feature of this rare case included the strategy to validate the pathogenicity of a variant of the IL2RG gene when the T and B cell lineages comprised only revertant cells. We studied the X-inactivation of sorted T cells from the mother to show that the pathogenic variant was indeed the cause of his SCID. One interesting feature was a progressive loss of B cells over 20 years. CyTOF (cytometry time of flight) analysis of bone marrow offered a potential explanation of the B cell failure, with expansions of progenitor populations that suggest a developmental block. Another interesting feature was that the patient bore extensive granulomatous disease and skin cancers that contained T cells, despite severe T cell lymphopenia in the blood. Finally, the patient had a few hundred T cells on presentation but his TCRs comprised a very limited repertoire, supporting the important conclusion that repertoire size trumps numbers of T cells.

Structural and functional features of central nervous system lymphatic vessels

Antoine Louveau, Igor Smirnov, Timothy Keyes, Jacob D. Eccles, Sherin J. Rouhani, J. David Peske, Noel C. Derecki, David Castle, James W. Mandell, Kevin S. Lee, Tajie H. Harris, Jonathan Kipnis

Nature · 2015

PDF Link
Abstract We discovered functional lymphatic vessels lining the dural sinuses. The discovery of the central nervous system lymphatic system may call for a reassessment of basic assumptions in neuroimmunology.

Sociodemographic factors and research experience impact MD-PhD program acceptance

Darnell K. Adrian Williams, Briana Christophers, Timothy Keyes, Rachit Kumar, Michael C. Granovetter, Alexandria Adigun, Justin Olivera, Jehron Pura-Bryant, Chynna Smith, Chiemeka Okafor, Mahlet Shibre, Dania Daye, Myles H. Akabas

JCI Insight · 2024

PDF Link
Abstract

The 2014 NIH Physician-Scientist Workforce Working Group predicted a future shortage of physician-scientists. Subsequent studies have highlighted disparities in MD-PhD admissions based on race, income, and education. Our analysis of data from the Association of American Medical Colleges covering 2014–2021 (15,156 applicants and 6,840 acceptees) revealed that acceptance into US MD-PhD programs correlates with research experience, family income, and research publications. The number of research experiences associated with parental education and family income. Applicants were more likely to be accepted with a family income greater than $50,000 or with one or more publications or presentations. Applicants were less likely to be accepted if they had parents without a graduate degree, were Black/African American, were first-generation college students, or were reapplicants, irrespective of the number of research experiences, publications, or presentations. These findings underscore an admissions bias that favors candidates from affluent and highly educated families, while disadvantaging underrepresented minorities.

Teaching LGBTQ+ health, a web-based faculty development course: program evaluation study using the RE-AIM framework

Michael Albert Gisondi, Timothy Keyes, Shana Zucker, Deila Bumgardner

JMIR Medical Education · 2023

PDF Link
Abstract

Background: Many health professions faculty members lack training on fundamental lesbian, gay, bisexual, transgender, and queer (LGBTQ+) health topics. Faculty development is needed to address knowledge gaps, improve teaching, and prepare students to competently care for the growing LGBTQ+ population.

Objective: We conducted a program evaluation of the massive open online course Teaching LGBTQ+ Health: A Faculty Development Course for Health Professions Educators from the Stanford School of Medicine. Our goal was to understand participant demographics, impact, and ongoing maintenance needs to inform decisions about updating the course.

Methods: We evaluated the course for the period from March 27, 2021, to February 24, 2023, guided by the RE-AIM (Reach, Effectiveness, Adoption, Implementation, and Maintenance) framework. We assessed impact using participation numbers, evidence of learning, and likelihood of practice change. Data included participant demographics, performance on a pre- and postcourse quiz, open-text entries throughout the course, continuing medical education (CME) credits awarded, and CME course evaluations. We analyzed demographics using descriptive statistics and pre- and postcourse quiz scores using a paired 2-tailed t test. We conducted a qualitative thematic analysis of open-text responses to prompts within the course and CME evaluation questions.

Results: Results were reported using the 5 framework domains. Regarding Reach, 1782 learners participated in the course, and 1516 (85.07%) accessed it through a main course website. Of the different types of participants, most were physicians (423/1516, 27.9%) and from outside the sponsoring institution and target audience (1452/1516, 95.78%). Regarding Effectiveness, the median change in test scores for the 38.1% (679/1782) of participants who completed both the pre- and postcourse tests was 3 out of 10 points, or a 30% improvement (P<.001). Themes identified from CME evaluations included LGBTQ+ health as a distinct domain, inclusivity in practices, and teaching LGBTQ+ health strategies. A minority of participants (237/1782, 13.3%) earned CME credits. Regarding Adoption, themes identified among responses to prompts in the course included LGBTQ+ health concepts and instructional strategies. Most participants strongly agreed with numerous positive statements about the course content, presentation, and likelihood of practice change. Regarding Implementation, the course cost US $57,000 to build and was intramurally funded through grants and subsidies. The course faculty spent an estimated 600 hours on the project, and educational technologists spent another 712 hours. Regarding Maintenance, much of the course is evergreen, and ongoing oversight and quality assurance require minimal faculty time. New content will likely include modules on transgender health and gender-affirming care.

Conclusions: Teaching LGBTQ+ Health improved participants’ knowledge of fundamental queer health topics. Overall participation has been modest to date. Most participants indicated an intention to change clinical or teaching practices. Maintenance costs are minimal. The web-based course will continue to be offered, and new content will likely be added.

Sexual and gender minority identity disclosure from undergraduate to graduate medical education: perceptions of professional Outness among Medical Students

Timothy Keyes, Shana Zucker, Teddy G. Goetz, Justin L. Jia, Samuel R. Bunting, Mitchell R. Lunn, Leslee L. Subak

Annals of LGBTQ Public and Population Health · 2022

Link
Abstract

Increasingly, medical schools and residency programs seek to recruit trainees from diverse backgrounds, including sexual and gender minority (SGM) people. However, many trainees do not disclose their SGM identity during medical training due to fear of discrimination, which remains a challenge for institutional diversity and inclusion efforts. Despite this, relatively few studies have rigorously quantified trainees’ SGM identity self-disclosure across different stages of medical training. In 2018 and 2019, the Medical Student Pride Alliance (MSPA) distributed a 33-item online questionnaire interrogating practices and attitudes about SGM identity disclosure to medical students at allopathic and osteopathic medical schools in the United States. Here, we analyze these data to compare 1) the degree to which medical students disclose SGM identity in various professional contexts during undergraduate and graduate medical training and 2) students’ attitudes regarding SGM identity disclosure across those contexts. Overall, 1,162 medical students from 125 medical schools responded to the survey. Of these respondents, 629 (54%) were SGM-identified. Among SGM-identified respondents, students were most likely to report SGM identity self-disclosure to peers (91%) and least likely to report SGM identity self-disclosure on applications to residency or post-doctoral work (29%). Cisgender women were less likely to report SGM identity self-disclosure than other genders, and students performing research were more likely to report SGM identity self-disclosure among mentors. Overall, most (>90%) survey respondents supported trainees’ ability to disclose their sexual orientation or gender identity during medical training. This exploratory study provides preliminary evidence that SGM-identifying medical students often do not disclose their sexual orientation or gender identity in evaluative professional contexts. Future work should assess this phenomenon in a larger national sample and propose targeted policies to support SGM inclusion throughout medical training in general and on applications to graduate medical education specifically.

Student Education About Pre-Exposure Prophylaxis (PrEP) Varies Between Regions of the United States

Samuel R. Bunting, Sarah S. Garber, Robert H. Goldstein, Timothy D. Ritchie, Tamzin J. Batteson, Timothy Keyes

Journal of General Internal Medicine · 2020

PDF Link
Abstract

Background Daily, oral pre-exposure prophylaxis (PrEP) is an effective and safe prevention strategy for people at risk for HIV. However, prescription of PrEP has been limited for patients at the highest risk. Disparities in PrEP prescription are pronounced among racial and gender minority patients. A significant body of literature indicates that practicing healthcare providers have little awareness and knowledge of PrEP. Very little work has investigated the education about PrEP among health professionals in training.

Objective The objective of this study was to compare health professions students’ awareness of PrEP and education about PrEP between regions of the US, and to determine if correlations between regional HIV incidence and PrEP use were present.

Design Survey study.

Participants A cross-sectional sample of health professions students (N_=_1859) representing future prescribers (MD, DO, PA), pharmacists, and nurses in the US.

Key Results Overall, 83.4% of students were aware of PrEP, but only 62.2% of fourth-year students indicated they had been taught about PrEP at any time during their training. Education about PrEP was most comprehensive in the Northeastern US, the area with the highest PrEP to need ratio (4.7). In all regions, transgender patients and heterosexual men and women were least likely to be presented in education as PrEP candidates, and men who have sex with men were the most frequently presented.

Conclusions There are marked differences in education regarding PrEP both between academic programs and regions of the USA.

Medical Student Pride Alliance: The First National LGBTQ+ Medical Student Affinity Organization

Teddy G. Goetz, Shana Zucker, Timothy Keyes, Michael Gisondi

Medical Education · 2020

PDF Link
Abstract This paper documents how the Medical Student Pride Alliance (MSPA), the first national affinity organization for LGBTQ+ medical students, was founded in the United States in 2018.