Building Scalable Autonomous Prompt Optimization for Clinical Workflows to Aid Physicians

Feb 12, 2026

Advisor: Dr. Rintu Kutum

Students: Tejasdeep Singh

Machine Learning Healthcare AI

Abstract

This project develops a scalable, privacy-preserving autonomous prompt optimization framework for clinical documentation tasks. By adapting the ProTeGi (Prompt Optimization with Textual Gradients) methodology to medical summarization, the system leverages local open-source large language models and an LLM-as-a-Judge evaluator to iteratively refine prompts based on clinical accuracy rather than generic text similarity. Evaluated on radiology report summarization, the framework demonstrates the ability to evolve from generic instructions to structured, clinically aligned prompt templates without manual prompt engineering.

Project Description

The rapid integration of large language models (LLMs) into healthcare promises significant reductions in physician documentation burden, particularly in domains such as radiology where structured reporting is essential. However, a major barrier to real-world adoption is the “one-size-fits-all” nature of generic LLM outputs, which often fail to align with the stylistic conventions and clinical priorities of individual physicians. Manual prompt engineering has emerged as a workaround, but this approach is brittle, unscalable, and impractical in large healthcare systems.

This capstone project addresses this gap by developing a scalable, autonomous prompt optimization framework tailored for clinical workflows. The work adapts the ProTeGi (Prompt Optimization with Textual Gradients) framework to the domain of medical text generation, specifically radiological impression summarization. Instead of relying on opaque continuous prompt embeddings or manual trial-and-error, the system operates entirely in discrete, human-readable prompt space, ensuring interpretability and auditability—both critical requirements in safety-sensitive clinical environments.

A key contribution of the project is the introduction of a domain-specific evaluation mechanism termed the “Medical Judge.” Rather than using conventional NLP metrics such as BLEU or ROUGE, the Medical Judge employs a secondary local LLM to assess whether generated summaries capture critical clinical information, including pathology, location, and severity. This binary clinical accuracy signal is then used within a multi-armed bandit framework to guide efficient prompt selection and refinement.

The system is implemented as a fully privacy-preserving pipeline using open-source models deployed locally via Ollama, eliminating dependence on external APIs and mitigating data governance concerns. Experimental results on a subset of de-identified radiology reports demonstrate rapid convergence from generic prompts to structured, clinically meaningful templates within a small number of optimization rounds. Overall, this project provides a proof-of-concept for autonomous, interpretable prompt optimization in healthcare and lays the groundwork for scalable personalization of AI-assisted clinical documentation.

Objectives

Design an autonomous and interpretable prompt optimization framework suitable for clinical use.
Adapt the ProTeGi textual gradient methodology for medical text generation tasks.
Ensure a privacy-first implementation using only local, open-source language models.
Validate the approach on radiology report summarization and analyze convergence behavior.

Methodology

The proposed system treats prompt optimization as a discrete search problem inspired by gradient descent. Clinical reports are first parsed and standardized using a robust regex-based preprocessing pipeline to handle inconsistently formatted source data. The optimization engine then iteratively evaluates candidate prompts by generating summaries and comparing them against ground-truth impressions.

Textual gradients are produced in the form of natural language critiques generated by an evaluation LLM, highlighting specific clinical deficiencies in the output. These critiques are applied to edit and refine prompts, while Monte Carlo paraphrasing expands the search space to avoid local optima. A multi-armed bandit strategy based on the Upper Confidence Bound algorithm efficiently allocates evaluation budget across candidate prompts.

All inference and evaluation are performed locally using open-source models (e.g., Qwen and LLaMA variants), ensuring that sensitive clinical data never leaves the system.

Outcomes and Contributions

Demonstrated autonomous evolution of prompts from generic instructions to structured clinical templates.
Introduced a clinically grounded LLM-as-a-Judge evaluation mechanism.
Delivered a fully local, privacy-preserving implementation suitable for healthcare settings.
Provided empirical evidence of rapid convergence and optimization efficiency in medical summarization tasks.