Abstract:
Background: Large language models (LLMs) are increasingly used in oncology clinical workflows; however, real-world clinician adoption has outpaced formal institutional guidance and systematic safety validation. Existing evaluations often rely on aggregate performance metrics, which may obscure disease-specific safety risks. We sought to (1) characterize real-world AI use and verification behaviors among oncology clinicians and (2) assess disease-dependent safety failures of LLM-based clinical decision support systems.
Methods: We conducted a voluntary, anonymous cross-sectional survey of oncology clinicians evaluating AI access, patterns of use, verification behaviors, perceived accountability, and responses to a standardized oncology clinical scenario. Separately, we developed 216 simulated tumor-board vignettes across five oncology domains: leukemia (n=30), breast (n=50), gastrointestinal (n=50), central nervous system metastases (n=50), and gynecologic malignancies (n=50). Each vignette was evaluated using three LLM configurations: (1) unconstrained LLM, (2) NCCN guideline-anchored retrieval-augmented generation (RAG), and (3) literature-anchored RAG. Outputs were independently scored by two board-certified oncologists using a modified Generative Performance Score (mGPS; −1 to +1), incorporating guideline concordance and hallucination penalties. Safety disparity was conservatively defined as the highest severity across scoring axes.
Results: Thirty-one clinicians completed the survey, including fellows (45%) and attending oncologists (29%), representing academic and community practice settings. Despite limited or uncertain access to institution-approved AI tools, nearly all respondents reported independent AI use for professional tasks. Most clinicians reported routinely verifying AI outputs against guidelines or primary literature and maintaining clinician-centered accountability for AI-related errors, while formal institutional governance was frequently absent.
In vignette-based evaluation, NCCN-anchored RAG demonstrated improved guideline concordance and reduced hallucinations compared with unconstrained models; however, safety performance varied substantially by disease context. Leukemia demonstrated predominantly low-to-intermediate safety disparity (93%), whereas high disparity was observed in CNS metastases (80%) and gynecologic malignancies (70%), driven by concurrent hallucinations, staging errors, and inappropriate extrapolation. Readability scores did not correlate with safety, frequently obscuring clinically significant errors.
Conclusions: Oncology clinicians are already integrating AI into clinical practice with high levels of independent verification but limited institutional oversight. LLM safety is strongly disease-dependent and inadequately captured by aggregate accuracy metrics. Disease-stratified validation frameworks incorporating guideline concordance and hallucination detection are necessary to inform responsible clinical deployment. These findings support the need for clinician-led governance and disease-specific risk stratification prior to broad adoption of AI decision support in oncology.

