Study cohorts and dataset construction
We established three cohorts, upon which we built one training set and two test sets. Table 1 presents the baseline characteristics of the training and test sets. Figure 1 illustrates the relationships between the different cohorts and data sets. The training set consisted of 487,246 ultrasound images from 67,981 thyroid nodules; additionally, we incorporated 48,470 US reports and 11 diagnostic guidelines (listed in Supplementary Table 1) as language training data. To better evaluate model performance, we established two independent external test sets with distinct validation objectives. Test Set 1 comprised 3376 thyroid nodules from 2964 patients, all with definitive surgical pathological confirmation, thereby enabling assessment of ThyGPT’s diagnostic accuracy and its efficacy in reducing unnecessary fine-needle aspiration biopsies. Test Set 2 consisted of 1263 US reports, of which 157 were found to contain various errors. This set was designed specifically to evaluate ThyGPT’s capability in detecting errors in US reports. Given the absence of surgical pathological confirmation in a substantial proportion of those cases with erroneous US reports, Test Set 2 was not used for analyzing unnecessary biopsy reduction. The above US imaging data were obtained from 65 machines. Supplementary Table 2 presents detailed statistical information about these machines. Supplementary Table 3 presents more information on the multicenter data distribution. Figure 2 shows the overall design of the ThyGPT model for assisting in the diagnosis of thyroid nodules. The basic framework used was the LLaMA3 model and transformer architecture.
Fig. 1: Enrollment of the main cohort and two external test cohorts.
The main cohort of data was sourced from Centers 1 to 4, which was utilized for building the training set. External Test Cohort 1 consisted of data from Centers 5 to 8, and these data were merged to form Test Set 1. This set was used to evaluate the performance of ThyGPT in assisting radiologists in diagnosing thyroid nodules. External Test Cohort 2 comprised data from Center 9, primarily consisting of 157 erroneous reports. These reports were merged with 1106 correct reports from Center 5 to form Test Set 2. This set was utilized to assess the ability of ThyGPT to automatically detect potential errors in US reports.
Fig. 2: Overall design of ThyGPT for assisting in the diagnosis of thyroid nodules.
a Pre-labeled nodules, guidelines, diagnostic reports, and pathological results were utilized for the training of ThyGPT. b The ThyGPT model was developed by integrating multi-head self-attention mechanisms and the transformer architecture of ThyGPT (For more details, see Supplementary Fig. 1). c Patients undergo US examinations, and radiologists diagnose patients based on the US images, which are simultaneously input into the ThyGPT model for analysis. Radiologists engage in detailed discussions with ThyGPT regarding the condition of the nodules. d ThyGPT engages in discussions with radiologists and answers their questions. The overall performance of ThyGPT is evaluated from different perspectives.
Table 1 Baseline characteristics
The results were primarily divided into three parts. First, we compared the performance of radiologists in diagnosing thyroid nodules under different conditions, such as unassisted diagnosis, assisted only by the traditional feature heatmaps, and assisted by real-time conversations with ThyGPT. Second, we extracted some typical cases of conversation between radiologists and ThyGPT. Third, we tested ThyGPT’s ability to detect errors in US reports.
Auxiliary diagnostic performance of ThyGPT
The diagnostic performance of ThyGPT was evaluated on Test Set 1, which comprised 2964 patients and 3376 tumors, of which 1601 were malignant. To thoroughly assess ThyGPT’s capability in assisting diagnosis, we conducted a detailed comparison of the diagnostic outcomes when ThyGPT was used by three junior radiologists (with fewer than 5 years of experience) and three senior radiologists (with more than 10 years of experience). To facilitate radiologists in utilizing ThyGPT as an auxiliary diagnostic tool, we established the following usage rules: (1) Initially, radiologists and ThyGPT conduct independent evaluations; (2) when radiologists’ assessment is consistent with ThyGPT’s evaluation, no modification is made; (3) when radiologists’ diagnosis differs from ThyGPT’s evaluation, radiologists query ThyGPT; and (4) ultimately, radiologists decide whether to consider ThyGPT’s diagnostic results and whether to modify their final diagnosis based on the query results.
Figure 3a depicts the receiver-operating characteristic (ROC) curve for ThyGPT, with the diagnostic outcomes of radiologists under different conditions represented as discrete points. This figure shows that referring to traditional feature heatmaps aided in improving the diagnostic performance of radiologists. However, when radiologists engaged in communication with ThyGPT, their diagnostic capability substantially improved, even surpassing the performance of ThyGPT alone.
Fig. 3: Clinical radiologists’ diagnostic results and deep learning methods’ ROC curves.
a The green curve displays the ROC curve of ThyGPT for discriminating benign and malignant nodules. The gray dots represent the independent diagnostic results of radiologists; the blue dots indicate radiologists’ diagnoses assisted by feature heatmaps only; and the orange dots denote radiologists’ diagnoses after discussing with ThyGPT. The circular dots represent junior radiologists; the square dots represent senior radiologists; and the different colored pentagrams represent the mean points of radiologists under different diagnostic modes. We magnified the key area and drew circles centered on the mean point, with the maximum outlier distance as the radius. These circles delineate the coverage ranges of radiologists’ diagnostic results under the different conditions. Compared to independent diagnoses by radiologists (gray) and those assisted by traditional feature heatmaps (blue), the ThyGPT model, which utilized an AI copilot for assistance, exhibited significantly improved diagnostic accuracy (orange). b Histogram distribution of ThyGPT’s predicted probabilities for thyroid nodule malignancy. The horizontal axis represents the predicted scores; the vertical axis denotes the sample count; the pink bins signify benign nodules; and the green bins indicate malignant nodules. c Comparison of confusion matrices in four scenarios: radiologists’ independent diagnosis, ThyGPT’s independent diagnosis, radiologists assisted by feature heatmaps, and radiologists comprehensively assisted by ThyGPT. d ThyGPT assisted radiologists in diagnosis and reduced unnecessary FNAs. The green squares illustrate the proportion of missed FNAs for malignant tumors.
Table 2 presents the specific diagnostic performance under different assistance modalities. The data indicated that without ThyGPT, the average sensitivity and specificity of radiologists on the test set were 0.802 (95% confidence interval [CI]: 0.793–0.809) and 0.809 (95% CI: 0.802–0.817), respectively. After radiologists thoroughly communicated with ThyGPT, the average sensitivity increased to 0.893 (95% CI: 0.887–0.899), and the average specificity increased to 0.922 (95% CI: 0.917–0.927). Considering the p value changes in various evaluation metrics, such as the true positive rate (TPR), true negative rate (TNR), positive predictive value (PPV), negative predictive value (NPV), and F1 score, the average diagnostic capability of junior radiologists reached or approached that of the AI model, while the average diagnostic performance of senior radiologists significantly surpassed that of the AI model.
Table 2 The diagnostic performance under different assistance modalities
Figure 3b illustrates the histogram distribution of ThyGPT’s predicted probabilities for thyroid nodule malignancy. The distribution in the figure shows that nodules of different risk levels demonstrated clustering characteristics. Figure 3c depicts the confusion matrices in four scenarios: radiologists’ independent diagnosis, ThyGPT’s independent diagnosis, radiologists assisted by feature heatmaps, and radiologists comprehensively assisted by ThyGPT. The changes in the true positives and true negatives in the confusion matrix indicated that ThyGPT effectively assisted radiologists in assessing the nodule risk.
We further explored how to better use ThyGPT and integrate it into the precision diagnosis and treatment of thyroid nodules. The specific strategy was as follows: (1) High-PPV (H-PPV) nodules: Nodules with a ThyGPT-predicted malignancy score exceeding 0.7 (corresponding to a PPV of exceeding 0.960) are categorized as H-PPV nodules. Detailed information on PPV and NPV under different thresholds is provided in Supplementary Table 4. If ThyGPT identifies a H-PPV nodule and the radiologist reviewing ThyGPT’s results does not raise any objections, bypassing FNA and proceeding directly to surgery can be considered. (2) Moderately suspicious nodules (MSN): Nodules with a ThyGPT-predicted malignancy score between 0.3 and 0.7 are defined as MSNs. In such cases, the radiologist may determine whether to perform FNA based on the ACR guidelines. (3) High-NPV (H-NPV) nodules: Nodules with a ThyGPT-predicted malignancy score below 0.3 (corresponding to an NPV of exceeding 0.975) are categorized as H-NPV nodules. If ThyGPT detects such a nodule and the radiologist raises no objections, follow-up can be considered as an appropriate course of action. In practical applications, the thresholds can be adjusted as needed to modify the expected PPV and NPV.
According to these rules, the proportion of FNAs in the test set decreased from 64.2% to 23.3% (p < 0.001), while the proportion of missed malignant tumors decreased from 11.6% to 5.3% (p < 0.001). The detailed comparison between ACR criteria and ThyGPT-assisted H-PPV approach is presented in Supplementary Table 5. Among 2167 nodules requiring FNA according to ACR criteria, 1459 (67.3%) did not require FNA when applying the above H-PPV and H-NPV rules. In the H-PPV subgroup identified by ThyGPT, there were 903 cases, of which 886 were malignant and 17 were benign, with an accuracy of 98.1%.
Communication between ThyGPT and radiologists
Figures 4, 5 display several representative cases. For the cases shown in Fig. 4, radiologists’ initial judgment was not entirely accurate; however, they modified their diagnoses after consulting and reviewing the ThyGPT analysis. Specifically, for Sample 1 in Fig. 4, the radiologist’s initial diagnosis was a benign nodule with ACR TR category 4, whereas ThyGPT identified the nodule as malignant after analyzing the US image. During the inquiry process, ThyGPT analyzed and explained the relationship between its diagnosis and nodule components to the radiologist. Consequently, the radiologist revised the final diagnosis after considering ThyGPT’s assessment. For Sample 2 in Fig. 4, ThyGPT evaluated the nodule as malignant, highlighting key malignant features present in the solid areas. The radiologist accepted ThyGPT’s assessment and revised the diagnosis following their interactive session.
Fig. 4: Radiologist revises their initial diagnosis after consulting ThyGPT.
a Radiologist’s initial diagnosis: ACR TR category 4, likely to be benign. ThyGPT’s assessment: ACR TR category 4, malignant nodule, with a malignancy score of 0.83. The radiologist consulted ThyGPT and updated the diagnosis: ACR TR category 4, likely to be malignant. Final pathological result: a malignant nodule. b Radiologist’s initial diagnosis: ACR TR category 3, likely to be benign. ThyGPT’s assessment: malignancy score 0.86. The radiologist consulted ThyGPT and updated the diagnosis: likely to be malignant. Final pathological result: a malignant nodule.
Fig. 5: Radiologist insists on the initial diagnosis after communicating with ThyGPT.
a Radiologist’s initial diagnosis: ACR TR category 4, likely to be malignant. ThyGPT’s assessment: ACR TR category 4, likely to be benign, with a malignancy score of 0.29. The radiologist consulted ThyGPT and insisted on their initial diagnosis. Final pathological result: a malignant nodule. b Radiologist’s initial diagnosis: ACR TR category 3, likely to be benign. ThyGPT’s first assessment: ACR TR category 4, likely to be malignant, with a malignancy score of 0.75. ThyGPT’s second assessment: likely to be benign, with a malignancy score of 0.21. The radiologist consulted ThyGPT and insisted on their initial diagnosis. Final pathological result: a benign nodule.
Figure 5 presents cases where clinical radiologists correctly diagnosed the nodules, whereas ThyGPT committed errors. Sample 1 in Fig. 5 was a malignant nodule, but ThyGPT assessed it as benign, with a score of 0.29. However, the radiologist found that the detection result was inaccurate, so they prompted ThyGPT to re-detect the nodule. ThyGPT accurately detected the nodule in the second round, but ThyGPT’s auxiliary diagnostic conclusion did not convince the radiologist. Finally, the radiologist maintained the diagnosis of the nodule as likely malignant, and the final pathological result supported the radiologist’s diagnosis. Conversely, Sample 2 in Fig. 5 was a benign nodule, which ThyGPT incorrectly assessed as malignant. The radiologist deemed the heatmap overly concentrated on hyperechoic areas (and thus lacking in reference value), questioned the model’s judgment and instructed the model to recalculate, ignoring the erroneous features in the hyperechoic areas. The model then provided a correct diagnosis following the radiologist’s directive.
Table 3 provide a detailed analysis of missed malignant nodules by ThyGPT, radiologists, and the model-assisted radiologists. Notably, follicular thyroid cancer (FTC) was the most difficult to identify, with radiologists missing 44.7% of cases, compared to the 17.0% missed by ThyGPT. Small nodules (<10 mm), particularly TR3 nodules, spongiform/cystic compositions, and iso/hyperechoic patterns, showed higher miss rates across all methods. However, the integration of ThyGPT with radiologists consistently improved detection rates, reducing missed cases closer to, or even matching, ThyGPT’s standalone performance, suggesting that AI assistance could effectively complement human expertise.
Table 3 Distribution of missed malignancies
Table 4 shows the overall distribution of diagnostic changes made by different radiologists after using ThyGPT. Based on this data, there is a clear pattern in how radiologists modified their diagnoses when assisted by ThyGPT. Junior radiologists showed a higher alteration rate (11.5%) compared to senior radiologists (9.9%), indicating that they were more likely to change their initial assessments. Importantly, when radiologists did alter their diagnoses, they were highly accurate, with only 0.2% overall wrong alterations compared to 10.5% correct alterations. This indicates that ThyGPT’s assistance led to predominantly beneficial changes in diagnostic decisions, with higher impact on junior radiologists, while maintaining a very low error rate across both experience levels.
Table 4 Statistics on diagnostic changes by radiologists after using ThyGPT
We investigated the characteristic distribution of both correct and incorrect diagnostic alterations (see supplementary Table 6), which revealed that TR4 nodules were more prone to diagnostic modifications, whereas other characteristics showed a relatively uniform distribution. In comparison, we found that ThyGPT’s predicted nodule malignancy risk values were highly correlated with radiologists’ incorrect diagnostic alterations. As shown in Fig. 6, 92.7% (38/41) of incorrect alterations occurred between 0.4 and 0.6 (between red dashed lines). This suggests that when ThyGPT predicts nodule malignancy risk probabilities within this range, its own assessment of malignancy risk is in a marginal zone and users should exercise additional caution when considering ThyGPT’s nodule analysis.
Fig. 6: An example of incorrectly altered diagnosis by a radiologist, and the distribution of malignancy risk probabilities predicted by ThyGPT.
a Radiologist’s initial diagnosis: ACR TR category 4, likely to be malignant. ThyGPT’s assessment: ACR TR category 4, likely to be benign, with a malignancy score of 0.43. The radiologist consulted ThyGPT and incorrectly altered the diagnosis as: likely to be benign. Final pathological result: a malignant nodule. b Distribution of predicted malignancy risk values for nodules with incorrectly altered diagnoses by radiologists. c Distribution of predicted malignancy risk values for nodules with correctly altered diagnoses by radiologists.
Detection of errors in diagnostic reports
Test Set 2 was primarily used to evaluate ThyGPT’s ability to detect errors in US reports. This data set included a total of 1263 US reports, 157 of which contained errors. These 157 errors were classified into five categories: 35 omissions, 30 insertions, 33 side confusions, 36 inconsistencies, and 23 other errors. These five categories are the common types of errors found in radiological reports. A more detailed definition of these error categories can be found in Supplementary Table 7. Three senior radiologists and three junior radiologists, as well as ThyGPT, were responsible for detecting these errors. Initially, radiologists reviewed the reports separately to identify any errors. Thereafter, ThyGPT searched for errors in the reports independently. Finally, each radiologist combined their own findings with those from ThyGPT to produce the final result.
Figure 7 shows some samples of erroneous reports, radiologists’ independent judgments, and the outcomes obtained with the assistance of ThyGPT. The error detection rate of ThyGPT was 0.905 (142/157; 95% CI: 0.899–0.910), which exceeded that of all radiologists. Figure 8a presents the error detection rate of radiologists and ThyGPT. With the assistance of ThyGPT, radiologists’ average error detection rate increased from 0.764 (120/157; 95% CI: 0.751–0.779) to 0.962 (151/157; 95% CI: 0.959–0.966), and the p value was less than 0.001. Even junior radiologists exhibited error detection rates approaching those of senior radiologists. Figure 8b illustrates the average error detection rates of radiologists for each error type, as well as their error detection and correction rates aided by ThyGPT. Notably, ThyGPT achieved a 100% error detection and correction rate for side confusion errors.
Fig. 7: ThyGPT facilitates the detection of errors in US diagnostic reports.
The first column shows samples of erroneous reports (errors are marked in red). The second column shows the error description and ThyGPT’s automatic detections and corrections. a The report exemplifies a side confusion error. The ThyGPT system successfully detected that the anatomical marker indicated the left thyroid lobe; however, the report text erroneously described findings pertaining to the right thyroid lobe. The feature heatmap generated by ThyGPT highlighted the model’s areas of focus. b The report exemplifies an omission error. The report omitted information regarding the nodule size. ThyGPT’s output provided the correct prompting and the corresponding feature heatmap. c The report exemplifies an inconsistency error. The US images revealed the presence of a halo and internal microcalcifications within the nodule, which contradicted the description in the report. ThyGPT accurately detected this inconsistency and visualized the errors using two heatmaps. For all errors, ThyGPT provided corrections. The third example demonstrates an incomplete correction. After introducing new features, the corresponding ACR TR classification needs to be simultaneously corrected (marked in blue). However, ThyGPT currently lacks the capability to successfully modify such potential cascading errors.
Fig. 8: Comparison of the detection results for the different types of errors in US reports.
a Detection results of radiologists, ThyGPT, and radiologists aided by ThyGPT in percentages. b Error detection rates of radiologists, ThyGPT, and radiologists aided by ThyGPT for each error type; it also demonstrates the proportion of auto-corrected errors by ThyGPT. The correction results were reviewed by two senior radiologists to confirm whether the corrections were successful.
The average time for ThyGPT to process the reports was 0.031 s, which was shorter than that for radiologists (49.9 s). This processing rate satisfied the requirement for real-time error detection upon report completion. Considering ThyGPT’s average error detection rate of 0.905, employing the model to immediately scan US reports upon their completion could reduce over 90% of report errors in the first instance. More detailed error detection results of the radiologists can be found in Supplementary Fig. 2