ConU: Conformal Uncertainty in Large Language Models with Correctness Coverage Guarantees

Zhiyuan Wang1, Jinhao Duan2, Lu Cheng3, Yue Zhang2, Qingni Wang1,
Xiaoshuang Shi1, Kaidi Xu2, Hengtao Shen1, Xiaofeng Zhu1

1School of Computer Science and Engineering, University of Electronic
Science and Technology of China
2Department of Computer Science, Drexel University
3Department of Computer Science, University of Illinois Chicago
Corresponding to: Xiaoshuang Shi <xsshi2013@gmail.com>
Abstract

Uncertainty quantification (UQ) in natural language generation (NLG) tasks remains an open challenge, exacerbated by the closed-source nature of the latest large language models (LLMs). This study investigates applying conformal prediction (CP), which can transform any heuristic uncertainty notion into rigorous prediction sets, to black-box LLMs in open-ended NLG tasks. We introduce a novel uncertainty measure based on self-consistency theory, and then develop a conformal uncertainty criterion by integrating the uncertainty condition aligned with correctness into the CP algorithm. Empirical evaluations indicate that our uncertainty measure outperforms prior state-of-the-art methods. Furthermore, we achieve strict control over the correctness coverage rate utilizing 7 popular LLMs on 4 free-form NLG datasets, spanning general-purpose and medical scenarios. Additionally, the calibrated prediction sets with small size further highlights the efficiency of our method in providing trustworthy guarantees for practical open-ended NLG applications.

ConU: Conformal Uncertainty in Large Language Models with Correctness Coverage Guarantees


Zhiyuan Wang1, Jinhao Duan2, Lu Cheng3, Yue Zhang2, Qingni Wang1, Xiaoshuang Shi1thanks: Corresponding to: Xiaoshuang Shi <xsshi2013@gmail.com>, Kaidi Xu2, Hengtao Shen1, Xiaofeng Zhu1 1School of Computer Science and Engineering, University of Electronic Science and Technology of China 2Department of Computer Science, Drexel University 3Department of Computer Science, University of Illinois Chicago


1 Introduction

Despite advancements in various natural language generation (NLG) tasks Katz et al. (2024); Touvron et al. (2023a); Chen et al. (2023); Duan et al. (2024b, c), large language models (LLMs) are proven to hallucinate facts and confidently generate textual information that is not correct or grounded in reality Ji et al. (2023); Manakul et al. (2023). Factually incorrect answers can confuse and mislead users, resulting in erroneous conclusions and ultimately undermining the trustworthiness of LLMs-based high-stakes applications.

Uncertainty quantification (UQ) provides valuable insights into the reliability of model responses, facilitating risk assessment and hallucination detection Kadavath et al. (2022); Lin et al. (2022a). However, it demands investigating black-box uncertainty measures with the proliferation of LLMs served via APIs Achiam et al. (2023), which only allows textual inputs and outputs. Conformal prediction (CP) Campos et al. (2024); Angelopoulos and Bates (2021); Quach et al. (2024); Zhao et al. (2024) is known for providing a model-agnostic and statistically rigorous uncertainty estimation. CP was primarily employed in classification  Angelopoulos and Bates (2021) and regression tasks Wang et al. (2024a). For NLG tasks, CP is first adapted to the multiple-choice question-answering (MCQA) setting, where the acceptable response is selected from a fixed set of options Kumar et al. (2023); Ye et al. (2024), limiting its applications in real-world open-ended NLG tasks. Conformal language modeling Quach et al. (2024) relies on the model likelihoods and calibrates a stopping rule to sample prediction sets from the infinite output space until users are confident that the set covers at least one response satisfied. LofreeCP Su et al. (2024) studies CP for API-only LLMs without logit access by leveraging uncertainty information from diverse sources.

Our study explores adapting CP for general NLG applications. The nonconformity score (NS) in CP serves as a criterion for calibrating prediction sets, which provide coverage guarantees by selecting a set of possible labels that satisfy the NS threshold Angelopoulos and Bates (2021). Since typical logits-based NS may encounter miscalibration, we aim to integrate black-box UQ into the definition of NS, by closely aligning it with the uncertainty condition of the correct answers and devising a conformal uncertainty criterion, while it is more reliable to analyze the uncertainty within LLMs’ true output space. Then, we employ the uncertainty criterion, concluded from a small amount of independent and identically distributed (i.i.d.) calibration data, to construct prediction sets by selecting generations sharing a similar uncertainty condition from the unbounded output space on test samples. Typically, there are two goals of CP: (1) the calibrated prediction set contains the correct answer with at least a user-specified probability; and (2) the average set size should be small, demonstrating the prediction efficiency of our method.

The first challenge is UQ for black-box LLMs. Our solution is inspired by an intuitive observation: If a language model generates more semantically diverse outputs for the same prompt, the uncertainty is likely higher Su et al. (2024); Lin et al. (2023); Xiong et al. (2023). Regardless of the model’s capability to tackle the current problem, the confidence score that the model assigns to a generation can be represented by its frequency within the output space. We approximate the model’s output distribution by sampling multiple answers to the same question. Then, we perform semantic clustering on the sampled generations, and propose to measure the uncertainty of each generation by combining two factors: the frequency of occurrence of the semantic meaning it conveys, and the consistency between its semantic and other semantic clusters augmented by their individual frequency.

Based on the measure, we define the NS as the uncertainty of the generation. To this end, the generation meets the correctness criterion and is semantically most similar to the reference answer in the calibration set. We then calculate the quantile q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG of NSs for all calibration samples, based on the user-specified upper bound of error rate α𝛼\alphaitalic_α. Next, we utilize the conformal uncertainty criterion (i.e., the uncertainty threshold q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG) to construct a prediction set for each test sample by selecting generations that satisfy the uncertainty conditions strictly associated with correctness from the candidate generations. Additionally, for black-box UQ, we propose employing the most frequent generation or semantic (i.e., the model’s most confident answer) as a more trustworthy reference object for the query and leveraging it to measure the overall uncertainty of the current UQ process. We term this measure ConU, as it employs the same approach as the conformal uncertainty criterion.

Extensive experimental results exhibit that ConU generally outperforms prior state-of-the-art methods and verify the strict correctness coverage guarantees. Specifically, the prediction sets calibrated by the conformal uncertainty criterion always encompass the correct answers under various user-specified error rates. Furthermore, the average prediction set size is small, highlighting the prediction efficiency of our approach. To our knowledge, this is the first method in the literature to strictly link the NS with the uncertainty condition aligned with correctness via black-box UQ, thereby developing a more robust conformal uncertainty criterion, which provides rigorous correctness coverage guarantees in practical open-ended NLG tasks, and its unique inspiration in benchmarking UQ in LLMs through CP generates independent interest***Our code is available at https://github.com/Zhiyuan-GG/Conformal-Uncertainty-Criterion/tree/main.

In summary, our major contributions are listed as follows:

  • We propose a sampling-based black-box uncertainty measure, termed as ConU, utilizing self-consistency in open-ended NLG tasks, facilitating trustworthy decision-making.

  • We devise a conformal uncertainty criterion by strictly aligning the NS with the uncertainty condition of acceptable answers, and achieve rigorous correctness coverage with at least a user-specified probability, thereby providing robust guarantees under various error rates in practical open-ended NLG applications.

  • We conduct selective prediction leveraging the calibrated prediction sets and obtain promising improvements in model accuracy without requiring additional task-specific fine-tuning or architectural modifications.

2 Related Work

2.1 Uncertainty Quantification in LLMs

Prior work on UQ in LLMs predominantly focuses on white-box information like token-likelihoods or embeddings Da et al. (2024); Kuhn et al. (2023); Duan et al. (2024a); Wang et al. (2024b), internal state or activations Yin et al. (2024); Chen et al. (2024), model fine-tuning Tian et al. (2023). These methods can encounter poor calibration and require substantial computational resources. Additionally, researchers lack white-box access to the internal information of LLMs served via APIs. These restrictions demand black-box measures for general UQ in LLMs generations.

Recent work Lin et al. (2023) develops several sampling-based uncertainty measures, which can be applied to black-box LLMs by leveraging semantic similarity along with dispersion. Our study follows the sampling setting and proposes to employ the most frequent generation as the reference object to measure the overall uncertainty based on the self-consistency theory Wang et al. (2022).

2.2 Conformal Prediction in LLMs

CP Angelopoulos and Bates (2021); Quach et al. (2024); Campos et al. (2024) has emerged as a theoretically sound and practically useful way to guarantee ground-truth coverage with the aid of a small amount of exchangeable samples for calibration. CP in classification tasks defines the NS, which is correlated with the ground-truth label, obtains the quantile, q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG, of NSs for all calibration samples based on a user-specified upper bound of the error rate α𝛼\alphaitalic_α, and utilizes q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG as a threshold to select possible labels on test samples, thereby establishing prediction sets that guarantee ground truth coverage with at least the probability of 1α1𝛼1-\alpha1 - italic_α.

Recently, researchers have attempted to apply CP to LLMs for principled UQ. The work Mohri and Hashimoto (2024) achieves conformal factuality guarantees by progressively making generations less specific and establishing their corresponding entailment sets until correct answers are encompassed. For correctness coverage, two studies Kumar et al. (2023); Ye et al. (2024) follow CP in classification tasks and convert NLG tasks into MCQA settings. For open-ended NLG, based on the output token sequence logits, the study Quach et al. (2024) develops a stopping rule to sample generations until users are confident that a correct answer is covered in QA tasks, which can be impractical for API-only LLMs. LofreeCP Su et al. (2024) leverages uncertainty information to construct prediction sets that achieve correctness coverage.

This paper focuses on more practical scenarios of black-box LLMs in open-ended NLG tasks. Differing from LofreeCP, we strictly connect the NS with the uncertainty condition aligned with correctness via black-box UQ, which concludes a more robust conformal uncertainty criterion to calibrate prediction sets with rigorous correctness coverage guarantees under various error rates despite the complexity of the model or datasets.

3 Method

Our method investigates two key issues: (1) how to estimate the uncertainty in black-box LLMs when we can only access the output texts; and (2) how to provide rigorous guarantees on the error rate in open-ended NLG tasks. We first devise a black-box uncertainty measure grounded in self-consistency to provide the trustworthiness notion of model responses. Furthermore, we utilize the split CP technique to convert the heuristic approximation into a statistically rigorous one, thereby ensuring a more robust and systematic assessment of uncertainty.

3.1 Preliminaries

Following the analysis of black-box LLMs in prior work Xiong et al. (2023); Lin et al. (2023); Manakul et al. (2023), conditioned on each prompt (or question) xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we employ the most likely generation y^isubscript^𝑦𝑖\hat{y}_{i}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for correctness evaluation. Additionally, we sample a set of M𝑀Mitalic_M candidate generations {y^m(i)}m=1Msuperscriptsubscriptsuperscriptsubscript^𝑦𝑚𝑖𝑚1𝑀\left\{\hat{y}_{m}^{(i)}\right\}_{m=1}^{M}{ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT from the model’s output space for black-box UQ and the derivation of conformal uncertainty criterion. We denote the reference answer to xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as yisuperscriptsubscript𝑦𝑖y_{i}^{*}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

3.2 Uncertainty Quantification

For each data point, we first cluster semantics in the M𝑀Mitalic_M sampled generations and obtain K𝐾Kitalic_K non-repeated semantics. We denote the number of generations sharing the k𝑘kitalic_k-th semantic as Vksubscript𝑉𝑘V_{k}italic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (i.e., k=1KVk=Msuperscriptsubscript𝑘1𝐾subscript𝑉𝑘𝑀\textstyle\sum_{k=1}^{K}V_{k}=M∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_M) and any one generation in this cluster as y^k(i)superscriptsubscript^𝑦𝑘𝑖\hat{y}_{k}^{(i)}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT.

Building on earlier approaches that utilize self-consistency Wang et al. (2022); Su et al. (2024); Yadkori et al. (2024) as a reliable measure of confidence, we employ the frequency of the k𝑘kitalic_k-th semantic as its proxy for reliability: (y^k(i))=VkMsuperscriptsubscript^𝑦𝑘𝑖subscript𝑉𝑘𝑀\mathcal{F}\left(\hat{y}_{k}^{(i)}\right)=\frac{V_{k}}{M}caligraphic_F ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) = divide start_ARG italic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_M end_ARG. Then, we define the uncertainty score of each candidate generation in {y^m(i)}m=1Msuperscriptsubscriptsuperscriptsubscript^𝑦𝑚𝑖𝑚1𝑀\left\{\hat{y}_{m}^{(i)}\right\}_{m=1}^{M}{ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT as

𝒰(y^m(i))=1λ(y^m(i))(1λ)1Kk=1K𝒮(y^m(i),y^k(i))(y^k(i)),𝒰superscriptsubscript^𝑦𝑚𝑖1𝜆superscriptsubscript^𝑦𝑚𝑖1𝜆1𝐾superscriptsubscript𝑘1𝐾𝒮superscriptsubscript^𝑦𝑚𝑖superscriptsubscript^𝑦𝑘𝑖superscriptsubscript^𝑦𝑘𝑖\begin{split}\mathcal{U}\left(\hat{y}_{m}^{(i)}\right)=&1-\lambda\cdot\mathcal% {F}\left(\hat{y}_{m}^{(i)}\right)-\left(1-\lambda\right)\cdot\\ &\frac{1}{K}\displaystyle\sum_{k=1}^{K}\mathcal{S}\left(\hat{y}_{m}^{(i)},\hat% {y}_{k}^{(i)}\right)\mathcal{F}\left(\hat{y}_{k}^{(i)}\right),\end{split}start_ROW start_CELL caligraphic_U ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) = end_CELL start_CELL 1 - italic_λ ⋅ caligraphic_F ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) - ( 1 - italic_λ ) ⋅ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT caligraphic_S ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) caligraphic_F ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) , end_CELL end_ROW (1)

where (y^m(i))superscriptsubscript^𝑦𝑚𝑖\mathcal{F}\left(\hat{y}_{m}^{(i)}\right)caligraphic_F ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) refers to the frequency of the semantic that y^m(i)superscriptsubscript^𝑦𝑚𝑖\hat{y}_{m}^{(i)}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT conveys, and 𝒮(,)𝒮\mathcal{S}\left(\cdot,\cdot\right)caligraphic_S ( ⋅ , ⋅ ) measures the semantic similarity between two generations utilizing a cross-encoder model Reimers and Gurevych (2019). (y^k(i))superscriptsubscript^𝑦𝑘𝑖\mathcal{F}\left(\hat{y}_{k}^{(i)}\right)caligraphic_F ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) is to augment the persuasiveness of the similarity score associated with y^k(i)superscriptsubscript^𝑦𝑘𝑖\hat{y}_{k}^{(i)}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT.

To measure the model uncertainty, we select any one generation in the largest semantic cluster to be the most trustworthy generation in the M𝑀Mitalic_M sampled generations and denote it as y^mstisuperscriptsubscript^𝑦𝑚𝑠𝑡𝑖\hat{y}_{mst}^{{i}}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_m italic_s italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. Then, we define the uncertainty score of the i𝑖iitalic_i-th query-response process as

𝒰({y^m(i)}m=1M|xi)=1λ(y^msti)(1λ)1Kk=1K𝒮(y^msti,y^k(i))(y^k(i)).𝒰conditionalsuperscriptsubscriptsuperscriptsubscript^𝑦𝑚𝑖𝑚1𝑀subscript𝑥𝑖1𝜆superscriptsubscript^𝑦𝑚𝑠𝑡𝑖1𝜆1𝐾superscriptsubscript𝑘1𝐾𝒮superscriptsubscript^𝑦𝑚𝑠𝑡𝑖superscriptsubscript^𝑦𝑘𝑖superscriptsubscript^𝑦𝑘𝑖\begin{split}&\mathcal{U}\left(\left\{\hat{y}_{m}^{(i)}\right\}_{m=1}^{M}|x_{i% }\right)=1-\lambda\cdot\mathcal{F}\left(\hat{y}_{mst}^{{i}}\right)-\\ &\left(1-\lambda\right)\cdot\frac{1}{K}\displaystyle\sum_{k=1}^{K}\mathcal{S}% \left(\hat{y}_{mst}^{{i}},\hat{y}_{k}^{(i)}\right)\mathcal{F}\left(\hat{y}_{k}% ^{(i)}\right).\end{split}start_ROW start_CELL end_CELL start_CELL caligraphic_U ( { over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = 1 - italic_λ ⋅ caligraphic_F ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_m italic_s italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) - end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ( 1 - italic_λ ) ⋅ divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT caligraphic_S ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_m italic_s italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) caligraphic_F ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) . end_CELL end_ROW (2)

Intuitively, the most frequent semantic within the candidate generations represents the model’s most confident answer to the current problem. Even though the reference semantic may not necessarily be the correct one, we can measure the degree of the model’s uncertainty by calculating the confidence level of that semantic as well as the deviation between it and other semantics.

Since Eq. (1) can quantify the uncertainty of each candidate generation, we attempt to develop an uncertainty criterion to search for the correct answers within the unfixed output space of the LLM.

3.3 Conformal Correctness Coverage

Following the fundamental requirement in split CP Angelopoulos and Bates (2021), we randomly employ N𝑁Nitalic_N samples to construct the calibration data set {(xi,yi)}i=1Nsuperscriptsubscriptsubscript𝑥𝑖superscriptsubscript𝑦𝑖𝑖1𝑁\left\{\left(x_{i},y_{i}^{*}\right)\right\}_{i=1}^{N}{ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, and for each calibration sample we demand that at least one sampled generation y^j(i)superscriptsubscript^𝑦𝑗𝑖\hat{y}_{j}^{(i)}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT in {y^m(i)}m=1Msuperscriptsubscriptsuperscriptsubscript^𝑦𝑚𝑖𝑚1𝑀\left\{\hat{y}_{m}^{(i)}\right\}_{m=1}^{M}{ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT meets the correctness criterion. Our objective of conformal correctness coverage is by concluding the uncertainty criterion that is closely linked with correctness on {(xi,yi)}i=1Nsuperscriptsubscriptsubscript𝑥𝑖superscriptsubscript𝑦𝑖𝑖1𝑁\left\{\left(x_{i},y_{i}^{*}\right)\right\}_{i=1}^{N}{ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, we can calibrate an uncertainty (prediction) set 𝒫(xtest)𝒫subscript𝑥𝑡𝑒𝑠𝑡\mathcal{P}\left(x_{test}\right)caligraphic_P ( italic_x start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ) for the test prompt xtestsubscript𝑥𝑡𝑒𝑠𝑡x_{test}italic_x start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT by selecting generations that meet the common uncertainty condition, and the set can guarantee correctness coverage under various user-specificed error rates. Here, we approximate the prediction region of xtestsubscript𝑥𝑡𝑒𝑠𝑡x_{test}italic_x start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT to the M𝑀Mitalic_M candidate generations {y^m(test)}m=1Msuperscriptsubscriptsuperscriptsubscript^𝑦𝑚𝑡𝑒𝑠𝑡𝑚1𝑀\left\{\hat{y}_{m}^{(test)}\right\}_{m=1}^{M}{ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t italic_e italic_s italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT.

Assumptions: (1) There is at least one candidate generation in {y^m(test)}m=1Msuperscriptsubscriptsuperscriptsubscript^𝑦𝑚𝑡𝑒𝑠𝑡𝑚1𝑀\left\{\hat{y}_{m}^{(test)}\right\}_{m=1}^{M}{ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t italic_e italic_s italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT meeting the correctness criterion; (2) Samples in the calibration and test data sets are exchangeable.

As the sampled set {y^m(test)}m=1Msuperscriptsubscriptsuperscriptsubscript^𝑦𝑚𝑡𝑒𝑠𝑡𝑚1𝑀\left\{\hat{y}_{m}^{(test)}\right\}_{m=1}^{M}{ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t italic_e italic_s italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT is a subset of the prediction region, which is impossible to enumerate, we can simplify it by stating that there is at least one correct answer in {y^m(test)}m=1Msuperscriptsubscriptsuperscriptsubscript^𝑦𝑚𝑡𝑒𝑠𝑡𝑚1𝑀\left\{\hat{y}_{m}^{(test)}\right\}_{m=1}^{M}{ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t italic_e italic_s italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT. Exchangeability is the fundamental assumption of CP Angelopoulos and Bates (2021). We provide the explanation for Assumption (1) in Appendix B.

Based on the uncertainty measure described as Eq. (1), we define the NS of the i𝑖iitalic_i-th calibration sample as

ri=r(xi,yi)=𝒰(argmaxy^j(i)𝒮(y^j(i),yi)(y^j(i),yi)),subscript𝑟𝑖𝑟subscript𝑥𝑖superscriptsubscript𝑦𝑖𝒰subscriptsuperscriptsubscript^𝑦𝑗𝑖𝒮superscriptsubscript^𝑦𝑗𝑖superscriptsubscript𝑦𝑖superscriptsubscript^𝑦𝑗𝑖superscriptsubscript𝑦𝑖\begin{split}&r_{i}=r\left(x_{i},y_{i}^{*}\right)=\\ &\mathcal{U}\left({\arg\max}_{\hat{y}_{j}^{(i)}}\mathcal{S}\left(\hat{y}_{j}^{% (i)},y_{i}^{*}\right)\mathcal{E}\left(\hat{y}_{j}^{(i)},y_{i}^{*}\right)\right% ),\end{split}start_ROW start_CELL end_CELL start_CELL italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_r ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL caligraphic_U ( roman_arg roman_max start_POSTSUBSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_S ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) caligraphic_E ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) , end_CELL end_ROW (3)

where (,)\mathcal{E}\left(\cdot,\cdot\right)caligraphic_E ( ⋅ , ⋅ ) is the indicator function determining whether the two sentences share equivalent semantics, i.e., (y^j(i),yi)=1superscriptsubscript^𝑦𝑗𝑖superscriptsubscript𝑦𝑖1\mathcal{E}\left(\hat{y}_{j}^{(i)},y_{i}^{*}\right)=1caligraphic_E ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = 1 indicates that y^j(i)superscriptsubscript^𝑦𝑗𝑖\hat{y}_{j}^{(i)}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT is semantically equivalent to yisuperscriptsubscript𝑦𝑖y_{i}^{*}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, and (y^j(i),yi)=0superscriptsubscript^𝑦𝑗𝑖superscriptsubscript𝑦𝑖0\mathcal{E}\left(\hat{y}_{j}^{(i)},y_{i}^{*}\right)=0caligraphic_E ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = 0 denotes it does not. That is, the NS, r(xi,yi)𝑟subscript𝑥𝑖superscriptsubscript𝑦𝑖r\left(x_{i},y_{i}^{*}\right)italic_r ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) represents the uncertainty condition of the candidate generation y^j(i)superscriptsubscript^𝑦𝑗𝑖\hat{y}_{j}^{(i)}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT, which has the highest similarity score with the reference answer yisuperscriptsubscript𝑦𝑖y_{i}^{*}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in generations that are semantically equivalent to yisuperscriptsubscript𝑦𝑖y_{i}^{*}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. The criterion for determining semantic equivalence here is the same as that for correctness evaluation (i.e., y^j(i)superscriptsubscript^𝑦𝑗𝑖\hat{y}_{j}^{(i)}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT is correct according to yisuperscriptsubscript𝑦𝑖y_{i}^{*}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT if (y^j(i),yi)=1superscriptsubscript^𝑦𝑗𝑖superscriptsubscript𝑦𝑖1\mathcal{E}\left(\hat{y}_{j}^{(i)},y_{i}^{*}\right)=1caligraphic_E ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = 1).

It is worth emphasizing that we strictly align the NSs with the uncertainty conditions of correct answers within the fresh calibration set, concluding an honest insight into the model’s performance, which is crucial for robust correctness coverage guarantees in new test samples.

Following prior work Angelopoulos and Bates (2021); Quach et al. (2024); Campos et al. (2024), we sort {ri}i=1Nsuperscriptsubscriptsubscript𝑟𝑖𝑖1𝑁\left\{r_{i}\right\}_{i=1}^{N}{ italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ({r1rN}subscript𝑟1subscript𝑟𝑁\left\{r_{1}\leq\cdots\leq r_{N}\right\}{ italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ ⋯ ≤ italic_r start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }) and calculate the (N+1)(1α)N𝑁11𝛼𝑁\frac{\left\lceil\left(N+1\right)\left(1-\alpha\right)\right\rceil}{N}divide start_ARG ⌈ ( italic_N + 1 ) ( 1 - italic_α ) ⌉ end_ARG start_ARG italic_N end_ARG quantile of NSs for all calibration data to develop the conformal uncertainty criterion

q^=inf{q:|{i:riq}|N(N+1)(1α)N}=r(N+1)(1α),^𝑞infimumconditional-set𝑞conditional-set𝑖subscript𝑟𝑖𝑞𝑁𝑁11𝛼𝑁subscript𝑟𝑁11𝛼\begin{split}&\hat{q}=\\ &\inf\left\{q:\frac{\left|\left\{i:r_{i}\leq q\right\}\right|}{N}\geq\frac{% \left\lceil\left(N+1\right)\left(1-\alpha\right)\right\rceil}{N}\right\}\\ &={r}_{\left\lceil\left(N+1\right)\left(1-\alpha\right)\right\rceil},\end{split}start_ROW start_CELL end_CELL start_CELL over^ start_ARG italic_q end_ARG = end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL roman_inf { italic_q : divide start_ARG | { italic_i : italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_q } | end_ARG start_ARG italic_N end_ARG ≥ divide start_ARG ⌈ ( italic_N + 1 ) ( 1 - italic_α ) ⌉ end_ARG start_ARG italic_N end_ARG } end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_r start_POSTSUBSCRIPT ⌈ ( italic_N + 1 ) ( 1 - italic_α ) ⌉ end_POSTSUBSCRIPT , end_CELL end_ROW (4)

where α𝛼\alphaitalic_α is the upper bound of the error rate.

Table 1: Performance comparison (AUROC) of uncertainty quantification across our proposed method and 8 baseline approaches, evaluated on 5 instruction-tuned LLMs over 4 open-ended NLG datasets. The correctness criterion is based on the sentence similarity measured by the DistillRoBERTa model with a threshold of 0.7. The best UQ methods are in bold and the second-best one is underscored.

Dataset LLMs White-box Black-box PE LNPE SE SAR LS NumSet Ecc Deg ConU TriviaQA LLaMA-2-7B-Chat 0.6587 0.6459 0.7495 0.7876 0.5571 0.7763 0.7839 0.8103 0.8198 Mistral-7B-Instruct-v0.3 0.6620 0.5968 0.7845 0.8306 0.5969 0.8491 0.8596 0.8596 0.8671 LLaMA-3-8B-Instruct 0.7247 0.6465 0.7934 0.8271 0.4661 0.8201 0.7404 0.8246 0.8275 Vicuna-13B-v1.5 0.5553 0.5543 0.7568 0.7207 0.5734 0.7629 0.6578 0.7858 0.7926 LLaMA-2-13B-Chat 0.6065 0.5614 0.7624 0.7757 0.6121 0.7885 0.8035 0.8035 0.8048 Average 0.6414 0.6010 0.7693 0.7883 0.5611 0.7994 0.7690 0.8167 0.8224 CoQA LLaMA-2-7B-Chat 0.6236 0.5618 0.7120 0.7372 0.5403 0.7309 0.6769 0.7613 0.7600 Mistral-7B-Instruct-v0.3 0.6746 0.5795 0.7062 0.7551 0.5799 0.7481 0.6931 0.7645 0.7652 LLaMA-3-8B-Instruct 0.7495 0.6531 0.7652 0.7902 0.4532 0.7400 0.7288 0.7763 0.7702 Vicuna-13B-v1.5 0.5928 0.5565 0.7110 0.6984 0.4965 0.6832 0.6679 0.7191 0.7106 LLaMA-2-13B-Chat 0.6203 0.5634 0.7039 0.7427 0.5534 0.7230 0.6805 0.7546 0.7591 Average 0.6522 0.5829 0.7197 0.7472 0.5247 0.7250 0.6894 0.7552 0.7530 MedQA LLaMA-2-7B-Chat 0.4888 0.4925 0.5341 0.5862 0.5599 0.5933 0.5511 0.6064 0.6120 Mistral-7B-Instruct-v0.3 0.4613 0.4639 0.5091 0.6397 0.5520 0.6282 0.6562 0.6660 0.6789 LLaMA-3-8B-Instruct 0.5854 0.5781 0.6508 0.7167 0.4522 0.7093 0.6142 0.7159 0.7196 Vicuna-13B-v1.5 0.4970 0.4922 0.5523 0.5854 0.5479 0.5926 0.5383 0.6261 0.6360 LLaMA-2-13B-Chat 0.4618 0.4647 0.5277 0.5792 0.5734 0.6041 0.5743 0.6070 0.6153 Average 0.4989 0.4983 0.5548 0.6214 0.5371 0.6255 0.5868 0.6443 0.6524 MedMCQA LLaMA-2-7B-Chat 0.4774 0.4848 0.5221 0.5883 0.5531 0.6171 0.5165 0.5983 0.6330 Mistral-7B-Instruct-v0.3 0.4971 0.4989 0.5491 0.6944 0.5103 0.7084 0.7170 0.7173 0.7413 LLaMA-3-8B-Instruct 0.5414 0.5395 0.6244 0.6940 0.4817 0.6992 0.5952 0.6993 0.7098 Vicuna-13B-v1.5 0.4614 0.4815 0.5550 0.5509 0.5377 0.5891 0.5135 0.6221 0.6448 LLaMA-2-13B-Chat 0.4547 0.4712 0.5385 0.5701 0.5711 0.6378 0.6188 0.6188 0.6414 Average 0.4864 0.4952 0.5578 0.6195 0.5308 0.6503 0.5922 0.6511 0.6741

As for each test sample, we construct the prediction set following

𝒫(xtest)={y^j(test):r(xtest,y^j(test))q^}.𝒫subscript𝑥𝑡𝑒𝑠𝑡conditional-setsuperscriptsubscript^𝑦𝑗𝑡𝑒𝑠𝑡𝑟subscript𝑥𝑡𝑒𝑠𝑡superscriptsubscript^𝑦𝑗𝑡𝑒𝑠𝑡^𝑞\begin{split}\mathcal{P}\left(x_{test}\right)=\left\{\hat{y}_{j}^{(test)}:r% \left(x_{test},\hat{y}_{j}^{(test)}\right)\leq\hat{q}\right\}.\end{split}start_ROW start_CELL caligraphic_P ( italic_x start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ) = { over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t italic_e italic_s italic_t ) end_POSTSUPERSCRIPT : italic_r ( italic_x start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t italic_e italic_s italic_t ) end_POSTSUPERSCRIPT ) ≤ over^ start_ARG italic_q end_ARG } . end_CELL end_ROW (5)

It is evident that the most semantically similar generation to y^j(test)superscriptsubscript^𝑦𝑗𝑡𝑒𝑠𝑡\hat{y}_{j}^{(test)}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t italic_e italic_s italic_t ) end_POSTSUPERSCRIPT in {y^m(test)}m=1Msuperscriptsubscriptsuperscriptsubscript^𝑦𝑚𝑡𝑒𝑠𝑡𝑚1𝑀\left\{\hat{y}_{m}^{(test)}\right\}_{m=1}^{M}{ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t italic_e italic_s italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT is itself, and we obtain r(xtest,y^j(test))=𝒰(y^j(test))𝑟subscript𝑥𝑡𝑒𝑠𝑡superscriptsubscript^𝑦𝑗𝑡𝑒𝑠𝑡𝒰superscriptsubscript^𝑦𝑗𝑡𝑒𝑠𝑡r\left(x_{test},\hat{y}_{j}^{(test)}\right)=\mathcal{U}\left(\hat{y}_{j}^{(% test)}\right)italic_r ( italic_x start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t italic_e italic_s italic_t ) end_POSTSUPERSCRIPT ) = caligraphic_U ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t italic_e italic_s italic_t ) end_POSTSUPERSCRIPT ). Recall the assumption that {y^m(test)}m=1Msuperscriptsubscriptsuperscriptsubscript^𝑦𝑚𝑡𝑒𝑠𝑡𝑚1𝑀\left\{\hat{y}_{m}^{(test)}\right\}_{m=1}^{M}{ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t italic_e italic_s italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT contains at least one correct generation (i.e., ytest{y^m(test)}m=1Msuperscriptsubscript𝑦𝑡𝑒𝑠𝑡superscriptsubscriptsuperscriptsubscript^𝑦𝑚𝑡𝑒𝑠𝑡𝑚1𝑀y_{test}^{*}\in\left\{\hat{y}_{m}^{(test)}\right\}_{m=1}^{M}italic_y start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ { over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t italic_e italic_s italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT), then the event {ytest𝒫(xtest)}superscriptsubscript𝑦𝑡𝑒𝑠𝑡𝒫subscript𝑥𝑡𝑒𝑠𝑡\left\{y_{test}^{*}\in\mathcal{P}\left(x_{test}\right)\right\}{ italic_y start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_P ( italic_x start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ) } is equivalent to {rtest=r(xtest,ytest)q^}subscript𝑟𝑡𝑒𝑠𝑡𝑟subscript𝑥𝑡𝑒𝑠𝑡superscriptsubscript𝑦𝑡𝑒𝑠𝑡^𝑞\left\{r_{test}=r\left(x_{test},y_{test}^{*}\right)\leq\hat{q}\right\}{ italic_r start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT = italic_r ( italic_x start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ over^ start_ARG italic_q end_ARG }.

Since the calibration and test samples (x1,y1)subscript𝑥1superscriptsubscript𝑦1\left(x_{1},y_{1}^{*}\right)( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), …, (xN,yN)subscript𝑥𝑁superscriptsubscript𝑦𝑁\left(x_{N},y_{N}^{*}\right)( italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), (xtest,ytest)subscript𝑥𝑡𝑒𝑠𝑡superscriptsubscript𝑦𝑡𝑒𝑠𝑡\left(x_{test},y_{test}^{*}\right)( italic_x start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) are exchangeable, we have P(rtestri)=iN+1𝑃subscript𝑟𝑡𝑒𝑠𝑡subscript𝑟𝑖𝑖𝑁1P\left(r_{test}\leq r_{i}\right)=\frac{i}{N+1}italic_P ( italic_r start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ≤ italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG italic_i end_ARG start_ARG italic_N + 1 end_ARG. Then we conclude

P(ytest𝒫(xtest))=P(rtestr(N+1)(1α))=(N+1)(1α)N+11α,𝑃superscriptsubscript𝑦𝑡𝑒𝑠𝑡𝒫subscript𝑥𝑡𝑒𝑠𝑡𝑃subscript𝑟𝑡𝑒𝑠𝑡subscript𝑟𝑁11𝛼𝑁11𝛼𝑁11𝛼\begin{split}P\left(y_{test}^{*}\in\mathcal{P}\left(x_{test}\right)\right)&=P% \left(r_{test}\leq{r}_{\left\lceil\left(N+1\right)\left(1-\alpha\right)\right% \rceil}\right)\\ &=\frac{\left\lceil\left(N+1\right)\left(1-\alpha\right)\right\rceil}{N+1}\\ &\geq 1-\alpha,\end{split}start_ROW start_CELL italic_P ( italic_y start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_P ( italic_x start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ) ) end_CELL start_CELL = italic_P ( italic_r start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ≤ italic_r start_POSTSUBSCRIPT ⌈ ( italic_N + 1 ) ( 1 - italic_α ) ⌉ end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG ⌈ ( italic_N + 1 ) ( 1 - italic_α ) ⌉ end_ARG start_ARG italic_N + 1 end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≥ 1 - italic_α , end_CELL end_ROW (6)

and obtain the user-specified lower bound (i.e., 1α1𝛼1-\alpha1 - italic_α) of the correctness coverage rate guaranteed by these calibrated prediction sets.

4 Evaluations

4.1 Experimental Set-up

Baselines.

We consider 8 baseline methods, including 4 white-box methods: Predictive Entropy (PEKadavath et al. (2022), Length-normalized Predictive Entropy (LNPEMalinin and Gales (2020), Semantic Entropy (SEKuhn et al. (2023), and Shift Attention to Relevance (SARDuan et al. (2024a), and 4 black-box approaches: Lexical Similarity (LSLin et al. (2022b) and Number of Semantic Sets (NumSetKuhn et al. (2023); Lin et al. (2023). Moreover, we also include the most recent state-of-the-art uncertainty quantification methods, Degree Matrix (DegLin et al. (2023), and Eccentricity (EccLin et al. (2023). More details of baseline methods can be found in Appendix C.1.

Base LLMs.

We conduct empirical evaluations on 7 LLMs encompassing various sizes and architectures for comprehensive analysis, including GPT-3.5-turbo served by OpenAIOpenAI (2021), LLaMA-2-7B-Chat Touvron et al. (2023b), Mistral-7B-Instruct-v0.3 Jiang et al. (2023), Llama-3-8B-Instruct AI@Meta (2024), Vicuna-13B-v1.5 Zheng et al. (2023), LLaMA-2-13B-Chat Touvron et al. (2023b), LLaMA-3-70B-Instruct AI@Meta (2024). We utilize the default generation configs and checkpoints provided by the HuggingFace platformhttps://huggingface.co/models for all open-source LLMs.

Refer to caption

Figure 1: Target vs. empirical correctness coverage rate.
We test the 4 datasets utilizing the LLaMA-2-7B-Chat model as the generator. Empirically, we achieve strict control over the coverage of correct answers by calibrating prediction sets on 4 free-form QA datasets.

Datasets.

We evaluate the performance of ConU and verify the correctness coverage guarantees on 4 free-form NLG datasets, including CoQA Reddy et al. (2019) for conversational QA task, TriviaQA Joshi et al. (2017) for reading comprehension, MedQA Jin et al. (2021) for solving medical problems, and MedMCQA Pal et al. (2022) for medical entrance exam questions. More details of datasets can be found in Appendix C.2.

Evaluation Metric.

Following prior work Duan et al. (2024a); Wang et al. (2024b), we evaluate the performance of UQ by treating it as the problem of predicting whether to trust a generation given the prompt, and utilize the Area Under the Receiver Operating Characteristic Curve (AUROC) which gauges if the uncertainty scores can effectively distinguish between correct and incorrect generations. To verify if the correctness coverage is strictly guaranteed, we evaluate the coverage rate under various user-specified error rates. We also report the average prediction set size to evaluate the prediction efficiency and practicality of our approach.

Correctness and Equivalence Metric.

We utilize sentence similarity Duan et al. (2024a) as the metric for correctness and equivalence evaluation. We employ the cross-encoder model Reimers and Gurevych (2019) with DistillRoBERTa Sanh et al. (2019) serving as the backbone to measure the semantic similarity score between the most likely generation and reference answer and set a strict correctness threshold of 0.7.

Table 2: The results of correctness coverage rate (%percent\%%) on 7 LLMs with various sizes across 4 open-ended NLG datasets. The user-specified error rate α𝛼\alphaitalic_α is set to 0.1.

LLMs TriviaQA CoQA MedQA MedMCQA LLaMA-2-7B-Chat 91.00 93.37 100.00 91.32 Mistral-7B-Instruct-v0.3 90.83 91.87 90.70 90.39 LLaMA-3-8B-Instruct 94.27 90.73 90.46 93.17 LLaMA-2-13B-Chat 91.68 91.63 91.72 92.45 Vicuna-13B-v1.5 90.19 92.68 90.25 92.13 LLaMA-3-70B-Instruct 92.18 90.95 93.70 92.48 GPT-3.5-turbo 93.14 91.66 91.78 90.36

Table 3: The average prediction set size on 7 LLMs with various sizes across 4 open-ended NLG datasets. The user-specified error rate α𝛼\alphaitalic_α is set to 0.1.

LLMs TriviaQA CoQA MedQA MedMCQA LLaMA-2-7B-Chat 2.28 2.26 4.28 3.07 Mistral-7B-Instruct-v0.3 2.24 2.49 4.20 3.26 LLaMA-3-8B-Instruct 2.34 2.45 2.68 2.60 LLaMA-2-13B-Chat 2.19 2.28 3.40 2.73 Vicuna-13B-v1.5 2.26 2.47 3.29 2.98 LLaMA-3-70B-Instruct 1.03 1.71 2.15 1.60 GPT-3.5-turbo 1.96 2.13 2.49 2.02

Hyperparameters.

We randomly sample 5 answers to each question for UQ and 10 candidate generations for verification of correctness coverage guarantees. We leverage beam search for the most likely generations for correctness evaluation and multinominal sampling for candidate generations Duan et al. (2024a). The max length of each generation is set to 128 tokens. The temperature of generation is set to 1.0. The coefficient λ𝜆\lambdaitalic_λ introduced in Eq. (1) is set to 0.5. The ratio of calibration and test set is set to 1:10 by default.

4.2 UQ in Black-Box LLMs

As defined in failure prediction Xiong et al. (2023) which evaluates whether the uncertainty score can effectively distinguish between correct and incorrect generations, an effective measure should assign higher uncertainty to incorrect generations and lower to correct ones. We compare our approach with state-of-the-art methods utilizing AUROC. Experimental results are summarized in Table 1. Generally, our method outperforms baseline methods in most of the settings. For instance, our method consistently beat 8 baseline methods on the TriviaQA datasets. It is worth noting that our method outperforms other methods by at most 2.4%percent\%% AUROC on the MedMCQA dataset and 1.29%percent\%% AUROC on the MedQA, which indicates the potential impacts of our methods on real-world high-stakes NLG applications. We will discuss the impact of the number of sampled generations on UQ in Section 4.4.

Refer to caption

Figure 2: Target correctness coverage rate vs. empirical correctness coverage rate on non-empty prediction sets. We test the 4 datasets utilizing the LLaMA-2-7B-Chat model. We can almost obtain absolute coverage of correct answers in non-empty calibrated prediction sets even at a strict user-accepted error rate.

4.3 Conformal Correctness Coverage

In this section, we verify that the calibrated prediction sets constructed following Eq. (5) indeed achieve rigorous correctness coverage guarantees under various user-specified error rates as described in Eq. (6). Then we explore the utility of prediction sets and conduct selective prediction based on our proposed uncertainty measure.

Empirical Coverage Guarantees.

To guarantee the derived lower bound of correctness coverage rate in practice, we randomly split the four datasets at a ratio of 1:10, employing the respective portions as the calibration and test set. We utilize the calibration set to derive the conformal uncertainty criterion specified by the upper bound of the error rate. Then, we measure the correctness coverage rate on the test set and plot the results on four datasets in Figure 1. It is evident that we achieve strict control of the correctness coverage rate under various error rates. The verification on other models can be found in Appendix D.

Following the study Ye et al. (2024), we set the error rate α𝛼\alphaitalic_α to 0.1 and test the coverage rate on 4 datasets utilizing 7 LLMs with various scales. As is exhibited in Table 2, the coverage rate is at least 90%percent9090\%90 %, indicating that the requirement of correctness coverage guarantees is satisfied. It is worth noting that prior work Ye et al. (2024); Kumar et al. (2023) selects the possible option from the fixed choices while we characterize the unbound answer distribution by sampling and utilize our devised conformal uncertainty criterion to search for the correct answer, which is more practical.

Table 4: The enhancement of model accuracy (%percent\%%) after conducting selective prediction within the calibrated prediction sets based on the black-box uncertainty measure, utilizing sentence similarity as the criterion for correctness evaluation under the threshold of 0.7.

Dataset LLMs Original Calibrated TriviaQA LLaMA-2-7B-Chat 68.43 70.77 Mistral-7B-Instruct-v0.3 79.04 81.45 LLaMA-3-8B-Instruct 79.36 80.00 Vicuna-13B-v1.5 78.40 78.80 LLaMA-2-13B-Chat 76.70 78.13 CoQA LLaMA-2-7B-Chat 73.00 75.53 Mistral-7B-Instruct-v0.3 78.25 80.80 LLaMA-3-8B-Instruct 72.93 74.67 Vicuna-13B-v1.5 76.17 78.43 LLaMA-2-13B-Chat 80.00 81.23 MedQA LLaMA-2-7B-Chat 37.88 40.80 Mistral-7B-Instruct-v0.3 38.65 43.90 LLaMA-3-8B-Instruct 66.29 70.59 Vicuna-13B-v1.5 44.42 46.78 LLaMA-2-13B-Chat 42.07 46.15

We also evaluate the prediction efficiency of the conformal uncertainty criterion utilizing the average size of these calibrated prediction sets, which is the primary metric for CP Angelopoulos and Bates (2021). Table 3 demonstrates that the average size of prediction sets calibrated by our method remains very small across the 4 datasets. For instance, the average set size is 1.03 on the LLaMa-3-70B-Instruct model in the TriviaQA task, indicating that we can almost directly identify the correct answers through these calibrated prediction sets.

We boldly expect that as long as the language model has the capability to solve the current problem, despite the unfixed answer distribution, we can always find the correct generation by performing black-box UQ on each sampled answer and searching for answers meeting the conformal uncertainty criterion, and then limit the selection region to the calibrated prediction set for post-processing.

Utility of Calibrated Prediction Sets.

Since for some test samples, all the candidate generations can be filtered out by the conformal uncertainty criterion, we explore the utility of non-empty prediction sets in practice. Figure 2 exhibits that the prediction sets achieve promising correctness coverage rate, raising to 100%percent\%% as the accepted error rate increases. In the MedQA dataset, while the error rate is set to 0.1, we almost achieve absolute correctness coverage guarantees, indicating that, without reference answers provided in real-world high-stakes situations, we can ensure that the small reference range we have established contains the correct answer for posterior selection, and then high-uncertainty problems will be handed over to experts, which aligns with the selective prediction and abstention criterion.

Based on the proposed uncertainty measure, we conduct post-processing to select the generation with the lowest uncertainty score from each calibrated prediction set and evaluate the total selective accuracy. It is worth noting that the performance depends on the quality of the uncertainty measure. Results are summarized in Table 4. Through posterior selection, we obtain promising accuracy improvement despite several empty prediction sets.

4.4 Ablation Studies

Considering that these sampling-based methods integrate multiple generations within the candidate set, We investigate the effects of the number of sampled generations (i.e., M𝑀Mitalic_M) on the performance of UQ. As illustrated in Figure 3, our uncertainty measure consistently outperforms the baseline approaches, and its performance can be further boosted by incorporating more generations. While employing just 4 generations, our method is able to achieve the highest AUROC of 0.8082, demonstrating its generation-efficient nature.

Refer to caption

Figure 3: The performance of UQ over various numbers of generations. Results are obtained from the LLaMA-3-8B-Instruct model on the TriviaQA dataset. Our method consistently surpasses 7 baseline methods.

Refer to caption

Figure 4: The average coverage rate across 4 datasets at different ratios between the calibration and test set utilizing the LLaMA-3-8B-Instruct model. The red dashed line indicates the lower bound at 0.9 (i.e., α=0.1𝛼0.1\alpha=0.1italic_α = 0.1).

As described in Section 3.3, conformal prediction assumes a calibration set for the threshold q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG. In our prior analysis, We divide the dataset into the calibration and test set at a fixed ratio of 1:10. Here, we investigate the correctness coverage rate at different ratios of size between the calibration and test set, and present the results in Figure 4. Despite various ratios of set size, we can always obtain a strict lower bound of the coverage rate by constructing prediction sets based on our devised conformal uncertainty criterion. This indicates the potential impacts of our method for robust guarantees in real-world open-ended NLG applications.

5 Conclusion

In this work, we introduce ConU tailored for black-box UQ in open-ended NLG tasks. Relying on CP which can transform any heuristic approximation into a statistically rigorous uncertainty notion, we develop a robust conformal uncertainty criterion to provide reliable guarantees of correctness coverage under various user-specified error rates. We achieve strict control of the coverage rate across 7 practical LLMs on 4 free-from NLG datasets. Furthermore, the small average uncertainty set size underscores the efficiency of our methods. Utilizing these calibrated prediction sets, we perform selective prediction and obtain remarkable improvements in model accuracy. We envisage that our conformal uncertainty criterion can provide new strategies for principled UQ in open-ended NLG tasks.

Acknowledgments

Zhiyuan Wang, Xiaoshuang Shi, and Xiaofeng Zhu were supported by the National Key Research &\&& Development Program of China under Grant (No. 2022YFA1004100).

Limitations

Our approach has some limitations. We need to develop an uncertainty criterion to verify whether the correct answer has been sampled from the output space in real-world applications. Secondly, our findings are limited to the four datasets and future works will extend to other typical NLG tasks like document summarization. Finally, we will attempt to expand our conformal uncertainty criterion to non-exchangeability scenarios, aiming to establish a general criterion across different NLG tasks.

References

  • Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  • AI@Meta (2024) AI@Meta. 2024. Llama 3 model card.
  • Angelopoulos and Bates (2021) Anastasios N Angelopoulos and Stephen Bates. 2021. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv preprint arXiv:2107.07511.
  • Angelopoulos et al. (2024) Anastasios Nikolas Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster. 2024. Conformal risk control. In The Twelfth International Conference on Learning Representations.
  • Campos et al. (2024) Margarida M Campos, António Farinhas, Chrysoula Zerva, Mário AT Figueiredo, and André FT Martins. 2024. Conformal prediction for natural language processing: A survey. arXiv preprint arXiv:2405.01976.
  • Chen et al. (2024) Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye. 2024. Inside: Llms’ internal states retain the power of hallucination detection. arXiv preprint arXiv:2402.03744.
  • Chen et al. (2023) Zhipeng Chen, Kun Zhou, Beichen Zhang, Zheng Gong, Wayne Xin Zhao, and Ji-Rong Wen. 2023. Chatcot: Tool-augmented chain-of-thought reasoning on chat-based large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14777–14790.
  • Da et al. (2024) Longchao Da, Tiejin Chen, Lu Cheng, and Hua Wei. 2024. Llm uncertainty quantification through directional entailment graph and claim level response augmentation. arXiv preprint arXiv:2407.00994.
  • Duan et al. (2024a) Jinhao Duan, Hao Cheng, Shiqi Wang, Chenan Wang, Alex Zavalny, Renjing Xu, Bhavya Kailkhura, and Kaidi Xu. 2024a. Shifting attention to relevance: Towards the uncertainty estimation of large language models. In The 62nd Annual Meeting of the Association for Computational Linguistics.
  • Duan et al. (2024b) Jinhao Duan, Shiqi Wang, James Diffenderfer, Lichao Sun, Tianlong Chen, Bhavya Kailkhura, and Kaidi Xu. 2024b. Reta: Recursively thinking ahead to improve the strategic reasoning of large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 2232–2246.
  • Duan et al. (2024c) Jinhao Duan, Renming Zhang, James Diffenderfer, Bhavya Kailkhura, Lichao Sun, Elias Stengel-Eskin, Mohit Bansal, Tianlong Chen, and Kaidi Xu. 2024c. Gtbench: Uncovering the strategic reasoning limitations of llms via game-theoretic evaluations. arXiv preprint arXiv:2402.12348.
  • Ji et al. (2023) Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
  • Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7b. arXiv preprint arXiv:2310.06825.
  • Jin et al. (2021) Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2021. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421.
  • Joshi et al. (2017) Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611.
  • Kadavath et al. (2022) Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. 2022. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221.
  • Katz et al. (2024) Daniel Martin Katz, Michael James Bommarito, Shang Gao, and Pablo Arredondo. 2024. Gpt-4 passes the bar exam. Philosophical Transactions of the Royal Society A, 382(2270):20230254.
  • Kuhn et al. (2023) Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2023. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. arXiv preprint arXiv:2302.09664.
  • Kumar et al. (2023) Bhawesh Kumar, Charlie Lu, Gauri Gupta, Anil Palepu, David Bellamy, Ramesh Raskar, and Andrew Beam. 2023. Conformal prediction with large language models for multi-choice question answering. arXiv preprint arXiv:2305.18404.
  • Lin et al. (2022a) Stephanie Lin, Jacob Hilton, and Owain Evans. 2022a. Teaching models to express their uncertainty in words. arXiv preprint arXiv:2205.14334.
  • Lin et al. (2023) Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. 2023. Generating with confidence: Uncertainty quantification for black-box large language models. arXiv preprint arXiv:2305.19187.
  • Lin et al. (2022b) Zi Lin, Jeremiah Zhe Liu, and Jingbo Shang. 2022b. Towards collaborative neural-symbolic graph semantic parsing via uncertainty. Findings of the Association for Computational Linguistics: ACL 2022.
  • Malinin and Gales (2020) Andrey Malinin and Mark Gales. 2020. Uncertainty estimation in autoregressive structured prediction. In International Conference on Learning Representations.
  • Manakul et al. (2023) Potsawee Manakul, Adian Liusie, and Mark Gales. 2023. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. In The 2023 Conference on Empirical Methods in Natural Language Processing.
  • Mohri and Hashimoto (2024) Christopher Mohri and Tatsunori Hashimoto. 2024. Language models with conformal factuality guarantees. arXiv preprint arXiv:2402.10978.
  • OpenAI (2021) OpenAI. 2021. Chatgpt.
  • Pal et al. (2022) Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. 2022. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on health, inference, and learning, pages 248–260. PMLR.
  • Quach et al. (2024) Victor Quach, Adam Fisch, Tal Schuster, Adam Yala, Jae Ho Sohn, Tommi S. Jaakkola, and Regina Barzilay. 2024. Conformal language modeling. In International Conference on Learning Representations.
  • Reddy et al. (2019) Siva Reddy, Danqi Chen, and Christopher D Manning. 2019. Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249–266.
  • Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992.
  • Sanh et al. (2019) Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
  • Su et al. (2024) Jiayuan Su, Jing Luo, Hongwei Wang, and Lu Cheng. 2024. Api is enough: Conformal prediction for large language models without logit-access. arXiv preprint arXiv:2403.01216.
  • Tian et al. (2023) Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D Manning. 2023. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. arXiv preprint arXiv:2305.14975.
  • Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023a. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  • Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023b. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  • Wang et al. (2024a) Fangxin Wang, Lu Cheng, Ruocheng Guo, Kay Liu, and Philip S Yu. 2024a. Equal opportunity of coverage in fair regression. Advances in Neural Information Processing Systems, 36.
  • Wang et al. (2022) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
  • Wang et al. (2024b) Zhiyuan Wang, Jinhao Duan, Chenxi Yuan, Qingyu Chen, Tianlong Chen, Huaxiu Yao, Yue Zhang, Ren Wang, Kaidi Xu, and Xiaoshuang Shi. 2024b. Word-sequence entropy: Towards uncertainty estimation in free-form medical question answering applications and beyond. arXiv preprint arXiv:2402.14259.
  • Xiong et al. (2023) Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. 2023. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. arXiv preprint arXiv:2306.13063.
  • Yadkori et al. (2024) Yasin Abbasi Yadkori, Ilja Kuzborskij, David Stutz, András György, Adam Fisch, Arnaud Doucet, Iuliya Beloshapka, Wei-Hung Weng, Yao-Yuan Yang, Csaba Szepesvári, et al. 2024. Mitigating llm hallucinations via conformal abstention. arXiv preprint arXiv:2405.01563.
  • Ye et al. (2024) Fanghua Ye, Mingming Yang, Jianhui Pang, Longyue Wang, Derek F Wong, Emine Yilmaz, Shuming Shi, and Zhaopeng Tu. 2024. Benchmarking llms via uncertainty quantification. arXiv preprint arXiv:2401.12794.
  • Yin et al. (2024) Fan Yin, Jayanth Srinivasa, and Kai-Wei Chang. 2024. Characterizing truthfulness in large language model generations with local intrinsic dimension. arXiv preprint arXiv:2402.18048.
  • Zhao et al. (2024) Tianyi Zhao, Jian Kang, and Lu Cheng. 2024. Conformalized link prediction on graph neural networks. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 4490–4499.
  • Zheng et al. (2023) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.

Appendix A Proof of the Coverage Property

This is the explanation of validity for the conformal uncertainty criterion introduced in Section 3.3. We reproduce the derivation here for completeness. Let us break down the overall implementation into the following five steps:

Black-box Uncertainty Measure. We first conduct semantic clustering within the M𝑀Mitalic_M candidate generations and obtain K𝐾Kitalic_K non-repeated semantics for each sample. Since generations in the k𝑘kitalic_k-th cluster share the equivalent meaning, we denote any one generation in the k𝑘kitalic_k-th cluster as y^k(i)superscriptsubscript^𝑦𝑘𝑖\hat{y}_{k}^{(i)}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT. Then we rely on self-consistency and define the uncertainty score of each candidate generation as 𝒰(y^m(i))𝒰superscriptsubscript^𝑦𝑚𝑖\mathcal{U}\left(\hat{y}_{m}^{(i)}\right)caligraphic_U ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) as described in Eq. (1).

NS Definition. For each calibration sample, we select the generation that (1) first shares the equivalent semantics with the reference answer and (2) then exhibits the highest semantic similarity to the reference answer, and then define the NS as its uncertainty score calculated following Eq. (1). The first condition is to tightly couple the NS with correctness and the second is to facilitate generation selection in test samples. The NS of the i𝑖iitalic_i-th calibration data risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is described as Eq. (3).

Conformal Uncertainty Criterion. We calculate the (N+1)(1α)N𝑁11𝛼𝑁\frac{\left\lceil\left(N+1\right)\left(1-\alpha\right)\right\rceil}{N}divide start_ARG ⌈ ( italic_N + 1 ) ( 1 - italic_α ) ⌉ end_ARG start_ARG italic_N end_ARG quantile of the NSs for all fresh calibration data to develop our conformal uncertainty criterion (i.e., the uncertainty threshold q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG) based on the user-specified error rate α𝛼\alphaitalic_α. As described in Eq. 4, q^=r(N+1)(1α)^𝑞subscript𝑟𝑁11𝛼\hat{q}={r}_{\left\lceil\left(N+1\right)\left(1-\alpha\right)\right\rceil}over^ start_ARG italic_q end_ARG = italic_r start_POSTSUBSCRIPT ⌈ ( italic_N + 1 ) ( 1 - italic_α ) ⌉ end_POSTSUBSCRIPT.

Construction of Prediction Sets. For each test data, we construct a prediction set following Eq. (5). Since the generation that is semantically equivalent to y^i(test)superscriptsubscript^𝑦𝑖𝑡𝑒𝑠𝑡\hat{y}_{i}^{(test)}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t italic_e italic_s italic_t ) end_POSTSUPERSCRIPT and shares the highest semantic similarity to y^i(test)superscriptsubscript^𝑦𝑖𝑡𝑒𝑠𝑡\hat{y}_{i}^{(test)}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t italic_e italic_s italic_t ) end_POSTSUPERSCRIPT in {y^m(test)}m=1Msuperscriptsubscriptsuperscriptsubscript^𝑦𝑚𝑡𝑒𝑠𝑡𝑚1𝑀\left\{\hat{y}_{m}^{(test)}\right\}_{m=1}^{M}{ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t italic_e italic_s italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT is itself, we can obtain r(xtest,y^j(test))=𝒰(y^j(test))𝑟subscript𝑥𝑡𝑒𝑠𝑡superscriptsubscript^𝑦𝑗𝑡𝑒𝑠𝑡𝒰superscriptsubscript^𝑦𝑗𝑡𝑒𝑠𝑡r\left(x_{test},\hat{y}_{j}^{(test)}\right)=\mathcal{U}\left(\hat{y}_{j}^{(% test)}\right)italic_r ( italic_x start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t italic_e italic_s italic_t ) end_POSTSUPERSCRIPT ) = caligraphic_U ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t italic_e italic_s italic_t ) end_POSTSUPERSCRIPT ). Then we calibrate the prediction set by selecting generations, of which the uncertainty satisfies the conformal uncertainty criterion closely linked with correctness.

Correctness Coverage Guarantees. Considering the assumption that there is at least one correct answer in {y^m(test)}m=1Msuperscriptsubscriptsuperscriptsubscript^𝑦𝑚𝑡𝑒𝑠𝑡𝑚1𝑀\left\{\hat{y}_{m}^{(test)}\right\}_{m=1}^{M}{ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t italic_e italic_s italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, we can conclude that the event {ytest𝒫(xtest)}superscriptsubscript𝑦𝑡𝑒𝑠𝑡𝒫subscript𝑥𝑡𝑒𝑠𝑡\left\{y_{test}^{*}\in\mathcal{P}\left(x_{test}\right)\right\}{ italic_y start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_P ( italic_x start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ) } is equivalent to {rtest=r(xtest,ytest)q^}subscript𝑟𝑡𝑒𝑠𝑡𝑟subscript𝑥𝑡𝑒𝑠𝑡superscriptsubscript𝑦𝑡𝑒𝑠𝑡^𝑞\left\{r_{test}=r\left(x_{test},y_{test}^{*}\right)\leq\hat{q}\right\}{ italic_r start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT = italic_r ( italic_x start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ over^ start_ARG italic_q end_ARG }. Since (x1,y1)subscript𝑥1superscriptsubscript𝑦1\left(x_{1},y_{1}^{*}\right)( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), …, (xN,yN)subscript𝑥𝑁superscriptsubscript𝑦𝑁\left(x_{N},y_{N}^{*}\right)( italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), (xtest,ytest)subscript𝑥𝑡𝑒𝑠𝑡superscriptsubscript𝑦𝑡𝑒𝑠𝑡\left(x_{test},y_{test}^{*}\right)( italic_x start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) are exchangeable, we have P(rtestri)=iN+1𝑃subscript𝑟𝑡𝑒𝑠𝑡subscript𝑟𝑖𝑖𝑁1P\left(r_{test}\leq r_{i}\right)=\frac{i}{N+1}italic_P ( italic_r start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ≤ italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG italic_i end_ARG start_ARG italic_N + 1 end_ARG. Ultimately, we achieve rigorous guarantees of the correctness coverage rate on test samples as described as Eq. (6).

Appendix B Validity of Assumption (1)

We assume that at least one acceptable response is sampled into the candidate set for each test data point. For each calibration data point, we sample multiple generations from the output space, denoted as 𝒞m(Xi)={Y^j(i)}j=1msubscript𝒞𝑚subscript𝑋𝑖superscriptsubscriptsuperscriptsubscript^𝑌𝑗𝑖𝑗1𝑚\mathcal{C}_{m}\left(X_{i}\right)=\left\{\hat{Y}_{j}^{(i)}\right\}_{j=1}^{m}caligraphic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = { over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT. Then, we define the loss of miscoverage by the candidate set as

l(𝒞m(Xi),Yi)=𝟏{Yi𝒞m(Xi)},𝑙subscript𝒞𝑚subscript𝑋𝑖superscriptsubscript𝑌𝑖1superscriptsubscript𝑌𝑖subscript𝒞𝑚subscript𝑋𝑖l\left(\mathcal{C}_{m}\left(X_{i}\right),Y_{i}^{*}\right)=\mathbf{1}\left\{Y_{% i}^{*}\notin\mathcal{C}_{m}\left(X_{i}\right)\right\},italic_l ( caligraphic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = bold_1 { italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∉ caligraphic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } , (7)

and the loss is non-increasing in m𝑚mitalic_m.

We set AN(m)=i=1Nl(𝒞m(Xi),Yi)subscript𝐴𝑁𝑚superscriptsubscript𝑖1𝑁𝑙subscript𝒞𝑚subscript𝑋𝑖superscriptsubscript𝑌𝑖A_{N}\left(m\right)=\displaystyle\sum_{i=1}^{N}l\left(\mathcal{C}_{m}\left(X_{% i}\right),Y_{i}^{*}\right)italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_m ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_l ( caligraphic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). Given that l(𝒞m(Xtest),Ytest){0,1}𝑙subscript𝒞𝑚subscript𝑋𝑡𝑒𝑠𝑡superscriptsubscript𝑌𝑡𝑒𝑠𝑡01l\left(\mathcal{C}_{m}\left(X_{test}\right),Y_{test}^{*}\right)\in\left\{0,1\right\}italic_l ( caligraphic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ) , italic_Y start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∈ { 0 , 1 }, we obtain

AN+1(m)=i=1N+1l(𝒞m(Xi),Yi)=AN(m)+l(𝒞m(Xtest),Ytest){AN(m),AN(m)+1}.subscript𝐴𝑁1𝑚superscriptsubscript𝑖1𝑁1𝑙subscript𝒞𝑚subscript𝑋𝑖superscriptsubscript𝑌𝑖subscript𝐴𝑁𝑚𝑙subscript𝒞𝑚subscript𝑋𝑡𝑒𝑠𝑡superscriptsubscript𝑌𝑡𝑒𝑠𝑡subscript𝐴𝑁𝑚subscript𝐴𝑁𝑚1\begin{split}&A_{N+1}\left(m\right)\\ &=\displaystyle\sum_{i=1}^{N+1}l\left(\mathcal{C}_{m}\left(X_{i}\right),Y_{i}^% {*}\right)\\ &=A_{N}\left(m\right)+l\left(\mathcal{C}_{m}\left(X_{test}\right),Y_{test}^{*}% \right)\\ &\in\left\{A_{N}\left(m\right),A_{N}\left(m\right)+1\right\}.\end{split}start_ROW start_CELL end_CELL start_CELL italic_A start_POSTSUBSCRIPT italic_N + 1 end_POSTSUBSCRIPT ( italic_m ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N + 1 end_POSTSUPERSCRIPT italic_l ( caligraphic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_m ) + italic_l ( caligraphic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ) , italic_Y start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ∈ { italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_m ) , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_m ) + 1 } . end_CELL end_ROW (8)

By the exchangeability of N𝑁Nitalic_N calibration data points and the test data point, we have ltestUniform({l1,,lN,ltest})subscript𝑙𝑡𝑒𝑠𝑡Uniformsubscript𝑙1subscript𝑙𝑁subscript𝑙𝑡𝑒𝑠𝑡l_{test}\backsim\mathrm{Uniform}\left(\left\{l_{1},\cdots,l_{N},l_{test}\right% \}\right)italic_l start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ∽ roman_Uniform ( { italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_l start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT } ), where lisubscript𝑙𝑖l_{i}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the abbreviation for l(𝒞m(Xi),Yi)𝑙subscript𝒞𝑚subscript𝑋𝑖superscriptsubscript𝑌𝑖l\left(\mathcal{C}_{m}\left(X_{i}\right),Y_{i}^{*}\right)italic_l ( caligraphic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) Angelopoulos et al. (2024). Then, we have

𝔼[l(𝒞m(Xtest),Ytest)]=AN+1(m)N+1{AN(m)N+1,AN(m)+1N+1}.𝔼delimited-[]𝑙subscript𝒞𝑚subscript𝑋𝑡𝑒𝑠𝑡superscriptsubscript𝑌𝑡𝑒𝑠𝑡subscript𝐴𝑁1𝑚𝑁1subscript𝐴𝑁𝑚𝑁1subscript𝐴𝑁𝑚1𝑁1\begin{split}&\mathbb{E}\left[l\left(\mathcal{C}_{m}\left(X_{test}\right),Y_{% test}^{*}\right)\right]\\ &=\frac{A_{N+1}\left(m\right)}{N+1}\\ &\in\left\{\frac{A_{N}\left(m\right)}{N+1},\frac{A_{N}\left(m\right)+1}{N+1}% \right\}.\end{split}start_ROW start_CELL end_CELL start_CELL blackboard_E [ italic_l ( caligraphic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ) , italic_Y start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG italic_A start_POSTSUBSCRIPT italic_N + 1 end_POSTSUBSCRIPT ( italic_m ) end_ARG start_ARG italic_N + 1 end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ∈ { divide start_ARG italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_m ) end_ARG start_ARG italic_N + 1 end_ARG , divide start_ARG italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_m ) + 1 end_ARG start_ARG italic_N + 1 end_ARG } . end_CELL end_ROW (9)

Since we have demanded that at least one acceptable response is sampled into the candidate set for each calibration data (i.e., AN(m)=0subscript𝐴𝑁𝑚0A_{N}\left(m\right)=0italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_m ) = 0), we obtain 𝔼[l(𝒞m(Xtest),Ytest)]{0,1N+1}𝔼delimited-[]𝑙subscript𝒞𝑚subscript𝑋𝑡𝑒𝑠𝑡superscriptsubscript𝑌𝑡𝑒𝑠𝑡01𝑁1\mathbb{E}\left[l\left(\mathcal{C}_{m}\left(X_{test}\right),Y_{test}^{*}\right% )\right]\in\left\{0,\frac{1}{N+1}\right\}blackboard_E [ italic_l ( caligraphic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ) , italic_Y start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ] ∈ { 0 , divide start_ARG 1 end_ARG start_ARG italic_N + 1 end_ARG } and Assumption (1) holds in this case.

Appendix C Implementation Details

C.1 Baselines

We compare ConU with 8 baseline measures. PE is defined as the entropy over the whole generation and LNPE is the length normalized PE. SE tackles the issue of semantic equivalence by gathering generations sharing the same meaning into semantic clusters and calculating cluster-wise entropy. SAR solves the issue of generative inequality and allocates more attention to key tokens and sentences. LS measures the average sentence similarity among sampled responses. NumSet employs the number of semantic sets (equivalence classes) as a reflection of uncertainty. Deg and Ecc treat each generation as one node, calculate the symmetric normalized graph Laplacian, and respectively utilize the degree matrix and the average distance from the center as the uncertainty measures.

We do not compare the two recent approaches that adapt CP for correctness coverage in open-ended NLG tasks for several reasons: (1) Conformal language modeling Quach et al. (2024) relies on the white-box model likelihoods information, which is impractical for recent LLMs served via API without logit access; (2) LofreeCP Su et al. (2024) is susceptible to different settings of datasets and models, and cannot consistently guarantee the correctness coverage rate; (3) Our conformal uncertainty criterion achieves strict control of the correctness coverage rate under various user-specified error rates, model settings, and datasets, first linking black-box UQ with rigorous guarantees of correctness coverage, which meets the requirement for general NLG applications.

C.2 Datasets

CoQA Reddy et al. (2019) is a large-scale conversational QA dataset with more than 127k question-answer pairs equipped with contextual information. TriviaQA Joshi et al. (2017) is a reading comprehension dataset with over 650k question-answer pairs. MedQA Jin et al. (2021) is a medical MCQA dataset collected from professional medical board exams. MedMCQA Pal et al. (2022) is a large-scale MCQA dataset for practical medical entrance exam questions. For the evaluation of UQ, we randomly select 3,000 samples from each dataset. For the verification of correctness coverage guarantees, we utilize the development set (7,983 questions) of CoQA and full validation sets of MedQA and MedMCQA. For TriviaQA, we utilize the same 3,000 samples in UQ evaluations.

For CoQA, we utilize the contextual information combined with the question as the prompt. For TriviaQA and MedMCQA, we randomly select 5 question-answer pairs as a fixed few-shot template and combine it with the current question. For MedQA, we employ 3 question-answer pairs.

Appendix D Robustness of Conformal Uncertainty Criterion

We verify the correctness coverage guarantees on the other 6 LLMs across 4 datasets. As demonstrated in Figures 5 ~10, we achieve rigorous control of coverage rate under various user-specified error rates despite different model settings or datasets. We also report the results of the correctness coverage rate under two strict error rates of 0.05 and 0.01. Table 5 and Table 6 indicate the robustness of our conformal uncertainty criterion.

Refer to caption

Figure 5: Target vs. empirical correctness coverage rate.
We test the 4 datasets utilizing the Mistral-7B-Instruct-v0.3 model as the generator.

Refer to caption

Figure 6: Target vs. empirical correctness coverage rate.
We test the 4 datasets utilizing the LLaMA-3-8B-Instruct model as the generator.

Refer to caption

Figure 7: Target vs. empirical correctness coverage rate. We test the 4 datasets utilizing the LLaMA-2-13B-Chat model as the generator.

Refer to caption

Figure 8: Target vs. empirical correctness coverage rate. We test the 4 datasets utilizing the Vicuna-13B-v1.5 model as the generator.

Refer to caption

Figure 9: Target vs. empirical correctness coverage rate.
We test the 4 datasets utilizing the LLaMA-3-70B-Instruct model as the generator.

Refer to caption

Figure 10: Target vs. empirical correctness coverage rate. We test the 4 datasets utilizing the GPT-3.5-turbo model as the generator.
Table 5: The results of correctness coverage rate (%percent\%%) on 7 LLMs across 4 open-ended NLG datasets. The user-accepted error rate α𝛼\alphaitalic_α is strictly set to 0.05.

LLMs TriviaQA CoQA MedQA MedMCQA LLaMA-2-7B-Chat 95.26 96.45 100.00 95.99 Mistral-7B-Instruct-v0.3 95.01 95.72 95.79 95.12 LLaMA-3-8B-Instruct 98.17 95.23 95.78 98.38 LLaMA-2-13B-Chat 95.04 96.96 95.15 96.59 Vicuna-13B-v1.5 97.28 95.33 95.51 97.29 LLaMA-3-70B-Instruct 95.38 95.33 95.51 97.29 GPT-3.5-turbo 97.02 97.60 95.62 95.19

Table 6: The results of correctness coverage rate (%percent\%%) on 7 LLMs across 4 open-ended NLG datasets. The user-accepted error rate α𝛼\alphaitalic_α is strictly set to 0.01.

LLMs TriviaQA CoQA MedQA MedMCQA LLaMA-2-7B-Chat 99.93 99.83 100.00 99.14 Mistral-7B-Instruct-v0.3 99.38 99.27 99.15 99.81 LLaMA-3-8B-Instruct 99.79 99.53 100.00 99.76 LLaMA-2-13B-Chat 99.06 99.13 99.51 99.48 Vicuna-13B-v1.5 99.52 100.00 99.94 100.00 LLaMA-3-70B-Instruct 99.84 99.75 99.15 99.82 GPT-3.5-turbo 99.17 99.82 99.51 99.95