Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

LLM-Oracle Machines

Jie Wang Richard Miner School of Computer and Information Sciences, University of Massachusetts, Lowell, MA 01854, USA. Copyright ©Jie Wang, 2024.
Abstract

Contemporary AI applications leverage large language models (LLMs) to harness their knowledge and reasoning abilities for natural language processing tasks. This approach shares similarities with the concept of oracle Turing machines (OTMs). To capture the broader potential of these computations, including those not yet realized, we propose an extension to OTMs: the LLM-oracle machine (LLM-OM), by employing a cluster of LLMs as the oracle. Each LLM acts as a black box, capable of answering queries within its expertise, albeit with a delay. We introduce four variants of the LLM-OM: basic, augmented, fault-avoidance, and ϵitalic-ϵ\epsilonitalic_ϵ-fault. The first two are commonly observed in existing AI applications. The latter two are specifically designed to address the challenges of LLM hallucinations, biases, and inconsistencies, aiming to ensure reliable outcomes.

1 Introduction

In an oracle Turing machine, the oracle embodies a decision problem. It acts as a hypothetical, all-powerful entity that can instantly determine whether a query, an instance generated during the computation, is a positive instance of the decision problem. OTMs have played a significant role in the development of both computation theory and computational complexity theory.

Drawing inspiration from the concept of using external knowledge to assist with computing tasks, and motivated by the recent advancements in LLMs with their powerful knowledge bases and inference capabilities, we use a cluster of LLMs in place of the oracle in OTMs.

We treat LLMs as black boxes, capable of answering queries within their domains of expertise, albeit with a delay111The assumption of delay may be omitted if we are focusing on generating trustworthy results.. Both queries and responses are exchanged in natural language, encompassing formatting. Since each LLM is trained on diverse datasets using different technologies, their capabilities vary. Therefore, we use a cluster of LLMs as the oracle, selecting an appropriate LLM to respond to a query during an LLM-OM computation.

Unlike in an OTM, where the oracle represents a decision problem and a query asks whether an instance is positive, a query generated in the computation of an LLM-OM consists of a task and a sequence of specifications, with the response being a solution to complete the task.

In a nutshell, the computation of an LLM-OM takes an input representing a task to accomplish, generates queries, acquires answers to each query from the appropriate LLM in the LLM-oracle, and continues this process until the LLM-OM reaches a halting state with the final answer.

Unlike in an OTM where the oracle reliably provides an answer to a query, LLMs can generate fabricated or misleading information, resulting in incorrect or inadequate answers (e.g., see [1, 2]). LLMs may also provide answers with different meanings to the same query at different times. These issues of information hallucination (or better phrased as “information nonsense” because LLMs cannot distinguish between truth and lies), inadequacy, and inconsistency are common in LLMs.

While advancing technologies aim to mitigate these issues, complete elimination remains challenging. Therefore, we assume that there exists a probability that an LLM may provide an unacceptable answer to a query. Depending on the context of the application, an unacceptable answer could be one that is outright incorrect or one that, while not incorrect, fails to meet the required level of adequacy.

2 LLM-OM Basics and Variants

An LLM-OM can be viewed as a deterministic algorithm with access to an LLM oracle. Similar to an OTM, each computation step in an LLM-OM represents a transition. We denote an LLM-OM as M𝑀Mitalic_M, the input representing the task inquiry (including optional specifications) as Q𝑄Qitalic_Q, and the output as the answer A𝐴Aitalic_A to Q𝑄Qitalic_Q. This output A𝐴Aitalic_A can take various forms, including human-like text, code snippets, or other representations. Both inputs Q𝑄Qitalic_Q and outputs A𝐴Aitalic_A can be encoded as binary strings for compatibility with traditional computing models.

During the computation, M𝑀Mitalic_M generates a query in the form of (x;y1,y2,,yk)𝑥subscript𝑦1subscript𝑦2subscript𝑦𝑘(x;y_{1},y_{2},\ldots,y_{k})( italic_x ; italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), denoted as qxsubscript𝑞𝑥q_{x}italic_q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, where x𝑥xitalic_x is an intermediate task for completing Q𝑄Qitalic_Q and each yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is an attribute. These attributes can be

  • Selected text to provide relevant context for x𝑥xitalic_x.

  • Specific requirement to detail the desired outcome of x𝑥xitalic_x.

  • Solution format to instruct M𝑀Mitalic_M on how to express the answer (e.g., text, code).

  • Verification method to specify how to validate the answer’s correctness.

  • Self-critique instruction to guide M𝑀Mitalic_M to evaluate its own response.

Other specifications with any additional details relevant to completing x𝑥xitalic_x may also be attributes. Collectively, these attributes form a prompt that instructs the LLMs on how to approach the intermediate task x𝑥xitalic_x effectively.

The computation begins with M𝑀Mitalic_M interpreting the input task Q𝑄Qitalic_Q. If possible, it decomposes Q𝑄Qitalic_Q into a sequence of smaller, more manageable subtasks, denoted as Q1,Q2,,Qmsubscript𝑄1subscript𝑄2subscript𝑄𝑚Q_{1},Q_{2},\ldots,Q_{m}italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. However, if Q𝑄Qitalic_Q cannot be further broken down, it remains the sole subtask.

2.1 Basic variants: adaptive and non-adaptive

The basic variant of the LLM-OM utilizes two query types: adaptive and non-adaptive.

In an adaptive LLM-OM: subtasks are interdependent. This means some subtasks require answers from previous subtasks before generating all queries. For a subtask Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the LLM-OM generates a query qi,xsubscript𝑞𝑖𝑥q_{i,x}italic_q start_POSTSUBSCRIPT italic_i , italic_x end_POSTSUBSCRIPT specifying the task within the query. It then retrieves an answer axsubscript𝑎𝑥a_{x}italic_a start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT from a chosen LLM in the oracle and uses axsubscript𝑎𝑥a_{x}italic_a start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT to determine the next step. The final answer is derived by combining answers from subtasks, potentially involving the LLM-oracle again.

In a non-adaptive LLM-OM, subtasks are independent. Each subtask Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is completed by sending a set of independent queries generated during the computation to the LLM-oracle. The final answer is produced solely from these answers, without further interaction with the LLM-oracle.

Ideally, the final answer A𝐴Aitalic_A should directly address the original inquiry Q𝑄Qitalic_Q and remain relevant to the topic. Let toc(X)toc𝑋\text{toc}(X)toc ( italic_X ) denote the collection of topics within text X𝑋Xitalic_X.

  • We say that A𝐴Aitalic_A is relevant to Q𝑄Qitalic_Q, denoted by AQless-than-or-similar-to𝐴𝑄A\lesssim Qitalic_A ≲ italic_Q, if toc(Q)toc(A)toc𝑄toc𝐴\mbox{toc}(Q)\subseteq\mbox{toc}(A)toc ( italic_Q ) ⊆ toc ( italic_A ).

Remark 1. The basic functionality of many LLM web applications, like free versions of ChatGPT and Gemini, exemplifies the non-adaptive LLM-OM variant. In these applications, users submit queries, which directly translate to the subtasks in the LLM-OM. The system then independently generates a set of queries based on the user’s input and sends them to the underlying LLM-oracle. Finally, the application presents the user with the final answer derived solely from these responses, without further interaction with the LLM.

2.2 Augmented LLM-OM

The augmented LLM-OM takes a pair (T,Q)𝑇𝑄(T,Q)( italic_T , italic_Q ) as input. Here, T𝑇Titalic_T represents an “augmented text” in natural language, acting as verified background knowledge or ground truth. Q𝑄Qitalic_Q remains the task inquiry, specifying what information to extract or infer from T𝑇Titalic_T. This information can include specific sentences, topics, summaries, entities, relationships between entities, events, event relationships, logical consequences, or numerical consequences. Similar to the basic variant, queries generated during the computation can be either adaptive or non-adaptive.

Ideally, the answer A𝐴Aitalic_A to the inquiry Q𝑄Qitalic_Q should align with the augmented text T𝑇Titalic_T. Here’s a formal definition of this alignment. Let mng(X)mng𝑋\mbox{mng}(X)mng ( italic_X ) denote the set of meanings of text X𝑋Xitalic_X.

  • We say that A𝐴Aitalic_A complies with T𝑇Titalic_T with respect to Q𝑄Qitalic_Q, written as AQTsubscriptless-than-or-similar-to𝑄𝐴𝑇A\lesssim_{Q}Titalic_A ≲ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT italic_T, if the following conditions hold:

    1. 1.

      Relevance: A𝐴Aitalic_A is relevant to Q𝑄Qitalic_Q. Namely, QAless-than-or-similar-to𝑄𝐴Q\lesssim Aitalic_Q ≲ italic_A.

    2. 2.

      Inclusiveness: A𝐴Aitalic_A is inclusive in T𝑇Titalic_T. Namely, mng(A)mng(T)mng𝐴mng𝑇\mbox{mng}(A)\subseteq\mbox{mng}(T)mng ( italic_A ) ⊆ mng ( italic_T ).

Note that even if two texts X𝑋Xitalic_X and Y𝑌Yitalic_Y have the same set of meanings (i.e. mng(X)=mng(Y)mng𝑋mng𝑌\mbox{mng}(X)=\mbox{mng}(Y)mng ( italic_X ) = mng ( italic_Y )), it doesn’t necessarily mean they are identical. There are many ways to express the same idea with different wording.

To complete the task x𝑥xitalic_x represented by query qxsubscript𝑞𝑥q_{x}italic_q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, the LLM-OM M𝑀Mitalic_M first identifies relevant content, denoted as Cxsubscript𝐶𝑥C_{x}italic_C start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, within the augmented text T𝑇Titalic_T. It then leverages this content Cxsubscript𝐶𝑥C_{x}italic_C start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT along with an appropriate LLM from the LLM-oracle to generate an answer to x𝑥xitalic_x, denoted as axsubscript𝑎𝑥a_{x}italic_a start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, ideally complying with Cxsubscript𝐶𝑥C_{x}italic_C start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT with respect to qxsubscript𝑞𝑥q_{x}italic_q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, namely, axqxCxsubscriptless-than-or-similar-tosubscript𝑞𝑥subscript𝑎𝑥subscript𝐶𝑥a_{x}\lesssim_{q_{x}}C_{x}italic_a start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ≲ start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT.

Remark 2. Web applications like ChatGPT 4o that employ Retrieval-Augmented Generation (RAG) techniques with an LLM can be seen as a practical application of the augmented LLM-OM framework. In RAG-based applications, the retrieved information acts as the augmented text that confines the LLM’s response. This retrieved information helps the LLM generate answers that are more grounded in factual evidence and more likely to comply with the user’s query.

2.3 Fault-avoidance LLM-OM

The basic and augmented LLM-OM variants do not inherently guarantee consistency, correctness, or adequacy in the final answer A𝐴Aitalic_A for several reasons:

  1. 1.

    Limitations of LLMs: Even the most advanced LLMs can be susceptible to biases, factual inaccuracies, and hallucinations in their responses. These issues can directly translate into inconsistencies or errors in the final answer generated by the LLM-OM.

  2. 2.

    Incomplete or unclear input: If the initial user inquiry Q𝑄Qitalic_Q or the augmented text T𝑇Titalic_T in the augmented variant is ambiguous, incomplete, or misleading, it can lead the LLM-OM down an incorrect path and ultimately result in an inadequate or incorrect answer.

  3. 3.

    Dependence on LLM selection: The choice of LLM from the LLM-oracle can also influence the outcome. Different LLMs have varying strengths and weaknesses, and an inappropriate selection might hinder the generation of a consistent or accurate answer.

  4. 4.

    Query design challenges: Crafting effective intermediate queries qxsubscript𝑞𝑥q_{x}italic_q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT is crucial. Poorly designed queries can lead the LLM-OM to misunderstand the intent or miss key aspects of the overall task, compromising the final answer’s adequacy.

For the purpose of illustration, we assume that both the user inquiry Q𝑄Qitalic_Q and the augmented text T𝑇Titalic_T in the augmented variant are well-defined and free from ambiguity, incompleteness, and misleading information. This idealized scenario allows us to focus on the limitations inherent to the LLM-oracle itself, independent of potential issues with the input data. We define these terms (consistency, correctness, and adequacy) as follows:

  • M𝑀Mitalic_M is consistent if, for any user inquiries Q𝑄Qitalic_Q and Qsuperscript𝑄Q^{\prime}italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT posed at different points in time, if mng(Q)=mng(Q)mngsuperscript𝑄mng𝑄\mbox{mng}(Q^{\prime})=\mbox{mng}(Q)mng ( italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = mng ( italic_Q ), then M𝑀Mitalic_M outputs an answer A𝐴Aitalic_A to Q𝑄Qitalic_Q and an answer Asuperscript𝐴A^{\prime}italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to Qsuperscript𝑄Q^{\prime}italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT such that mng(A)=mng(A)mng𝐴mngsuperscript𝐴\mbox{mng}(A)=\mbox{mng}(A^{\prime})mng ( italic_A ) = mng ( italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ).

  • Suppose M𝑀Mitalic_M is augmented with input and output being (T,Q)𝑇𝑄(T,Q)( italic_T , italic_Q ) and A𝐴Aitalic_A. We say that A𝐴Aitalic_A is correct for Q𝑄Qitalic_Q with respect to T𝑇Titalic_T if the following conditions hold:

    1. 1.

      Compliance: A𝐴Aitalic_A complies with the ground truth T𝑇Titalic_T with respect to Q𝑄Qitalic_Q. Namely, AQTsubscriptless-than-or-similar-to𝑄𝐴𝑇A\lesssim_{Q}Titalic_A ≲ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT italic_T.

    2. 2.

      Exclusiveness: A𝐴Aitalic_A doesn’t include unnecessary information outside the scope of Q𝑄Qitalic_Q. In other words, if we let intent(Q)intent𝑄\mbox{intent}(Q)intent ( italic_Q ) denote the set of intended meanings of Q𝑄Qitalic_Q, then mng(A)intent(Q)mng𝐴intent𝑄\mbox{mng}(A)\subseteq\mbox{intent}(Q)mng ( italic_A ) ⊆ intent ( italic_Q ).

    3. 3.

      Completeness: A𝐴Aitalic_A doesn’t miss any information contained in T𝑇Titalic_T that are relevant to Q𝑄Qitalic_Q. In other words, (mng(T)mng(A))intent(Q)=mng𝑇mng𝐴intent𝑄(\mbox{mng}(T)-\mbox{mng}(A))\cap\mbox{intent}(Q)=\emptyset( mng ( italic_T ) - mng ( italic_A ) ) ∩ intent ( italic_Q ) = ∅.

    4. 4.

      Distribution agreement: If A𝐴Aitalic_A has more than one meaning, then dist(A)=dist(TA)dist𝐴distsubscript𝑇𝐴\mbox{dist}(A)=\mbox{dist}(T_{A})dist ( italic_A ) = dist ( italic_T start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ), where dist(A)dist𝐴\mbox{dist}(A)dist ( italic_A ) denotes the distribution of meanings within answer A𝐴Aitalic_A and dist(TA)distsubscript𝑇𝐴\mbox{dist}(T_{A})dist ( italic_T start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) denotes the distribution of meanings in the corresponding text for A𝐴Aitalic_A in the augmented text T𝑇Titalic_T.

    We say that M𝑀Mitalic_M is correct with respect to T𝑇Titalic_T if for any input (T,Q)𝑇𝑄(T,Q)( italic_T , italic_Q ), M𝑀Mitalic_M returns an answer A𝐴Aitalic_A that is correct for Q𝑄Qitalic_Q with respect to T𝑇Titalic_T.

  • Suppose M𝑀Mitalic_M is an augmented LLM-OM with input (T,Q)𝑇𝑄(T,Q)( italic_T , italic_Q ) that generates an answer A𝐴Aitalic_A. We say that answer A𝐴Aitalic_A is adequate for Q𝑄Qitalic_Q with respect to T𝑇Titalic_T if A𝐴Aitalic_A complies with T𝑇Titalic_T with respect to Q𝑄Qitalic_Q. However, unlike correctness, adequacy allows for some flexibility in the answer:

    1. 1.

      Non-exclusiveness: A𝐴Aitalic_A may contain information that extend beyond the specific information requested in the user inquiry Q𝑄Qitalic_Q. Namely, mng(A)intent(Q)mng𝐴intent𝑄\mbox{mng}(A)-\mbox{intent}(Q)\not=\emptysetmng ( italic_A ) - intent ( italic_Q ) ≠ ∅.

    2. 2.

      Incompleteness: A𝐴Aitalic_A may miss information from T𝑇Titalic_T that pertains to Q𝑄Qitalic_Q. Namely, (mng(T)mng(A))intent(Q)mng𝑇mng𝐴intent𝑄(\mbox{mng}(T)-\mbox{mng}(A))\cap\mbox{intent}(Q)\not=\emptyset( mng ( italic_T ) - mng ( italic_A ) ) ∩ intent ( italic_Q ) ≠ ∅.

    3. 3.

      Distribution discrepancy: dist(A)dist(TA)dist𝐴distsubscript𝑇𝐴\mbox{dist}(A)\not=\mbox{dist}(T_{A})dist ( italic_A ) ≠ dist ( italic_T start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ).

    Essentially, adequacy acknowledges that the answer A𝐴Aitalic_A may not be perfect, but it provides valuable information to the query from the augmented text.

    We say that M𝑀Mitalic_M is adequate if for all inputs (T,Q)𝑇𝑄(T,Q)( italic_T , italic_Q ), M𝑀Mitalic_M returns an answer A𝐴Aitalic_A that is adequate for Q𝑄Qitalic_Q with respect to T𝑇Titalic_T.

  • If M𝑀Mitalic_M is not augmented, assume the existence of a set of texts, denoted by U𝑈Uitalic_U, that represents the true knowledge and information for the areas of interest. We say that M𝑀Mitalic_M is correct if M𝑀Mitalic_M is correct with respect to U𝑈Uitalic_U, M𝑀Mitalic_M is adequate if M𝑀Mitalic_M is adequate with respect to U𝑈Uitalic_U. Set U𝑈Uitalic_U is referred to as the absolute truth.

Note that if A𝐴Aitalic_A is correct for Q𝑄Qitalic_Q with respect to T𝑇Titalic_T, then A𝐴Aitalic_A is adequate for Q𝑄Qitalic_Q with respect to T𝑇Titalic_T, but not vice versa.

A fault-avoidance LLM-OM is an augmented LLM-OM that is consistent and correct. A weak fault-avoidance LLM-OM is an augmented LLM-OM that is consistent and adequate. It is necessary for M𝑀Mitalic_M to identify the best-matched content Cxsubscript𝐶𝑥C_{x}italic_C start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT from T𝑇Titalic_T for each query qxsubscript𝑞𝑥q_{x}italic_q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, and the chosen LLM from the LLM-oracle must comply with Cxsubscript𝐶𝑥C_{x}italic_C start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT when generating an answer axsubscript𝑎𝑥a_{x}italic_a start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT to qxsubscript𝑞𝑥q_{x}italic_q start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT.

Verifying the correctness or adequacy of the answer A𝐴Aitalic_A for a given query Q𝑄Qitalic_Q with respect to T𝑇Titalic_T calls for concrete implementations of toc(X)toc𝑋\mbox{toc}(X)toc ( italic_X ), mng(X)mng𝑋\mbox{mng}(X)mng ( italic_X ), intent(Q)intent𝑄\mbox{intent}(Q)intent ( italic_Q ), and dist(X)dist𝑋\mbox{dist}(X)dist ( italic_X ) using techniques in natural language processing, machine learning, deep learning, and other methods.

Consistence, however, is harder to verify. We may aim to develop a method that provides a certain guarantee that M𝑀Mitalic_M is consistent with a desired high probability.

2.4 ϵitalic-ϵ\epsilonitalic_ϵ-fault LLM-OM

An ϵitalic-ϵ\epsilonitalic_ϵ-fault LLM-OM is a non-augmented LLM-OM that is consistent and correct with probability of 1ϵ1italic-ϵ1-\epsilon1 - italic_ϵ with respect to the absolute truth for the areas of interest, where ϵitalic-ϵ\epsilonitalic_ϵ is some small positive parameter. Likewise, a weak ϵitalic-ϵ\epsilonitalic_ϵ-fault LLM-OM is similarly defined by replacing correctness with adequacy.

Let L1,L2,,Lksubscript𝐿1subscript𝐿2subscript𝐿𝑘L_{1},L_{2},\ldots,L_{k}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT be the LLMs in the LLM-oracle, where each Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has a small probability pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of generating hallucinated answers to queries. Different LLMs may hallucinate on different queries. However, M𝑀Mitalic_M doesn’t know which LLM will hallucinate an answer to a given query, and M𝑀Mitalic_M alone cannot verify if an answer is incorrect, as the absolute truth is not provided as an augmented input.

We aim to investigate whether it is possible to utilize these LLMs to obtain an answer to a query with a certain level of guarantee that the answer is correct or adequate with a desired probability 1ϵ1italic-ϵ1-\epsilon1 - italic_ϵ for some ϵitalic-ϵ\epsilonitalic_ϵ.

Making certain reasonable assumptions may be useful in the quest to obtain such a result. For example, we may assume that for a given query, there is always an Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT that produces a correct and adequate answer; we just don’t know which one.

References

  • [1] J. Li, J. Chen, R. Ren, X. Cheng, W. X. Zhao, J.-Y. Nie, and J.-R. Wen, “The dawn after the dark: An empirical study on factuality hallucination in large language models,” 2024. arXiv 2401.03205.
  • [2] R. Stureborg, D. Alikaniotis, and Y. Suhara, “Large language models are inconsistent and biased evaluators,” 2024. arXiv 2405.01724.