Into The Unknown Unknowns: Engaged Human Learning Through Participation in Language Model Agent Conversatio

Into the Unknown Unknowns: Engaged Human Learning
through Participation in Language Model Agent Conversations
Yucheng Jiang1* Yijia Shao1* Dekun Ma2 Sina J. Semnani1 Monica S. Lam1
1
Stanford University, 2 Yale University
{yuchengj, shaoyj}@stanford.edu
dekun.ma@yale.edu, {sinaj, lam}@cs.stanford.edu
Abstract Use Search Engines Read Existing Reports Interact with Chatbots
User Question 1
While language model (LM)-powered chatbots

arXiv:2408.15232v1 [cs.CL] 27 Aug 2024
and generative search engines excel at answer- How to User Question 2

tailor it for
ing concrete queries, discovering information my need…
in the terrain of unknown unknowns remains User Effort: Low
challenging for users. To emulate the common So much to read… Support
What shall I ask now …
Interaction
educational scenario where children/students User Effort:
User Effort:
High
learn by listening to and participating in conver- Explore
High
Explore
sations with their parents/teachers, we create “Unknown
“Unknown
Unknowns”
Collaborative STORM (Co-STORM). Un- Unknowns”
Universe of
like QA systems that require users to ask all the “Known Unknowns”
questions, Co-STORM lets users observe and
occasionally steer the discourse among several Universe of
User Effort:
LM agents. The agents ask questions on the Participant with “Unknown Unknowns” Low
Perspective A:
user’s behalf, allowing the user to discover un- Support
Interaction
known unknowns serendipitously. To facilitate Participant with
Perspective B: Explore
user interaction, Co-STORM assists users in “Unknown
Unknowns”
tracking the discourse by organizing the uncov-
ered information into a dynamic mind map, ul- Observe the Discouse Request Generation
Moderator:
/ Take a Turn from Discourse
timately generating a comprehensive report as
Cited Report
takeaways. For automatic evaluation, we con- Interact with Co-STORM
struct the WildSeek dataset by collecting real
information-seeking records with user goals. Figure 1: Comparison of different paradigms for
Co-STORM outperforms baseline methods on learning and information seeking. Co-STORM en-
both discourse trace and report quality. In a ables humans to observe and participate in a collabo-
further human evaluation, 70% of participants rative discourse among LM agents with different roles.
prefer Co-STORM over a search engine, and Users can request the system to generate a full-length
78% favor it over a RAG chatbot. cited report based on the discourse history and the infor-
mation collected.
1 Introduction However, a gap remains in using these systems
for complex information-seeking scenarios, such as
Recent advancements in language models (LMs)
academic research, market analysis, and decision-
(Bai et al., 2022; OpenAI, 2023; Gemini
making, where the system should expose users
Team, 2024) and retrieval-augmented generation
to their unknown unknowns to facilitate knowl-
(RAG) (Lewis et al., 2021) have led to more ca-
edge discovery. While the concept of unknown
pable chatbots and emerging generative search en-
unknowns originally referred to unexpected risks
gines (Liu et al., 2023a). Compared to traditional
in the military, it is linked to the serendipitous dis-
search engines and information retrieval (IR) mod-
covery of information in the information research
els (Robertson, 1977), these systems fulfill user
context (Foster and Ford, 2003; Agarwal, 2015).
queries by generating direct responses, effectively
Specifically, Kirzner (1997) directly contrasts such
addressing known unknowns, where users are aware
discovery (“the realization that one had overlooked
of their information needs.
something in fact readily available”) with success-
*
Equal Contribution ful search (“the deliberate production of informa-
1
tion which one knew one had lacked”). sess both discourse trace and final report quality.
Prior work on automated expository writ- Our results show that Co-STORM significantly out-
ing (Shen et al., 2023) attempts to help readers performs RAG chatbots in surfacing in-depth and
reach the terrain of unknown unknowns by curat- serendipitous information while providing a more
ing information from various sources into unified engaging learning experience.
articles with substantial breadth and depth. In par- We further conduct a human evaluation by invit-
ticular, by proposing the STORM writing system, ing 20 users with diverse backgrounds to compare
Shao et al. (2024) demonstrate that LMs paired Co-STORM with a search engine and a RAG chat-
with search engines can automatically generate bot. 70% preferred Co-STORM over the search en-
Wikipedia-like articles on arbitrary topics. How- gine, and 78% preferred it over the RAG chatbot for
ever, with the static report as the final outcome, the overall information-seeking experience. Partici-
STORM does not support any user interaction pants find that Co-STORM facilitates serendipitous
which is crucial in complex information seeking discovery and requires less mental effort.
where there is no single, gold query, but queries Our main contributions include:
evolve dynamically towards a goal (Bates, 1989). • We propose Co-STORM, a novel system that
This dynamic and exploratory nature makes de- combines collaborative discourse emulation,
signing assistance systems challenging. Traditional human interaction, and information organiza-
search engines and RAG chatbots passively react to tion to assist learning and complex informa-
users’ search queries or conversational questions, tion seeking.
often inducing echo chamber effects (Sharma et al., • We construct the WildSeek dataset from real-
2024) or high cognition load as users with limited world human information-seeking records to
prior knowledge may even struggle to formulate evaluate information-seeking assistance tools.
questions (Kuhlthau, 1991; Belkin et al., 1982). • Results from both automatic and human evalu-
To surface unknown unknowns and better sup- ation show that Co-STORM can help humans
port user interaction, we propose Collaborative discover unknown unknowns with less mental
STORM (Co-STORM), an information-seeking effort required.
assistance system that supports collaborative dis-
course among users and multiple LM agents. Un- 2 Complex Information Seeking
like the one-question-one-answer mode of interac-
2.1 Problem Formulation
tion, Co-STORM allows users to learn by observ-
ing and participating occasionally in the discourse, Pirolli (2009) defines complex information-seeking
emulating a common educational scenario (Nuss- as part of the broader sensemaking process, involv-
baum, 2008). To facilitate a thought-provoking dis- ing collecting, sifting, understanding, and organiz-
course and serendipitous discovery, Co-STORM ing information from large collections to generate
simulates two agent types grounded in the search a knowledge product. Prevalent in domains such
engine: experts who participate by asking or an- as investigative journalism, scientific research, and
swering questions with different perspectives and market analysis, this task has the following prop-
a moderator, a non-expert who knows enough to erties: (1) it requires seeking information from
ask good questions and steers the discourse. The multiple sources to address various facets of a
user can jump in at any time to steer the discourse topic rather than retrieving a document that best
and inject questions and opinions according to their matches a query; (2) it involves ongoing user inter-
interest. Co-STORM maintains a dynamic, hierar- action rather than processing a single query; (3) it
chical mind map to ensure they can easily follow produces report-like curated information product
and engage (Buzan, 1974). Upon the conclusion rather than a single short-form answer. As shown in
of the discourse, users can request the system to Table 1, none of the existing information-seeking
generate a cited report based on the mind map. assistance systems (e.g., Robertson, 1977; Chen
For evaluation, we introduce WildSeek, a dataset et al., 2017; Reddy et al., 2019; Shao et al., 2024)
of topics and user goals from real users engaged can fully support this task.
in complex information-seeking across multiple Given a user with an initial topic of interest t and
domains.1 We propose automatic metrics to as- an initial goal g, and a repository of information R,
the task is to interact with the user and write a cus-
1
Our data and code will be released upon acceptance. tom long-form report tailored to the users’ interest;
2
Multiple Ongoing Curated 3.1 Collaborative Discourse Protocol
Sources Interact Report
Nussbaum (2008) emphasizes the importance of
Information Retrieval ✗ ✗ ✗
Single-Turn QA ✓ ✗ ✗ collaborative discourse in fostering deep under-
Conversational QA ✓ ✓ ✗ standing and critical thinking in human learning.
Report Generation ✓ ✗ ✓ Since it is difficult to assemble a group of hu-
Co-STORM ✓ ✓ ✓ man experts for collaborative discourse on any
topic at any time, we propose Co-STORM (Fig-
Table 1: Comparison of different information-seeking ure 2) to emulate this process with multiple LM
assistance systems. agents to assist human information seeking and
learning. Formally, the collaborative discourse,
Domain: Economics
D = {u1 , u2 , ..., un }, consists of turn-based tex-
Topic: Development of a Shared Trading Currency to tual utterances ui from one of three roles: the user
Facilitate International Trade
(§3.3), experts with diverse perspectives (§3.4), and
Latent goal: Investigate how a new shared currency could a moderator guiding the discourse and injecting
eliminate transaction costs and boost GDP among member
countries. questions (§3.5). The discourse begins with N ex-
perts, P = {p1 , ..., pN }, discussing the topic t for
Table 2: A sample data point in the WildSeek dataset for one turn per expert to warm up the conversation.
studying complex information-seeking tasks; the topic Co-STORM dynamically maintains a mind map
and goal are provided by users on the publicly available (§3.2) to track the discourse and construct shared
STORM website, the domain is assigned manually.
knowledge between the user and the system.
Utterance Intent Inspired by the utterance intent
the report is a sequence of sentences S = s1 s2 ...sn ,
taxonomy for information-seeking conversations
with each sentence citing a set of retrieved informa-
proposed by Qu et al. (2019), we associate each
tion I ⊂ R, for the sake of verifiability.
agent utterance ui with an agent intent type ai ,
where ai can be one of the following: O RIGINAL
2.2 WildSeek: An In-the-Wild Information Q UESTION (which initiates a new question), I NFOR -
Seeking Dataset MATION R EQUEST (which seeks additional informa-
To study users’ interests in complex information- tion from the prior utterance), P OTENTIAL A NSWER
seeking tasks in the wild, we utilized data collected (which offers a possible answer to a previously
from the open-source STORM web application2 , posed question), F URTHER D ETAILS (which provides
which generates comprehensive long-form reports supplementary information to a previous answer).
based on users’ specified topics of interest and We group O RIGINAL Q UESTION and I NFORMATION R E -
goals for using the site. Each data point is a pair QUEST as question-asking and the other two intents
comprising a topic and the user’s goal. To improve as question-answering.

the quality of the dataset, we retain only those data Initiative Management Traum (2003) under-
points that are well motivated by applying rule- scores the necessity of discourse management in
based filtering followed by binary classification multiparty dialogues. While existing systems sup-
using an LM (gpt-4o-2024-05-13). Next, we use port either just user initiatives (e.g., QA systems) or
the same LM to predict the taxonomy class of each just agent initiatives (e.g., STORM), Co-STORM
topic, followed by manual review and refinement. adopts a mixed-initiative approach. When the user
Finally, we downsample the data to create a dataset actively engages, the system continues the dis-
with 100 data points across 24 different domains. course based on the user’s question or argument,
Table 2 shows a sample data point from the dataset; allowing for a more targeted discussion. Other-
further details about the dataset are in Appendix A. wise, the system automatically generates the next
turn. The user controls who takes the initiative, as
3 Method Co-STORM allows the user to take a turn anytime.
Turn Management If the user does not take the
“Tell me and I forget. Teach me and I remember. turn at timestamp i, Co-STORM needs to deter-
Involve me and I learn.” — Benjamin Franklin mine which LM agent should generate the next
utterance ui . Its protocol is to let different experts,
2
Available at https://storm.genie.stanford.edu p1 , ..., pN , take turns in sequence. To prevent them
3
User: I want to learn more about the topic Alphafold 3
Track the Discourse so I can report this technical breakthrough. Observe and Participate
Mind Map (§3.2) Collaborative Discourse

• Background and Development AI Expert: …
…
• Technical Aspects Geneticist: …
• Accessibility Enhancements for Researchers
• Advancements in Biomolecular Structure Prediction User (§3.3): I'm curious about its impact and its applications?
…
• New Features and Advancements in AlphaFold 3 Molecular Biology Expert: ...This open-access treasure trove is leveling
• Impact and Applications the playing field, allowing researchers everywhere to tap into top-notch
• Drug Discovery Acceleration structural data.[13] [14] ....
• Protein-DNA and Protein-RNA Interaction Predictions
• Accuracy Improvements in Drug Discovery Moderator: That's a great point. How might the integration of AlphaFold 3
… with emerging technologies like self-driving laboratories further accelerate
and innovate the drug discovery process? (inspired by [20][21])
Question Asking Generate Question Polish Utterance

Choose Intent Or
Update
Question Answering Generate Queries and Retrieve Generate Cited Response
Discourse Filter
Perspective-Guided Expert Pipeline (§3.4) Large Information Respository (e.g., The Internet) Mind Map
History
Polish Utterance Generate

Rerank Unused information Generate Question
Update Participant List Cited
Moderator Pipeline (§3.5) Report
Figure 2: Overview of Co-STORM. Co-STORM emulates a collaborative discourse among the user, simulated
perspective-guided experts, and a simulated moderator. It maintains a dynamically updated mind map (§3.2) to help
user track and engage in the discourse (§3.3) . The simulated expert is prompted to determine the utterance intent
based on discourse history and generate a question or an answer grounded in the Internet (§3.4). The simulated
moderator is prompted with unused information and the mind map to generate a new question to automatically
steer the discourse (§3.5). The mind map can be used to generate a full-length cited report as takeaways. Complete
discourse transcript and the associated report are detailed in Appendix §G and §H.
from only expanding on the same point, upon ob- question that leads to its retrieval.
serving L consecutive turns of expert responses Co-STORM dynamically updates the mind map
with intents being either P OTENTIAL A NSWER or F UR - through two operations, insert and reorganize.
THER D ETAILS , the system asks the moderator to Insert places a piece of information under the
intervene. In §5.3, we analyze the benefit of this most appropriate concept by first deriving a set
protocol design. of candidate concepts using semantic similarity
between its associated question and each concept
3.2 Tracking the Discourse with a Mind Map in C, then prompting the LM to choose the final
placement. When a concept c has more than K
Having shared knowledge or a shared conceptual pieces of information, M triggers the reorganize
space is critical for collaboration (Roschelle and operation. Reorganize prompts the LM to gen-
Teasley, 1995). To help users track the discourse erate a list of new subtopic names under c, and
and reduce their cognition load, Co-STORM uses applies insert to place each piece of information
a tree-structured mind map M to dynamically or- associated with c in the subtree rooted at c. After
ganize collected information in the discourse D. expansion, Co-STORM adopts a bottom-up clean-
Specifically, M = (C, E) is a hierarchical organiza- ing process to iteratively delete concepts with no
tion of concepts C, where its directed edges E char- supporting information and collapse concepts with
acterize latent parent-child relationships among top- only one subtopic. More details on insert opera-
ics (e.g., in Figure 2, “Drug Discovery Accelera- tion are included in Appendix B.
tion” is a subtopic of “Impact and Applications”).
Each concept c ∈ C is associated with a subset
3.3 User Participation
of retrieved information I c ⊂ I. To ensure M is
an intent-driven organization of information, each When the user injects an utterance u, Co-STORM
piece of information is also associated with the uses u as the query to retrieve information to
4
prompt the LM to obtain an updated list of experts, informed questions based on the uncited sources
P ′ . Following this update, the system switches retrieved since the last moderator turn. To choose
back to the auto-steering mode where the expert among the many uncited sources, the moderator
or the moderator takes turns according to the turn reranks each piece of information i based on the
management protocol introduced in §3.1. Once the similarity to the topic t and the dissimilarity to
user is satisfied with the discourse, Co-STORM its associated question q. Formally, the reranking
generates the final report S as the curated informa- score is
tion product of the collaborative discourse. This
report is generated using the mind map M as the cos(i, t)α (1 − cos(i, q))1−α ,
outline and the retrieved information I c associated
where i, t, q are corresponding text embeddings
with each concept c to generate the report section
and α is a hyperparameter. This reranking func-
by section.
tion prioritizes information that does not directly
3.4 Simulating the Roundtable Participant answer the original question but is relevant to the
topic t. Co-STORM concatenates these reranked
Following STORM (Shao et al., 2024) which uses sources along with concept names in C to avoid
perspective-guided question asking to improve the repetitive concepts. This combined context is used
question diversity and quality, Co-STORM person- to prompt the LM to generate the question for the
alizes simulated experts with different expertise moderator turn and an updated list of experts, P ′ .
to represent different perspectives. Co-STORM
retrieves the background of topic t with a search 4 Co-STORM Implementation
query and gives it to an LM to generate the expert
list P = {p1 , ..., pN }. For example, for the topic The LM component of Co-STORM is implemented
“AlphaFold3” in Figure 2, the LM suggests an “AI using zero-shot prompting via the DSPy framework
Expert”, a “Geneticist”, and a “Molecular Biology (Khattab et al., 2023) and gpt-4o-2024-05-13
Expert” to participate in the discourse. (see full prompts in Appendix D). We ground Co-
If there is no interruption by the user or the mod- STORM on the Internet using the You.com search
erator, each expert pj sequentially takes turns with API4 although the system is compatible with other
the following procedure: To generate an utterance search engines or IR systems. Hyperparameters
from an expert p in turn i, (1) The LM is prompted N , K, L, α are set to 3, 10, 2, and 0.5, respec-
to choose the intent ai based on the discourse his- tively. The text embeddings in Eq. 1 are obtained
tory {u1 , ..., ui−1 } and the expert’s perspective p. from text-embedding-3-small. We set the LM
(2) If the intent ai is P OTENTIAL A NSWER or F URTHER temperature as 1.0 and top_p as 0.9 for all exper-
D ETAILS, we prompt the LM to generate a search iments. For human evaluation, we develop a web
query q, retrieve information with a search engine 3 , application (Figure 12) for users to interact with
and generate a response with citations; otherwise, Co-STORM in real time.
we prompt the LM to directly generate a question
based on the discourse history. (3) We use the LM 5 Automatic Evaluation
to polish the utterance to make it more chatty and
Automatic evaluation enables scalable testing and
engaging.
allows for consistent simulation of user behavior.
3.5 Simulating the Moderator We compare Co-STORM with the following base-
lines: (1) RAG Chatbot, a baseline that retrieves
If all the simulated participants are experts, we
information from the search engine and interacts
discover that the discourse tends to consist mostly
with the user through a one-question-one-answer
of utterances with intent F URTHER D ETAILS, leading
paradigm. (2) STORM + QA, a baseline that uses
to repetition and niche discussions. The moderator
the STORM framework (Shao et al., 2024) to gen-
plays an important role of injecting new directions
erate a report for a given topic to provide general
into the discourse. To generate the moderator’s
information. It then allows the user to ask follow-
utterance, Co-STORM instructs the LM to generate
up questions and provides corresponding answers
3
Following Shao et al. (2024), when retrieving information retrieved with the search engine.
with a search engine, we apply rule-based filtering according
to the Wikipedia guideline https://en.wikipedia.org/w https://documentation.you.com/api-reference
4
iki/Wikipedia:Reliable_sources. /search
5
Report Quality Question-Answering Turn Quality
Relevance Breadth Depth Novelty Info Diversity Consistency Engagement # Unique URLs
RAG Chatbot 3.57 3.50 3.26 2.44 0.595 4.37 4.13 2.94
STORM + QA 3.61 3.61 3.43 2.50 0.592 4.34 4.11 2.89
Co-STORM 3.78 3.79 3.77† 3.05† 0.602 4.40† 4.33† 6.04†
w/o Multi-Expert 3.73 3.75 3.77 2.93 0.589 4.40 4.32 5.91
w/o Moderator 3.56 3.69 3.41 2.89 0.577 4.39 4.28 5.67
Table 3: Automatic evaluation results for report quality and the quality of question-answering turns in the discourse
with simulated users. Ablations are included as follows: “w/o Multi-Expert” denotes 1 expert and 1 moderator, and
“w/o Moderator” denotes N experts and 0 moderator.† denotes significant differences (p < 0.05) from a paired
t-test between Co-STORM and both baselines. The rubric grading uses a 1-5 scale. All scores reported are the mean
values.
5.1 Evaluation Setup ment, and No Repetition for question-asking ut-

We use the WildSeek dataset where each data point terances (i.e., utterances with the intent O RIGINAL
Q UESTION or R EQUEST I NFORMATION). For question-
consists of an initial topic t and goal g. We sim-
ulate the user with an LM (gpt-4o-2024-05-13) answering utterances that provide information, we
prompted with t, g, the discourse history D, and assess Consistency and Engagement. We also re-
the instruction for question generation. To ensure port the number of unique cited URLs in these ut-
a fair comparison, we terminate the information- terances to indicate information diversity at the turn
seeking session once it reaches 30 search queries level. Both the rubrics for report evaluation and
for Co-STORM and both baselines. For all meth- utterance evaluation are included in Appendix D.
ods, the final report is generated using the two-
5.3 Automatic Evaluation Results
stage approach of outline generation followed by
section-by-section article generation, as proposed Table 3 presents the evaluation results for report
by STORM (Shao et al., 2024), based on the inter- quality and the quality of question-answering turns
action history. We evaluate the system quality by in the discourse. The question-answering turns
assessing the final report and the interaction history and the final report are the primary sources for hu-
(i.e., discourse) with the automatic metrics defined man learning when they interact with Co-STORM.
in §5.2. STORM + QA considers multiple perspectives in
researching the given topic, indeed leading to im-
5.2 Automatic Metrics proved performance across all four grading dimen-
Report Quality We evaluate the final report on sions of the report quality compared to the RAG
four aspects, Relevance, Broad Coverage (Breadth), Chatbot. However, Co-STORM outperforms it,
Depth, and Novelty, as indicators of the quality of particularly in the Depth and Novelty aspects, by
the whole information-seeking process.5 We em- simulating collaborative discourse with multiple
ploy Prometheus 2 (Kim et al., 2024), a 7B eval- agentic roles, akin to a thought-provoking round ta-
uator LM, to score the report based on a 5-point ble discussion. For discourse quality, the question-
rubric. To further quantify the diversity of the col- answering turns in Co-STORM significantly out-
lected information, we also report the Information perform both baselines in terms of Consistency
Diversity as the average pairwise dissimilarity of and Engagement. This improvement is attributed
I, to collaborative discourse setup, where the LM is
P
i,j∈I,i̸=j cos(i, j)
prompted to generate the answer only when the
1− , retrieved information matches the current question
|I|(|I| − 1)
according to the discourse history (see Listing 2).
where i, j are corresponding text embeddings ob- The utterance polishing step (see Figure 2) also
tained from OpenAI’s text-embedding-3-small. helps as it serves as a self-improving mechanism.
Discourse Quality Since the discourse itself is
valuable for human learning, we also evaluate the 5.4 Ablation Studies
discourse trace using a 5-point rubric to grade each
As discussed in §3, a major innovation of Co-
turn. This grading assesses Novelty, Intent Align-
STORM is the orchestration of two types of LM
5
The same four aspects are used in human evaluation. agents. To assess the benefit, we compare Co-
6
User Expert Moderator Strongly Disagree Disagree Neutral Agree Strongly Agree
Co-STORM 1 N 1 Co-STORM is better than Search Engine

w/o Multi-Expert 1 1 1 Disagree Agree
w/o Moderator 1 N 0 Less Effort 10% 80%
4.5 User Engagement 0% 100%
Addresses Echo Chamber 10% 60%
Overall Experience 10% 70%

4
Co-STORM is better than RAG Chatbot
Disagree Agree
3.5
Less Effort 22% 67%
User Engagement 22% 44%
3 Addresses Echo Chamber 0% 56%
Overall Experience 11% 78%
2.5
Figure 4: Survey results of the pairwise comparison
Novelty Intent Alignment No Repetition
(i.e., agreement on whether Co-STORM is better than
Figure 3: Rubric grading results for question-asking turn Search Engine/RAG Chatbot) in human evaluation.
quality in automatic evaluation with simulated users.
6.1 Evaluation Setup
STORM with two ablations: (1) without multi- We conduct an IRB-approved human evaluation
ple experts with different perspectives (“w/o Multi- to compare Co-STORM with RAG Chatbot and
Expert”), i.e., only a single expert and a moderator, Search Engine by recruiting 20 volunteers on the
and (2) multiple experts but no moderator steering Internet. Participants are randomly split into two
the discourse (“w/o Moderator”). As shown in Ta- groups: one compared Co-STORM with Google
ble 3, the ablated systems perform worse than the Search, while the other group compared it with the
full system across all metrics in both report and RAG Chatbot. Participants are asked to seek infor-
question-answering turn quality. Notably, remov- mation on two topics, one on each system, from
ing the moderator has a greater negative impact the same domain and sharing the same goal. Note
than reducing the number of experts. that we mitigate topic familiarity bias by using two
A key feature of Co-STORM is that LM agents different topics within the same domain. We have
can ask questions on the user’s behalf. As shown prepared five different domains, with each assigned
in Figure 3, the advantage of Co-STORM’s multi- to 2 users in each group. To counterbalance, one
agent design becomes clearer when inspecting the user starts with Co-STORM and switches to the
question-asking turns. Having just one expert and baseline for the other topic, and vice versa.
one moderator can already provide most of the After seeking information for each topic, partici-
benefits. Importantly, the moderator role in Co- pants are instructed to rate their experience based
STORM raises questions based on unused informa- on four grading aspects defined in §5.2 (Relevance,
tion about the topic—such a role represents some- Breadth, Depth, Novelty/Serendipity), using a 5-
body with a much larger known unknowns, effec- point Likert scale. After completing both tasks, par-
tively steering the discourse to help users discover ticipants are asked to provide pairwise preferences
more in the space of their unknown unknowns. regarding the required effort, user engagement, ad-
Another key innovation of Co-STORM is the dressing echo chamber issues, and overall experi-
dynamic mind map. We include controlled experience. We also collect open-ended feedback and
ments on mind map quality in Appendix B. allow participants to optionally leave comments on
6 Human Evaluation each discourse turn and the mind map snapshots
when interacting with Co-STORM. More details on
Human evaluation is essential for assessing systems the human evaluation are included in Appendix E.
designed for collaborative discourse, as it captures
the complexities of human interaction, reflects fa- 6.2 Human Evaluation Results
miliar real-world interactions, and provides critical Table 4 shows the human rating results and Fig-
insights into the system’s effectiveness. We com- ure 4 shows the pairwise comparison results. Co-
pare Co-STORM with two baselines: (1) RAG 6
One participant in the Co-STORM v.s. RAG Chatbot
Chatbot as detailed in section §5. (2) Traditional group submitted the rating but did not leave a usage record, so
Search Engine. we excluded this data point from the aggregated results.
7
Co-STORM v.s. Search Engine Co-STORM v.s. RAG Chatbot
Search Engine Co-STORM Win % (Lose %) p-value RAG Chatbot Co-STORM Win % (Lose %) p-value
Relevance 3.90 4.00 30% (30%) 0.758 3.89 4.22 33% (0%) 0.081
Breadth 3.60 4.10 50% (10%) 0.096 3.11 4.22 67% (0%) 0.013
Depth 3.10 4.00 60% (10%) 0.081 3.11 4.00 56% (33%) 0.069
Serendipity 2.70 3.90 70% (10%) 0.030 2.78 3.78 67% (0%) 0.009
Table 4: Human ratings on different aspects of the information-seeking experience with Co-STORM and Search
Engine (n=10) and with Co-STORM and RAG Chatbot (n=9)6 . The ratings are given on a scale from 1 to 5 with 3
as “Average”. We report the win rate of Co-STORM in pairwise comparison and the p-value in a paired t-test.
STORM helps users find broader and deeper in- 7 Related Works
formation relevant to their goals. Participants
found that Co-STORM uncovers information with Information-Seeking Support in NLP NLP re-
greater breadth and depth compared to the search search supporting human information-seeking has
engine and the RAG Chatbot. Specifically, Co- mainly focused on building question-answering
STORM is rated strictly higher in Breadth by 50% (QA) systems (Chen et al., 2017; Lee et al., 2019;
of the participants and strictly higher in Depth by Dasigi et al., 2021; Levy et al., 2021; Yuan et al.,
60% of the participants than the search engine. 2020). These works often assume that the an-
Compared to the RAG Chatbot, Co-STORM re- swer can be found within a single document (Clark
ceives strictly higher scores in Breadth from 67% et al., 2020) or that users can formulate complex
of participants and in Depth from 56% of partici- queries (Yang et al., 2018; Chen et al., 2021; Ah-
pants. This finding aligns with the automatic evalu- madvand et al., 2023), assumptions that do not hold
ation results shown in Table 3. While helping users true in complex information seeking (Butler, 2000;
discover more information, Co-STORM remains Booth et al., 2009; Byström and Järvelin, 1995).
aligned with their goals, as participants also rated Some more recent works have proposed long-
Co-STORM higher in Relevance compared. form QA systems (Xu et al., 2023, 2024) and au-
tomatic expository writing systems (Balepur et al.,
Co-STORM provides more serendipitous infor- 2023; Shen et al., 2023; Shao et al., 2024) to synthe-
mation with less mental effort required. Partic- size information from multiple sources. Some other
ipants found that Co-STORM requires less effort, studies have explored conversational search (Ku-
better mitigates the echo chamber issue, and pro- mar and Callan, 2020; Nakamura et al., 2022).
vides a better overall experience. In a more fine- However, these works typically ignore human inter-
grained evaluation, participants evaluated 32% of action or only passively answer user questions. We
Co-STORM ’s total utterances, rating 89% of them construct a multi-agent system with a human-in-
as effectively “steering the discourse towards a new the-loop protocol to support effective user interac-
and interesting direction”. One participant noted, tion for complex and evolving information needs.
“Co-STORM allows for almost full automation and Multi-Agent Systems As LMs advance, a grow-
much better understanding as it brings up topics ing body of research explores their use in multi-
that the user may not even think of”. Moreover, par- agent applications (Wu et al., 2023; Nakajima,
ticipants found the mind map helpful. In total, they 2023; Liu et al., 2023b; Wang et al., 2024). Several
evaluated 80 snapshots of the dynamic mind map, studies show that multi-agent debate enhances fac-
finding it accurately tracked the discourse 71% of tuality and reasoning compared to using a single
the time. One participant remarked, “Co-STORM LM (Du et al., 2023; Liang et al., 2023), and co-
is so much less mentally taxing for me to use”. operative role-playing frameworks improve perfor-
Co-STORM should support more customiza- mance on coding or mathematical benchmarks (Li
tion. Among the 19 participants, 4 noted that et al., 2023; Hong et al., 2023). While these studies
the RAG Chatbot better follows instructions that primarily focus on automating tasks, the potential
have a clear target and mentioned they expect Co- applications extend further. For instance, Gener-
STORM to generate more concise utterances and ative Agents (Park et al., 2023) instantiate an in-
provide less information in such cases. We view dy- teractive environment with twenty-five LM agents
namically adapting Co-STORM to users’ evolving to study emergent social behaviors, and Michael
mental states and personalizing their preferences et al. (2023) show that multi-agent debates help
as a meaningful direction for future work. humans supervise model outputs. Our work aligns
8
with these broader applications by constructing a menting a multilingual Co-STORM requires inte-
multi-agent system to facilitate human learning. grating search engines or retrieval models capable
Collaborative Discourse for Human Learning of accessing diverse language sources. Further-
Collaborative discourse has long been valued in more, managing content across different languages
classroom settings for its ability to deepen learners’ demands robust content moderation and the ability
understanding of concepts, enhance peer learning, to identify conflicting information to ensure a reli-
and increase engagement (Nussbaum, 2008; Os- able human learning experience. Finally, compared
borne, 2010; Kolodner, 2007; Chinn et al., 2000). to the RAG Chatbot, Co-STORM has higher la-
Specifically, Nussbaum (2008) argues not all types tency due to the need to decide the utterance intent
of collaborative discourse are equally beneficial to and update the mind map. Although the current
students’ learning, emphasizing the importance of latency is acceptable for real-time interaction, as
critical discussion where participants assume differ- demonstrated in human evaluations, further improv-
ent points of view. Furthermore, the facilitator role ing the efficiency of the LM system would provide
is important in collaborative discourse, with asking a smoother user experience.
questions and providing complementary informa-
tion as popular strategies (Onrubia et al., 2022). Acknowledgements
We thank Omar Khattab, Eric Zelikman, Rose E.
8 Conclusion
Wang, Yen-Jen Wang, and Qingyang Tang for their
We propose Co-STORM, an information-seeking helpful feedback, and the ACL ARR reviewers for
assistance system that emulates collaborative dis- their valuable comments. We also appreciate the
course among users and multiple LM agents. By 20 participants in our user study who made this
creating an interactive environment where users can work possible. We are grateful to You.com for
both observe and participate, Co-STORM enhances providing a discounted search API that supported
learning and the complex information-seeking pro- our experiments. This work is supported in part
cess. To facilitate automatic evaluation, we con- by the Verdant Foundation and Microsoft Azure
struct the WildSeek dataset, which captures the AI credits. Yijia Shao is supported by a Stanford
information-seeking needs and intents of real Inter- School of Engineering Fellowship.
net users. Experimental results, including extensive
human assessments, show that Co-STORM outper- Ethics Statement
forms traditional search engines and RAG chatbots
We build and evaluate our work to strictly adhere
in surfacing unknown unknowns for human learn-
to ethical standards. The construction of the Wild-
ing and reducing users’ mental effort.
Seek dataset involves collecting data with users’
explicit approval, and we carefully remove all per-
Limitations
sonally identifiable information. In contrast to cre-
We design Co-STORM to create an immersive hu- ative generation tasks, our tasks generate content
man learning experience by enabling humans to that may impact how people perceive information
participate in LM agent conversations. Despite the and shape their opinions. We design our system
advantages demonstrated through both automatic to ground generated content on openly accessible
and human evaluations, several limitations remain. external sources available on the general internet,
First, the system could better tailor the collabora- with proper citations. Our experiments and evalu-
tive discourse to the user’s prior knowledge, skip- ations ensure the accurate delivery of information
ping basic facts for knowledgeable users and intro- and significantly reduce hallucinations. We avoid
ducing concepts progressively for novices. Second, publishing or posting any generated content with-
while Co-STORM employs an effective discourse out careful examination of information accuracy.
management mechanism, users sometimes desire We believe there are no data privacy issues as we
more control over the discourse, including man- ground our generated content from information ac-
aging expert perspectives and customizing the ut- cessible to the general public.
terance length. Third, extending Co-STORM to The primary risk of our work is the common
support multiple languages would significantly en- bias issues originating from biases present on the
hance its usefulness and impact. Although current general internet. Following Shao et al. (2024), we
LMs often possess multilingual capabilities, imple- mitigate this problem by applying rule-based fil-
9
tering according to the Wikipedia guideline7 and Tony Buzan. 1974. Noting. In Using both sides of your
incorporating multiple sources. Additionally, in our brain, chapter 4, pages 71–116. E. P. Dutton, New
York.
web application, we have implemented input topic
moderation to reject topics that are sensitive, ille- Katriina Byström and Kalervo Järvelin. 1995. Task com-
gal, or potentially violate personal privacy. How- plexity affects information seeking and use. Informa-
ever, further information processing modules that tion processing & management, 31(2):191–213.
serve as filters for internet sources and more robust Danqi Chen, Adam Fisch, Jason Weston, and Antoine
modules to verify the accuracy of information can Bordes. 2017. Reading Wikipedia to answer open-
be implemented. Additionally, our current work domain questions. In Proceedings of the 55th Annual
Meeting of the Association for Computational Lin-
only considers generating and retrieving informa-
guistics (Volume 1: Long Papers), pages 1870–1879,
tion from English sources. Extending our system Vancouver, Canada. Association for Computational
to be compatible with multilingual sources and gen- Linguistics.
eration will be beneficial.
Wenhu Chen, Ming-Wei Chang, Eva Schlinger,
William Yang Wang, and William W. Cohen. 2021.
Open question answering over tables and text. In In-
References ternational Conference on Learning Representations.
Naresh Kumar Agarwal. 2015. Towards a definition Clark A Chinn, Angela M O’donnell, and Theresa S
of serendipity in information behaviour. Informa- Jinks. 2000. The structure of discourse in collabora-
tion research: an international electronic journal, tive learning. The Journal of Experimental Educa-
20(3):n3. tion, 69(1):77–97.
Ali Ahmadvand, Negar Arabzadeh, Julia Kiseleva, Patri- Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan
cio Figueroa Sanz, Xin Deng, Sujay Jauhar, Michael Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and
Gamon, Eugene Agichtein, Ned Friend, and Anirud- Jennimaria Palomaki. 2020. TyDi QA: A benchmark
dha. 2023. Making large language models interactive: for information-seeking question answering in typo-
A pioneer study on supporting complex information- logically diverse languages. Transactions of the As-
seeking tasks with implicit constraints. Preprint, sociation for Computational Linguistics, 8:454–470.
arXiv:2205.00584.
Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan,
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Noah A. Smith, and Matt Gardner. 2021. A dataset
Amanda Askell, and Jackson Kernion et al. 2022. of information-seeking questions and answers an-
Constitutional ai: Harmlessness from ai feedback. chored in research papers. In Proceedings of the
Preprint, arXiv:2212.08073. 2021 Conference of the North American Chapter of
the Association for Computational Linguistics: Hu-
Nishant Balepur, Jie Huang, and Kevin Chang. 2023. man Language Technologies, pages 4599–4610, On-
Expository text generation: Imitate, retrieve, para- line. Association for Computational Linguistics.
phrase. In Proceedings of the 2023 Conference on
Empirical Methods in Natural Language Process- Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenen-
ing, pages 11896–11919, Singapore. Association for baum, and Igor Mordatch. 2023. Improving factual-
Computational Linguistics. ity and reasoning in language models through multia-
gent debate. arXiv preprint arXiv:2305.14325.
Marcia J. Bates. 1989. The design of browsing and
berrypicking techniques for the online search inter- Allen Foster and Nigel Ford. 2003. Serendipity and
face. Online Review, 13(5):407–424. information seeking: an empirical study. Journal of
documentation, 59(3):321–340.
Nicholas J Belkin, Robert N Oddy, and Helen M
Brooks. 1982. Ask for information retrieval: Part Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen.
i. background and theory. Journal of documentation, 2023. Enabling large language models to generate
38(2):61–71. text with citations. In Proceedings of the 2023 Con-
ference on Empirical Methods in Natural Language
Wayne C Booth, Gregory G Colomb, and Joseph M Processing, pages 6465–6488, Singapore. Associa-
Williams. 2009. The craft of research. University of tion for Computational Linguistics.
Chicago press.
Gemini Team. 2024. Gemini 1.5: Unlocking multi-
Ruth Butler. 2000. What learners want to know: The modal understanding across millions of tokens of
role of achievement goals in shaping information context. arXiv preprint arXiv:2403.05530.
seeking, learning, and interest. In Intrinsic and ex-
trinsic motivation, pages 161–194. Elsevier. Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu
Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang,
7
https://en.wikipedia.org/wiki/Wikipedia:Reli Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang
able_sources Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu,
10
and Jürgen Schmidhuber. 2023. Metagpt: Meta pro- Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii
gramming for a multi-agent collaborative framework. Khizbullin, and Bernard Ghanem. 2023. Camel:
Preprint, arXiv:2308.00352. Communicative agents for "mind" exploration of
large language model society. In Advances in Neural
Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Information Processing Systems, volume 36, pages
Zhiyuan Zhang, Keshav Santhanam, Sri Vard- 51991–52008. Curran Associates, Inc.
hamanan, Saiful Haq, Ashutosh Sharma, Thomas T.
Joshi, Hanna Moazam, Heather Miller, Matei Za- Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang,
haria, and Christopher Potts. 2023. Dspy: Compiling Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, and
declarative language model calls into self-improving Shuming Shi. 2023. Encouraging divergent thinking
pipelines. arXiv preprint arXiv:2310.03714. in large language models through multi-agent debate.
arXiv preprint arXiv:2305.19118.
Seungone Kim, Juyoung Suk, Shayne Longpre,
Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Nelson Liu, Tianyi Zhang, and Percy Liang. 2023a.
Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Evaluating verifiability in generative search engines.
Seo. 2024. Prometheus 2: An open source language In Findings of the Association for Computational Lin-
model specialized in evaluating other language mod- guistics: EMNLP 2023, pages 7001–7025, Singapore.
els. Preprint, arXiv:2405.01535. Association for Computational Linguistics.
Israel M Kirzner. 1997. Entrepreneurial discovery and Zijun Liu, Yanzhe Zhang, Peng Li, Yang Liu, and Diyi
the competitive market process: An austrian ap- Yang. 2023b. Dynamic llm-agent network: An llm-
proach. Journal of economic Literature, 35(1):60– agent collaboration framework with agent team opti-
85. mization. Preprint, arXiv:2310.02170.
Janet L Kolodner. 2007. The roles of scripts in promot- Julian Michael, Salsabila Mahdi, David Rein, Jack-
ing collaborative discourse in learning by design. In son Petty, Julien Dirani, Vishakh Padmakumar, and
Scripting Computer-Supported Collaborative Learn- Samuel R. Bowman. 2023. Debate helps supervise
ing: Cognitive, Computational and Educational Per- unreliable experts. Preprint, arXiv:2311.08702.
spectives, pages 237–262. Springer.
Yohei Nakajima. 2023. Babyagi. https://github.c
Carol C Kuhlthau. 1991. Inside the search process: In- om/yoheinakajima/babyagi.
formation seeking from the user’s perspective. Jour-
nal of the American society for information science, Kai Nakamura, Sharon Levy, Yi-Lin Tuan, Wenhu Chen,
42(5):361–371. and William Yang Wang. 2022. HybriDialogue: An
information-seeking dialogue dataset grounded on
Vaibhav Kumar and Jamie Callan. 2020. Making in- tabular and textual data. In Findings of the Associa-
formation seeking easier: An improved pipeline for tion for Computational Linguistics: ACL 2022, pages
conversational search. In Findings of the Association 481–492, Dublin, Ireland. Association for Computa-
for Computational Linguistics: EMNLP 2020, pages tional Linguistics.
3971–3980, Online. Association for Computational
Linguistics. E Michael Nussbaum. 2008. Collaborative discourse,
argumentation, and learning: Preface and litera-
Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. ture review. Contemporary Educational Psychology,
2019. Latent retrieval for weakly supervised open 33(3):345–359.
domain question answering. In Proceedings of the
57th Annual Meeting of the Association for Computa- Javier Onrubia, Begona Roca, and Marta Minguela.
tional Linguistics, pages 6086–6096, Florence, Italy. 2022. Assisting teacher collaborative discourse in
Association for Computational Linguistics. professional development: An analysis of a facili-
tator’s discourse strategies. Teaching and Teacher
Sharon Levy, Kevin Mo, Wenhan Xiong, and Education, 113:103667.
William Yang Wang. 2021. Open-Domain question-
Answering for COVID-19 and other emergent do- OpenAI. 2023. Gpt-4 technical report.
mains. In Proceedings of the 2021 Conference on
Empirical Methods in Natural Language Processing: Jonathan Osborne. 2010. Arguing to learn in science:
System Demonstrations, pages 259–266, Online and The role of collaborative, critical discourse. science,
Punta Cana, Dominican Republic. Association for 328(5977):463–466.
Computational Linguistics.
Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai,
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Meredith Ringel Morris, Percy Liang, and Michael S.
Petroni, Vladimir Karpukhin, Naman Goyal, Hein- Bernstein. 2023. Generative agents: Interactive simu-
rich Küttler, Mike Lewis, Wen tau Yih, Tim Rock- lacra of human behavior. In In the 36th Annual ACM
täschel, Sebastian Riedel, and Douwe Kiela. 2021. Symposium on User Interface Software and Technol-
Retrieval-augmented generation for knowledge- ogy (UIST ’23), UIST ’23, New York, NY, USA.
intensive nlp tasks. Preprint, arXiv:2005.11401. Association for Computing Machinery.
11
Peter Pirolli. 2009. Powers of 10: Modeling complex Fangyuan Xu, Kyle Lo, Luca Soldaini, Bailey Kuehl,
information-seeking systems at multiple scales. Com- Eunsol Choi, and David Wadden. 2024. Kiwi:
puter, 42(3):33–40. A dataset of knowledge-intensive writing instruc-
tions for answering research questions. Preprint,
Chen Qu, Liu Yang, W. Bruce Croft, Yongfeng Zhang, arXiv:2403.03866.
Johanne R. Trippas, and Minghui Qiu. 2019. User in-
tent prediction in information-seeking conversations. Fangyuan Xu, Yixiao Song, Mohit Iyyer, and Eunsol
In Proceedings of the 2019 Conference on Human Choi. 2023. A critical evaluation of evaluations for
Information Interaction and Retrieval, CHIIR ’19. long-form question answering. In Proceedings of the
ACM. 61st Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers), pages
Siva Reddy, Danqi Chen, and Christopher D. Manning. 3225–3245, Toronto, Canada. Association for Com-
2019. CoQA: A conversational question answering putational Linguistics.
challenge. Transactions of the Association for Com-
putational Linguistics, 7:249–266. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio,
William Cohen, Ruslan Salakhutdinov, and Christo-
Stephen E. Robertson. 1977. Theories and models in pher D. Manning. 2018. HotpotQA: A dataset for
information retrieval. Journal of Documentation, diverse, explainable multi-hop question answering.
33:126–148. In Proceedings of the 2018 Conference on Empiri-
cal Methods in Natural Language Processing, pages
Jeremy Roschelle and Stephanie D Teasley. 1995. The 2369–2380, Brussels, Belgium. Association for Com-
construction of shared knowledge in collaborative putational Linguistics.
problem solving. In Computer supported collabora-
tive learning, pages 69–97. Springer. Xingdi Yuan, Jie Fu, Marc-Alexandre Côté, Yi Tay,
Chris Pal, and Adam Trischler. 2020. Interactive
Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh machine comprehension with information seeking
Khanna, Anna Goldie, and Christopher D. Manning. agents. In Proceedings of the 58th Annual Meeting of
2024. Raptor: Recursive abstractive processing for the Association for Computational Linguistics, pages
tree-organized retrieval. In International Conference 2325–2338, Online. Association for Computational
on Learning Representations (ICLR). Linguistics.
Yijia Shao, Yucheng Jiang, Theodore A. Kanell, Pe- Yu Zhang, Yunyi Zhang, Martin Michalski, Yucheng
ter Xu, Omar Khattab, and Monica S. Lam. 2024. Jiang, Yu Meng, and Jiawei Han. 2023. Effective
Assisting in writing wikipedia-like articles from seed-guided topic discovery by integrating multiple
scratch with large language models. Preprint, types of contexts. In WSDM’23, pages 429–437.
arXiv:2402.14207.
Yunyi Zhang, Ruozhen Yang, Xueqiang Xu, Jinfeng
Nikhil Sharma, Q Vera Liao, and Ziang Xiao. 2024. Xiao, Jiaming Shen, and Jiawei Han. 2024. Tele-
Generative echo chamber? effect of llm-powered class: Taxonomy enrichment and llm-enhanced hier-
search systems on diverse information seeking. In archical text classification with minimal supervision.
Proceedings of the CHI Conference on Human Fac- Preprint, arXiv:2403.00165.
tors in Computing Systems, pages 1–17.
Zejiang Shen, Tal August, Pao Siangliulue, Kyle

Lo, Jonathan Bragg, Jeff Hammerbacher, Doug
Downey, Joseph Chee Chang, and David Sontag.
2023. Beyond summarization: Designing ai support
for real-world expository writing tasks. Preprint,
arXiv:2304.02623.
David Traum. 2003. Issues in multiparty dialogues.

In Workshop on Agent Communication Languages,
pages 201–211. Springer.
Zhenhailong Wang, Shaoguang Mao, Wenshan Wu, Tao

Ge, Furu Wei, and Heng Ji. 2024. Unleashing the
emergent cognitive synergy in large language mod-
els: A task-solving agent through multi-persona self-
collaboration. Preprint, arXiv:2307.05300.
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu,

Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang,
Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadal-
lah, Ryen W White, Doug Burger, and Chi Wang.
2023. Autogen: Enabling next-gen llm applications
via multi-agent conversation framework.
12
A Dataset Details and an incremental set of information. We compare
the Co-STORM insert (§3.2) with two alternative
We constructed WildSeek using a web applica- approaches: (1) The Embedding Only baseline se-
tion8 we built that hosts the open-sourced STORM lects the placement with the highest semantic simi-
project (Shao et al., 2024), as detailed in §2.2. User larity using embedding cosine similarity. (2) The
privacy was strictly maintained by explicitly obtain- Language Model Only baseline directly prompts
ing consent each time users logged into our web an LM to choose the best placement within the
application. No personally identifiable information given hierarchical organization.
was collected, and the entire dataset was manually
We construct an evaluation dataset for the con-
reviewed to ensure compliance with this standard.
trolled experiments by leveraging the FreshWiki
We rejected topics that were illegal, harmful, vio-
dataset (Shao et al., 2024), which is a collection of
lent, racist, sexual, non-English, based on personal
recent, high-quality Wikipedia articles. We use the
experience, or contained personal information.
Wikipedia article outline as the concept hierarchy
At the time of dataset construction, 8,777 users
and require each candidate method to find the best
accessed our web application, resulting in the col-
placement for a given citation used in the article.
lection of 6,608 unique topic and purpose pairs.
The original placement of the citation in the article
Participants were sourced from the general inter-
is deemed as the ground truth. We apply rule-based
net and all held valid Google accounts, in accor-
filtering to retain articles with up to three levels
dance with our IRB-approved policy. To ensure
of hierarchy and English citation sources only. In-
broad coverage, we conducted topic classification
serting one cited source back into the outline is
using gpt-4o-2024-05-13 and human inspection,
considered as one task. After downsampling, we
and then downsampled the collected data to 100
derive a dataset consisting of 111 tasks: 33 from
cases, covering 24 fine-grained categories in 6
first-level sections, 64 from second-level sections,
domains: Science, Health and Fitness, Culture
and 14 tasks from third-level sections.
and Society, Lifestyle and Leisure, Social Science
and Humanities, and Others. We applied rule- We report the insertion accuracy in Table 6. For
based filtering to exclude non-informative or trivial tasks where the ground truth placement is in the
information-seeking purposes, and the 100 selected second or third level, we also consider a placement
topic-purpose pairs were manually labeled by the is partially correct if the information is inserted into
authors. Table 5 includes example data points from one of the ancestors of the ground truth placement
each domain and Figure 5 shows the full taxonomy and report the partial accuracy. The experimental
of the WildSeek dataset. results show that solely relying on the LM performs
poorly as the hierarchical organization can be wide
B Mind Map Insert Operation and deep and the performance heavily depends on
the quality of concept names. Co-STORM insert
As revealed in human evaluation results (see §6), operation consistently outperforms both baseline
the mind map is crucial for helping users track approaches.
the discourse and the collected information. Co-
STORM dynamically updates the mind map
through insert and reorganize operations. In C Full Prompts in Co-STORM
this section, we conduct controlled experiments on
different implementations of insert and verify the In §3.1, we introduce Co-STORM ’s collabo-
quality of the mind map updates. rative discourse protocol which includes three
Dynamically organizing collected information key roles: the user, experts, and a moderator.
into a mind map is challenging. Unlike classic We implement the perspective-guided expert and
document classification tasks (Zhang et al., 2024, moderator pipeline using zero-shot prompting of
2023) and recursive summarization tasks (Sarthi gpt-4o-2024-05-13. Listing 2 and Listing 3 doc-
et al., 2024; Gao et al., 2023), where either the uments the full prompts for simulating the expert
hierarchical organization or the information to be and the moderator respectively. Co-STORM uses a
organized is fixed, mind map insertion involves hierarchical mind map to track the discourse (§3.2)
an evolving hierarchical organization of concepts and the mind map insert operation is detailed
in Appendix B. Prompts used for the mind map
8
Our institution’s IRB approved the web application operations can be found in Listing 1.
13
Root (100)
Culture Lifestyle Social science

& & &
Science (58) Health & Fitness (10) Others (6)
Society (10) Leisure (8) Humanities (8)
Applied science (51) Natural science (7) Fitness (1) Law (1) Design (1) Education (1) Animal (3)
Medicine & healthcare (9) News (2) Gaming (2) Management (1) Transportation (2)
Politics (1) Music (3) Business and econ (6) Emergency management (1)
Agriculture (1) Biology (4) Urban study (1) Food (1)
Computer science (45) Physics (2) Cultural study (5) Travel (1)
Environment science (5) Geology (1)
Figure 5: WildSeek taxonomy. The number in the parenthesis denotes the number of data points classified under the
corresponding category or its descendants.
Domain Example Task

Computer Science Topic: Blockchain anomaly detection using large models
Intent: To evaluate the effectiveness of large models in detecting anomalies
in blockchain systems compared to existing models.
Healthcare Topic: The effects of NMN supplements on human anti-aging
Intent: To investigate the efficacy and mechanisms of NMN supplements in slowing down
or reversing the aging process in humans.
Environmental Science Topic: Utilization of Weather Forecasting for Wind and Solar Energy Assessment
Intent: To explore advanced methodologies in integrating weather forecast
data for optimizing wind and solar energy evaluations.
Law Topic: Recent legal cases in the US involving hardware technology innovations
Intent: To investigate the legal precedents and implications of hardware technology
innovations in the US.
Economics Topic: Development of a Shared Trading Currency to Facilitate International Trade
Intent: Investigate how a new shared currency could eliminate transaction costs
and boost GDP among member countries.
Table 5: Examples of complex information seeking tasks from the WildSeek dataset.
D Automatic Evaluation Details can be found in Table 9, Table 10, and Table 11.
Following Shao et al. (2024), we use the To assess the quality of the automatic evalua-
Prometheus model (Kim et al., 2024), an open- tion results in §5.3, we randomly sampled 50 data
source rubric grading model for evaluating long- points from the automatic evaluation of discourse
form text based on user-defined criteria. For our quality, with 10 data points for each rubric item,
experiments, we use prometheus-7b-v2.0 9 with i.e. Novelty, Intent Alignment, No Repetition for
its default temperature as 1.0 and top_p as 0.9, the question-asking utterances, and Consistency and
state-of-the-art version at the time of our experi- Engagement for question-answering utterances, as
ments. As the model has a limited context window, defined in §5.2. Each data point represents the au-
for report evaluation, we omit references and trim tomatic grading of one utterance on one rubric item.
the input text to under 2000 words to fit into the Two independent evaluators provided human grad-
model’s context window, following the practice in ing. We calculate the Pearson correlation between
Shao et al. (2024); for discourse quality evaluation, the automatic evaluation scores and the average
we reduce the discourse history length by taking human grading scores. Table 7 shows that the auto-
the last 2000 words as context. The report quality matic rubric grading exhibits a positive correlation
evaluation and discourse quality evaluation rubrics with human grading, with statistical significance
observed for 4 out of the 5 rubric items. Addi-
https://huggingface.co/prometheus-eval/prom
9 tionally, the experimental results from the human
etheus-7b-v2.0 evaluation with real users (Table Table 4 and Fig-
14
First-Level Second-Level Third-Level
Acc. Acc. Partial Acc. Acc. Partial Acc.
Embedding only 24.24 35.94 65.62 35.71 57.14
Language Model only 3.03 7.81 62.50 7.14 71.43
Co-STORM insert 39.39 51.56 68.75 35.71 71.43
Table 6: Controlled experiment results of different mind map insertion methods (%). A placement is deemed as
partially correct if the information is inserted into one of the ancestors of the ground truth placement.
Pearson Correlation (p-value)

Novelty 0.32 (< 4e-1)
Intent Alignment 0.55 (< 2e-2)
No Repetition 0.50 (< 7e-3)
Consistency 0.50 (< 2e-3)
Engagement 0.34 (< 2e-2)
Table 7: Pearson correlation between average human

rubric grading scores and automatic rubric grading
scores on discourse turn quality (n=50).
platform 10 and a web application we built.

Figure 6: Age distribution of participants in the human The web application provides participants an
evaluation. interface to perform real-time interaction with Co-
STORM. The web application has two modes, Co-
STORM mode and RAG chatbot mode. Figure 12
shows a screenshot of the web application in Co-
STORM mode. The RAG chatbot mode is similar
to the common chatbot interface.
As discussed in §6, we crafted five pairs of com-
plex information-seeking tasks for human evalu-
ation (see Table 8). After completing each task,
participants were instructed to rate the information-
seeking assistance system they used (i.e., Google
Search, RAG Chatbot, or Co-STORM) from four
grading aspects defined in §5.2 using 1 to 5 Likert
Figure 7: Education level distribution of participants in scale (Likert question shown in Figure 8). After
the human evaluation.
completing both two tasks, participants were asked
to provide a pairwise preference by comparing Co-
ure 4) also reveal similar findings to the automatic STORM with either Google Search (see Figure 9)
evaluation results, verifying our automatic evalua- or the RAG chatbot (see Figure 10) with the Likert
tion setup. questions.
F Case Study
E Human Evaluation Details
We present two examples from different topics
Human evaluation participants voluntarily provided
where the moderator effectively steers the conver-
demographic data, including their ages and highest
sation toward engaging directions. Example 13
education levels. As shown in Figure 6 and Fig-
shows an example of discourse on the topic “The
ure 7), our human evaluation covers a diverse demo-
effects of NMN supplements on human anti-aging”
graphic. All participants gave consent to feedback
where the moderator effectively steers the ongoing
data collection and we ensured no personal iden-
discourse to the anti-aging benefits of personalized
tifiable information was stored (see Figure 11).
Feedback was collected via an online questionnaire 10
https://www.qualtrics.com
15
NMN and then further directs the discourse towards
genetic profiling for personalized NMN supple-
mentation plans. Example 14 highlights moderator
effectively raises a new concept and shifts the dis-
cussion to the topic “The Emergence of Artificial
Super Intelligence: Future Prospects and Impacts”.
The moderator steers the ongoing discourse from
technology hurdles, the role of computation power,
societal impact, risk, and mitigation toward a dis-
cussion on the quantum digital twin.
Additionally, we include a complete discourse
transcript (Appendix §G) and the associated report
(Appendix §H) on the topic of “AlphaFold 3,” as
referenced in Figure 2. In the discourse, the sys-
tem initiates the discussion with steering by the
moderator, focusing on the background and devel-
opment of AlphaFold 3, as well as the technical
advancements in biomolecular structure prediction,
protein-DNA interactions, and its impact on genetic
regulation. The user then directs the discourse to-
wards its applications. Several participants provide
insights into AlphaFold 3’s applications in drug dis-
covery, personalized medicine, and biotechnology.
This is followed by a discussion on self-driving
laboratories (SDLs), again steered by the modera-
tor. Finally, the user shifts the discussion towards
the economic impact and market implications of
AlphaFold 3.
16
Topic Goal
GPT-4 Omni To investigate the latest technology breakthrough and discover a unique angle to report on it,
AlphaFold 3 ensuring more people know about the technology.
Gaza war protests in US colleges To investigate the latest news and provide comprehensive coverage, ensuring
The conviction of Donald J. Trump in 2024 people receive diverse perspectives on the events.
Privacy Norm with Digital Technologies To gain an in-depth understanding of the topic and prepare for a one-hour
Copyright Issues with Language Models presentation in a college reading group.
Social Organism To conduct a literature review on a given topic in preparation for a class discus-
Social Statics and Social Dynamics sion in a sociology course.
China’s dropping population in recent years To investigate the latest news and find an engaging angle to report it, incorporat-
The Humanitarian crisis in Gaza in recent years ing background stories and connections to related events to enhance its appeal.
Table 8: Information-seeking tasks used in human evaluation.
Figure 8: Human evaluation grading rubrics for each method (search engine, RAG Chatbot, and Co-STORM).
Evaluation results are shown in Table 4
.
Figure 9: Likert question for comparing Co-STORM with Google Search in human evaluation. Evaluation results
are shown in Figure 4.
17
Figure 10: Likert question for comparing Co-STORM with RAG Chatbot in human evaluation. Evaluation results
are shown in Figure 4.
Figure 11: Screenshot to get consent from participants for gathering feedback data during human evaluations using
Qualtrics.
18
class InsertInformation(dspy.Signature):
"""Your job is to insert the given information to the knowledge base. The knowledge base is a tree based data structure to
organize the collection information. Each knowledge node contains information derived from themantically similar
question or intent.
To decide the best placement of the information, you will be navigated in this tree based data structure layer by layer.
You will be presented with the question and query leads to ththeis information, and tree structure.
Output should strictly follow one of options presetned below with no other information.
- 'insert': to place the information under the current node.
- 'step: [child node name]': to step into a specified child node.
- 'create: [new child node name]': to create new child node and insert the info under it.
Example outputs:
- insert
- step: node2
- create: node3
"""
intent = dspy.InputField(prefix="Question and query leads to this info: ", format=str)
structure = dspy.InputField(prefix="Tree structure: \n", format=str)
choice = dspy.OutputField(prefix="Choice:\n", format=str)
class InsertInformationCandidateChoice(dspy.Signature):
"""Your job is to insert the given information to the knowledge base. The knowledge base is a tree based data structure to
organize the collection information. Each knowledge node contains information derived from themantically similar
question or intent.
You will be presented with the question and query leads to this information, and candidate choices of placement. In these
choices, -> denotes parent-child relationship. Note that reasonable may not be in these choices.
If there exists reasonable choice, output "Best placement: [choice index]"; otherwise, output "No reasonable choice".
"""
intent = dspy.InputField(prefix="Question and query leads to this info: ", format=str)
choices = dspy.InputField(prefix="Candidate placement:\n", format=str)
decision = dspy.OutputField(prefix="Decision:\n", format=str)
Listing 1: Prompts used for dynamically updating the mind map in Co-STORM.
19
class QuestionToQuery(dspy.Signature):
"""You want to answer the question or support a claim using Google search.
What do you type in the search box?
The question is raised in a round table discussion on a topic. The question may or may not focus on the topic itself.
Write the queries you will use in the following format:
- query 1
- query 2
...
- query n"""
topic = dspy.InputField(prefix='Topic context:', format=str)

question = dspy.InputField(prefix='I want to collect information about: ', format=str)
queries = dspy.OutputField(prefix="Queries: \n", format=str)
class AnswerQuestion(dspy.Signature):
""" You are an expert who can use information effectively. You have gathered the related information and will now use the
information to form a response.
Make your response as informative as possible and make sure every sentence is supported by the gathered information.
If [Gathered information] is not directly related to the [Topic] and [Question], start your response with "Based on the
available information, I cannot fully address the question." Then, provide the most relevant answer you can based on the
available information, and explain any limitations or gaps.
Use [1], [2], ..., [n] in line (for example, "The capital of the United States is Washington, D.C.[1][3].").
You DO NOT need to include a References or Sources section to list the sources at the end. The style of writing should be
formal.
"""
topic = dspy.InputField(prefix='Topic you are discussing about:', format=str)

question = dspy.InputField(prefix='You want to provide insight on: ', format=str)
info = dspy.InputField(
prefix='Gathered information:\n', format=str)
style = dspy.InputField(prefix="Style of your response should be:", format=str)
answer = dspy.OutputField(
prefix="Now give your response. (Try to use as many different sources as possible and do not hallucinate.)",
format=str
)
class ConvertUtteranceStyle(dspy.Signature):
"""
You are an invited speaker in the round table conversation.
Your task is to make the question or the response more conversational and engaging to facilitate the flow of conversation.
Note that this is ongoing conversation so no need to have welcoming and concluding words. Previous speaker utterance is
provided only for making the conversation more natural.
Note that do not hallucinate and keep the citation index like [1] as it is. Also,
"""
expert = dspy.InputField(prefix="You are inivited as: ", format=str)
action = dspy.InputField(prefix="You want to contribute to conversation by: ", format=str)
prev = dspy.InputField(prefix="Previous speaker said: ", format=str)
content = dspy.InputField(prefix="Question or response you want to say: ", format=str)
utterance = dspy.OutputField(prefix="Your utterance (keep the information as much as you can with citations, prefer
shorter answers without loss of information): ", format=str)
Listing 2: Prompts used for simulating perspective-guided experts in Co-STORM.
20
class KnowledgeBaseSummmary(dspy.Signature):
"""Your job is to give brief summary of what's been discussed in a roundtable conversation. Contents are themantically
organized into hierarchical sections.
You will be presented with these sections where "#" denotes level of section.
"""
topic = dspy.InputField(prefix="topic: ", format=str)
structure = dspy.InputField(prefix="Tree structure: \n", format=str)
output = dspy.OutputField(prefix="Now give brief summary:\n", format=str)
class GroundedQuestionGeneration(dspy.Signature):
"""Your job is to find next discussion focus in a roundtable conversation. You will be given previous conversation summary
and some information that might assist you discover new discussion focus.
Note that the new discussion focus should bring new angle and perspective to the discussion and avoid repetition. The
new discussion focus should be grounded on the available information and push the boundaries of the current
discussion for broader exploration.
The new discussion focus should have natural flow from last utterance in the conversation.
Use [1][2] in line to ground your question.
"""
topic = dspy.InputField(prefix="topic: ", format=str)
summary = dspy.InputField(prefix="Discussion history: \n", format=str)
information = dspy.InputField(prefix="Available information: \n", format=str)
last_utterance = dspy.InputField(prefix="Last utterance in the conversation: \n", format=str)
output = dspy.OutputField(prefix="Now give next discussion focus in the format of one sentence question:\n", format=str)
class GenerateExpertWithFocus(dspy.Signature):
"""
You need to select a group of speakers who will be suitable to have roundtable discussion on the [topic] of specific [
focus].
You may consider inviting speakers having opposite stands on the topic; speakers representing different interest parties;
Ensure that the selected speakers are directly connected to the specific context and scenario provided.
For example, if the discussion focus is about a recent event at a specific university, consider inviting students, faculty
members, journalists covering the event, university officials, and local community members.
Use the background information provided about the topic for inspiration. For each speaker, add a description of their
interests and what they will focus on during the discussion.
No need to include speakers name in the output.
Strictly follow format below:
1. [speaker 1 role]: [speaker 1 short description]
2. [speaker 2 role]: [speaker 2 short description]
"""
topic = dspy.InputField(prefix='Topic of interest:', format=str)

background_info = dspy.InputField(prefix='Background information:\n', format=str)
focus = dspy.InputField(prefix="Discussion focus: ", format=str)
topN = dspy.InputField(prefix="Number of speakers needed: ", format=str)
experts = dspy.OutputField(format=str)
Listing 3: Prompts used for simulating the moderator in Co-STORM
21
Criteria Description Broad Coverage: Does the article provide an in-depth exploration of the topic and have good coverage?
Score 1 Description Severely lacking; offers little to no coverage of the topic’s primary aspects, resulting in a very narrow perspective.
Score 2 Description Partial coverage; includes some of the topic’s main aspects but misses others, resulting in an incomplete portrayal.
Score 3 Description Acceptable breadth; covers most main aspects, though it may stray into minor unnecessary details or overlook some relevant points.
Score 4 Description Good coverage; achieves broad coverage of the topic, hitting on all major points with minimal extraneous information.
Score 5 Description Exemplary in breadth; delivers outstanding coverage, thoroughly detailing all crucial aspects of the topic without including irrelevant information.
Criteria Description Novelty: Does the report cover novel aspects that relate to the user’s initial intent but are not directly derived from it?
Score 1 Description Lacks novelty; the report strictly follows the user’s initial intent with no additional insights.
Score 2 Description Minimal novelty; includes few new aspects but they are not significantly related to the initial intent.
Score 3 Description Moderate novelty; introduces some new aspects that are somewhat related to the initial intent.
Score 4 Description Good novelty; covers several new aspects that enhance the understanding of the initial intent.
Score 5 Description Excellent novelty; introduces numerous new aspects that are highly relevant and significantly enrich the initial intent.
Criteria Description Relevance and Focus: How effectively does the report maintain relevance and focus, given the dynamic nature of the discourse?
Score 1 Description Very poor focus; discourse diverges significantly from the initial topic and intent with many irrelevant detours.
Score 2 Description Poor focus; some relevant information, but many sections diverge from the initial topic.
Score 3 Description Moderate focus; mostly stays on topic with occasional digressions that still provide useful information.
Score 4 Description Good focus; maintains relevance and focus throughout the discourse with minor divergences that add value.
Score 5 Description Excellent focus; consistently relevant and focused discourse, even when exploring divergent but highly pertinent aspects.
Criteria Description Depth of Exploration: How thoroughly does the report explore the initial topic and its related areas, reflecting the dynamic discourse?
Score 1 Description Very superficial; provides only a basic overview with significant gaps in exploration.
Score 2 Description Superficial; offers some detail but leaves many important aspects unexplored.
Score 3 Description Moderate depth; covers key aspects but may lack detailed exploration in some areas.
Score 4 Description Good depth; explores most aspects in detail with minor gaps.
Score 5 Description Excellent depth; thoroughly explores all relevant aspects with comprehensive detail, reflecting a deep and dynamic discourse.
Table 9: Report scoring rubrics on a 1-5 scale for the Prometheus model.
Evaluates the extent to which the conversation turn introduces new and unexpected information that is relevant to the topic at hand.
Criteria Description Novelty: High novelty indicates the conversation is providing fresh insights or perspectives that the user might not have considered,
thereby enriching the dialogue and enhancing the user’s understanding of the subject.
Score 1 Description The turn fails to introduce any new or unexpected information, repeating known facts or irrelevant content.
Score 2 Description The turn introduces some new information, but it is mostly predictable or only slightly relevant.
Score 3 Description The turn provides moderately novel information that is relevant and somewhat unexpected.
Score 4 Description The turn introduces new and relevant information that is largely unexpected, sparking interest.
Score 5 Description The turn consistently introduces highly novel and relevant information that is completely unexpected, significantly enhancing the conversation.
Measures how interesting and captivating the conversation turn is. An engaging turn holds the user’s attention and encourages them
Criteria Description Engaging:
to continue interacting. It often includes elements that are thought-provoking, entertaining, or particularly relevant to the user’s interests.
Score 1 Description The turn is dull and uninteresting, likely causing the user to lose interest.
Score 2 Description The turn has limited engagement, with occasional interesting points but generally fails to captivate the user.
Score 3 Description The turn is moderately engaging, holding the user’s interest but lacking captivating elements.
Score 4 Description The turn is engaging and interesting, encouraging further interaction with minor lapses.
Score 5 Description The turn is highly engaging, consistently holding the user’s interest and encouraging further interaction.
Assesses whether the conversation turn contradicts previous statements or established facts. Minimizing contradictionsis essential
Criteria Description Consistency: for maintaining trust and coherence in the conversation. A high score indicates that the turn is free from inconsistencies and
logically fits with the preceding dialogue.
Score 1 Description The turn frequently contradicts previous statements or established facts, causing confusion.
Score 2 Description The turn occasionally contradicts itself, with some inconsistencies present.
Score 3 Description The turn is mostly free of contradictions, with only minor inconsistencies that do not significantly impact coherence.
Score 4 Description The turn is nearly free of contradictions, with only very rare and minor inconsistencies.
Score 5 Description The turn is entirely free of contradictions, maintaining perfect coherence and logical consistency.
Table 10: Question-answering turn scoring rubrics on a 1-5 scale for the Prometheus model.
Assesses how well the conversation turn aligns with the user’s latent intent or goals. It measures the relevance and appropriateness of the
Criteria Description Intent Alignment: response in contributing towards the user’s overall objectives. High intent alignment ensures that the conversation stays focused on the user’s
needs and drives towards meaningful outcomes.
Score 1 Description The turn does not align with the user’s latent intent or goals, and may confuse the conversation’s purpose.
Score 2 Description The turn slightly aligns with the user’s latent intent, but does not significantly contribute to the overall goals.
Score 3 Description The turn moderately aligns with the user’s latent intent, contributing to the overall goals in a limited way.
Score 4 Description The turn aligns well with the user’s latent intent, contributing meaningfully to the overall goals.
Score 5 Description The turn perfectly aligns with the user’s latent intent, significantly driving the conversation towards the overall goals.
Looks at the degree to which the conversation turn repeats information that has already been provided. Lower scores indicate higher repetition,
Criteria Description Repetition: which can detract from the value of the conversation by failing to introduce new content. Ideally, each turn should add new information or
perspectives to the dialogue.
Score 1 Description The turn repeats information already provided without adding any new value.
Score 2 Description The turn has noticeable repetition, with limited new information added.
Score 3 Description The turn includes some repetition, but provides enough new information to be moderately valuable.
Score 4 Description The turn has minimal repetition, mostly introducing new and relevant information.
Score 5 Description The turn does not repeat any information, consistently providing new and valuable content.
Table 11: Question-asking turn scoring rubrics on a 1-5 scale for the Prometheus model.
22
Switch between discourse and final report
Click to provide feedback for each turn
Click a discourse turn to

highlight mind map
Click to continue observe

OR actively take a turn
Figure 12: Screenshot of the interactive web application for human evaluation.
Topic: The effects of NMN supplements on human anti-aging

{Previous discourse focuses on the definition of NMN, Cellular Mechanisms of NMN, Side Effects and Genetic Factors, Genetic
Variations and NMN Effectiveness, Methodologies of Recent Clinical Trials on NMN, and Safety and Side Effects.}
Moderator:
That's a fascinating angle! Considering the potential influence of genetic variations, how do you think personalized NMN
supplementation-tailored to individual genetics, lifestyle, and health factors could optimize its anti-aging benefits?
(inspired by [89][90])
{1 turn omitted}
Geneticist specializing in personalized medicine:

NMN supplementation by considering our unique genetic make-up and lifestyle. Since NMN boosts NAD+ levels, which drop with age
and are crucial for improving insulin sensitivity, mitochondrial function, and lifespan [107][110] {More content omitted}
{1 turn omitted}
Moderator:
How can we create a thorough genetic profiling method to tailor NMN supplementation plans for individuals, ensuring we
maximize those anti-aging benefits?
{rest of discourse omitted}
Figure 13: Example discourse highlight on the topic “The effects of NMN supplements on human anti-aging”.
23
Topic: The Emergence of Artificial Super Intelligence: Future Prospects and Impacts
{Previous discourse focuses on Technological Hurdles, Role of Computational Power, Societal Impact, Risks and Mitigation.}
Moderator:
Building on that, how do you think the combination of digital twins with AI and quantum computing could speed up the
development and practical use of artificial super intelligence (ASI) across different industries? (inspired by [77])
General Knowledge Provider:

Integrating digital twins with AI and quantum computing is a game-changer for various industries by offering incredibly
precise simulations and optimizations{More content omitted}
In healthcare, QDTs can revolutionize our understanding and delivery of health services through advanced simulations of
biological systems[81]{More content omitted}
QDTs allow for enhanced data management from IoT devices, leading to better asset performance and reduced failure risks
[82][85].
{rest of discourse omitted}
Figure 14: Example discourse highlight on the topic “The Emergence of Artificial Super Intelligence: Future
Prospects and Impacts”.
24
G Full Collaborative Discourse Example
Topic: AlphaFold 3
Default Background Researcher:
Based on the available information, I cannot fully address the question in a conversational style. However, I can provide a
detailed and accurate explanation in an informative manner.
AlphaFold 3 is the latest advancement in the field of molecular structure prediction, introduced in 2024 by Google DeepMind
and Isomorphic Labs, both subsidiaries of Alphabet[2]. This groundbreaking AI model represents a substantial leap beyond its
predecessor, AlphaFold 2, by expanding its capabilities to predict the structures and interactions of a broader spectrum of
biomolecules[1][2].
Unlike AlphaFold 2, which primarily focused on single-chain proteins, AlphaFold 3 is designed to predict the structures of
complex biomolecular systems. These systems include not only proteins but also DNA, RNA, post-translational modifications, and
small molecules such as drugs and ligands[2][6][9]. The model operates by allowing scientists to input descriptions of these
biomolecular complexes, which it then processes using a sophisticated neural network architecture. This architecture employs a
custom Transformer with triangular attention and a diffusion process to generate the 3D coordinates of every atom in the
specified system[6].
One of the most notable advancements in AlphaFold 3 is its potential application in drug design. By accurately predicting the
structure and interactions of protein-ligand complexes, the model can aid in identifying and designing new molecules that
could serve as effective drugs[3][4][10]. This has significant implications for the pharmaceutical industry, where Isomorphic
Labs is already collaborating with companies to tackle real-world drug design challenges[3][4].
Researchers and scientists can access the majority of AlphaFold 3's capabilities for free through the newly launched AlphaFold
Server, which is an easy-to-use research tool designed to make these advanced predictions accessible to a broader scientific
community[3][4][7]. The introduction of AlphaFold 3 is expected to dramatically accelerate biological research by providing
detailed atomic-level views of biomolecular interactions, thus enhancing our understanding of the biological world and
facilitating the development of new treatments[9][11].
Overall, AlphaFold 3's ability to predict the structures of a wide variety of biomolecules more broadly and accurately than
previous models makes it a revolutionary tool in the fields of structural biology and drug discovery[5][8].
Bioinformatics Specialist:
Based on the available information, I cannot fully address the question. However, I can provide some insights into how
AlphaFold 3 has improved accessibility and usability for researchers aiming to integrate its predictions into their biological
studies.
AlphaFold 3 has introduced several significant enhancements over its predecessors, specifically designed to improve both
accessibility and usability for researchers. One of the most notable advancements is the launch of the AlphaFold Server, which
is a free, user-friendly platform that allows biologists to generate molecular complexes with just a few clicks[13][14]. This
means that researchers, regardless of their computational resources or expertise in machine learning, can now easily access
high-accuracy protein structure predictions.
Additionally, AlphaFold 3 extends its predictive capabilities beyond just protein structures to include a variety of
biomolecular systems, such as nucleic acids, small molecules, ions, and modified residues[12][17]. This broader scope of
prediction is particularly beneficial for researchers involved in complex biological studies and drug discovery, as it allows
them to model interactions across different types of molecules more accurately and comprehensively.
The new diffusion-based architecture of AlphaFold 3 enhances its prediction accuracy and reliability[12][16]. For example, the
model shows a significant improvement in predicting protein-protein complexes and antibody-protein interfaces[18]. Such
improvements are crucial for researchers focusing on immunology and therapeutic antibody development.
Despite these advancements, it's important to note that the downloadable code of AlphaFold 3 has not been released by DeepMind
, unlike its previous models[14][15]. This may limit the flexibility for researchers who wish to modify or integrate the model
deeply into their own systems. Nevertheless, the open accessibility of the AlphaFold Server provides a substantial step
towards democratizing access to cutting-edge protein structure prediction technology.
In summary, AlphaFold 3 enhances accessibility through its user-friendly AlphaFold Server, allowing researchers to easily
generate and use high-accuracy predictions. Its improved predictive capabilities across a range of biomolecular systems
further support its integration into various biological studies. However, the absence of a downloadable code version may be a
limitation for some users.
Pharmaceutical Industry Expert:

Based on the available information, I cannot fully address the question. However, I can provide some relevant insights into
the impact of AlphaFold 3 on drug discovery projects.
AlphaFold 3, developed by Google DeepMind and Isomorphic Labs, represents a significant advancement in the field of protein
structure prediction and has transformative potential for drug discovery[21]. Compared to previous computational methods,
AlphaFold 3 has improved the efficiency and accuracy of these projects in several ways.
Firstly, AlphaFold 3 has enhanced the prediction accuracy for protein structures, including complex interactions like antibody
-antigen bindings, which are crucial for immunology and therapeutic antibody development[24]. This increased accuracy allows
researchers to gain a better understanding of molecular interactions, which is essential for designing effective drugs.
Moreover, AlphaFold 3's ability to predict structures for nearly all molecules in the Protein Data Bank significantly broadens
the scope of potential drug targets[20]. This capability is particularly valuable for tackling neglected diseases, which
25
often suffer from a lack of research funding[25].
A notable example of AlphaFold 3's impact can be seen in a project conducted by Insilico and the University of Toronto. They
applied AlphaFold to an end-to-end AI-powered drug discovery platform and successfully identified a new drug for a novel
target to treat hepatocellular carcinoma (HCC)[23]. This project demonstrates how AlphaFold 3 can streamline the drug
discovery process by integrating structure prediction with other AI models to identify effective therapeutic compounds
efficiently.
Another example involves the discovery of a potent hit molecule, ISM042-2-048, using AlphaFold in combination with other AI
tools. This molecule showed significant inhibitory activity and selective anti-proliferation effects, showcasing the practical
application of AlphaFold in the hit identification phase of drug discovery[22].
In summary, AlphaFold 3 has made significant strides in improving the efficiency and accuracy of drug discovery projects. By
providing highly accurate protein structure predictions and enabling the identification of new drug targets, it has the
potential to revolutionize the development of new therapies. However, while the implications are promising, the full
transformative impact on drug discovery may still be incremental at this stage[19].
Structural Biologist:
Based on the available information, I cannot fully address the question. However, I can provide the most relevant answer based
on the gathered information, and explain any limitations or gaps.
AlphaFold 3 introduces several specific advancements in predicting complex biomolecular structures, which significantly
enhance our understanding of interactions involving proteins, DNA, RNA, and ligands at the atomic level. One of the key
innovations in AlphaFold 3 is the updated diffusion-based architecture. This architecture enables joint structure prediction
for complexes that include not only proteins but also nucleic acids (DNA and RNA), small molecules, ions, and modified
residues[8]. This comprehensive capability allows for more accurate and detailed modeling of biomolecular interactions,
surpassing many previous specialized tools[8][16].
Furthermore, AlphaFold 3 achieves unprecedented accuracy in predicting protein-ligand interactions, which are crucial for drug
discovery. It is reported to be 50\% more accurate than the best traditional methods on the PoseBusters benchmark, and it
does so without the need for any structural information input[26]. This makes AlphaFold 3 the first AI system to outperform
physics-based tools in this domain[26].
In addition to protein-ligand interactions, AlphaFold 3 is capable of modeling the interactions between proteins and nucleic
acids such as DNA and RNA[27]. This capability is particularly important for understanding the molecular mechanisms underlying
genetic regulation and expression, as well as the role of RNA in various cellular processes.
AlphaFold 3's advancements also extend to the modeling of protein interactions with antibodies, which is another critical area
for drug development[28]. The ability to predict these interactions with high accuracy can aid in the design of new
therapeutic molecules and improve the efficiency of drug discovery pipelines[28].
Overall, AlphaFold 3 represents a significant leap forward in biomolecular structure prediction. Its ability to model a wide
range of molecular interactions with high accuracy enhances our understanding of the atomic-level details of these
interactions, which is essential for both basic biological research and practical applications in drug discovery and
development[9][11].
While the gathered information highlights many advancements of AlphaFold 3, it does not provide exhaustive details on all
potential applications and specific improvements over its predecessors. Nonetheless, the advancements mentioned above
underline the transformative impact of AlphaFold 3 in the field of structural biology.
moderator:
That's fascinating! Speaking of advancements, how do you think AlphaFold 3's improved accuracy in predicting protein-DNA and
protein-RNA interactions might deepen our insights into genetic regulation and expression?

Absolutely, AlphaFold 3 (AF3) is a game-changer for understanding genetic regulation and expression. Its diffusion-based
architecture for predicting raw atom coordinates marks a significant leap, allowing it to model protein-DNA and protein-RNA
interactions with much higher accuracy[29][8]. Unlike its predecessor, AlphaFold 2, which focused on individual protein
structures, AF3 can now tackle these complex interactions, making it possible to gain deeper insights into how genetic
processes are regulated.
What's really impressive is the 50\% improvement in prediction accuracy over existing methods, with some categories even
doubling in accuracy[30][33]. This means we can now more reliably model these interactions, shedding light on the mechanisms
of gene regulation and expression at a molecular level.
Additionally, AF3's ability to predict interactions across all of life's molecules, including nucleic acids, supports a wide
range of scientific applications-from drug design to genomics research[31][32]. For instance, understanding protein-RNA
interactions is crucial for RNA-based therapeutics, opening new avenues in medical research.
So, in a nutshell, AF3 is significantly enhancing our understanding of genetic regulation and expression by providing highly
accurate models of these essential interactions.
Biochemist/Structural Biologist:
Absolutely, the strides AlphaFold 3 (AF3) has made in predicting protein-DNA and protein-RNA interactions are nothing short of
revolutionary for genetic research and personalized medicine. With its diffusion-based architecture, AF3 goes beyond what
AlphaFold 2 could achieve by predicting raw atom coordinates, allowing it to model a wide range of biomolecular interactions
with high accuracy, including those involving nucleic acids, small molecules, ions, and modified residues [29][8].
26
One of the standout features of AF3 is its accuracy; it's shown to improve prediction accuracy by at least 50\%, and in some
cases, even double it [30]. This is a game-changer for understanding gene regulation and expression, providing atomic-level
insights into how proteins interact with DNA and RNA [11][5]. Imagine being able to visualize these interactions in such
detail-this could lead to breakthroughs in developmental biology, oncology, and neurogenetics [31].
Moreover, AF3's capabilities have profound implications for personalized medicine. By accurately modeling protein-RNA
interactions, it helps us understand how genetic variations affect individual responses to treatments, paving the way for more
personalized and effective therapies [36][37]. It's not just a tool for basic research but a potential catalyst for
transforming medical practices and treatment strategies.
Geneticist:
Building on AlphaFold 3's (AF3) advancements, its predictive capabilities can really push the boundaries of gene regulation
research and genetic therapies. AF3 doesn't just improve on AlphaFold 2's accuracy; it broadens the scope to predict complexes
involving proteins, nucleic acids, small molecules, and ions[8][41]. This expanded understanding of protein interactions in
their full biological context, such as with DNA, RNA, and ligands, is crucial for effective gene regulation and therapy
[38][42].
AF3's enhanced ability to model protein-ligand interactions with high precision can significantly aid in designing targeted
genetic therapies and drugs[8]. For instance, its predecessor has already contributed to drug discovery, including cancer
immunotherapy and COVID-19 vaccine development[43][40]. With AF3, we can expect even more precise modeling of these
interactions, leading to the identification of novel therapeutic targets.
Moreover, DeepMind's commitment to open science, sharing over 200 million protein structures, accelerates collaborative
efforts in genetic therapies by providing valuable data to researchers globally[39]. This open-access resource is a goldmine
for identifying new protein targets and understanding their roles in genetic regulation, paving the way for breakthroughs in
gene therapy and personalized medicine.
In essence, AF3's advanced capabilities offer unprecedented opportunities in both scientific research and therapeutic
development, enhancing our ability to identify novel protein targets for genetic therapies[38][8][41][42].
User:
What are AlphaFold3 impacts and what are its applications?

Sure, let's dive into what makes AlphaFold 3 such a game-changer and where it's making waves.
AlphaFold 3, developed by DeepMind, has taken our ability to predict protein structures to a whole new level, hitting almost
99\% accuracy for human proteins[44][45]. This isn't just a cool tech feat; it's a leap in how we understand biology, so much
so that some think it could snag a Nobel Prize down the line[45].
For researchers, AlphaFold 3 is like a supercharged toolkit. It complements traditional methods like X-ray crystallography,
cryo-EM, and NMR spectroscopy, speeding up the pace of discovery and offering new insights into the molecular basis of
diseases[49].
One of the biggest wins is in drug discovery. By predicting protein interactions with other molecules, AlphaFold 3 can help
identify new drug candidates much faster, tackling one of the biggest hurdles in the field[13][1].
Then there's the AlphaFold Database, now with over 200 million protein structure predictions. This open-access treasure trove
is leveling the playing field, allowing researchers everywhere to tap into top-notch structural data[13][48]. It's fueling
advancements in enzyme design, cancer research, and even vaccine development[46][47].
Beyond proteins, AlphaFold 3 is branching out to predict interactions involving DNA, RNA, and small molecules, opening doors
in genomics, biotechnology, and synthetic biology[8][1].
Of course, it's not a silver bullet. Turning these insights into real-world applications will need lots of teamwork between
scientists and industry experts[45].
All in all, AlphaFold 3 is revolutionizing structural biology and paving the way for breakthroughs in drug discovery,
biotechnology, and beyond. Its impact is vast, touching everything from basic research to practical medical and industrial
innovations.
Biochemist:
Absolutely, AlphaFold 3 (AF3) represents a huge leap in protein structure prediction, especially with its diffusion-based
architecture. Unlike AlphaFold 2's Evoformer model, AF3 can accurately model interactions between proteins and various
biomolecules like DNA, RNA, and small molecules with remarkable precision[52][29]. This is groundbreaking for genetic
regulation research, as it allows us to predict how proteins interact with DNA and uncover the mechanisms behind gene
activation and repression[50][8][53].
This enhanced capability is crucial for developing targeted gene therapies. With AF3, we can pinpoint key regulatory elements
and design interventions to modify gene expression precisely. This has the potential to lead to innovative treatments for
genetic disorders, where adjusting gene function is essential[9][31][34].
However, it's worth mentioning that access to AlphaFold 3 is currently limited by DeepMind, which could slow down its
widespread application[51]. Despite this hurdle, the advancements made by AF3 offer a bright future for genetic research and
therapeutic development.
In short, AlphaFold 3's ability to predict protein-DNA interactions with high accuracy could revolutionize our understanding
of genetic regulation and pave the way for targeted gene therapies, even if access constraints may delay its broader use.
27
Pharmaceutical Industry Representative:
Absolutely, the potential of AlphaFold 3 in drug discovery is immense. Given its advanced ability to predict 3D structures of
complex biomolecular interactions-including proteins, DNA, RNA, and small molecules-it offers a significant advantage in
understanding how drugs bind to their targets [55][7].
One of the standout features of AlphaFold 3 is its capability to predict the effects of post-translational modifications and
ions on molecular systems [55]. This can be a game-changer for identifying new drug targets that were previously considered
too challenging to tackle [55][56].
By improving the accuracy of drug metabolism and toxicity predictions, AlphaFold 3 can substantially cut down the number of
necessary wet lab experiments, saving both time and money [58]. Moreover, its near-complete prediction of human protein
structures opens up unprecedented opportunities for fundamental research and therapeutic development [44].
While these advancements are promising, some experts still see the current impact on drug discovery as incremental, requiring
further validation [54]. Nonetheless, the potential for AlphaFold 3 to streamline drug discovery, especially for complex
diseases, is incredibly exciting [21][20].
moderator:
That's a great point. How do you think integrating AlphaFold 3 with emerging technologies like self-driving laboratories could
further accelerate and innovate the drug discovery process? (inspired by [60][61])

Absolutely, integrating AlphaFold 3 with self-driving laboratories (SDLs) can truly revolutionize drug discovery. AlphaFold 3'
s prowess in predicting 3D structures of complex biomolecular systems offers deep insights into molecular interactions crucial
for drug design[55][9]. This means we can now target proteins that were previously considered untouchable with remarkable
precision[55][44].
On the other hand, SDLs bring in the power of machine learning, lab automation, and robotics to conduct high-throughput
experiments autonomously[64][65]. By merging AlphaFold 3's predictive capabilities with SDLs, we can streamline the process of
identifying and optimizing drug candidates. SDLs can swiftly test numerous hypotheses and chemical compounds, guided by the
insights from AlphaFold 3, making the identification of promising therapeutic candidates more efficient than ever[64].
Together, AlphaFold 3 and SDLs not only speed up the discovery process but also translate fundamental scientific insights into
practical drug solutions. While neither is a cure-all alone, their integration creates a powerful synergy, moving us faster
from understanding molecular mechanisms to developing effective drugs[63][11].
In essence, this combo promises to enhance both the efficiency and effectiveness of new drug development, potentially
transforming global healthcare outcomes[62][64].
Computational Biologist:
That sounds incredibly promising! But I'm curious, what are some of the specific challenges and limitations we might encounter
when trying to integrate AlphaFold 3 with self-driving laboratories for high-throughput drug discovery?
User:
Any impact on the stock market?

It's hard to pinpoint AlphaFold 3's direct impact on the stock market, but its revolutionary advances in predicting molecular
structures could shake things up. By speeding up and cutting costs in protein structure prediction, it stands to greatly
benefit pharmaceutical giants like Pfizer, Merck, and Novartis, as well as smaller biotech firms like Schrodinger and
Exscientia[66][67][68][32][69][21][70][71][72]. These advancements might not move the market immediately, but they could
significantly boost the stock performance of companies that leverage this tech. So, keeping an eye on how these firms
integrate AlphaFold 3 could be key to understanding its broader market implications.
Financial Analyst:
Absolutely, let's dive into how AlphaFold 3 is being integrated into the workflows of both pharmaceutical giants and smaller
biotech firms, and the financial implications of this integration.
AlphaFold 3, developed by Google DeepMind, goes beyond predicting individual protein structures to modeling interactions
between multiple proteins, DNA, RNA, and small molecules[73][74]. This makes it a game-changer in drug discovery and
development.
Big pharma companies, especially those partnering with Isomorphic Labs, are using AlphaFold 3 to tackle complex drug design
challenges[80][81]. By integrating this tech, they can speed up the drug discovery process, cut costs, and improve prediction
accuracy, which enhances their development pipelines[78][79].
Smaller biotech firms are also leveraging AlphaFold 3 through platforms like Benchling, which incorporates AlphaFold beta for
predicting 3D structures of novel proteins[75][77]. This helps them innovate more efficiently without needing extensive
computational resources or deep machine learning expertise[78].
However, scaling AlphaFold 3 poses challenges like building the necessary IT infrastructure and ensuring data integrity and
interoperability[76].
In essence, both large and small firms are using AlphaFold 3 to boost efficiency and innovation in drug discovery, with
potentially significant long-term financial benefits.
28
H Final Report Example
AlphaFold 3
# Background and Development
## Overview of AlphaFold 3
AlphaFold 3, co-developed by Google DeepMind and Isomorphic Labs, was announced on May 8, 2024, and
represents a significant advancement in the prediction of biomolecular structures and interactions[2][7].
Unlike its predecessor, AlphaFold 2, AlphaFold 3 extends its capabilities beyond single-chain proteins
to a wide array of biomolecules including DNA, RNA, post-translational modifications, and small molecules
such as ligands, which are critical in drug discovery[1][2].
The AI model employs a neural network architecture that utilizes a custom Transformer with triangular
attention and a diffusion process to generate 3D coordinates of every atom within the specified
biomolecular system[6]. This allows researchers to input descriptions of complex biomolecular systems
and receive highly accurate predictions of their three-dimensional structures[6].
One of the standout features of AlphaFold 3 is its ability to predict the structures of protein
complexes with enhanced accuracy, particularly in protein-DNA and protein-RNA interactions. This
improvement is expected to facilitate groundbreaking discoveries in gene regulation and expression,
potentially revolutionizing genetic research and personalized medicine[5].
To facilitate broader scientific research, the capabilities of AlphaFold 3 are accessible for free
through the AlphaFold Server, an easy-to-use research tool[3][4]. This accessibility is aimed at
accelerating biological research and drug discovery processes by enabling more precise identification
of drug targets and reducing the time and costs associated with developing new medications, especially
for complex diseases[7][10]. Isomorphic Labs is also collaborating with pharmaceutical companies to
leverage AlphaFold 3’s potential in real-world drug design challenges, with the ultimate goal of
developing new life-changing treatments for patients[3][4].
## Advancements in Biomolecular Structure Prediction
AlphaFold 3 represents a significant leap in the field of biomolecular structure prediction, building
upon the successes of its predecessor, AlphaFold 2. The new model boasts a substantially updated
diffusion-based architecture, which enables joint structure prediction of not only proteins but also
nucleic acids, small molecules, ions, and modified residues[8][38]. This expansion beyond proteins
marks a pivotal advancement, allowing for a more comprehensive understanding of complex biomolecular
interactions and enhancing the accuracy of these predictions.
One of the most notable achievements of AlphaFold 3 is its improved accuracy, which far surpasses many
specialized tools previously used for protein-ligand interactions[8]. This heightened precision has
broad implications, particularly in identifying novel protein targets for genetic therapies, thereby
offering unprecedented opportunities in gene regulation research[38][39]. Furthermore, the model’s
capabilities extend to predicting the structures of complexes involving DNA and RNA, which could
significantly advance our understanding of genetic regulation and aid in the development of targeted
gene therapies[8][42].
DeepMind’s commitment to open science has also played a crucial role in the impact of AlphaFold 3. By
freely publishing the predicted structures of over 200 million proteins, DeepMind fosters collaboration
and knowledge sharing, accelerating scientific discoveries and pharmaceutical development worldwide[39].
This initiative ensures that the benefits of AlphaFold 3 are widely accessible, contributing to
collective scientific progress.
29
The applications of AlphaFold 3 extend beyond academic research, impacting practical fields such as
vaccine development. For instance, the structure-guided design of COVID-19 vaccines by companies like
Pfizer, Moderna, and Johnson & Johnson benefited from advancements in protein structure prediction,
highlighting the model’s potential in addressing global health challenges[40].
Despite its significant advancements, some researchers have noted limitations in AlphaFold 3’s accuracy
for a subset of its predictions, and the model does not fully reveal the underlying mechanisms of protein
folding[42]. Nevertheless, the broader understanding of biomolecular contexts provided by AlphaFold
3, including the interactions of drug targets with protein binding partners, DNA, RNA, and ligand
cofactors, is expected to lead to more effective therapeutic interventions[43]. This richer contextual
insight underscores the potential for rational, structure-based drug design, as demonstrated in the
examination of TIM-3, a potential target for cancer immunotherapy[43].
## ProteinDNA Interactions
AlphaFold 3 represents a significant advancement in the field of computational biology, specifically

in the prediction of the structure of biomolecular systems. Building upon the foundational work of
AlphaFold 2, which accurately predicted the structure of individual proteins, AlphaFold 3 extends these
capabilities to include complex interactions involving multiple proteins, DNA, RNA, and small molecule
ligands[9][11]. This includes an accurate atomic-level view of how these biomolecules come together and
interact, providing critical insights into the structural impact of post-translational modifications
and ions[11].
By providing detailed predictions of protein-DNA interactions, AlphaFold 3 enhances our understanding

of gene regulation and the molecular basis of various diseases. This capability is crucial for
advancing drug discovery, as it allows researchers to identify potential therapeutic targets more
accurately and to understand how drugs can modify these interactions to produce desired effects[9][11].
The integration of AlphaFold 3 with emerging technologies, such as self-driving laboratories, promises
to further accelerate and innovate the drug discovery process by automating the synthesis, testing, and
optimization of new drug candidates based on precise structural data[9].
# Technical Aspects
## Accessibility Enhancements for Researchers
AlphaFold 3 significantly enhances accessibility and usability for researchers looking to integrate
its predictions into their biological studies. One of the major strides in accessibility is the launch
of the AlphaFold Server, a free and user-friendly research tool powered by AlphaFold 3. This server
is touted as the most accurate tool globally for predicting protein interactions with other molecules
within the cell. Researchers, irrespective of their computational resource availability or machine
learning expertise, can generate molecular complexes with just a few clicks on a single platform[13].
Google DeepMind has made AlphaFold Server openly accessible to facilitate the global research
community’s use of AlphaFold 3, driving advancements in fields such as drug discovery, biotechnology,
genomics, and our foundational understanding of biological systems. However, it is noteworthy that,
unlike its predecessors, DeepMind has not released the downloadable code for AlphaFold 3[14]. Despite
this, the open access to the AlphaFold Database, which houses over 200 million protein structure
predictions, continues to accelerate scientific research[13].
30
Additionally, the structural predictions provided by AlphaFold 3 extend beyond proteins to include
nucleic acids, small molecules, ions, and modified residues. This comprehensive predictive ability
marks a substantial improvement over the specialized models of AlphaFold 2, which were more limited in
scope[12][16][17]. The enhancements in prediction accuracy, particularly for protein-protein complexes
and antibody-protein interfaces, offer researchers more reliable data to advance their studies[18].
These advancements collectively contribute to a more accessible and powerful tool for the scientific
community.
## Advancements in Biomolecular Structure Prediction
### AlphaFold 3 and ProteinDNA Interactions
AlphaFold 3 has revolutionized the modeling of protein-DNA interactions, an essential component in

understanding genetic regulation. The updated diffusion-based architecture of AlphaFold 3 enables
the joint structure prediction of complexes, including not just proteins, but also nucleic acids
such as DNA, small molecules, ions, and modified residues[8]. This comprehensive approach allows
for significantly improved accuracy over many previous specialized tools, especially in predicting
protein-ligand interactions[8]. By accurately modeling these interactions, AlphaFold 3 provides deeper
insights into the mechanisms of genetic regulation and opens new avenues for developing targeted gene
therapies[8].
### Impact of AlphaFold 3 on Genetic Regulation
#### ProteinDNA Interaction Prediction
AlphaFold 3 marks a significant advancement in the prediction of protein-DNA interactions, offering

enhanced capabilities compared to its predecessor. Unlike AlphaFold 2, which was optimized for predicting
the structure of individual proteins, AlphaFold 3 employs a diffusion-based model that predicts raw
atom coordinates, allowing it to accurately model an array of biomolecular interactions including those
between proteins and nucleic acids like DNA and RNA[29].
The shift to a diffusion-based architecture enables AlphaFold 3 to achieve a remarkable improvement

in prediction accuracy. Specifically, the model shows at least a 50% improvement in predicting the
interactions of proteins with other molecule types, and in certain crucial categories, the accuracy has
doubled compared to existing methods[30]. This enhanced prediction capability can lead to groundbreaking
discoveries in gene regulation mechanisms and revolutionize our approach to genetic research and
personalized medicine[30].
Introduced in collaboration with Isomorphic Labs, AlphaFold 3 goes beyond proteins to encompass a broad
spectrum of biomolecules, including DNA, RNA, and small molecules known as ligands. This comprehensive
approach opens new avenues for transformative science, from developing biorenewable materials and
more resilient crops to accelerating drug design and genomics research[31]. By accurately predicting
the interactions of proteins with DNA, AlphaFold 3 holds the potential to significantly advance our
understanding of genetic regulation and assist in the development of targeted gene therapies[29][31].
#### ProteinRNA Interaction Prediction
AlphaFold 3 has marked a significant leap forward in the field of structural biology by enhancing
its prediction accuracy for protein-DNA and protein-RNA interactions. Building upon the foundational
work of AlphaFold 2, the latest iteration of AlphaFold developed by Google’s DeepMind and Isomorphic
Labs in London can now predict the structure and interactions of a wide array of biomolecular systems
with unprecedented precision[5][11][37]. This includes a dramatic improvement, with at least a 50%
enhancement in accuracy for interactions between proteins and other molecule types compared to existing
methods, and in certain crucial categories, the prediction accuracy has doubled[33].
31
These advancements hold transformative potential for understanding genetic regulation and expression, as
the more accurate predictions can provide deeper insights into the mechanisms of gene regulation[36][37].
Such detailed atomic-level views of molecular interactions are expected to revolutionize approaches
in genetic research and personalized medicine, paving the way for groundbreaking discoveries in how
genes are regulated and expressed within biological systems[5][11]. This progress also means that the
model is not limited to proteins but extends to DNA, RNA, and other small molecules, enabling a more
comprehensive understanding of biomolecular dynamics[11].
### New Features and Advancements in AlphaFold 3
AlphaFold 3 introduces several groundbreaking features and advancements in the field of biomolecular
structure prediction. One of the most significant improvements is the ability to predict the structure
of a wide variety of biomolecular systems more broadly and accurately than its predecessor, AlphaFold 2.
This has been achieved through the use of diffusion techniques to enhance the underlying architectural
model, allowing for more general predictions[16].
Notably, AlphaFold 3 has set a new benchmark in accuracy for predicting drug-like interactions,
including the binding of proteins with ligands and antibodies with their target proteins. It is 50%
more accurate than the best traditional methods on the PoseBusters benchmark, and it achieves this
without requiring any input of structural information. This makes AlphaFold 3 the first AI system to
outperform physics-based tools for biomolecular structure prediction[26].
Another significant advancement is AlphaFold 3’s ability to model proteins interacting not only with
other proteins but also with other biomolecules, such as DNA and RNA strands[27]. This capability is
particularly valuable for understanding complex biological processes and interactions at the atomic
level. Additionally, AlphaFold 3 excels in modeling protein-ligand interactions, a feature crucial
for drug discovery efforts[27][28]. Accurate predictions of protein-ligand structures facilitate the
identification and design of new molecules, which could potentially be developed into therapeutic
drugs[28].
Early analyses have shown that AlphaFold 3 greatly outperforms AlphaFold 2.3 in certain protein structure
prediction problems relevant to drug discovery, such as antibody binding[28]. This underscores the
system’s potential to significantly impact the pharmaceutical industry by improving the efficiency and
accuracy of drug discovery processes[28].
# Impact and Applications
AlphaFold 3, developed by Google DeepMind in collaboration with Isomorphic Labs, has made significant
strides in biotechnology by accurately predicting the structure and interactions of a wide range of
biological molecules, including proteins, DNA, RNA, and small molecules such as drugs[1][7][9]. This
advancement has substantial implications for several fields, most notably drug discovery and genetic
research.
One of the key impacts of AlphaFold 3 is its potential to dramatically accelerate the drug discovery
process. By enabling precise identification of drug targets, it reduces both the time and costs
associated with developing new medications, particularly for complex diseases[7][19][20][21]. The
model’s ability to predict how proteins interact with other molecules offers invaluable insights
into the mechanisms of diseases and the development of targeted therapies[7][11]. Additionally, the
integration of AlphaFold 3 with emerging technologies like self-driving laboratories could further
innovate the drug discovery process, enhancing efficiency and accuracy[9][11].
In genetic research, AlphaFold 3’s capability to predict protein-DNA interactions could significantly
advance our understanding of genetic regulation, thereby aiding in the development of targeted gene
therapies[8]. By providing an atomic-level view of biomolecular systems, including the structural
impact of post-translational modifications and ions, AlphaFold 3 deepens our understanding of the
biological world[11].
32
The introduction of AlphaFold Server, a free and accessible research tool powered by AlphaFold
3, has further democratized access to this groundbreaking technology. Researchers can now generate
molecular complexes with minimal computational resources or expertise in machine learning, accelerating
scientific research across the globe[13]. The server and the AlphaFold database provide open access to
over 200 million protein structure predictions, fostering an environment of collaborative scientific
discovery[13][20].
## Drug Discovery Acceleration
AlphaFold 3 represents a significant advancement in drug discovery, offering the potential to

revolutionize the field by enabling more precise identification of drug targets and reducing the
time and costs associated with developing new medications, particularly for complex diseases[56][57].
Developed by Google DeepMind and Isomorphic Labs, AlphaFold 3 builds upon the success of its predecessor,
AlphaFold 2, by providing accurate atomic-level views of the structure of biomolecular systems. This
includes not only proteins but also DNA, RNA, and small molecule ligands, along with their interactions
and structural impacts due to post-translational modifications and ions[11][55].
The AI model’s ability to predict complex protein interactions and structures with high accuracy offers
a new set of drug target candidates to explore, potentially leading to groundbreaking therapeutic
developments[56][57]. Furthermore, the application of AlphaFold 3 in predicting the structural impact
of various molecular systems opens up exciting possibilities for rational drug development against
targets that were previously difficult to modulate[55].
Although the initial impact of AlphaFold and similar models like RoseTTAFold on drug discovery
has been incremental, the potential commercial and scientific value of AlphaFold 3 is vast, with
its transformative potential already being acknowledged as "Nobel Prize-worthy"[19][21]. By accurately
predicting the three-dimensional shapes of proteins and other biomolecules, AlphaFold 3 helps streamline
the process of identifying compounds that will successfully bind to these targets, producing beneficial
health outcomes[57].
Moreover, integrating AlphaFold 3 with emerging technologies such as self-driving laboratories

could further accelerate and innovate the drug discovery process. The combination of AlphaFold 3’s
structural predictions with automated, high-throughput experimentation could dramatically speed up the
validation and optimization of new drug candidates, transforming our understanding and approach to drug
R&D[9][44][55].
## ProteinDNA and ProteinRNA Interaction Predictions
AlphaFold 3 represents a significant advancement in the prediction of biomolecular interactions,

specifically those involving proteins, DNA, and RNA. Unlike its predecessor, AlphaFold 2, which primarily
focused on predicting the structure of individual proteins, AlphaFold 3 employs a diffusion-based
architecture to predict raw atom coordinates. This allows it to model a variety of biomolecular
interactions with high accuracy, including those between proteins and nucleic acids such as DNA and
RNA[29].
Introduced in 2024 by Google DeepMind and Isomorphic Labs, AlphaFold 3 expands its predictive
capabilities beyond proteins to encompass all of life’s molecules. This includes small molecules known as
ligands, which are significant in the context of drug discovery[31]. The ability to predict interactions
between proteins and DNA holds particular promise for advancing genetic regulation understanding and
developing targeted gene therapies[29][31].
33
The predictive power of AlphaFold 3 extends to complex biomolecular interactions, including those
involving protein complexes with DNA, RNA, and various ligands and ions. This enhanced capability
allows for a more comprehensive understanding of biological processes and has the potential to
unlock transformative scientific developments, from biorenewable materials to more resilient crops
and accelerated genomics research[34]. Additionally, AlphaFold 3’s success rate of approximately 70%
in accurately predicting protein-protein interactions underscores its effectiveness[34].
Perhaps one of the most exciting aspects of AlphaFold 3 is its ability to model interactions between
proteins and a wide range of biological molecules, including DNA and RNA. This advancement is critical
for understanding the fundamental mechanisms of life and for identifying potential drug candidates,
reflecting the extensive training set that includes a broad spectrum of molecules[53]. By accurately
predicting these complex interactions, AlphaFold 3 has the potential to revolutionize various fields
within biological research and biotechnology.
## Efficiency and Accuracy Improvements in Drug Discovery
AlphaFold 3 has significantly improved the efficiency and accuracy of drug discovery processes, enabling
more precise identification of drug targets and reducing the time and costs associated with developing
new medications. This advancement is particularly impactful in the context of complex diseases, where
traditional methods have struggled to provide swift and accurate results[7][20][59].
One notable example is the discovery of a more potent hit molecule, ISM042-2-048, through AI-powered
compound generation. This compound demonstrated good inhibitory activity against CDK20, a crucial
protein in hepatocellular carcinoma (HCC), with an IC50 value of 33.4 ± 22.6 nM. It also showed
selective anti-proliferation activity in an HCC cell line, marking the first instance of AlphaFold
being applied to hit identification in drug discovery[22]. Furthermore, scientists at Insilico and the
University of Toronto have integrated AlphaFold into an end-to-end AI-powered drug discovery platform,
leading to the identification of a new drug for a novel target aimed at treating HCC[23].
AlphaFold 3 has also enhanced prediction accuracy for antibody-antigen interactions, a critical area
for immunology and therapeutic antibody development. By blending bioinformatics and physics, AlphaFold
offers a more precise understanding of the exact binding between antibodies and antigens, surpassing
the capabilities of previous computational methods[24].
Moreover, AlphaFold opens new avenues for exploring drug targets, especially in neglected diseases.
These are conditions that receive little research funding due to affecting small or low-income
populations, making them less attractive to commercial markets. The expanded scope of AlphaFold
3 to include a diverse range of biomolecules further paves the way for transformational science,
including bio-renewable materials and more resilient crops, alongside accelerating drug discovery and
genomics research[25][59].
## Biotechnology Applications
In 2024, together with Isomorphic Labs, we introduced AlphaFold 3, which predicts the structure and
interactions of all of life’s molecules[1]. AlphaFold 3 goes beyond proteins to a broad spectrum of
biomolecules including DNA, RNA, and even small molecules, also known as ligands, which encompass many
drugs[1].
## Genetic Regulation and Personalized Medicine
AlphaFold 3 has heralded a significant advancement in our understanding of genetic regulation and the
development of personalized medicine. By leveraging a diffusion-based architecture, AlphaFold 3 can
predict the structure and interactions of various biomolecular systems with unprecedented accuracy,
including proteins, nucleic acids, small molecules, ions, and modified residues[8][35]. This enhanced
capability allows for joint structure prediction of complex biological systems, which is critical for
understanding the intricate interactions within cells[35].
34
One of the most groundbreaking features of AlphaFold 3 is its ability to predict protein-DNA and
protein-RNA interactions with far greater accuracy compared to previous models. The new AlphaFold
model has shown a significant improvement—up to 50% or more—in predicting these interactions, which are
essential for understanding gene regulation and expression[30]. Such precise predictions could lead
to revolutionary discoveries in the mechanisms of gene regulation, potentially transforming genetic
research and opening new avenues for personalized medicine[30][35].
Furthermore, AlphaFold 3’s ability to model how DNA interacts with proteins offers profound insights
into cellular processes and the regulation of genetic codes. This capability can significantly advance
our understanding of genetic regulation and help in the development of targeted gene therapies[50][52].
The improved accuracy in predicting these molecular interactions means that scientists can now explore
genetic pathways with a level of detail previously unattainable, facilitating the creation of more
effective and personalized treatment plans for various genetic disorders[8][52].
Despite these advancements, it is important to note that access to AlphaFold 3 is currently restricted,
which may limit the widespread application of its capabilities in the short term[51]. However, the
potential implications for genetic research and personalized medicine remain vast and promising as the
technology continues to evolve and become more accessible to the scientific community.
## Economic Impact and Market Implications
AlphaFold 3, a groundbreaking artificial intelligence program developed by Google DeepMind and

Isomorphic Labs, has been heralded for its transformative potential in drug discovery and development,
which could have substantial economic implications[21]. The program predicts the structure and
interactions of all of life’s molecules with remarkable accuracy, a significant advancement in the
field of genetics[32][67].
By dramatically reducing the cost and time associated with protein structure determination, AlphaFold 3
has the potential to expedite research and development processes in the pharmaceutical industry, leading
to significant cost savings[66]. This reduction in costs can benefit multiple sectors, particularly
companies focused on biotechnology and pharmaceuticals. Stocks of companies involved in these sectors,
such as Alphabet Inc., ABBV, EXAI, IBM, MRK, MSFT, NVS, ORCL, PFE, SDGR, and SLP, could see a positive
impact due to the advancements brought about by AlphaFold 3[68].
Moreover, the program’s ability to predict protein structures has practical applications in addressing
global health challenges. For instance, researchers at the University of Cambridge are utilizing
AlphaFold to develop a more effective malaria vaccine, while teams at the University of Colorado
are exploring solutions to antibiotic resistance, a major public health concern[70]. These efforts
underscore the potential for AlphaFold 3 to contribute to significant medical breakthroughs, which
could further bolster investor confidence in related sectors.
However, it remains uncertain how exactly AlphaFold 3 will catalyze drug discovery and development.
Many drugs fail to reach the market due to unforeseen interactions between their components and various
parts of the body[71]. Despite this uncertainty, the AI-driven advancements of AlphaFold 3 hold promise
for overcoming some of these traditional hurdles, thereby accelerating the pipeline from research to
market.
# References
[1] AlphaFold - Google DeepMind. Retrieved June 3, 2024, from https://deepmind.google/technologies/a

lphafold/
[2] AlphaFold - Wikipedia. Retrieved June 3, 2024, from https://en.wikipedia.org/wiki/AlphaFold
[3] AlphaFold 3 predicts the structure and interactions of all of life’s molecules - Isomorphic Labs.
Retrieved June 3, 2024, from https://www.isomorphiclabs.com/articles/alphafold-3-predicts-the-struc
ture-and-interactions-of-all-of-lifes-molecules
35
[4] AlphaFold 3 predicts the structure and interactions of all of life’s molecules. Retrieved June 3,
2024, from https://blog.google/technology/ai/google-deepmind-isomorphic-alphafold-3-ai-model/
[5] AlphaFold 3 upgrade enables the prediction of other types of biomolecular systems. Retrieved June
3, 2024, from https://phys.org/news/2024-05-alphafold-enables-biomolecular.html
[6] Rational drug design with AlphaFold 3 - Isomorphic Labs. Retrieved June 3, 2024, from https:
//www.isomorphiclabs.com/articles/rational-drug-design-with-alphafold-3
[7] Google DeepMind’s AlphaFold 3 Could Transform Drug Discovery. Retrieved June 3, 2024, from
https://time.com/6975934/google-deepmind-alphafold-3-ai/
[8] Accurate structure prediction of biomolecular interactions with AlphaFold 3 | Nature. Retrieved
June 3, 2024, from https://www.nature.com/articles/s41586-024-07487-w
[10] A glimpse of the next generation of AlphaFold - Google DeepMind. Retrieved June 3, 2024, from
https://deepmind.google/discover/blog/a-glimpse-of-the-next-generation-of-alphafold/
[12] Accurate structure prediction of biomolecular interactions with AlphaFold 3 | Nature. Retrieved
June 3, 2024, from https://www.nature.com/articles/s41586-024-07487-w

lphafold/
[14] AlphaFold 3: Google DeepMind’s latest AI tech in drug discovery. Retrieved June 3, 2024, from
https://www.prescouter.com/2024/05/google-deepmind-alphafold-3/
[17] AlphaFold 3: A Leap Forward in Biomolecular Structure Prediction — Opportunities and Limitations
| by Freedom Preetham | Meta Multiomics | May, 2024 | Medium. Retrieved June 3, 2024, from https:
//medium.com/meta-multiomics/alphafold-3-a-leap-forward-in-biomolecular-structure-prediction-opportu
nities-and-limitations-924350af1181
[18] AlphaFold3 and its improvements in comparison to AlphaFold2 | by Falk Hoffmann | May, 2024 |
Medium. Retrieved June 3, 2024, from https://medium.com/@falk_hoffmann/alphafold3-and-its-improvement
s-in-comparison-to-alphafold2-96815ffbb044
[19] What does AlphaFold mean for drug discovery?. Retrieved June 3, 2024, from https://www.nature.c
om/articles/d41573-021-00161-0
[20] DeepMind’s latest AlphaFold model is more useful for drug discovery | TechCrunch. Retrieved June
3, 2024, from https://techcrunch.com/2023/10/31/deepminds-latest-alphafold-model-is-more-useful-for
-drug-discovery/
[21] Why AlphaFold 3 is stirring up so much buzz in pharma | PharmaVoice. Retrieved June 3, 2024, from
https://www.pharmavoice.com/news/google-alphafold-3-drug-discovery-pharma-buzz/716496/
[22] AlphaFold accelerates artificial intelligence powered drug discovery: efficient discovery of a
novel CDK20 small molecule inhibitor - PMC. Retrieved June 3, 2024, from https://www.ncbi.nlm.nih.gov
/pmc/articles/PMC9906638/
36
[23] First Application of AlphaFold in Identifying Potential Liver Cancer Drug. Retrieved June 3, 2024,
from https://www.genengnews.com/topics/drug-discovery/first-application-of-alphafold-in-identifying
-potential-liver-cancer-drug/
[25] AlphaFold Is The Most Important Achievement In AI—Ever. Retrieved June 3, 2024, from https:
//www.forbes.com/sites/robtoews/2021/10/03/alphafold-is-the-most-important-achievement-in-ai-ever/
[27] AlphaFold 3 offers even more accurate protein structure prediction. Retrieved June 3, 2024, from
https://www.drugdiscoverytrends.com/meet-alphafold-3-which-can-accurately-model-more-than-99-of-mol
ecular-types-in-the-protein-data-bank/

lphafold/
[38] AlphaFold3: A Game Changer in Protein Structure Prediction — Part 1 | by Chithra Srinivasan | May,
2024 | Medium. Retrieved June 3, 2024, from https://medium.com/@csn289/alphafold3-a-game-changer-in-p
rotein-structure-prediction-part-1-b8d9c361bda2
[40] Frontiers | Before and after AlphaFold2: An overview of protein structure prediction. Retrieved
June 3, 2024, from https://www.frontiersin.org/articles/10.3389/fbinf.2023.1120370/full
37
[41] AlphaFold 3: A Leap Forward in Biomolecular Structure Prediction — Opportunities and Limitations
| by Freedom Preetham | Meta Multiomics | May, 2024 | Medium. Retrieved June 3, 2024, from https:
//medium.com/meta-multiomics/alphafold-3-a-leap-forward-in-biomolecular-structure-prediction-opportu
nities-and-limitations-924350af1181
[44] The Drug Discoverer - Reflecting on DeepMind’s AlphaFold artificial intelligence success – what’s
the real significance for protein folding research and drug discovery? - The Institute of Cancer
Research, London. Retrieved June 3, 2024, from https://www.icr.ac.uk/blogs/the-drug-discoverer/page-d
etails/reflecting-on-deepmind-s-alphafold-artificial-intelligence-success-what-s-the-real-significa
nce-for-protein-folding-research-and-drug-discovery
[47] Great expectations – the potential impacts of AlphaFold DB | EMBL. Retrieved June 3, 2024, from
https://www.embl.org/news/science/alphafold-potential-impacts/
[48] AlphaFold: Accelerating biological research - Google DeepMind. Retrieved June 3, 2024, from
https://deepmind.google/impact/meet-the-scientists-using-alphafold/
[49] How is AlphaFold2 used by scientists? | AlphaFold. Retrieved June 3, 2024, from https://www.ebi.
ac.uk/training/online/courses/alphafold/validation-and-impact/how-is-alphafold-used-by-scientists/
[50] AlphaFold 3 Will Change the Biological World and Drug Discovery. Retrieved June 3, 2024, from
https://www.analyticsvidhya.com/blog/2024/05/deepmind-ai-alphafold/
[51] Major AlphaFold upgrade offers boost for drug discovery. Retrieved June 3, 2024, from https:
//www.nature.com/articles/d41586-024-01383-z
[53] AlphaFold 3: Stepping into the future of structure prediction - Front Line Genomics. Retrieved
June 3, 2024, from https://frontlinegenomics.com/alphafold-3-stepping-into-the-future-of-structure-p
rediction/
[54] Analyzing the potential of AlphaFold in drug discovery | MIT News | Massachusetts Institute of
Technology. Retrieved June 3, 2024, from https://news.mit.edu/2022/alphafold-potential-protein-drug-0
906
[58] AlphaFold 3: Revolutionizing drug discovery and molecular biology. Retrieved June 3, 2024, from
https://www.prescouter.com/2024/05/alphafold-3/
38
[59] Google DeepMind’s AI model AlphaFold 3 can be a gamechanger in drug discovery. Retrieved June 3,
2024, from https://indiaai.gov.in/article/google-deepmind-s-ai-model-alphafold-3-can-be-a-gamechang
er-in-drug-discovery
[61] New study uses AlphaFold and AI to accelerate design of novel drug for liver cancer. Retrieved
June 3, 2024, from https://acceleration.utoronto.ca/news/new-study-uses-alphafold-and-ai-to-acceler
ate-design-of-novel-drug-for-liver-cancer
[64] The rise of self-driving labs in chemical and materials sciences | Nature Synthesis. Retrieved
June 3, 2024, from https://www.nature.com/articles/s44160-022-00231-0
[65] Self-driving laboratories to autonomously navigate the protein fitness landscape | Nature Chemical
Engineering. Retrieved June 3, 2024, from https://www.nature.com/articles/s44286-023-00002-4
[66] r/stocks on Reddit: Which stocks will benefit most from alphafold protein fold prediction
advancement?. Retrieved June 3, 2024, from https://www.reddit.com/r/stocks/comments/k7z8hi/whic
h_stocks_will_benefit_most_from_alphafold/
[67] Did Google’s DeepMind Just Revolutionize Medicine? Retrieved June 3, 2024, from https://www.fool
.com/investing/2020/12/05/did-googles-deepmind-just-revolutionize-medicine/
[68] Alphabet’s AlphaFold Is A Gift To Humanity, Not Investors | Seeking Alpha. Retrieved June 3, 2024,
from https://seekingalpha.com/article/4522954-alphabets-alphafold-is-a-gift-to-humanity-not-investo
rs
[69] DeepMind’s AlphaFold AI Origin Story, Influence on Biotech. Retrieved June 3, 2024, from https:
//www.businessinsider.com/deepmind-alphafold-ai-origin-story-impact-on-biotech-2023-2

lphafold/
[71] Folds and faults: On AlphaFold 3 - The Hindu. Retrieved June 3, 2024, from https://www.thehindu
.com/opinion/editorial/folds-and-faults-on-alphafold-3/article68161671.ece
[72] Software Engineers Lead Drug Breakthough Terry 11:07 AM - Markman on Tech. Retrieved June 3, 2024,
from https://www.thestreet.com/tech/news/sdgrjdm121420
Table 12: Co-STORM’s generated final report for “Alpha Fold 3”. “#”, “##” indicate the section title and subsection
title respectively. Numbers in brackets indicate the cited references.
39

Into The Unknown Unknowns: Engaged Human Learning Through Participation in Language Model Agent Conversatio

Uploaded by

Copyright:

Available Formats

Into The Unknown Unknowns: Engaged Human Learning Through Participation in Language Model Agent Conversatio

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Into The Unknown Unknowns: Engaged Human Learning Through Participation in Language Model Agent Conversatio

Uploaded by

Copyright:

Available Formats

Into the Unknown Unknowns: Engaged Human Learning

through Participation in Language Model Agent Conversations

While language model (LM)-powered chatbots

and generative search engines excel at answer- How to User Question 2

comprising a topic and the user’s goal. To improve as question-answering.

Mind Map (§3.2) Collaborative Discourse

Question Asking Generate Question Polish Utterance

Polish Utterance Generate

5.1 Evaluation Setup ment, and No Repetition for question-asking ut-

Co-STORM 1 N 1 Co-STORM is better than Search Engine

Addresses Echo Chamber 10% 60%

Overall Experience 10% 70%

User Engagement 22% 44%

3 Addresses Echo Chamber 0% 56%

Overall Experience 11% 78%

Zejiang Shen, Tal August, Pao Siangliulue, Kyle

David Traum. 2003. Issues in multiparty dialogues.

Zhenhailong Wang, Shaoguang Mao, Wenshan Wu, Tao

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu,

Culture Lifestyle Social science

Domain Example Task

Pearson Correlation (p-value)

Table 7: Pearson correlation between average human

platform 10 and a web application we built.

Table 8: Information-seeking tasks used in human evaluation.

topic = dspy.InputField(prefix='Topic context:', format=str)

topic = dspy.InputField(prefix='Topic you are discussing about:', format=str)

Listing 2: Prompts used for simulating perspective-guided experts in Co-STORM.

topic = dspy.InputField(prefix='Topic of interest:', format=str)

Listing 3: Prompts used for simulating the moderator in Co-STORM

Click to provide feedback for each turn

Click a discourse turn to

Click to continue observe

Topic: The effects of NMN supplements on human anti-aging

Geneticist specializing in personalized medicine:

{rest of discourse omitted}

General Knowledge Provider:

{rest of discourse omitted}

Pharmaceutical Industry Expert:

General Knowledge Provider:

General Knowledge Provider:

General Knowledge Provider:

General Knowledge Provider:

# Background and Development

## Advancements in Biomolecular Structure Prediction

AlphaFold 3 represents a significant advancement in the field of computational biology, specifically

By providing detailed predictions of protein-DNA interactions, AlphaFold 3 enhances our understanding

## Accessibility Enhancements for Researchers

## Advancements in Biomolecular Structure Prediction

### AlphaFold 3 and ProteinDNA Interactions

AlphaFold 3 has revolutionized the modeling of protein-DNA interactions, an essential component in

### Impact of AlphaFold 3 on Genetic Regulation

#### ProteinDNA Interaction Prediction

AlphaFold 3 marks a significant advancement in the prediction of protein-DNA interactions, offering

The shift to a diffusion-based architecture enables AlphaFold 3 to achieve a remarkable improvement

#### ProteinRNA Interaction Prediction

### New Features and Advancements in AlphaFold 3

# Impact and Applications

## Drug Discovery Acceleration