This presentation was provided by William Mattingly of the Smithsonian Institution, during the fourth segment of the NISO training series "AI & Prompt Design." Session Four: Structured Data and Assistants, was held on April 25, 2024.
2. 1. Importance of Structured Data
2. How to Generate Structured Data from LLMs
3. Importance of Consistency in LLM Outputs
4. How to Generate Consistent Responses
5. Vector Databases and Semantic Search
6. Retrieval Augmented Generation
7. Assistants
Goals
8. Structured Data
Types of Data
Important Structures
● CSV
● JSON
● HTML/XML
Important Questions:
1. Should the data be hierarchical (nested).
2. Do I want to preserve the input data? If
so, how?
3. What is the intended usage of the data?
4. How much data will I have (scalability)?
14. HTML
<p>
Not that <span class="person">Belladonna
Took</span> ever had any adventures after she
became Mrs. <span class="person">Bungo
Baggins</span>.
<span class="person">Bungo</span>, that was
<span class="person">Bilbo</span>’s father, built
the most luxurious hobbit-hole for her
(and partly with her money) that was to be found
either under <span class="place">The Hill</span>
or over <span class="place">The Hill</span>
or across <span class="place">The Water</span>,
and there they remained to the end of their days.
</p>
16. XML
<text>
<sentence>
Not that <person>Belladonna Took</person> ever had any
adventures after she became Mrs. <person>Bungo Baggins</person>.
</sentence>
<sentence>
<person>Bungo</person>, that was <person>Bilbo</person>’s
father, built the most luxurious hobbit-hole for her
(and partly with her money) that was to be found either under
<place>The Hill</place> or over <place>The Hill</place>
or across <place>The Water</place>, and there they remained to
the end of their days.
</sentence>
</text>
17. Exercise 1 (10 min): Generate Structured
Data Output for “John went to Paris on 1
August 2023.”
21. Practical Applications with Real World Data
An ANCYL member who was shot
and severely injured by SAP
members at Lephoi, Bethulie,
Orange Free State (OFS) on 17
April 1991. Police opened fire on a
gathering at an ANC supporter's
house following a dispute between
two neighbours, one of whom was
linked to the ANC and the other to
the SAP and a councillor.
27. Vector
Database
How do we use a vector
database?
● We populate a vector database
with by using a machine
learning model to vectorize
data and send them to the
database.
29. Vector
Database
Why use a vector database?
● Vector databases allow users
to store vector data in a way
that allows users to query it
and find similarity based on a
vector-level similarity, rather
than explicit human-defined
similarity.
30. Vector
Database
What is it?
● A vector database holds
numerous vectors or
embeddings of data.
Sometimes, the database will
also store the original data
alongside these vectors.
33. Vector Database
Stacks
What is available to us?
● Python, Annoy, Streamlit
○ Cheap, easy to deploy, great for
smaller datasets, but requires a
little bit of knowledge to build from
scratch
○ Best for smaller databases (under
10,000 data)
● Python, txtAI
○ Cheap and easy to use, more
resource intensive but easy to
deploy
○ Allows for easy interpretability (via
highlighting)
39. RAG
What is it?
● RAG allows for you to combine
the strengths of large language
models (LLMs) with vector
databases
● It limits the chances for an LLM
to hallucinate (generate fake
information)
● It uses a vector database to
find relevant material to a query
40. RAG
What is it?
● RAG allows for you to combine
the strengths of large language
models (LLMs) with vector
databases
● It limits the chances for an LLM
to hallucinate (generate fake
information)
● It uses a vector database to
find relevant material to a query
1
2
3
4
5 6