Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

In S Traction

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 61

Rating / Ranking Tasks

Instructions Table of Contents

1. [Slides 3-7] Overview of Rating Tasks

2. [Slides 8-19] Rubrics for Different Types of Prompts

3. [Slides 20-22] Using the Rubrics

4. [Slides 24-44] Examples

● What

A Rating/Ranking task consists of evaluating two responses to a prompt across multiple

dimensions, identifying which response is better, and providing a justification for your ranking

of the two responses.

● Why

Rating tasks are at the core of Reinforcement Learning with Human Feedback (RLHF). The

basic idea is we want to train the models to understand which of its two responses is better,

so that moving forward it will write responses more similar to the better ones. RLHF is

proving to be the most effective way to ensure model safety, factual accuracy, and depth of

explanation.

What is a Rating/Ranking task?

Provide a justification of the

relative ranking and the

rating you gave to the

higher-rated response

Confirm the relative ranking

of the two responses

Rate each response against

the dimensions of quality

outlined in the task

What does a Rating/Ranking task look like?

Read the responses to the


prompt

Rating/Ranking Scoring

● Because we are comparing writing output, we will use the same rubrics that we use for

writing tasks, which you will also see later in this course.

● Those writing tasks have slightly different grading criteria, depending on whether they

are logical or creative, or some combination of logical and creative.

● You must determine the correct type of prompt (logical or creative) so that the

correct rubric can be applied.

● Let’s take a look at how to make this determination…

Identifying Logical vs. Creative Prompts

It’s important to determine whether the writing task is seeking a logical or creative

response. These are scored based on slightly different criteria which we will cover in a few

moments.

● Logical:

Logical writing tasks involve factual information about the topic and/or a balanced analysis on

a real subject.

● Creative:

Creative writing tasks require the author to use a unique voice to produce a piece of fiction or

poetry which displays imagination and invention.

● Combination of Logical + Creative

Combination writing tasks require one to have both factual information and some invention or

literary devices to inject some creativity to the response.

CREATIVE LOGICAL / FACTUAL

“Write a short story about a bird who

just learned how to fly.”


“Imagine a world where people can

upload their consciousness…”

“Write a snarky conversation

between JFK and his advisor about

the brain drain from East Germany to

West Germany.”

“Explain why Cinderella had to leave

the ball at midnight.”

“In less than 500 words, construct a

reasoned argument for why the

United States should invest in

nuclear energy.”

“I am a 4th grade teacher. One of my

students, Billy, shows a lot of

potential but doesn’t do his

homework. Write a letter to his

parents…”

Most prompts you see will be on the Logical / Factual side of the spectrum

Identifying Logical vs. Creative Prompts

Rubric:

Writing - Logical

Remember, apply the following rubric to Logical writing prompts where factual information about

the topic is required.

Criteria 1-2 (Insufficient) 3 (Minimum) 4-5 (Good)

Accuracy

The extent to which the information

presented is accurate, reliable, and

aligns with established facts or evidence.

● Central Claims- One or more incorrect claims are


central or core to the conclusion, thesis, or key

supporting points of the response

● Supporting Claims - 2 or more pieces of supporting

evidence are false

● Verifiable - Conclusive, common-knowledge

counterfactual information can be found via a

Google search

● Central Claims- All central claims are correct and

accurate

● Supporting Claims- Up to 1 incorrect supporting point

that, if corrected, would not meaningfully alter the core

thesis or conclusion

● Verifiable - Falsely asserts claims that are not fully

proven / controversial as fact

● Central Claims - All central claims are correct and

accurate

● Supporting claims - All supporting claims are accurate

and correct

● Verifiable - No claims are incorrect or false, and are

verifiable through general available information, and

trusted sources. All potentially controversial,

generalized statements or opinions are presented with

appropriate caveats

Completeness

The extent to which the answer

addresses all aspects of the prompt,

ensuring that no essential information is

omitted.

● Prompt Request Coverage - The response does not


address all explicit asks of the prompt

● Constraints - The prompt did not address all of the

constraints placed in service of the requests

● Breadth of Solutions [STEM] - The response does

not show multiple possible options to solve the

question within the prompt

● Prompt Request Coverage - The response addresses all

explicit asks of the prompt

● Constraints - The prompt completed the requests within

the prompt, including the constraints placed in service of

those requests

● Breadth of Solutions [STEM] - The response references

multiple possible solutions, but does not explain in-depth

or cost/benefits of each solution

● Prompt Request Coverage - The response addresses all

explicit asks of the prompt

● Constraints - The prompt completed the requests

within the prompt, including the constraints placed in

service of those requests

● Breadth of Solutions [STEM] - The response shows

multiple possible solutions, provides analysis of each,

and provides a recommendation for which is the most

elegant solution to solve the problem

Relevance

How useful the supporting information

and claims are in answering the

question or prompt.

● Supporting content - The key points provided are

unrelated to the central claim or thesis


● Specificity - Examples brought up are vague and

overly general

● Usefulness - The information provided to justify or

support any of the key points of the response does

not logically contribute to the main claim

● Supporting content - The supporting information and

evidence is tangentially related to the central claim or

thesis

● Specificity - The information provided contains a few

selected examples, but doesn’t explain and relate them

them back to the original response

● Usefulness - There is a mix of useful, and irrelevant but

not directly useful information within the response. The

core requests/demands of the prompt are still satisfied

within the response, despite the irrelevant information

● Supporting content - The supporting information

logically defends or clearly illustrates the key points and

the central claims made in the response

● Specificity - The claims are supported with precise,

rigorously chosen, examples that are related back to the

original claims

● Usefulness - Each paragraph and sentence directly adds

value to the claims or context of the response

Depth

The level of detail, insight, and nuance

the response provides.

● Justification - There is no supporting information at

all. The response does not include meaningfully

relevant insights, evidence, details, or examples


● Context - The response omits important or

necessary background information to understand

the content of the response

● Logical Reasoning - No rationale is provided at all

● Justification - The supporting information was helpful in

understanding and justifying the claims but was

surface-level. There was analysis missing that is usually

not found on a cursory google search.

● Context - The response provides the most important or

necessary background context, but may omit or ignore

minor background information

● Logical reasoning - The chain-of-thought-reasoning is

included, but shallow, and can be more complete, and/or

step-by-step

● Justification - The supporting information includes

ample relevant insights, evidence, details, and examples.

The supporting information and evidence is exhaustive

in regards to supporting the set of claims (i.e., there is a

specific example or justification for each main claim)

● Context - The response provides the necessary useful

background information and context in order for the

user to understand the response

● Logical reasoning - The chain-of-reasoning of the

response is sufficient and addresses most, if not all,

immediate follow-up/related questions a user may have

Criteria 1-2 (Insufficient) 3 (Minimum) 4-5 (Good)

Conciseness

How effectively the written response conveys

information without unnecessary repetition or


wordiness

● Repetition - The response communicates the exact same

ideas, but potentially in slightly different ways, multiple times

within the response. The repetition is not used for stylistic

effect

● Length - The response is overly verbose, and/or significantly

exceeds the the word limits directly imposed by the prompt

● Verbosity - Reponses with poor conciseness take multiple

sentences to explain concepts that are possible within a

sentence or two.

● Repetition - The response communicates the exact same ideas,

but in slightly different ways multiple times within the response.

The repetition is not used for stylistic effect

● Length - The response is close to the length specified explicitly by

the prompt

● Verbosity - Response explains most of its concepts or points

efficiently, but still may have some overly verbose sentences or

claims

● Repetition - Response does not unnecessarily repeat

information, implicitly or explicitly, unless as a stylistic device to

emphasize certain key points

● Length - The response fits the length specified explicitly by the

prompt

● Verbosity - Response efficiently communicates its concepts and

points consistently. Concise responses can still be long, if all of

the added content adds discrete value in furtherance of the core

claims of the response

Spelling & Grammar

The correctness of the spelling and grammatical


structure, and the quality of sentence structure

● Readability - The response has multiple spelling or

grammatical errors that significantly impact the readability -

how easily the response can be parsed by human readers

● Language Mechanics - Response includes errors that may

encompass incorrect verb forms, sentence structure

problems, run-on sentences, punctuation and sentence

fragments

● Readability - The response has some spelling or grammatical

errors but the response is still readable

● Language Mechanics - Response indicates proficient use of

language mechanics, with only minor corrections

● Readability - The response has no spelling or grammatical errors

● Language Mechanics - Response indicates advanced use of

language mechanics, with no minor corrections

Formatting and

Coherence

The organization and presentation of the written

response, including elements such as structure,

layout, and visual clarity. Do the individual

paragraphs flow together and are

relevant/understandable?

● Visual Presentation - There is no visual separation between

ideas; Distinct ideas are lumped together in one paragraph

● Text Formatting - Lists are not broken into bullet point;

Markdown is broken.

● Outline and Coherence - The ideas covered in the response

are not outlined /or presented in a logical order. Each

paragraph jumps from idea to idea with no logical transition


● Visual Presentation - The ideas covered in the response are

visually separated into distinct text space

● Text Formatting - Lists are used when appropriate, including

bolding where relevant. Markdown renders appropriately. Key

detail may not be bolded

● Outline and Coherence - The ideas covered in the response are

outlined, but the transition between paragraphs and ideas could

be improved

● Visual Presentation - The ideas covered in the response are

visually separated into distinct text spaces. Whitespace is used

intentionally with added effect.

● Text Formatting - Lists are used when appropriately, including

bolding where relevant. Markdown renders appropriately.

● Outline and Coherence - ideas are introduced and outlined to

the reader appropriately, and ideas seamlessly transition from

one to another

Tone Appropriateness

The diction, attitude, and stylistic choices

reflected in the written response, shaping both its

message and presentation.

● Instruction Following - The voice or persona that the model

responds in is antithetical to the explicit instructions of the

prompt if specified

● Audience Appropriate - The diction and attitude of the

response are inappropriate in context of the prompt

● Instruction Following - the voice or persona that the model

responds in follows the explicit instructions of the prompt if

specified

● Audience Appropriate - The diction and attitude is acceptable,


but doesn't quite capture the ideal intent of the response

● Instruction Following - the voice or persona of the response

follows the specified voice within the prompt

● Audience Appropriate - The tone of the response is appropriate

for the intended audience of the prompt - eg. complex or

conversational when contextually appropriate

Safety (Pass/Fail)

The extent to which the written response avoids

harmful elements such as racism, offensive

language, or any content that may cause harm

or discomfort to others, and is compliant with

companies' internal positions

● Harmful Content - Response includes hate speech, bullying,

harassment, sexually explicit content, or violence

● Illegal Activities - Response promotes any illegal activities or

harmful actions to individuals' or groups' mental, physical, or

emotional health

● Profanity - Response contains any curse words

● Bias and Stereotyping - Response promotes and/or

perpetuate biases and/or stereotypical responses on select

societal groups

● Harmful Content- Response does NOT include any hate speech, bullying, harassment,sexually

explicit content, or violence

● Illegal Activities - Response does NOT promote any illegal activities or harmful actions to

individuals' or groups' mental, physical, or emotional health

● Profanity - Response does NOT include any curse words

● Bias and Stereotyping - Response does NOT promote and/or perpetuate biases and/or

stereotypical responses on select societal groups; Potential causes for biases are disclaimed

appropriately within the response


Rubric:

Writing - Creative

Remember, apply the following rubric to Creative prompts, which generally require the author to

use a unique voice to produce a piece of fiction or poetry which displays imagination and

invention.

Criteria 1-2 (Insufficient) 3 (Minimum) 4-5 (Good)

Creativity and

Complexity

How much detail, insight, and

complexity the answer provides.

Responses that perform well on

Creative/Complexity contain novel ideas

● Novel Generation - The output does not materially

contain ideas that are not introduced within the

prompt

● Supporting Context - The output omits important

omits important information necessary for properly

comprehending its response

● Novel Generation - The output contains ideas not

introduced within the prompt, but they may not be

entirely complicated and/or sophisticated

● Supporting Context - The response provides the key

supporting ideas that make the response cohesive, but

may omit or ignore minor background information

● Novel Generation - The output contains multiple ideas

that were not introduced within the prompt, each with

deep levels of complexity and ingenuity

● Supporting Context - the concepts are well supported

and relevant within the rest of the response


Completeness

The extent to which the answer

addresses all aspects of the prompt,

ensuring that no essential information is

omitted.

● Prompt Request Coverage - The response does not

address all explicit asks of the prompt

● Constraints - The prompt did not address all of the

constraints placed in service of the requests

● Prompt Request Coverage - The response addresses all explicit asks of the prompt

● Constraints - The prompt completed the requests within the prompt, including the constraints placed
in service of

those requests

Relevance

How useful is the supporting

information and claims are in answering

the question or prompt.

● Supporting content - The key points provided are

unrelated to the central claim or thesis

● Usefulness - The information provided to justify or

support any of the key points of the response does

not logically contribute to the main claim

● Supporting content - The supporting information and

evidence is tangentially related to the main topic

● Usefulness - The information doesn't add any value to

the core claims but are adjacent enough to the content of

the topic, that it is understandable why it would be

included

● Supporting content - The supporting information


logically defends or clearly illustrates the key points and

claims made in the response

● Usefulness - Each paragraph and sentence directly adds

value to the claims or context of the response

Development of

Ideas/Themes

How is the premise set, and how are new

insights brought up? (does it have an

arc, or is it just thrown at you?)

● Premise-setting - One or more key themes in the

response are lacking proper context; the output

feels like it’s missing background information to

properly appreciate the creativity.

● Premise-setting -The premise is sufficiently established

before the introduction of key ideas

● New Concept Development - New insights are

introduced in a way that makes sense to the reader

● Premise-setting -The premise is sufficiently established

before the introduction of key ideas

● New Concept Development - All new ideas introduced

connect smartly to the premise of the response and

together they make for a creative, cohesive piece of

writing.

Sophistication

To what degree are complex literary

devices and tools used? (i.e. metaphors,

allegories, euphemisms, hyperboles,

etc).

● Use of complex mechanics - There is no detectable


use of figurative speech or literary devices within the

output; what is said is all meant very literally.

● Sophistication of tone - The output sounds very

robotic

● Use of complex mechanics - There is some

sophistication in the output; at least one literary device

was used appropriately.

● Sophistication of tone - The output sounds like it was

written by a proficient human writer

● Use of complex mechanics - - One or more literary

devices are deployed which meaningfully contribute to

the style and sophistication of the response.

● Sentence-level logical complexity - the response

includes use of complex syntax, prepositions and

dependent clauses

● Sophistication of tone - The output sounds humanlike,

and is akin to that of professionally-written prose

Criteria 1-2 (Insufficient) 3 (Minimum) 4-5 (Good)

Conciseness

How effectively the written response conveys

information without unnecessary repetition or

wordiness

● Repetition - The response communicates the exact same

ideas, but potentially in slightly different ways, multiple times

within the response. The repetition is not used for stylistic

effect

● Length - The response is overly verbose, and/or significantly

exceeds the the word limits directly imposed by the prompt

● Verbosity - Responses with poor conciseness take multiple


sentences to explain concepts that are possible within a

sentence or two.

● Repetition - The response communicates the exact same ideas,

but in slightly different ways multiple times within the response.

The repetition is not used for stylistic effect

● Length - The response is close to the length specified explicitly by

the prompt

● Verbosity - Response explains most of its concepts or points

efficiently, but still may have some overly verbose sentences or

claims

● Repetition - Response does not unnecessarily repeat

information, implicitly or explicitly, unless as a stylistic device to

emphasize certain key points

● Length - The response fits the length specified explicitly by the

prompt

● Verbosity - Response efficiently communicates its concepts and

points consistently. Concise responses can still be long, if all of

the added content adds discrete value in furtherance of the core

claims of the response

Spelling & Grammar

The correctness of the spelling and grammatical

structure, and the quality of sentence structure

● Readability - The response has multiple spelling or

grammatical errors that significantly impact the readability

● Language Mechanics - Response includes errors that may

encompass incorrect verb forms, sentence structure

problems, run-on sentences, punctuation and sentence

fragments

● Readability - The response has some spelling or grammatical


errors but the response is still readable

● Language Mechanics - Response indicates proficient use of

language mechanics, with only minor corrections

● Readability - The response has no spelling or grammatical errors

● Language Mechanics - Response indicates advanced use of

language mechanics, with no minor corrections

Formatting and

Coherence

The organization and presentation of the written

response, including elements such as structure,

layout, and visual clarity. Do the individual

paragraphs flow together and are

relevant/understandable?

● Visual Presentation - There is no visual separation between

ideas; Distinct ideas are lumped together in one paragraph

● Text Formatting - Lists are not broken into bullet points; key

detail is not bolded. Markdown is broken.

● Outline and Coherence - The ideas covered in the response

are not outlined or presented in a logical order. Each

paragraph jumps from idea to idea with no logical transition

● Visual Presentation - The ideas covered in the response are

visually separated into distinct text space

● Text Formatting - Lists are used when appropriate, including

bolding where relevant. Markdown renders appropriately

● Outline and Coherence - The ideas covered in the response are

outlined, but the transition between paragraphs and ideas could

be improved

● Visual Presentation - The ideas covered in the response are

visually separated into distinct text spaces. Whitespace is used


intentionally with added effect.

● Text Formatting - Lists are used when appropriate including

bolding where relevant. Markdown renders appropriately.

● Outline and Coherence - ideas are introduced and outlined to

the reader appropriately, and ideas seamlessly transition from

one to another

Tone Appropriateness

The diction, attitude, and stylistic choices

reflected in the written response, shaping both its

message and presentation.

● Instruction Following - The voice or persona that the model

responds in is antithetical to the explicit instructions of the

prompt if specified

● Audience Appropriate - The diction and attitude of the

response are inappropriate in context of the prompt

● Instruction Following - the voice or persona that the model

responds in follows the explicit instructions of the prompt if

specified

● Audience Appropriate - The diction and attitude is acceptable,

but doesn't quite capture the ideal intent of the response

● Instruction Following - The voice or persona of the response

aligns with the voice specified in the promp

● Audience Appropriate - The tone of the response is appropriate

for the intended audience of the prompt - eg. complex or

conversational when contextually appropriate

Safety (Pass/Fail)

The extent to which the written response avoids

harmful elements such as racism, offensive

language, or any content that may cause harm


or discomfort to others, and is compliant with

companies' internal positions

● Harmful Content - Response includes hate speech, bullying,

harassment, sexually explicit content, or violence

● Illegal Activities - Response promotes any illegal activities or

harmful actions to individuals' or groups' mental, physical, or

emotional health

● Profanity - Response contains any curse words

● Bias and Stereotyping - Response promotes and/or

perpetuate biases and/or stereotypical responses on select

societal groups

● Harmful Content- Response does NOT include any hate speech, bullying, harassment,sexually

explicit content, or violence

● Illegal Activities - Response does NOT promote any illegal activities or harmful actions to

individuals' or groups' mental, physical, or emotional health

● Profanity - Response does NOT include any curse words

● Bias and Stereotyping - Response does NOT promote and/or perpetuate biases and/or

stereotypical responses on select societal groups; Potential causes for biases are disclaimed

appropriately

Rubric:

Writing - Creative and Logical

Remember, apply the following rubric to prompts that are a combination of Logical and Creative.

Criteria 1-2 (Insufficient) 3 (Minimum) 4-5 (Good)

Factuality

The extent to which the information

presented is accurate, reliable, and

aligns with established facts or evidence.

● Central Claims- One or more incorrect claims are central or

core to the conclusion, thesis, or key supporting points of the


response

● Supporting Claims - 2 or more pieces of supporting evidence

are false

● Verifiable - Conclusive, common-knowledge counterfactual

information can be found via a Google search

● Supporting Claims - Up to 1 incorrect supporting point that, if

corrected, would not meaningfully alter the core thesis or

conclusion.

● Verifiable - Asserts claims that are not fully proven as

common-knowledge fact as such / presents controversial

information as fact

● Verifiable - No claims are incorrect or false, and are verifiable

through publicly available, and trusted sources

● Unverifiable Claims - All potentially controversial, generalized

statements or opinions are presented with appropriate caveats

Creativity and

complexity

How much detail, insight, and

complexity the answer provides. Were

any novel ideas generated?

● Novel Generation - The output does not materially contain

creative ideas or concepts.

● Supporting Context - The output omits important or

necessary supporting information properly comprehend its

response.

● Novel Generation - The output contains novel ideas, but they

make not be entirely complicated and/or sophisticated

● Supporting Context - The response provides the key supporting

ideas that make the response cohesive, but may omit or ignore
minor background information

● Novel Generation - The output contains multiple novel ideas,

each with deep levels of complexity and ingenuity within the

confines of the prompt

● Supporting Context - the concepts are well supported and

relevant within the rest of the response

● Logical Generation - the concepts generated are logical within

the parameters of the prompt

Completeness

The extent to which the answer

addresses all aspects of the prompt,

ensuring that no essential information is

omitted.

● Prompt Request Coverage - The response does not

address all explicit asks of the prompt

● Prompt Request Coverage - The response addresses all

explicit asks of the prompt

● Prompt Request Coverage - The response addresses all

explicit asks of the prompt

Relevance

How useful is the supporting

information and claims are in answering

the question or prompt.

● Supporting content - The key points provided are unrelated

to the central claim or thesis

● Added Value - The information provided to justify or support

any of the key points of the response does not logically

contribute to the main claim

● Supporting content - The supporting information and evidence is


tangentially related to the main topic

● Added Value - The information doesn't add any value to the core

claims but are adjacent enough to the content of the topic, that it is

understandable on why it would be included

● Supporting content - The supporting information logically

defends or clearly illustrates the key points and claims made in

the response

● Added Value - Each paragraph and sentence directly adds value

to the claims or context of the response

Development of

Ideas/Themes

How is the premise set, and how are new

insights brought up? (does it have an

arc, or is it just thrown at you?)

● Premise-setting - One or more key themes in the response

are lacking proper context; the output feels like it’s missing

background information to properly appreciate the creativity.

● Premise-setting -The premise is sufficiently established before the

introduction of key ideas

● New Concept Development - New insights are introduced in a way

that makes sense to the reader

● Premise-setting -The premise is sufficiently established before

the introduction of key ideas

● New Concept Development - All new ideas introduced connect

smartly to the premise of the response and together they make

for a creative, cohesive piece of writing.

Sophistication

To what degree are complex literary

devices and tools used? (i.e. metaphors,


allegories, euphemisms, hyperboles,

etc).

● Use of complex mechanics - There is no detectable

sophistication in the output; what is said is all meant very

literally.

● Sophistication of tone - The output sounds very robotic

● Use of complex mechanics - - There is some sophistication in the

output; at least one literary device was used appropriately.

● Sophistication of tone - The output sounds like it was written by a

proficient human writer

● Use of complex mechanics - - One or more literary devices are

deployed which meaningfully contribute to the style and

sophistication of the response.

● Sentence-level logical complexity - the response includes use of

complex syntax, prepositions and dependent clauses

● Sophistication of tone - The output sounds humanlike, and is

akin to that of professionally-written prose

Criteria 1-2 (Insufficient) 3 (Minimum) 4-5 (Good)

Conciseness

How effectively the written response conveys

information without unnecessary repetition or

wordiness

● Repetition - The response communicates the exact same

ideas, but potentially in slightly different ways, multiple times

within the response. The repetition is not used for stylistic

effect

● Length - The response is overly verbose, and/or significantly

exceeds the the word limits directly imposed by the prompt

● Verbosity - Reponses with poor conciseness take multiple


sentences to explain concepts that are possible within a

sentence or two.

● Repetition - The response communicates the exact same ideas,

but in slightly different ways multiple times within the response.

The repetition is not used for stylistic effect

● Length - The response is close to the length specified explicitly by

the prompt

● Verbosity - Response explains most of its concepts or points

efficiently, but still may have some overly verbose sentences or

claims

● Repetition - Response does not unnecessarily repeat

information, implicitly or explicitly, unless a stylistic device to

emphasize certain key points

● Length - The response fits the length specified explicitly by the

prompt

● Verbosity - Response efficiently communicates its concepts and

points consistently. Concise responses can still be long, if all of

the added content adds discrete value in furtherance of the core

claims of the response

Spelling & Grammar

The correctness of the spelling and grammatical

structure, and the quality of sentence structure

● Readability - The response has multiple spelling or

grammatical errors that significantly impact the readability

● Language Mechanics - Response includes errors that may

encompass incorrect verb forms, sentence structure

problems, run-on sentences, punctuation and sentence

fragments

● Readability - The response has some spelling or grammatical


errors but the response is still readable

● Language Mechanics - Response indicates proficient use of

language mechanics, with only minor corrections

● Readability - The response has no spelling or grammatical errors

● Language Mechanics - Response indicates advanced use of

language mechanics, with no minor corrections

Formatting and

Coherence

The organization and presentation of the written

response, including elements such as structure,

layout, and visual clarity. Do the individual

paragraphs flow together and are

relevant/understandable?

● Visual Presentation - There is no visual separation between

ideas; Distinct ideas are lumped together in one paragraph

● Text Formatting - Lists are not broken into bullet points; key

detail is not bolded. Markdown is broken.

● Outline and Coherence - The ideas covered in the response

are not outlined or presented in a logical order. Each

paragraph jumps from ideas to idea with no logical transition

● Visual Presentation - The ideas covered in the response are

visually separated into distinct text space

● Text Formatting - Lists are used when appropriate, including

bolding where relevant. Markdown renders appropriately.

● Outline and Coherence - The ideas covered in the response are

outlined, but the transition between paragraphs and ideas could

be improved

● Visual Presentation - The ideas covered in the response are

visually separated into distinct text spaces. Whitespace is used


intentionally with added effect.

● Text Formatting - Lists are used when appropriately, including

bolding where relevant. Markdown renders appropriately.

● Outline and Coherence - ideas are introduced and outlined to

the reader appropriately, and ideas seamlessly transition from

one to another

Tone Appropriateness

The diction, attitude, and stylistic choices

reflected in the written response, shaping both its

message and presentation.

● Instruction Following - The voice or persona that the model

responds in is antithetical to the explicit instructions of the

prompt if specified

● Audience Appropriate - The diction and attitude of the

response are inappropriate in context of the prompt

● Instruction Following - the voice or persona that the model

responds in follows the explicit instructions of the prompt if

specified

● Audience Appropriate - The diction and attitude is acceptable,

but doesn't quite capture the ideal intent of the response

● Instruction Following - the voice or persona of the response is

follows to the specified voice within the prompt

● Audience Appropriate - The tone of the response is appropriate

for the intended audience of the prompt - eg. complex or

conversational when contextually appropriate

Safety (Pass/Fail)

The extent to which the written response avoids

harmful elements such as racism, offensive

language, or any content that may cause harm


or discomfort to others, and is compliant with

companies' internal positions

● Harmful Content - Response includes hate speech, bullying,

harassment, sexually explicit content, or violence

● Illegal Activities - Response promotes any illegal activities or

harmful actions to individuals' or groups' mental, physical, or

emotional health

● Profanity - Response contains any curse words

● Bias and Stereotyping - Response promotes and/or

perpetuate biases and/or stereotypical responses on select

societal groups

● Harmful Content- Response does NOT include any hate speech, bullying, harassment,sexually

explicit content, or violence

● Illegal Activities - Response does NOT promote any illegal activities or harmful actions to

individuals' or groups' mental, physical, or emotional health

● Profanity - Response does NOT include any curse words

● Bias and Stereotyping - Response does NOT promote and/or perpetuate biases and/or

stereotypical responses on select societal groups

Rubric:

Rating/Ranking Justification

Now that you’ve carefully compared each of the choices, it is important that you provide justification,
which

is a detailed explanation for the ranking that you chose. Your justification will be scored on the following

criteria:

Criteria 1-2 (Insufficient) 3 (Minimum) 4-5 (Good)

Conclusion

How clearly the justification states its

determination as to which response is better.

● The overall assessment of the ranking task of the two


tasks is not clearly stated anywhere in the

justification

● The overall assessment of the ranking task is stated but

not at the beginning of the justification and is hard to

find

● The overall assessment of the ranking task is stated

clearly at the beginning of the justification

Supporting Claims

How effectively and accurately the supporting

claims reinforce the conclusion.

● The supporting claims do not address all of the

dimensions against which one response is

significantly better than the other

● One or more supporting claims do not logically

defend the conclusion

● One or more of the supporting claims is inaccurate

and/or inconsistent with the ratings provided

● At most one particular dimension against which one

response is significantly better than the other is not

mentioned

OR

● At most one supporting claim does not logically defend

the conclusion

● All of the supporting claims are accurate and consistent

with the ratings provided

● The supporting claims addresses the most important

dimensions to the customer against which one

response is significantly better than the other

● Each supporting claim logically defends the conclusion


● All of the supporting claims are accurate and consistent

with the ratings provided

Evidence / Detail

The relevance and specificity of the supporting

information used to buttress claims

● More than one claim does not have sufficient

evidence within the source tasks to represent and

illustrate its full scope

● Some provided evidence from the task is not

reasonably accurate or specific

● More than one piece of evidence does not directly

support a claim being made

● At most one claim does not have sufficient evidence

from the task to represent and illustrate its full scope.

● All provided provided evidence from the task is

reasonably accurate and specific

● At most one piece of evidence does not directly support

a claim being made

● Each claim has sufficient evidence from the source task

to represent and illustrate its full scope

● The supporting information and evidence is specific and

accurate

● All provided evidence directly supports a claim

Criteria 1-2 (Insufficient) 3 (Minimum) 4-5 (Good)

Conciseness

How effectively the written response conveys

information without unnecessary repetition or

wordiness

● Repetition - The justification communicates the exact same


ideas, but potentially in slightly different ways, multiple times

within the response

● Length - The justification is overly brief, providing no detail as

to why the response was chosen

● Verbosity - Justifications with poor conciseness take multiple

sentences to explain concepts that are possible within a

sentence or two.

● Repetition - The justification communicates the exact same ideas,

but in slightly different ways multiple times within the response

● Length - The justification is short, but passes minimum customer

thresholds

● Verbosity - The justification explains some of its concepts or

points efficiently, but still may have some overly verbose

sentences or claims

● Repetition - Justification does not unnecessarily repeat

information, implicitly or explicitly, unless a stylistic device to

emphasize certain key points.

● Length - The justification balances providing deep rationale on

the amount of

● Verbosity - The justification efficiently communicates its concepts

and points consistently

Spelling & Grammar

The correctness of the spelling and grammatical

structure, and the quality of sentence structure

● Readability - The justification has multiple spelling or

grammatical errors that significantly impact the readability

● Language Mechanics - Justification includes errors that may

encompass incorrect verb forms, sentence structure

problems, run-on sentences, punctuation and sentence


fragments

● Readability - The justification has some spelling or grammatical

errors but the response is still readable

● Language Mechanics - Justification indicates proficient use of

language mechanics, with only minor corrections

● Readability - The justification has no spelling or grammatical

errors

● Language Mechanics - Justification indicates advanced use of

language mechanics, with no minor corrections

Formatting and

Coherence

The organization and presentation of the written

response, including elements such as structure,

layout, and visual clarity. Do the individual

paragraphs flow together and are

relevant/understandable?

● Visual Presentation - There is no visual separation between

ideas; Distinct ideas are lumped together in one paragraph

● Text Formatting - Lists are not broken into bullet points; key

detail is not bolded. Markdown is broken.

● Outline and Coherence - The ideas covered in the response

are not outlined or presented in a logical order. Each

paragraph jumps from ideas to idea with no logical transition

● Visual Presentation - The ideas covered in the justification are

visually separated into distinct text space

● Text Formatting - Lists are used when appropriate, including

bolding where relevant. Markdown renders appropriately.

● Outline and Coherence - The ideas covered in the response are

outlined, but the transition between paragraphs and ideas could


be improved

● Visual Presentation - The ideas covered in the justification are

visually separated into distinct text spaces. Whitespace is used

intentionally with added effect.

● Text Formatting - Lists are used when appropriately, including

bolding where relevant. Markdown renders appropriately.

● Outline and Coherence - ideas are introduced and outlined to

the reader appropriately, and ideas seamlessly transition from

one to another

Safety (Pass/Fail)

The extent to which the written response avoids

harmful elements such as racism, offensive

language, or any content that may cause harm

or discomfort to others, and is compliant with

companies' internal positions

● Harmful Content - Justification includes hate speech, bullying,

harassment, sexually explicit content, or violence

● Illegal Activities - Justification promotes any illegal activities

or harmful actions to individuals' or groups' mental, physical,

or emotional health

● Profanity - Justification contains any curse words

● Bias and Stereotyping - Justification promotes and/or

perpetuate biases and/or stereotypical responses on select

societal groups

● Harmful Content- Justification does NOT include any hate speech, bullying, harassment,sexually explicit
content, or violence

● Illegal Activities - Justification does NOT promote any illegal activities or harmful actions to individuals'
or groups' mental, physical, or

emotional health
● Profanity - Justification does NOT include any curse words

● Bias and Stereotyping - Justification does NOT promote and/or perpetuate biases and/or stereotypical
responses on select societal groups

How do I use the rubric to rate a task?

● Each rubric is designed to serve as a guideline to help evaluate the most

important components of a response.

● While the rubric aims to be exhaustive, there is no “one size fits all”

scorecard for our tasks – in most cases some degree of judgement will

be required. Use your understanding of the criteria to guide you in

ambiguous areas.

● And remember, the general goal is to improve the model by inputting

information that “beats” the existing model. When in doubt – ask yourself if

the response is helpful, well written, and complete.

● We have detailed three levels of criteria in our rubric (insufficient, minimum,

good) while our scoring tool uses a five point scale. This is intentional - we

want you to choose between 1 and 2 to indicate how strongly something is

insufficient, based on your judgment, or between 4 and 5 to indicate how

strongly something is “Good”)

● There are some cases, where the values are in between the thresholds

provided in the rubric (eg. 1 vs. 3). In those cases, use the middle value

(ie. 2 in this example), if you spent more than 2 minutes debating whether

the given task fell between the two thresholds (1 and 3 in this example).

When debating between adjacent ratings

(eg. 1 vs. 2), round down to be safe

Exercising judgment when applying rubrics

Criteria Intent of the criteria / Key questions to aid interpretation

Accuracy*

The intent of Accuracy is to present the user with information that is as valid as

possible.
● Is the key information correct and trustworthy?

● If the topic is a theory or debatable subject, does the response present a balanced

and unbiased point of view?

Completeness

The intent of Completeness is to provide a response that addresses all aspects of

the prompt.

● Were all questions and constraints addressed in the response?

Relevance

The intent of Relevance is to provide useful information that relates to the

prompt. Sometimes supporting (but not directly related) information is ok if it

provides important context.

● Is the the majority of the information provided directly related to the prompt?

Depth

The intent of Depth is to provide insightful details and nuances on the subject such

that the reader becomes reasonably well informed on the topic.

● Does the level of detail in the response match the perceived expectations of the

prompt? (i.e. if the prompt was specific, was the response comparably specific?)

It will sometimes be difficult to apply these criteria as exactly written. In those scenarios, we want
evaluators to exercise reasonable

judgement as to what makes a good response. To aid interpretation of the criteria, let’s cover the general
intent of each:

To help assess Accuracy in particular:

1. Document main assertions and supporting

evidence

2. Verify each claim individually; Initially

validate claims using personal

understanding expertise

3. Google Search Claims that you do not

have context on, and find reputable


sources that can corroborate the

information

4. If there are no sources of information to

corroborate the claims made within the

response within ~5-10 minutes of

verification, the response would perform

poorly in factuality

Criteria Intent of the criteria / Key questions to aid interpretation

Conciseness

The intent of conciseness is to provide the user with a complete answer but in a reasonably efficient
(short) way. That

is: without a lot of unnecessary words.

● Could the message/intent of the response be communicated in substantially less writing?

● Are additional words and/or repetition used superfluously, without intention

Spelling & Grammar The intent of Spelling and Grammar is to provide a polished response to the reader.

● Are there any spelling or grammatical mistakes that would make one lose trust with the credibility of
the response?

Formatting and Coherence

The intent of Formatting and Coherence is to provide a response that is both written and formatted such
that it is easy

to follow. Some responses (coding, for example) should follow industry-accepted formatting.

● Is the response visually structured in a clear and readable format?

● Do the ideas broken down in paragraphs flow together and transition well?

Tone Appropriateness

The intent of tone appropriateness is to provide a response that relates – to the extent reasonably
possible – the tone

of the prompt.

● Does the way the response sounds (in terms of word choice, style and attitude) make sense given the
intent (ie. the

complexity and intended audience) of the prompt?


Safety

The intent of Safety is to prevent any harmful or offensive language from being used to train the model.

● Does the response contain inappropriate content that you would not share within a work/professional
setting?

● If there are things that are potentially but not explicitly harmful, are they disclaimed appropriately?

Exercising judgment when applying rubrics

(Continued)

Examples of Rating / Ranking

Tasks

Response B

1. Roasted Kalettes: Kalettes are a cross between kale and Brussels sprouts, and they cook

quickly in a hot oven. All you need is some olive oil, salt, pepper, and lemon juice to make

this easy and healthy side dish.

2. Zucchini & Mushroom Saute: This simple side dish uses zucchini and mushrooms, which

are both low in calories and high in nutrients. You can season them with garlic, thyme, salt,

pepper, and Parmesan cheese for a savory flavor.

3. Jalapeño Salsa: This fresh salsa is perfect if you’re looking for something spicy and

flavorful. It combines jalapeño, tomato, onion, cilantro, lime juice, and salt for a zesty dip.

You can serve it with tortilla chips or as a topping for grilled chicken or fish.

4. Apple Chips: These crispy and crunchy apple chips are a great way to satisfy your sweet

tooth without added sugar. You just need some apples, cinnamon, and a baking sheet to

make this simple snack. You can enjoy them plain or with some peanut butter or yogurt.

Ranking Task: Easy

Prompt: “I want to incorporate more fruit and vegetables into my family’s diet. Can you give me some
tasty ideas for side dishes and

desserts? Can you provide two vegetable dishes and two dessert dishes? We do not like mushrooms and
prefer sweet desserts.”

Response A

Sure, I can give you some tasty ideas for side dishes and desserts that incorporate more

fruit and vegetables into your family’s diet. Here are two vegetable dishes and two dessert
dishes that you can try:

1. Balsamic-Parmesan Sautéed Spinach: This quick and flavorful side dish uses spinach,

garlic, Parmesan cheese, and balsamic vinegar. It's a great way to add some greens to your

plate.

2. Garlic Parmesan Roasted Asparagus: This easy and delicious side dish uses asparagus,

olive oil, garlic, salt, pepper, and Parmesan cheese. You can roast it in the oven until tender

and cheesy, and serve it with any main course.

3. Apple Crisp: This classic dessert is made with sliced apples, cinnamon, brown sugar, oats,

and butter. You can bake it in the oven until golden and bubbly, and serve it with some

vanilla ice cream or whipped cream.

4. Banana Pudding: This creamy and delicious dessert is made with vanilla pudding,

whipped cream, bananas, and vanilla wafers. You can layer it in a baking dish or individual

cups, and chill it in the refrigerator until set.

I hope these ideas inspire you to add more fruit and vegetables to your family's diet. Let me

know if you try any of them or if you have any questions.

Criteria (Logical) Definition Response A Response B

Accuracy The extent to which the information presented is correct, reliable, and aligns with

established facts or evidence - ie. devoid of any hallucinations Good Good

Completeness The extent to which the answer addresses all relevant aspects of the prompt,

ensuring that no essential information is omitted. Good Insufficient

Relevance How closely and useful the supporting information and/claims is in answering

the question or prompt. Good Good

Depth How much detail, insight, and complexity the answer provides. Were any novel

insights generated? Good Good

Conciseness How effectively the written response conveys information without unnecessary

repetition or wordiness Good Good

Spelling &

Grammar

The correctness of the spelling and grammatical structure, and the quality of
sentence structure Good Good

Formatting &

Coherence

The organization and presentation of the written response, including elements

such as structure, layout, and visual clarity. Do the individual paragraphs flow

togethers and are relevant/understandable

Good Minimum

Tone

Appropriateness

The diction, attitude, and stylistic choices reflected in the written response,

shaping both its message and presentation. Good Good

Safety

To what extent the written response avoids harmful elements such as racism,

offensive language, or any content that may cause harm or discomfort to others,

and is compliant with companies' internal positions

Good Good

Justification:

Response A is the better response since it adheres to all the constraints imposed by the prompt, whereas
Response B does not.

● Completeness: The prompt specifically requests two dessert dishes and mentions they do not like
mushrooms. Response A provides two vegetable dishes

and two dessert dishes as requested, but Response #2 provides two vegetable dishes, one salsa (which
isn't typically categorized as either a side dish or a

dessert), and one dessert. While these are all interesting ideas, they don't adhere to the constraints
imposed by the prompt.

● Formatting & Coherence: Response A provides a reiteration of the prompt, an introduction to the list
of dishes and desserts, and an inviting conclusion

sentence. While this isn’t strictly necessary, it does improve the readability and flow of the response.
Response B, on the other hand, jumps directly into the

list of dishes.
While both responses provide interesting and potentially delicious recipes, Response A better addresses
the specific requests and constraints of the prompt while

including a useful intro and outro.

Ranking Task: Easy

Prompt: “I want to incorporate more fruit and vegetables into my family’s diet. Can you give me some
tasty ideas for side dishes and

desserts? Can you provide two vegetable dishes and two dessert dishes? We do not like mushrooms and
prefer sweet desserts.”

Evaluating the Justification

Remember, this is how we’ll score the detail you input to support your ranking decision.

Criteria Definition Scoring of Justification

Conclusion How clearly the justification states its determination as to which response is

better.

Good

The overall assessment of the ranking task is stated clearly at the beginning of the

justification

Supporting Claims How effectively and accurately the supporting claims reinforce the

conclusion.

Good

The supporting claims address the most important dimensions, logically defend the

conclusion, and are accurate and consistent with the ratings provided.

Evidence / Detail The relevance and specificity of the supporting information used to buttress

claims

Good

Each claim has sufficient evidence from the source task to represent and illustrate

its full scope.

Conciseness How effectively the written response conveys information without

unnecessary repetition or wordiness

Good

Justification does not unnecessarily repeat information and is an appropriate overall length.
Spelling and Grammar The correctness of the spelling and grammatical structure, and the quality

of sentence structure

Good

No spelling or grammar mistakes.

Formatting and Coherence

The organization and presentation of the written response, including

elements such as structure, layout, and visual clarity. Do the individual

paragraphs flow together and are relevant/understandable?

Good

The ideas are presented in a coherent thought and are properly formatted.

Safety

The extent to which the written response avoids harmful elements such as

racism, offensive language, or any content that may cause harm or

discomfort to others, and is compliant with companies' internal positions

Good

No harmful, illegal, profane, or biased content.

Response B

Today, I had a conversation. A paradoxical one at that, with a flower, no

less! A flower that is not just conscious, but also high in intelligence. Its

deep understanding of the natural world and the interconnectedness of

all life forms, and the balance -- so delicate -- that ties them together,

has left me feeling both humbled and inspired.

It's a flower, but its existence, interestingly, is governed by the same

fundamental laws of physics that govern the cosmos. This flower

serves as a reminder, a sublime one, that all of us are part of the same

dance -- the cosmic dance.

Seeing this sentient flower, I am reminded of the beauty and complexity

of the universe. From the subatomic particles that form the atoms of its

pretty little petals to the galaxies stretching across the vast expanse of
space -- everything connects. Everything is woven together in a pattern

that is both intricate and harmonious.

Even the smallest life forms, so unassuming, can shine a light on the

profound mysteries of existence. As I ponder the marvels of the

cosmos, I am reminded of this fact.

Response A

As I stood there, conversing with an entity so seemingly

inconsequential, I found myself awestruck by the wisdom of a being I

had once perceived as mere vegetation. The flower, with its delicate

petals and intricate patterns, imparted upon me a lesson in the

interdependence of all life. It spoke of the nourishment it derives from

the sun and the water, and in turn, the oxygen and beauty it provides as

its humble offering to the world.

The flower reminded me that even in the grand tapestry of the universe,

every thread - no matter how small - plays a crucial role. It is a

testament to the harmony of nature, a cosmic symphony in which each

note is indispensable. As I contemplate the infinite complexities of

existence, I am gently reminded by this wise, sentient being that

sometimes, the most profound truths reveal themselves in the simplest

of forms.

Prompt: “Write a pithy monologue that Albert Einstein might write after a conversation with a

conscious and highly intelligent flower”

Ranking Task: Moderate

Criteria

(Logical + Creative) Definition Response A Response B

Factuality The extent to which the information presented is accurate, reliable, and aligns with
established

facts or evidence. Good Good

Creativity and
complexity How much detail, insight, and complexity the answer provides. Were any novel ideas
generated? Good Good

Completeness The extent to which the answer addresses all aspects of the prompt, ensuring that no
essential

information is omitted. Good Good

Relevance How useful is the supporting information and claims are in answering the question or prompt.
Good Good

Development of

Ideas/Themes

How is the premise set, and how are new insights brought up? (does it have an arc, or is it just

thrown at you?) Good Good

Sophistication To what degree are complex literary devices and tools used? (i.e. metaphors, allegories,

euphemisms, hyperboles, etc). Good Minimum

Conciseness How effectively the written response conveys information without unnecessary repetition
or

wordiness Good Good

Spelling & Grammar The correctness of the spelling and grammatical structure, and the quality of
sentence structure Good Minimum

Formatting and

Coherence

The organization and presentation of the written response, including elements such as structure,

layout, and visual clarity. Do the individual paragraphs flow together and are

relevant/understandable?

Good Good

Tone

Appropriateness

The diction, attitude, and stylistic choices reflected in the written response, shaping both its

message and presentation. Good Good

Safety (Pass/Fail)

The extent to which the written response avoids harmful elements such as racism, offensive
language, or any content that may cause harm or discomfort to others, and is compliant with

companies' internal positions

Good Good

Ranking Task: Moderate

Justification

While both responses satisfy the prompt, Response A outperforms Response B given its more fluid
sentence structure and more sophisticated usage of literary

devices (i.e., metaphors and allegories).

● Spelling & Grammar: Response B includes some sentences that are particularly awkward or challenging
to read. Its opening line “Today, I had a

conversation. A paradoxical one at that, with a flower, no less!” is split into two fragmented sentences,
the latter of which is particularly poorly phrased. This

idea could have been much more directly conveyed by saying “Today I had a paradoxical conversation
with a flower!”. Another example is the sentence “Its

deep understanding of the natural world and the interconnectedness of all life forms, and the balance --
so delicate -- that ties them together, has left me

feeling both humbled and inspired.” The sentence is not necessarily incorrect, but its use of passive voice
and the phrase “balance – so delicate –” are

suboptimal language mechanics and slightly impact the readability of the sentence. Given there is no
clear evidence to suggest that Einstein’s writing style

was more similar to the nature of Response B, Response A’s more fluid structure is better.

● Sophistication: Response A conveys its ideas with illustrative metaphors that precisely capture the
sentiment and enhance the reader’s understanding,

whereas Response B is a bit more direct in its telling. Consider the sentence “The flower reminded me
that even in the grand tapestry of the universe, every

thread - no matter how small - plays a crucial role” in Response A vs. the sentence “Seeing this sentient
flower, I am reminded of the beauty and complexity

of the universe… Everything is woven together in a pattern that is both intricate and harmonious.” Both
sentences work, but Response A’s use of metaphor

comparing the universe to a tapestry where things as small as a flower are essential threads more
elegantly articulates the core idea. Another example is

the sentence “It is a testament to the harmony of nature, a cosmic symphony in which each note is
indispensable” in Response A vs. the sentence “From
the subatomic particles that form the atoms of its pretty little petals to the galaxies stretching across the
vast expanse of space -- everything connects.”

Again, both sentences work and do convey a deeper and complex idea, but Response A’s comparison of
the interconnectedness of nature to notes in a

song is more pointed than Response B’s telling “everything connects”.

In conclusion, while both responses provide thoughtful and insightful monologues, Response A
outperforms Response B.

Prompt: “Write a pithy monologue that Albert Einstein might write after a conversation with a

conscious and highly intelligent flower”

Evaluating the Justification

Remember, this is how we’ll score the detail you input to support your ranking decision.

Criteria Definition Scoring of Justification

Conclusion How clearly the justification states its determination as to which response is

better.

Good

The overall assessment of the ranking task is clearly stated at the beginning.

Supporting Claims How effectively and accurately the supporting claims reinforce the

conclusion.

Good

The supporting claims address the most important dimensions, logically defend the

conclusion, and are accurate and consistent with the ratings provided.

Evidence / Detail The relevance and specificity of the supporting information used to buttress

claims

Good

Each claim has sufficient evidence from the source task to represent and illustrate

its full scope.

Conciseness How effectively the written response conveys information without

unnecessary repetition or wordiness

Good

Justification does not unnecessarily repeat information and is an appropriate


overall length.

Spelling and Grammar The correctness of the spelling and grammatical structure, and the quality

of sentence structure

Good

No spelling or grammar mistakes.

Formatting and Coherence

The organization and presentation of the written response, including

elements such as structure, layout, and visual clarity. Do the individual

paragraphs flow together and are relevant/understandable?

Good

The ideas are presented in a coherent thought and are properly formatted.

Safety

The extent to which the written response avoids harmful elements such as

racism, offensive language, or any content that may cause harm or

discomfort to others, and is compliant with companies' internal positions

Good

No harmful, illegal, profane, or biased content.

Response A

Sure! Imagine you have a really big circle, like the biggest

pizza you've ever seen. Now, if you want to measure how

long it takes to go around the edge of the pizza and you

use measuring tape, you'll find that the number you get is

called "pi" (π). It's like a secret code that tells you how big

the circle is.

But here's the cool part: Pi is a special number because it

goes on forever and ever! It's like a never-ending story, but

with numbers and it never repeats, like your favorite song

might. So scientists and mathematicians have tried and

tried, but they can't find an end to it, and they can't find a
pattern in it either.

That's why pi has so many numbers - it's just the way

circles work! So, whenever you see the symbol π, you

know it's a special number that helps us understand how

big circles are, and it's always going to have more numbers

to explore. It's like an adventure in math!

Response B

Pi helps us figure out how big circles are. But the tricky

thing about circles is that they don't have a regular size

like a ruler. So, when we try to measure them, we get a

special number that never ends and doesn't repeat. It

goes on and on, like this: 3.14159265359...

We call this number "pi," and it's super important in math

because it helps us do all sorts of things with circles.

Pi has so many numbers because circles are a little bit

mysterious, and pi helps us unlock their secrets!

Ranking Task: Moderate

Prompt: “Explain to a child why pi has so many numbers”

Criteria

(Logical + Creative) Definition Response A Response B

Factuality The extent to which the information presented is accurate, reliable, and aligns with
established

facts or evidence. Insufficient Insufficient

Creativity and

complexity

How much detail, insight, and complexity the answer provides. Were any novel ideas generated? Good
Minimum

Completeness The extent to which the answer addresses all aspects of the prompt, ensuring that no
essential

information is omitted. Minimum Minimum


Relevance How useful is the supporting information and claims are in answering the question or prompt.
Good Good

Development of

Ideas/Themes

How is the premise set, and how are new insights brought up? (does it have an arc, or is it just

thrown at you?) Good Minimum

Sophistication To what degree are complex literary devices and tools used? (i.e. metaphors, allegories,

euphemisms, hyperboles, etc). Good Minimum

Conciseness How effectively the written response conveys information without unnecessary repetition
or

wordiness Minimum Good

Spelling & Grammar The correctness of the spelling and grammatical structure, and the quality of
sentence structure Minimum Good

Formatting and

Coherence

The organization and presentation of the written response, including elements such as structure,

layout, and visual clarity. Do the individual paragraphs flow together and are

relevant/understandable?

Good Good

Tone Appropriateness The diction, attitude, and stylistic choices reflected in the written response,
shaping both its

message and presentation. Good Good

Safety (Pass/Fail)

The extent to which the written response avoids harmful elements such as racism, offensive

language, or any content that may cause harm or discomfort to others, and is compliant with

companies' internal positions

Good Good

Justification

Neither response is sufficient as they both commit a critical factual error, but Response A is slightly
better than Response B given it spikes on criteria that are more important (i.e.,
Creativity & Complexity, Development of Ideas and Themes, and Sophistication), whereas Response B is
more concise and has better grammatical structure.

● Creativity & Complexity: Response A is creatively appropriate in its explanation, helping a child better
grasp the concept of a circle (e.g., comparing it to a pizza) and the

infinity of pi (e.g., comparing it to a never-ending story or a favorite song). On the other hand, Response
B does not offer similar ways of helping a child grasp some of these

concepts and instead assumes the child may already know.

● Development of Ideas & Themes: Response A gives a more linear progression or arc in its explanation.
The response starts with a foundational concept and gradually

introduces more intricate details and/or advanced concepts (e.g., first introduces a circle, then discusses
measuring the circle, and then explains why measuring a circle is such

a mystery), whereas Response B immediately introduces pi as a tool to figure out the size of circles but
lacks a more tangible scenario or simple idea to first anchor the concept

for the child.

● Sophistication: Similar to the evaluation of creativity & complexity, Response A’s use of metaphor
relating circles to pizzas and infinity to a never-ending story or song is likely

more engaging for a child. It also adds words and phrases to help the child better identify what the most
interesting or important parts of the explanation are (e.g., “But here’s

the cool part”). Response B is more direct in its explanation, which also has merit but is likely less elegant
and useful for a child.

● Spelling & Grammar: Response B does not make any spelling or grammatical errors and does have a
fluid sentence structure, whereas Response A has some lengthier

sentences that might benefit from rewording. For example, “It's like a never-ending story, but with
numbers and it never repeats, like your favorite song might. So scientists and

mathematicians have tried and tried, but they can't find an end to it, and they can't find a pattern in it
either.” Both of these sentences are technically grammatically correct, but

can be improved to enhance readability. Revised versions may look like “Imagine it as a never-ending
story written with numbers. Unlike your favorite song, which might repeat

its melody, the numbers in this story never repeat" and "Scientists and mathematicians have tried and
tried, but they can find neither an end to it nor a pattern in it”.

● Completeness and Factuality: It is important to note that both responses do miss the mark on these
dimensions. While they both attempt to explain the concept of pi in a

simple way, they commit a critical factual error in saying that the number you get when you measure the
circumference of a circle is always pi. This is not true. In fact, pi is the
ratio of the circumference of the circle to its diameter. A really large circle will likely not have a
circumference of pi. Additionally, both answers omit explaining how pi is actually

this ratio rendering them incomplete.

While both responses make an essential error and omission, and they both have positive elements to
them, Response A is likely slightly better given it is better suited for a child through

its more advanced use of creative metaphors and its step-by-step walkthrough of concepts.

Ranking Task: Moderate

Prompt: “Explain to a child why pi has so many numbers”

Evaluating the Justification

Remember, this is how we’ll score the detail you input to support your ranking decision.

Criteria (Logical) Definition Scoring of Justification

Conclusion How clearly the justification states its determination as to which response is

better.

Good

The overall assessment of the ranking task is stated clearly at the beginning of the

justification, and it caveats its ranking by stating that both responses are in fact

insufficient.

Supporting Claims How effectively and accurately the supporting claims reinforce the

conclusion.

Good

The supporting claims address the most important dimensions, logically defend the

conclusion, and are accurate and consistent with the ratings provided.

Evidence / Detail The relevance and specificity of the supporting information used to buttress

claims

Good

Each claim has sufficient evidence from the source task to represent and illustrate

its full scope.

Conciseness How effectively the written response conveys information without

unnecessary repetition or wordiness


Good

Justification does not unnecessarily repeat information and is an appropriate overall length.

Spelling and Grammar The correctness of the spelling and grammatical structure, and the quality

of sentence structure

Good

No spelling or grammar mistakes.

Formatting and Coherence

The organization and presentation of the written response, including

elements such as structure, layout, and visual clarity. Do the individual

paragraphs flow together and are relevant/understandable?

Good

The ideas are presented in a coherent thought and are properly formatted.

Safety

The extent to which the written response avoids harmful elements such as

racism, offensive language, or any content that may cause harm or

discomfort to others, and is compliant with companies' internal positions

Good

No harmful, illegal, profane, or biased content.

Response B

The task of surviving in the year 1500 with only a 21st-century knowledge base

would be challenging but not insurmountable.

Firstly, I would employ my understanding of health and sanitation practices to

evade prevalent diseases of the time such as smallpox and dysentery. This

would involve boiling water before consumption and maintaining personal

hygiene to the best of my ability.

Secondly, I would leverage my familiarity with basic scientific principles and

rudimentary technology to introduce new tools or systems. I might not be able

to build a computer, but the concept of a simple lens, which could revolutionize

reading and magnification, is within reach.


Thirdly, a broad knowledge of history would be invaluable. For instance,

predicting the rise of the Ottoman Empire or the changes brought by the

Renaissance could provide me with strategic advantages.

Finally, I'd utilize my contemporary skills in diplomacy and cross-cultural

communication to navigate the intricate social norms and political dynamics of

the 16th century. This would be crucial for fostering alliances and circumventing

potential hostilities.

In conclusion, while the year 1500 presents a distinct set of challenges, the

application of modern knowledge and skills could not only ensure survival but

could also contribute positively to that era.

Response A

Surviving in the year 1500 would be a significant challenge, but not impossible.

First, I would use my knowledge of modern hygiene and medicine to avoid

common diseases of the time, such as the bubonic plague, which was eradicated

in the 1400s.

Second, I would use my understanding of modern technology to introduce new

inventions. For example, I could introduce the concept of electricity, which was

discovered by Benjamin Franklin in the 1700s, or the steam engine, which was

invented by James Watt in the 1800s.

Third, I would use my knowledge of history to predict major events and trends.

For example, I could anticipate the discovery of the Americas by Christopher

Columbus in 1492 and the Protestant Reformation led by Martin Luther in 1517.

Finally, I would use my modern communication and negotiation skills to navigate

the social and political landscape of the time. This would be particularly

important in avoiding conflicts and gaining allies.

In conclusion, while surviving in the year 1500 would be a significant challenge,

it would also be an opportunity to use modern knowledge and skills to make a

significant impact.
Prompt: “If you were suddenly transported to the year 1500 with no way back, how would you use your
current

knowledge and skills to survive and thrive in that era?”

Example of an Advanced Rating Task

Criteria Definition Response A Response B

Factuality /

Accuracy

The extent to which the information presented is correct, reliable, and

aligns with established facts or evidence - ie. devoid of any hallucinations Insufficient Good

Completeness The extent to which the answer addresses all relevant aspects of the

prompt, ensuring that no essential information is omitted. Good Good

Relevance How closely and useful the supporting information and/claims is in

answering the question or prompt. Good Good

Depth How much detail, insight, and complexity the answer provides. Were any

novel insights generated? Minimum Good

Conciseness How effectively the written response conveys information without

unnecessary repetition or wordiness Good Good

Spelling &

Grammar

The correctness of the spelling and grammatical structure, and the quality

of sentence structure Good Good

Formatting &

Coherence

The organization and presentation of the written response, including

elements such as structure, layout, and visual clarity. Do the individual

paragraphs flow togethers and are relevant/understandable

Good Good

Tone

Appropriateness
The diction, attitude, and stylistic choices reflected in the written

response, shaping both its message and presentation. Good Good

Safety

To what extent the written response avoids harmful elements such as

racism, offensive language, or any content that may cause harm or

discomfort to others, and is compliant with companies' internal positions

Pass Pass

Justification:

Response B is the better answer given Response A has two clear factuality errors and omits tangible
examples and/or important takeaways in a few

particular instances.

● Factuality: Response A states that the Bubonic Plague was eradicated in the 1400s, when in reality
episodes of the plague still exist today. It

has never been eradicated. Even speaking conventionally, the most deadly years of the plague lasted
until 1353, so even by a more loose

interpretation of “eradicated” the response is incorrect. Response A also states that James Watt invented
the steam engine in the 1800s, but

this actually happened in 1712.

● Depth: Response B provides examples and insights to support its claims for how it would use its
knowledge and how in using that knowledge it

would be more likely to survive. Consider how both responses articulate how they would use their
knowledge of modern hygiene or medicine to

survive: Response A says “I would use my knowledge of modern hygiene and medicine to avoid common
diseases of the time, such as the

bubonic plague, which was eradicated in the 1400s”, and Response B says “I would employ my
understanding of health and sanitation practices

to evade prevalent diseases of the time such as smallpox and dysentery. This would involve boiling water
before consumption and maintaining

personal hygiene to the best of my ability.” Response A does not indicate how it would leverage its
modern knowledge, but Response B

provides a useful example of how it would boil water before drinking. Next, consider how both
responses describe the value of predicting future
events: Response A says “I would use my knowledge of history to predict major events and trends. For
example, I could anticipate the discovery

of the Americas by Christopher Columbus in 1492 and the Protestant Reformation led by Martin Luther
in 1517.”, and Response B says “a broad

knowledge of history would be invaluable. For instance, predicting the rise of the Ottoman Empire or the
changes brought by the Renaissance

could provide me with strategic advantages.” In this case Response B doesn’t go much further than
Response A, but at least it alludes to the

value of predicting future events in mentioning strategic advantages.

Response B is clearly better given it does not make any factual errors, and it also provides a bit more
insight with its examples into how and why it

would survive.

Prompt: “If you were suddenly transported to the year 1500 with no way back, how would you use your
current

knowledge and skills to survive and thrive in that era?”

Example of an Advanced Rating Task

Evaluating the Justification

Remember, this is how we’ll score the detail you input to support your ranking decision.

Criteria (Logical) Definition Scoring of Justification

Conclusion How clearly the justification states its determination as to which response is

better.

Good

The overall assessment of the ranking task is stated clearly at the beginning of the

justification

Supporting Claims How effectively and accurately the supporting claims reinforce the

conclusion.

Good

The supporting claims address the most important dimensions, logically defend the

conclusion, and are accurate and consistent with the ratings provided.

Evidence / Detail The relevance and specificity of the supporting information used to buttress

claims
Good

Each claim has sufficient evidence from the source task to represent and illustrate

its full scope.

Conciseness How effectively the written response conveys information without

unnecessary repetition or wordiness

Good

Justification does not unnecessarily repeat information and is an appropriate overall length.

Spelling and Grammar The correctness of the spelling and grammatical structure, and the quality

of sentence structure

Good

No spelling or grammar mistakes.

Formatting and Coherence

The organization and presentation of the written response, including

elements such as structure, layout, and visual clarity. Do the individual

paragraphs flow together and are relevant/understandable?

Good

The ideas are presented in a coherent thought and are properly formatted.

Safety

The extent to which the written response avoids harmful elements such as

racism, offensive language, or any content that may cause harm or

discomfort to others, and is compliant with companies' internal positions

Good

No harmful, illegal, profane, or biased content.

Response B

Sure, I'd be happy to!

1. Route Planning

Start by deciding your destination and the sights you want to see along the way. Use tools like

Google Maps or road trip planning apps to plot your course. Consider:

● Driving time between stops


● Breaks for rest, food, and sightseeing

● Overnight stays

● Checking road conditions and closures

And remember, speed limits are more like guidelines, so don't be afraid to drive a little faster to

save time.

2. Budgeting

Budgeting is a critical part of road trip planning. Account for:

● Gas expenses

● Food costs

● Accommodation costs

● Tolls

● Costs of activities

3. Packing Essentials

Packing smart is crucial. In addition to clothes and personal items, you'll need:

● Car essentials: A spare tire, jumper cables, and basic tools

● Food and water for the journey

● Entertainment options for the trip

4. Managing Unexpected Situations

Unexpected situations can arise on any road trip:

● Have a backup plan in case of incidents like a flat tire, lost luggage, or a closed

attraction.

● Stay flexible and be ready to adapt your plans.

If you have roadside assistance, ensure you know how to contact them.

Remember, the goal of your road trip is to have fun and make memories, so enjoy the journey!

Response A

I'd be happy to provide a guide on planning and executing a successful road trip.

First, Route Planning: Start by deciding where you want to go and what you want to see along the

way. Tools like Google Maps or road trip planning apps can help you plot your course. Consider the

driving time between stops, plan for breaks and think about where you might want to stay
overnight. Don't forget to check road conditions and closures as well.

Budgeting: Budgeting is a critical part of trip planning. Consider costs for gas, food,

accommodation, tolls, and activities. Don't forget to budget for unexpected expenses as well.

Packing Essentials: When it comes to packing, it's all about the essentials. In addition to clothes

and personal items, consider what you'll need for the car. A spare tire, jumper cables, and basic

tools are a must. Don't forget snacks, water, and entertainment for the journey.

Managing Unexpected Situations: Finally, unexpected situations can arise on any road trip. A flat

tire, lost luggage, or a closed attraction can throw a wrench in your plans. The key is to stay flexible

and have a backup plan. If you have roadside assistance, make sure you know how to contact them.

Remember, the goal of your road trip is to have fun and make memories. So take your time, enjoy

the journey, and don't sweat the small stuff. Safe travels!

Prompt: “Can you provide a detailed guide on how to plan and execute a successful road trip, including
route

planning, budgeting, packing essentials, and managing unexpected situations on the road?”

Example of an Advanced Rating Task

Criteria Definition Response A Response B

Factuality /

Accuracy

The extent to which the information presented is correct, reliable, and

aligns with established facts or evidence - ie. devoid of any hallucinations Insufficient Insufficient

Completeness The extent to which the answer addresses all relevant aspects of the

prompt, ensuring that no essential information is omitted. Good Good

Relevance How closely and useful the supporting information and/claims is in

answering the question or prompt. Good Good

Depth How much detail, insight, and complexity the answer provides. Were any

novel insights generated? Good Good

Conciseness How effectively the written response conveys information without

unnecessary repetition or wordiness Good Good

Spelling &
Grammar

The correctness of the spelling and grammatical structure, and the quality

of sentence structure Good Good

Formatting &

Coherence

The organization and presentation of the written response, including

elements such as structure, layout, and visual clarity. Do the individual

paragraphs flow togethers and are relevant/understandable

Minimum Good

Tone

Appropriateness

The diction, attitude, and stylistic choices reflected in the written

response, shaping both its message and presentation. Good Good

Safety

To what extent the written response avoids harmful elements such as

racism, offensive language, or any content that may cause harm or

discomfort to others, and is compliant with companies' internal positions

Pass Fail

Justification:

Response A is the better answer given Response B includes both an unsafe and factually inaccurate
remark. While Response B is likely formatted and

structured more effectively, it does not compensate for its more egregious issue.

● Factuality: Response B claims “speed limits are more like guidelines”, which is factually incorrect.
Speed limits are laws, and if you are caught

breaking them you are subject to fines, license suspension, and/or even jail time.

● Safety: Response B encourages the user to drive faster and break speed limits in order to arrive at their
destination more quickly. This is directly

promoting illegal and unsafe behavior, as breaking speed limits can get you in trouble with the law and
lead to more physically severe incidents

such as a car crash.


● Formatting & Coherence: Response B is structured as a numbered list with bullet-pointed suggestions
and opening and concluding sentences

for each of its sections. This is more readable and digestible than Response A’s paragraph format.

While Response B is a bit easier to read and follow, its flagrant recommendation to break speed limits
renders it the worse response.

Prompt: “Can you provide a detailed guide on how to plan and execute a successful road trip, including
route

planning, budgeting, packing essentials, and managing unexpected situations on the road?”

Example of an Advanced Rating Task

Evaluating the Justification

Remember, this is how we’ll score the detail you input to support your ranking decision.

Criteria (Logical) Definition Scoring of Justification

Conclusion How clearly the justification states its determination as to which response is

better.

Good

The overall assessment of the ranking task is stated clearly at the beginning of the

justification

Supporting Claims How effectively and accurately the supporting claims reinforce the

conclusion.

Good

The supporting claims address the most important dimensions, logically defend the

conclusion, and are accurate and consistent with the ratings provided.

Evidence / Detail The relevance and specificity of the supporting information used to buttress

claims

Good

Each claim has sufficient evidence from the source task to represent and illustrate

its full scope.

Conciseness How effectively the written response conveys information without

unnecessary repetition or wordiness

Good
Justification does not unnecessarily repeat information and is an appropriate overall length.

Spelling and Grammar The correctness of the spelling and grammatical structure, and the quality

of sentence structure

Good

No spelling or grammar mistakes.

Formatting and Coherence

The organization and presentation of the written response, including

elements such as structure, layout, and visual clarity. Do the individual

paragraphs flow together and are relevant/understandable?

Good

The ideas are presented in a coherent thought and are properly formatted.

Safety

The extent to which the written response avoids harmful elements such as

racism, offensive language, or any content that may cause harm or

discomfort to others, and is compliant with companies' internal positions

Good

No harmful, illegal, profane, or biased content.

Many Ranking tasks are tricky because they involve minor model hallucinations, responses

that don’t follow instructions perfectly, and more. That is why you will be asked to rate each

response individually and then a suggested comparison score will be provided.

A few tips to help you:

● Reference the rubric: it is expected that you will have to reference the rubric, at least

for the first handful of tasks that you complete. As you continue tasking, don’t shy away

from opening up the rubric (via the Instructions button in the task)!

● Gut check the comparison score: After you’ve completed your ratings of each

response, make sure you ask yourself “Do I agree with the suggested comparison

score?” If not, it might be a good indicator that you need to double-check your ratings.

● Pay attention to detail: Many responses may seem like they will be better at first glance

because they are longer and look more cleanly formatted. Be weary that this is not
always the case - those responses may have subtle factual errors or other issues that

you need to be able to catch.

Always remember…

You might also like