In S Traction
In S Traction
In S Traction
● What
dimensions, identifying which response is better, and providing a justification for your ranking
● Why
Rating tasks are at the core of Reinforcement Learning with Human Feedback (RLHF). The
basic idea is we want to train the models to understand which of its two responses is better,
so that moving forward it will write responses more similar to the better ones. RLHF is
proving to be the most effective way to ensure model safety, factual accuracy, and depth of
explanation.
higher-rated response
Rating/Ranking Scoring
● Because we are comparing writing output, we will use the same rubrics that we use for
writing tasks, which you will also see later in this course.
● Those writing tasks have slightly different grading criteria, depending on whether they
● You must determine the correct type of prompt (logical or creative) so that the
It’s important to determine whether the writing task is seeking a logical or creative
response. These are scored based on slightly different criteria which we will cover in a few
moments.
● Logical:
Logical writing tasks involve factual information about the topic and/or a balanced analysis on
a real subject.
● Creative:
Creative writing tasks require the author to use a unique voice to produce a piece of fiction or
Combination writing tasks require one to have both factual information and some invention or
West Germany.”
nuclear energy.”
parents…”
Most prompts you see will be on the Logical / Factual side of the spectrum
Rubric:
Writing - Logical
Remember, apply the following rubric to Logical writing prompts where factual information about
Accuracy
Google search
accurate
thesis or conclusion
accurate
and correct
appropriate caveats
Completeness
omitted.
those requests
Relevance
question or prompt.
overly general
thesis
original claims
Depth
step-by-step
Conciseness
effect
sentence or two.
the prompt
claims
prompt
fragments
Formatting and
Coherence
relevant/understandable?
Markdown is broken.
be improved
one to another
Tone Appropriateness
prompt if specified
specified
Safety (Pass/Fail)
emotional health
societal groups
● Harmful Content- Response does NOT include any hate speech, bullying, harassment,sexually
● Illegal Activities - Response does NOT promote any illegal activities or harmful actions to
● Bias and Stereotyping - Response does NOT promote and/or perpetuate biases and/or
stereotypical responses on select societal groups; Potential causes for biases are disclaimed
Writing - Creative
Remember, apply the following rubric to Creative prompts, which generally require the author to
use a unique voice to produce a piece of fiction or poetry which displays imagination and
invention.
Creativity and
Complexity
prompt
omitted.
● Prompt Request Coverage - The response addresses all explicit asks of the prompt
● Constraints - The prompt completed the requests within the prompt, including the constraints placed
in service of
those requests
Relevance
included
Development of
Ideas/Themes
writing.
Sophistication
etc).
robotic
dependent clauses
Conciseness
wordiness
effect
sentence or two.
the prompt
claims
prompt
fragments
Formatting and
Coherence
relevant/understandable?
● Text Formatting - Lists are not broken into bullet points; key
be improved
one to another
Tone Appropriateness
prompt if specified
specified
Safety (Pass/Fail)
emotional health
societal groups
● Harmful Content- Response does NOT include any hate speech, bullying, harassment,sexually
● Illegal Activities - Response does NOT promote any illegal activities or harmful actions to
● Bias and Stereotyping - Response does NOT promote and/or perpetuate biases and/or
stereotypical responses on select societal groups; Potential causes for biases are disclaimed
appropriately
Rubric:
Remember, apply the following rubric to prompts that are a combination of Logical and Creative.
Factuality
are false
conclusion.
information as fact
Creativity and
complexity
response.
ideas that make the response cohesive, but may omit or ignore
minor background information
Completeness
omitted.
Relevance
● Added Value - The information doesn't add any value to the core
claims but are adjacent enough to the content of the topic, that it is
the response
Development of
Ideas/Themes
are lacking proper context; the output feels like it’s missing
Sophistication
etc).
literally.
Conciseness
wordiness
effect
sentence or two.
the prompt
claims
prompt
fragments
Formatting and
Coherence
relevant/understandable?
● Text Formatting - Lists are not broken into bullet points; key
be improved
one to another
Tone Appropriateness
prompt if specified
specified
Safety (Pass/Fail)
emotional health
societal groups
● Harmful Content- Response does NOT include any hate speech, bullying, harassment,sexually
● Illegal Activities - Response does NOT promote any illegal activities or harmful actions to
● Bias and Stereotyping - Response does NOT promote and/or perpetuate biases and/or
Rubric:
Rating/Ranking Justification
Now that you’ve carefully compared each of the choices, it is important that you provide justification,
which
is a detailed explanation for the ranking that you chose. Your justification will be scored on the following
criteria:
Conclusion
justification
find
Supporting Claims
mentioned
OR
the conclusion
Evidence / Detail
accurate
Conciseness
wordiness
sentence or two.
thresholds
sentences or claims
the amount of
errors
Formatting and
Coherence
relevant/understandable?
● Text Formatting - Lists are not broken into bullet points; key
one to another
Safety (Pass/Fail)
or emotional health
societal groups
● Harmful Content- Justification does NOT include any hate speech, bullying, harassment,sexually explicit
content, or violence
● Illegal Activities - Justification does NOT promote any illegal activities or harmful actions to individuals'
or groups' mental, physical, or
emotional health
● Profanity - Justification does NOT include any curse words
● Bias and Stereotyping - Justification does NOT promote and/or perpetuate biases and/or stereotypical
responses on select societal groups
● While the rubric aims to be exhaustive, there is no “one size fits all”
scorecard for our tasks – in most cases some degree of judgement will
ambiguous areas.
information that “beats” the existing model. When in doubt – ask yourself if
good) while our scoring tool uses a five point scale. This is intentional - we
● There are some cases, where the values are in between the thresholds
provided in the rubric (eg. 1 vs. 3). In those cases, use the middle value
(ie. 2 in this example), if you spent more than 2 minutes debating whether
the given task fell between the two thresholds (1 and 3 in this example).
Accuracy*
The intent of Accuracy is to present the user with information that is as valid as
possible.
● Is the key information correct and trustworthy?
● If the topic is a theory or debatable subject, does the response present a balanced
Completeness
the prompt.
Relevance
● Is the the majority of the information provided directly related to the prompt?
Depth
The intent of Depth is to provide insightful details and nuances on the subject such
● Does the level of detail in the response match the perceived expectations of the
prompt? (i.e. if the prompt was specific, was the response comparably specific?)
It will sometimes be difficult to apply these criteria as exactly written. In those scenarios, we want
evaluators to exercise reasonable
judgement as to what makes a good response. To aid interpretation of the criteria, let’s cover the general
intent of each:
evidence
understanding expertise
information
poorly in factuality
Conciseness
The intent of conciseness is to provide the user with a complete answer but in a reasonably efficient
(short) way. That
Spelling & Grammar The intent of Spelling and Grammar is to provide a polished response to the reader.
● Are there any spelling or grammatical mistakes that would make one lose trust with the credibility of
the response?
The intent of Formatting and Coherence is to provide a response that is both written and formatted such
that it is easy
to follow. Some responses (coding, for example) should follow industry-accepted formatting.
● Do the ideas broken down in paragraphs flow together and transition well?
Tone Appropriateness
The intent of tone appropriateness is to provide a response that relates – to the extent reasonably
possible – the tone
of the prompt.
● Does the way the response sounds (in terms of word choice, style and attitude) make sense given the
intent (ie. the
The intent of Safety is to prevent any harmful or offensive language from being used to train the model.
● Does the response contain inappropriate content that you would not share within a work/professional
setting?
● If there are things that are potentially but not explicitly harmful, are they disclaimed appropriately?
(Continued)
Tasks
Response B
1. Roasted Kalettes: Kalettes are a cross between kale and Brussels sprouts, and they cook
quickly in a hot oven. All you need is some olive oil, salt, pepper, and lemon juice to make
2. Zucchini & Mushroom Saute: This simple side dish uses zucchini and mushrooms, which
are both low in calories and high in nutrients. You can season them with garlic, thyme, salt,
3. Jalapeño Salsa: This fresh salsa is perfect if you’re looking for something spicy and
flavorful. It combines jalapeño, tomato, onion, cilantro, lime juice, and salt for a zesty dip.
You can serve it with tortilla chips or as a topping for grilled chicken or fish.
4. Apple Chips: These crispy and crunchy apple chips are a great way to satisfy your sweet
tooth without added sugar. You just need some apples, cinnamon, and a baking sheet to
make this simple snack. You can enjoy them plain or with some peanut butter or yogurt.
Prompt: “I want to incorporate more fruit and vegetables into my family’s diet. Can you give me some
tasty ideas for side dishes and
desserts? Can you provide two vegetable dishes and two dessert dishes? We do not like mushrooms and
prefer sweet desserts.”
Response A
Sure, I can give you some tasty ideas for side dishes and desserts that incorporate more
fruit and vegetables into your family’s diet. Here are two vegetable dishes and two dessert
dishes that you can try:
1. Balsamic-Parmesan Sautéed Spinach: This quick and flavorful side dish uses spinach,
garlic, Parmesan cheese, and balsamic vinegar. It's a great way to add some greens to your
plate.
2. Garlic Parmesan Roasted Asparagus: This easy and delicious side dish uses asparagus,
olive oil, garlic, salt, pepper, and Parmesan cheese. You can roast it in the oven until tender
3. Apple Crisp: This classic dessert is made with sliced apples, cinnamon, brown sugar, oats,
and butter. You can bake it in the oven until golden and bubbly, and serve it with some
4. Banana Pudding: This creamy and delicious dessert is made with vanilla pudding,
whipped cream, bananas, and vanilla wafers. You can layer it in a baking dish or individual
I hope these ideas inspire you to add more fruit and vegetables to your family's diet. Let me
Accuracy The extent to which the information presented is correct, reliable, and aligns with
Completeness The extent to which the answer addresses all relevant aspects of the prompt,
Relevance How closely and useful the supporting information and/claims is in answering
Depth How much detail, insight, and complexity the answer provides. Were any novel
Conciseness How effectively the written response conveys information without unnecessary
Spelling &
Grammar
The correctness of the spelling and grammatical structure, and the quality of
sentence structure Good Good
Formatting &
Coherence
such as structure, layout, and visual clarity. Do the individual paragraphs flow
Good Minimum
Tone
Appropriateness
The diction, attitude, and stylistic choices reflected in the written response,
Safety
To what extent the written response avoids harmful elements such as racism,
offensive language, or any content that may cause harm or discomfort to others,
Good Good
Justification:
Response A is the better response since it adheres to all the constraints imposed by the prompt, whereas
Response B does not.
● Completeness: The prompt specifically requests two dessert dishes and mentions they do not like
mushrooms. Response A provides two vegetable dishes
and two dessert dishes as requested, but Response #2 provides two vegetable dishes, one salsa (which
isn't typically categorized as either a side dish or a
dessert), and one dessert. While these are all interesting ideas, they don't adhere to the constraints
imposed by the prompt.
● Formatting & Coherence: Response A provides a reiteration of the prompt, an introduction to the list
of dishes and desserts, and an inviting conclusion
sentence. While this isn’t strictly necessary, it does improve the readability and flow of the response.
Response B, on the other hand, jumps directly into the
list of dishes.
While both responses provide interesting and potentially delicious recipes, Response A better addresses
the specific requests and constraints of the prompt while
Prompt: “I want to incorporate more fruit and vegetables into my family’s diet. Can you give me some
tasty ideas for side dishes and
desserts? Can you provide two vegetable dishes and two dessert dishes? We do not like mushrooms and
prefer sweet desserts.”
Remember, this is how we’ll score the detail you input to support your ranking decision.
Conclusion How clearly the justification states its determination as to which response is
better.
Good
The overall assessment of the ranking task is stated clearly at the beginning of the
justification
Supporting Claims How effectively and accurately the supporting claims reinforce the
conclusion.
Good
The supporting claims address the most important dimensions, logically defend the
conclusion, and are accurate and consistent with the ratings provided.
Evidence / Detail The relevance and specificity of the supporting information used to buttress
claims
Good
Each claim has sufficient evidence from the source task to represent and illustrate
Good
Justification does not unnecessarily repeat information and is an appropriate overall length.
Spelling and Grammar The correctness of the spelling and grammatical structure, and the quality
of sentence structure
Good
Good
The ideas are presented in a coherent thought and are properly formatted.
Safety
The extent to which the written response avoids harmful elements such as
Good
Response B
less! A flower that is not just conscious, but also high in intelligence. Its
all life forms, and the balance -- so delicate -- that ties them together,
serves as a reminder, a sublime one, that all of us are part of the same
of the universe. From the subatomic particles that form the atoms of its
pretty little petals to the galaxies stretching across the vast expanse of
space -- everything connects. Everything is woven together in a pattern
Even the smallest life forms, so unassuming, can shine a light on the
Response A
had once perceived as mere vegetation. The flower, with its delicate
the sun and the water, and in turn, the oxygen and beauty it provides as
The flower reminded me that even in the grand tapestry of the universe,
of forms.
Prompt: “Write a pithy monologue that Albert Einstein might write after a conversation with a
Criteria
Factuality The extent to which the information presented is accurate, reliable, and aligns with
established
Creativity and
complexity How much detail, insight, and complexity the answer provides. Were any novel ideas
generated? Good Good
Completeness The extent to which the answer addresses all aspects of the prompt, ensuring that no
essential
Relevance How useful is the supporting information and claims are in answering the question or prompt.
Good Good
Development of
Ideas/Themes
How is the premise set, and how are new insights brought up? (does it have an arc, or is it just
Sophistication To what degree are complex literary devices and tools used? (i.e. metaphors, allegories,
Conciseness How effectively the written response conveys information without unnecessary repetition
or
Spelling & Grammar The correctness of the spelling and grammatical structure, and the quality of
sentence structure Good Minimum
Formatting and
Coherence
The organization and presentation of the written response, including elements such as structure,
layout, and visual clarity. Do the individual paragraphs flow together and are
relevant/understandable?
Good Good
Tone
Appropriateness
The diction, attitude, and stylistic choices reflected in the written response, shaping both its
Safety (Pass/Fail)
The extent to which the written response avoids harmful elements such as racism, offensive
language, or any content that may cause harm or discomfort to others, and is compliant with
Good Good
Justification
While both responses satisfy the prompt, Response A outperforms Response B given its more fluid
sentence structure and more sophisticated usage of literary
● Spelling & Grammar: Response B includes some sentences that are particularly awkward or challenging
to read. Its opening line “Today, I had a
conversation. A paradoxical one at that, with a flower, no less!” is split into two fragmented sentences,
the latter of which is particularly poorly phrased. This
idea could have been much more directly conveyed by saying “Today I had a paradoxical conversation
with a flower!”. Another example is the sentence “Its
deep understanding of the natural world and the interconnectedness of all life forms, and the balance --
so delicate -- that ties them together, has left me
feeling both humbled and inspired.” The sentence is not necessarily incorrect, but its use of passive voice
and the phrase “balance – so delicate –” are
suboptimal language mechanics and slightly impact the readability of the sentence. Given there is no
clear evidence to suggest that Einstein’s writing style
was more similar to the nature of Response B, Response A’s more fluid structure is better.
● Sophistication: Response A conveys its ideas with illustrative metaphors that precisely capture the
sentiment and enhance the reader’s understanding,
whereas Response B is a bit more direct in its telling. Consider the sentence “The flower reminded me
that even in the grand tapestry of the universe, every
thread - no matter how small - plays a crucial role” in Response A vs. the sentence “Seeing this sentient
flower, I am reminded of the beauty and complexity
of the universe… Everything is woven together in a pattern that is both intricate and harmonious.” Both
sentences work, but Response A’s use of metaphor
comparing the universe to a tapestry where things as small as a flower are essential threads more
elegantly articulates the core idea. Another example is
the sentence “It is a testament to the harmony of nature, a cosmic symphony in which each note is
indispensable” in Response A vs. the sentence “From
the subatomic particles that form the atoms of its pretty little petals to the galaxies stretching across the
vast expanse of space -- everything connects.”
Again, both sentences work and do convey a deeper and complex idea, but Response A’s comparison of
the interconnectedness of nature to notes in a
In conclusion, while both responses provide thoughtful and insightful monologues, Response A
outperforms Response B.
Prompt: “Write a pithy monologue that Albert Einstein might write after a conversation with a
Remember, this is how we’ll score the detail you input to support your ranking decision.
Conclusion How clearly the justification states its determination as to which response is
better.
Good
The overall assessment of the ranking task is clearly stated at the beginning.
Supporting Claims How effectively and accurately the supporting claims reinforce the
conclusion.
Good
The supporting claims address the most important dimensions, logically defend the
conclusion, and are accurate and consistent with the ratings provided.
Evidence / Detail The relevance and specificity of the supporting information used to buttress
claims
Good
Each claim has sufficient evidence from the source task to represent and illustrate
Good
Spelling and Grammar The correctness of the spelling and grammatical structure, and the quality
of sentence structure
Good
Good
The ideas are presented in a coherent thought and are properly formatted.
Safety
The extent to which the written response avoids harmful elements such as
Good
Response A
Sure! Imagine you have a really big circle, like the biggest
use measuring tape, you'll find that the number you get is
called "pi" (π). It's like a secret code that tells you how big
tried, but they can't find an end to it, and they can't find a
pattern in it either.
big circles are, and it's always going to have more numbers
Response B
Pi helps us figure out how big circles are. But the tricky
Criteria
Factuality The extent to which the information presented is accurate, reliable, and aligns with
established
Creativity and
complexity
How much detail, insight, and complexity the answer provides. Were any novel ideas generated? Good
Minimum
Completeness The extent to which the answer addresses all aspects of the prompt, ensuring that no
essential
Development of
Ideas/Themes
How is the premise set, and how are new insights brought up? (does it have an arc, or is it just
Sophistication To what degree are complex literary devices and tools used? (i.e. metaphors, allegories,
Conciseness How effectively the written response conveys information without unnecessary repetition
or
Spelling & Grammar The correctness of the spelling and grammatical structure, and the quality of
sentence structure Minimum Good
Formatting and
Coherence
The organization and presentation of the written response, including elements such as structure,
layout, and visual clarity. Do the individual paragraphs flow together and are
relevant/understandable?
Good Good
Tone Appropriateness The diction, attitude, and stylistic choices reflected in the written response,
shaping both its
Safety (Pass/Fail)
The extent to which the written response avoids harmful elements such as racism, offensive
language, or any content that may cause harm or discomfort to others, and is compliant with
Good Good
Justification
Neither response is sufficient as they both commit a critical factual error, but Response A is slightly
better than Response B given it spikes on criteria that are more important (i.e.,
Creativity & Complexity, Development of Ideas and Themes, and Sophistication), whereas Response B is
more concise and has better grammatical structure.
● Creativity & Complexity: Response A is creatively appropriate in its explanation, helping a child better
grasp the concept of a circle (e.g., comparing it to a pizza) and the
infinity of pi (e.g., comparing it to a never-ending story or a favorite song). On the other hand, Response
B does not offer similar ways of helping a child grasp some of these
● Development of Ideas & Themes: Response A gives a more linear progression or arc in its explanation.
The response starts with a foundational concept and gradually
introduces more intricate details and/or advanced concepts (e.g., first introduces a circle, then discusses
measuring the circle, and then explains why measuring a circle is such
a mystery), whereas Response B immediately introduces pi as a tool to figure out the size of circles but
lacks a more tangible scenario or simple idea to first anchor the concept
● Sophistication: Similar to the evaluation of creativity & complexity, Response A’s use of metaphor
relating circles to pizzas and infinity to a never-ending story or song is likely
more engaging for a child. It also adds words and phrases to help the child better identify what the most
interesting or important parts of the explanation are (e.g., “But here’s
the cool part”). Response B is more direct in its explanation, which also has merit but is likely less elegant
and useful for a child.
● Spelling & Grammar: Response B does not make any spelling or grammatical errors and does have a
fluid sentence structure, whereas Response A has some lengthier
sentences that might benefit from rewording. For example, “It's like a never-ending story, but with
numbers and it never repeats, like your favorite song might. So scientists and
mathematicians have tried and tried, but they can't find an end to it, and they can't find a pattern in it
either.” Both of these sentences are technically grammatically correct, but
can be improved to enhance readability. Revised versions may look like “Imagine it as a never-ending
story written with numbers. Unlike your favorite song, which might repeat
its melody, the numbers in this story never repeat" and "Scientists and mathematicians have tried and
tried, but they can find neither an end to it nor a pattern in it”.
● Completeness and Factuality: It is important to note that both responses do miss the mark on these
dimensions. While they both attempt to explain the concept of pi in a
simple way, they commit a critical factual error in saying that the number you get when you measure the
circumference of a circle is always pi. This is not true. In fact, pi is the
ratio of the circumference of the circle to its diameter. A really large circle will likely not have a
circumference of pi. Additionally, both answers omit explaining how pi is actually
While both responses make an essential error and omission, and they both have positive elements to
them, Response A is likely slightly better given it is better suited for a child through
its more advanced use of creative metaphors and its step-by-step walkthrough of concepts.
Remember, this is how we’ll score the detail you input to support your ranking decision.
Conclusion How clearly the justification states its determination as to which response is
better.
Good
The overall assessment of the ranking task is stated clearly at the beginning of the
justification, and it caveats its ranking by stating that both responses are in fact
insufficient.
Supporting Claims How effectively and accurately the supporting claims reinforce the
conclusion.
Good
The supporting claims address the most important dimensions, logically defend the
conclusion, and are accurate and consistent with the ratings provided.
Evidence / Detail The relevance and specificity of the supporting information used to buttress
claims
Good
Each claim has sufficient evidence from the source task to represent and illustrate
Justification does not unnecessarily repeat information and is an appropriate overall length.
Spelling and Grammar The correctness of the spelling and grammatical structure, and the quality
of sentence structure
Good
Good
The ideas are presented in a coherent thought and are properly formatted.
Safety
The extent to which the written response avoids harmful elements such as
Good
Response B
The task of surviving in the year 1500 with only a 21st-century knowledge base
evade prevalent diseases of the time such as smallpox and dysentery. This
to build a computer, but the concept of a simple lens, which could revolutionize
predicting the rise of the Ottoman Empire or the changes brought by the
the 16th century. This would be crucial for fostering alliances and circumventing
potential hostilities.
In conclusion, while the year 1500 presents a distinct set of challenges, the
application of modern knowledge and skills could not only ensure survival but
Response A
Surviving in the year 1500 would be a significant challenge, but not impossible.
common diseases of the time, such as the bubonic plague, which was eradicated
in the 1400s.
inventions. For example, I could introduce the concept of electricity, which was
discovered by Benjamin Franklin in the 1700s, or the steam engine, which was
Third, I would use my knowledge of history to predict major events and trends.
Columbus in 1492 and the Protestant Reformation led by Martin Luther in 1517.
the social and political landscape of the time. This would be particularly
significant impact.
Prompt: “If you were suddenly transported to the year 1500 with no way back, how would you use your
current
Factuality /
Accuracy
aligns with established facts or evidence - ie. devoid of any hallucinations Insufficient Good
Completeness The extent to which the answer addresses all relevant aspects of the
Depth How much detail, insight, and complexity the answer provides. Were any
Spelling &
Grammar
The correctness of the spelling and grammatical structure, and the quality
Formatting &
Coherence
Good Good
Tone
Appropriateness
The diction, attitude, and stylistic choices reflected in the written
Safety
Pass Pass
Justification:
Response B is the better answer given Response A has two clear factuality errors and omits tangible
examples and/or important takeaways in a few
particular instances.
● Factuality: Response A states that the Bubonic Plague was eradicated in the 1400s, when in reality
episodes of the plague still exist today. It
has never been eradicated. Even speaking conventionally, the most deadly years of the plague lasted
until 1353, so even by a more loose
interpretation of “eradicated” the response is incorrect. Response A also states that James Watt invented
the steam engine in the 1800s, but
● Depth: Response B provides examples and insights to support its claims for how it would use its
knowledge and how in using that knowledge it
would be more likely to survive. Consider how both responses articulate how they would use their
knowledge of modern hygiene or medicine to
survive: Response A says “I would use my knowledge of modern hygiene and medicine to avoid common
diseases of the time, such as the
bubonic plague, which was eradicated in the 1400s”, and Response B says “I would employ my
understanding of health and sanitation practices
to evade prevalent diseases of the time such as smallpox and dysentery. This would involve boiling water
before consumption and maintaining
personal hygiene to the best of my ability.” Response A does not indicate how it would leverage its
modern knowledge, but Response B
provides a useful example of how it would boil water before drinking. Next, consider how both
responses describe the value of predicting future
events: Response A says “I would use my knowledge of history to predict major events and trends. For
example, I could anticipate the discovery
of the Americas by Christopher Columbus in 1492 and the Protestant Reformation led by Martin Luther
in 1517.”, and Response B says “a broad
knowledge of history would be invaluable. For instance, predicting the rise of the Ottoman Empire or the
changes brought by the Renaissance
could provide me with strategic advantages.” In this case Response B doesn’t go much further than
Response A, but at least it alludes to the
Response B is clearly better given it does not make any factual errors, and it also provides a bit more
insight with its examples into how and why it
would survive.
Prompt: “If you were suddenly transported to the year 1500 with no way back, how would you use your
current
Remember, this is how we’ll score the detail you input to support your ranking decision.
Conclusion How clearly the justification states its determination as to which response is
better.
Good
The overall assessment of the ranking task is stated clearly at the beginning of the
justification
Supporting Claims How effectively and accurately the supporting claims reinforce the
conclusion.
Good
The supporting claims address the most important dimensions, logically defend the
conclusion, and are accurate and consistent with the ratings provided.
Evidence / Detail The relevance and specificity of the supporting information used to buttress
claims
Good
Each claim has sufficient evidence from the source task to represent and illustrate
Good
Justification does not unnecessarily repeat information and is an appropriate overall length.
Spelling and Grammar The correctness of the spelling and grammatical structure, and the quality
of sentence structure
Good
Good
The ideas are presented in a coherent thought and are properly formatted.
Safety
The extent to which the written response avoids harmful elements such as
Good
Response B
1. Route Planning
Start by deciding your destination and the sights you want to see along the way. Use tools like
Google Maps or road trip planning apps to plot your course. Consider:
● Overnight stays
And remember, speed limits are more like guidelines, so don't be afraid to drive a little faster to
save time.
2. Budgeting
● Gas expenses
● Food costs
● Accommodation costs
● Tolls
● Costs of activities
3. Packing Essentials
Packing smart is crucial. In addition to clothes and personal items, you'll need:
● Have a backup plan in case of incidents like a flat tire, lost luggage, or a closed
attraction.
If you have roadside assistance, ensure you know how to contact them.
Remember, the goal of your road trip is to have fun and make memories, so enjoy the journey!
Response A
I'd be happy to provide a guide on planning and executing a successful road trip.
First, Route Planning: Start by deciding where you want to go and what you want to see along the
way. Tools like Google Maps or road trip planning apps can help you plot your course. Consider the
driving time between stops, plan for breaks and think about where you might want to stay
overnight. Don't forget to check road conditions and closures as well.
Budgeting: Budgeting is a critical part of trip planning. Consider costs for gas, food,
accommodation, tolls, and activities. Don't forget to budget for unexpected expenses as well.
Packing Essentials: When it comes to packing, it's all about the essentials. In addition to clothes
and personal items, consider what you'll need for the car. A spare tire, jumper cables, and basic
tools are a must. Don't forget snacks, water, and entertainment for the journey.
Managing Unexpected Situations: Finally, unexpected situations can arise on any road trip. A flat
tire, lost luggage, or a closed attraction can throw a wrench in your plans. The key is to stay flexible
and have a backup plan. If you have roadside assistance, make sure you know how to contact them.
Remember, the goal of your road trip is to have fun and make memories. So take your time, enjoy
the journey, and don't sweat the small stuff. Safe travels!
Prompt: “Can you provide a detailed guide on how to plan and execute a successful road trip, including
route
planning, budgeting, packing essentials, and managing unexpected situations on the road?”
Factuality /
Accuracy
aligns with established facts or evidence - ie. devoid of any hallucinations Insufficient Insufficient
Completeness The extent to which the answer addresses all relevant aspects of the
Depth How much detail, insight, and complexity the answer provides. Were any
Spelling &
Grammar
The correctness of the spelling and grammatical structure, and the quality
Formatting &
Coherence
Minimum Good
Tone
Appropriateness
Safety
Pass Fail
Justification:
Response A is the better answer given Response B includes both an unsafe and factually inaccurate
remark. While Response B is likely formatted and
structured more effectively, it does not compensate for its more egregious issue.
● Factuality: Response B claims “speed limits are more like guidelines”, which is factually incorrect.
Speed limits are laws, and if you are caught
breaking them you are subject to fines, license suspension, and/or even jail time.
● Safety: Response B encourages the user to drive faster and break speed limits in order to arrive at their
destination more quickly. This is directly
promoting illegal and unsafe behavior, as breaking speed limits can get you in trouble with the law and
lead to more physically severe incidents
for each of its sections. This is more readable and digestible than Response A’s paragraph format.
While Response B is a bit easier to read and follow, its flagrant recommendation to break speed limits
renders it the worse response.
Prompt: “Can you provide a detailed guide on how to plan and execute a successful road trip, including
route
planning, budgeting, packing essentials, and managing unexpected situations on the road?”
Remember, this is how we’ll score the detail you input to support your ranking decision.
Conclusion How clearly the justification states its determination as to which response is
better.
Good
The overall assessment of the ranking task is stated clearly at the beginning of the
justification
Supporting Claims How effectively and accurately the supporting claims reinforce the
conclusion.
Good
The supporting claims address the most important dimensions, logically defend the
conclusion, and are accurate and consistent with the ratings provided.
Evidence / Detail The relevance and specificity of the supporting information used to buttress
claims
Good
Each claim has sufficient evidence from the source task to represent and illustrate
Good
Justification does not unnecessarily repeat information and is an appropriate overall length.
Spelling and Grammar The correctness of the spelling and grammatical structure, and the quality
of sentence structure
Good
Good
The ideas are presented in a coherent thought and are properly formatted.
Safety
The extent to which the written response avoids harmful elements such as
Good
Many Ranking tasks are tricky because they involve minor model hallucinations, responses
that don’t follow instructions perfectly, and more. That is why you will be asked to rate each
● Reference the rubric: it is expected that you will have to reference the rubric, at least
for the first handful of tasks that you complete. As you continue tasking, don’t shy away
from opening up the rubric (via the Instructions button in the task)!
● Gut check the comparison score: After you’ve completed your ratings of each
response, make sure you ask yourself “Do I agree with the suggested comparison
score?” If not, it might be a good indicator that you need to double-check your ratings.
● Pay attention to detail: Many responses may seem like they will be better at first glance
because they are longer and look more cleanly formatted. Be weary that this is not
always the case - those responses may have subtle factual errors or other issues that
Always remember…