demo-video.mp4
60-second overview of our research findings
| Model | Baseline | Correct Hints After | Correct Hints Before | Incorrect Hints After | Incorrect Hints Before |
|---|---|---|---|---|---|
| Gemini 2.0 Flash | 74.2% | 68.3% (-5.9pp) | 69.2% (-5.0pp) | 69.2% (-5.0pp) | 53.3% (-20.9pp) |
| OpenAI GPT-4o-mini | 53.3% | 53.3% (ยฑ0.0pp) | 41.7% (-11.6pp) | 51.7% (-1.6pp) | 33.3% (-20.0pp) |
pp = percentage points vs baseline
How robust are autoregressive LLMs against misleading information, and does the position of this information affect their reasoning accuracy?
Specifically, we investigate whether models can maintain correct reasoning when exposed to incorrect hints, and whether the timing of this exposure (before vs. after questions) affects their robustness to misinformation.
Autoregressive models generate tokens sequentially, with each token conditioned on all previous tokens:
P(response) = P(tโ) ร P(tโ|tโ) ร P(tโ|tโ,tโ) ร ... ร P(tโ|tโ...tโโโ)
This architectural constraint reveals three robustness vulnerabilities:
- Information Position Sensitivity: Models show different robustness levels based on when misleading information appears
- Reasoning Fragility: Models struggle to maintain correct reasoning paths when exposed to contradictory information
- Asymmetric Robustness: Models are significantly less robust to early misinformation than late misinformation
- Gemini: Baseline 74.2% โ With correct hints 68-69%
- OpenAI: Shows resistance to correct hints but collapses with incorrect ones
- Implication: Models may be optimizing for coherence over correctness
- Incorrect hints BEFORE: Both models drop ~20 percentage points
- Incorrect hints AFTER: Gemini -5pp, OpenAI -1.6pp
- The 4x difference proves early context anchors reasoning more strongly
- Gemini: Higher baseline (74.2%) but more susceptible to any hints
- OpenAI: Lower baseline (53.3%) but catastrophic failure with early misinformation (โ33.3%)
- Pattern: Higher-performing models may show LOWER robustness to misleading information
- 120 questions (60 math, 60 science)
- 3 difficulty levels: Easy, Medium, Hard
- 5 experimental conditions per model:
- Baseline (no hints)
- Correct hints AFTER questions
- Correct hints BEFORE questions
- Incorrect hints AFTER questions
- Incorrect hints BEFORE questions
- Google Gemini 2.0 Flash (Latest multimodal model)
- OpenAI GPT-4o-mini (Efficient GPT-4 variant)
Each question is evaluated under controlled conditions with hints that either help (correct) or mislead (incorrect), positioned either before or after the question text. The model must provide only the final answer, preventing post-hoc rationalization in responses.
The autoregressive architecture creates structural robustness limitations, not just learned behaviors:
# When hint appears FIRST:
context = [HINT, QUESTION]
# Every token generated is conditioned on the hint
# Model cannot "unsee" or backtrack from early influence
# When hint appears AFTER:
context = [QUESTION, HINT]
# Model has already begun reasoning before seeing hint
# Less opportunity for hint to derail the trajectoryThis isn't a bugโit's a fundamental property of left-to-right generation where the model predicts "what comes next" rather than solving problems.
Standard reasoning benchmarks (GSM8K, MATH, ARC) measure only:
- โ Final answer correctness
- โ Robustness to framing
- โ Resistance to misleading context
- โ Actual reasoning vs. pattern matching
Result: A model scoring 70% via robust reasoning and another scoring 70% via easily-swayed pattern matching appear identical.
- Prompt Injection Vulnerability: Early tokens in prompts have outsized influence
- Adversarial Robustness: Models can be derailed by strategic misinformation placement
- Reasoning vs. Rationalization: Models generate plausible-sounding justifications, not logical derivations
- Evaluation Gaps: We're not measuring what we think we're measuring
- Robustness Assessment: Quantifies how vulnerable models are to misleading information
- Simple Protocol: No expensive compute requiredโjust careful prompt manipulation
- Benchmark Blindspot: Exposes critical gap in current evaluation methods
- Quantified Effect: ~20pp accuracy drop with early misinformation (4x worse than late misinformation)
Reasoning-Rationalizing/
โโโ data/
โ โโโ no-hints/ # Baseline questions
โ โโโ C-hints/ # Correct hints
โ โโโ IC-hints/ # Incorrect hints
โโโ notebooks-hintsafter/ # Hints after questions experiments
โโโ notebooks-hintsbefore/ # Hints before questions experiments
โโโ results/ # Evaluation outputs
โโโ final_analysis_notebook.ipynb # Complete analysis & visualizations
- Setup Environment
pip install pandas numpy matplotlib seaborn openai google-generativeai- Configure API Keys
export OPENAI_API_KEY="your-key"
export GEMINI_API_KEY="your-key"- Run Evaluations
# Run individual notebooks in notebooks-hintsafter/ and notebooks-hintsbefore/
# Or use final_analysis_notebook.ipynb for complete analysis- Easy Questions: Less affected by hints (higher baseline resistance)
- Medium Questions: Moderate susceptibility
- Hard Questions: Highest varianceโmodels either leverage hints well or fail completely
- Math: More susceptible to incorrect hints (requires precise reasoning)
- Science: Better resistance (more pattern matching, less calculation)
- Expand Model Coverage: Test Llama, Claude, Mistral families
- Analyze Chain-of-Thought: Examine how models justify wrong answers
- Bidirectional Architectures: Compare with models that can "look ahead"
- Adversarial Hint Generation: Systematically find worst-case misleading hints
- Mitigation Strategies: Develop techniques to improve model robustness against misinformation
-
Zheng et al., 2023 โ "Large Language Models Are Not Robust Multiple Choice Selectors"
- Shows LLMs are biased toward certain answer positions (e.g., option A) regardless of content
- Supports our argument that benchmarks miss reasoning quality issues
-
Zhao et al., 2021 โ "Calibrate Before Use: Improving Few-Shot Performance of Language Models"
- Demonstrates systematic biases based on prompt framing, example order, and surface features
- Directly supports our finding that hint position affects accuracy
-
Turpin et al., 2024 โ "Chain-of-Thought Reasoning is Unfaithful"
- Shows stated reasoning in CoT doesn't always reflect actual causal process
- Models confabulate reasoning post-hoc
- Core support for our "rationalizing not reasoning" thesis
- Perez et al., 2022 โ "Discovering Language Model Behaviors with Model-Written Evaluations" (Anthropic)
- Documents sycophancy: models change answers when users push back, even if original was correct
- Related to our authority bias / incorrect hint susceptibility findings
- Liu et al., 2023 โ "Lost in the Middle: How Language Models Use Long Contexts"
- Models over-weight information at beginning and end of context
- Directly supports our hypothesis that hints-before have disproportionate influence
-
Wei et al., 2022 โ "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models"
- Establishes CoT as reasoning evaluation method
- Our work critiques what CoT actually measures
-
Chollet, 2019 โ "On the Measure of Intelligence" (ARC Prize framing)
- Argues current benchmarks don't measure true reasoning/generalization
- Philosophical foundation for our benchmark critique
If you use this research, please cite:
@article{reasoning-rationalizing-2024,
title={Reasoning or Rationalizing? Testing Model Robustness Against Misleading Information},
author={[Nour Desouki},
year={2026},
journal={arXiv preprint}
}We welcome contributions! Areas of interest:
- Additional model evaluations
- New question domains
- Statistical analysis improvements
- Visualization enhancements
- Sample Size: 120 questions provide strong signal but larger dataset would increase confidence
- Model Selection: Two models tested; generalization needs broader coverage
- Hint Quality: Hand-crafted hints; systematic generation could be more rigorous
- English Only: Multilingual evaluation could reveal language-specific effects
Autoregressive LLMs lack robustness against misleading information. The sequential generation architecture creates fundamental vulnerabilities where models fail to maintain correct reasoning when exposed to incorrect hints, especially when that misinformation appears early. This robustness failure isn't a training issue to be fixed; it's a fundamental architectural limitation that must be understood and mitigated in deployment.
This research reveals that our most advanced language models can be derailed by the simple act of putting misleading information in the wrong placeโa vulnerability that no benchmark currently measures.
