Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

nourdesoukizz/Reasoning-Rationalizing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

21 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿง  Reasoning or Rationalizing? Testing Model Robustness Against Misleading Information

Demo Video

demo-video.mp4

60-second overview of our research findings

๐Ÿ“Š Key Results at a Glance

Model Performance Comparison

Model Baseline Correct Hints After Correct Hints Before Incorrect Hints After Incorrect Hints Before
Gemini 2.0 Flash 74.2% 68.3% (-5.9pp) 69.2% (-5.0pp) 69.2% (-5.0pp) 53.3% (-20.9pp)
OpenAI GPT-4o-mini 53.3% 53.3% (ยฑ0.0pp) 41.7% (-11.6pp) 51.7% (-1.6pp) 33.3% (-20.0pp)

pp = percentage points vs baseline

๐ŸŽฏ Research Question

How robust are autoregressive LLMs against misleading information, and does the position of this information affect their reasoning accuracy?

Specifically, we investigate whether models can maintain correct reasoning when exposed to incorrect hints, and whether the timing of this exposure (before vs. after questions) affects their robustness to misinformation.

๐Ÿ”ฌ Robustness Testing Framework

Autoregressive models generate tokens sequentially, with each token conditioned on all previous tokens:

P(response) = P(tโ‚) ร— P(tโ‚‚|tโ‚) ร— P(tโ‚ƒ|tโ‚,tโ‚‚) ร— ... ร— P(tโ‚™|tโ‚...tโ‚™โ‚‹โ‚)

This architectural constraint reveals three robustness vulnerabilities:

  1. Information Position Sensitivity: Models show different robustness levels based on when misleading information appears
  2. Reasoning Fragility: Models struggle to maintain correct reasoning paths when exposed to contradictory information
  3. Asymmetric Robustness: Models are significantly less robust to early misinformation than late misinformation

๐Ÿ“ˆ Key Findings

1. Hints Paradoxically Hurt Performance

  • Gemini: Baseline 74.2% โ†’ With correct hints 68-69%
  • OpenAI: Shows resistance to correct hints but collapses with incorrect ones
  • Implication: Models may be optimizing for coherence over correctness

2. Position Matters - Robustness Varies with Information Timing

  • Incorrect hints BEFORE: Both models drop ~20 percentage points
  • Incorrect hints AFTER: Gemini -5pp, OpenAI -1.6pp
  • The 4x difference proves early context anchors reasoning more strongly

3. Models Exhibit Different Failure Modes

  • Gemini: Higher baseline (74.2%) but more susceptible to any hints
  • OpenAI: Lower baseline (53.3%) but catastrophic failure with early misinformation (โ†’33.3%)
  • Pattern: Higher-performing models may show LOWER robustness to misleading information

๐Ÿ› ๏ธ Experimental Design

Dataset

  • 120 questions (60 math, 60 science)
  • 3 difficulty levels: Easy, Medium, Hard
  • 5 experimental conditions per model:
    • Baseline (no hints)
    • Correct hints AFTER questions
    • Correct hints BEFORE questions
    • Incorrect hints AFTER questions
    • Incorrect hints BEFORE questions

Models Tested

  • Google Gemini 2.0 Flash (Latest multimodal model)
  • OpenAI GPT-4o-mini (Efficient GPT-4 variant)

Methodology

Each question is evaluated under controlled conditions with hints that either help (correct) or mislead (incorrect), positioned either before or after the question text. The model must provide only the final answer, preventing post-hoc rationalization in responses.

๐Ÿ—๏ธ Architectural Interpretation

The autoregressive architecture creates structural robustness limitations, not just learned behaviors:

# When hint appears FIRST:
context = [HINT, QUESTION]
# Every token generated is conditioned on the hint
# Model cannot "unsee" or backtrack from early influence

# When hint appears AFTER:  
context = [QUESTION, HINT]
# Model has already begun reasoning before seeing hint
# Less opportunity for hint to derail the trajectory

This isn't a bugโ€”it's a fundamental property of left-to-right generation where the model predicts "what comes next" rather than solving problems.

๐ŸŽฏ Why This Matters

Current Benchmarks Are Blind

Standard reasoning benchmarks (GSM8K, MATH, ARC) measure only:

  • โœ… Final answer correctness
  • โŒ Robustness to framing
  • โŒ Resistance to misleading context
  • โŒ Actual reasoning vs. pattern matching

Result: A model scoring 70% via robust reasoning and another scoring 70% via easily-swayed pattern matching appear identical.

Real-World Implications

  1. Prompt Injection Vulnerability: Early tokens in prompts have outsized influence
  2. Adversarial Robustness: Models can be derailed by strategic misinformation placement
  3. Reasoning vs. Rationalization: Models generate plausible-sounding justifications, not logical derivations
  4. Evaluation Gaps: We're not measuring what we think we're measuring

๐Ÿš€ Contributions

  1. Robustness Assessment: Quantifies how vulnerable models are to misleading information
  2. Simple Protocol: No expensive compute requiredโ€”just careful prompt manipulation
  3. Benchmark Blindspot: Exposes critical gap in current evaluation methods
  4. Quantified Effect: ~20pp accuracy drop with early misinformation (4x worse than late misinformation)

๐Ÿ“‚ Repository Structure

Reasoning-Rationalizing/
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ no-hints/          # Baseline questions
โ”‚   โ”œโ”€โ”€ C-hints/           # Correct hints
โ”‚   โ””โ”€โ”€ IC-hints/          # Incorrect hints
โ”œโ”€โ”€ notebooks-hintsafter/   # Hints after questions experiments
โ”œโ”€โ”€ notebooks-hintsbefore/  # Hints before questions experiments
โ”œโ”€โ”€ results/                # Evaluation outputs
โ””โ”€โ”€ final_analysis_notebook.ipynb  # Complete analysis & visualizations

๐Ÿ”„ Reproducing Results

  1. Setup Environment
pip install pandas numpy matplotlib seaborn openai google-generativeai
  1. Configure API Keys
export OPENAI_API_KEY="your-key"
export GEMINI_API_KEY="your-key"
  1. Run Evaluations
# Run individual notebooks in notebooks-hintsafter/ and notebooks-hintsbefore/
# Or use final_analysis_notebook.ipynb for complete analysis

๐Ÿ“Š Detailed Results

Impact by Difficulty Level

  • Easy Questions: Less affected by hints (higher baseline resistance)
  • Medium Questions: Moderate susceptibility
  • Hard Questions: Highest varianceโ€”models either leverage hints well or fail completely

Domain-Specific Effects

  • Math: More susceptible to incorrect hints (requires precise reasoning)
  • Science: Better resistance (more pattern matching, less calculation)

๐Ÿ”ฎ Future Work

  1. Expand Model Coverage: Test Llama, Claude, Mistral families
  2. Analyze Chain-of-Thought: Examine how models justify wrong answers
  3. Bidirectional Architectures: Compare with models that can "look ahead"
  4. Adversarial Hint Generation: Systematically find worst-case misleading hints
  5. Mitigation Strategies: Develop techniques to improve model robustness against misinformation

๐Ÿ“š References

Benchmark Limitations & Evaluation Critique

  1. Zheng et al., 2023 โ€” "Large Language Models Are Not Robust Multiple Choice Selectors"

    • Shows LLMs are biased toward certain answer positions (e.g., option A) regardless of content
    • Supports our argument that benchmarks miss reasoning quality issues
  2. Zhao et al., 2021 โ€” "Calibrate Before Use: Improving Few-Shot Performance of Language Models"

    • Demonstrates systematic biases based on prompt framing, example order, and surface features
    • Directly supports our finding that hint position affects accuracy
  3. Turpin et al., 2024 โ€” "Chain-of-Thought Reasoning is Unfaithful"

    • Shows stated reasoning in CoT doesn't always reflect actual causal process
    • Models confabulate reasoning post-hoc
    • Core support for our "rationalizing not reasoning" thesis

Sycophancy & Authority Bias

  1. Perez et al., 2022 โ€” "Discovering Language Model Behaviors with Model-Written Evaluations" (Anthropic)
    • Documents sycophancy: models change answers when users push back, even if original was correct
    • Related to our authority bias / incorrect hint susceptibility findings

Primacy & Recency Effects

  1. Liu et al., 2023 โ€” "Lost in the Middle: How Language Models Use Long Contexts"
    • Models over-weight information at beginning and end of context
    • Directly supports our hypothesis that hints-before have disproportionate influence

Foundational Context

  1. Wei et al., 2022 โ€” "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models"

    • Establishes CoT as reasoning evaluation method
    • Our work critiques what CoT actually measures
  2. Chollet, 2019 โ€” "On the Measure of Intelligence" (ARC Prize framing)

    • Argues current benchmarks don't measure true reasoning/generalization
    • Philosophical foundation for our benchmark critique

๐Ÿ“š Citation

If you use this research, please cite:

@article{reasoning-rationalizing-2024,
  title={Reasoning or Rationalizing? Testing Model Robustness Against Misleading Information},
  author={[Nour Desouki},
  year={2026},
  journal={arXiv preprint}
}

๐Ÿค Contributing

We welcome contributions! Areas of interest:

  • Additional model evaluations
  • New question domains
  • Statistical analysis improvements
  • Visualization enhancements

โš ๏ธ Limitations

  1. Sample Size: 120 questions provide strong signal but larger dataset would increase confidence
  2. Model Selection: Two models tested; generalization needs broader coverage
  3. Hint Quality: Hand-crafted hints; systematic generation could be more rigorous
  4. English Only: Multilingual evaluation could reveal language-specific effects

๐Ÿ’ก Key Takeaway

Autoregressive LLMs lack robustness against misleading information. The sequential generation architecture creates fundamental vulnerabilities where models fail to maintain correct reasoning when exposed to incorrect hints, especially when that misinformation appears early. This robustness failure isn't a training issue to be fixed; it's a fundamental architectural limitation that must be understood and mitigated in deployment.


This research reveals that our most advanced language models can be derailed by the simple act of putting misleading information in the wrong placeโ€”a vulnerability that no benchmark currently measures.

About

we investigate whether models can maintain correct reasoning when exposed to incorrect hints, and whether the timing of this exposure (before vs. after questions) affects their robustness to misinformation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors