Files
protocol-bicorder/analysis/README.md
2025-10-28 11:04:06 -06:00

1.7 KiB

Bicorder synthetic data analysis

This directory concerns a synthetic data analysis conducted with the Protocol Bicorder.

Procedure

See prompts.md for a collection of prompts used in this process. SHOULD THESE BE INTEGRATED BELOW?

  • Document chunking: Gathering raw data from protocol-focused texts, including the draft of the author's book, The Protocol Reader, As for Protocols, and Das Protokoll; produces a CSV list of protocols
    • The dataset includes some LLM hallucinations---that is, protocols not in the texts---but the hallucinations are often acceptable examples and so some have been retained (data/output-raw.csv, n=776)
    • Cleaning: Manual review of the protocols listed to remove overly broad or inappropriate entries (data/output-edit.csv, n=TKTK)
  • Dataset elaboration: Expand the dataset with LLM background knowledge
    • Add analyst personas: a control analyst with an academic "view from nowhere" and two others directly involved in the protocol
    • Cleaning: Manual review of the elaborated protocols and analysts for quality and correctness, editing or removing problematic entries
  • Bicorder diagnostic
    • Automated diagnoses
      • For each protocol
        • Start a thread; for each gradient
          • extract term explanation
          • pick a number, add to csv, followed by comma
    • Manual audit of analyses: TKTK
    • Test different models
      • Perhaps use a simplified template without citations and descriptions to reduce tokens, and just provide those materials once
    • Iterate on bicorder design?
  • Diagnostic results analysis
    • TKTK need to identify relevant tests
      • Correlations: Which gradients seem to travel together?