1.7 KiB
1.7 KiB
Bicorder synthetic data analysis
This directory concerns a synthetic data analysis conducted with the Protocol Bicorder.
Procedure
See prompts.md for a collection of prompts used in this process. SHOULD THESE BE INTEGRATED BELOW?
- Document chunking: Gathering raw data from protocol-focused texts, including the draft of the author's book, The Protocol Reader, As for Protocols, and Das Protokoll; produces a CSV list of protocols
- The dataset includes some LLM hallucinations---that is, protocols not in the texts---but the hallucinations are often acceptable examples and so some have been retained (
data/output-raw.csv, n=776) - Cleaning: Manual review of the protocols listed to remove overly broad or inappropriate entries (
data/output-edit.csv, n=TKTK)
- The dataset includes some LLM hallucinations---that is, protocols not in the texts---but the hallucinations are often acceptable examples and so some have been retained (
- Dataset elaboration: Expand the dataset with LLM background knowledge
- Add analyst personas: a control analyst with an academic "view from nowhere" and two others directly involved in the protocol
- Cleaning: Manual review of the elaborated protocols and analysts for quality and correctness, editing or removing problematic entries
- Bicorder diagnostic
- Automated diagnoses
- For each protocol
- Start a thread; for each gradient
- extract term explanation
- pick a number, add to csv, followed by comma
- Start a thread; for each gradient
- For each protocol
- Manual audit of analyses: TKTK
- Test different models
- Perhaps use a simplified template without citations and descriptions to reduce tokens, and just provide those materials once
- Iterate on bicorder design?
- Automated diagnoses
- Diagnostic results analysis
- TKTK need to identify relevant tests
- Correlations: Which gradients seem to travel together?
- TKTK need to identify relevant tests