Files
protocol-bicorder/analysis/README.md
2025-10-30 10:56:21 -06:00

80 lines
4.7 KiB
Markdown

# Bicorder synthetic data analysis
This directory concerns a synthetic data analysis conducted with the Protocol Bicorder.
Scripts were created with the assistance of Claude Code, but the data processing was done with local models.
## Purpose
This analyses has several purposes:
* To test the usefulness and limitations of the Protocol Bicorder
* To identify any patterns in a synthetic dataset derived from recent works on protocols
## Procedure
See [`prompts.md`](prompts.md) for a collection of prompts used in this process. SHOULD THESE BE INTEGRATED BELOW?
### Document chunking
This stage gathered raw data from recent protocol-focused texts.
The following prompt was applied to book chapter drafts and major protocol-related books, including the draft of the author's book, _The Protocol Reader_, _As for Protocols_, and _Das Protokoll_. The texts were pasted in plain text and then divided into 5000-word files, with the following prompt applied to each of them with the `chunk.sh` script:
```yaml
model: "gemma3:12b"
context: "model running on ollama locally, accessed with llm on the command line"
prompt: "Return csv-formatted data (with no markdown wrapper) that consists of a list of protocols discussed or referred to in the attached text. Protocols are defined extremely broadly as 'patterns of interaction,' and may be of a nontechnical nature. Protocols should be as specific as possible, such as 'Sacrament of Reconciliation' rather than 'Religious Protocols.' The first column should provide a brief descriptor of the protocol, and the second column should describe it in a substantial paragraph of 3-5 sentences, encapsulated in quotation marks to avoid breaking on commas. Be sure to paraphrase rather than quoting directly from the source text."
```
The result was a CSV-formatted list of protocols (`protocols_raw.csv`, n=774 total protocols listed).
### Dataset cleaning
The dataset was then manually reviewed. The review involved the following:
* Removal of repetitive formatting material introduced by the LLM
* Correction of formatting errors
* Removal of rows whose contents met the following criteria:
- Repetition of entries---though some repetitions were simply merged into a single entry
- Overly broad entries that lacked meaningful context-specificity
- Overly narrow entries, e.g., referring to specific events
The cleaning process was carried out in a subjective manner, so some entries that meet the above criteria may remain in the dataset. The dataset also appears to include some LLM hallucinations---that is, protocols not in the texts---but the hallucinations are often acceptable examples and so some have been retained. Some degree of noise in the dataset was considered acceptable for the purposes of the study. Some degree of repetition, also, provides the dataset with a kind of control cases for evaluating the diagnostic process.
The result was a CSV-formatted list of protocols (`protocols_edited.csv`, n=419).
### Initial diagnostic
This is the part of the process where an LLM proceeds to apply the bicorder tool to the dataset. For each row in the dataset, and on each gradient, it prompts the LLM to apply each gradient to the protocol. The outputs are then added to a CSV output file.
See detailed documentation of the scripts at `WORKFLOW.md`.
- Manual audit of analyses: TKTK
- Test different models
- Perhaps use a simplified template without citations and descriptions to reduce tokens, and just provide those materials once
- Iterate on bicorder design?
### Persona elaboration
* Dataset elaboration: Expand the dataset with LLM background knowledge
- Add analyst personas: a control analyst with an academic "view from nowhere" and two others directly involved in the protocol
- Cleaning: Manual review of the elaborated protocols and analysts for quality and correctness, editing or removing problematic entries
```yaml
model: "claude"
context: "Claude Code interface"
prompt: "In protocols.csv, fill in the third and fourth columns with plausible inputs that reflect diversity along cultural, professional, gender, and class lines. The 'analyst' should be a particular persona described generically, like '23-year-old male student in New Delhi,' and the 'standpoint' should more thoroughly describe the analyst's relationship to the protocol, like, 'Learning Brahmanic rituals from elders in order to maintain the tradition, but feeling pulled away from these rituals by contemporary culture.' Replicate each protocol line twice to provide a total of three plausible analyst and standpoint pairs for each protocol."
```
### Results analysis
* Diagnostic results analysis
- TKTK need to identify relevant tests
- Correlations: Which gradients seem to travel together?