# Bicorder synthetic data analysis This directory concerns a synthetic data analysis conducted with the Protocol Bicorder. ## Procedure See [`prompts.md`](prompts.md) for a collection of prompts used in this process. SHOULD THESE BE INTEGRATED BELOW? * Document chunking: Gathering raw data from protocol-focused texts, including the draft of the author's book, _The Protocol Reader_, _As for Protocols_, and _Das Protokoll_; produces a CSV list of protocols - The dataset includes some LLM hallucinations---that is, protocols not in the texts---but the hallucinations are often acceptable examples and so some have been retained (`data/output-raw.csv`, n=776) - Cleaning: Manual review of the protocols listed to remove overly broad or inappropriate entries (`data/output-edit.csv`, n=TKTK) * Dataset elaboration: Expand the dataset with LLM background knowledge - Add analyst personas: a control analyst with an academic "view from nowhere" and two others directly involved in the protocol - Cleaning: Manual review of the elaborated protocols and analysts for quality and correctness, editing or removing problematic entries * Bicorder diagnostic - Automated diagnoses - For each protocol - Start a thread; for each gradient - extract term explanation - pick a number, add to csv, followed by comma - Manual audit of analyses: TKTK - Test different models - Perhaps use a simplified template without citations and descriptions to reduce tokens, and just provide those materials once - Iterate on bicorder design? * Diagnostic results analysis - TKTK need to identify relevant tests - Correlations: Which gradients seem to travel together?