Restructured data/ into analysis/
This commit is contained in:
27
analysis/README.md
Normal file
27
analysis/README.md
Normal file
@@ -0,0 +1,27 @@
|
||||
# Bicorder synthetic data analysis
|
||||
|
||||
This directory concerns a synthetic data analysis conducted with the Protocol Bicorder.
|
||||
|
||||
## Procedure
|
||||
|
||||
See [`prompts.md`](prompts.md) for a collection of prompts used in this process. SHOULD THESE BE INTEGRATED BELOW?
|
||||
|
||||
* Document chunking: Gathering raw data from protocol-focused texts, including the draft of the author's book, _The Protocol Reader_, _As for Protocols_, and _Das Protokoll_; produces a CSV list of protocols
|
||||
- The dataset includes some LLM hallucinations---that is, protocols not in the texts---but the hallucinations are often acceptable examples and so some have been retained (`data/output-raw.csv`, n=776)
|
||||
- Cleaning: Manual review of the protocols listed to remove overly broad or inappropriate entries (`data/output-edit.csv`, n=TKTK)
|
||||
* Dataset elaboration: Expand the dataset with LLM background knowledge
|
||||
- Add analyst personas: a control analyst with an academic "view from nowhere" and two others directly involved in the protocol
|
||||
- Cleaning: Manual review of the elaborated protocols and analysts for quality and correctness, editing or removing problematic entries
|
||||
* Bicorder diagnostic
|
||||
- Automated diagnoses
|
||||
- For each protocol
|
||||
- Start a thread; for each gradient
|
||||
- extract term explanation
|
||||
- pick a number, add to csv, followed by comma
|
||||
- Manual audit of analyses: TKTK
|
||||
- Test different models
|
||||
- Perhaps use a simplified template without citations and descriptions to reduce tokens, and just provide those materials once
|
||||
- Iterate on bicorder design?
|
||||
* Diagnostic results analysis
|
||||
- TKTK need to identify relevant tests
|
||||
- Correlations: Which gradients seem to travel together?
|
||||
Reference in New Issue
Block a user