Remove the intermediate readings/ subdirectory level — dataset naming (synthetic_YYYYMMDD, manual_YYYYMMDD) already encodes what the data is. Update all path references across scripts and docs accordingly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
190 lines
6.2 KiB
Markdown
190 lines
6.2 KiB
Markdown
# Protocol Bicorder Analysis Workflow
|
||
|
||
This directory contains scripts for analyzing protocols using the Protocol Bicorder framework with LLM assistance.
|
||
|
||
The scripts automatically draw the gradients from the current state of the [bicorder.json](`../bicorder.json`) file.
|
||
|
||
## Scripts
|
||
|
||
### Diagnostic data generation (LLM-based)
|
||
|
||
1. **scripts/bicorder_batch.py** - **[RECOMMENDED]** Process entire CSV with one command
|
||
2. **scripts/bicorder_analyze.py** - Prepares CSV with gradient columns
|
||
3. **scripts/bicorder_query.py** - Queries LLM for each gradient value and updates CSV (each query is a new chat)
|
||
|
||
### Manual / JSON-based readings
|
||
|
||
4. **scripts/json_to_csv.py** - Convert a directory of individual bicorder JSON reading files into a `readings.csv`
|
||
5. **scripts/sync_readings.sh** - Sync a readings dataset from a remote git repository, then regenerate CSV and run analysis (see below)
|
||
|
||
### Analysis
|
||
|
||
6. **scripts/multivariate_analysis.py** - Run clustering, PCA, correlation, and feature importance analysis on a readings CSV
|
||
7. **scripts/lda_visualization.py** - Generate LDA cluster separation plot and projection data
|
||
8. **scripts/classify_readings.py** - Apply the synthetic-trained LDA classifier to all readings; saves `analysis/classifications.csv`
|
||
9. **scripts/visualize_clusters.py** - Additional cluster visualizations
|
||
10. **scripts/export_model_for_js.py** - Export trained model to `bicorder_model.json` (read by `bicorder-app` at build time)
|
||
|
||
## Syncing a manual readings dataset
|
||
|
||
If the dataset has a `.sync_source` file (e.g., `data/manual_20260320/`), one command handles everything:
|
||
|
||
```bash
|
||
scripts/sync_readings.sh data/manual_20260320
|
||
```
|
||
|
||
This fetches new JSON files from the remote repo, regenerates `readings.csv`, runs multivariate analysis (with `--min-coverage 0.8` to handle shortform readings), generates the LDA visualization, and saves cluster classifications to `analysis/classifications.csv`.
|
||
|
||
## Running analysis on any readings CSV
|
||
|
||
```bash
|
||
# Full analysis pipeline
|
||
python3 scripts/multivariate_analysis.py data/manual_20260320/readings.csv \
|
||
--min-coverage 0.8 \
|
||
--analyses clustering pca correlation importance
|
||
|
||
# LDA visualization (cluster separation plot)
|
||
python3 scripts/lda_visualization.py data/manual_20260320/readings.csv
|
||
|
||
# Classify all readings (uses synthetic dataset as training data by default)
|
||
python3 scripts/classify_readings.py data/manual_20260320/readings.csv
|
||
```
|
||
|
||
Use `--min-coverage` (0.0–1.0) to drop dimension columns below the given coverage fraction before analysis. This is important for datasets with many shortform readings where most dimensions are sparsely filled.
|
||
|
||
## Converting JSON reading files to CSV
|
||
|
||
If you have a directory of individual bicorder JSON reading files:
|
||
|
||
```bash
|
||
python3 scripts/json_to_csv.py data/manual_20260320/json/ \
|
||
-o data/manual_20260320/readings.csv
|
||
```
|
||
|
||
---
|
||
|
||
## Quick Start (Recommended, LLM-based)
|
||
|
||
### Process All Protocols with One Command
|
||
|
||
```bash
|
||
python3 scripts/bicorder_batch.py data/synthetic_20251116/protocols_edited.csv -o analysis_output.csv
|
||
```
|
||
|
||
This will:
|
||
1. Create the analysis CSV with gradient columns
|
||
2. For each protocol row, query all gradients (each query is a new chat with full protocol context)
|
||
3. Update the CSV automatically with the results
|
||
4. Show progress and summary
|
||
|
||
### Common Options
|
||
|
||
```bash
|
||
# Process only rows 1-5 (useful for testing)
|
||
python3 scripts/bicorder_batch.py data/synthetic_20251116/protocols_edited.csv -o analysis_output.csv --start 1 --end 5
|
||
|
||
# Use specific LLM model
|
||
python3 scripts/bicorder_batch.py data/synthetic_20251116/protocols_edited.csv -o analysis_output.csv -m mistral
|
||
|
||
# Add analyst metadata
|
||
python3 scripts/bicorder_batch.py data/synthetic_20251116/protocols_edited.csv -o analysis_output.csv \
|
||
-a "Your Name" -s "Your analytical standpoint"
|
||
```
|
||
|
||
---
|
||
|
||
## Manual Workflow (Advanced)
|
||
|
||
### Step 1: Prepare the Analysis CSV
|
||
|
||
Create a CSV with empty gradient columns:
|
||
|
||
```bash
|
||
python3 scripts/bicorder_analyze.py data/synthetic_20251116/protocols_edited.csv -o analysis_output.csv
|
||
```
|
||
|
||
Optional: Add analyst metadata:
|
||
```bash
|
||
python3 scripts/bicorder_analyze.py data/synthetic_20251116/protocols_edited.csv -o analysis_output.csv \
|
||
-a "Your Name" -s "Your analytical standpoint"
|
||
```
|
||
|
||
### Step 2: Query Gradients for a Protocol Row
|
||
|
||
Query all gradients for a specific protocol:
|
||
|
||
```bash
|
||
python3 scripts/bicorder_query.py analysis_output.csv 1
|
||
```
|
||
|
||
- Replace `1` with the row number you want to analyze
|
||
- Each gradient is queried in a new chat with full protocol context
|
||
- Each response is automatically parsed and written to the CSV
|
||
- Progress is shown for each gradient
|
||
|
||
Optional: Specify a model:
|
||
```bash
|
||
python3 scripts/bicorder_query.py analysis_output.csv 1 -m mistral
|
||
```
|
||
|
||
### Step 3: Repeat for All Protocols
|
||
|
||
For each protocol in your CSV:
|
||
|
||
```bash
|
||
python3 scripts/bicorder_query.py analysis_output.csv 1
|
||
python3 scripts/bicorder_query.py analysis_output.csv 2
|
||
python3 scripts/bicorder_query.py analysis_output.csv 3
|
||
# ... and so on
|
||
|
||
# OR: Use scripts/bicorder_batch.py to automate all of this!
|
||
```
|
||
|
||
## Architecture
|
||
|
||
### How It Works
|
||
|
||
Each gradient query is sent to the LLM as a **new, independent chat**. Every query includes:
|
||
- The protocol descriptor (name)
|
||
- The protocol description
|
||
- The gradient definition (left term, right term, and their descriptions)
|
||
- Instructions to rate 1-9
|
||
|
||
This approach:
|
||
- **Simplifies the code** - No conversation state management
|
||
- **Prevents bias** - Each evaluation is independent, not influenced by previous responses
|
||
- **Enables parallelization** - Queries could theoretically run concurrently
|
||
- **Makes debugging easier** - Each query/response pair is self-contained
|
||
|
||
## Tips
|
||
|
||
### Dry Run Mode
|
||
|
||
Test prompts without calling the LLM:
|
||
|
||
```bash
|
||
python3 scripts/bicorder_query.py analysis_output.csv 1 --dry-run
|
||
```
|
||
|
||
This shows you exactly what prompt will be sent for each gradient, including the full protocol context.
|
||
|
||
### Check Your Progress
|
||
|
||
View completed values:
|
||
|
||
```bash
|
||
python3 -c "
|
||
import csv
|
||
with open('analysis_output.csv') as f:
|
||
reader = csv.DictReader(f)
|
||
for i, row in enumerate(reader, 1):
|
||
empty = sum(1 for k, v in row.items() if 'vs' in k and not v)
|
||
print(f'Row {i}: {empty}/23 gradients empty')
|
||
"
|
||
```
|
||
|
||
### Batch Processing
|
||
|
||
Use the `scripts/bicorder_batch.py` script (see Quick Start section above) for processing multiple protocols.
|
||
|