Files
protocol-bicorder/analysis/WORKFLOW.md
Nathan Schneider 1a80219a25 Remove web/ prototype; update docs to reflect app integration
The web/ directory (bicorder-classifier.js, .d.ts, test-classifier.mjs)
was a prototype superseded by bicorder-app/src/bicorder-classifier.ts.
The only integration point between this analysis directory and the app is
bicorder_model.json, which Vite reads at build time.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-20 17:39:25 -06:00

190 lines
6.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Protocol Bicorder Analysis Workflow
This directory contains scripts for analyzing protocols using the Protocol Bicorder framework with LLM assistance.
The scripts automatically draw the gradients from the current state of the [bicorder.json](`../bicorder.json`) file.
## Scripts
### Diagnostic data generation (LLM-based)
1. **scripts/bicorder_batch.py** - **[RECOMMENDED]** Process entire CSV with one command
2. **scripts/bicorder_analyze.py** - Prepares CSV with gradient columns
3. **scripts/bicorder_query.py** - Queries LLM for each gradient value and updates CSV (each query is a new chat)
### Manual / JSON-based readings
4. **scripts/json_to_csv.py** - Convert a directory of individual bicorder JSON reading files into a `readings.csv`
5. **scripts/sync_readings.sh** - Sync a readings dataset from a remote git repository, then regenerate CSV and run analysis (see below)
### Analysis
6. **scripts/multivariate_analysis.py** - Run clustering, PCA, correlation, and feature importance analysis on a readings CSV
7. **scripts/lda_visualization.py** - Generate LDA cluster separation plot and projection data
8. **scripts/classify_readings.py** - Apply the synthetic-trained LDA classifier to all readings; saves `analysis/classifications.csv`
9. **scripts/visualize_clusters.py** - Additional cluster visualizations
10. **scripts/export_model_for_js.py** - Export trained model to `bicorder_model.json` (read by `bicorder-app` at build time)
## Syncing a manual readings dataset
If the dataset has a `.sync_source` file (e.g., `data/readings/manual_20260320/`), one command handles everything:
```bash
scripts/sync_readings.sh data/readings/manual_20260320
```
This fetches new JSON files from the remote repo, regenerates `readings.csv`, runs multivariate analysis (with `--min-coverage 0.8` to handle shortform readings), generates the LDA visualization, and saves cluster classifications to `analysis/classifications.csv`.
## Running analysis on any readings CSV
```bash
# Full analysis pipeline
python3 scripts/multivariate_analysis.py data/readings/manual_20260320/readings.csv \
--min-coverage 0.8 \
--analyses clustering pca correlation importance
# LDA visualization (cluster separation plot)
python3 scripts/lda_visualization.py data/readings/manual_20260320/readings.csv
# Classify all readings (uses synthetic dataset as training data by default)
python3 scripts/classify_readings.py data/readings/manual_20260320/readings.csv
```
Use `--min-coverage` (0.01.0) to drop dimension columns below the given coverage fraction before analysis. This is important for datasets with many shortform readings where most dimensions are sparsely filled.
## Converting JSON reading files to CSV
If you have a directory of individual bicorder JSON reading files:
```bash
python3 scripts/json_to_csv.py data/readings/manual_20260320/json/ \
-o data/readings/manual_20260320/readings.csv
```
---
## Quick Start (Recommended, LLM-based)
### Process All Protocols with One Command
```bash
python3 scripts/bicorder_batch.py data/readings/synthetic_20251116/protocols_edited.csv -o analysis_output.csv
```
This will:
1. Create the analysis CSV with gradient columns
2. For each protocol row, query all gradients (each query is a new chat with full protocol context)
3. Update the CSV automatically with the results
4. Show progress and summary
### Common Options
```bash
# Process only rows 1-5 (useful for testing)
python3 scripts/bicorder_batch.py data/readings/synthetic_20251116/protocols_edited.csv -o analysis_output.csv --start 1 --end 5
# Use specific LLM model
python3 scripts/bicorder_batch.py data/readings/synthetic_20251116/protocols_edited.csv -o analysis_output.csv -m mistral
# Add analyst metadata
python3 scripts/bicorder_batch.py data/readings/synthetic_20251116/protocols_edited.csv -o analysis_output.csv \
-a "Your Name" -s "Your analytical standpoint"
```
---
## Manual Workflow (Advanced)
### Step 1: Prepare the Analysis CSV
Create a CSV with empty gradient columns:
```bash
python3 scripts/bicorder_analyze.py data/readings/synthetic_20251116/protocols_edited.csv -o analysis_output.csv
```
Optional: Add analyst metadata:
```bash
python3 scripts/bicorder_analyze.py data/readings/synthetic_20251116/protocols_edited.csv -o analysis_output.csv \
-a "Your Name" -s "Your analytical standpoint"
```
### Step 2: Query Gradients for a Protocol Row
Query all gradients for a specific protocol:
```bash
python3 scripts/bicorder_query.py analysis_output.csv 1
```
- Replace `1` with the row number you want to analyze
- Each gradient is queried in a new chat with full protocol context
- Each response is automatically parsed and written to the CSV
- Progress is shown for each gradient
Optional: Specify a model:
```bash
python3 scripts/bicorder_query.py analysis_output.csv 1 -m mistral
```
### Step 3: Repeat for All Protocols
For each protocol in your CSV:
```bash
python3 scripts/bicorder_query.py analysis_output.csv 1
python3 scripts/bicorder_query.py analysis_output.csv 2
python3 scripts/bicorder_query.py analysis_output.csv 3
# ... and so on
# OR: Use scripts/bicorder_batch.py to automate all of this!
```
## Architecture
### How It Works
Each gradient query is sent to the LLM as a **new, independent chat**. Every query includes:
- The protocol descriptor (name)
- The protocol description
- The gradient definition (left term, right term, and their descriptions)
- Instructions to rate 1-9
This approach:
- **Simplifies the code** - No conversation state management
- **Prevents bias** - Each evaluation is independent, not influenced by previous responses
- **Enables parallelization** - Queries could theoretically run concurrently
- **Makes debugging easier** - Each query/response pair is self-contained
## Tips
### Dry Run Mode
Test prompts without calling the LLM:
```bash
python3 scripts/bicorder_query.py analysis_output.csv 1 --dry-run
```
This shows you exactly what prompt will be sent for each gradient, including the full protocol context.
### Check Your Progress
View completed values:
```bash
python3 -c "
import csv
with open('analysis_output.csv') as f:
reader = csv.DictReader(f)
for i, row in enumerate(reader, 1):
empty = sum(1 for k, v in row.items() if 'vs' in k and not v)
print(f'Row {i}: {empty}/23 gradients empty')
"
```
### Batch Processing
Use the `scripts/bicorder_batch.py` script (see Quick Start section above) for processing multiple protocols.