Files

Nathan Schneider 897c30406b Reorganize directory, add manual dataset and sync tooling

- Move all scripts to scripts/, web assets to web/, analysis results
  into self-contained data/readings/<type>_<YYYYMMDD>/ directories
- Add data/readings/manual_20260320/ with 32 JSON readings from
  git.medlab.host/ntnsndr/protocol-bicorder-data
- Add scripts/json_to_csv.py to convert bicorder JSON files to CSV
- Add scripts/sync_readings.sh for one-command sync + re-analysis of
  any dataset backed by a .sync_source config file
- Add scripts/classify_readings.py to apply the LDA classifier to all
  readings and save per-reading cluster assignments
- Add --min-coverage flag to multivariate_analysis.py for sparse/shortform
  datasets; also applies in lda_visualization.py
- Fix lda_visualization.py NaN handling and 0-d array annotation bug
- Update README.md and WORKFLOW.md to document datasets, sync workflow,
  shortform handling, and new scripts

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-20 17:35:13 -06:00

6.3 KiB

Raw Blame History

Protocol Bicorder Analysis Workflow

This directory contains scripts for analyzing protocols using the Protocol Bicorder framework with LLM assistance.

The scripts automatically draw the gradients from the current state of the bicorder.json file.

Scripts

Diagnostic data generation (LLM-based)

scripts/bicorder_batch.py - [RECOMMENDED] Process entire CSV with one command
scripts/bicorder_analyze.py - Prepares CSV with gradient columns
scripts/bicorder_query.py - Queries LLM for each gradient value and updates CSV (each query is a new chat)

Manual / JSON-based readings

scripts/json_to_csv.py - Convert a directory of individual bicorder JSON reading files into a readings.csv
scripts/sync_readings.sh - Sync a readings dataset from a remote git repository, then regenerate CSV and run analysis (see below)

Analysis

scripts/multivariate_analysis.py - Run clustering, PCA, correlation, and feature importance analysis on a readings CSV
scripts/lda_visualization.py - Generate LDA cluster separation plot and projection data
scripts/classify_readings.py - Apply the synthetic-trained LDA classifier to all readings; saves analysis/classifications.csv
scripts/visualize_clusters.py - Additional cluster visualizations
scripts/export_model_for_js.py - Export trained model to bicorder_model.json for the web classifier

Syncing a manual readings dataset

If the dataset has a .sync_source file (e.g., data/readings/manual_20260320/), one command handles everything:

scripts/sync_readings.sh data/readings/manual_20260320

This fetches new JSON files from the remote repo, regenerates readings.csv, runs multivariate analysis (with --min-coverage 0.8 to handle shortform readings), generates the LDA visualization, and saves cluster classifications to analysis/classifications.csv.

Running analysis on any readings CSV

# Full analysis pipeline
python3 scripts/multivariate_analysis.py data/readings/manual_20260320/readings.csv \
  --min-coverage 0.8 \
  --analyses clustering pca correlation importance

# LDA visualization (cluster separation plot)
python3 scripts/lda_visualization.py data/readings/manual_20260320/readings.csv

# Classify all readings (uses synthetic dataset as training data by default)
python3 scripts/classify_readings.py data/readings/manual_20260320/readings.csv

Use --min-coverage (0.0–1.0) to drop dimension columns below the given coverage fraction before analysis. This is important for datasets with many shortform readings where most dimensions are sparsely filled.

Converting JSON reading files to CSV

If you have a directory of individual bicorder JSON reading files:

python3 scripts/json_to_csv.py data/readings/manual_20260320/json/ \
  -o data/readings/manual_20260320/readings.csv

Quick Start (Recommended, LLM-based)

Process All Protocols with One Command

python3 scripts/bicorder_batch.py data/readings/synthetic_20251116/protocols_edited.csv -o analysis_output.csv

This will:

Create the analysis CSV with gradient columns
For each protocol row, query all gradients (each query is a new chat with full protocol context)
Update the CSV automatically with the results
Show progress and summary

Common Options

# Process only rows 1-5 (useful for testing)
python3 scripts/bicorder_batch.py data/readings/synthetic_20251116/protocols_edited.csv -o analysis_output.csv --start 1 --end 5

# Use specific LLM model
python3 scripts/bicorder_batch.py data/readings/synthetic_20251116/protocols_edited.csv -o analysis_output.csv -m mistral

# Add analyst metadata
python3 scripts/bicorder_batch.py data/readings/synthetic_20251116/protocols_edited.csv -o analysis_output.csv \
  -a "Your Name" -s "Your analytical standpoint"

Manual Workflow (Advanced)

Step 1: Prepare the Analysis CSV

Create a CSV with empty gradient columns:

python3 scripts/bicorder_analyze.py data/readings/synthetic_20251116/protocols_edited.csv -o analysis_output.csv

Optional: Add analyst metadata:

python3 scripts/bicorder_analyze.py data/readings/synthetic_20251116/protocols_edited.csv -o analysis_output.csv \
  -a "Your Name" -s "Your analytical standpoint"

Step 2: Query Gradients for a Protocol Row

Query all gradients for a specific protocol:

python3 scripts/bicorder_query.py analysis_output.csv 1

Replace 1 with the row number you want to analyze
Each gradient is queried in a new chat with full protocol context
Each response is automatically parsed and written to the CSV
Progress is shown for each gradient

Optional: Specify a model:

python3 scripts/bicorder_query.py analysis_output.csv 1 -m mistral

Step 3: Repeat for All Protocols

For each protocol in your CSV:

python3 scripts/bicorder_query.py analysis_output.csv 1
python3 scripts/bicorder_query.py analysis_output.csv 2
python3 scripts/bicorder_query.py analysis_output.csv 3
# ... and so on

# OR: Use scripts/bicorder_batch.py to automate all of this!

Architecture

How It Works

Each gradient query is sent to the LLM as a new, independent chat. Every query includes:

The protocol descriptor (name)
The protocol description
The gradient definition (left term, right term, and their descriptions)
Instructions to rate 1-9

This approach:

Simplifies the code - No conversation state management
Prevents bias - Each evaluation is independent, not influenced by previous responses
Enables parallelization - Queries could theoretically run concurrently
Makes debugging easier - Each query/response pair is self-contained

Tips

Dry Run Mode

Test prompts without calling the LLM:

python3 scripts/bicorder_query.py analysis_output.csv 1 --dry-run

This shows you exactly what prompt will be sent for each gradient, including the full protocol context.

Check Your Progress

View completed values:

python3 -c "
import csv
with open('analysis_output.csv') as f:
    reader = csv.DictReader(f)
    for i, row in enumerate(reader, 1):
        empty = sum(1 for k, v in row.items() if 'vs' in k and not v)
        print(f'Row {i}: {empty}/23 gradients empty')
"

Batch Processing

Use the scripts/bicorder_batch.py script (see Quick Start section above) for processing multiple protocols.

6.3 KiB Raw Blame History Unescape Escape