Reorganize directory, add manual dataset and sync tooling
- Move all scripts to scripts/, web assets to web/, analysis results into self-contained data/readings/<type>_<YYYYMMDD>/ directories - Add data/readings/manual_20260320/ with 32 JSON readings from git.medlab.host/ntnsndr/protocol-bicorder-data - Add scripts/json_to_csv.py to convert bicorder JSON files to CSV - Add scripts/sync_readings.sh for one-command sync + re-analysis of any dataset backed by a .sync_source config file - Add scripts/classify_readings.py to apply the LDA classifier to all readings and save per-reading cluster assignments - Add --min-coverage flag to multivariate_analysis.py for sparse/shortform datasets; also applies in lda_visualization.py - Fix lda_visualization.py NaN handling and 0-d array annotation bug - Update README.md and WORKFLOW.md to document datasets, sync workflow, shortform handling, and new scripts Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -6,16 +6,69 @@ The scripts automatically draw the gradients from the current state of the [bico
|
||||
|
||||
## Scripts
|
||||
|
||||
1. **bicorder_batch.py** - **[RECOMMENDED]** Process entire CSV with one command
|
||||
2. **bicorder_analyze.py** - Prepares CSV with gradient columns
|
||||
3. **bicorder_query.py** - Queries LLM for each gradient value and updates CSV (each query is a new chat)
|
||||
### Diagnostic data generation (LLM-based)
|
||||
|
||||
## Quick Start (Recommended)
|
||||
1. **scripts/bicorder_batch.py** - **[RECOMMENDED]** Process entire CSV with one command
|
||||
2. **scripts/bicorder_analyze.py** - Prepares CSV with gradient columns
|
||||
3. **scripts/bicorder_query.py** - Queries LLM for each gradient value and updates CSV (each query is a new chat)
|
||||
|
||||
### Manual / JSON-based readings
|
||||
|
||||
4. **scripts/json_to_csv.py** - Convert a directory of individual bicorder JSON reading files into a `readings.csv`
|
||||
5. **scripts/sync_readings.sh** - Sync a readings dataset from a remote git repository, then regenerate CSV and run analysis (see below)
|
||||
|
||||
### Analysis
|
||||
|
||||
6. **scripts/multivariate_analysis.py** - Run clustering, PCA, correlation, and feature importance analysis on a readings CSV
|
||||
7. **scripts/lda_visualization.py** - Generate LDA cluster separation plot and projection data
|
||||
8. **scripts/classify_readings.py** - Apply the synthetic-trained LDA classifier to all readings; saves `analysis/classifications.csv`
|
||||
9. **scripts/visualize_clusters.py** - Additional cluster visualizations
|
||||
10. **scripts/export_model_for_js.py** - Export trained model to `bicorder_model.json` for the web classifier
|
||||
|
||||
## Syncing a manual readings dataset
|
||||
|
||||
If the dataset has a `.sync_source` file (e.g., `data/readings/manual_20260320/`), one command handles everything:
|
||||
|
||||
```bash
|
||||
scripts/sync_readings.sh data/readings/manual_20260320
|
||||
```
|
||||
|
||||
This fetches new JSON files from the remote repo, regenerates `readings.csv`, runs multivariate analysis (with `--min-coverage 0.8` to handle shortform readings), generates the LDA visualization, and saves cluster classifications to `analysis/classifications.csv`.
|
||||
|
||||
## Running analysis on any readings CSV
|
||||
|
||||
```bash
|
||||
# Full analysis pipeline
|
||||
python3 scripts/multivariate_analysis.py data/readings/manual_20260320/readings.csv \
|
||||
--min-coverage 0.8 \
|
||||
--analyses clustering pca correlation importance
|
||||
|
||||
# LDA visualization (cluster separation plot)
|
||||
python3 scripts/lda_visualization.py data/readings/manual_20260320/readings.csv
|
||||
|
||||
# Classify all readings (uses synthetic dataset as training data by default)
|
||||
python3 scripts/classify_readings.py data/readings/manual_20260320/readings.csv
|
||||
```
|
||||
|
||||
Use `--min-coverage` (0.0–1.0) to drop dimension columns below the given coverage fraction before analysis. This is important for datasets with many shortform readings where most dimensions are sparsely filled.
|
||||
|
||||
## Converting JSON reading files to CSV
|
||||
|
||||
If you have a directory of individual bicorder JSON reading files:
|
||||
|
||||
```bash
|
||||
python3 scripts/json_to_csv.py data/readings/manual_20260320/json/ \
|
||||
-o data/readings/manual_20260320/readings.csv
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Quick Start (Recommended, LLM-based)
|
||||
|
||||
### Process All Protocols with One Command
|
||||
|
||||
```bash
|
||||
python3 bicorder_batch.py protocols_edited.csv -o analysis_output.csv
|
||||
python3 scripts/bicorder_batch.py data/readings/synthetic_20251116/protocols_edited.csv -o analysis_output.csv
|
||||
```
|
||||
|
||||
This will:
|
||||
@@ -28,13 +81,13 @@ This will:
|
||||
|
||||
```bash
|
||||
# Process only rows 1-5 (useful for testing)
|
||||
python3 bicorder_batch.py protocols_edited.csv -o analysis_output.csv --start 1 --end 5
|
||||
python3 scripts/bicorder_batch.py data/readings/synthetic_20251116/protocols_edited.csv -o analysis_output.csv --start 1 --end 5
|
||||
|
||||
# Use specific LLM model
|
||||
python3 bicorder_batch.py protocols_edited.csv -o analysis_output.csv -m mistral
|
||||
python3 scripts/bicorder_batch.py data/readings/synthetic_20251116/protocols_edited.csv -o analysis_output.csv -m mistral
|
||||
|
||||
# Add analyst metadata
|
||||
python3 bicorder_batch.py protocols_edited.csv -o analysis_output.csv \
|
||||
python3 scripts/bicorder_batch.py data/readings/synthetic_20251116/protocols_edited.csv -o analysis_output.csv \
|
||||
-a "Your Name" -s "Your analytical standpoint"
|
||||
```
|
||||
|
||||
@@ -47,12 +100,12 @@ python3 bicorder_batch.py protocols_edited.csv -o analysis_output.csv \
|
||||
Create a CSV with empty gradient columns:
|
||||
|
||||
```bash
|
||||
python3 bicorder_analyze.py protocols_edited.csv -o analysis_output.csv
|
||||
python3 scripts/bicorder_analyze.py data/readings/synthetic_20251116/protocols_edited.csv -o analysis_output.csv
|
||||
```
|
||||
|
||||
Optional: Add analyst metadata:
|
||||
```bash
|
||||
python3 bicorder_analyze.py protocols_edited.csv -o analysis_output.csv \
|
||||
python3 scripts/bicorder_analyze.py data/readings/synthetic_20251116/protocols_edited.csv -o analysis_output.csv \
|
||||
-a "Your Name" -s "Your analytical standpoint"
|
||||
```
|
||||
|
||||
@@ -61,7 +114,7 @@ python3 bicorder_analyze.py protocols_edited.csv -o analysis_output.csv \
|
||||
Query all gradients for a specific protocol:
|
||||
|
||||
```bash
|
||||
python3 bicorder_query.py analysis_output.csv 1
|
||||
python3 scripts/bicorder_query.py analysis_output.csv 1
|
||||
```
|
||||
|
||||
- Replace `1` with the row number you want to analyze
|
||||
@@ -71,7 +124,7 @@ python3 bicorder_query.py analysis_output.csv 1
|
||||
|
||||
Optional: Specify a model:
|
||||
```bash
|
||||
python3 bicorder_query.py analysis_output.csv 1 -m mistral
|
||||
python3 scripts/bicorder_query.py analysis_output.csv 1 -m mistral
|
||||
```
|
||||
|
||||
### Step 3: Repeat for All Protocols
|
||||
@@ -79,12 +132,12 @@ python3 bicorder_query.py analysis_output.csv 1 -m mistral
|
||||
For each protocol in your CSV:
|
||||
|
||||
```bash
|
||||
python3 bicorder_query.py analysis_output.csv 1
|
||||
python3 bicorder_query.py analysis_output.csv 2
|
||||
python3 bicorder_query.py analysis_output.csv 3
|
||||
python3 scripts/bicorder_query.py analysis_output.csv 1
|
||||
python3 scripts/bicorder_query.py analysis_output.csv 2
|
||||
python3 scripts/bicorder_query.py analysis_output.csv 3
|
||||
# ... and so on
|
||||
|
||||
# OR: Use bicorder_batch.py to automate all of this!
|
||||
# OR: Use scripts/bicorder_batch.py to automate all of this!
|
||||
```
|
||||
|
||||
## Architecture
|
||||
@@ -110,7 +163,7 @@ This approach:
|
||||
Test prompts without calling the LLM:
|
||||
|
||||
```bash
|
||||
python3 bicorder_query.py analysis_output.csv 1 --dry-run
|
||||
python3 scripts/bicorder_query.py analysis_output.csv 1 --dry-run
|
||||
```
|
||||
|
||||
This shows you exactly what prompt will be sent for each gradient, including the full protocol context.
|
||||
@@ -132,5 +185,5 @@ with open('analysis_output.csv') as f:
|
||||
|
||||
### Batch Processing
|
||||
|
||||
Use the `bicorder_batch.py` script (see Quick Start section above) for processing multiple protocols.
|
||||
Use the `scripts/bicorder_batch.py` script (see Quick Start section above) for processing multiple protocols.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user