Set up analysis scripts

2025-10-30 10:56:21 -06:00
parent d2da0425c6
commit 815ed9d6f4
14 changed files with 1427 additions and 651 deletions
--- a/analysis/TEST_COMMANDS.md
+++ b/analysis/TEST_COMMANDS.md
@@ -0,0 +1,157 @@
+# Test Commands for Refactored Bicorder
+
+Run these tests in order to verify the refactored code works correctly.
+
+## Test 1: Dry Run - Single Protocol
+
+Test that prompts are generated correctly with protocol context:
+
+```bash
+python3 bicorder_query.py protocols_edited.csv 1 --dry-run | head -80
+```
+
+**Expected result:**
+- Should show "DRY RUN: Row 1, 23 gradients"
+- Should show protocol descriptor and description
+- Each prompt should include full protocol context
+- Should show 23 gradient prompts
+
+## Test 2: Verify CSV Structure
+
+Check that the analyze script still creates proper CSV structure:
+
+```bash
+python3 bicorder_analyze.py protocols_edited.csv -o test_output.csv
+head -1 test_output.csv | tr ',' '\n' | grep -E "(explicit|precise|elite)" | head -5
+```
+
+**Expected result:**
+- Should show gradient column names like:
+  - Design_explicit_vs_implicit
+  - Design_precise_vs_interpretive
+  - Design_elite_vs_vernacular
+
+## Test 3: Single Gradient Query (Real LLM Call)
+
+Query just one protocol to test the full pipeline:
+
+```bash
+python3 bicorder_query.py test_output.csv 1 -m gpt-4o-mini
+```
+
+**Expected result:**
+- Should show "Protocol: [name]"
+- Should show "[1/23] Querying: Design explicit vs implicit..."
+- Should complete all 23 gradients
+- Should show "✓ CSV updated: test_output.csv"
+- Each gradient should show a value 1-9
+
+**Verify the output:**
+```bash
+# Check that values were written
+head -2 test_output.csv | tail -1 | tr ',' '\n' | tail -25 | head -5
+```
+
+## Test 4: Check for No Conversation State
+
+Verify that the tool doesn't create any conversation files:
+
+```bash
+# Before running test
+llm logs list | grep -i bicorder
+
+# Run a query
+python3 bicorder_query.py test_output.csv 2 -m gpt-4o-mini
+
+# After running test
+llm logs list | grep -i bicorder
+```
+
+**Expected result:**
+- Should not see any "bicorder_row_*" or similar conversation IDs
+- Each query should be independent
+
+## Test 5: Batch Processing (Small Set)
+
+Test batch processing on rows 1-3:
+
+```bash
+python3 bicorder_batch.py protocols_edited.csv -o test_batch_output.csv --start 1 --end 3 -m gpt-4o-mini
+```
+
+**Expected result:**
+- Should process 3 protocols
+- Should show progress for each row
+- Should show "Successful: 3" at the end
+- No mention of "initializing conversation"
+
+**Verify outputs:**
+```bash
+# Check that all 3 rows have values
+python3 -c "
+import csv
+with open('test_batch_output.csv') as f:
+    reader = csv.DictReader(f)
+    for i, row in enumerate(reader, 1):
+        if i > 3:
+            break
+        gradient_cols = [k for k in row.keys() if '_vs_' in k]
+        filled = sum(1 for k in gradient_cols if row[k])
+        print(f'Row {i}: {filled}/23 gradients filled')
+"
+```
+
+## Test 6: Dry Run with Different Model
+
+Test that model parameter works in dry run:
+
+```bash
+python3 bicorder_query.py protocols_edited.csv 5 --dry-run -m mistral | head -50
+```
+
+**Expected result:**
+- Should show prompts (model doesn't matter in dry run, but flag should be accepted)
+
+## Test 7: Error Handling
+
+Test with invalid row number:
+
+```bash
+python3 bicorder_query.py test_output.csv 999
+```
+
+**Expected result:**
+- Should show error: "Error: Row 999 not found in CSV"
+
+## Test 8: Compare Prompt Structure
+
+Compare the new standalone prompts vs old system prompt approach:
+
+```bash
+# New approach - protocol context in each prompt
+python3 bicorder_query.py protocols_edited.csv 1 --dry-run | grep -A 5 "Analyze this protocol"
+
+# Old approach would have had protocol in system prompt only (no longer used)
+# Verify that protocol context appears in EVERY gradient prompt
+python3 bicorder_query.py protocols_edited.csv 1 --dry-run | grep -c "Analyze this protocol"
+```
+
+**Expected result:**
+- Should show "23" (protocol context appears in all 23 prompts)
+
+## Cleanup
+
+Remove test files:
+
+```bash
+rm -f test_output.csv test_batch_output.csv
+```
+
+## Success Criteria
+
+✅ All 23 gradients queried for each protocol
+✅ No conversation IDs created or referenced
+✅ Protocol context included in every prompt
+✅ CSV values properly written (1-9)
+✅ Batch processing works without initialization step
+✅ Error handling works correctly