Set up analysis scripts
This commit is contained in:
157
analysis/TEST_COMMANDS.md
Normal file
157
analysis/TEST_COMMANDS.md
Normal file
@@ -0,0 +1,157 @@
|
||||
# Test Commands for Refactored Bicorder
|
||||
|
||||
Run these tests in order to verify the refactored code works correctly.
|
||||
|
||||
## Test 1: Dry Run - Single Protocol
|
||||
|
||||
Test that prompts are generated correctly with protocol context:
|
||||
|
||||
```bash
|
||||
python3 bicorder_query.py protocols_edited.csv 1 --dry-run | head -80
|
||||
```
|
||||
|
||||
**Expected result:**
|
||||
- Should show "DRY RUN: Row 1, 23 gradients"
|
||||
- Should show protocol descriptor and description
|
||||
- Each prompt should include full protocol context
|
||||
- Should show 23 gradient prompts
|
||||
|
||||
## Test 2: Verify CSV Structure
|
||||
|
||||
Check that the analyze script still creates proper CSV structure:
|
||||
|
||||
```bash
|
||||
python3 bicorder_analyze.py protocols_edited.csv -o test_output.csv
|
||||
head -1 test_output.csv | tr ',' '\n' | grep -E "(explicit|precise|elite)" | head -5
|
||||
```
|
||||
|
||||
**Expected result:**
|
||||
- Should show gradient column names like:
|
||||
- Design_explicit_vs_implicit
|
||||
- Design_precise_vs_interpretive
|
||||
- Design_elite_vs_vernacular
|
||||
|
||||
## Test 3: Single Gradient Query (Real LLM Call)
|
||||
|
||||
Query just one protocol to test the full pipeline:
|
||||
|
||||
```bash
|
||||
python3 bicorder_query.py test_output.csv 1 -m gpt-4o-mini
|
||||
```
|
||||
|
||||
**Expected result:**
|
||||
- Should show "Protocol: [name]"
|
||||
- Should show "[1/23] Querying: Design explicit vs implicit..."
|
||||
- Should complete all 23 gradients
|
||||
- Should show "✓ CSV updated: test_output.csv"
|
||||
- Each gradient should show a value 1-9
|
||||
|
||||
**Verify the output:**
|
||||
```bash
|
||||
# Check that values were written
|
||||
head -2 test_output.csv | tail -1 | tr ',' '\n' | tail -25 | head -5
|
||||
```
|
||||
|
||||
## Test 4: Check for No Conversation State
|
||||
|
||||
Verify that the tool doesn't create any conversation files:
|
||||
|
||||
```bash
|
||||
# Before running test
|
||||
llm logs list | grep -i bicorder
|
||||
|
||||
# Run a query
|
||||
python3 bicorder_query.py test_output.csv 2 -m gpt-4o-mini
|
||||
|
||||
# After running test
|
||||
llm logs list | grep -i bicorder
|
||||
```
|
||||
|
||||
**Expected result:**
|
||||
- Should not see any "bicorder_row_*" or similar conversation IDs
|
||||
- Each query should be independent
|
||||
|
||||
## Test 5: Batch Processing (Small Set)
|
||||
|
||||
Test batch processing on rows 1-3:
|
||||
|
||||
```bash
|
||||
python3 bicorder_batch.py protocols_edited.csv -o test_batch_output.csv --start 1 --end 3 -m gpt-4o-mini
|
||||
```
|
||||
|
||||
**Expected result:**
|
||||
- Should process 3 protocols
|
||||
- Should show progress for each row
|
||||
- Should show "Successful: 3" at the end
|
||||
- No mention of "initializing conversation"
|
||||
|
||||
**Verify outputs:**
|
||||
```bash
|
||||
# Check that all 3 rows have values
|
||||
python3 -c "
|
||||
import csv
|
||||
with open('test_batch_output.csv') as f:
|
||||
reader = csv.DictReader(f)
|
||||
for i, row in enumerate(reader, 1):
|
||||
if i > 3:
|
||||
break
|
||||
gradient_cols = [k for k in row.keys() if '_vs_' in k]
|
||||
filled = sum(1 for k in gradient_cols if row[k])
|
||||
print(f'Row {i}: {filled}/23 gradients filled')
|
||||
"
|
||||
```
|
||||
|
||||
## Test 6: Dry Run with Different Model
|
||||
|
||||
Test that model parameter works in dry run:
|
||||
|
||||
```bash
|
||||
python3 bicorder_query.py protocols_edited.csv 5 --dry-run -m mistral | head -50
|
||||
```
|
||||
|
||||
**Expected result:**
|
||||
- Should show prompts (model doesn't matter in dry run, but flag should be accepted)
|
||||
|
||||
## Test 7: Error Handling
|
||||
|
||||
Test with invalid row number:
|
||||
|
||||
```bash
|
||||
python3 bicorder_query.py test_output.csv 999
|
||||
```
|
||||
|
||||
**Expected result:**
|
||||
- Should show error: "Error: Row 999 not found in CSV"
|
||||
|
||||
## Test 8: Compare Prompt Structure
|
||||
|
||||
Compare the new standalone prompts vs old system prompt approach:
|
||||
|
||||
```bash
|
||||
# New approach - protocol context in each prompt
|
||||
python3 bicorder_query.py protocols_edited.csv 1 --dry-run | grep -A 5 "Analyze this protocol"
|
||||
|
||||
# Old approach would have had protocol in system prompt only (no longer used)
|
||||
# Verify that protocol context appears in EVERY gradient prompt
|
||||
python3 bicorder_query.py protocols_edited.csv 1 --dry-run | grep -c "Analyze this protocol"
|
||||
```
|
||||
|
||||
**Expected result:**
|
||||
- Should show "23" (protocol context appears in all 23 prompts)
|
||||
|
||||
## Cleanup
|
||||
|
||||
Remove test files:
|
||||
|
||||
```bash
|
||||
rm -f test_output.csv test_batch_output.csv
|
||||
```
|
||||
|
||||
## Success Criteria
|
||||
|
||||
✅ All 23 gradients queried for each protocol
|
||||
✅ No conversation IDs created or referenced
|
||||
✅ Protocol context included in every prompt
|
||||
✅ CSV values properly written (1-9)
|
||||
✅ Batch processing works without initialization step
|
||||
✅ Error handling works correctly
|
||||
Reference in New Issue
Block a user