Flatten data/readings/ → data/

Remove the intermediate readings/ subdirectory level — dataset naming (synthetic_YYYYMMDD, manual_YYYYMMDD) already encodes what the data is. Update all path references across scripts and docs accordingly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-20 17:46:23 -06:00
parent 1a80219a25
commit 60e83783ec
533 changed files with 97 additions and 97 deletions
@@ -12,7 +12,7 @@ This guide explains how to integrate the cluster classification system into the
 **Version-based compatibility**: The model includes a `bicorder_version` field. The classifier checks that versions match. When bicorder.json structure changes:
 1. Increment the version number in bicorder.json
-2. Retrain the model with `python3 scripts/export_model_for_js.py data/readings/synthetic_20251116/readings.csv`
+2. Retrain the model with `python3 scripts/export_model_for_js.py data/synthetic_20251116/readings.csv`
 3. The new model will have the updated version
 This ensures the web app and model stay in sync without complex backward compatibility.
@@ -25,7 +25,7 @@ This ensures the web app and model stay in sync without complex backward compati
 The model is the only artifact produced by this analysis directory that the app consumes. Regenerate it after re-running analysis on the synthetic dataset:
 ```bash
-python3 scripts/export_model_for_js.py data/readings/synthetic_20251116/readings.csv
+python3 scripts/export_model_for_js.py data/synthetic_20251116/readings.csv
 ```
 ## Quick Start
@@ -6,26 +6,26 @@ Scripts were created with the assistance of Claude Code. Data processing was don
 ## Datasets
-Readings are organized under `data/readings/<type>_<YYYYMMDD>/`, each self-contained with its own `readings.csv`, `analysis/`, and `json/` subdirectories:
+Readings are organized under `data/<type>_<YYYYMMDD>/`, each self-contained with its own `readings.csv`, `analysis/`, and `json/` subdirectories:
- **`data/readings/synthetic_20251116/`** — 411 protocols from synthetic LLM-generated readings (see detailed procedure below)
+- **`data/synthetic_20251116/`** — 411 protocols from synthetic LLM-generated readings (see detailed procedure below)
- **`data/readings/manual_20260320/`** — manual readings collected at [git.medlab.host/ntnsndr/protocol-bicorder-data](https://git.medlab.host/ntnsndr/protocol-bicorder-data), continuously expanding
+- **`data/manual_20260320/`** — manual readings collected at [git.medlab.host/ntnsndr/protocol-bicorder-data](https://git.medlab.host/ntnsndr/protocol-bicorder-data), continuously expanding
 ### Syncing the manual dataset
 The manual dataset is kept current via a `.sync_source` config file and a one-command sync script:
 ```bash
-scripts/sync_readings.sh data/readings/manual_20260320
+scripts/sync_readings.sh data/manual_20260320
 ```
 This clones the remote repository, copies JSON reading files, regenerates `readings.csv`, runs multivariate analysis (filtering to well-covered dimensions), generates an LDA visualization, and saves per-reading cluster classifications to `analysis/classifications.csv`.
 Options:
 ```bash
-scripts/sync_readings.sh data/readings/manual_20260320 --min-coverage 0.8   # default
+scripts/sync_readings.sh data/manual_20260320 --min-coverage 0.8   # default
-scripts/sync_readings.sh data/readings/manual_20260320 --no-analysis        # sync JSON only
+scripts/sync_readings.sh data/manual_20260320 --no-analysis        # sync JSON only
-scripts/sync_readings.sh data/readings/manual_20260320 --training data/readings/synthetic_20251116/readings.csv
+scripts/sync_readings.sh data/manual_20260320 --training data/synthetic_20251116/readings.csv
 ```
 ### Handling shortform readings
@@ -59,7 +59,7 @@ context: "model running on ollama locally, accessed with llm on the command line
 prompt: "Return csv-formatted data (with no markdown wrapper) that consists of a list of protocols discussed or referred to in the attached text. Protocols are defined extremely broadly as 'patterns of interaction,' and may be of a nontechnical nature. Protocols should be as specific as possible, such as 'Sacrament of Reconciliation' rather than 'Religious Protocols.' The first column should provide a brief descriptor of the protocol, and the second column should describe it in a substantial paragraph of 3-5 sentences, encapsulated in quotation marks to avoid breaking on commas. Be sure to paraphrase rather than quoting directly from the source text."
 ```
-The result was a CSV-formatted list of protocols (`data/readings/synthetic_20251116/protocols_raw.csv`, n=774 total protocols listed).
+The result was a CSV-formatted list of protocols (`data/synthetic_20251116/protocols_raw.csv`, n=774 total protocols listed).
 ### Dataset cleaning
@@ -74,7 +74,7 @@ The dataset was then manually reviewed. The review involved the following:
 The cleaning process was carried out in a subjective manner, so some entries that meet the above criteria may remain in the dataset. The dataset also appears to include some LLM hallucinations---that is, protocols not in the texts---but the hallucinations are often acceptable examples and so some have been retained. Some degree of noise in the dataset was considered acceptable for the purposes of the study. Some degree of repetition, also, provides the dataset with a kind of control cases for evaluating the diagnostic process.
-The result was a CSV-formatted list of protocols (`data/readings/synthetic_20251116/protocols_edited.csv`, n=411).
+The result was a CSV-formatted list of protocols (`data/synthetic_20251116/protocols_edited.csv`, n=411).
 ### Initial diagnostic
@@ -83,24 +83,24 @@ This diagnostic used the file now at `bicorder_analyzed.json`, though the script
 For each row in the dataset, and on each gradient, a series of scripts prompts the LLM to apply each gradient to the protocol. The outputs are then added to a CSV output file.
-The result was a CSV-formatted list of protocols (`data/readings/synthetic_20251116/readings.csv`, n=411).
+The result was a CSV-formatted list of protocols (`data/synthetic_20251116/readings.csv`, n=411).
 See detailed documentation of the scripts at `WORKFLOW.md`. 
 ### Manual and alternate model audit
-To test the output, a manual review of the first 10 protocols in the `data/readings/synthetic_20251116/protocols_edited.csv` dataset was produced in the file `data/readings/synthetic_20251116/readings_manual.csv`. (Alphabetization in this case seems a reasonable proxy for a random sample of protocols. It includes some partially overlapping protocols, as does the dataset as a whole.) Additionally, three models were tested on the same cases:
+To test the output, a manual review of the first 10 protocols in the `data/synthetic_20251116/protocols_edited.csv` dataset was produced in the file `data/synthetic_20251116/readings_manual.csv`. (Alphabetization in this case seems a reasonable proxy for a random sample of protocols. It includes some partially overlapping protocols, as does the dataset as a whole.) Additionally, three models were tested on the same cases:
 ```bash
-python3 scripts/bicorder_batch.py data/readings/synthetic_20251116/protocols_edited.csv -o data/readings/synthetic_20251116/readings_mistral.csv -m mistral -a "Mistral" -s "A careful ethnographer and outsider aspiring to achieve a neutral stance and a high degree of precision" --start 1 --end 10
+python3 scripts/bicorder_batch.py data/synthetic_20251116/protocols_edited.csv -o data/synthetic_20251116/readings_mistral.csv -m mistral -a "Mistral" -s "A careful ethnographer and outsider aspiring to achieve a neutral stance and a high degree of precision" --start 1 --end 10
 ```
 ```bash
-python3 scripts/bicorder_batch.py data/readings/synthetic_20251116/protocols_edited.csv -o data/readings/synthetic_20251116/readings_gpt-oss.csv -m gpt-oss -a "GPT-OSS" -s "A careful ethnographer and outsider aspiring to achieve a neutral stance and a high degree of precision" --start 1 --end 10
+python3 scripts/bicorder_batch.py data/synthetic_20251116/protocols_edited.csv -o data/synthetic_20251116/readings_gpt-oss.csv -m gpt-oss -a "GPT-OSS" -s "A careful ethnographer and outsider aspiring to achieve a neutral stance and a high degree of precision" --start 1 --end 10
 ```
 ```bash
-python3 scripts/bicorder_batch.py data/readings/synthetic_20251116/protocols_edited.csv -o data/readings/synthetic_20251116/readings_gemma3-12b.csv -m gemma3:12b -a "Gemma3:12b" -s "A careful ethnographer and outsider aspiring to achieve a neutral stance and a high degree of precision" --start 1 --end 10
+python3 scripts/bicorder_batch.py data/synthetic_20251116/protocols_edited.csv -o data/synthetic_20251116/readings_gemma3-12b.csv -m gemma3:12b -a "Gemma3:12b" -s "A careful ethnographer and outsider aspiring to achieve a neutral stance and a high degree of precision" --start 1 --end 10
 ```
 A Euclidean distance analysis (`python3 scripts/compare_analyses.py`) found that the `gpt-oss` model was closer to the manual example than the others. It was therefore selected to be the model used for conducting the bicorder diagnostic on the dataset.
@@ -112,13 +112,13 @@ Average Euclidean Distance:
  3. readings_mistral.csv    - Avg Distance: 13.33
 ```
-Command used to produce `data/readings/synthetic_20251116/readings.csv` (using the Ollama cloud service for the `gpt-oss` model):
+Command used to produce `data/synthetic_20251116/readings.csv` (using the Ollama cloud service for the `gpt-oss` model):
 ```bash
-python3 scripts/bicorder_batch.py data/readings/synthetic_20251116/protocols_edited.csv -o data/readings/synthetic_20251116/readings.csv -m gpt-oss:20b-cloud -a "GPT-OSS" -s "A careful ethnographer and outsider aspiring to achieve a neutral stance and a high degree of precision"
+python3 scripts/bicorder_batch.py data/synthetic_20251116/protocols_edited.csv -o data/synthetic_20251116/readings.csv -m gpt-oss:20b-cloud -a "GPT-OSS" -s "A careful ethnographer and outsider aspiring to achieve a neutral stance and a high degree of precision"
 ```
-The result was a CSV-formatted list of protocols (`data/readings/synthetic_20251116/readings.csv`, n=411).
+The result was a CSV-formatted list of protocols (`data/synthetic_20251116/readings.csv`, n=411).
 ### Further analysis
@@ -126,7 +126,7 @@ The result was a CSV-formatted list of protocols (`data/readings/synthetic_20251
 Per-protocol values are meaningful for the bicorder because, despite varying levels of appropriateness, all of the gradients are structured as ranging from "hardness" to "softness"---with lower values associated with greater rigidity. The average value for a given protocol, therefore, provides a rough sense of the protocol's hardness. 
-Basic averages appear in `data/readings/synthetic_20251116/readings-analysis.ods`.
+Basic averages appear in `data/synthetic_20251116/readings-analysis.ods`.
 #### Univariate analysis
@@ -136,7 +136,7 @@ First, a plot of average values for each protocol:
 This reveals a linear distribution of values among the protocols, aside from exponential curves only at the extremes. Perhaps the most interesting finding is a skew toward the higher end of the scale, associated with softness. Even relatively hard, technical protocols appear to have significant soft characteristics.
-The protocol value averages have a mean of 5.45 and a median of 5.48. In comparison to the midpoint of 5, the normalized midpoint deviation is 0.11. In comparison, the Pearson coefficient measures the skew at just -0.07, which means that the relative skew of the data is actually slightly downward. So the distribution of protocol values is very balanced but has a consistent upward deviation from the scale's baseline. (These calculations are in `data/readings/synthetic_20251116/readings-analysis.odt[averages]`.)
+The protocol value averages have a mean of 5.45 and a median of 5.48. In comparison to the midpoint of 5, the normalized midpoint deviation is 0.11. In comparison, the Pearson coefficient measures the skew at just -0.07, which means that the relative skew of the data is actually slightly downward. So the distribution of protocol values is very balanced but has a consistent upward deviation from the scale's baseline. (These calculations are in `data/synthetic_20251116/readings-analysis.odt[averages]`.)
 Second, a plot of average values for each gradient (with gaps to indicate the three groupings of gradients):
@@ -165,21 +165,21 @@ Claude Code created a `multivariate_analysis.py` tool to conduct this analysis.
 ```bash
 # Run all analyses (default)
-python3 scripts/multivariate_analysis.py data/readings/synthetic_20251116/readings.csv
+python3 scripts/multivariate_analysis.py data/synthetic_20251116/readings.csv
 # Run specific analyses only
-python3 scripts/multivariate_analysis.py data/readings/synthetic_20251116/readings.csv --analyses
+python3 scripts/multivariate_analysis.py data/synthetic_20251116/readings.csv --analyses
 clustering pca
 ```
 Initial manual observations:
 * The correlations generally seem predictable; for example, the strongest is between `Design_static_vs_malleable` and `Experience_predictable_vs_emergent`, which is not surprising
-* The elite vs. vernacular distinction appears to be the most predictive gradient (`data/readings/synthetic_20251116/analysis/plots/feature_importances.png`)
+* The elite vs. vernacular distinction appears to be the most predictive gradient (`data/synthetic_20251116/analysis/plots/feature_importances.png`)
-![Correlation heatmap](data/readings/synthetic_20251116/analysis/plots/correlation_heatmap_full.png)
+![Correlation heatmap](data/synthetic_20251116/analysis/plots/correlation_heatmap_full.png)
-![Importance ranking](data/readings/synthetic_20251116/analysis/plots/feature_importances.png)
+![Importance ranking](data/synthetic_20251116/analysis/plots/feature_importances.png)
 Claude's interpretation:
@@ -7,7 +7,7 @@ Run these tests in order to verify the refactored code works correctly.
 Test that prompts are generated correctly with protocol context:
 ```bash
-python3 scripts/bicorder_query.py data/readings/synthetic_20251116/protocols_edited.csv 1 --dry-run | head -80
+python3 scripts/bicorder_query.py data/synthetic_20251116/protocols_edited.csv 1 --dry-run | head -80
 ```
 **Expected result:**
@@ -21,7 +21,7 @@ python3 scripts/bicorder_query.py data/readings/synthetic_20251116/protocols_edi
 Check that the analyze script still creates proper CSV structure:
 ```bash
-python3 scripts/bicorder_analyze.py data/readings/synthetic_20251116/protocols_edited.csv -o test_output.csv
+python3 scripts/bicorder_analyze.py data/synthetic_20251116/protocols_edited.csv -o test_output.csv
 head -1 test_output.csv | tr ',' '\n' | grep -E "(explicit|precise|elite)" | head -5
 ```
@@ -76,7 +76,7 @@ llm logs list | grep -i bicorder
 Test batch processing on rows 1-3:
 ```bash
-python3 scripts/bicorder_batch.py data/readings/synthetic_20251116/protocols_edited.csv -o test_batch_output.csv --start 1 --end 3 -m gpt-4o-mini
+python3 scripts/bicorder_batch.py data/synthetic_20251116/protocols_edited.csv -o test_batch_output.csv --start 1 --end 3 -m gpt-4o-mini
 ```
 **Expected result:**
@@ -106,7 +106,7 @@ with open('test_batch_output.csv') as f:
 Test that model parameter works in dry run:
 ```bash
-python3 scripts/bicorder_query.py data/readings/synthetic_20251116/protocols_edited.csv 5 --dry-run -m mistral | head -50
+python3 scripts/bicorder_query.py data/synthetic_20251116/protocols_edited.csv 5 --dry-run -m mistral | head -50
 ```
 **Expected result:**
@@ -129,11 +129,11 @@ Compare the new standalone prompts vs old system prompt approach:
 ```bash
 # New approach - protocol context in each prompt
-python3 scripts/bicorder_query.py data/readings/synthetic_20251116/protocols_edited.csv 1 --dry-run | grep -A 5 "Analyze this protocol"
+python3 scripts/bicorder_query.py data/synthetic_20251116/protocols_edited.csv 1 --dry-run | grep -A 5 "Analyze this protocol"
 # Old approach would have had protocol in system prompt only (no longer used)
 # Verify that protocol context appears in EVERY gradient prompt
-python3 scripts/bicorder_query.py data/readings/synthetic_20251116/protocols_edited.csv 1 --dry-run | grep -c "Analyze this protocol"
+python3 scripts/bicorder_query.py data/synthetic_20251116/protocols_edited.csv 1 --dry-run | grep -c "Analyze this protocol"
 ```
 **Expected result:**
@@ -27,10 +27,10 @@ The scripts automatically draw the gradients from the current state of the [bico
 ## Syncing a manual readings dataset
-If the dataset has a `.sync_source` file (e.g., `data/readings/manual_20260320/`), one command handles everything:
+If the dataset has a `.sync_source` file (e.g., `data/manual_20260320/`), one command handles everything:
 ```bash
-scripts/sync_readings.sh data/readings/manual_20260320
+scripts/sync_readings.sh data/manual_20260320
 ```
 This fetches new JSON files from the remote repo, regenerates `readings.csv`, runs multivariate analysis (with `--min-coverage 0.8` to handle shortform readings), generates the LDA visualization, and saves cluster classifications to `analysis/classifications.csv`.
@@ -39,15 +39,15 @@ This fetches new JSON files from the remote repo, regenerates `readings.csv`, ru
 ```bash
 # Full analysis pipeline
-python3 scripts/multivariate_analysis.py data/readings/manual_20260320/readings.csv \
+python3 scripts/multivariate_analysis.py data/manual_20260320/readings.csv \
  --min-coverage 0.8 \
  --analyses clustering pca correlation importance
 # LDA visualization (cluster separation plot)
-python3 scripts/lda_visualization.py data/readings/manual_20260320/readings.csv
+python3 scripts/lda_visualization.py data/manual_20260320/readings.csv
 # Classify all readings (uses synthetic dataset as training data by default)
-python3 scripts/classify_readings.py data/readings/manual_20260320/readings.csv
+python3 scripts/classify_readings.py data/manual_20260320/readings.csv
 ```
 Use `--min-coverage` (0.0–1.0) to drop dimension columns below the given coverage fraction before analysis. This is important for datasets with many shortform readings where most dimensions are sparsely filled.
@@ -57,8 +57,8 @@ Use `--min-coverage` (0.0–1.0) to drop dimension columns below the given cover
 If you have a directory of individual bicorder JSON reading files:
 ```bash
-python3 scripts/json_to_csv.py data/readings/manual_20260320/json/ \
+python3 scripts/json_to_csv.py data/manual_20260320/json/ \
-  -o data/readings/manual_20260320/readings.csv
+  -o data/manual_20260320/readings.csv
 ```
 ---
@@ -68,7 +68,7 @@ python3 scripts/json_to_csv.py data/readings/manual_20260320/json/ \
 ### Process All Protocols with One Command
 ```bash
-python3 scripts/bicorder_batch.py data/readings/synthetic_20251116/protocols_edited.csv -o analysis_output.csv
+python3 scripts/bicorder_batch.py data/synthetic_20251116/protocols_edited.csv -o analysis_output.csv
 ```
 This will:
@@ -81,13 +81,13 @@ This will:
 ```bash
 # Process only rows 1-5 (useful for testing)
-python3 scripts/bicorder_batch.py data/readings/synthetic_20251116/protocols_edited.csv -o analysis_output.csv --start 1 --end 5
+python3 scripts/bicorder_batch.py data/synthetic_20251116/protocols_edited.csv -o analysis_output.csv --start 1 --end 5
 # Use specific LLM model
-python3 scripts/bicorder_batch.py data/readings/synthetic_20251116/protocols_edited.csv -o analysis_output.csv -m mistral
+python3 scripts/bicorder_batch.py data/synthetic_20251116/protocols_edited.csv -o analysis_output.csv -m mistral
 # Add analyst metadata
-python3 scripts/bicorder_batch.py data/readings/synthetic_20251116/protocols_edited.csv -o analysis_output.csv \
+python3 scripts/bicorder_batch.py data/synthetic_20251116/protocols_edited.csv -o analysis_output.csv \
  -a "Your Name" -s "Your analytical standpoint"
 ```
@@ -100,12 +100,12 @@ python3 scripts/bicorder_batch.py data/readings/synthetic_20251116/protocols_edi
 Create a CSV with empty gradient columns:
 ```bash
-python3 scripts/bicorder_analyze.py data/readings/synthetic_20251116/protocols_edited.csv -o analysis_output.csv
+python3 scripts/bicorder_analyze.py data/synthetic_20251116/protocols_edited.csv -o analysis_output.csv
 ```
 Optional: Add analyst metadata:
 ```bash
-python3 scripts/bicorder_analyze.py data/readings/synthetic_20251116/protocols_edited.csv -o analysis_output.csv \
+python3 scripts/bicorder_analyze.py data/synthetic_20251116/protocols_edited.csv -o analysis_output.csv \
  -a "Your Name" -s "Your analytical standpoint"
 ```
--- a/Show More
+++ b/Show More