Reorganize directory, add manual dataset and sync tooling
- Move all scripts to scripts/, web assets to web/, analysis results into self-contained data/readings/<type>_<YYYYMMDD>/ directories - Add data/readings/manual_20260320/ with 32 JSON readings from git.medlab.host/ntnsndr/protocol-bicorder-data - Add scripts/json_to_csv.py to convert bicorder JSON files to CSV - Add scripts/sync_readings.sh for one-command sync + re-analysis of any dataset backed by a .sync_source config file - Add scripts/classify_readings.py to apply the LDA classifier to all readings and save per-reading cluster assignments - Add --min-coverage flag to multivariate_analysis.py for sparse/shortform datasets; also applies in lda_visualization.py - Fix lda_visualization.py NaN handling and 0-d array annotation bug - Update README.md and WORKFLOW.md to document datasets, sync workflow, shortform handling, and new scripts Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -1,9 +1,42 @@
|
||||
# Bicorder synthetic data analysis
|
||||
# Bicorder data analysis
|
||||
|
||||
This directory concerns a synthetic data analysis conducted with the Protocol Bicorder.
|
||||
This directory concerns analyses conducted with the Protocol Bicorder across multiple datasets.
|
||||
|
||||
Scripts were created with the assistance of Claude Code. Data processing was done largely with either local models or the Ollama cloud service, which does not retain user data. Thanks to [Seth Frey (UC Davis)](https://enfascination.com/) for guidance, but all mistakes are the responsibility of the author, [Nathan Schneider](https://nathanschneider.info). This is the work of a researcher working with AI outside their field of expertise and should be treated as a playful experiment, not a model of rigorous methodology.
|
||||
|
||||
## Datasets
|
||||
|
||||
Readings are organized under `data/readings/<type>_<YYYYMMDD>/`, each self-contained with its own `readings.csv`, `analysis/`, and `json/` subdirectories:
|
||||
|
||||
- **`data/readings/synthetic_20251116/`** — 411 protocols from synthetic LLM-generated readings (see detailed procedure below)
|
||||
- **`data/readings/manual_20260320/`** — manual readings collected at [git.medlab.host/ntnsndr/protocol-bicorder-data](https://git.medlab.host/ntnsndr/protocol-bicorder-data), continuously expanding
|
||||
|
||||
### Syncing the manual dataset
|
||||
|
||||
The manual dataset is kept current via a `.sync_source` config file and a one-command sync script:
|
||||
|
||||
```bash
|
||||
scripts/sync_readings.sh data/readings/manual_20260320
|
||||
```
|
||||
|
||||
This clones the remote repository, copies JSON reading files, regenerates `readings.csv`, runs multivariate analysis (filtering to well-covered dimensions), generates an LDA visualization, and saves per-reading cluster classifications to `analysis/classifications.csv`.
|
||||
|
||||
Options:
|
||||
```bash
|
||||
scripts/sync_readings.sh data/readings/manual_20260320 --min-coverage 0.8 # default
|
||||
scripts/sync_readings.sh data/readings/manual_20260320 --no-analysis # sync JSON only
|
||||
scripts/sync_readings.sh data/readings/manual_20260320 --training data/readings/synthetic_20251116/readings.csv
|
||||
```
|
||||
|
||||
### Handling shortform readings
|
||||
|
||||
Many manual readings use the shortform bicorder (9 key dimensions rather than all 23). Two analysis strategies handle this:
|
||||
|
||||
1. **Multivariate analysis with `--min-coverage`**: Drops dimension columns below the coverage threshold so analysis runs on the shared well-filled dimensions (e.g., 8 dimensions at 80% coverage for the current manual dataset).
|
||||
2. **Classifier (`classify_readings.py`)**: Applies the synthetic-trained LDA model to all readings, filling any missing dimensions with a neutral value (5). The `completeness` column in the output flags readings where confidence is limited by sparse data.
|
||||
|
||||
---
|
||||
|
||||
## Purpose
|
||||
|
||||
This analyses has several purposes:
|
||||
@@ -26,7 +59,7 @@ context: "model running on ollama locally, accessed with llm on the command line
|
||||
prompt: "Return csv-formatted data (with no markdown wrapper) that consists of a list of protocols discussed or referred to in the attached text. Protocols are defined extremely broadly as 'patterns of interaction,' and may be of a nontechnical nature. Protocols should be as specific as possible, such as 'Sacrament of Reconciliation' rather than 'Religious Protocols.' The first column should provide a brief descriptor of the protocol, and the second column should describe it in a substantial paragraph of 3-5 sentences, encapsulated in quotation marks to avoid breaking on commas. Be sure to paraphrase rather than quoting directly from the source text."
|
||||
```
|
||||
|
||||
The result was a CSV-formatted list of protocols (`protocols_raw.csv`, n=774 total protocols listed).
|
||||
The result was a CSV-formatted list of protocols (`data/readings/synthetic_20251116/protocols_raw.csv`, n=774 total protocols listed).
|
||||
|
||||
### Dataset cleaning
|
||||
|
||||
@@ -41,7 +74,7 @@ The dataset was then manually reviewed. The review involved the following:
|
||||
|
||||
The cleaning process was carried out in a subjective manner, so some entries that meet the above criteria may remain in the dataset. The dataset also appears to include some LLM hallucinations---that is, protocols not in the texts---but the hallucinations are often acceptable examples and so some have been retained. Some degree of noise in the dataset was considered acceptable for the purposes of the study. Some degree of repetition, also, provides the dataset with a kind of control cases for evaluating the diagnostic process.
|
||||
|
||||
The result was a CSV-formatted list of protocols (`protocols_edited.csv`, n=411).
|
||||
The result was a CSV-formatted list of protocols (`data/readings/synthetic_20251116/protocols_edited.csv`, n=411).
|
||||
|
||||
|
||||
### Initial diagnostic
|
||||
@@ -50,42 +83,42 @@ This diagnostic used the file now at `bicorder_analyzed.json`, though the script
|
||||
|
||||
For each row in the dataset, and on each gradient, a series of scripts prompts the LLM to apply each gradient to the protocol. The outputs are then added to a CSV output file.
|
||||
|
||||
The result was a CSV-formatted list of protocols (`diagnostic_output.csv`, n=411).
|
||||
The result was a CSV-formatted list of protocols (`data/readings/synthetic_20251116/readings.csv`, n=411).
|
||||
|
||||
See detailed documentation of the scripts at `WORKFLOW.md`.
|
||||
|
||||
### Manual and alternate model audit
|
||||
|
||||
To test the output, a manual review of the first 10 protocols in the `protocols_edited.csv` dataset was produced in the file `diagnostic_output_manual.csv`. (Alphabetization in this case seems a reasonable proxy for a random sample of protocols. It includes some partially overlapping protocols, as does the dataset as a whole.) Additionally, three models were tested on the same cases:
|
||||
To test the output, a manual review of the first 10 protocols in the `data/readings/synthetic_20251116/protocols_edited.csv` dataset was produced in the file `data/readings/synthetic_20251116/readings_manual.csv`. (Alphabetization in this case seems a reasonable proxy for a random sample of protocols. It includes some partially overlapping protocols, as does the dataset as a whole.) Additionally, three models were tested on the same cases:
|
||||
|
||||
```bash
|
||||
python3 bicorder_batch.py protocols_edited.csv -o diagnostic_output_mistral.csv -m mistral -a "Mistral" -s "A careful ethnographer and outsider aspiring to achieve a neutral stance and a high degree of precision" --start 1 --end 10
|
||||
python3 scripts/bicorder_batch.py data/readings/synthetic_20251116/protocols_edited.csv -o data/readings/synthetic_20251116/readings_mistral.csv -m mistral -a "Mistral" -s "A careful ethnographer and outsider aspiring to achieve a neutral stance and a high degree of precision" --start 1 --end 10
|
||||
```
|
||||
|
||||
```bash
|
||||
python3 bicorder_batch.py protocols_edited.csv -o diagnostic_output_gpt-oss.csv -m gpt-oss -a "GPT-OSS" -s "A careful ethnographer and outsider aspiring to achieve a neutral stance and a high degree of precision" --start 1 --end 10
|
||||
python3 scripts/bicorder_batch.py data/readings/synthetic_20251116/protocols_edited.csv -o data/readings/synthetic_20251116/readings_gpt-oss.csv -m gpt-oss -a "GPT-OSS" -s "A careful ethnographer and outsider aspiring to achieve a neutral stance and a high degree of precision" --start 1 --end 10
|
||||
```
|
||||
|
||||
```bash
|
||||
python3 bicorder_batch.py protocols_edited.csv -o diagnostic_output_gemma3-12b.csv -m gemma3:12b -a "Gemma3:12b" -s "A careful ethnographer and outsider aspiring to achieve a neutral stance and a high degree of precision" --start 1 --end 10
|
||||
python3 scripts/bicorder_batch.py data/readings/synthetic_20251116/protocols_edited.csv -o data/readings/synthetic_20251116/readings_gemma3-12b.csv -m gemma3:12b -a "Gemma3:12b" -s "A careful ethnographer and outsider aspiring to achieve a neutral stance and a high degree of precision" --start 1 --end 10
|
||||
```
|
||||
|
||||
A Euclidean distance analysis (`./venv/bin/python3 compare analyses.py`) found that the `gpt-oss` model was closer to the manual example than the others. It was therefore selected to be the model used for conducting the bicorder diagnostic on the dataset.
|
||||
A Euclidean distance analysis (`python3 scripts/compare_analyses.py`) found that the `gpt-oss` model was closer to the manual example than the others. It was therefore selected to be the model used for conducting the bicorder diagnostic on the dataset.
|
||||
|
||||
```
|
||||
Average Euclidean Distance:
|
||||
1. diagnostic_output_gpt-oss.csv - Avg Distance: 11.68
|
||||
2. diagnostic_output_gemma3-12b.csv - Avg Distance: 13.06
|
||||
3. diagnostic_output_mistral.csv - Avg Distance: 13.33
|
||||
1. readings_gpt-oss.csv - Avg Distance: 11.68
|
||||
2. readings_gemma3-12b.csv - Avg Distance: 13.06
|
||||
3. readings_mistral.csv - Avg Distance: 13.33
|
||||
```
|
||||
|
||||
Command used to produce `diagnostic_output.csv` (using the Ollama cloud service for the `gpt-oss` model):
|
||||
Command used to produce `data/readings/synthetic_20251116/readings.csv` (using the Ollama cloud service for the `gpt-oss` model):
|
||||
|
||||
```bash
|
||||
python3 bicorder_batch.py protocols_edited.csv -o diagnostic_output.csv -m gpt-oss:20b-cloud -a "GPT-OSS" -s "A careful ethnographer and outsider aspiring to achieve a neutral stance and a high degree of precision"
|
||||
python3 scripts/bicorder_batch.py data/readings/synthetic_20251116/protocols_edited.csv -o data/readings/synthetic_20251116/readings.csv -m gpt-oss:20b-cloud -a "GPT-OSS" -s "A careful ethnographer and outsider aspiring to achieve a neutral stance and a high degree of precision"
|
||||
```
|
||||
|
||||
The result was a CSV-formatted list of protocols (`diagnostic_output.csv`, n=411).
|
||||
The result was a CSV-formatted list of protocols (`data/readings/synthetic_20251116/readings.csv`, n=411).
|
||||
|
||||
### Further analysis
|
||||
|
||||
@@ -93,7 +126,7 @@ The result was a CSV-formatted list of protocols (`diagnostic_output.csv`, n=411
|
||||
|
||||
Per-protocol values are meaningful for the bicorder because, despite varying levels of appropriateness, all of the gradients are structured as ranging from "hardness" to "softness"---with lower values associated with greater rigidity. The average value for a given protocol, therefore, provides a rough sense of the protocol's hardness.
|
||||
|
||||
Basic averages appear in `diagnostic_output-analysis.ods`.
|
||||
Basic averages appear in `data/readings/synthetic_20251116/readings-analysis.ods`.
|
||||
|
||||
#### Univariate analysis
|
||||
|
||||
@@ -103,7 +136,7 @@ First, a plot of average values for each protocol:
|
||||
|
||||
This reveals a linear distribution of values among the protocols, aside from exponential curves only at the extremes. Perhaps the most interesting finding is a skew toward the higher end of the scale, associated with softness. Even relatively hard, technical protocols appear to have significant soft characteristics.
|
||||
|
||||
The protocol value averages have a mean of 5.45 and a median of 5.48. In comparison to the midpoint of 5, the normalized midpoint deviation is 0.11. In comparison, the Pearson coefficient measures the skew at just -0.07, which means that the relative skew of the data is actually slightly downward. So the distribution of protocol values is very balanced but has a consistent upward deviation from the scale's baseline. (These calculations are in `diagnostic_output-analysis.odt[averages]`.)
|
||||
The protocol value averages have a mean of 5.45 and a median of 5.48. In comparison to the midpoint of 5, the normalized midpoint deviation is 0.11. In comparison, the Pearson coefficient measures the skew at just -0.07, which means that the relative skew of the data is actually slightly downward. So the distribution of protocol values is very balanced but has a consistent upward deviation from the scale's baseline. (These calculations are in `data/readings/synthetic_20251116/readings-analysis.odt[averages]`.)
|
||||
|
||||
Second, a plot of average values for each gradient (with gaps to indicate the three groupings of gradients):
|
||||
|
||||
@@ -132,21 +165,21 @@ Claude Code created a `multivariate_analysis.py` tool to conduct this analysis.
|
||||
|
||||
```bash
|
||||
# Run all analyses (default)
|
||||
venv/bin/python3 multivariate_analysis.py diagnostic_output.csv
|
||||
python3 scripts/multivariate_analysis.py data/readings/synthetic_20251116/readings.csv
|
||||
|
||||
# Run specific analyses only
|
||||
venv/bin/python3 multivariate_analysis.py diagnostic_output.csv --analyses
|
||||
python3 scripts/multivariate_analysis.py data/readings/synthetic_20251116/readings.csv --analyses
|
||||
clustering pca
|
||||
```
|
||||
|
||||
Initial manual observations:
|
||||
|
||||
* The correlations generally seem predictable; for example, the strongest is between `Design_static_vs_malleable` and `Experience_predictable_vs_emergent`, which is not surprising
|
||||
* The elite vs. vernacular distinction appears to be the most predictive gradient (`analysis_results/plots/feature_importances.png`)
|
||||
* The elite vs. vernacular distinction appears to be the most predictive gradient (`data/readings/synthetic_20251116/analysis/plots/feature_importances.png`)
|
||||
|
||||

|
||||

|
||||
|
||||

|
||||

|
||||
|
||||
Claude's interpretation:
|
||||
|
||||
@@ -405,7 +438,7 @@ This simple version-matching approach ensures compatibility without complex stru
|
||||
### Files
|
||||
|
||||
- `bicorder_model.json` (5KB) - Trained LDA model with coefficients and scaler parameters
|
||||
- `bicorder-classifier.js` - JavaScript implementation for real-time classification in web app
|
||||
- `web/bicorder-classifier.js` - JavaScript implementation for real-time classification in web app
|
||||
- `ascii_bicorder.py` (updated) - Python script now calculates automated analysis values
|
||||
- `../bicorder.json` (updated) - Added bureaucratic ↔ relational gradient to analysis section
|
||||
|
||||
@@ -417,7 +450,7 @@ The calculation happens automatically when generating bicorder output:
|
||||
python3 ascii_bicorder.py bicorder.json bicorder.txt
|
||||
```
|
||||
|
||||
For web integration, see `INTEGRATION_GUIDE.md` for details on using `bicorder-classifier.js` to provide real-time classification as users fill out diagnostics.
|
||||
For web integration, see `INTEGRATION_GUIDE.md` for details on using `web/bicorder-classifier.js` to provide real-time classification as users fill out diagnostics.
|
||||
|
||||
### Key Features
|
||||
|
||||
|
||||
Reference in New Issue
Block a user