Scripts were created with the assistance of Claude Code. Data processing was done largely with either local models or the Ollama cloud service, which does not retain user data. Thanks to Seth Frey (UC Davis) for guidance, but all mistakes are the responsibility of the author, Nathan Schneider. WARNING: This is the work of a researcher working with AI outside their field of expertise and should be treated as a playful experiment, not a model of rigorous methodology.

Purpose

This analyses has several purposes:

To test the usefulness and limitations of the Protocol Bicorder
To identify potential improvements to the Protocol Bicorder
To identify any patterns in a synthetic dataset derived from recent works on protocols

Procedure

Document chunking

This stage gathered raw data from recent protocol-focused texts.

The following prompt was applied to book chapter drafts and major protocol-related books, including the draft of the author's book, The Protocol Reader, As for Protocols, and Das Protokoll. The texts were pasted in plain text and then divided into 5000-word files, with the following prompt applied to each of them with the chunk.sh script:

model: "gemma3:12b" 
context: "model running on ollama locally, accessed with llm on the command line"
prompt: "Return csv-formatted data (with no markdown wrapper) that consists of a list of protocols discussed or referred to in the attached text. Protocols are defined extremely broadly as 'patterns of interaction,' and may be of a nontechnical nature. Protocols should be as specific as possible, such as 'Sacrament of Reconciliation' rather than 'Religious Protocols.' The first column should provide a brief descriptor of the protocol, and the second column should describe it in a substantial paragraph of 3-5 sentences, encapsulated in quotation marks to avoid breaking on commas. Be sure to paraphrase rather than quoting directly from the source text."

The result was a CSV-formatted list of protocols (protocols_raw.csv, n=774 total protocols listed).

Dataset cleaning

The dataset was then manually reviewed. The review involved the following:

Removal of repetitive formatting material introduced by the LLM
Correction or removal of formatting errors
Removal of rows whose contents met the following criteria:
- Repetition of entries---though some repetitions were simply merged into a single entry
- Overly broad entries that lacked meaningful context-specificity
- Overly narrow entries, e.g., referring to specific events

The cleaning process was carried out in a subjective manner, so some entries that meet the above criteria may remain in the dataset. The dataset also appears to include some LLM hallucinations---that is, protocols not in the texts---but the hallucinations are often acceptable examples and so some have been retained. Some degree of noise in the dataset was considered acceptable for the purposes of the study. Some degree of repetition, also, provides the dataset with a kind of control cases for evaluating the diagnostic process.

The result was a CSV-formatted list of protocols (protocols_edited.csv, n=411).

Initial diagnostic

This diagnostic used the file now at bicorder_analyzed.json, though the scripts are set up to analyze ../bicorder.json. That file has since been updated based on this analysis.

For each row in the dataset, and on each gradient, a series of scripts prompts the LLM to apply each gradient to the protocol. The outputs are then added to a CSV output file.

The result was a CSV-formatted list of protocols (diagnostic_output.csv, n=411).

See detailed documentation of the scripts at WORKFLOW.md.

Manual and alternate model audit

To test the output, a manual review of the first 10 protocols in the protocols_edited.csv dataset was produced in the file diagnostic_output_manual.csv. (Alphabetization in this case seems a reasonable proxy for a random sample of protocols. It includes some partially overlapping protocols, as does the dataset as a whole.) Additionally, three models were tested on the same cases:

python3 bicorder_batch.py protocols_edited.csv -o diagnostic_output_mistral.csv -m mistral -a "Mistral" -s "A careful ethnographer and outsider aspiring to achieve a neutral stance and a high degree of precision" --start 1 --end 10

python3 bicorder_batch.py protocols_edited.csv -o diagnostic_output_gpt-oss.csv -m gpt-oss -a "GPT-OSS" -s "A careful ethnographer and outsider aspiring to achieve a neutral stance and a high degree of precision" --start 1 --end 10

python3 bicorder_batch.py protocols_edited.csv -o diagnostic_output_gemma3-12b.csv -m gemma3:12b -a "Gemma3:12b" -s "A careful ethnographer and outsider aspiring to achieve a neutral stance and a high degree of precision" --start 1 --end 10

A Euclidean distance analysis (./venv/bin/python3 compare analyses.py) found that the gpt-oss model was closer to the manual example than the others. It was therefore selected to be the model used for conducting the bicorder diagnostic on the dataset.

Average Euclidean Distance:
  1. diagnostic_output_gpt-oss.csv    - Avg Distance: 11.68
  2. diagnostic_output_gemma3-12b.csv - Avg Distance: 13.06
  3. diagnostic_output_mistral.csv    - Avg Distance: 13.33

Command used to produce diagnostic_output.csv (using the Ollama cloud service for the gpt-oss model):

python3 bicorder_batch.py protocols_edited.csv -o diagnostic_output.csv -m gpt-oss:20b-cloud -a "GPT-OSS" -s "A careful ethnographer and outsider aspiring to achieve a neutral stance and a high degree of precision"

The result was a CSV-formatted list of protocols (diagnostic_output.csv, n=411).

Further analysis

Basic averages

Per-protocol values are meaningful for the bicorder because, despite varying levels of appropriateness, all of the gradients are structured as ranging from "hardness" to "softness"---with lower values associated with greater rigidity. The average value for a given protocol, therefore, provides a rough sense of the protocol's hardness.

Basic averages appear in diagnostic_output-analysis.ods.

Univariate analysis

First, a plot of average values for each protocol:

This reveals a linear distribution of values among the protocols, aside from exponential curves only at the extremes. Perhaps the most interesting finding is a skew toward the higher end of the scale, associated with softness. Even relatively hard, technical protocols appear to have significant soft characteristics.

The protocol value averages have a mean of 5.45 and a median of 5.48. In comparison to the midpoint of 5, the normalized midpoint deviation is 0.11. In comparison, the Pearson coefficient measures the skew at just -0.07, which means that the relative skew of the data is actually slightly downward. So the distribution of protocol values is very balanced but has a consistent upward deviation from the scale's baseline. (These calculations are in diagnostic_output-analysis.odt[averages].)

Second, a plot of average values for each gradient (with gaps to indicate the three groupings of gradients):

This indicates that a few of the gradients appear to have outsized responsibility for the high skew of the protocol averages.

Entanglement_exclusive_vs_non-exclusive is the highest by nearly a full point
Three others have averages over 7:
- Design_precise_vs_interpretive
- Design_documenting_vs_enabling
- Design_technical_vs_social

There is also an extreme at the bottom end: Design_durable_vs_ephemeral is the lowest by a full point.

It is not clear whether these extremes reveal anything other than a bias in the dataset or the LLM interpreter. Their descriptions may contribute to the extremes.

Multivariate analysis

Expectations:

There are some gradients whose values are highly correlated. These might point to redundancies in the bicorder design.
Some correlations might be revealing about connections in the characteristics of protocols, but these should be considered carefully as they may be the result of design or LLM interpretation.

Claude Code created a multivariate_analysis.py tool to conduct this analysis. Usage:

# Run all analyses (default)
venv/bin/python3 multivariate_analysis.py diagnostic_output.csv

# Run specific analyses only
venv/bin/python3 multivariate_analysis.py diagnostic_output.csv --analyses
clustering pca

Initial manual observations:

The correlations generally seem predictable; for example, the strongest is between Design_static_vs_malleable and Experience_predictable_vs_emergent, which is not surprising
The elite vs. vernacular distinction appears to be the most predictive gradient (analysis_results/plots/feature_importances.png)

Claude's interpretation:

Two Fundamental Protocol Types (K-Means Clustering)

The data reveals two distinct protocol families (216 vs 192 protocols):

Cluster 1: "Vernacular/Emergent Protocols"

Examples: Marronage, Songlines, Access-Centered Practices, Ethereum Proof of Work, Sangoma Healing Practices

Characteristics:

HIGH: elite→vernacular (6.4), malleable (7.3), flocking→swarming (6.4)

LOW: self-enforcing→enforced (3.4), sovereign→subsidiary (2.9)

Cluster 2: "Institutional/Standardized Protocols"

Examples: ISO standards, Greenwich Mean Time, Building Codes, German Bureaucratic Prose, Royal Access Protocol

Characteristics:

HIGH: self-enforcing (7.1), sovereign (6.0)

LOW: elite→vernacular (1.8), flocking→swarming (2.3), static (3.5)

Key Structural Dimensions (PCA)

Three principal components explain 55% of variance:

PC1 (main axis of variation): Elite/Static/Flocking ↔ Self-enforcing

Essentially captures the Vernacular vs. Institutional divide

PC2: Sufficient/Crystallized ↔ Kafka-esque

Measures protocol "completeness" vs. bureaucratic nightmare quality

PC3: Universal/Technical/Macro ↔ Particular/Embodied

Scale and abstraction level

Strong Correlations (Most Significant Relationships)

Static ↔ Predictable (r=0.61): Unchanging protocols create predictable experiences

Elite ↔ Self-enforcing (r=-0.58): Elite protocols need external enforcement; vernacular ones self-enforce

Self-enforcing ↔ Flocking (r=-0.56): Self-enforcing protocols resist swarming dynamics

Exclusion ↔ Kafka (r=0.52): Exclusionary protocols feel Kafka-esque

Most Discriminative Dimension (Feature Importance)

Design_elite_vs_vernacular (22.7% importance) is by far the most powerful predictor of protocol type, followed by Entanglement_flocking_vs_swarming (13.8%).

Most "Central" Protocols (Network Analysis)

These protocols share the most dimensional similarities with others:

VPN Usage (Circumvention Protocol) - bridges many protocol types

Access Check-in - connects accessibility and participation patterns

Quadratic Voting - spans governance dimensions

Outliers (DBSCAN found 281!)

Most protocols are actually quite unique - DBSCAN identified 281 outliers, suggesting the dataset contains many distinctive protocol configurations that don't fit neat clusters. Only 10 tight sub-clusters exist.

Category Prediction Power

Design dimensions predict clustering with 90.4% accuracy

Entanglement dimensions: 89.2% accuracy

Experience dimensions: only 78.3% accuracy

This suggests Design and Entanglement are more fundamental than Experience.

The core insight: Protocols fundamentally divide between vernacular/emergent/malleable forms and institutional/standardized/static forms, with the elite↔vernacular dimension being the strongest predictor of all other characteristics.

Comments:

The distinction along the lines of "Vernacular/Emergent" and "Institutional/Standardized" tracks well with
Strange that "Ethereum Proof of Work" appears in the "Vernacular/Emergent" family

Conclusions

Improvements to the bicorder

These findings indicate some options for improving on the current version of the bicorder.

Reflections on manual review:

The exclusion/exclusive names could be improved to overlap less
Should have separate values for "n/a" (0---but that could screw up averages) and both (5)
Remove the analysis section, or use analyses here for what becomes most meaningful

Claude report on possible improvements:

Prioritize High-Impact Dimensions ⭐

The current 23 dimensions aren't equally informative. Reorder by importance:

Tier 1 - Critical (>10% importance):

Design_elite_vs_vernacular (22.7%) - THE most discriminative dimension

Entanglement_flocking_vs_swarming (13.8%)

Design_static_vs_malleable (10.2%)

Tier 2 - Important (5-10%):

Entanglement_self-enforcing_vs_enforced (9.2%)

Entanglement_obligatory_vs_voluntary (8.0%)

Experience_exclusion_vs_inclusion (5.9%)

Tier 3 - Supplementary (<5%):

All remaining 17 dimensions

Recommendation: Reorganize the tool to present Tier 1 dimensions first, or mark them as "core diagnostics" vs. "supplementary diagnostics."

Consider Reducing Low-Value Dimensions

Several dimensions have low discriminative power AND low variance:

Candidates for removal/consolidation:

Entanglement_exclusive_vs_non-exclusive (0.6% importance, σ=1.64, mean=8.5)

Responses heavily cluster at "non-exclusive" - not discriminating

Design_durable_vs_ephemeral (1.2% importance, σ=2.41)

Entanglement_defensible_vs_exposed (1.1% importance, σ=2.41)

Recommendation: Either remove these or combine into composite measures. Going from 23→18 dimensions would reduce analyst burden by ~20% with minimal information loss.

Add Composite Scores 📊

Since PC1 explains the main variance, create derived metrics:

"Protocol Type Score" (based on PC1 loadings): Score = elite_vs_vernacular(0.36) + static_vs_malleable(0.33) + flocking_vs_swarming(0.31) - self-enforcing_vs_enforced(0.29)

High score = Institutional/Standardized

Low score = Vernacular/Emergent

"Protocol Completeness Score" (based on PC2): Score = sufficient_vs_insufficient(0.43) + crystallized_vs_contested(0.38) - Kafka_vs_Whitehead(0.36)

Measures how "finished" vs. "kafkaesque" a protocol feels

Recommendation: Display these composite scores alongside individual dimensions to provide quick high-level insights.

Highlight Key Correlations 🔗

The tool should alert analysts to important relationships:

Strong positive correlations:

Static ↔ Predictable (0.61)

Elite ↔ Static (0.53)

Exclusion ↔ Kafka (0.52)

Strong negative correlations:

Elite ↔ Self-enforcing (-0.58)

Self-enforcing ↔ Flocking (-0.56)

Recommendation: When an analyst rates a dimension, show a tooltip: "Protocols rated as 'elite' tend to also be 'static' and require 'enforcement'"

Flag Potential Redundancy

Two dimension pairs show moderate correlation within the same category:

Entanglement_abstract_vs_embodied ↔ Entanglement_flocking_vs_swarming (r=-0.56)

Design_documenting_vs_enabling ↔ Design_static_vs_malleable (r=0.53)

Recommendation: Consider merging these or making one primary and the other optional.

Rebalance Categories ⚖️

Current split: Design (8), Entanglement (8), Experience (7)

Performance by category:

Design: 90.4% predictive accuracy

Entanglement: 89.2% predictive accuracy

Experience: 78.3% predictive accuracy

Recommendation:

Strengthen Design (add 1-2 high-variance dimensions)

Trim Experience (remove low-performers, down to 5)

Result: 9 Design, 8 Entanglement, 5 Experience = 22 dimensions (down from

Add Diagnostic Quality Indicators

Based on variance analysis, flag dimensions where responses are too clustered:

Entanglement_exclusive_vs_non-exclusive: 93% of protocols rate 7-9

Design_technical_vs_social: Mean=7.6, heavily skewed toward "social"

Recommendation: Consider revising these gradient definitions or endpoints to achieve better distribution.

Create Shortened Version 🎯

For rapid assessment, create a "Bicorder Core" with just the top 8 dimensions:

elite_vs_vernacular ⭐⭐⭐

flocking_vs_swarming

static_vs_malleable

self-enforcing_vs_enforced

obligatory_vs_voluntary

exclusion_vs_inclusion

universal_vs_particular (high variance)

explicit_vs_implicit (high variance)

This captures ~65% of the discriminative power in 1/3 the time.

Add Comparison Features

The network analysis shows some protocols are highly "central" (similar to many others):

VPN Usage (Circumvention Protocol)

Access Check-in

Quadratic Voting

Recommendation: After rating a protocol, show: "This protocol is most similar to: [X, Y, Z]" based on dimensional proximity.

Summary of Recommendations

Immediate actions:

✂️ Remove 3-5 low-value dimensions → 20 dimensions

🔄 Reorder dimensions by importance (elite_vs_vernacular first)

➕ Add 2 composite scores (Protocol Type, Completeness)

Medium-term enhancements: 4. 🎯 Create "Bicorder Core" (8-dimension quick version) 5. 💡 Add contextual tooltips about correlations 6. 📊 Show similar protocols after assessment

This would make the tool ~20% faster to use while maintaining 95%+ of its discriminative power.

Questions:

What makes the elite/vernacular distinction so "important"? Is it because of some salience of the description?
Why is the sovereign/subsidiary distinction, which is such a central part of the theorizing here, not more "important"? Would this change with a different description of the values?

Future work: Description modification

Hypothesis: The phrasing of gradient value descriptions could have a significant impact on the salience of particular gradients.

Method: Modify the descriptions and run the same tests, see if the results are different.

Future work: Persona elaboration

Hypothesis: Changing the analyst and their standpoint could result in interesting differences of outcomes.

Method: Alongside the dataset of protocols, generate diverse personas, such as a) personas used to evaluate every protocols, and b) protocol-specific personas that reflect different relationships to the protocol. Modify the test suite to include personas as an additional dimension of the analysis.

README.md Unescape Escape

Bicorder synthetic data analysis