protocol-virtues-study/README.md

# Protocol virtues study

The purpose of this study is to produce an inductive sample of virtues for living among protocols by drawing deductively on two recent, open-access books on protocols: _[The Protocol Reader](https://summerofprotocols.com/research/protocol-reader)_ and _[As for Protocols](https://www.fulcrum.org/concern/monographs/3t945t729?locale=en)_. It experiments with the use of LLMs.

This document describes in detail the method of the study.

## Preliminary experiments

The `script.sh` file contains the instructions used to deploy an LLM on the texts. It was used to test several local models on the introductions (and then full text) of both books.

Observations:

* The ministral-3 run took considerably more time than Gemma and LFM, which were both comparable.
* The ministral-3 output (`outputs/output-ministral3-20260310.csv`) is fairly nonsensical and extremely repetitive; it does attempt to cite the source text but does so inaccurately
* lfm2.5-thinking output (`outputs/output-lfm25-20260310.csv`) is hallucinatory and does not appear to draw from the source text meaningfully
* gemma3 output (`outputs/output-gemma3-20260310.csv`) is constructive and plausible, and while the source-text quotations are not exact, they do resemble actual passages enough that most can be located and confirmed

Based on this, along with a broader recognition of the limits of LLM interpretation, I am opting for a more manual method, while using closely scrutinized LLM outputs as a corrective.

However, gemma3 output from the introductions to both books will be retained.

## Method

The method for producing a list of virtues is as follows:

* **Highlighting** through a manual re-reading of _Protocol Reader_ and _As for Protocols_, while highlighting any passages that state or imply virtues for living well among protocols
* **Coding** through manual grouping of the passages according to a set of concise candidate virtues
* **Analysis** of the coding to identify patterns

This method seeks to obtain a list of virtues based on manual reading by the researcher, while consulting an LLM interpretation to identify any oversights on the part of the researcher.

### Highlighting

Coding involved re-reading the books on KOReader. I highlighted passages that seemed to directly or indirectly relate to virtues for life among protocols (n=134). Of those, 62 were from _As for Protocols_ and 72 were from _The Protocol Reader_. Those highlights were then exported into text files and then gathered into `text_coding/snippets.csv`.

### Coding

The `text_coding/snippets.csv` data was ported into `text_coding/coding.ods` for coding. An initial list of virtues was derived from the gemma3 analyses of the introductions to the books (`outputs/output-gemma3-20260310.csv`, with 49 virtues, and `outputs/output-gemma3-AsForProtocolsIntro-20260315.csv`, with 22 virtues). Duplicates were removed along with entries that did not seem to qualify as virtues, resulting in a combined set of 56 virtues.

I then reviewed all of the highlighted snippets (in the `coding` tab of `text_coding/coding.ods`), coding each snippet with whatever virtue names seemed relevant to it. Digital copies of the books were on hand for consulting the surrounding context.

Additional virtues were added if the text appeared to communicate something not previously represented on the list (n=36). They were placed at the bottom of the list, which are seen first during coding, to prioritize the use of manually identified virtues.

Virtues were identified interpretively; their identification depended on the sense of the text, not necessarily the literal use of words, though I made efforts to use words from the texts where appropriate.

19 of the LLM-suggested virtues were not applied to any of the snippets.

### Analysis

An initial analysis (`results` tab of the above spreadsheet, or `text_coding/results.csv`), aided by several LLM tools (kimi-k2.5, glm-5, minimax-m2.5), reveals a distribution with several clusters alongside the outlier of "Adaptability." But the groupings do not create any clear, natural cutoffs. It appears best to treat these virtues as a continuum rather than leaning too hard on the clustering, which is not statistically significant.

A multivariate analysis of the raw coding in (`coding` tab of the spreadsheet, or `text_coding/coding.csv`) suggests, again, that "Adaptability" is not only high in frequency but is a central hub. "Care" and "Consent" represent the strongest association, although they are not very frequent. Interestingly, the _As for Protocols_ snippets have higher network density than the _Protocol Reader_ ones. See multivariate analysis produced by [kimi-k2.5](https://ollama.com/library/kimi-k2.5) in `text_coding/analysis/`.

## Data stewardship

Both books are available freely on the internet in open-access editions. Initial processing was done with a local LLM, without transferring data to a cloud provider. Subsequent analysis was conducted with the Ollama cloud service, which does not permit model training on prompt data and does not retain prompts or responses.

The source texts are not included in this repository.