amachine
What is A-Machine?
A-Machine is a work in progress library for constructing epsilon-machines1 and other stochastic models for generating structured symbol sequences with ground truth causal structure and information-theoretic complexity. The intended use case is the study of neural network learning dynamics and internal representations.
Quickstart
Installation
# CPU only
pip install a-machine
# With GPU support (requires CUDA 13)
pip install "a-machine[cuda]" --extra-index-url https://pypi.nvidia.com
Example
More examples are available at the gitlab repo.
import amachine as am
# Seeds random and np.random globally
am.srand_global( 42 )
# May have multiple recurrent subgraphs, terminal states, or tranistory states
m = am.random_machine(
n_states=23,
symbols=[ '0', '1' ],
connectedness=0.35,
randomness=0.25 )
# Collapse to the largest recurrent subgraph
m.collapse_to_largest_strongly_connected_subgraph()
# Minimize the machine in case it's not minimal -> epsilon-machine.
m.minimize( retain_names=True )
# Entropy rate, statistical complexity
print( f"h_mu : {m.h_mu()}" )
print( f"C_mu : {m.C_mu()}" )
# Display the epislon machine
m.draw_graph()

HMMs and Epsilon Machines
An $\epsilon$-machine is the minimal, optimally predictive, unifilar, hidden Markov model of a stationary stochastic process. A large body of theory linking them to complex dynamical systems, information theory, and physics, has been developed by James P. Crutchfield and collaborators under the name computational mechanics 12.
A-Machine's HMM implementation is designed primarily for constructing $\epsilon$-machines to generate training data with interesting ground truth theoretical properties. This should be useful for making informed hypotheses about, designing experiments for, and interpreting the learning dynamics and internal representations of neural networks.
More expressive finite generative models, such as non-unifilar HMMs or probabilistic context-free grammars, can represent a broader class of languages than a finite $\epsilon$-machine, but may lack the theoretical tractability that makes $\epsilon$-machines particularly analytically useful. Thus, utilizing $\epsilon$-machines instead of alternatives, may involve a tradeoff between expressiveness and interpretability.
Complexity Measures
Complexity measures implemented within the HMM class include:
- statistical complexity $C_\mu$,
- entropy rate $h_{\mu}$,
- single symbol entropy H(1),
- anticipated information $\rho_{\mu}$,
- Excess entropy $\mathbf{E}$,
- Syncronization information $\mathbf{S}$,
- Transient information $\mathbf{T}$,
- Crypticity $\chi$, and
- Block convergence measures $\mathbf{S}(L), \mathbf{T}(L), \mathcal{H}(L),$ and $h_{\mu}(L)$.
Citations, definitions, and interpretations are in their individual documentation sections linked above.
Isomorphic Construction
A few ways to construct $\epsilon$-machines are impemented so far:
random_machine, which generates a random HMM that can be reduced to an $\epsilon$-machine,
star_join, which joins a list of machines to a central hub,
star, which combines random_machine and star_join,
isomorphic_to, which creates a copy of an existing machine with new labels,
full_isomorphic_rotation, which creates and star joins $n$ isomorphic machines, where $n$ is the number of symbols, and the $i^{th}$ machine's state $j$ emits the $k^{th}$ symbol in the alphabet, then the ${i+1}^{th}$ machine's state $j$ will emit symbol $k+1 \bmod n$, and

- unique_isomorphic, which creates and star joins
n_machinesisomorphic machines, wheren_machinesis a parameter, and each has its own alphabet.

Syntagmatics and Paradigmatics
The obvious differences between full_isomorphic_rotation and unique_isomorphic are that the former has $n$ sub-machnes = $n$ symbols, while the latter can have any number of sub-machines but the full alphabet grows with the number of machines. Both introduce learnable structural redundancies, which may be experimentally interesting, but they may be considered somewhat unnatural and structually boring relative to human language.
Imagine a world where every now and then, each word in the dictionary moves up one entry but takes on the meaning of the word it replaced. This is analogous to what full_isomorphic_rotation does. On the other hand, unique_isomorphic is like cloning the whole dictionary, and then exclusively using one dictionary at a time, while switching to a new one every now and then.
Neither are very good anologs for isomorphic substructure in human lanuage, where there are consistent rules, that creates coherent, rich, and varied relational structure.
And this is the type of structure we want to also in our $\epsilon$-machines, so that the insights gleaned may be generalize better to help us understand how artificial intelligence learns from, internally represents, and uses human language.
The work in progress framework for this is conceptually grounded in concepts borrowed from paradigmatic analysis and syntagmatic analysis. Instead studying how words can be chosen in positions within a body of text, we consider how symbols can be chosen for a transtion in a state machine. Within the hard constraints, some stochastic variability among valid choices can be permitted.
There is one complication, and that is an $\epsilon-$machine must be minimal and unifilar. You can't simply change a symbol on an edge at random, as it may break the structural properties of the machine.
Once suitable syntagmatics are defined for a single machine, we can apply the same concepts to joined collections of machines, in a way that is analogous to discourse structure in human language. At this level, we can create $\epsilon$-machines, with hierarchical isomorphic substructure.
A framework and API for generating models with these properties is still being worked out.
Mixed State Presentation
The mixed states presentation construction 34 tracks beliefs about which intrinsic state the $\epsilon$-machicine is in as an observer attempts to syncronize with the $\epsilon$ machines. MSP construction can be useful for analyzing learning dynamics, because this is the task that a neural network learning from the generated data must perform to minimize loss. This has been emipirically demonstrated in transformer models (in at least one instance) by Shai et al. 5.
The MSP can explode in terms of the number of states that are produced, or even be inifite for non-synchronizable processes. The belief states transitions also may produce fractal-like patterns in the probability simplex.
A-Machine implements MSP construction from a given $\epsilon$ machine. Since the MSP may have too many states to compute, a cap on the max number of states is imposed, and when the cap is reached, the MSP is closed by mapping the unresolved frontier of belief states back to the nearest existing belief states. When this happens, the final MSP is only an approximation.
The MSP also yields closed form expressions for a range of complexity measures 6, such as $\mathbf{E}, \mathbf{S}$, and $\mathbf{T}$, which can be solved for using spectral decompositon. A-machine implements this approach to computing these complexity measures, in addition to the alternative, block entropy convergence. When the MSP construction excedes the max states cap, comparing the measures computed from its approximation with those approximated by block entropy convergence is useful for corroboration and gaining additional insight into the synronization dynamics.
Block Entropy Convergence
As mentioned, block entropy convergence supports estimation of a range of complexity measures, and allows you to observe the convergence in uncertainty as a function of $L$ (after seeing all blocks of length $L$). A-machine implements this as a C++ extension.
Data Generation
Data generation generates stochastic sequences from a given machine, optionally along with the sequence of hidden states that were traversed, and also optionally with an additional symbol sequence with what we are calling isomorphic_shift applied. This changes the symbols that were emited by a state, for symbols that would have been emitted by alternative isomorphic states (should they exist).
import amachine as am
am.srand_global(42)
m = am.random_machine(
n_states=11,
symbols=[ '0', '1', '2' ],
connectedness=0.65,
randomness=0.37 )
# Collapse to the largest recurrent subgraph
m.collapse_to_largest_strongly_connected_subgraph()
# Minimize the machine -> epsilon-machine.
m.minimize()
# Create an isomorphic machine with new labels
m_iso = am.isomorphic_to( m, alphabet=[ '3', '4', '5' ] )
m_iso.isoclass = 0
m.isoclass = 0
for j, state in enumerate( m.states ) :
m.states[ j ].add_isomorph( m_iso.states[ j ].name )
m_iso.states[ j ].add_isomorph( m.states[ j ].name )
# Join them together
m_star = am.star_join(
exit_symbol='x',
enter_symbols=['a','b'],
machines=[ m, m_iso ],
mode_residency_factor=0.75
)
m_star.draw_graph()
m_star.generate_data(
file_prefix="./data/iso_012-345", # saves as iso_012-345.parqet
n_gen=2_000_000_000, # 2 billion symbols
include_states=True,
isomorphic_shifts={1}
)

The two sequences will be aligned, and with identical local and global statistics. Thus, you can train on one of them, and then analyze activation patterns and residuals on both of them in parallel during inference, to study if and how the neural network learned polysemantic representations that exploit the structural redundancies.
Cleaner and more general methods for constructing graphs with ismorphic substructure are in the works.
-
Crutchfield, "The calculi of emergence: computation, dynamics and induction.", 1994. https://csc.ucdavis.edu/~cmg/papers/CalcEmerg.pdf ↩
-
Shalizi, Rohilla, and Crutchfield. "Computational mechanics: Pattern and prediction, structure and simplicity.", 2001. https://arxiv.org/abs/cond-mat/9907176 ↩
-
Blackwell, David Harold. "The entropy of functions of finite-state Markov chains.", 1959. ↩
-
Jurgens & Crutchfield "Shannon Entropy Rate of Hidden Markov Processes", 2021. https://link.springer.com/article/10.1007/s10955-021-02769-3 ↩
-
Shai et al., "Transformers Represent Belief State Geometry in their Residual Stream", 2024. https://arxiv.org/abs/2405.15943 ↩
-
Crutchfield, "Exact complexity: The spectral decomposition of intrinsic computation", 2016. https://csc.ucdavis.edu/~cmg/papers/ec.pdf ↩
1""" 2# What is A-Machine? 3 4A-Machine is a work in progress library for constructing epsilon-machines[^1] and other stochastic models for generating structured symbol sequences with ground truth causal structure and information-theoretic complexity. The intended use case is the study of neural network learning dynamics and internal representations. 5 6# Quickstart 7 8## Installation 9 10```bash 11# CPU only 12pip install a-machine 13 14# With GPU support (requires CUDA 13) 15pip install "a-machine[cuda]" --extra-index-url https://pypi.nvidia.com 16``` 17 18##Example 19 20[More examples](https://gitlab.com/tneuroth/a-machine/-/tree/main/examples) are available at the gitlab repo. 21 22```python 23import amachine as am 24 25# Seeds random and np.random globally 26am.srand_global( 42 ) 27 28# May have multiple recurrent subgraphs, terminal states, or tranistory states 29m = am.random_machine( 30 n_states=23, 31 symbols=[ '0', '1' ], 32 connectedness=0.35, 33 randomness=0.25 ) 34 35# Collapse to the largest recurrent subgraph 36m.collapse_to_largest_strongly_connected_subgraph() 37 38# Minimize the machine in case it's not minimal -> epsilon-machine. 39m.minimize( retain_names=True ) 40 41# Entropy rate, statistical complexity 42print( f"h_mu : {m.h_mu()}" ) 43print( f"C_mu : {m.C_mu()}" ) 44 45# Display the epislon machine 46m.draw_graph() 47``` 48 49<img src="resources/am_graph.png" alt="AM Graph" style="width: 80%; margin-left: 10%;"> 50 51# HMMs and Epsilon Machines 52 53An $\\epsilon$-machine is the minimal, optimally predictive, unifilar, hidden Markov model of a stationary stochastic process. A large body of theory linking them to complex dynamical systems, information theory, and physics, has been developed by James P. Crutchfield and collaborators under the name computational mechanics [^1][^2]. 54 55A-Machine's [HMM implementation](amachine/am_hmm.html#HMM) is designed primarily for constructing $\\epsilon$-machines to generate training data with interesting ground truth theoretical properties. This should be useful for making informed hypotheses about, designing experiments for, and interpreting the learning dynamics and internal representations of neural networks. 56 57More expressive finite generative models, such as non-unifilar HMMs or probabilistic context-free grammars, can represent a broader class of languages than a finite $\\epsilon$-machine, but may lack the theoretical tractability that makes $\\epsilon$-machines particularly analytically useful. Thus, utilizing $\\epsilon$-machines instead of alternatives, may involve a tradeoff between expressiveness and interpretability. 58 59## Complexity Measures 60 61Complexity measures implemented within the HMM class include: 62 63* [statistical complexity $C_\\mu$](amachine/am_hmm.html#HMM.C_mu), 64* [entropy rate $h_{\\mu}$](amachine/am_hmm.html#HMM.h_mu), 65* [single symbol entropy H(1)](amachine/am_hmm.html#HMM.H_1), 66* [anticipated information $\\rho_{\\mu}$](amachine/am_hmm.html#HMM.rho_mu), 67* [Excess entropy $\\mathbf{E}$](amachine/am_hmm.html#HMM.E), 68* [Syncronization information $\\mathbf{S}$](amachine/am_hmm.html#HMM.S), 69* [Transient information $\\mathbf{T}$](amachine/am_hmm.html#HMM.S), 70* [Crypticity $\\chi$](amachine/am_hmm.html#HMM.chi), and 71* [Block convergence measures $\\mathbf{S}(L), \\mathbf{T}(L), \\mathcal{H}(L),$ and $h_{\\mu}(L)$](amachine/am_hmm.html#HMM.block_convergence). 72 73Citations, definitions, and interpretations are in their individual documentation sections linked above. 74 75## Isomorphic Construction 76 77A few ways to construct $\\epsilon$-machines are impemented so far: 78 79* [random_machine](amachine/am_create.html#random_machine), which generates a random HMM that can be reduced to an $\\epsilon$-machine, 80 81* [star_join](amachine/am_create.html#star_join), which joins a list of machines to a central hub, 82 83* [star](amachine/am_create.html#star), which combines random_machine and star_join, 84 85* [isomorphic_to](amachine/am_create.html#isomorphic_to), which creates a copy of an existing machine with new labels, 86 87* [full_isomorphic_rotation](amachine/am_create.html#full_isomorphic_rotation), which creates and star joins $n$ isomorphic machines, where $n$ is the number of symbols, and the $i^{th}$ machine's state $j$ emits the $k^{th}$ symbol in the alphabet, then the ${i+1}^{th}$ machine's state $j$ will emit symbol $k+1 \\bmod n$, and 88 89<img src="resources/iso_rotation.png" alt="iso rotation" style="width: 100%; margin-left: 0%;"> 90 91* [unique_isomorphic](amachine/am_create.html#unique_isomorphic), which creates and star joins `n_machines` isomorphic machines, where `n_machines` is a parameter, and each has its own alphabet. 92 93<img src="resources/unique_iso.png" alt="unique iso" style="width: 100%; margin-left: 0%;"> 94 95##Syntagmatics and Paradigmatics 96 97The obvious differences between `full_isomorphic_rotation` and `unique_isomorphic` are that the former has $n$ sub-machnes = $n$ symbols, while the latter can have any number of sub-machines but the full alphabet grows with the number of machines. Both introduce learnable structural redundancies, which may be experimentally interesting, but they may be considered somewhat unnatural and structually boring relative to human language. 98 99Imagine a world where every now and then, each word in the dictionary moves up one entry but takes on the meaning of the word it replaced. This is analogous to what `full_isomorphic_rotation` does. On the other hand, `unique_isomorphic` is like cloning the whole dictionary, and then exclusively using one dictionary at a time, while switching to a new one every now and then. 100 101Neither are very good anologs for isomorphic substructure in human lanuage, where there are consistent rules, that creates coherent, rich, and varied relational structure. 102 103And this is the type of structure we want to also in our $\\epsilon$-machines, so that the insights gleaned may be generalize better to help us understand how artificial intelligence learns from, internally represents, and uses human language. 104 105The work in progress framework for this is conceptually grounded in concepts borrowed from [paradigmatic analysis](https://en.wikipedia.org/wiki/Paradigmatic_analysis) and [syntagmatic analysis](https://en.wikipedia.org/wiki/Syntagmatic_analysis). Instead studying how words can be chosen in positions within a body of text, we consider how symbols can be chosen for a transtion in a state machine. Within the hard constraints, some stochastic variability among valid choices can be permitted. 106 107There is one complication, and that is an $\\epsilon-$machine must be minimal and unifilar. You can't simply change a symbol on an edge at random, as it may break the structural properties of the machine. 108 109Once suitable syntagmatics are defined for a single machine, we can apply the same concepts to joined collections of machines, in a way that is analogous to [discourse structure](https://en.wikipedia.org/wiki/Discourse_relation) in human language. At this level, we can create $\\epsilon$-machines, with hierarchical isomorphic substructure. 110 111A framework and API for generating models with these properties is still being worked out. 112 113## Mixed State Presentation 114 115The mixed states presentation construction [^3][^4] tracks beliefs about which intrinsic state the $\\epsilon$-machicine is in as an observer attempts to syncronize with the $\\epsilon$ machines. MSP construction can be useful for analyzing learning dynamics, because this is the task that a neural network learning from the generated data must perform to minimize loss. This has been emipirically demonstrated in transformer models (in at least one instance) by Shai et al. [^5]. 116 117The MSP can explode in terms of the number of states that are produced, or even be inifite for non-synchronizable processes. The belief states transitions also may produce fractal-like patterns in the probability simplex. 118 119A-Machine implements [MSP construction](amachine/am_machine.html#get_msp) from a given $\\epsilon$ machine. Since the MSP may have too many states to compute, a cap on the max number of states is imposed, and when the cap is reached, the MSP is closed by mapping the unresolved frontier of belief states back to the nearest existing belief states. When this happens, the final MSP is only an approximation. 120 121The MSP also yields closed form expressions for a range of complexity measures [^6], such as $\\mathbf{E}, \\mathbf{S}$, and $\\mathbf{T}$, which can be solved for using spectral decompositon. A-machine implements this approach to computing these complexity measures, in addition to the alternative, block entropy convergence. When the MSP construction excedes the max states cap, comparing the measures computed from its approximation with those approximated by block entropy convergence is useful for corroboration and gaining additional insight into the synronization dynamics. 122 123## Block Entropy Convergence 124 125As mentioned, [block entropy convergence](amachine/am_hmm.html#HMM.block_convergence) supports estimation of a range of complexity measures, and allows you to observe the convergence in uncertainty as a function of $L$ (after seeing all blocks of length $L$). A-machine implements this as a [C++ extension](https://gitlab.com/tneuroth/a-machine/-/blob/main/src/amachine/am_fast/am_fast.cpp). 126 127## Data Generation 128 129[Data generation](amachine/am_hmm.html#HMM.generate_data) generates stochastic sequences from a given machine, optionally along with the sequence of hidden states that were traversed, and also optionally with an additional symbol sequence with what we are calling [isomorphic_shift](amachine/am_hmm.html#HMM.isomorphic_shift) applied. This changes the symbols that were emited by a state, for symbols that would have been emitted by alternative isomorphic states (should they exist). 130 131```python 132import amachine as am 133 134am.srand_global(42) 135 136m = am.random_machine( 137 n_states=11, 138 symbols=[ '0', '1', '2' ], 139 connectedness=0.65, 140 randomness=0.37 ) 141 142# Collapse to the largest recurrent subgraph 143m.collapse_to_largest_strongly_connected_subgraph() 144 145# Minimize the machine -> epsilon-machine. 146m.minimize() 147 148# Create an isomorphic machine with new labels 149m_iso = am.isomorphic_to( m, alphabet=[ '3', '4', '5' ] ) 150 151m_iso.isoclass = 0 152m.isoclass = 0 153 154for j, state in enumerate( m.states ) : 155 m.states[ j ].add_isomorph( m_iso.states[ j ].name ) 156 m_iso.states[ j ].add_isomorph( m.states[ j ].name ) 157 158# Join them together 159m_star = am.star_join( 160 exit_symbol='x', 161 enter_symbols=['a','b'], 162 machines=[ m, m_iso ], 163 mode_residency_factor=0.75 164) 165 166m_star.draw_graph() 167 168m_star.generate_data( 169 file_prefix="./data/iso_012-345", # saves as iso_012-345.parqet 170 n_gen=2_000_000_000, # 2 billion symbols 171 include_states=True, 172 isomorphic_shifts={1} 173) 174``` 175 176<img src="resources/iso.png" alt="iso machine" style="width: 100%; margin-left: 0%;"> 177 178The two sequences will be aligned, and with identical local and global statistics. Thus, you can train on one of them, and then analyze activation patterns and residuals on both of them in parallel during inference, to study if and how the neural network learned polysemantic representations that exploit the structural redundancies. 179 180*Cleaner and more general methods for constructing graphs with ismorphic substructure are in the works.* 181 182[^1]: Crutchfield, "The calculi of emergence: computation, dynamics and induction.", 1994. 183 <https://csc.ucdavis.edu/~cmg/papers/CalcEmerg.pdf> 184 185[^2]: Shalizi, Rohilla, and Crutchfield. "Computational mechanics: Pattern and prediction, structure and simplicity.", 2001. 186 <https://arxiv.org/abs/cond-mat/9907176> 187 188[^3]: Blackwell, David Harold. "The entropy of functions of finite-state Markov chains.", 1959. 189 190[^4]: Jurgens & Crutchfield "Shannon Entropy Rate of Hidden Markov Processes", 2021. 191 <https://link.springer.com/article/10.1007/s10955-021-02769-3> 192 193[^5]: Shai et al., "Transformers Represent Belief State Geometry in their Residual Stream", 2024. 194 <https://arxiv.org/abs/2405.15943> 195 196[^6]: Crutchfield, "Exact complexity: The spectral decomposition of intrinsic computation", 2016. 197 <https://csc.ucdavis.edu/~cmg/papers/ec.pdf> 198 199""" 200__version__ = "0.2.0" 201 202from .am_create import ( 203 random_machine, 204 star_join, 205 star, 206 isomorphic_to, 207 full_isomorphic_rotation, 208 unique_isomorphic 209) 210 211from .am_hmm import HMM 212 213from .am_syntagmatics import ( 214 Syntagmatics, 215 ExplicitCohesionRule 216) 217 218from .am_substitution_algebra import ( 219 SubstitutionAlgebra 220) 221 222from .am_random import srand_global 223from .am_vocabulary import Vocabulary