GitLab Repo

amachine

What is A-Machine?

A-Machine is a work in progress library for constructing epsilon-machines1 and other stochastic models for generating structured symbol sequences with ground truth causal structure and information-theoretic complexity. The intended use case is the study of neural network learning dynamics and internal representations.

Quickstart

Installation

# CPU only
pip install a-machine

# With GPU support (requires CUDA 13)
pip install "a-machine[cuda]" --extra-index-url https://pypi.nvidia.com

Example

More examples are available at the gitlab repo.

import amachine as am

# Seeds random and np.random globally
am.srand_global( 42 )

# May have multiple recurrent subgraphs, terminal states, or tranistory states
m = am.random_machine( 
        n_states=23, 
        symbols=[ '0', '1' ],
        connectedness=0.35,
        randomness=0.25 )

# Collapse to the largest recurrent subgraph
m.collapse_to_largest_strongly_connected_subgraph()

# Minimize the machine in case it's not minimal -> epsilon-machine.
m.minimize( retain_names=True )

# Entropy rate, statistical complexity
print( f"h_mu : {m.h_mu()}" )
print( f"C_mu : {m.C_mu()}" )

# Display the epislon machine
m.draw_graph()

AM Graph

HMMs and Epsilon Machines

An $\epsilon$-machine is the minimal, optimally predictive, unifilar, hidden Markov model of a stationary stochastic process. A large body of theory linking them to complex dynamical systems, information theory, and physics, has been developed by James P. Crutchfield and collaborators under the name computational mechanics 12.

A-Machine's HMM implementation is designed primarily for constructing $\epsilon$-machines to generate training data with interesting ground truth theoretical properties. This should be useful for making informed hypotheses about, designing experiments for, and interpreting the learning dynamics and internal representations of neural networks.

More expressive finite generative models, such as non-unifilar HMMs or probabilistic context-free grammars, can represent a broader class of languages than a finite $\epsilon$-machine, but may lack the theoretical tractability that makes $\epsilon$-machines particularly analytically useful. Thus, utilizing $\epsilon$-machines instead of alternatives, may involve a tradeoff between expressiveness and interpretability.

Complexity Measures

Complexity measures implemented within the HMM class include:

Citations, definitions, and interpretations are in their individual documentation sections linked above.

Isomorphic Construction

A few ways to construct $\epsilon$-machines are impemented so far:

  • random_machine, which generates a random HMM that can be reduced to an $\epsilon$-machine,

  • star_join, which joins a list of machines to a central hub,

  • star, which combines random_machine and star_join,

  • isomorphic_to, which creates a copy of an existing machine with new labels,

  • full_isomorphic_rotation, which creates and star joins $n$ isomorphic machines, where $n$ is the number of symbols, and the $i^{th}$ machine's state $j$ emits the $k^{th}$ symbol in the alphabet, then the ${i+1}^{th}$ machine's state $j$ will emit symbol $k+1 \bmod n$, and

iso rotation

  • unique_isomorphic, which creates and star joins n_machines isomorphic machines, where n_machines is a parameter, and each has its own alphabet.

unique iso

Syntagmatics and Paradigmatics

The obvious differences between full_isomorphic_rotation and unique_isomorphic are that the former has $n$ sub-machnes = $n$ symbols, while the latter can have any number of sub-machines but the full alphabet grows with the number of machines. Both introduce learnable structural redundancies, which may be experimentally interesting, but they may be considered somewhat unnatural and structually boring relative to human language.

Imagine a world where every now and then, each word in the dictionary moves up one entry but takes on the meaning of the word it replaced. This is analogous to what full_isomorphic_rotation does. On the other hand, unique_isomorphic is like cloning the whole dictionary, and then exclusively using one dictionary at a time, while switching to a new one every now and then.

Neither are very good anologs for isomorphic substructure in human lanuage, where there are consistent rules, that creates coherent, rich, and varied relational structure.

And this is the type of structure we want to also in our $\epsilon$-machines, so that the insights gleaned may be generalize better to help us understand how artificial intelligence learns from, internally represents, and uses human language.

The work in progress framework for this is conceptually grounded in concepts borrowed from paradigmatic analysis and syntagmatic analysis. Instead studying how words can be chosen in positions within a body of text, we consider how symbols can be chosen for a transtion in a state machine. Within the hard constraints, some stochastic variability among valid choices can be permitted.

There is one complication, and that is an $\epsilon-$machine must be minimal and unifilar. You can't simply change a symbol on an edge at random, as it may break the structural properties of the machine.

Once suitable syntagmatics are defined for a single machine, we can apply the same concepts to joined collections of machines, in a way that is analogous to discourse structure in human language. At this level, we can create $\epsilon$-machines, with hierarchical isomorphic substructure.

A framework and API for generating models with these properties is still being worked out.

Mixed State Presentation

The mixed states presentation construction 34 tracks beliefs about which intrinsic state the $\epsilon$-machicine is in as an observer attempts to syncronize with the $\epsilon$ machines. MSP construction can be useful for analyzing learning dynamics, because this is the task that a neural network learning from the generated data must perform to minimize loss. This has been emipirically demonstrated in transformer models (in at least one instance) by Shai et al. 5.

The MSP can explode in terms of the number of states that are produced, or even be inifite for non-synchronizable processes. The belief states transitions also may produce fractal-like patterns in the probability simplex.

A-Machine implements MSP construction from a given $\epsilon$ machine. Since the MSP may have too many states to compute, a cap on the max number of states is imposed, and when the cap is reached, the MSP is closed by mapping the unresolved frontier of belief states back to the nearest existing belief states. When this happens, the final MSP is only an approximation.

The MSP also yields closed form expressions for a range of complexity measures 6, such as $\mathbf{E}, \mathbf{S}$, and $\mathbf{T}$, which can be solved for using spectral decompositon. A-machine implements this approach to computing these complexity measures, in addition to the alternative, block entropy convergence. When the MSP construction excedes the max states cap, comparing the measures computed from its approximation with those approximated by block entropy convergence is useful for corroboration and gaining additional insight into the synronization dynamics.

Block Entropy Convergence

As mentioned, block entropy convergence supports estimation of a range of complexity measures, and allows you to observe the convergence in uncertainty as a function of $L$ (after seeing all blocks of length $L$). A-machine implements this as a C++ extension.

You also plot block entropy curves and block measures, by calling amachine.HMM.draw_block_measure_curves, and amachine.HMM.draw_block_entropy_curve.

block measures plots

Data Generation

Data generation generates stochastic sequences from a given machine, optionally along with the sequence of hidden states that were traversed, and also optionally with an additional symbol sequence with what we are calling isomorphic_shift applied. This changes the symbols that were emited by a state, for symbols that would have been emitted by alternative isomorphic states (should they exist).

import amachine as am

am.srand_global(42)

m = am.random_machine( 
        n_states=11, 
        symbols=[ '0', '1', '2' ],
        connectedness=0.65,
        randomness=0.37 )

# Collapse to the largest recurrent subgraph
m.collapse_to_largest_strongly_connected_subgraph()

# Minimize the machine -> epsilon-machine.
m.minimize()

# Create an isomorphic machine with new labels
m_iso = am.isomorphic_to( m, alphabet=[ '3', '4', '5' ] )

m_iso.isoclass = 0
m.isoclass     = 0

for j, state in enumerate( m.states ) : 
    m.states[ j ].add_isomorph( m_iso.states[ j ].name )
    m_iso.states[ j ].add_isomorph( m.states[ j ].name )

# Join them together
m_star = am.star_join(
    exit_symbol='x', 
    enter_symbols=['a','b'],
    machines=[ m, m_iso ],
    mode_residency_factor=0.75 
)

m_star.draw_graph()

m_star.generate_data(
    file_prefix="./data/iso_012-345", # saves as iso_012-345.parqet
    n_gen=2_000_000_000, # 2 billion symbols    
    include_states=True,
    isomorphic_shifts={1}
)

iso machine

The two sequences will be aligned, and with identical local and global statistics. Thus, you can train on one of them, and then analyze activation patterns and residuals on both of them in parallel during inference, to study if and how the neural network learned polysemantic representations that exploit the structural redundancies.

Cleaner and more general methods for constructing graphs with ismorphic substructure are in the works.


  1. Crutchfield, "The calculi of emergence: computation, dynamics and induction.", 1994. https://csc.ucdavis.edu/~cmg/papers/CalcEmerg.pdf 

  2. Shalizi, Rohilla, and Crutchfield. "Computational mechanics: Pattern and prediction, structure and simplicity.", 2001. https://arxiv.org/abs/cond-mat/9907176 

  3. Blackwell, David Harold. "The entropy of functions of finite-state Markov chains.", 1959. 

  4. Jurgens & Crutchfield "Shannon Entropy Rate of Hidden Markov Processes", 2021. https://link.springer.com/article/10.1007/s10955-021-02769-3 

  5. Shai et al., "Transformers Represent Belief State Geometry in their Residual Stream", 2024. https://arxiv.org/abs/2405.15943 

  6. Crutchfield, "Exact complexity: The spectral decomposition of intrinsic computation", 2016. https://csc.ucdavis.edu/~cmg/papers/ec.pdf 

  1"""
  2# What is A-Machine?
  3
  4A-Machine is a work in progress library for constructing epsilon-machines[^1] and other stochastic models for generating structured symbol sequences with ground truth causal structure and information-theoretic complexity. The intended use case is the study of neural network learning dynamics and internal representations.
  5
  6# Quickstart
  7
  8## Installation
  9
 10```bash
 11# CPU only
 12pip install a-machine
 13
 14# With GPU support (requires CUDA 13)
 15pip install "a-machine[cuda]" --extra-index-url https://pypi.nvidia.com
 16```
 17
 18##Example
 19
 20[More examples](https://gitlab.com/tneuroth/a-machine/-/tree/main/examples) are available at the gitlab repo.
 21
 22```python
 23import amachine as am
 24
 25# Seeds random and np.random globally
 26am.srand_global( 42 )
 27
 28# May have multiple recurrent subgraphs, terminal states, or tranistory states
 29m = am.random_machine( 
 30	n_states=23, 
 31	symbols=[ '0', '1' ],
 32	connectedness=0.35,
 33	randomness=0.25 )
 34
 35# Collapse to the largest recurrent subgraph
 36m.collapse_to_largest_strongly_connected_subgraph()
 37
 38# Minimize the machine in case it's not minimal -> epsilon-machine.
 39m.minimize( retain_names=True )
 40
 41# Entropy rate, statistical complexity
 42print( f"h_mu : {m.h_mu()}" )
 43print( f"C_mu : {m.C_mu()}" )
 44
 45# Display the epislon machine
 46m.draw_graph()
 47```
 48
 49<img src="resources/am_graph.png" alt="AM Graph" style="width: 80%; margin-left: 10%;">
 50
 51# HMMs and Epsilon Machines
 52
 53An $\\epsilon$-machine is the minimal, optimally predictive, unifilar, hidden Markov model of a stationary stochastic process. A large body of theory linking them to complex dynamical systems, information theory, and physics, has been developed by James P. Crutchfield and collaborators under the name computational mechanics [^1][^2].
 54
 55A-Machine's [HMM implementation](amachine/am_hmm.html#HMM) is designed primarily for constructing $\\epsilon$-machines to generate training data with interesting ground truth theoretical properties. This should be useful for making informed hypotheses about, designing experiments for, and interpreting the learning dynamics and internal representations of neural networks.
 56
 57More expressive finite generative models, such as non-unifilar HMMs or probabilistic context-free grammars, can represent a broader class of languages than a finite $\\epsilon$-machine, but may lack the theoretical tractability that makes $\\epsilon$-machines particularly analytically useful. Thus, utilizing $\\epsilon$-machines instead of alternatives, may involve a tradeoff between expressiveness and interpretability.
 58
 59## Complexity Measures
 60
 61Complexity measures implemented within the HMM class include:
 62
 63* [statistical complexity $C_\\mu$](amachine/am_hmm.html#HMM.C_mu),
 64* [entropy rate $h_{\\mu}$](amachine/am_hmm.html#HMM.h_mu),
 65* [single symbol entropy H(1)](amachine/am_hmm.html#HMM.H_1), 
 66* [anticipated information $\\rho_{\\mu}$](amachine/am_hmm.html#HMM.rho_mu),
 67* [Excess entropy $\\mathbf{E}$](amachine/am_hmm.html#HMM.E),
 68* [Syncronization information $\\mathbf{S}$](amachine/am_hmm.html#HMM.S),
 69* [Transient information $\\mathbf{T}$](amachine/am_hmm.html#HMM.S),
 70* [Crypticity $\\chi$](amachine/am_hmm.html#HMM.chi), and
 71* [Block convergence measures $\\mathbf{S}(L), \\mathbf{T}(L), \\mathcal{H}(L),$ and $h_{\\mu}(L)$](amachine/am_hmm.html#HMM.block_convergence).
 72
 73Citations, definitions, and interpretations are in their individual documentation sections linked above.
 74
 75## Isomorphic Construction
 76
 77A few ways to construct $\\epsilon$-machines are impemented so far:
 78
 79* [random_machine](amachine/am_create.html#random_machine), which generates a random HMM that can be reduced to an $\\epsilon$-machine,
 80
 81* [star_join](amachine/am_create.html#star_join), which joins a list of machines to a central hub,
 82
 83* [star](amachine/am_create.html#star), which combines random_machine and star_join,
 84
 85* [isomorphic_to](amachine/am_create.html#isomorphic_to), which creates a copy of an existing machine with new labels,
 86
 87* [full_isomorphic_rotation](amachine/am_create.html#full_isomorphic_rotation), which creates and star joins $n$ isomorphic machines, where $n$ is the number of symbols, and the $i^{th}$ machine's state $j$ emits the $k^{th}$ symbol in the alphabet, then the ${i+1}^{th}$ machine's state $j$ will emit symbol $k+1 \\bmod n$, and
 88
 89<img src="resources/iso_rotation.png" alt="iso rotation" style="width: 100%; margin-left: 0%;">
 90
 91* [unique_isomorphic](amachine/am_create.html#unique_isomorphic), which creates and star joins `n_machines` isomorphic machines, where `n_machines` is a parameter, and each has its own alphabet.
 92
 93<img src="resources/unique_iso.png" alt="unique iso" style="width: 100%; margin-left: 0%;">
 94
 95##Syntagmatics and Paradigmatics
 96
 97The obvious differences between `full_isomorphic_rotation` and `unique_isomorphic` are that the former has $n$ sub-machnes = $n$ symbols, while the latter can have any number of sub-machines but the full alphabet grows with the number of machines.  Both introduce learnable structural redundancies, which may be experimentally interesting, but they may be considered somewhat unnatural and structually boring relative to human language.
 98
 99Imagine a world where every now and then, each word in the dictionary moves up one entry but takes on the meaning of the word it replaced. This is analogous to what `full_isomorphic_rotation` does. On the other hand, `unique_isomorphic` is like cloning the whole dictionary, and then exclusively using one dictionary at a time, while switching to a new one every now and then.
100
101Neither are very good anologs for isomorphic substructure in human lanuage, where there are consistent rules, that creates coherent, rich, and varied relational structure. 
102
103And this is the type of structure we want to also in our $\\epsilon$-machines, so that the insights gleaned may be generalize better to help us understand how artificial intelligence learns from, internally represents, and uses human language.
104
105The work in progress framework for this is conceptually grounded in concepts borrowed from [paradigmatic analysis](https://en.wikipedia.org/wiki/Paradigmatic_analysis) and [syntagmatic analysis](https://en.wikipedia.org/wiki/Syntagmatic_analysis). Instead studying how words can be chosen in positions within a body of text, we consider how symbols can be chosen for a transtion in a state machine. Within the hard constraints, some stochastic variability among valid choices can be permitted. 
106
107There is one complication, and that is an $\\epsilon-$machine must be minimal and unifilar. You can't simply change a symbol on an edge at random, as it may break the structural properties of the machine.
108
109Once suitable syntagmatics are defined for a single machine, we can apply the same concepts to joined collections of machines, in a way that is analogous to [discourse structure](https://en.wikipedia.org/wiki/Discourse_relation) in human language. At this level, we can create $\\epsilon$-machines, with hierarchical isomorphic substructure.
110
111A framework and API for generating models with these properties is still being worked out.
112
113## Mixed State Presentation
114
115The mixed states presentation construction [^3][^4] tracks beliefs about which intrinsic state the $\\epsilon$-machicine is in as an observer attempts to syncronize with the $\\epsilon$ machines. MSP construction can be useful for analyzing learning dynamics, because this is the task that a neural network learning from the generated data must perform to minimize loss. This has been emipirically demonstrated in transformer models (in at least one instance) by Shai et al. [^5].
116
117The MSP can explode in terms of the number of states that are produced, or even be inifite for non-synchronizable processes. The belief states transitions also may produce fractal-like patterns in the  probability simplex.
118
119A-Machine implements [MSP construction](amachine/am_machine.html#get_msp) from a given $\\epsilon$ machine. Since the MSP may have too many states to compute, a cap on the max number of states is imposed, and when the cap is reached, the MSP is closed by mapping the unresolved frontier of belief states back to the nearest existing belief states. When this happens, the final MSP is only an approximation.
120
121The MSP also yields closed form expressions for a range of complexity measures [^6], such as $\\mathbf{E}, \\mathbf{S}$, and $\\mathbf{T}$, which can be solved for using spectral decompositon. A-machine implements this approach to computing these complexity measures, in addition to the alternative, block entropy convergence. When the MSP construction excedes the max states cap, comparing the measures computed from its approximation with those approximated by block entropy convergence is useful for corroboration and gaining additional insight into the synronization dynamics.
122
123## Block Entropy Convergence
124
125As mentioned, [block entropy convergence](amachine/am_hmm.html#HMM.block_convergence) supports estimation of a range of complexity measures, and allows you to observe the convergence in uncertainty as a function of $L$ (after seeing all blocks of length $L$). A-machine implements this as a [C++ extension](https://gitlab.com/tneuroth/a-machine/-/blob/main/src/amachine/am_fast/am_fast.cpp).
126
127You also plot block entropy curves and block measures, by calling [`amachine.HMM.draw_block_measure_curves`](amachine/am_hmmhtml#HMM.draw_block_measure_curves), and [`amachine.HMM.draw_block_entropy_curve`](amachine/am_hmm.html#HMMdraw_block_entropy_curve). 
128
129<img src="resources/curves.png" alt="block measures plots" style="width: 100%; margin-left: 0%;">
130
131## Data Generation
132
133[Data generation](amachine/am_hmm.html#HMM.generate_data) generates stochastic sequences from a given machine, optionally along with the sequence of hidden states that were traversed, and also optionally with an additional symbol sequence with what we are calling [isomorphic_shift](amachine/am_hmm.html#HMM.isomorphic_shift) applied. This changes the symbols that were emited by a state, for symbols that would have been emitted by alternative isomorphic states (should they exist).
134
135```python
136import amachine as am
137
138am.srand_global(42)
139
140m = am.random_machine( 
141        n_states=11, 
142        symbols=[ '0', '1', '2' ],
143        connectedness=0.65,
144        randomness=0.37 )
145
146# Collapse to the largest recurrent subgraph
147m.collapse_to_largest_strongly_connected_subgraph()
148
149# Minimize the machine -> epsilon-machine.
150m.minimize()
151
152# Create an isomorphic machine with new labels
153m_iso = am.isomorphic_to( m, alphabet=[ '3', '4', '5' ] )
154
155m_iso.isoclass = 0
156m.isoclass     = 0
157
158for j, state in enumerate( m.states ) : 
159    m.states[ j ].add_isomorph( m_iso.states[ j ].name )
160    m_iso.states[ j ].add_isomorph( m.states[ j ].name )
161    
162# Join them together
163m_star = am.star_join(
164    exit_symbol='x', 
165    enter_symbols=['a','b'],
166    machines=[ m, m_iso ],
167    mode_residency_factor=0.75 
168)
169
170m_star.draw_graph()
171
172m_star.generate_data(
173    file_prefix="./data/iso_012-345", # saves as iso_012-345.parqet
174    n_gen=2_000_000_000, # 2 billion symbols    
175    include_states=True,
176    isomorphic_shifts={1}
177)
178```
179
180<img src="resources/iso.png" alt="iso machine" style="width: 100%; margin-left: 0%;">
181
182The two sequences will be aligned, and with identical local and global statistics. Thus, you can train on one of them, and then analyze activation patterns and residuals on both of them in parallel during inference, to study if and how the neural network learned polysemantic representations that exploit the structural redundancies.
183
184*Cleaner and more general methods for constructing graphs with ismorphic substructure are in the works.*
185
186[^1]: Crutchfield, "The calculi of emergence: computation, dynamics and induction.", 1994.
187    <https://csc.ucdavis.edu/~cmg/papers/CalcEmerg.pdf>
188
189[^2]: Shalizi, Rohilla, and Crutchfield. "Computational mechanics: Pattern and prediction, structure and simplicity.", 2001.
190    <https://arxiv.org/abs/cond-mat/9907176>
191
192[^3]: Blackwell, David Harold. "The entropy of functions of finite-state Markov chains.", 1959.
193
194[^4]: Jurgens & Crutchfield "Shannon Entropy Rate of Hidden Markov Processes", 2021.
195    <https://link.springer.com/article/10.1007/s10955-021-02769-3>
196
197[^5]: Shai et al., "Transformers Represent Belief State Geometry in their Residual Stream", 2024. 
198    <https://arxiv.org/abs/2405.15943>
199
200[^6]: Crutchfield, "Exact complexity: The spectral decomposition of intrinsic computation", 2016.
201    <https://csc.ucdavis.edu/~cmg/papers/ec.pdf>
202
203"""
204__version__ = "0.2.0"
205
206from .am_create import (
207    random_machine,
208    star_join,
209    star,
210    isomorphic_to,
211    full_isomorphic_rotation,
212    unique_isomorphic
213)
214
215from .am_hmm import HMM
216
217from .am_syntagmatics import (
218    Syntagmatics,
219    ExplicitCohesionRule
220)
221
222from .am_substitution_algebra import (
223    SubstitutionAlgebra
224)
225
226from .am_random import srand_global
227from .am_vocabulary import Vocabulary