Skip to content

Conversation

@cardmagic
Copy link
Owner

@cardmagic cardmagic commented Dec 28, 2025

Summary

  • Add as_json method that returns a Hash representation
  • Add to_json method that returns a JSON string
  • Add from_json class method that accepts either a JSON string or Hash
  • Add save(path) and load(path) for file operations
  • Use versioned JSON format for future compatibility

Usage

# Train and save to file
classifier = Classifier::Bayes.new('Spam', 'Ham')
classifier.train_spam('buy now cheap')
classifier.save('model.json')

# Load from file and continue training
loaded = Classifier::Bayes.load('model.json')
loaded.train_ham('meeting tomorrow')
loaded.classify('special offer')  # => "Spam"

# Get hash representation
hash = classifier.as_json
# => { version: 1, type: 'bayes', categories: {...}, ... }

# Get JSON string
json_string = classifier.to_json
# => '{"version":1,"type":"bayes",...}'

# Load from JSON string
loaded = Classifier::Bayes.from_json(json_string)

# Load from hash (useful when JSON is already parsed)
loaded = Classifier::Bayes.from_json(hash)

Design Decisions

  • JSON over Marshal: Human-readable, portable, version-safe
  • LSI rebuilds on load: Only source data serialized, not vectors. Makes JSON portable across GSL/non-GSL environments
  • Versioned format: {"version": 1, "type": "bayes|lsi", ...} allows future format changes
  • Accepts both String and Hash: from_json handles both for flexibility

Test plan

  • Round-trip tests for both Bayes and LSI
  • Tests for as_json returning Hash
  • Tests for from_json with both String and Hash
  • Verify classifications match after save/load
  • Test continued training on loaded classifiers
  • All tests pass with bundle exec rake test
  • All tests pass with NATIVE_VECTOR=true bundle exec rake test

Closes #17

Provides a cleaner API than raw Marshal for persisting trained
classifiers. Users can now save training state and resume later:

  classifier.save('model.json')
  loaded = Classifier::Bayes.load('model.json')
  loaded.train_spam('more data')  # continue training

Both Bayes and LSI classifiers support:
- to_json / from_json for string serialization
- save(path) / load(path) for file operations

LSI serializes only source data (word_hash, categories), not computed
vectors. The index rebuilds on load, making JSON files portable across
GSL/non-GSL environments.

Closes #17
- Add as_json method that returns a Hash representation
- Modify to_json to use as_json internally
- Modify from_json to accept both String and Hash arguments

This provides more flexibility for serialization workflows.
- Extract restore_state private method to reduce from_json AbcSize
- Change as_json return type to untyped for Steep compatibility
- Use assert_path_exists in tests per Minitest/AssertPathExists
- Add JSON RBS vendor file for type checking
- Regenerate RBS files
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Save the current state of training

2 participants