Skip to content

Support for tokenization of languages without spaces #4

@andreekeberg

Description

@andreekeberg

Need to implement a smarter method of tokenization which takes into account languages that traditionally does not use spaces between words (currently resulting in full-sentence tokens not suitable for the current method of cosine similarity comparisons).

Some of these languages include:

  • Chinese
  • Japanese
  • Thai
  • Khmer
  • Lao
  • Burmese

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions