Skip to content

Segmentation Fault when calling libsais_int #14

@julianmukaj

Description

@julianmukaj
Suffix array initialized with length: 3055148
Calling libsais_int with parameters:
  buffer.as_ptr(): 0x75fee1dff010
  suffix_array.as_mut_ptr(): 0x6967ee0
  buffer.len() as i32: 3055148
  vocab_size: 100000
  symbol_frequency_table: 0
Segmentation fault (core dumped)

Justing the datastore creation scripts and they seem to crash on the finalize step of lib.rs, this is on ubuntu 22 with py 3.9.. Same thing on Windows.

Opened the git issue prematurely, I fixed this by adding +1 to vocabulary size in lib.rs (https://discourse.julialang.org/t/segfault-calling-c-function-any-advice/94730/8) and rebuilding the wheel, maybe it is model dependent issue not sure, something to track down and handle for future releases maybe?

I am having trouble with the data reader/search part too..

let end_of_indices = end_of_indices.unwrap();

is not caught is end_of_indices is None

Edit again: increasing vocabulary size above the tokenizer vocab size seems to solve the segmentation error, seems dependent on the datastore data if it throws or not.

if end_of_indices.is_none() {
                    return
                }

couldn't figure out why end_of_indices was null sometimes so just returned if so

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions