- get a dataset with different LLM-generated texts - clean data (weird marking) if it really large part of the data and could result in weird predictions