|
1 | | -# generated_text_detector |
| 1 | +# Generated Text Detection # |
| 2 | + |
| 3 | +[]() [](https://www.python.org/downloads/release/python-3110/) [](https://developer.nvidia.com/cuda-12-2-0-download-archive) |
| 4 | + |
| 5 | +This repository contains the HTTP service for the Generated Text Detector. \ |
| 6 | +To integrate the detector with your project on the SuperAnnotate platform, please follow the instructions provided in our [Tutorial](tutorial.md) |
| 7 | + |
| 8 | +## How it works ## |
| 9 | + |
| 10 | +### Model ### |
| 11 | + |
| 12 | +The Generated Text Detection model is based on a fine-tuned RoBERTa Large architecture. Trained on a diverse dataset sourced from multiple open datasets, it excels at classifying text inputs as either generated/synthetic or human-written. \ |
| 13 | +For more details and access to the model, visit its [Hugging Face Model Hub page](https://huggingface.co/SuperAnnotate/roberta-large-llm-content-detector). |
| 14 | + |
| 15 | +## How to run it ## |
| 16 | + |
| 17 | +### API Service Configuration ### |
| 18 | + |
| 19 | +You can deploy the service wherever it is convenient; one of the basic options is on a created EC2 instance. Learn about instance creation and setup [here](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html). \ |
| 20 | +Hardware requirements will depend on your on your deployment type. Recommended ec2 instances for deployment type 2: |
| 21 | +- **GPU**: [**g3s.xlarge**](https://instances.vantage.sh/aws/ec2/g3s.xlarge) |
| 22 | +- **CPU**: [**a1.large**](https://instances.vantage.sh/aws/ec2/a1.large) |
| 23 | + |
| 24 | +***NOTES***: |
| 25 | + |
| 26 | +- To verify that everything is functioning correctly, try calling the healthcheck endpoint. |
| 27 | +- Also, ensure that the port on which your service is deployed (8080 by default) is open to the global network. Refer to this [**tutorial**](https://stackoverflow.com/questions/5004159/opening-port-80-ec2-amazon-web-services/10454688#10454688) for guidance on opening a port on an EC2 instance. |
| 28 | + |
| 29 | +### General Pre-requirements ### |
| 30 | + |
| 31 | +0. Clone this repo and move to root folder |
| 32 | +1. **Create SSL sertificate.** It is necessary to create certificates to make the connection secure, this is mandatory for integration with the SuperAnnotate platform. |
| 33 | +- Generate self-signed SSL certificate by following command: `openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -days 365 -nodes` |
| 34 | +2. **Install necessary dependencies** |
| 35 | +- For running as Python file: [***Pyhon3.11***](https://www.python.org/downloads/release/python-3110/) |
| 36 | + - GPU inference: [***Nvidia drivers***](https://ubuntu.com/server/docs/nvidia-drivers-installation) and [***CUDA toolkit***](https://developer.nvidia.com/cuda-12-2-2-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_local) |
| 37 | +- For running as Docker: [***Docker***](https://docs.docker.com/engine/install/ubuntu/) |
| 38 | + - GPU inference: [***Nvidia drivers***](https://ubuntu.com/server/docs/nvidia-drivers-installation); [***CUDA toolkit***](https://developer.nvidia.com/cuda-12-2-2-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_local); [***NVIDIA Container Toolkit***](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html). |
| 39 | + |
| 40 | +### As python file ### |
| 41 | + |
| 42 | +1. Install requirements: `pip install -r generated_text_detector/requirements.txt` |
| 43 | +2. Set the Python path variable: `export PYTHONPATH="."` |
| 44 | +3. Run the API: `uvicorn --host 0.0.0.0 --port 8080 --ssl-keyfile=./key.pem --ssl-certfile=./cert.pem generated_text_detector.fastapi_app:app` |
| 45 | + |
| 46 | +### As docker containers ### |
| 47 | + |
| 48 | +#### GPU Version #### |
| 49 | + |
| 50 | +1. Build image: `sudo docker build -t generated_text_detector:GPU -f Dockerfile_GPU .` |
| 51 | +2. Run container: `sudo docker run --gpus all -e DETECTOR_CONFIG_PATH="etc/configs/detector_config.json" -p 8080:8080 -d generated_text_detector:GPU` |
| 52 | + |
| 53 | +#### CPU Version #### |
| 54 | + |
| 55 | +1. Build image: `sudo docker build -t generated_text_detector:CPU -f Dockerfile_CPU .` |
| 56 | +2. Run container: `sudo docker run -e DETECTOR_CONFIG_PATH="etc/configs/detector_config.json" -p 8080:8080 -d generated_text_detector:CPU` |
| 57 | + |
| 58 | +## Performance ## |
| 59 | + |
| 60 | +### Benchmark ### |
| 61 | + |
| 62 | +The model was evaluated on a benchmark collected from the same datasets used for training, alongside a closed subset of SuperAnnotate. \ |
| 63 | +However, there are no direct intersections of samples between the training data and the benchmark. \ |
| 64 | +The benchmark comprises 1k samples, with 200 samples per category. \ |
| 65 | +The model's performance is compared with open-source solutions and popular API detectors in the table below: |
| 66 | + |
| 67 | +| Model/API | Wikipedia | Reddit QA | SA instruction | Papers | Average | |
| 68 | +|--------------------------------------------------------------------------------------------------|----------:|----------:|---------------:|-------:|--------:| |
| 69 | +| [Hello-SimpleAI](https://huggingface.co/Hello-SimpleAI/chatgpt-detector-roberta) | **0.97**| 0.95 | 0.82 | 0.69 | 0.86 | |
| 70 | +| [RADAR](https://huggingface.co/spaces/TrustSafeAI/RADAR-AI-Text-Detector) | 0.47 | 0.84 | 0.59 | 0.82 | 0.68 | |
| 71 | +| [GPTZero](https://gptzero.me) | 0.72 | 0.79 | **0.90**| 0.67 | 0.77 | |
| 72 | +| [Originality.ai](https://originality.ai) | 0.91 | **0.97**| 0.77 |**0.93**|**0.89** | |
| 73 | +| [LLM content detector](https://huggingface.co/SuperAnnotate/roberta-large-llm-content-detector) | 0.88 | 0.95 | 0.84 | 0.81 | 0.87 | |
| 74 | + |
| 75 | +### Time performance ### |
| 76 | + |
| 77 | +There are 2 inference modes available on CPU and GPU. |
| 78 | +In the table below you can see the time performance of the service deployed in the appropriate mode |
| 79 | + |
| 80 | +| Method | RPS | |
| 81 | +|-------:|----:| |
| 82 | +| GPU | 10 | |
| 83 | +| CPU | 0.9 | |
| 84 | + |
| 85 | +*In this test, request texts average 500 tokens |
| 86 | + |
| 87 | +## Endpoints ## |
| 88 | + |
| 89 | +The following endpoints are available in the Generated Text Detection service: |
| 90 | + |
| 91 | +- **GET /healthcheck**: |
| 92 | + - **Summary**: Ping |
| 93 | + - **Description**: Alive method |
| 94 | + - **Input Type**: None |
| 95 | + - **Output Type**: JSON |
| 96 | + - **Output Values**: |
| 97 | + - `{"healthy": True}` |
| 98 | + - **Status Codes**: |
| 99 | + - `200`: Successful Response |
| 100 | + |
| 101 | +- **POST /detect**: |
| 102 | + - **Summary**: Main endpoint of detection |
| 103 | + - **Description**: Detection generated text and return report with *Generated Score* and *Predicted Author* |
| 104 | + - **Input Type**: JSON. With string filed `text` |
| 105 | + - **Input Value Example**: `{"text": "some text"}` |
| 106 | + - **Output Type**: JSON. With 2 fileds: |
| 107 | + - `generated_score`: float values from 0 to 1 |
| 108 | + - `author`: one of the following string values: |
| 109 | + - *LLM Generated* |
| 110 | + - *Probably LLM Generated* |
| 111 | + - *Not sure* |
| 112 | + - *Probably human written* |
| 113 | + - *Human* |
| 114 | + - **Output Value Example**: |
| 115 | + - `{"generated_score": 0, "author": "Human"}` |
| 116 | + - **Status Codes**: |
| 117 | + - `200`: Successful Response |
0 commit comments