Skip to content

Commit ad27066

Browse files
committed
[ver]: v1.0.0 of generated_text_detector contains base HTTP service of detection generated text
1 parent c4e861a commit ad27066

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

42 files changed

+1007
-1
lines changed

.gitignore

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -127,6 +127,7 @@ venv/
127127
ENV/
128128
env.bak/
129129
venv.bak/
130+
local_env/
130131

131132
# Spyder project settings
132133
.spyderproject
@@ -158,3 +159,16 @@ cython_debug/
158159
# and can be added to the global gitignore or merged into this file. For a more nuclear
159160
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
160161
#.idea/
162+
163+
# VS Code IDE files
164+
.vscode/
165+
166+
# SSL certificates
167+
cert.pem
168+
key.pem
169+
170+
# Others
171+
my_configs/
172+
temp_data/
173+
output_dir/
174+
test.py

Changelog

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
## [1.0.0] - 2024-08-05
2+
3+
_First release._
4+
5+
### Added
6+
7+
- **Breaking:** Base functionality for HTTP service
8+
- **Breaking:** Model for inference Generated Text Detection
9+
- **Breaking:** Teamplate of form builder for SuperAnnotate infrastructure
10+
11+
12+
## [0.0.1] - 2024-10-04
13+
14+
_Init._

Dockerfile_CPU

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
FROM python:3.11-slim-bullseye
2+
3+
# Set utility env varibles
4+
ENV PATH=/generated_text_detector/miniconda/bin:$PATH
5+
6+
# Set workdir
7+
WORKDIR /generated_text_detector
8+
9+
# Install python requirements
10+
COPY generated_text_detector/requirements.txt .
11+
RUN pip3 install -r requirements.txt --no-cache
12+
RUN python -m nltk.downloader punkt
13+
14+
# Copy code to container
15+
COPY generated_text_detector/ generated_text_detector/
16+
COPY etc/ etc/
17+
COPY key.pem cert.pem ./
18+
COPY version.txt ./
19+
20+
EXPOSE 8080
21+
CMD uvicorn --host 0.0.0.0 --port 8080 --ssl-keyfile=./key.pem --ssl-certfile=./cert.pem generated_text_detector.fastapi_app:app

Dockerfile_GPU

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
FROM nvidia/cuda:12.2.2-cudnn8-runtime-ubuntu22.04
2+
3+
# Set utility env varibles
4+
ENV PATH=/generated_text_detector/miniconda/bin:$PATH
5+
6+
# Install some basic utilities
7+
RUN apt-get update && apt-get install -y \
8+
curl \
9+
ca-certificates \
10+
sudo \
11+
git \
12+
bzip2 \
13+
build-essential \
14+
libgl1 \
15+
libglib2.0-0 \
16+
&& rm -rf /var/lib/apt/lists/*
17+
18+
# Set workdir
19+
WORKDIR /generated_text_detector
20+
21+
# Install Miniconda and Python
22+
RUN curl -sLo /generated_text_detector/miniconda.sh https://repo.continuum.io/miniconda/Miniconda3-py311_24.1.2-0-Linux-x86_64.sh \
23+
&& chmod +x /generated_text_detector/miniconda.sh \
24+
&& /generated_text_detector/miniconda.sh -b -p /generated_text_detector/miniconda \
25+
&& rm /generated_text_detector/miniconda.sh \
26+
&& conda install -y python==3.11 \
27+
&& pip3 install nvitop
28+
29+
# Install python requirements
30+
COPY generated_text_detector/requirements.txt .
31+
RUN pip3 install -r requirements.txt --no-cache
32+
RUN python -m nltk.downloader punkt
33+
34+
# Copy code to container
35+
COPY generated_text_detector/ generated_text_detector/
36+
COPY etc/ etc/
37+
COPY key.pem cert.pem ./
38+
COPY version.txt ./
39+
40+
EXPOSE 8080
41+
CMD uvicorn --host 0.0.0.0 --port 8080 --ssl-keyfile=./key.pem --ssl-certfile=./cert.pem generated_text_detector.fastapi_app:app

README.md

Lines changed: 117 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,117 @@
1-
# generated_text_detector
1+
# Generated Text Detection #
2+
3+
[![Version](https://img.shields.io/badge/version-1.0.0-green.svg)]() [![Python 3.11](https://img.shields.io/badge/python-3.11-blue.svg)](https://www.python.org/downloads/release/python-3110/) [![CUDA 12.2](https://img.shields.io/badge/CUDA-12.2-green.svg)](https://developer.nvidia.com/cuda-12-2-0-download-archive)
4+
5+
This repository contains the HTTP service for the Generated Text Detector. \
6+
To integrate the detector with your project on the SuperAnnotate platform, please follow the instructions provided in our [Tutorial](tutorial.md)
7+
8+
## How it works ##
9+
10+
### Model ###
11+
12+
The Generated Text Detection model is based on a fine-tuned RoBERTa Large architecture. Trained on a diverse dataset sourced from multiple open datasets, it excels at classifying text inputs as either generated/synthetic or human-written. \
13+
For more details and access to the model, visit its [Hugging Face Model Hub page](https://huggingface.co/SuperAnnotate/roberta-large-llm-content-detector).
14+
15+
## How to run it ##
16+
17+
### API Service Configuration ###
18+
19+
You can deploy the service wherever it is convenient; one of the basic options is on a created EC2 instance. Learn about instance creation and setup [here](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html). \
20+
Hardware requirements will depend on your on your deployment type. Recommended ec2 instances for deployment type 2:
21+
- **GPU**: [**g3s.xlarge**](https://instances.vantage.sh/aws/ec2/g3s.xlarge)
22+
- **CPU**: [**a1.large**](https://instances.vantage.sh/aws/ec2/a1.large)
23+
24+
***NOTES***:
25+
26+
- To verify that everything is functioning correctly, try calling the healthcheck endpoint.
27+
- Also, ensure that the port on which your service is deployed (8080 by default) is open to the global network. Refer to this [**tutorial**](https://stackoverflow.com/questions/5004159/opening-port-80-ec2-amazon-web-services/10454688#10454688) for guidance on opening a port on an EC2 instance.
28+
29+
### General Pre-requirements ###
30+
31+
0. Clone this repo and move to root folder
32+
1. **Create SSL sertificate.** It is necessary to create certificates to make the connection secure, this is mandatory for integration with the SuperAnnotate platform.
33+
- Generate self-signed SSL certificate by following command: `openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -days 365 -nodes`
34+
2. **Install necessary dependencies**
35+
- For running as Python file: [***Pyhon3.11***](https://www.python.org/downloads/release/python-3110/)
36+
- GPU inference: [***Nvidia drivers***](https://ubuntu.com/server/docs/nvidia-drivers-installation) and [***CUDA toolkit***](https://developer.nvidia.com/cuda-12-2-2-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_local)
37+
- For running as Docker: [***Docker***](https://docs.docker.com/engine/install/ubuntu/)
38+
- GPU inference: [***Nvidia drivers***](https://ubuntu.com/server/docs/nvidia-drivers-installation); [***CUDA toolkit***](https://developer.nvidia.com/cuda-12-2-2-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_local); [***NVIDIA Container Toolkit***](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html).
39+
40+
### As python file ###
41+
42+
1. Install requirements: `pip install -r generated_text_detector/requirements.txt`
43+
2. Set the Python path variable: `export PYTHONPATH="."`
44+
3. Run the API: `uvicorn --host 0.0.0.0 --port 8080 --ssl-keyfile=./key.pem --ssl-certfile=./cert.pem generated_text_detector.fastapi_app:app`
45+
46+
### As docker containers ###
47+
48+
#### GPU Version ####
49+
50+
1. Build image: `sudo docker build -t generated_text_detector:GPU -f Dockerfile_GPU .`
51+
2. Run container: `sudo docker run --gpus all -e DETECTOR_CONFIG_PATH="etc/configs/detector_config.json" -p 8080:8080 -d generated_text_detector:GPU`
52+
53+
#### CPU Version ####
54+
55+
1. Build image: `sudo docker build -t generated_text_detector:CPU -f Dockerfile_CPU .`
56+
2. Run container: `sudo docker run -e DETECTOR_CONFIG_PATH="etc/configs/detector_config.json" -p 8080:8080 -d generated_text_detector:CPU`
57+
58+
## Performance ##
59+
60+
### Benchmark ###
61+
62+
The model was evaluated on a benchmark collected from the same datasets used for training, alongside a closed subset of SuperAnnotate. \
63+
However, there are no direct intersections of samples between the training data and the benchmark. \
64+
The benchmark comprises 1k samples, with 200 samples per category. \
65+
The model's performance is compared with open-source solutions and popular API detectors in the table below:
66+
67+
| Model/API | Wikipedia | Reddit QA | SA instruction | Papers | Average |
68+
|--------------------------------------------------------------------------------------------------|----------:|----------:|---------------:|-------:|--------:|
69+
| [Hello-SimpleAI](https://huggingface.co/Hello-SimpleAI/chatgpt-detector-roberta) | **0.97**| 0.95 | 0.82 | 0.69 | 0.86 |
70+
| [RADAR](https://huggingface.co/spaces/TrustSafeAI/RADAR-AI-Text-Detector) | 0.47 | 0.84 | 0.59 | 0.82 | 0.68 |
71+
| [GPTZero](https://gptzero.me) | 0.72 | 0.79 | **0.90**| 0.67 | 0.77 |
72+
| [Originality.ai](https://originality.ai) | 0.91 | **0.97**| 0.77 |**0.93**|**0.89** |
73+
| [LLM content detector](https://huggingface.co/SuperAnnotate/roberta-large-llm-content-detector) | 0.88 | 0.95 | 0.84 | 0.81 | 0.87 |
74+
75+
### Time performance ###
76+
77+
There are 2 inference modes available on CPU and GPU.
78+
In the table below you can see the time performance of the service deployed in the appropriate mode
79+
80+
| Method | RPS |
81+
|-------:|----:|
82+
| GPU | 10 |
83+
| CPU | 0.9 |
84+
85+
*In this test, request texts average 500 tokens
86+
87+
## Endpoints ##
88+
89+
The following endpoints are available in the Generated Text Detection service:
90+
91+
- **GET /healthcheck**:
92+
- **Summary**: Ping
93+
- **Description**: Alive method
94+
- **Input Type**: None
95+
- **Output Type**: JSON
96+
- **Output Values**:
97+
- `{"healthy": True}`
98+
- **Status Codes**:
99+
- `200`: Successful Response
100+
101+
- **POST /detect**:
102+
- **Summary**: Main endpoint of detection
103+
- **Description**: Detection generated text and return report with *Generated Score* and *Predicted Author*
104+
- **Input Type**: JSON. With string filed `text`
105+
- **Input Value Example**: `{"text": "some text"}`
106+
- **Output Type**: JSON. With 2 fileds:
107+
- `generated_score`: float values from 0 to 1
108+
- `author`: one of the following string values:
109+
- *LLM Generated*
110+
- *Probably LLM Generated*
111+
- *Not sure*
112+
- *Probably human written*
113+
- *Human*
114+
- **Output Value Example**:
115+
- `{"generated_score": 0, "author": "Human"}`
116+
- **Status Codes**:
117+
- `200`: Successful Response
61.3 KB
Loading
91.7 KB
Loading
105 KB
Loading
32.4 KB
Loading
43 KB
Loading

0 commit comments

Comments
 (0)