Skip to content

Commit 2dea7c8

Browse files
Use gpt-4o's tokenizer (#258)
feat: switch to o200k_base, require tiktoken ≥ 0.7.0, drop Python 3.7 Context ------- Token counting now uses **o200k_base** (native to GPT-4o / 4o-mini). That encoding ships only with **tiktoken ≥ 0.7.0**, whose wheels need Python 3.8+. CI already tests 3.8-3.13, so we align our documented minimums. Changes ------- * src/gitingest/output_formatters.py – `cl100k_base` → `o200k_base` * README.md – “Python 3.7+” → “Python 3.8+” * pyproject.toml * `tiktoken` → `tiktoken>=0.7.0` (o200k support) * remove classifier *Programming Language :: Python :: 3.7* * requirements.txt – same `tiktoken` bump Impact ------ * **Breaking** for users pinned to Python 3.7 → upgrade to 3.8+. * Environments on `tiktoken==0.6.*` must `pip install -U tiktoken>=0.7.0`. * No other runtime deps added. Co-authored-by: Filip Christiansen <22807962+filipchristiansen@users.noreply.github.com>
1 parent 1dd133c commit 2dea7c8

File tree

4 files changed

+4
-5
lines changed

4 files changed

+4
-5
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ You can also replace `hub` with `ingest` in any GitHub URL to access the corresp
2828

2929
## 📚 Requirements
3030

31-
- Python 3.7+
31+
- Python 3.8+
3232
- For private repositories: A GitHub Personal Access Token (PAT). You can generate one at [https://github.com/settings/personal-access-tokens](https://github.com/settings/personal-access-tokens) (Profile → Settings → Developer Settings → Personal Access Tokens → Fine-grained Tokens)
3333

3434
### 📦 Installation

pyproject.toml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ dependencies = [
1111
"python-dotenv",
1212
"slowapi",
1313
"starlette>=0.40.0", # Vulnerable to https://osv.dev/vulnerability/GHSA-f96h-pmfr-66vw
14-
"tiktoken",
14+
"tiktoken>=0.7.0", # Support for o200k_base encoding
1515
"tomli",
1616
"typing_extensions; python_version < '3.10'",
1717
"uvicorn>=0.11.7", # Vulnerable to https://osv.dev/vulnerability/PYSEC-2020-150
@@ -23,7 +23,6 @@ classifiers=[
2323
"Development Status :: 3 - Alpha",
2424
"Intended Audience :: Developers",
2525
"License :: OSI Approved :: MIT License",
26-
"Programming Language :: Python :: 3.7",
2726
"Programming Language :: Python :: 3.8",
2827
"Programming Language :: Python :: 3.9",
2928
"Programming Language :: Python :: 3.10",

requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,6 @@ pydantic
44
python-dotenv
55
slowapi
66
starlette>=0.40.0 # Vulnerable to https://osv.dev/vulnerability/GHSA-f96h-pmfr-66vw
7-
tiktoken
7+
tiktoken>=0.7.0 # Support for o200k_base encoding
88
tomli
99
uvicorn>=0.11.7 # Vulnerable to https://osv.dev/vulnerability/PYSEC-2020-150

src/gitingest/output_formatters.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -171,7 +171,7 @@ def _format_token_count(text: str) -> Optional[str]:
171171
The formatted number of tokens as a string (e.g., '1.2k', '1.2M'), or `None` if an error occurs.
172172
"""
173173
try:
174-
encoding = tiktoken.get_encoding("cl100k_base")
174+
encoding = tiktoken.get_encoding("o200k_base") # gpt-4o, gpt-4o-mini
175175
total_tokens = len(encoding.encode(text, disallowed_special=()))
176176
except (ValueError, UnicodeEncodeError) as exc:
177177
print(exc)

0 commit comments

Comments
 (0)