Use gpt-4o's tokenizer (#258)

AmgadHasan · filipchristiansen · web-flow · commit 2dea7c886530 · 2025-06-18T11:57:34.000+02:00
feat: switch to o200k_base, require tiktoken ≥ 0.7.0, drop Python 3.7

Context
-------
Token counting now uses **o200k_base** (native to GPT-4o / 4o-mini).  
That encoding ships only with **tiktoken ≥ 0.7.0**, whose wheels need Python 3.8+.  
CI already tests 3.8-3.13, so we align our documented minimums.

Changes
-------
* src/gitingest/output_formatters.py – `cl100k_base` → `o200k_base`
* README.md – “Python 3.7+” → “Python 3.8+”
* pyproject.toml  
  * `tiktoken` → `tiktoken&gt;=0.7.0` (o200k support)  
  * remove classifier *Programming Language :: Python :: 3.7*
* requirements.txt – same `tiktoken` bump

Impact
------
* **Breaking** for users pinned to Python 3.7 → upgrade to 3.8+.  
* Environments on `tiktoken==0.6.*` must `pip install -U tiktoken&gt;=0.7.0`.  
* No other runtime deps added.

Co-authored-by: Filip Christiansen &lt;22807962+filipchristiansen@users.noreply.github.com&gt;
diff --git a/README.md b/README.md
@@ -28,7 +28,7 @@ You can also replace `hub` with `ingest` in any GitHub URL to access the corresp
 
 ## 📚 Requirements
 
-- Python 3.7+
+- Python 3.8+
 - For private repositories: A GitHub Personal Access Token (PAT). You can generate one at [https://github.com/settings/personal-access-tokens](https://github.com/settings/personal-access-tokens) (Profile → Settings → Developer Settings → Personal Access Tokens → Fine-grained Tokens)
 
 ### 📦 Installation
diff --git a/pyproject.toml b/pyproject.toml
@@ -11,7 +11,7 @@ dependencies = [
     "python-dotenv",
     "slowapi",
     "starlette>=0.40.0",  # Vulnerable to https://osv.dev/vulnerability/GHSA-f96h-pmfr-66vw
-    "tiktoken",
+    "tiktoken>=0.7.0",  # Support for o200k_base encoding
     "tomli",
     "typing_extensions; python_version < '3.10'",
     "uvicorn>=0.11.7",  # Vulnerable to https://osv.dev/vulnerability/PYSEC-2020-150
@@ -23,7 +23,6 @@ classifiers=[
     "Development Status :: 3 - Alpha",
     "Intended Audience :: Developers",
     "License :: OSI Approved :: MIT License",
-    "Programming Language :: Python :: 3.7",
     "Programming Language :: Python :: 3.8",
     "Programming Language :: Python :: 3.9",
     "Programming Language :: Python :: 3.10",
diff --git a/requirements.txt b/requirements.txt
@@ -4,6 +4,6 @@ pydantic
 python-dotenv
 slowapi
 starlette>=0.40.0  # Vulnerable to https://osv.dev/vulnerability/GHSA-f96h-pmfr-66vw
-tiktoken
+tiktoken>=0.7.0  # Support for o200k_base encoding
 tomli
 uvicorn>=0.11.7  # Vulnerable to https://osv.dev/vulnerability/PYSEC-2020-150
diff --git a/src/gitingest/output_formatters.py b/src/gitingest/output_formatters.py
@@ -171,7 +171,7 @@ def _format_token_count(text: str) -> Optional[str]:
         The formatted number of tokens as a string (e.g., '1.2k', '1.2M'), or `None` if an error occurs.
     """
     try:
-        encoding = tiktoken.get_encoding("cl100k_base")
+        encoding = tiktoken.get_encoding("o200k_base")  # gpt-4o, gpt-4o-mini
         total_tokens = len(encoding.encode(text, disallowed_special=()))
     except (ValueError, UnicodeEncodeError) as exc:
         print(exc)