Skip to content
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 0 additions & 6 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -58,12 +58,6 @@ repos:
- id: python-use-type-annotations
description: 'Enforce that python3.6+ type annotations are used instead of type comments.'

- repo: https://github.com/PyCQA/isort
rev: 6.0.1
hooks:
- id: isort
description: 'Sort imports alphabetically, and automatically separated into sections and by type.'

- repo: https://github.com/pre-commit/mirrors-eslint
rev: v9.30.1
hooks:
Expand Down
57 changes: 57 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,12 +144,60 @@ By default, the digest is written to a text file (`digest.txt`) in your current
- Use `--output/-o <filename>` to write to a specific file.
- Use `--output/-o -` to output directly to `STDOUT` (useful for piping to other tools).

### 🔧 Configure processing limits

```bash
# Set higher limits for large repositories
gitingest https://github.com/torvalds/linux \
--max-files 100000 \
--max-total-size 2147483648 \
--max-directory-depth 25

# Process only Python files up to 1MB each
gitingest /path/to/project \
--include-pattern "*.py" \
--max-size 1048576 \
--max-files 1000
```

See more options and usage details with:

```bash
gitingest --help
```

### Configuration via Environment Variables

You can configure various limits and settings using environment variables. All configuration environment variables start with the `GITINGEST_` prefix:

#### File Processing Configuration

- `GITINGEST_MAX_FILE_SIZE` - Maximum size of a single file to process *(default: 10485760 bytes, 10 MB)*
- `GITINGEST_MAX_FILES` - Maximum number of files to process *(default: 10000)*
- `GITINGEST_MAX_TOTAL_SIZE_BYTES` - Maximum size of output file *(default: 524288000 bytes, 500 MB)*
- `GITINGEST_MAX_DIRECTORY_DEPTH` - Maximum depth of directory traversal *(default: 20)*
- `GITINGEST_DEFAULT_TIMEOUT` - Default operation timeout in seconds *(default: 60)*
- `GITINGEST_OUTPUT_FILE_NAME` - Default output filename *(default: "digest.txt")*
- `GITINGEST_TMP_BASE_PATH` - Base path for temporary files *(default: system temp directory)*

#### Server Configuration (for self-hosting)

- `GITINGEST_MAX_DISPLAY_SIZE` - Maximum size of content to display in UI *(default: 300000 bytes)*
- `GITINGEST_DELETE_REPO_AFTER` - Repository cleanup timeout in seconds *(default: 3600, 1 hour)*
- `GITINGEST_MAX_FILE_SIZE_KB` - Maximum file size for UI slider in kB *(default: 102400, 100 MB)*
- `GITINGEST_MAX_SLIDER_POSITION` - Maximum slider position in UI *(default: 500)*

#### Example usage

```bash
# Configure for large scientific repositories
export GITINGEST_MAX_FILES=50000
export GITINGEST_MAX_FILE_SIZE=20971520 # 20 MB
export GITINGEST_MAX_TOTAL_SIZE_BYTES=1073741824 # 1 GB

gitingest https://github.com/some/large-repo
```

## 🐍 Python package usage

```python
Expand Down Expand Up @@ -178,6 +226,15 @@ summary, tree, content = ingest("https://github.com/username/private-repo")

# Include repository submodules
summary, tree, content = ingest("https://github.com/username/repo-with-submodules", include_submodules=True)

# Configure limits programmatically
summary, tree, content = ingest(
"https://github.com/username/large-repo",
max_file_size=20 * 1024 * 1024, # 20 MB per file
max_files=50000, # 50k files max
max_total_size_bytes=1024**2, # 1 MB total
max_directory_depth=30 # 30 levels deep
)
```

By default, this won't write a file but can be enabled with the `output` argument.
Expand Down
8 changes: 0 additions & 8 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -112,14 +112,6 @@ case-sensitive = true
[tool.pycln]
all = true

# TODO: Remove this once we figure out how to use ruff-isort
[tool.isort]
profile = "black"
line_length = 119
remove_redundant_aliases = true
float_to_top = true # https://github.com/astral-sh/ruff/issues/6514
order_by_type = true
filter_files = true

# Test configuration
[tool.pytest.ini_options]
Expand Down
51 changes: 45 additions & 6 deletions src/gitingest/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,16 +9,20 @@
import click
from typing_extensions import Unpack

from gitingest.config import MAX_FILE_SIZE, OUTPUT_FILE_NAME
from gitingest.config import MAX_DIRECTORY_DEPTH, MAX_FILES, MAX_FILE_SIZE, MAX_TOTAL_SIZE_BYTES, OUTPUT_FILE_NAME
from gitingest.entrypoint import ingest_async


class _CLIArgs(TypedDict):
source: str
max_size: int
max_files: int
max_total_size: int
max_directory_depth: int
exclude_pattern: tuple[str, ...]
include_pattern: tuple[str, ...]
branch: str | None
tag: str | None
include_gitignored: bool
include_submodules: bool
token: str | None
Expand All @@ -34,6 +38,24 @@ class _CLIArgs(TypedDict):
show_default=True,
help="Maximum file size to process in bytes",
)
@click.option(
"--max-files",
default=MAX_FILES,
show_default=True,
help="Maximum number of files to process",
)
@click.option(
"--max-total-size",
default=MAX_TOTAL_SIZE_BYTES,
show_default=True,
help="Maximum total size of all files in bytes",
)
@click.option(
"--max-directory-depth",
default=MAX_DIRECTORY_DEPTH,
show_default=True,
help="Maximum depth of directory traversal",
)
@click.option("--exclude-pattern", "-e", multiple=True, help="Shell-style patterns to exclude.")
@click.option(
"--include-pattern",
Expand All @@ -42,6 +64,7 @@ class _CLIArgs(TypedDict):
help="Shell-style patterns to include.",
)
@click.option("--branch", "-b", default=None, help="Branch to clone and ingest")
@click.option("--tag", default=None, help="Tag to clone and ingest")
@click.option(
"--include-gitignored",
is_flag=True,
Expand Down Expand Up @@ -98,7 +121,7 @@ def main(**cli_kwargs: Unpack[_CLIArgs]) -> None:
$ gitingest --include-pattern "*.js" --exclude-pattern "node_modules/*"

Private repositories:
$ gitingest https://github.com/user/private-repo -t ghp_token
$ gitingest https://github.com/user/private-repo --token ghp_token
$ GITHUB_TOKEN=ghp_token gitingest https://github.com/user/private-repo

Include submodules:
Expand All @@ -112,9 +135,13 @@ async def _async_main(
source: str,
*,
max_size: int = MAX_FILE_SIZE,
max_files: int = MAX_FILES,
max_total_size: int = MAX_TOTAL_SIZE_BYTES,
max_directory_depth: int = MAX_DIRECTORY_DEPTH,
exclude_pattern: tuple[str, ...] | None = None,
include_pattern: tuple[str, ...] | None = None,
branch: str | None = None,
tag: str | None = None,
include_gitignored: bool = False,
include_submodules: bool = False,
token: str | None = None,
Expand All @@ -132,21 +159,29 @@ async def _async_main(
A directory path or a Git repository URL.
max_size : int
Maximum file size in bytes to ingest (default: 10 MB).
max_files : int
Maximum number of files to ingest (default: 10,000).
max_total_size : int
Maximum total size of output file in bytes (default: 500 MB).
max_directory_depth : int
Maximum depth of directory traversal (default: 20).
exclude_pattern : tuple[str, ...] | None
Glob patterns for pruning the file set.
include_pattern : tuple[str, ...] | None
Glob patterns for including files in the output.
branch : str | None
Git branch to ingest. If ``None``, the repository's default branch is used.
Git branch to clone and ingest (default: the default branch).
tag : str | None
Git tag to clone and ingest. If ``None``, no tag is used.
include_gitignored : bool
If ``True``, also ingest files matched by ``.gitignore`` or ``.gitingestignore`` (default: ``False``).
If ``True``, include files ignored by ``.gitignore`` and ``.gitingestignore`` (default: ``False``).
include_submodules : bool
If ``True``, recursively include all Git submodules within the repository (default: ``False``).
token : str | None
GitHub personal access token (PAT) for accessing private repositories.
Can also be set via the ``GITHUB_TOKEN`` environment variable.
output : str | None
The path where the output file will be written (default: ``digest.txt`` in current directory).
The path where the output file is written (default: ``digest.txt`` in current directory).
Use ``"-"`` to write to ``stdout``.

Raises
Expand All @@ -170,9 +205,13 @@ async def _async_main(
summary, _, _ = await ingest_async(
source,
max_file_size=max_size,
include_patterns=include_patterns,
max_files=max_files,
max_total_size_bytes=max_total_size,
max_directory_depth=max_directory_depth,
exclude_patterns=exclude_patterns,
include_patterns=include_patterns,
branch=branch,
tag=tag,
include_gitignored=include_gitignored,
include_submodules=include_submodules,
token=token,
Expand Down
16 changes: 9 additions & 7 deletions src/gitingest/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,14 @@
import tempfile
from pathlib import Path

MAX_FILE_SIZE = 10 * 1024 * 1024 # Maximum size of a single file to process (10 MB)
MAX_DIRECTORY_DEPTH = 20 # Maximum depth of directory traversal
MAX_FILES = 10_000 # Maximum number of files to process
MAX_TOTAL_SIZE_BYTES = 500 * 1024 * 1024 # Maximum size of output file (500 MB)
DEFAULT_TIMEOUT = 60 # seconds
from gitingest.utils.config_utils import _get_int_env_var, _get_str_env_var

OUTPUT_FILE_NAME = "digest.txt"
MAX_FILE_SIZE = _get_int_env_var("MAX_FILE_SIZE", 10 * 1024 * 1024) # Max file size to process in bytes (10 MB)
MAX_FILES = _get_int_env_var("MAX_FILES", 10_000) # Max number of files to process
MAX_TOTAL_SIZE_BYTES = _get_int_env_var("MAX_TOTAL_SIZE_BYTES", 500 * 1024 * 1024) # Max output file size (500 MB)
MAX_DIRECTORY_DEPTH = _get_int_env_var("MAX_DIRECTORY_DEPTH", 20) # Max depth of directory traversal

TMP_BASE_PATH = Path(tempfile.gettempdir()) / "gitingest"
DEFAULT_TIMEOUT = _get_int_env_var("DEFAULT_TIMEOUT", 60) # Default timeout for git operations in seconds

OUTPUT_FILE_NAME = _get_str_env_var("OUTPUT_FILE_NAME", "digest.txt")
TMP_BASE_PATH = Path(_get_str_env_var("TMP_BASE_PATH", tempfile.gettempdir())) / "gitingest"
66 changes: 47 additions & 19 deletions src/gitingest/entrypoint.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,11 @@ async def ingest_async(
source: str,
*,
max_file_size: int = MAX_FILE_SIZE,
include_patterns: str | set[str] | None = None,
max_files: int | None = None,
max_total_size_bytes: int | None = None,
max_directory_depth: int | None = None,
exclude_patterns: str | set[str] | None = None,
include_patterns: str | set[str] | None = None,
branch: str | None = None,
tag: str | None = None,
include_gitignored: bool = False,
Expand All @@ -51,17 +54,23 @@ async def ingest_async(
Parameters
----------
source : str
The source to analyze, which can be a URL (for a Git repository) or a local directory path.
A directory path or a Git repository URL.
max_file_size : int
Maximum allowed file size for file ingestion. Files larger than this size are ignored (default: 10 MB).
include_patterns : str | set[str] | None
Pattern or set of patterns specifying which files to include. If ``None``, all files are included.
Maximum file size in bytes to ingest (default: 10 MB).
max_files : int | None
Maximum number of files to ingest (default: 10,000).
max_total_size_bytes : int | None
Maximum total size of output file in bytes (default: 500 MB).
max_directory_depth : int | None
Maximum depth of directory traversal (default: 20).
exclude_patterns : str | set[str] | None
Pattern or set of patterns specifying which files to exclude. If ``None``, no files are excluded.
Glob patterns for pruning the file set.
include_patterns : str | set[str] | None
Glob patterns for including files in the output.
branch : str | None
The branch to clone and ingest (default: the default branch).
Git branch to clone and ingest (default: the default branch).
tag : str | None
The tag to clone and ingest. If ``None``, no tag is used.
Git tag to to clone and ingest. If ``None``, no tag is used.
include_gitignored : bool
If ``True``, include files ignored by ``.gitignore`` and ``.gitingestignore`` (default: ``False``).
include_submodules : bool
Expand All @@ -70,7 +79,7 @@ async def ingest_async(
GitHub personal access token (PAT) for accessing private repositories.
Can also be set via the ``GITHUB_TOKEN`` environment variable.
output : str | None
File path where the summary and content should be written.
File path where the summary and content is written.
If ``"-"`` (dash), the results are written to ``stdout``.
If ``None``, the results are not written to a file.

Expand Down Expand Up @@ -107,6 +116,13 @@ async def ingest_async(
if query.url:
_override_branch_and_tag(query, branch=branch, tag=tag)

if max_files is not None:
query.max_files = max_files
if max_total_size_bytes is not None:
query.max_total_size_bytes = max_total_size_bytes
if max_directory_depth is not None:
query.max_directory_depth = max_directory_depth

query.include_submodules = include_submodules

async with _clone_repo_if_remote(query, token=token):
Expand All @@ -121,8 +137,11 @@ def ingest(
source: str,
*,
max_file_size: int = MAX_FILE_SIZE,
include_patterns: str | set[str] | None = None,
max_files: int | None = None,
max_total_size_bytes: int | None = None,
max_directory_depth: int | None = None,
exclude_patterns: str | set[str] | None = None,
include_patterns: str | set[str] | None = None,
branch: str | None = None,
tag: str | None = None,
include_gitignored: bool = False,
Expand All @@ -139,17 +158,23 @@ def ingest(
Parameters
----------
source : str
The source to analyze, which can be a URL (for a Git repository) or a local directory path.
A directory path or a Git repository URL.
max_file_size : int
Maximum allowed file size for file ingestion. Files larger than this size are ignored (default: 10 MB).
include_patterns : str | set[str] | None
Pattern or set of patterns specifying which files to include. If ``None``, all files are included.
Maximum file size in bytes to ingest (default: 10 MB).
max_files : int | None
Maximum number of files to ingest (default: 10,000).
max_total_size_bytes : int | None
Maximum total size of output file in bytes (default: 500 MB).
max_directory_depth : int | None
Maximum depth of directory traversal (default: 20).
exclude_patterns : str | set[str] | None
Pattern or set of patterns specifying which files to exclude. If ``None``, no files are excluded.
Glob patterns for pruning the file set.
include_patterns : str | set[str] | None
Glob patterns for including files in the output.
branch : str | None
The branch to clone and ingest (default: the default branch).
Git branch to clone and ingest (default: the default branch).
tag : str | None
The tag to clone and ingest. If ``None``, no tag is used.
Git tag to to clone and ingest. If ``None``, no tag is used.
include_gitignored : bool
If ``True``, include files ignored by ``.gitignore`` and ``.gitingestignore`` (default: ``False``).
include_submodules : bool
Expand All @@ -158,7 +183,7 @@ def ingest(
GitHub personal access token (PAT) for accessing private repositories.
Can also be set via the ``GITHUB_TOKEN`` environment variable.
output : str | None
File path where the summary and content should be written.
File path where the summary and content is written.
If ``"-"`` (dash), the results are written to ``stdout``.
If ``None``, the results are not written to a file.

Expand All @@ -179,8 +204,11 @@ def ingest(
ingest_async(
source=source,
max_file_size=max_file_size,
include_patterns=include_patterns,
max_files=max_files,
max_total_size_bytes=max_total_size_bytes,
max_directory_depth=max_directory_depth,
exclude_patterns=exclude_patterns,
include_patterns=include_patterns,
branch=branch,
tag=tag,
include_gitignored=include_gitignored,
Expand Down
Loading
Loading