Change cloud ingestion to download blobs based off path versus file data #2858

pamelafox · 2025-12-05T23:48:15Z

Purpose

Previously we were using file_data for ingestion but that has a limit of 16MB. This PR changes the extraction skill to always receive the blob path and use our BlobManager to download that path.

Does this introduce a breaking change?

When developers merge from main and run the server, azd up, or azd deploy, will this produce an error?
If you're not sure, try it out on an old environment.

[ ] Yes
[X] No

Does this require changes to learn.microsoft.com docs?

This repository is referenced by this tutorial
which includes deployment, settings and usage instructions. If text or screenshot need to change in the tutorial,
check the box below and notify the tutorial author. A Microsoft employee can do this for you if you're an external contributor.

[ ] Yes
[X] No

Type of change

[X] Bugfix
[ ] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[ ] Other... Please describe:

Code quality checklist

See CONTRIBUTING.md for more details.

The current tests all pass (python -m pytest).
I added tests that prove my fix is effective or that my feature works
I ran python -m pytest --cov to verify 100% coverage of added lines
I ran python -m mypy to check for type errors
I either used the pre-commit hooks or ran ruff and black manually on my code.

Copilot

Pull request overview

This PR refactors the cloud ingestion pipeline to bypass Azure AI Search's 16MB file_data limit by having the extraction skill download blobs directly from Azure Storage using the blob path instead of receiving base64-encoded file content. The change updates the skillset configuration to pass metadata_storage_path instead of file_data, modifies the document extractor function to parse the URL and download blobs via BlobManager, and upgrades the azure-ai-documentintelligence package from beta version 1.0.0b4 to stable version 1.0.2.

Key Changes

Modified skillset inputs to pass blob URL (metadata_storage_path) instead of binary payload (file_data)
Updated document extractor function to download blobs using BlobManager with URL parsing logic
Added required environment variables (AZURE_STORAGE_RESOURCE_GROUP, AZURE_SUBSCRIPTION_ID) to Bicep infrastructure
Upgraded azure-ai-documentintelligence dependency from beta to stable release

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
infra/main.bicep	Added `AZURE_STORAGE_RESOURCE_GROUP` and `AZURE_SUBSCRIPTION_ID` environment variables to support blob manager initialization
app/functions/text_processor/requirements.txt	Updated `azure-ai-documentintelligence` from `1.0.0b4` to `1.0.2` (stable release)
app/functions/figure_processor/requirements.txt	Updated `azure-ai-documentintelligence` from `1.0.0b4` to `1.0.2` (stable release)
app/functions/document_extractor/requirements.txt	Updated `azure-ai-documentintelligence` from `1.0.0b4` to `1.0.2` (stable release)
app/functions/document_extractor/function_app.py	Refactored to use blob path-based download instead of base64 file_data; added blob manager, URL parsing, and removed base64 decoding logic
app/backend/prepdocslib/cloudingestionstrategy.py	Changed skillset configuration to pass `metadata_storage_path` instead of `file_data`, `file_name`, and `content_type`; disabled `allow_skillset_to_read_file_data`

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

app/functions/document_extractor/function_app.py

Copilot · 2025-12-05T23:53:42Z

app/functions/document_extractor/function_app.py

 class GlobalSettings:
    file_processors: dict[str, FileProcessor]
    azure_credential: ManagedIdentityCredential
+    blob_manager: BlobManager


The GlobalSettings dataclass has been updated to include a blob_manager field, but existing tests create mock settings without this field. This will cause TypeError exceptions in tests since blob_manager is a required field.

Tests like test_document_extractor_requires_single_record, test_document_extractor_handles_processing_exception, and test_document_extractor_process_document_http_error in tests/test_function_apps.py need to be updated to include blob_manager in their mock GlobalSettings objects.

app/functions/document_extractor/function_app.py

app/backend/prepdocslib/cloudingestionstrategy.py

app/functions/document_extractor/function_app.py

Copilot

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-06T00:05:15Z

tests/test_function_apps.py

-                    "file_data": {"$type": "file", "data": base64.b64encode(b"pdf-bytes").decode("utf-8")},
-                    "file_name": "sample.pdf",
-                    "contentType": "application/pdf",
+                    "metadata_storage_path": "https://account.blob.core.windows.net/container/sample.pdf",


Tests only verify simple blob paths at the container root (e.g., sample.pdf). Consider adding test cases for:

Nested paths with subdirectories (e.g., folder/subfolder/file.pdf)

URL-encoded characters in the path (e.g., spaces encoded as %20 or other special characters)

Edge cases like trailing slashes or empty blob names

This would ensure the URL parsing logic handles all realistic scenarios from Azure Search indexers.

pamelafox added 3 commits December 5, 2025 15:43

Move cloud functions to use blob download based off path

042e397

Restore comments for PR

f8001d8

Restore comments for PR

ac63901

pamelafox requested review from Copilot and mattgotteiner December 5, 2025 23:48

Copilot started reviewing on behalf of pamelafox December 5, 2025 23:48 View session

Fix the tests

e908849

Copilot AI reviewed Dec 5, 2025

View reviewed changes

pamelafox added 3 commits December 5, 2025 15:56

Remove requirements.txt from git since they are manually opied

1dcafc8

Remove unused fields

f934143

Fix SaS to SAS

3edfc6b

pamelafox requested a review from Copilot December 6, 2025 00:00

Copilot started reviewing on behalf of pamelafox December 6, 2025 00:01 View session

mattgotteiner approved these changes Dec 6, 2025

View reviewed changes

Copilot AI reviewed Dec 6, 2025

View reviewed changes

pamelafox merged commit 35a7885 into Azure-Samples:main Dec 6, 2025
30 checks passed

pamelafox deleted the useblobdownload branch December 6, 2025 00:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Change cloud ingestion to download blobs based off path versus file data #2858

Change cloud ingestion to download blobs based off path versus file data #2858

Uh oh!

pamelafox commented Dec 5, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Dec 5, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Change cloud ingestion to download blobs based off path versus file data #2858

Change cloud ingestion to download blobs based off path versus file data #2858

Uh oh!

Conversation

pamelafox commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Does this introduce a breaking change?

Does this require changes to learn.microsoft.com docs?

Type of change

Code quality checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pamelafox commented Dec 5, 2025 •

edited

Loading