-
Notifications
You must be signed in to change notification settings - Fork 5.1k
Change cloud ingestion to download blobs based off path versus file data #2858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the cloud ingestion pipeline to bypass Azure AI Search's 16MB file_data limit by having the extraction skill download blobs directly from Azure Storage using the blob path instead of receiving base64-encoded file content. The change updates the skillset configuration to pass metadata_storage_path instead of file_data, modifies the document extractor function to parse the URL and download blobs via BlobManager, and upgrades the azure-ai-documentintelligence package from beta version 1.0.0b4 to stable version 1.0.2.
Key Changes
- Modified skillset inputs to pass blob URL (
metadata_storage_path) instead of binary payload (file_data) - Updated document extractor function to download blobs using
BlobManagerwith URL parsing logic - Added required environment variables (
AZURE_STORAGE_RESOURCE_GROUP,AZURE_SUBSCRIPTION_ID) to Bicep infrastructure - Upgraded
azure-ai-documentintelligencedependency from beta to stable release
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| infra/main.bicep | Added AZURE_STORAGE_RESOURCE_GROUP and AZURE_SUBSCRIPTION_ID environment variables to support blob manager initialization |
| app/functions/text_processor/requirements.txt | Updated azure-ai-documentintelligence from 1.0.0b4 to 1.0.2 (stable release) |
| app/functions/figure_processor/requirements.txt | Updated azure-ai-documentintelligence from 1.0.0b4 to 1.0.2 (stable release) |
| app/functions/document_extractor/requirements.txt | Updated azure-ai-documentintelligence from 1.0.0b4 to 1.0.2 (stable release) |
| app/functions/document_extractor/function_app.py | Refactored to use blob path-based download instead of base64 file_data; added blob manager, URL parsing, and removed base64 decoding logic |
| app/backend/prepdocslib/cloudingestionstrategy.py | Changed skillset configuration to pass metadata_storage_path instead of file_data, file_name, and content_type; disabled allow_skillset_to_read_file_data |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| class GlobalSettings: | ||
| file_processors: dict[str, FileProcessor] | ||
| azure_credential: ManagedIdentityCredential | ||
| blob_manager: BlobManager |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The GlobalSettings dataclass has been updated to include a blob_manager field, but existing tests create mock settings without this field. This will cause TypeError exceptions in tests since blob_manager is a required field.
Tests like test_document_extractor_requires_single_record, test_document_extractor_handles_processing_exception, and test_document_extractor_process_document_http_error in tests/test_function_apps.py need to be updated to include blob_manager in their mock GlobalSettings objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 7 out of 8 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "file_data": {"$type": "file", "data": base64.b64encode(b"pdf-bytes").decode("utf-8")}, | ||
| "file_name": "sample.pdf", | ||
| "contentType": "application/pdf", | ||
| "metadata_storage_path": "https://account.blob.core.windows.net/container/sample.pdf", |
Copilot
AI
Dec 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests only verify simple blob paths at the container root (e.g., sample.pdf). Consider adding test cases for:
- Nested paths with subdirectories (e.g.,
folder/subfolder/file.pdf) - URL-encoded characters in the path (e.g., spaces encoded as
%20or other special characters) - Edge cases like trailing slashes or empty blob names
This would ensure the URL parsing logic handles all realistic scenarios from Azure Search indexers.
Purpose
Previously we were using file_data for ingestion but that has a limit of 16MB. This PR changes the extraction skill to always receive the blob path and use our BlobManager to download that path.
Does this introduce a breaking change?
When developers merge from main and run the server, azd up, or azd deploy, will this produce an error?
If you're not sure, try it out on an old environment.
Does this require changes to learn.microsoft.com docs?
This repository is referenced by this tutorial
which includes deployment, settings and usage instructions. If text or screenshot need to change in the tutorial,
check the box below and notify the tutorial author. A Microsoft employee can do this for you if you're an external contributor.
Type of change
Code quality checklist
See CONTRIBUTING.md for more details.
python -m pytest).python -m pytest --covto verify 100% coverage of added linespython -m mypyto check for type errorsruffandblackmanually on my code.