Skip to content

Conversation

@pamelafox
Copy link
Collaborator

@pamelafox pamelafox commented Dec 5, 2025

Purpose

Previously we were using file_data for ingestion but that has a limit of 16MB. This PR changes the extraction skill to always receive the blob path and use our BlobManager to download that path.

Does this introduce a breaking change?

When developers merge from main and run the server, azd up, or azd deploy, will this produce an error?
If you're not sure, try it out on an old environment.

[ ] Yes
[X] No

Does this require changes to learn.microsoft.com docs?

This repository is referenced by this tutorial
which includes deployment, settings and usage instructions. If text or screenshot need to change in the tutorial,
check the box below and notify the tutorial author. A Microsoft employee can do this for you if you're an external contributor.

[ ] Yes
[X] No

Type of change

[X] Bugfix
[ ] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[ ] Other... Please describe:

Code quality checklist

See CONTRIBUTING.md for more details.

  • The current tests all pass (python -m pytest).
  • I added tests that prove my fix is effective or that my feature works
  • I ran python -m pytest --cov to verify 100% coverage of added lines
  • I ran python -m mypy to check for type errors
  • I either used the pre-commit hooks or ran ruff and black manually on my code.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the cloud ingestion pipeline to bypass Azure AI Search's 16MB file_data limit by having the extraction skill download blobs directly from Azure Storage using the blob path instead of receiving base64-encoded file content. The change updates the skillset configuration to pass metadata_storage_path instead of file_data, modifies the document extractor function to parse the URL and download blobs via BlobManager, and upgrades the azure-ai-documentintelligence package from beta version 1.0.0b4 to stable version 1.0.2.

Key Changes

  • Modified skillset inputs to pass blob URL (metadata_storage_path) instead of binary payload (file_data)
  • Updated document extractor function to download blobs using BlobManager with URL parsing logic
  • Added required environment variables (AZURE_STORAGE_RESOURCE_GROUP, AZURE_SUBSCRIPTION_ID) to Bicep infrastructure
  • Upgraded azure-ai-documentintelligence dependency from beta to stable release

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
infra/main.bicep Added AZURE_STORAGE_RESOURCE_GROUP and AZURE_SUBSCRIPTION_ID environment variables to support blob manager initialization
app/functions/text_processor/requirements.txt Updated azure-ai-documentintelligence from 1.0.0b4 to 1.0.2 (stable release)
app/functions/figure_processor/requirements.txt Updated azure-ai-documentintelligence from 1.0.0b4 to 1.0.2 (stable release)
app/functions/document_extractor/requirements.txt Updated azure-ai-documentintelligence from 1.0.0b4 to 1.0.2 (stable release)
app/functions/document_extractor/function_app.py Refactored to use blob path-based download instead of base64 file_data; added blob manager, URL parsing, and removed base64 decoding logic
app/backend/prepdocslib/cloudingestionstrategy.py Changed skillset configuration to pass metadata_storage_path instead of file_data, file_name, and content_type; disabled allow_skillset_to_read_file_data

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 31 to +36
class GlobalSettings:
file_processors: dict[str, FileProcessor]
azure_credential: ManagedIdentityCredential
blob_manager: BlobManager
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GlobalSettings dataclass has been updated to include a blob_manager field, but existing tests create mock settings without this field. This will cause TypeError exceptions in tests since blob_manager is a required field.

Tests like test_document_extractor_requires_single_record, test_document_extractor_handles_processing_exception, and test_document_extractor_process_document_http_error in tests/test_function_apps.py need to be updated to include blob_manager in their mock GlobalSettings objects.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

"file_data": {"$type": "file", "data": base64.b64encode(b"pdf-bytes").decode("utf-8")},
"file_name": "sample.pdf",
"contentType": "application/pdf",
"metadata_storage_path": "https://account.blob.core.windows.net/container/sample.pdf",
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests only verify simple blob paths at the container root (e.g., sample.pdf). Consider adding test cases for:

  1. Nested paths with subdirectories (e.g., folder/subfolder/file.pdf)
  2. URL-encoded characters in the path (e.g., spaces encoded as %20 or other special characters)
  3. Edge cases like trailing slashes or empty blob names

This would ensure the URL parsing logic handles all realistic scenarios from Azure Search indexers.

Copilot uses AI. Check for mistakes.
@pamelafox pamelafox merged commit 35a7885 into Azure-Samples:main Dec 6, 2025
30 checks passed
@pamelafox pamelafox deleted the useblobdownload branch December 6, 2025 00:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants