Skip to content

Conversation

@pamelafox
Copy link
Collaborator

Purpose

Fixes #2817

This pull request refactors the ingestion pipeline to support a new cloud ingestion strategy, and improves modularity by reorganizing setup logic.

The ingestion strategy uses Azure Functions as Custom Web API skills in a skillset connected to a Blob Indexer.

These are the three skills in order:

document extractor:
document_extractor

figure processor:
figure_processor

text processor:
text_processor

I have had to refactor parts of prepdocs to make it easily reusable by the functions as well, so that we can run the same code locally and in the cloud.

Does this introduce a breaking change?

When developers merge from main and run the server, azd up, or azd deploy, will this produce an error?
If you're not sure, try it out on an old environment.

[ ] Yes
[ ] No

Does this require changes to learn.microsoft.com docs?

This repository is referenced by this tutorial
which includes deployment, settings and usage instructions. If text or screenshot need to change in the tutorial,
check the box below and notify the tutorial author. A Microsoft employee can do this for you if you're an external contributor.

[ ] Yes
[ ] No

Type of change

[ ] Bugfix
[ ] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[ ] Other... Please describe:

Code quality checklist

See CONTRIBUTING.md for more details.

  • The current tests all pass (python -m pytest).
  • I added tests that prove my fix is effective or that my feature works
  • I ran python -m pytest --cov to verify 100% coverage of added lines
  • I ran python -m mypy to check for type errors
  • I either used the pre-commit hooks or ran ruff and black manually on my code.

@pamelafox pamelafox marked this pull request as draft November 4, 2025 07:05
@pamelafox pamelafox changed the title Prepskills WIP: Cloud ingestion strategy with prepdocs as custom skillset for Azure AI Search Blob Indexer Nov 4, 2025
logger.warning("No HTML parser available")
# Build mapping of file extensions to parsers using shared select_parser helper.
# Each select attempt may instantiate a DI parser; duplication is acceptable at startup.
def _try_select(ext: str, content_type: str) -> Parser | None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming? maybe "try_select_parser"?

figure_processor_auth_resource_id=figure_processor_resource_id,
text_processor_uri=text_processor_uri,
text_processor_auth_resource_id=text_processor_resource_id,
subscription_id=os.environ["AZURE_SUBSCRIPTION_ID"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also require_env_var?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make custom skills for AI Search for prepdocs ingestion code

2 participants