Skip to content

Omnipy v0.17.0 Release Notes

Pre-release
Pre-release

Choose a tag to compare

@sveinugu sveinugu released this 06 Dec 12:10
· 591 commits to main since this release

Release date: Nov 7, 2024

v0.17.0 of Omnipy was also a huge release, with a focus on features for building dynamic URLs
and loading datasets asynchronously from APIs. As a whole, the release was a major step towards
dependable communication with APIs, and the ability to handle large datasets in a concurrent and
efficient manner.

New features in v0.17.0

  • Dynamic building of URLs

    A new model, HttpUrlModel, has been added to support dynamic building of URLs from parts. It is
    more flexible than other similar solutions in the standard Python library, Pydantic, or other
    libraries, supporting the following features:

    • All parts can be easily edited at any time, using built-in types such as dict and Path
    • Automatic data type conversion (generic Omnipy feature)
    • Continuous validation after each change (generic Omnipy feature)
    • Error recovery: revert to last valid snapshot after invalid change (generic Omnipy feature)
    • Whenever the HttpUrlModel object is converted to a string, i.e. by insertion into a
      StrModel / StrDataset or being used to fetch data, the URL string is automatically
      constructed from the parts.
    • Builds on top of Url from
      pydantic_core, which provides basic validation, URL encoding as well as
      punycode encoding of international domain names for
      increased security

    With the HttpUrlDataset, dynamic URLs are scaled up to operate in batch mode, e.g. for building
    URLs for repeated API calls to be fetched concurrently and asynchronously.

  • Dataset upgraded to support state info for per-item tasks

    To support per-item asynchronous tasks, the Dataset class has been upgraded to support state
    information for pendinG and failed tasks - on a per-item basis. This includes storing
    exceptions and other relevant info for each item that has failed or is pending. Dataset
    visualisation has been updated to relay this info to the user in a clear and concise way.

  • Job modifier iterate_over_data_files now supports asynchronous iteration

    The iterate_over_data_files job modifier has been upgraded to support asynchronous iteration
    over data files. This allows for more efficient handling of large datasets, and is especially
    useful when combined with the new Dataset state information for pending and failed tasks
    (see item above).

  • Automatic handling of asynchronous tasks based on runtime environment

    Through the new auto_async job modifier, Omnipy now automatically detects whether the code is
    being run in an asynchronous runtime environment, such as a Jupyter notebook, and adjusts the
    execution of asynchronous tasks accordingly:

    • Technically, if auto_async is set to True (the default), the existing event loop is detected
      and used to run an asynchronous Omnipy Task as an asyncio.Task, allowing tasks to be run in
      the background if run from, e.g., a Jupyter notebook.
    • If no event loop is detected, Omnipy will create a new event loop and close it after the task is
      finished, allowing the task to be run synchronously in a regular Python script, or from the
      console.
    • The auto_async feature alleviates the complexity of running asynchronous tasks in different
      environments, and simplifies the combined use of asynchronous and synchronous tasks.

    Note 1: Omnipy is yet to support asynchronous flows, so asynchronous tasks currently need to
    be run independently.

    Note 2: auto_async does not support the opposite functionality, that is, running blocking
    synchronous tasks in the background in an asyncronous environment. This would require running the
    blocking tasks in threads, however Omnipy runtime objects (such as configs) are not (yet)
    thread-safe. Hence, synchronous tasks will block the event loop and any asynchronous tasks that
    are running there.

  • Dataset now supports asynchronous loading of data from URLs

    The Dataset class has been upgraded to support asynchronous loading of data from URLs. This
    makes use of the new HttpUrlDataset class for building URLs, the new state information for
    failed and pending per-item tasks, and the asynchronous iteration over data files. The fetching
    is implemented in the new get_*_from_api_endpoint tasks (where * is json,
    bytes, or str), built on top of the asynchronous aiohttp library, and supports the following
    features:

    • Automatic retry of HTTP requests, building on the aiohttp_retry library. Retries are
      configurable to retry for particular HTTP response codes, to retry a specified number of times
      and to use a specified algorithm to calculate the delay between retries.
    • Rate limiting of HTTP requests, building on the aiolimiter library. Rate limiting is
      configurable to limit the number of requests per time period, and to specify the time period
      used for calculation, indirectly also controlling the burst size. Adding to what is provided by
      the aiolimiter library, Omnipy ensures that the maximum rate limit is not exceeded also for
      the initial burst of requests.
    • Automatic reset of rate limiter counting and delays for subsequent batches of requests
    • Retries and rate limiting are configured individually for each domain. Omnipy ensures that HTTP
      requests in the same batch (e.g. provided in the same HttpUrlDataset) are coordinated
      according to their domain.
    • The default values for retries and rate limiting are set to reasonable values, so that this
      functionality is provided seamlessly for the users. However, these default values can be easily
      be changed if needed.
    • Dataset.load() now supports lists and dicts of paths or URLs (strings or HttpUrlModel
      objects) as input, as well as HttpUrlDataset objects.
    • Due to the asynchronous nature of the get_*_from_api_endpoint tasks, users in an asynchronous
      environment such as Jupyter Notebook can inspect the status of the download tasks while the
      download is in progress by inspecting the Dataset object.
  • Other new features / bug fixes / refactorings

    • Refactored Model and Dataset repr to make use of IPython pretty printer. Drops support for
      plain Python console for automatic pretty prints
    • Implemented NestedSplitToItemsModel and NestedJoinItemsModel for parsing nested structures of
      any level to/from strings (e.g. "param1=true&param2=42")
    • Implemented MatchItemsModel, which allows for filtering of items in a list based on a
      user-defined functions
    • Implemented task create_row_index_from_column() and basic table datasets
    • Added support for optional fields in PydanticRecordModel
    • Fixed lack of to_data() conversion when importing mappings and iterators of models to a
      dataset
    • Refactored models and datasets for split and join, to reduce duplication and allow adjustments
      of params for all.