Skip to content

Commit 1b209d4

Browse files
committed
Added release notes for v0.17.0 (after the fact)
Reformatted Markdown
1 parent 6bcb463 commit 1b209d4

File tree

1 file changed

+217
-33
lines changed

1 file changed

+217
-33
lines changed

docs/release_notes.md

Lines changed: 217 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -1,48 +1,232 @@
1-
## Omnipy v0.16.1
1+
## Omnipy v0.17.0
2+
3+
_Release date: Nov 7, 2024_
4+
5+
v0.17.0 of Omnipy was also a **huge** release, with a focus on features for building dynamic URLs
6+
and loading datasets asynchronously from APIs. As a whole, the release was a major step towards
7+
dependable communication with APIs, and the ability to handle large datasets in a concurrent and
8+
efficient manner.
9+
10+
### New features in v0.17.0
11+
12+
- **Dynamic building of URLs**
13+
14+
A new model, `HttpUrlModel`, has been added to support dynamic building of URLs from parts. It is
15+
more flexible than other similar solutions in the standard Python library, `Pydantic`, or other
16+
libraries, supporting the following features:
17+
- All parts can be easily edited at any time, using built-in types such as `dict` and `Path`
18+
- Automatic data type conversion _(generic Omnipy feature)_
19+
- Continuous validation after each change _(generic Omnipy feature)_
20+
- Error recovery: revert to last valid snapshot after invalid change _(generic Omnipy feature)_
21+
- Whenever the `HttpUrlModel` object is converted to a string, i.e. by insertion into a
22+
`StrModel` / `StrDataset` or being used to fetch data, the URL string is automatically
23+
constructed from the parts.
24+
- Builds on top of [`Url`](https://docs.pydantic.dev/2.0/usage/types/urls/) from
25+
`pydantic_core`, which provides basic validation, URL encoding as well as
26+
[punycode](https://en.wikipedia.org/wiki/Punycode) encoding of international domain names for
27+
[increased security](https://www.xudongz.com/blog/2017/idn-phishing/)
28+
29+
With the `HttpUrlDataset`, dynamic URLs are scaled up to operate in batch mode, e.g. for building
30+
URLs for repeated API calls to be fetched concurrently and asynchronously.
31+
32+
33+
- **`Dataset` upgraded to support state info for per-item tasks**
34+
35+
To support per-item asynchronous tasks, the `Dataset` class has been upgraded to support state
36+
information for **pendinG** and **failed** tasks - _on a per-item basis._ This includes storing
37+
exceptions and other relevant info for each item that has failed or is pending. Dataset
38+
visualisation has been updated to relay this info to the user in a clear and concise way.
39+
40+
41+
- **Job modifier `iterate_over_data_files` now supports asynchronous iteration**
42+
43+
The `iterate_over_data_files` job modifier has been upgraded to support asynchronous iteration
44+
over data files. This allows for more efficient handling of large datasets, and is especially
45+
useful when combined with the new `Dataset` state information for pending and failed tasks
46+
(see item above).
47+
48+
49+
- **Automatic handling of asynchronous tasks based on runtime environment**
50+
51+
Through the new `auto_async` job modifier, Omnipy now automatically detects whether the code is
52+
being run in an asynchronous runtime environment, such as a Jupyter notebook, and adjusts the
53+
execution of asynchronous tasks accordingly:
54+
- Technically, if `auto_async` is set to `True` (the default), the existing event loop is
55+
detected and used to run an asynchronous Omnipy `Task` as an `asyncio.Task`, allowing tasks to
56+
be run in the background if run from, _e.g._, a Jupyter notebook.
57+
- If no event loop is detected, Omnipy will create a new event loop and close it after the task
58+
is finished, allowing the task to be run synchronously in a regular Python script, or from the
59+
console.
60+
- The `auto_async` feature alleviates the complexity of running asynchronous tasks in different
61+
environments, and simplifies the combined use of asynchronous and synchronous tasks.
62+
63+
_**Note 1:** Omnipy is yet to support asynchronous flows, so asynchronous tasks currently need to
64+
be run independently._
65+
66+
_**Note 2:** `auto_async` does not support the opposite functionality, that is, running blocking
67+
synchronous tasks in the background in an asyncronous environment. This would require running the
68+
blocking tasks in threads, however Omnipy runtime objects (such as configs) are not (yet)
69+
thread-safe. Hence, synchronous tasks will block the event loop and any asynchronous tasks that
70+
are running there._
71+
72+
73+
- **`Dataset` now supports asynchronous loading of data from URLs**
74+
75+
The `Dataset` class has been upgraded to support asynchronous loading of data from URLs. This
76+
makes use of the new `HttpUrlDataset` class for building URLs, the new state information for
77+
failed and pending _per-item_ tasks, and the asynchronous iteration over data files. The fetching
78+
is implemented in the new `get_*_from_api_endpoint` tasks (where `*` is `json`,
79+
`bytes`, or `str`), built on top of the asynchronous `aiohttp` library, and supports the following
80+
features:
81+
- Automatic retry of HTTP requests, building on the `aiohttp_retry` library. Retries are
82+
configurable to retry for particular HTTP response codes, to retry a specified number of times
83+
and to use a specified algorithm to calculate the delay between retries.
84+
- Rate limiting of HTTP requests, building on the `aiolimiter` library. Rate limiting is
85+
configurable to limit the number of requests per time period, and to specify the time period
86+
used for calculation, indirectly also controlling the burst size. Adding to what is provided
87+
by the `aiolimiter` library, Omnipy ensures that the maximum rate limit is not exceeded also
88+
for the initial burst of requests.
89+
- Automatic reset of rate limiter counting and delays for subsequent batches of requests
90+
- Retries and rate limiting are configured individually for each domain. Omnipy ensures that
91+
HTTP requests in the same batch (e.g. provided in the same `HttpUrlDataset`) are coordinated
92+
according to their domain.
93+
- The default values for retries and rate limiting are set to reasonable values, so that this
94+
functionality is provided seamlessly for the users. However, these default values can be
95+
easily be changed if needed.
96+
- `Dataset.load()` now supports lists and dicts of paths or URLs (strings or `HttpUrlModel`
97+
objects) as input, as well as `HttpUrlDataset` objects.
98+
- Due to the asynchronous nature of the `get_*_from_api_endpoint` tasks, users in an
99+
asynchronous environment such as Jupyter Notebook can inspect the status of the download tasks
100+
while the download is in progress by inspecting the `Dataset` object.
101+
102+
103+
- **Other new features / bug fixes / refactorings**
104+
- Refactored Model and Dataset __repr__ to make use of IPython pretty printer. Drops support for
105+
plain Python console for automatic pretty prints
106+
- Implemented NestedSplitToItemsModel and NestedJoinItemsModel for parsing nested structures of
107+
any level to/from strings (e.g. `"param1=true&param2=42"`)
108+
- Implemented MatchItemsModel, which allows for filtering of items in a list based on a
109+
user-defined functions
110+
- Implemented task `create_row_index_from_column()` and basic table datasets
111+
- Added support for optional fields in `PydanticRecordModel`
112+
- Fixed lack of `to_data()` conversion when importing mappings and iterators of models to a
113+
dataset
114+
- Refactored models and datasets for split and join, to reduce duplication and allow adjustments
115+
of params for all.
116+
117+
## Omnipy v0.16.1
118+
2119
_Release date: Sep 20, 2024_
3120

4-
v0.16 of Omnipy is a **huge** release, with a focus on performance and improvements on internals. It is also the first version where we will start providing detailed release notes.
121+
v0.16 of Omnipy is a **huge** release, with a focus on performance and improvements on internals. It
122+
is also the first version where we will start providing detailed release notes.
5123

6-
_Note, the v0.16.1 release notes includes features from the v0.16.0 release, which was yanked due to issues with Python 3.12._
124+
_Note, the v0.16.1 release notes includes features from the v0.16.0 release, which was yanked due to
125+
issues with Python 3.12._
7126

8127
### New features in v0.16
9128

10129
- **General speedup**
11-
Performance has been a major focus of the new release. Many of the major new features have been implemented to allow improved efficiency. Execution time of all examples in the [omnipy_examples](https://github.com/fairtracks/omnipy_examples) repo have been improved; in some cases the run times has been reduced to less than one tenth of the previous time. There is now very little overhead added by Omnipy on top of pydantic, so we should expect a major speed boost once support for pydantic v2 is added.
130+
Performance has been a major focus of the new release. Many of the major new features have been
131+
implemented to allow improved efficiency. Execution time of all examples in
132+
the [omnipy_examples](https://github.com/fairtracks/omnipy_examples) repo have been improved; in
133+
some cases the run times has been reduced to less than one tenth of the previous time. There is
134+
now very little overhead added by Omnipy on top of pydantic, so we should expect a major speed
135+
boost once support for pydantic v2 is added.
136+
137+
12138
- **Reimplemented model snapshots for efficiency**
13-
Model snapshots now make use of a memoization dictionary through the Pythons builtin `deepcopy` functionality, greatly speeding up snapshots of hierarchical models. The snapshots and the contents of the memoization dictionary are automatically deleted following garbage collection, thoroughly tested to provide no memory leaks.
139+
Model snapshots now make use of a memoization dictionary through the Pythons builtin `deepcopy`
140+
functionality, greatly speeding up snapshots of hierarchical models. The snapshots and the
141+
contents of the memoization dictionary are automatically deleted following garbage collection,
142+
thoroughly tested to provide no memory leaks.
143+
144+
14145
- **Lazy snapshots**
15-
Models now take snapshots only when they might change the first time, greatly improving efficiency of models with contents that do not change.
146+
Models now take snapshots only when they might change the first time, greatly improving efficiency
147+
of models with contents that do not change.
148+
149+
16150
- **Remove unneeded nested Models**
17-
Some models, such as `SplitLinesToColumnsModel` have been are reimplemented to remove second-level Omnipy models, and instead use doubly nested builtin collections, e.g. `Model[list[list[str]]` instead of `Model[list[Model[list[str]]]]`. JSON Model containers now use simple types at the terminal level (e.g. 42 instead of JsonScalarM(42)). For cases where the nested Omnipy models are required, this is now supported by a new non-default option (see next feature).
151+
Some models, such as `SplitLinesToColumnsModel` have been are reimplemented to remove second-level
152+
Omnipy models, and instead use doubly nested builtin collections, e.g. `Model[list[list[str]]`
153+
instead of `Model[list[Model[list[str]]]]`. JSON Model containers now use simple types at the
154+
terminal level (e.g. 42 instead of JsonScalarM(42)). For cases where the nested Omnipy models are
155+
required, this is now supported by a new non-default option (see next feature).
156+
157+
18158
- **Dynamically convert elements to models**
19-
Support for dynamically generating Model objects from the elements of parent collection Models, e.g. to generate Model[int] objects when iterating through the elements of a Model[list[int]]. Turned off by default through `dynamically_convert_elements_to_models` config for efficiency.
159+
Support for dynamically generating Model objects from the elements of parent collection Models,
160+
e.g. to generate Model[int] objects when iterating through the elements of a Model[list[int]].
161+
Turned off by default through `dynamically_convert_elements_to_models` config for efficiency.
162+
163+
20164
- **Redesigned parametrised models and datasets to keep state**
21-
Previous implementation of parametrised models and datasets required users to specify the parameter every time it was used, making in difficult to specify composite models that include parametrised submodels. Also, the implementation was complex and made it difficult to improve Omnipy with with improved functionality for conversion and serialization. New implementation is based on parametrizing models and datasets as new types in a highly decoupled fashion. It is unfortunately slightly more complex to define parametrized models and datasets in the new solution due to innate complexities in how Python implements type annotations. Having tested a number of alternatives, most of whom did not work out, it is clear that the new solution strikes a good balance between simplicity and flexibility.
165+
Previous implementation of parametrised models and datasets required users to specify the
166+
parameter every time it was used, making in difficult to specify composite models that include
167+
parametrised submodels. Also, the implementation was complex and made it difficult to improve
168+
Omnipy with with improved functionality for conversion and serialization. New implementation is
169+
based on parametrizing models and datasets as new types in a highly decoupled fashion. It is
170+
unfortunately slightly more complex to define parametrized models and datasets in the new solution
171+
due to innate complexities in how Python implements type annotations. Having tested a number of
172+
alternatives, most of whom did not work out, it is clear that the new solution strikes a good
173+
balance between simplicity and flexibility.
174+
175+
22176
- **Chained models**
23-
A new solution for creating `mini-workflows` by chaining two or more models to form a single chained model. This reduces the need to specify linear flows for parsing, as exemplified in the new [BED file parser](https://github.com/fairtracks/omnipy_examples/blob/master/src/omnipy_examples/bed.py) example in [omnipy_examples](https://github.com/fairtracks/omnipy_examples).
177+
A new solution for creating `mini-workflows` by chaining two or more models to form a single
178+
chained model. This reduces the need to specify linear flows for parsing, as exemplified in the
179+
new [BED file parser](https://github.com/fairtracks/omnipy_examples/blob/master/src/omnipy_examples/bed.py)
180+
example in [omnipy_examples](https://github.com/fairtracks/omnipy_examples).
181+
182+
24183
- **Support for streaming to models by overloading `+` operator**
25-
All models supporting the `+` operator can now be streamed to from builtin types or other models, triggering parsing as specified in the model. Example: `my_table_model = TableOfPydanticRecordsModel[MyColumns](); my_table_model += [['text', 12, True]]`. This in principle allows for large flows to continue where they left off in case of network issues, or faulty data in the middle of a longer stream. Proper failure management is yet to be implemented, but is made much easier through the support of streaming to Models. Basic interactive operations are also much simplified with this feature, e.g. for concatenation of data.
26-
- **Improved automatic conversion**
27-
- Mimicked operations now autoconvert the outputs, e.g. `Model[int](5) + 5 == Model[int](10)`.
28-
- Iterators and other sequence-like types such as range generators are now automatically recognized and converted sequence types such as `list` and `tuple`.
29-
- `PandasModel` and `PandasDataset` now support other models and datasets as input during initialisation.
30-
- **Improvements of model validation**
31-
- Internals of validation functionality in the Model class has been harmonised and simplified.
32-
- Mimicked methods/attributes are validated also when interactive_mode=False
33-
- Pydantic models are validated before accessing attributes
184+
All models supporting the `+` operator can now be streamed to from builtin types or other models,
185+
triggering parsing as specified in the model. Example:
186+
`my_table_model = TableOfPydanticRecordsModel[MyColumns](); my_table_model += [['text', 12, True]]`.
187+
This in principle allows for large flows to continue where they left off in case of network
188+
issues, or faulty data in the middle of a longer stream. Proper failure management is yet to be
189+
implemented, but is made much easier through the support of streaming to Models. Basic interactive
190+
operations are also much simplified with this feature, e.g. for concatenation of data.
191+
192+
193+
- **Improved automatic conversion**
194+
- Mimicked operations now autoconvert the outputs, e.g. `Model[int](5) + 5 == Model[int](10)`.
195+
- Iterators and other sequence-like types such as range generators are now automatically
196+
recognized and converted sequence types such as `list` and `tuple`.
197+
- `PandasModel` and `PandasDataset` now support other models and datasets as input during
198+
initialisation.
199+
200+
201+
- **Improvements of model validation**
202+
- Internals of validation functionality in the Model class has been harmonised and simplified.
203+
- Mimicked methods/attributes are validated also when interactive_mode=False
204+
- Pydantic models are validated before accessing attributes
205+
206+
34207
- **Better handling of `None` values**
35-
Pydantic v1 made some poor choices in how to handle `None` values, which has been very difficult to rectify within Omnipy. A previous hack to fix this issue has now been replaced with an improved hack which also fixed a number of previously "known issues" in the Omnipy tests. This refactoring is paving the way to a simplified move to pydantic v2, which is on the horizon, but postponed for now to focus on feature completion.
36-
- **Other new features**
37-
- Support for Python 3.12 and Prefect 2.20
38-
- Better support for forward references
39-
- Caching of type-related function calls such as Model.outer_type(), further improving efficiency
40-
- Dataset.load() now supports lists of paths or URLs as input
41-
- Implementation of a SetDeque util class for speedup of various features, including model snapshots
42-
- Support default values for `TypeVar`, through `typing_extensions` (otherwise a Python 3.13 feature)
43-
- Refactoring of root log, fixing issues with a stuck timestamp when running flows
44-
- Reimplemented and fixed `__name__`, `__qualname__`, and `__repr__` for Model and Dataset
45-
- Implemented support for `__call__()`, and `__bool__()` for Models
46-
- Implemented `copy()` for Model and Dataset
47-
- Implemented flexible `__setitem__` and `__delitem__` for Dataset, supporting indexing by ints, slices and tuples.
48-
- A ton of smaller bug fixes, new tests and code cleanup. Some refactoring, especially of new snapshot functionality, is postponed to later versions.
208+
Pydantic v1 made some poor choices in how to handle `None` values, which has been very difficult
209+
to rectify within Omnipy. A previous hack to fix this issue has now been replaced with an improved
210+
hack which also fixed a number of previously "known issues" in the Omnipy tests. This refactoring
211+
is paving the way to a simplified move to pydantic v2, which is on the horizon, but postponed for
212+
now to focus on feature completion.
213+
214+
215+
- **Other new features**
216+
- Support for Python 3.12 and Prefect 2.20
217+
- Better support for forward references
218+
- Caching of type-related function calls such as Model.outer_type(), further improving
219+
efficiency
220+
- Dataset.load() now supports lists of paths or URLs as input
221+
- Implementation of a SetDeque util class for speedup of various features, including model
222+
snapshots
223+
- Support default values for `TypeVar`, through `typing_extensions` (otherwise a Python 3.13
224+
feature)
225+
- Refactoring of root log, fixing issues with a stuck timestamp when running flows
226+
- Reimplemented and fixed `__name__`, `__qualname__`, and `__repr__` for Model and Dataset
227+
- Implemented support for `__call__()`, and `__bool__()` for Models
228+
- Implemented `copy()` for Model and Dataset
229+
- Implemented flexible `__setitem__` and `__delitem__` for Dataset, supporting indexing by ints,
230+
slices and tuples.
231+
- A ton of smaller bug fixes, new tests and code cleanup. Some refactoring, especially of new
232+
snapshot functionality, is postponed to later versions.

0 commit comments

Comments
 (0)