Skip to content

Commit 732224e

Browse files
cmaloneyvstinnermaurycy
authored
gh-139871: Add bytearray.take_bytes([n]) to efficiently extract bytes (GH-140128)
Update `bytearray` to contain a `bytes` and provide a zero-copy path to "extract" the `bytes`. This allows making several code paths more efficient. This does not move any codepaths to make use of this new API. The documentation changes include common code patterns which can be made more efficient with this API. --- When just changing `bytearray` to contain `bytes` I ran pyperformance on a `--with-lto --enable-optimizations --with-static-libpython` build and don't see any major speedups or slowdowns with this; all seems to be in the noise of my machine (Generally changes under 5% or benchmarks that don't touch bytes/bytearray). Co-authored-by: Victor Stinner <vstinner@python.org> Co-authored-by: Maurycy Pawłowski-Wieroński <5383+maurycy@users.noreply.github.com>
1 parent 2fbd396 commit 732224e

File tree

11 files changed

+406
-95
lines changed

11 files changed

+406
-95
lines changed

Doc/library/stdtypes.rst

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3173,6 +3173,30 @@ objects.
31733173

31743174
.. versionadded:: 3.14
31753175

3176+
.. method:: take_bytes(n=None, /)
3177+
3178+
Remove the first *n* bytes from the bytearray and return them as an immutable
3179+
:class:`bytes`.
3180+
By default (if *n* is ``None``), return all bytes and clear the bytearray.
3181+
3182+
If *n* is negative, index from the end and take the first :func:`len`
3183+
plus *n* bytes. If *n* is out of bounds, raise :exc:`IndexError`.
3184+
3185+
Taking less than the full length will leave remaining bytes in the
3186+
:class:`bytearray`, which requires a copy. If the remaining bytes should be
3187+
discarded, use :func:`~bytearray.resize` or :keyword:`del` to truncate
3188+
then :func:`~bytearray.take_bytes` without a size.
3189+
3190+
.. impl-detail::
3191+
3192+
Taking all bytes is a zero-copy operation.
3193+
3194+
.. versionadded:: next
3195+
3196+
See the :ref:`What's New <whatsnew315-bytearray-take-bytes>` entry for
3197+
common code patterns which can be optimized with
3198+
:func:`bytearray.take_bytes`.
3199+
31763200
Since bytearray objects are sequences of integers (akin to a list), for a
31773201
bytearray object *b*, ``b[0]`` will be an integer, while ``b[0:1]`` will be
31783202
a bytearray object of length 1. (This contrasts with text strings, where

Doc/whatsnew/3.15.rst

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -307,6 +307,86 @@ Other language changes
307307
not only integers or floats, although this does not improve precision.
308308
(Contributed by Serhiy Storchaka in :gh:`67795`.)
309309

310+
.. _whatsnew315-bytearray-take-bytes:
311+
312+
* Added :meth:`bytearray.take_bytes(n=None, /) <bytearray.take_bytes>` to take
313+
bytes out of a :class:`bytearray` without copying. This enables optimizing code
314+
which must return :class:`bytes` after working with a mutable buffer of bytes
315+
such as data buffering, network protocol parsing, encoding, decoding,
316+
and compression. Common code patterns which can be optimized with
317+
:func:`~bytearray.take_bytes` are listed below.
318+
319+
(Contributed by Cody Maloney in :gh:`139871`.)
320+
321+
.. list-table:: Suggested Optimizing Refactors
322+
:header-rows: 1
323+
324+
* - Description
325+
- Old
326+
- New
327+
328+
* - Return :class:`bytes` after working with :class:`bytearray`
329+
- .. code:: python
330+
331+
def read() -> bytes:
332+
buffer = bytearray(1024)
333+
...
334+
return bytes(buffer)
335+
336+
- .. code:: python
337+
338+
def read() -> bytes:
339+
buffer = bytearray(1024)
340+
...
341+
return buffer.take_bytes()
342+
343+
* - Empty a buffer getting the bytes
344+
- .. code:: python
345+
346+
buffer = bytearray(1024)
347+
...
348+
data = bytes(buffer)
349+
buffer.clear()
350+
351+
- .. code:: python
352+
353+
buffer = bytearray(1024)
354+
...
355+
data = buffer.take_bytes()
356+
357+
* - Split a buffer at a specific separator
358+
- .. code:: python
359+
360+
buffer = bytearray(b'abc\ndef')
361+
n = buffer.find(b'\n')
362+
data = bytes(buffer[:n + 1])
363+
del buffer[:n + 1]
364+
assert data == b'abc'
365+
assert buffer == bytearray(b'def')
366+
367+
- .. code:: python
368+
369+
buffer = bytearray(b'abc\ndef')
370+
n = buffer.find(b'\n')
371+
data = buffer.take_bytes(n + 1)
372+
373+
* - Split a buffer at a specific separator; discard after the separator
374+
- .. code:: python
375+
376+
buffer = bytearray(b'abc\ndef')
377+
n = buffer.find(b'\n')
378+
data = bytes(buffer[:n])
379+
buffer.clear()
380+
assert data == b'abc'
381+
assert len(buffer) == 0
382+
383+
- .. code:: python
384+
385+
buffer = bytearray(b'abc\ndef')
386+
n = buffer.find(b'\n')
387+
buffer.resize(n)
388+
data = buffer.take_bytes()
389+
310390
* Many functions related to compiling or parsing Python code, such as
311391
:func:`compile`, :func:`ast.parse`, :func:`symtable.symtable`,
312392
and :func:`importlib.abc.InspectLoader.source_to_code`, now allow to pass

Include/cpython/bytearrayobject.h

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -5,25 +5,25 @@
55
/* Object layout */
66
typedef struct {
77
PyObject_VAR_HEAD
8-
Py_ssize_t ob_alloc; /* How many bytes allocated in ob_bytes */
8+
/* How many bytes allocated in ob_bytes
9+
10+
In the current implementation this is equivalent to Py_SIZE(ob_bytes_object).
11+
The value is always loaded and stored atomically for thread safety.
12+
There are API compatibilty concerns with removing so keeping for now. */
13+
Py_ssize_t ob_alloc;
914
char *ob_bytes; /* Physical backing buffer */
1015
char *ob_start; /* Logical start inside ob_bytes */
1116
Py_ssize_t ob_exports; /* How many buffer exports */
17+
PyObject *ob_bytes_object; /* PyBytes for zero-copy bytes conversion */
1218
} PyByteArrayObject;
1319

14-
PyAPI_DATA(char) _PyByteArray_empty_string[];
15-
1620
/* Macros and static inline functions, trading safety for speed */
1721
#define _PyByteArray_CAST(op) \
1822
(assert(PyByteArray_Check(op)), _Py_CAST(PyByteArrayObject*, op))
1923

2024
static inline char* PyByteArray_AS_STRING(PyObject *op)
2125
{
22-
PyByteArrayObject *self = _PyByteArray_CAST(op);
23-
if (Py_SIZE(self)) {
24-
return self->ob_start;
25-
}
26-
return _PyByteArray_empty_string;
26+
return _PyByteArray_CAST(op)->ob_start;
2727
}
2828
#define PyByteArray_AS_STRING(self) PyByteArray_AS_STRING(_PyObject_CAST(self))
2929

Include/internal/pycore_bytesobject.h

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,14 @@ PyAPI_FUNC(void)
6060
_PyBytes_Repeat(char* dest, Py_ssize_t len_dest,
6161
const char* src, Py_ssize_t len_src);
6262

63+
/* _PyBytesObject_SIZE gives the basic size of a bytes object; any memory allocation
64+
for a bytes object of length n should request PyBytesObject_SIZE + n bytes.
65+
66+
Using _PyBytesObject_SIZE instead of sizeof(PyBytesObject) saves
67+
3 or 7 bytes per bytes object allocation on a typical system.
68+
*/
69+
#define _PyBytesObject_SIZE (offsetof(PyBytesObject, ob_sval) + 1)
70+
6371
/* --- PyBytesWriter ------------------------------------------------------ */
6472

6573
struct PyBytesWriter {

Lib/test/test_bytes.py

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1397,6 +1397,16 @@ def test_clear(self):
13971397
b.append(ord('p'))
13981398
self.assertEqual(b, b'p')
13991399

1400+
# Cleared object should be empty.
1401+
b = bytearray(b'abc')
1402+
b.clear()
1403+
self.assertEqual(b.__alloc__(), 0)
1404+
base_size = sys.getsizeof(bytearray())
1405+
self.assertEqual(sys.getsizeof(b), base_size)
1406+
c = b.copy()
1407+
self.assertEqual(c.__alloc__(), 0)
1408+
self.assertEqual(sys.getsizeof(c), base_size)
1409+
14001410
def test_copy(self):
14011411
b = bytearray(b'abc')
14021412
bb = b.copy()
@@ -1458,6 +1468,61 @@ def test_resize(self):
14581468
self.assertRaises(MemoryError, bytearray().resize, sys.maxsize)
14591469
self.assertRaises(MemoryError, bytearray(1000).resize, sys.maxsize)
14601470

1471+
def test_take_bytes(self):
1472+
ba = bytearray(b'ab')
1473+
self.assertEqual(ba.take_bytes(), b'ab')
1474+
self.assertEqual(len(ba), 0)
1475+
self.assertEqual(ba, bytearray(b''))
1476+
self.assertEqual(ba.__alloc__(), 0)
1477+
base_size = sys.getsizeof(bytearray())
1478+
self.assertEqual(sys.getsizeof(ba), base_size)
1479+
1480+
# Positive and negative slicing.
1481+
ba = bytearray(b'abcdef')
1482+
self.assertEqual(ba.take_bytes(1), b'a')
1483+
self.assertEqual(ba, bytearray(b'bcdef'))
1484+
self.assertEqual(len(ba), 5)
1485+
self.assertEqual(ba.take_bytes(-5), b'')
1486+
self.assertEqual(ba, bytearray(b'bcdef'))
1487+
self.assertEqual(len(ba), 5)
1488+
self.assertEqual(ba.take_bytes(-3), b'bc')
1489+
self.assertEqual(ba, bytearray(b'def'))
1490+
self.assertEqual(len(ba), 3)
1491+
self.assertEqual(ba.take_bytes(3), b'def')
1492+
self.assertEqual(ba, bytearray(b''))
1493+
self.assertEqual(len(ba), 0)
1494+
1495+
# Take nothing from emptiness.
1496+
self.assertEqual(ba.take_bytes(0), b'')
1497+
self.assertEqual(ba.take_bytes(), b'')
1498+
self.assertEqual(ba.take_bytes(None), b'')
1499+
1500+
# Out of bounds, bad take value.
1501+
self.assertRaises(IndexError, ba.take_bytes, -1)
1502+
self.assertRaises(TypeError, ba.take_bytes, 3.14)
1503+
ba = bytearray(b'abcdef')
1504+
self.assertRaises(IndexError, ba.take_bytes, 7)
1505+
1506+
# Offset between physical and logical start (ob_bytes != ob_start).
1507+
ba = bytearray(b'abcde')
1508+
del ba[:2]
1509+
self.assertEqual(ba, bytearray(b'cde'))
1510+
self.assertEqual(ba.take_bytes(), b'cde')
1511+
1512+
# Overallocation at end.
1513+
ba = bytearray(b'abcde')
1514+
del ba[-2:]
1515+
self.assertEqual(ba, bytearray(b'abc'))
1516+
self.assertEqual(ba.take_bytes(), b'abc')
1517+
ba = bytearray(b'abcde')
1518+
ba.resize(4)
1519+
self.assertEqual(ba.take_bytes(), b'abcd')
1520+
1521+
# Take of a bytearray with references should fail.
1522+
ba = bytearray(b'abc')
1523+
with memoryview(ba) as mv:
1524+
self.assertRaises(BufferError, ba.take_bytes)
1525+
self.assertEqual(ba.take_bytes(), b'abc')
14611526

14621527
def test_setitem(self):
14631528
def setitem_as_mapping(b, i, val):
@@ -2564,6 +2629,18 @@ def zfill(b, a):
25642629
c = a.zfill(0x400000)
25652630
assert not c or c[-1] not in (0xdd, 0xcd)
25662631

2632+
def take_bytes(b, a): # MODIFIES!
2633+
b.wait()
2634+
c = a.take_bytes()
2635+
assert not c or c[0] == 48 # '0'
2636+
2637+
def take_bytes_n(b, a): # MODIFIES!
2638+
b.wait()
2639+
try:
2640+
c = a.take_bytes(10)
2641+
assert c == b'0123456789'
2642+
except IndexError: pass
2643+
25672644
def check(funcs, a=None, *args):
25682645
if a is None:
25692646
a = bytearray(b'0' * 0x400000)
@@ -2625,6 +2702,10 @@ def check(funcs, a=None, *args):
26252702
check([clear] + [startswith] * 10)
26262703
check([clear] + [strip] * 10)
26272704

2705+
check([clear] + [take_bytes] * 10)
2706+
check([take_bytes_n] * 10, bytearray(b'0123456789' * 0x400))
2707+
check([take_bytes_n] * 10, bytearray(b'0123456789' * 5))
2708+
26282709
check([clear] + [contains] * 10)
26292710
check([clear] + [subscript] * 10)
26302711
check([clear2] + [ass_subscript2] * 10, None, bytearray(b'0' * 0x400000))

Lib/test/test_capi/test_bytearray.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
import sys
12
import unittest
23
from test.support import import_helper
34

@@ -55,7 +56,9 @@ def test_fromstringandsize(self):
5556
self.assertEqual(fromstringandsize(b'', 0), bytearray())
5657
self.assertEqual(fromstringandsize(NULL, 0), bytearray())
5758
self.assertEqual(len(fromstringandsize(NULL, 3)), 3)
58-
self.assertRaises(MemoryError, fromstringandsize, NULL, PY_SSIZE_T_MAX)
59+
self.assertRaises(OverflowError, fromstringandsize, NULL, PY_SSIZE_T_MAX)
60+
self.assertRaises(OverflowError, fromstringandsize, NULL,
61+
PY_SSIZE_T_MAX-sys.getsizeof(b'') + 1)
5962

6063
self.assertRaises(SystemError, fromstringandsize, b'abc', -1)
6164
self.assertRaises(SystemError, fromstringandsize, b'abc', PY_SSIZE_T_MIN)

Lib/test/test_sys.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1583,7 +1583,7 @@ def test_objecttypes(self):
15831583
samples = [b'', b'u'*100000]
15841584
for sample in samples:
15851585
x = bytearray(sample)
1586-
check(x, vsize('n2Pi') + x.__alloc__())
1586+
check(x, vsize('n2PiP') + x.__alloc__())
15871587
# bytearray_iterator
15881588
check(iter(bytearray()), size('nP'))
15891589
# bytes
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
Update :class:`bytearray` to use a :class:`bytes` under the hood as its buffer
2+
and add :func:`bytearray.take_bytes` to take it out.

0 commit comments

Comments
 (0)