weixin_39604598
weixin_39604598
2020-11-27 12:12

Factor out IPythonBuffer from IBufferProtocol

Here is my exploration around the re-implementation of the buffer protocol (#21). I will try to make incremental changes that will keep IronPython working (so all tests should be passing).

Each step is a proposal of an idea that can be discussed and adjusted, so smaller steps are preferable, rather a large dive that goes nowhere.

The PR is writable to the maintainers, so I welcome commits on top in a collaborative effort.

该提问来源于开源项目:IronLanguages/ironpython3

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

21条回答

  • weixin_39604598 weixin_39604598 5月前

    Since I can't have a call-site that would do automatic conversion I was thinking maybe a helper method?

    Thanks for the commit. It's nice to see things get simpler with the new IBufferProtocol. I agree that UnsafeByteList and UnsafeByteArray can be phased out now.

    The issue with io.BytesIO.writelines reminds me of Bytes(PythonList bytes), where the list elements have to be cast to byte at runtime. The binder only handles simple (but common) cases where there is just one object that needs to support Buffer Protocol, the method can tupe its parameter as IByfferProtocol and there are no ambiguous overloads. For more complicated cases I think we need a helper method, maybe in PythonOps, in the spirit of TryGetEnumerator/GetEnumerator? I would also make a little more general in handling any type that behaves as Memory. Something like:

    c#
    bool TryGetBufferProtocol(object obj, out IBufferProtocol bufferProtocol) {
        bufferProtocol = null;
        if (obj == null) return false;
    
        if (obj is IBufferProtocol bp) {
            bufferProtocol = bp;
        } else if (obj is Memory<byte> mem) {
            bufferProtocol = new MemoryBufferProtocolWrapper(mem);
        } else if (obj is ReadOnlyMemory<byte> rom) {
            bufferProtocol = new MemoryBufferProtocolWrapper(rom);
        } else if (Converter.HasImplicitConversion(obj.GetType(), typeof(Memory<byte>))) {
            bufferProtocol = new MemoryBufferProtocolWrapper((Memory<byte>)obj);
        } else if (Converter.HasImplicitConversion(obj.GetType(), typeof(ReadOnlyMemory<byte>))) {
            bufferProtocol = new MemoryBufferProtocolWrapper((ReadOnlyMemory<byte>)obj);
        }
        return bufferProtocol != null;
    }
    </byte></byte></byte></byte></byte></byte>

    On another tangent of BytesIO, I was thinking that maybe it can use ArrayData as the backing structure in place of bytes[]? BytesIO has the same size-freeze on exports behaviour as array.array:

    python
    >>> import io
    >>> bio = io.BytesIO()
    >>> mv = bio.getbuffer()
    >>> bio.write(b"x")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    BufferError: Existing exports of data: object cannot be re-sized
    >>> mv.release()
    >>> bio.write(b"x") 
    1
    </module></stdin>
    点赞 评论 复制链接分享
  • weixin_39929595 weixin_39929595 5月前

    I would argue it would be easier to convince the library owners to add m.release() rather than gc.collect(), the latter being conditional on which implementation of Python is being used.

    Python 2 did not have a memoryview.release method which means there is probably lots of code where memoryview just "releases" by going out of scope.

    I don't think that running forced garbage collection so frequently is practical, for performance reasons.

    Agreed. Probably not a good idea.

    Or the Debug builds would print a warning when ~MemoryView() has to decrement the exports counter?

    Could be a Debug.Assert(false)?

    点赞 评论 复制链接分享
  • weixin_39929595 weixin_39929595 5月前

    I would also make a little more general in handling any type that behaves as Memory.

    I tried your proposed method and it threw an InvalidCastException? Also, HasImplicitConversion doesn't seem particularly efficient. I wonder how it would compare to an is byte[] if it has to be called frequently.

    点赞 评论 复制链接分享
  • weixin_39604598 weixin_39604598 5月前

    Python 2 did not have a memoryview.release method

    Aren't we lucky then that Python 2 is EOL? 😃

    I tried your proposed method and it threw an InvalidCastException

    Apparently, obj has to be first cast to its limit type before cast to Memory. This will do the trick:

    c#
    bufferProtocol = new MemoryBufferProtocolWrapper(Converter.Convert<memory>>(obj));
    </memory>

    Also, HasImplicitConversion doesn't seem particularly efficient.

    Agreed. That was the reason I did not use in in my modification of PythonOverloadResolver. The overload resolution is already taxed, comparing every overload pairs, and then doing multiple passes. I am still not sure what to do about it.

    In the case of GetBufferProtocol, the choice is more obvious, it is either this or an exeption. A fast track for byte[] and maybe for ArraySegment<byte> will be in order.

    点赞 评论 复制链接分享
  • weixin_39604598 weixin_39604598 5月前

    Actually, if we are using `Converter, it will do the whole job:

    c#
    bufferProtocol = Converter.Convert<ibufferprotocol>(obj);
    </ibufferprotocol>
    点赞 评论 复制链接分享
  • weixin_39604598 weixin_39604598 5月前

    BytesIO.getbuffer should be reworked to not return a copy

    Agreed. I think there are more parts in BytesIO that could be reworked now that we have System.Buffers.

    Does `CData` work? (very low priority)
    

    To my knowledge, CData works as good (or as bad) as before. That is, it does work except for indirect multidimensional arrays (i.e. arrays of pointers to arrays). That last one will require support for suboffsets in MemoryView — very low priority for me at the moment. I am planning simple general cleanup in CData in the coming weeks. I will also see if there are more loose ends there.

    Should `MemoryView` be an `IPythonBuffer`?
    

    Good point, but debatable. The way I see it memoryview is Python Buffer, but elevated to be a first class Python type. In the early CPython 3 versions even more so, for instance it would return None for shape if ndim == 0, just as it is mandated for Python Buffers. IPythonBuffer is not exactly CPython Buffer, but close enough, so conceptually I think that MemoryView is an IPythonBuffer. But I don't feel strongly about it. Introducing a MemoryViewView of sorts would prevent code like

    C#
    IPythonBuffer buf = memoryView.GetBuffer();
    
    // somewhere else, perhaps in a sub-method:
    if (buf is MemoryView mv) {
       // have a feast...
    }
    

    but I also don't feel strongly of adding extra complexity just to prevent it.

    On my side, besides the remaining TODOs, I see the following: * Constructors/initializers of bytes/bytearray should be able to accept any object implementing Buffer protocol, not just bytes-like objects. * Slice deletion in bytearray/array.array can be simplified and made more efficient. * Perhaps some of the switch on typecode statements in array.array can be simplified or eliminated. On the other hand, what is there now works just fine, so it is more like a mind teaser...

    点赞 评论 复制链接分享
  • weixin_39929595 weixin_39929595 5月前

    This is a good start. Hopefully I can find some time to experiment with this. Here are some of my initial thoughts:

    In particular, I think that several methods in IPythonBuffer should be removed.

    Agreed, most of them can probably go.

    However, we opted for more convenient types IList and ReadOnlyMemory, with the conversion implemented in PythonConversionBinder.

    Well, maybe we should revisit this. While these types are more from a .NET point of view it seems like they're not a great fit for the Python world. In particular IList is terrible because it would force us to make a copy the buffer? Another point which makes using the convenient types moot is that a lot of these methods are already a annoying to invoke from .NET due to the CodeContext argument. Perhaps we should just stick to IBufferProtocol overloads. While we'd need to call GetBuffer every time it doesn't seem like such a big deal and we'd have more controls over the flags (otherwise we'd need some sort of attribute to specify them?).

    We could have IBufferProtocol wrappers (for example static IBufferProtocol ToBufferProtocol(this Array[byte] bytes)) for the convenient .NET types. When invoking from the Python side of things the binder could easily use those wrappers to automatically convert the object to an IBufferProtocol.

    点赞 评论 复制链接分享
  • weixin_39604598 weixin_39604598 5月前

    Well, maybe we should revisit this. While these types are more from a .NET point of view it seems like they're not a great fit for the Python world. [...] Perhaps we should just stick to IBufferProtocol overloads.

    We can revisit it, but I wouldn't dismiss the existing mechanism as yet. First of all, nothing needs to change to support IBufferProtocol as bytes-like input; it can already be used and there is no need for any special parameter attributes. So having support for [BytesLike] in the converter is a bonus, that may be useful or convenient in some cases (assuming that it works correctly). On the other hand, if the "case for those cases" is weak, then it does not warrant the extra complexity. I will try to sum up some pros and cons of both approaches, as I see it.

    Using IBufferProtocol in the method signature: 1. It matches how CPython and offers the most flexibility on when, how (i.e. flags), and for how long a buffer is exported, unlike with the Binder mechanism. With the Binder mechanism, the buffer will be always requested at the entry to the method and released at the exit. This may not always match CPython. For instance, it is conceivable than for some corner cases created by values of other arguments, CPython never does request the buffer but fast-tracks to a default answer. If so, buffer error handling behaviour would be different. (Though I think this is an unlikely scenario). 2. It does not require any special handling code in the Binder, or any parameter attribute. 3. It makes it more difficult for interop, but in many cases methods that accept a bytes-like object are there just to provide an existing .NET functionality to Python. So there is little benefit for C# to call them. And indeed, we could provide a wrapper around byte[] or Memory<byte> for convenience.

    Using [BytesLike] support in the Binder. 1. The change has less impact on the codebase, so the migration can be gradual. The common GetBuffer functionality is factored out to the Binder-generated code. The actual method code is cleaner. 2. If we settle on Memory<byte>/ReadOnlyMemory<byte>, interop is easier. 3. For advanced scenarios, a method can still use IBufferProtocol in its signature.

    In particular IList is terrible because it would force us to make a copy the buffer?

    I agree that IList is not the best match for the buffer semantics. I would like to phase it out and replace with either [BytesLike]ReadOnlyMemory or IBufferProtocol, depending on the choice. Keeping support for IList as a possibility for buffer access will either require some exporters to make a copy of the data or we will need a wrapper around Memory that provides IList. The latter is functionally possible but will not offer efficient element access. One way or another, IList always underperforms, or is at best equal to other alternatives. It does offer an ability to pass discontinuous data, but as you have indicated before, this is not a requirement of bytes-like objects, which are expected to have contiguous byte data.

    So now how to phase it out. Since the full progression of types is IBubberProtocol -> IPythonBuffer -> ReadOnlyMemory -> ReadOnlySpan -> individual byte, I think the best first step is to replace [BytesLike]IList<byte> with [BytesLike]ReadOnlyMemory<byte>. These changes will be useful regardless the final mechanism of handling of bytes-like objects.

    点赞 评论 复制链接分享
  • weixin_39929595 weixin_39929595 5月前

    I think the best first step is to replace [BytesLike]IList with [BytesLike]ReadOnlyMemory

    This seems like a good first step, however, before jumping into it, I'd like to make sure this actually works. In particular, I'm trying to see how we would get from, for example, an array('i') (which is backed by an Array<int>) to a Memory<byte>. Right now we can write:

    C#
    array a = new array("i", new byte[] {1,2,3,4,5});
    using IPythonBuffer buffer = a.GetBuffer();
    ReadOnlyMemory<byte> mem = buffer.ToMemory();
    </byte>

    The issues which we would need to resolve are: 1. buffer.ToMemory() copies the data, it should be able to provide direct access to the memory. 2. a is backed by an ArrayData<int> which in turn is backed by an int[]. I know we can go from the array to a Memory<int>, but how do we go from there to a Memory<byte>? There's MemoryMarshal.Cast for Span but I don't see something like that for Memory. 3. ToMemory returns a ReadOnlyMemory, what do we do when we want write access?

    I guess in all the [BytesLike] cases we only need read-only access, however, we need to support the array scenario where the data is not backed by a byte array.

    Python
    bytes(array.array('i', [1, 2, 3, 4, 5]))
    

    Any ideas?

    点赞 评论 复制链接分享
  • weixin_39604598 weixin_39604598 5月前

    Here are my ideas (none of them tested out, though).

    First of all, although buffer.ToMemory() copies data if buffer comes from an array so does converting buffer to IList<byte>. Eliminating IList<byte> use would allow to remove IPythonBuffer.ToBytes(). It is not a small task because IList<byte> is used in many methods of Bytes and ByteArray, but it will be needed regardless which final solution we settle on.

    As for the IPythonBuffer interface, I wanted to remove GetItem, SetItem, SetSlice, ToBytes, and ToList. If we need ToBytes and ToList they can come as extension methods.

    Yes, it would be easier with spans, but we can't use spans in public methods.

    So I was thinking about replacing ToMemory(), which returns ReadOnlyMemory<byte> and possibly copying data, with two methods: Memory<T>? AsMemory<T>() and ReadOnlyMemory<T>? AsReadOnlyMemory<T>(). Neither of them would copy data. The former would only work for writable buffers (that is, ! IPythonBuffer.IsReadOnly), throwing InvalidOperationException otherwise. The latter would work all the time. But: if T does not "match" IPythonBuffer.Format, null is returned.

    An alternative: forget the generics and always return Memory<byte>/ReadOnlyMemory<byte>. The client then has to do proper data marshalling by dropping to Span and using MemoryMarshal.Cast if needed. This option is currently my favourite.

    This does not solve the ArrayData<int> problem, which should be able to provide both Memory<int>, when BufferFlags.Format is used, and Memory<byte>, for BufferFlags.Simple requests. Maybe a solution is in changing the backing object from int[] to some unmanaged blob?

    Yet another idea: use IBufferProtocol as the designated bytes-like type, and then each method does GetBuffer(BufferFlags.Format) -> AsMemory<T> -> AsSpan<T> -> AsBytes<T>. This could be an utility method in PythonOps, doing the job based on IPythonBuffer.Format Not ideal though, because CPython's array can satisfy simple (i.e. byte-typed) buffer requests too, so ideally the problem should be addressed in array.

    点赞 评论 复制链接分享
  • weixin_39604598 weixin_39604598 5月前

    Here is another idea that I have tested out a little bit. It addresses the array problem locally, in ArrayData<T>. Instead of using T[] as the backing object, it can use byte[]. Providing Memory<byte> for IPythonBuffer would become trivial. Indexed access, on the other hand, would go something like this:

    C#
    public T this[int index] {
        get => MemoryMarshal.Cast<byte t>(_items.Span)[index];
        set => MemoryMarshal.Cast<byte t>(_items.Span)[index] = value;
    }
    </byte></byte>

    I would like to see it tied out. , you have been working on ArrayData<T> on the past, do you prefer to take it on?

    点赞 评论 复制链接分享
  • weixin_39929595 weixin_39929595 5月前

    I remember that having span as a method parameter of a public method was not working well.

    Part of the issue with Span is that it can't be boxed or used as a generic argument so it causes a bunch of issues with call sites and the interpreter (which uses a stack of objects). If we're just using it "behind the scenes" via C# then there should be no issue with it.

    点赞 评论 复制链接分享
  • weixin_39604598 weixin_39604598 5月前

    This was originally "Remove IPythonBuffer.SetItem(...)" but things turned out to be quite interdependent. I tried to keep changes to minimum, just to the level of all tests passing, and commit them to get early feedback on the direction it is going in. The code is not proofread or cleaned up, and is sprinkled with TODO comments.

    There are some additional tests that are passing now (not included) but to my knowledge there are no regressions.

    It is strange that the build fails; does CI do something that a local ./make clean; ./make doesn't? I've successfully tested the commit locally on Windows and macOS.

    点赞 评论 复制链接分享
  • weixin_39604598 weixin_39604598 5月前

    I will try to merge in master and see if it helps.

    点赞 评论 复制链接分享
  • weixin_39929595 weixin_39929595 5月前

    I like it. I added an ArrayDataView to test out the approach. I put the logic there instead of in ByteArray so that it could be reused with array but I'm not sure if it's worth it? I don't think my ArrayDataView is quite correct since it should probably throw whenever Count changes instead of when the array is replaced. Also there might be threading issues with just using a counter as I do?

    点赞 评论 复制链接分享
  • weixin_39604598 weixin_39604598 5月前

    , thanks for the commit. I started working on ByteArrayView, leaving factoring out commonalities between it and ArrayView for later, but looking at your solution I see it can be done right away. Nevertheless, it was a worthwhile exercise because it got me thinking about multithreading early on. After some deliberation I came to a conclusion that it will be best not to make the Buffer Protocol views thread-safe, leaving any synchronization to the client (CPython also does not give any guarantees on thread-safety of buffer data). In the last commit, I wrote some rules of engagement as I see them currently.

    However, obtaining and releasing (and in the case of ByteArray, checking for exports) has to be thread-safe, as the rest of API of the exporting object. I added the modification to the code to ensure that. Perhaps they all can be moved to ArrayData for symmetry. It also means that all CheckBuffer calls have to happen in a locked critical section. This can be better done at ByteArray level. And yes, every method of ByteArray that may modify the length of the data needs to to the check.

    Further, I am not sure about calling Dispose() from the finalizer. I vaguely recall a possibility that a referenced object from a field may already been garbage-collected, in this case, _arrayData. That would lead to an exception, with GC would probably ignore...?

    Handling unformatted buffers is not quite clear to me. I find Python documentation on this point incomplete and ambiguous. Maybe I will have to look into CPython's code in the end to get the full picture. But one thing that is clear is that exporters, when given requests without Formatted flag, MUST provide buffers that report Format as null. See BytesView.

    点赞 评论 复制链接分享
  • weixin_39929595 weixin_39929595 5月前

    I wonder if there's any value in making the ByteArray._bytes field readonly. We could do this be adding an ArrayData<T>.Capacity property (similar to List<T>.Capacity) to allow shrinking the underlying array.

    Personally I don't see the issue with calling Dispose from the finalizer. It's essentially what the dispose pattern says to do? Although I'm not an expert in this domain.

    点赞 评论 复制链接分享
  • weixin_39604598 weixin_39604598 5月前

    I wonder if there's any value in making the ByteArray._bytes field readonly.

    I like the sound of it. This would place all export/resize logic and locking in one class. Feel free to experiment with it and commit your results. My plan for the next steps is to: 1. Provide conversion wrappers for Memory/ReadOnlyMemory as IBufferProtocol. 1. Provide automatic conversion from Memory et al. to IBufferProtocol by the binder. 1. Change all signatures using [BytesLike]ReadOnlyMemory<byte> to IBufferProtocol (that would be mostly in codecs). 1. Eliminate ToMemory() from IPythonBuffer. 1. Let array.array use ArrayDataView.

    After that, I could go back to try to make BytesArray._bytes field readonly.

    点赞 评论 复制链接分享
  • weixin_39604598 weixin_39604598 5月前

    Personally I don't see the issue with calling Dispose from the finalizer.

    Calling Dispose from the finalizer is not a problem. Disposing managed resources during such call is the problem.

    The task of the Garbage Collector is to collect and recycle any resources that are no longer in use. GC does a pretty good job for managed resources because, well it manages them. But it cannot do it for unmanaged resources, by definition. That's where the finalizer comes in handy. Its job is to assist GC in freeing unmanaged resources.

    A class can also have managed resources in use. The IDispose is meant to allow the programmer to release them as soon as they are not needed, rather than waiting on GC to collect them. Plus any unmanaged resources as well (say, close a file, so that it can be reopened).

    So in a general case, we have the following: 1. The finalizer has to release unmanaged resources 2. IDispose.Dispose() has to release unmanaged as well as managed resources.

    There is an overlap between them so it makes sense to factor it out to a separate method, conventionally void Dispose(bool disposing). When this method is called with true argument, it means it is being called from IDispose.Dispose() and should release both managed and unmanaged resources. After such call, the finalizer is not needed anymore, because all unmanaged resources are already taken care of. If it is called with false argument, it means the call comes from the finalizer. In such case, it has to dispose the unmanaged resources only. If the class can be subclassed, then Dispose(bool) is made protected virtual, so that the subclasses can properly handle their managed and unmanaged resources, but in the same manner, that is, only unmanaged resources are released on calls with disposing == false.

    This all together is the dispose pattern. Since ArrayDataView does not have any unmanaged resources, it does not need any finalizer. If the full dispose pattern were implemented in ArrayDataView, the call from the finalizer would be a no-op, and the type would only incur generation penalty for having a finalizer.

    But all this is theory, the primary reason why I deleted the finalizer in ArrayDataView is because it could turn ordinary bugs into heisenbugs. For a class that does not own any unmanaged resources (hence does not have a finalizer), it may be occasionally OK not to call Dispose() on it after use (although IMO it is a poor programming practice). In the case of ArrayDataView the consequence is keeping its size permanently frozen. But in most cases this is a coding error. This will eventually pop up as an exception down the road when an attempt is made to resize it. If the finalizer calls Dispose() behind the scenes at unpredictable moments, such bug may become difficult to reproduce, depending on anything that influences the demand on GC to collect data.

    In fact, I have recently had to deal with exactly this scenario. After finishing work on ArrayData buffer exports, the full IronPython test run was passing everywhere except on urllib. Because failures were deterministic, I was able to trace it down to a bug in HTTPResponse.client.readinto and come up with a proper workaround (see commit a0da749876a0200e9e472b840c0135d88af16e6b). It would have been much more tricky if the bug was non-deterministic.

    点赞 评论 复制链接分享
  • weixin_39929595 weixin_39929595 5月前

    thanks for the write-up on the dispose pattern. While the ArrayDataView object is perhaps not the place for it, I would argue that we still need some sort of finalizer to decrement the counter. Perhaps on MemoryView? Consider the following example:

    Python
    import gc
    b = bytearray()
    m = memoryview(b)
    del m
    gc.collect() # would not be needed with CPython
    b.append(1)
    

    While it's bad code the fact that it works with CPython means it's more than likely used in the wild...

    Now that we have a proper IBufferProtocol, using UnsafeByteList or UnsafeByteArray becomes mostly obsolete. (The issue you ran into would have been eliminated by removing uses of this.)

    I just pushed a commit, perhaps you have an idea to enable the following:

    Python
    import System
    import io
    b = io.BytesIO()
    b.writelines([System.Array[System.Byte]([1,2,3]), b"\1\2\3"])
    

    Since I can't have a call-site that would do automatic conversion I was thinking maybe a helper method?

    C#
    IBufferProtocol ToBufferProtocol(object obj) {
        if(obj is IBufferProtocol bp) return bp;
        if(obj is byte[] b) return MemoryBufferWrapper(b);
        // ...
        throw PythonOps.TypeError("a bytes-like object is required, not '{0}'", PythonTypeOps.GetName(obj));
    }
    
    点赞 评论 复制链接分享
  • weixin_39604598 weixin_39604598 5月前

    While it's bad code the fact that it works with CPython means it's more than likely used in the wild...

    If it is widely used in existing Python libraries, that is really bad news, because, as you indicate, it won't work with IronPython without gc.collect(). I would argue it would be easier to convince the library owners to add m.release() rather than gc.collect(), the latter being conditional on which implementation of Python is being used.

    The del case can be seen as a special case of a more general case when variables simply go out of scope. The following works in CPython:

    python
    def f(b):
        m = memoryview(b)
    
    b = bytearray()
    f(b)
    b.append(1)
    

    What I labelled a "bug" in HTTPResponse.client is exactly such case. Arguably, it is still bad code because it depends on a specific implementation of garbage collection, which, though documented for CPython, is not a part of the language. Unless we run GC.Collect(1) after each scope end, I see no way to mimic CPython's behaviour. I don't think that running forced garbage collection so frequently is practical, for performance reasons.

    But having a finalizer that reduces the export count may be beneficial to clients of such libraries, assuming the libraries themselves do not suffer from it. If an IronPython programmer uses such a library and encounters this problem, they can then call gc.collect() and move on. Since this is an issue of Python objects left behind without explicit release, I think the finalizer should then go in classes that implement those Python types, in this instance, indeed, MemoryView rather than ArrayDataView.

    Unfortunately, this still has the ability to create heisenbugs, and would have hindered me with HTTPResponse.client. Maybe we provide the finalizer only in Release builds? Or a commandline switch that can disable this behaviour? Or the Debug builds would print a warning when ~MemoryView() has to decrement the exports counter? That would give us an idea how often this is occuring in the wild... Ideally it would only do it for cases when the underlying array is still in use, to avoid warning about benign cases like:

    python
    def f():
        b = bytearray()
        m = memoryview(b)
    
    f()
    

    Though I don't know if it is technically possible to differentiate.

    点赞 评论 复制链接分享

相关推荐