This pull request removes numpy as an extra package and instead implements the buffer API directly. To implement this pull request it was necessary to backport the speed patch from 0.8 so that I can directly derive from types. This does have minor API breakage in that the return types on array slices change and thus a few comparison tests in the test suite changed.
Implementation notes follow....
Buffers and NumPy removal
NumPy was used primarily for supplying Python buffers on slicing. Numpy has always been problematic for JPype. As an optional extra is may or may not be built into the distribution, but if it is compiled in it is required. Therefore, is itn't really on "extra". Therefore, removing it would make distribution for binary versions of JPype much easier.
NumPy returns on slicing is but one part of the three uses of Python buffers in JPype. Thus to properly remove it we need to rework to remove it we need to review all three of the paths. These paths are
- Conversion of Python buffers to Java arrays on setArrayRange.
- Conversion of Java arrays to slices (and then from slices to buffers so that they can be transferred to NumPy.)
- Connecting bytearray to Java direct byte buffers.
We reviewed and revised each of the paths accordingly.
On conversion of Python buffers, the implementation was dated from Python 2.6 era where there was no formal support for buffers. Thus the buffer implementation never consulted the buffer type to see what type of object was being transferred nor how the memory was oriented. This entire section had to be replaced and was given the same conversion rules as NumPy. Any conversion is possible (including lossy ones like float to bool). The rules to trigger the conversion is by slicing just as with NumPy. By replicating the rules of NumPy we hide the fact that NumPy is no longer used and increase typesafety. There was a number of cases where in the past it would reinterpret cast the memory that will now function properly. The old behavior was a useless side effect of the implementation and was unstable with machine architecture so not likely used be a user.
The getArrayRange portion has to be split into two pieces. Under the previous implementation the type of the return changes from Python list to NumPy array depending on the compile option. Thus the test suite tested different behaviors for each. In removing NumPy we replace the Java array slice return with a Java array. Thus the type is always consistent. The Java array that is returned is still backed by the same array as before and has the start, end, and step set appropriately. This does create one change in behavior as the slice now has left assignment (just like NumPy) and before it was a copy. It is difficult exercise this but to do so we have to copy a slice to a new variable then use a second array dereference to assign an element. Because we converted to either list or NumPy, the second dereference could not affect the original.
There is no way to avoid this behavior change without adding a large transaction cost. Which is why NumPy has the exact same behavior as our replacement implementation. We could in principle make the slice read only rather than allowing for double slicing, but that would also be an API change.
There is one other consequence of producing a view of the original having to do
with passing back to Java. As Java does not recognize the concept of an array
view we must force it back to a Java type at the JNI level. We will force copy
the array slice into a Java array at that time. Thus replicates the same
functionality. This induces one special edge case as
java.lang.Serializable which are both implemented by arrays must be aware of
the slice copy.
Before we trigger the conversion to NumPy or list by calling the slice without
[:] on the array. Under the new implementation this is
effectively a no-op. Thus we haven't broken or forced any changes in the API.
The third case of direct byte buffers also revealed problems. Again the type of the buffer was not being check resulting in weird reinterpret cast like behaviors. Additionally, the memory buffer interface was making a very bad assumption about the referencing of the buffer by assuming that referencing the object that holds the buffer is the same as referencing the buffer itself. It was working only because the buffer was being leaked entirely and was likely possible to break under situations as the leaked buffer essentially locked the object forever.
To implement all of this properly unfortunately requires making the Python wrapper of Java arrays a direct type. This is possible in JPype 0.8 series where we converted all classes to CPython roots, thus our only choice is to backport the JPype speed patch into JPype 0.7.
API change summary
- The type of a slice is no longer polymorphic but is always a Java array now.
- The unbounded slice is now a no op.
- Buffer types now trigger conversion routines rather than reinterpret casting the memory.
- Direct buffers now guard the memory layout such that they work only with mutable bytearray like types.
- Assignment of elements though double slicing of an array now affects the original rather than just doing nothing effective like before.
JPype Speed Patch
Speed has always been an issue for JPype. While the interface of JPype is great the underlying implementation is wanting. Part of this was choices made early in the development that favors ease of implementation over speed. This forced a very thin interface using Python capsules and then the majority of the code in a pure Python module.
The speed issue stems from two main paths. The first being the method resolution phase in which we need to consider each method overload argument by argument which means many queries applied to each object. The second is the object construction penality when we return a Java object back to Python. The object bottleneck is both on the cost of the wrappers we produce, but additionaly all of the objects that are constructed to communicate with a Python function. Every int, list, string and tuple we use in the process is another object to construct. Thus returning one object triggers dozens of object to be constructed.
We have addressed these problems in five ways
- Improve resolution of methods by removing the two phase approach to resolving a type match cutting the resolution time in half.
- Caching the types in C so that they don't have to go back to Python to execute a method to check the Python cache and construct if necessary.
- Converting all of the base types to CPython so that they can directly access the C++ guts without going back through entry points.
- Remove all Python new and init method from the Java class tree so that we don't leave C during the construction phase. Thus avoiding having to construct Python objects for each argument used during object construction.
- Adding a Java slot to so that we can directly access Java resources both increasing the speed of the queries and saving us an additional object during construction.
All of these are being implemented for JPype 0.8 series. For now we are backporting the last four to the JPype 0.7 series. The first is not possible to backport as it requires larger structural changes.
Lets briefly discuss each of the items
During method resolution the previous implementation had a two phase approach. First is tried to match the arguments to the Java types in canConvertToJava. In this each Java argument had to walk through each possible type conversion and decide if the conversion is possible. Once the we have the method overload resovled we then perform the conversion by calling convertToJava. That would then walk through each possible type conversion to find the right one and apply it. Thus we did tha work twice.
To prevent this from happening we need to reserve memory for each argument we are trying to convert. When we walk through the list and find a conversion rather than just returning true, we place a pointer to the conversion routine. That way when we call convertToJava we don't have to walk the list a second time but instead go straight to the pointer to get the routine to execute.
This change has two additional consequences. First the primary source of bugs in
the type conversion was a mismatch between canConvertToJava and convertToJava
thus we are removing that problem entirely. The second and more important to
the user is that the type system is now open. By installing a routine we can
now add a user rule. Therefore if we need
java.sql.TimeStamp to accept a
Python time object we just need to add this to the type conversion table at the
Python level. This is implemented in the ClassHints patch. About half of our
customizer code was to try to achieve this on a per method level. Thus this
elimiates a lot of our current Python customizer code. The remaining customizer
code is to rename Java methods to Python methods and that will remain.
Caching of Python wrappers
In the previous implementation there was a text keyed dictionary that was consulted to get type wrappers. To access it C++ called to a Python function that decided when to return a cached type and when to create a new one. This meant dozens of object constructed just to find the wrapper. To solve this we simply move the cache and add it to the JClass directly. We have to back reference the Python class so it can't go away while the JVM is running.
There is one section of code that also uses the wrapper dict in the customizers which needs to decided does a wrapper already exist for the customizer. We have replaced these calls with methods on the module.
Conversion of the Base classes
JPype has a number of base classes (object, primitive, exception, string, array) which hold the methods for the class. If they are implemented as pure Python than every access from C++ to these elements needs to create objects accordingly when then are passed back through the module entry points to get back to C++.
We can avoid this by implementing each of these in CPython first at the module layer and then extending them in the exposed module so that they have the same outward appearance as before.
We made one refinement during the conversion by implementing all of the CPython
classes using the Heep type API which has the distinct advantage that unlike
static types, it can be changed at runtime. Thus from Python we can add
behavior to the heap types simply with by using
type.__setattr_. This was
a bit of a challenge as the documentation on heap types is much more sparse than
for static types. However, after going through the process I would recommend
that all new CPython modules should use heap types rather than static as API is
much better and the result much more flexable and stable. The only downside
being the memory footprint increases from 400 bytes to 900 bytes. There are a
few rough spots in the heap type API in that certain actions like setting the
Buffer have to be added to the type after creation, but otherwise it is a big
improvement. Now if all of the documentation would just drop the old static API
in favor of heap types it would be great.
In order to benefit from moving all of the base classes to C, we have to make sure that derived classes do not transfer control back to Python. Currently this happens due to the factory nature of our classes. The entry point for JObject is shared between the construction of objects from Python and a return from Java. Thus we have to either separate the factory behavior by pushing those types out of the type tree or pushing the factory behavior into the C layer.
We have chosen to split the factories and use overrides of the type system in
the meta class to apply
issubtype behavior. We can further
restrict the type system if we need to by adding verifications that the
__init__ methods must point the original base class
implementations if need. Howver, we have not taken this step as of yet.
The split approach effectively removes these heavy elements from type creation.
The concequence of this is that means all of the rest of logic needs to be in
CPython implementation. These can be rather cumbersome at times.
It is always a slippery slope when pushing code from Python back to CPython. Some thing are needed as they are on the critical path while others are called only occasionally and thus represent no cost to leave in Python. On the other hand some things are easy to implement in CPython because the have direct access rather than having to go through a module entry point. We have gone with the approach that all critical path and all code the eliminates the need for an entry point should be pushed back to C.
In order to get any reasonable speed with Python, the majority of the code needs to be in C. But additionally there needs to be the use of slots which are hard coded locations to search for a particular piece of information. This presents a challenge when wrapping Java as we need a slot for the Java value structure which must appear on Python object, long, float, exception, and type. These different types each have their own memory layout and thus we can't just add the slot at the base as once something is added to the base object it can no longer be used in multiple inheritance. Thus we require a different approach.
Looking through the CPython source, we find they have the same quandary with
__weakref__ slots. Those slots do not appear one
the base object but get added along the way. The method they use is to add
those slots to the back of the object be increasing the basesize of the object
and then referencing them with two offset slots in the type object. If the type
is variable length the slot offset are negative thus referencing from the end of
the object, or positive if the object is a fixed layout.
Thus we tried a few formulations to see what would work best.
Broken tree approach
The problem with just directly adding the slots in the tree is that the Java
inheritance tree forces the order of the Python tree we have to apply. If we
add a slot to
java.lang.Object we have to keep the slot on
java.lang.Throwable but that is not possible because Throwable requires it to
be derived from Python
Exception. Thus if we are going to add a slot to the
base we would have to break the tree into pieces on the Python side. This is
possible due to Python inheritance hacking with some effort.
But this approach had significant down sides. When we go to access the slot we have to first figure out if the slot is present and if not then fall back to looking in the dictionary. But one of the most common cases is one in which the item has no slot at all. Thus if we have to both look for the slot and then hit the dictionary, this is worse than just going to the dictionary in the first place. Thus this defeats the point of a slot in many cases.
Python dict approach
We attempted the same trick by increasing the basesize to account for our extra slot. This leaves to difficulties. First, the slot has no offset so we need to find it each time by implying its location. Second, the extra objects have to be "invisible" during the type construction phase, or Python will conclude the memory layout of the object is in conflict. We can fool the type system by subtracting the extra space from the type during the declaration phase and then adding it back after the base types are created.
This approach failed because the "invisible" part is checked each and every time a new type is added to the system. Thus every dynamic type we add checks the base types for consistency and at some point the type system will find the inconsistency and cause a failure. Therefore, this system can never be robust.
Dict and weakref appear to be very special cases within the Python system and as there is no general facility to replicate them working within the system does not appear to be viable.
Memory hacking approach
The last system we attempted to mess with the memory layout of the object during the creation phase to append our memory after Pythons. To do this we need to override the memory allocator to allocate the requested memory plus our extra. We can then access this appended memory by computing the correct size of the object and thus our slot is on the end.
We can test if the slot is present by looking to see if both
tp_finalize point to our Java slot handlers. This means we are still
effectively a slot as we can test and access with O(1).
The downside of this approach is there are certain cases in which the type of an object can be changed during the destruction phase which means that our slot can point to the wrong space if the basesize is changed out from under us. To guard against this we need to close our type system by imposing a ClassMeta which closes off mixin types that do not inherit from one of the special case base classes we have defined.
The API implications should be small. There was never a functional case where extending a Java object within Python actually made sense as the Python portion is just lost when passed to Java and unlike Proxies there is no way to retrieve it. Further the extending a Java object within Python does not bind the lifespan of the objects so any code that used this is likely already buggy. We will properly support this option with `` at a latter point.
With this limitiation in mind, this appears to be the best implementation - It adds the slot to all of the required types. - The slot is immediately accessable using just two fields (basesize, itemsize) - The slot can be tested for easily (check tp_alloc, tp_finalize) - It closes the type system by forcing a meta class that guards against inappropraite class constuction.
We could in principle add the slot to the "front" of the Python object but that could cause additional issues as we would require also require overriding the deallocation slot to disappear our memory from the free. This is dangerous because any module that mistakenly called the Python generic object base free rather than the slot free would cause Python to crash. Of course that should already be a problem for the module with the bug, but as it would be a correct assumption in 99% of the cases it may not be exposed. Thus in our opinion adding to the tail is safer than adding to the head.
Speed patch implications
Other than improving the speed, the speed patch has a lot of below the hood
changes. So long as the user was not accessing private members there is no API
change, but everything below that is gone. All private symbols like
__javavalue__ as well as all exposed private members
vanish from the interface. There is no longer a distinction between class
java.lang.Class instances for purposes of the casting system.
The wrapper is an extension of a Python type and has the class methods and
fields, and the instance is an extension of a Python object without these.
Both hold Java slots to the same object. Therefore a lot of complexity of the
private portions is effectively removed from the user view. Every path now has
the same resolution, check the Java slot first and if not assume it is Python.
Two private methods now appear on the wrapper (though I may be able to hide them
from the user view.) These are the test entry points
_convertToJava. Thus the speed patch should be transparent all user code that
does not access our private members. That said some code like the Database
wrappers using JPype have roots in some code that did make access to the private
tables. I have sent corrections when we upgraded to 0.7 series thus making them
conforming enough not to touch the private members. But that does mean some
modules may be making such accesses out in the wild.
The good new is after the speed patch pretty much everything that is supposed to be under the hood is now out of the reach of the user. Thus the chance of future API breakage is much lower.
Below the hood changes
- We have tried to prevent backend changes from reaching the API, though this is not always entirely the case. The majority of the cases not already noted appear to be in the range of implementation side effects. The old implementation had bugs that create undefined behaviors like reinterpet casting buffers and the like. It is not possible to both fix the bugs in the backend and make preserve buggy behavours on the front end. We have limited our changes to only those for which we can see no desirable use existing. Calling a list slice assignment on a float from a numpy array of ints and getting a pile of gibberish was as far as we can tell not useful to the user.
- setResource is dropped in favor of caching the module dictionary once at start of the JVM. We have a lot of resources we will need and the setResources method was cumbersome.
- There is a lot of thrashing for the Python module style between C and C++ style. The determining blow was that C++ exception warning showed up when the proper linkage was given. Thus the perferred style flipped from C++ style to C. Thus the naming style change accordingly.
- In addition to the style change there is also an attempt to isolate symbols between the different classes. The older style with a formal header that declares all the symbols at the top encouraged access to the functions and increased the complexity. Moving to a C style and making everything static forces the classes to be much more independent.
- With the change to C style there is natural split in CPython class files between the structure declaration, static methods that implement Python API functions, the declaration of the type, and the exposed C++ style API used by the rest of the module.
- There is some thrashing on how much of the C++ wrapper style Python API to keep. The rewrapping of the API was mainly so support differences between Python 2.7 and 3.x. So we dropped where we could. Only the JPPyObject which acts as memory handling device over pure Python style (because Python style is not exception safe) is strongly needed.
- There is some spacing thrashing between different editors and the continuing
debate of why C was written to have the pointer stick to to the variable
rather than the type. When writing
Object* fooit implies that the star is stuck to type rather than the variable where C reads it as the opposite. Hence there is the endless churn between what is correct
Object *fooand what we would say in which
Object*is actually a type. As Python favors the former and we currently have the latter that means at some point we should just have formatter force consistency.
- We introduced Py_IsInstanceSingle. Is is a missing Python API function first for fast type check method of a single inherited type using the mro. Then something is singlely inherited we can bypass the whole pile of redirects and list searchs and just compare if the mro list matches at the same point counted from the end. As all of our base types are single inherited this saves time and dodges the more expensive calls that would trigger due to the PyJPClassMeta overrides.
- We introduced Py_GetAttrDescriptor. This was previously implemented in Python and was very slow. Python has very good implementations for GetAttr but unfortunately it has the behavior that it always dereferences descriptors. Thus it is useless in the case that we need to get the descriptor from the type tree. We have reimplemented the Python type search without the descriptor dereferencing behavior. It is not nearly as complete as the Python method as we only need to consult the dictionary and not handle every edge case. That of course means that we are much faster.
- As with every rewrite there is the need to cleanup. Anything that wasn't reached by the test bench or was identified as being called only from one location was removed or reencorperated back in into the function that call it.
Fixes #575 Fixes #573 Fixes #567 Fixes #564 Fixes #563 Fixes #542 Fixes #536 Fixes #528
12条回答 默认 最新
- 点赞 评论 复制链接分享
Wow, this is very detailed. I need time to review this. Thanks for all the efforts.点赞 评论 复制链接分享
The following code seems to create a copy, which it shouldn't in my opinion.
python import jpype import numpy as np jpype.startJVM() n = int(1e3) x = jpype.JArray(jpype.JDouble, 1)(n) view = np.array(x[:]) assert view == x == 0.0 view = 1 assert x == 1
The last assertion fails. So there is a copy made instead of viewing the memory inside the JVM. The old (and hacky) Numpy related code allowed to avoid copying memory around. Is there a way to steer this behaviour?点赞 评论 复制链接分享
The behavior on your test is unchanged from the previous version. It shows the same assertion error in both versions. (But for different reasons)
In either case JNI will make copies (and likely more than you are thinking.)
Analyzing the old code.
import jpype import numpy as np jpype.startJVM() n = int(1e3) x = jpype.JArray(jpype.JDouble, 1)(n) view = np.array(x[:])
Here we generate 2 copies of the data typically on the
[:]. 1) We call GetPrimitiveArrayCritical, JNI copies the memory from the Java space to the transfer space. 2) We then call memcpy to copy the transfer buffer into numpy. 3) We then release critical which may copy back again if commit was set.
The first one is supposed to be bypassable, if the JVM supports pinning of arrays, but thus far I haven't seen any cases where it did. (Perhaps it is size triggered).
We should have used GetArrayRegion instead which would have saved a copy as we could have gone straight from Java to numpy array backing.
assert view == x == 0.0 view = 1
So at this point the numpy copy is set to 1 but the Java array is still 0.
assert x == 1
Analyzing the new code.
import jpype import numpy as np jpype.startJVM() n = int(1e3) x = jpype.JArray(jpype.JDouble, 1)(n) view = np.array(x[:])
The slice operator does nothing here as we just get a view of the same array.
np.array(v) then calls for a memory buffer view of the array which produces a pair of Get/Release calls. This is slightly worse than before as we can't use Critical because we do not know how long we will hold the array. Same result on average which is two copies get triggered.
The flag to the buffer makes a difference here. If they specify "WRITE" then we will trigger a 3rd copy to push the transfer area back to the Java memory. I can do some auditing and see what triggers the copy. The four relevant cases are
memoryview, andnp.array` which all call for a Py_buffer to be created.
assert view == x == 0.0 view = 1
So at this point the numpy copy is set to 1 but the Java array is still 0.
assert x == 1 <== Fails because np.array requested a copy.
In general as far as I can tell in Python when you call a class ctor you expect a copy to be produced.
For example consider the following code
import numpy as np n = int(1e3) x = np.zeros(n) view = np.array(x[:]) <== User explicitly is requesting a new array that is a copy of the old assert view == x == 0.0 view = 1 assert x == 1
And we get the same assertion failed.
There is one huge difference though between the implementations...
Try this code on the old version and the new
import jpype jpype.startJVM() n = int(1e3) x = jpype.JArray(jpype.JDouble)(n) view = x[:] <== This should do nothing because we are just saying v is a slice of x assert view == x == 0.0 view = 1 assert x == 1 <== old fails assertion as we made a copy we should not have. New code passes.
CPython lacks a call to get a buffer to request direct transfers of bytes but instead favors request access. Thus I can't make use of the JNI short cuts for getting regions. Thus I am forced to use calls that provide for full access which are less efficient.点赞 评论 复制链接分享
By the way thanks for the quick efforts to get the review started. This pull request is much larger than I was originally hoping as I needed to make use of the direct access to CPython types that I implemented in JPype 0.8. I expect that you will need some level of comments (and possible revision) on the level of memory hacking that I required on the speed patch. I plan to take then notes posted on the pull request when I update the devel guide. At times it felt like I was being forced to copy whole sections of CPython onto a piece of paper as psuedo code and then recomposing them back because CPython lacks simple things like
Unfortunately there will be implementation implications that leak from this. Java JNI is really sloppy about supporting pinning of arrays and I can't guarantee that even with the current efforts that I have covered all the cases. Sometimes it gives a copy and you can do whatever you want until a commit is forced. Sometimes it gives direct access which means anything goes. If you open more than one view to the same array (by casting to an object and back so the views don't share) you can get arguments where the last one to commit rules. This results in frustrating and sometimes really hard to explain behavior. I had to explicitly create views with copies and then use del to remove references to actually get it to pretend to write through. Worse the copy can fail and I am not sure I have accounted for this.
If you want me to try to force the views to all be read only (and thus save us from explaining how JNI functions) I would completely agree to it as it was beyond counter intuitive to me and I know this stuff pretty well.点赞 评论 复制链接分享
- weixin_39995439 2020-12-30 10:05
diff @@ Coverage Diff @@ ## master #577 +/- ## ========================================== - Coverage 54.31% 50.89% -3.43% ========================================== Files 95 99 +4 Lines 6777 7347 +570 Branches 2770 3340 +570 ========================================== + Hits 3681 3739 +58 - Misses 1347 1451 +104 - Partials 1749 2157 +408
| Impacted Files | Coverage Δ | | |---|---|---| | native/common/jp_monitor.cpp |
44.44% <ø> (ø)| :arrow_up: | | native/common/include/jp_javaframe.h |
56% <ø> (ø)| :arrow_up: | | native/common/include/jp_primitivetype.h |
100% <ø> (ø)| :arrow_up: | | native/common/include/jp_arrayclass.h |
100% <ø> (ø)| :arrow_up: | | native/common/jp_primitivetype.cpp |
83.33% <ø> (ø)| :arrow_up: | | native/common/include/jp_proxy.h |
100% <ø> (ø)| :arrow_up: | | jpype/_jcomparable.py |
100% <ø> (ø)| :arrow_up: | | native/common/include/jp_methodoverload.h |
100% <ø> (ø)| :arrow_up: | | native/common/include/jp_env.h |
33.33% <ø> (ø)| :arrow_up: | | native/common/include/jp_method.h |
100% <ø> (ø)| :arrow_up: | | ... and 97 more | |点赞 评论 复制链接分享
rebased and squash to allow for proper coverage check.点赞 评论 复制链接分享
I disabled the ability to write back to Java memory with memory buffers for now. It was so counter intuitive that no one is likely to every use it.点赞 评论 复制链接分享
I hereby swear that this pull request is actually ready to go despite my two days of vane attempts to increase the coverage score. I did turn up 3 segfaults on edge cases so it wasn't wasted effort. Unfortunately 50% of the C++ code is aggressive checking for failures that i can't seem to exercise without creating testing hooks in the code.点赞 评论 复制链接分享
I will put in a followup pull request with the multidimensional array memory buffer support. It appears like it is a very big speed up once I came up with a way to get most of the memory transfer performed on the Java side. I may be able to speed it up more with a direct byte buffer but it should be enough to close #133.点赞 评论 复制链接分享
I have actually forgotten how to do the direct memory transfers on the Python side, but I'm sure it was possible. I just hacked this feature back then, since I needed to avoid memcpy calls at any cost, because the data would blow main memory easily. The depended project is dead for a long time, so I also personally believe nobody really relies upon it.
I think we can merge this now.点赞 评论 复制链接分享
Merged so that I can resume coverage checks (without a massive patch sitting cluttering everything up). We have just a few outstanding patch requests that to get through and we can consider a release.点赞 评论 复制链接分享