Graceful degradation of ObjectSizeCalculator for non hotspot jvms
#14491 opened on Nov 30, 2025
Description
https://github.com/apache/incubator-hudi/issues/860 bug report
JIRA info
- Link: https://issues.apache.org/jira/browse/HUDI-234
- Type: Bug
- Affects version(s):
- 0.5.0
- Fix version(s):
- 1.1.0
Comments
30/Aug/19 22:10;xleesf;As indicated in the post [sizeOfAnObject|[https://stackoverflow.com/questions/52353/in-java-what-is-the-best-way-to-determine-the-size-of-an-object/30021105]], I think use Instrumentation with defining a MANIFEST.MF and launching a java agent to determine the size of an objec is too heavy, any other suggestions? [~vinoth];;;
02/Sep/19 19:12;DavisBroda;Probably worth getting an Instrumentation version in the product, just to get things working. Then replace it later if it's too heavy.;;;
03/Sep/19 02:45;vinoth;+1 to [~DavisBroda] 's suggestion. First step could be just making it work.
As for, specifically using the Instrumentation framework, it seems like you just launch the jvm with the agent once and there on, you get can fetch the size estimates? [~xleesf] that seems ok to me, unless this also "profiles" the app per se, which could make it really slow..
Moreover, what concerns me more from that stackoverflow thread are things like : "I tried this and got strange and unhelpful results. Strings were always 32, regardless of size. " . If it cannot estimate size of complex object graphs, then not sure how useful it is.
I think we can implement a graceful fallback which uses some approximation or always assumes a certain fixed, configurable size.. i.e a FakeEstimator, which may cause additional spilling, but atleast works.. For eg, if you said, all objects are 1MB, and they end up being 1KB, you just spill a lot..but things still work.. does this approach make sense?
Phase 2 after this could be , finding a very performant and accurate object size estimator that works across jvms (if such a things exists :) )
;;;
03/Sep/19 11:49;xleesf;I used Instrumentation to test string size with different values:
"aaaaa",
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa". They all reported 24. So even we introduce the Instructation framework, it may not so useful as expected. Hence I think we may implement a FakeEstimator as [~vinoth] said.;;;
03/Sep/19 13:35;DavisBroda;Instrumentation is a shallow check, so it's actually noticing the string as an object, and giving the basic size it gives any object, with maybe some shallow consideration of subfields (it sees a char array, and returns the minimum size an array can have, not considering contained elements). Using instrumentation to get a more accurate estimate is more complicated than just calling getObjectSize, requiring the duplication of a lot of the current ObjectSizeEstimator's logic around recursivly going through an object's subfields, array elements, etc. Not an elegant solution at all, but I don't know of any other API that will work on arbitrary JVMs.
I wonder how different various JVMs are in memory handling. If they aren't that different, then maybe we could just take the value a given object would have in HotSpot JVM, and multiply by some safety factor (1.5? or 2?) to get something a bit larger.;;;
04/Sep/19 09:34;xleesf;In order to make it work in other JVM first, we could use the value on the HotSpot JVM as a reference value, and we could multiply the reference value by a factor to calculate the value of various JVMs, but how should we determine the factor according to different JVMs or use the same factor for all JVMs?
cc [~vinoth] [~DavisBroda];;;
04/Sep/19 14:20;DavisBroda;I don't know enough about different JVMs to say whether different factors would be worthwhile in the long term, but in the short term getting a single factor that works woudl be preferable, followed by updates later if it proves to be either too large or too small on other JVMs.
As to how to determine the factor, probably get a group of sample objects and write an instrumentation implementation in a test environment. In a simpler test environment, with objects selected so that they are easier to run through instrumentation it should be easier to get numbers returned. See what the maximum difference is between the two, and then add a bit more for safety. That should give us a good starting value, to be updated if further issues arise.;;;
04/Sep/19 15:58;vinoth;First of all, this is a solid and very interesting. So thanks everyone :) for contributing
If they aren't that different, then maybe we could just take the value a given object would have in HotSpot JVM, and multiply by some safety factor (1.5? or 2?) to get something a bit larger.
Good point. But the object sizes could vary dynamically right? We may not be able to get baseline hotspot values on a non-hotspot JVM even if we figure out the safety factors? For e.g; the HoodieRecord object can have avro payload and its size could vary based on schema, compression code etc..
Let me take a stab at how we use the estimation across the code and see if there is a simpler alternative .. cc [~nishith29] who wrote most of this code.
;;;
06/Sep/19 15:27;vinoth;ObjectSizeCalculator is already abstracted under a SizeEstimator interface, so should be easy to provide a new impl. There are two places where we use this : SpillableMap to control how much of the hash merge map we need to retain in memory, Controlling the buffer between shuffle read and parquet write threads.
ObjectSizeCalculator has a MemoryLayoutSpec defined and it may be possible to provide one for IBM or other JVMs? May be we can just raise an issue against the original repo? twitter/commons? ;;;
09/Sep/19 02:16;xleesf;+1 to raise an issue against the original twitter/commons repo. And listen to the community's ideas before we implement a new one.
I filed an issue to track this. [issue|[https://github.com/twitter/commons/issues/484]] cc [~vinoth];;;