Is MSVC's dynamic memory allocation odd?

Ok, I was using the AMD CodeAnalyst (great program, btw) and looked at the assembly behind one of the lines in my program, it was this line:

riTileOrder *soa = new riTileOrder[NodeCount];

Then I looked at the assembly behind this line:

riTileOrder *soa = (riTileOrder *) malloc(sizeof(riTileOrder)*NodeCount);

Which does the same exact thing, you'll notice. Their assembly output (in MSVC) is as follows, respectively:
[code]
; Clock cycle counts based off the Pentium II
; using new, approx 23 clock cycles
mov eax,[ebp-14h]
shl eax, 02h
push eax
call $+00001e15h; (00412680)
add esp, 04h
mov [ebp-58h],eax
mov ecx,[ebp-58h]
mov [ebp-40h],ecx

; using malloc, approx 15 clock cycles
mov eax,[ebp-14h]
shl eax, 02h
push eax
call $+00001ec5h; (00412730)
add esp, 04h
mov [ebp-40h], eax
[/code]
Both of the calls point to the same exact code thing, I checked, it's just that their offset is different with different compiles. Now, why the heck is the malloc one 8 clock cycles less than the one using new? Is this some trick that MSVC's optimizer pulls, or what?

I suck with assembly, but I know basicly what they do, but I don't know enough assembly to figure out what the difference is.

I'm going to run a test in a moment to see if MSVC's implementation of malloc is actually faster than MSVC's implementation of new (for some unknown reason).

http://druidgames.cjb.net

Comments

  • Well, new doesn't appear to be any slower the malloc (I ran each test a lot of times, back to back, and took the lowest time it took to allocate the memory of each). They both had the same times. But I still don't see why one is 8 clock cycles more than the other on a Pentium II.

    http://druidgames.cjb.net


  • : Ok, I was using the AMD CodeAnalyst (great program, btw) and looked at the assembly behind one of the lines in my program, it was this line:
    :
    : riTileOrder *soa = new riTileOrder[NodeCount];
    :
    : Then I looked at the assembly behind this line:
    :
    : riTileOrder *soa = (riTileOrder *) malloc(sizeof(riTileOrder)*NodeCount);
    :
    : Which does the same exact thing, you'll notice. Their assembly output (in MSVC) is as follows, respectively:
    : [code]
    : ; Clock cycle counts based off the Pentium II
    : ; using new, approx 23 clock cycles
    : mov eax,[ebp-14h]
    : shl eax, 02h
    : push eax
    : call $+00001e15h; (00412680)
    : add esp, 04h
    : mov [ebp-58h],eax
    : mov ecx,[ebp-58h]
    : mov [ebp-40h],ecx
    :
    : ; using malloc, approx 15 clock cycles
    : mov eax,[ebp-14h]
    : shl eax, 02h
    : push eax
    : call $+00001ec5h; (00412730)
    : add esp, 04h
    : mov [ebp-40h], eax
    : [/code]
    : Both of the calls point to the same exact code thing, I checked, it's just that their offset is different with different compiles. Now, why the heck is the malloc one 8 clock cycles less than the one using new? Is this some trick that MSVC's optimizer pulls, or what?

    That code sure doesn't look like the optimized version.. is this the debug build?

    : I suck with assembly,

    My too. Don't know it at all.

    : I'm going to run a test in a moment to see if MSVC's implementation of malloc is actually faster than MSVC's implementation of new (for some unknown reason).

    Well, you can say one thing for sure: new can't possibly be faster than malloc, new is required to use malloc to do the actual allocation. The 'extra' stuff than new does is:

    1) automatically determine how large a block is necessary, based on the type. This is exactly the same as malloc(sizeof(MyClass))

    2) call any constructor(s) that need calling.

    If you're allocating storage for a builtin type (i.e. no constructor), the new operator should resolve to a call to malloc. No difference.

    Looking at the assembly dumps of MSVC++, gcc, and Borland C++, I see that not all compilers are equal in this regard. Both MSVC++ and gcc will just call malloc for built in types. In other words, using malloc() and new to allocate and array of ints generates identical code. However, Borland C++ has a lot of code surrounding that call to malloc(). This may be because Borland is the most standard-conforming of these compilers: new is supposed to throw an exception if the allocation fails. Neither MSVC++ or gcc do this, Borland does.

    I also noticed that if the object you're creating has a constructor, all three compilers will simply not allow the constructor to be called if the call to malloc fails. They all check the return value of malloc for NULL; MSVC++ and gcc just skip the constructor call if malloc fails (!), Borland throws.

    Anyway, interesting stuff to poke around in...

    Cheers,
    Eric


  • Yeah, they're the same speed when ran repeatidly, otherwise the first ran is always slower (as should be expected). I was going through the source MS provides for their call to new, and all it does is call a modified malloc that determines how much memory to allocate.

    The riTileOrder class doesn't have a constructor or destructor, it is just meant to wrap a pointer inside of it. MSVC produces noticably slower (release mode) code when using a pointer object, compared to when using a class with one pointer inside of it. I have no idea why, I haven't found a good way to test this yet.

    I think it is some trick the optimizer pulls when using classes compared to pointers. It doesn't seem to apply except when I'm sorting the objects. I use a rewritten and heavily optimized qsort to sort everything (it's a lot faster than either Borland's or MSVC's qsort, I didn't write it though). Might MSVC optimize one of the operations a qsort must perform when using structs/classed but not when using a raw pointer?

    Weird stuff... :)


    http://druidgames.cjb.net


  • : Yeah, they're the same speed when ran repeatidly, otherwise the first ran is always slower (as should be expected). I was going through the source MS provides for their call to new, and all it does is call a modified malloc that determines how much memory to allocate.

    Well, new is an operator, so the compiler should be at liberty to make no function call at all. The size of the type is known at compile time (the sizeof() operator generates a compile-time constant), so there's no reason that:

    [code]code char* p = new char[100];[/code]should not generate [italic]exactly[/italic] the same code as

    [code]char* p = (char*)malloc(100);[/code]In my test, these two lines of code generated identical machine code in the release build under MSVC++. I'm surprised that you found it doing something different...

    : The riTileOrder class doesn't have a constructor or destructor, it is just meant to wrap a pointer inside of it. MSVC produces noticeably slower (release mode) code when using a pointer object, compared to when using a class with one pointer inside of it. I have no idea why, I haven't found a good way to test this yet.

    I'm not sure I understand you... do you mean that using this pointer:

    [code]string* p1;[/code]Is somehow slower that using the same pointer wrapped in a class like this:

    [code]struct MyPointerWrapper
    {
    string* p1;
    };[/code]What exactly to you mean by slower? In what situations?

    : I think it is some trick the optimizer pulls when using classes compared to pointers.

    I don't know what you mean by 'classes compared to pointers'. Do you mean objects on the stack vrs objects on the heap?

    : It doesn't seem to apply except when I'm sorting the objects. I use a rewritten and heavily optimized qsort to sort everything (it's a lot faster than either Borland's or MSVC's qsort, I didn't write it though). Might MSVC optimize one of the operations a qsort must perform when using structs/classed but not when using a raw pointer?

    I don't know that you mean by "using structs/classed but not when using a raw pointer?". Can you elaborate?

    Also I should mention that the C++ library sort() algorithm can be much faster than any C-style sort algorithm that uses a callback function. Why? Because std::sort is a template, and the comparison function is a template parameter instead of a callback. If you supply an inline function for this parameter, you end up with the comparison being done inline inside the algorithm; no function call overhead for each comparison... much faster. Templates were the mechanism used to finally dethrone Fortran as the speed-king of numerical computing.

    Cheers,
    Eric


  • Hmm, I could probably rewrite the advanced qsort I'm using now to use inline functions as a template parameter, that's actually a good idea :). I want to use a custom qsort because I don't want someone using my engine to be at the mercy of their compiler's supplied one, since it is used in speed critical parts of it.

    And, when I say slowed code, it was literally moving slower. I use a timer to scroll around the screen at a certain rate of pixels per second, and the motion became more jerky, and the frame rate lowered about 5fps just by changing "riSpriteOrder" (it is the same as riTileOrder, except that it is for my sprite rendering ruitine, it contained a pointer to a class derived from riBaseSprite) to "riBaseSprite *". I can't explain why, and I haven't tried it under any other compilers (also, now I have other stuff stored within riSpriteOrder also, including a texture ID and sort position, so it is easier to use a class now).

    I really can't explain why or how it happened. It makes no sense to me, but I figured using a class instead of a pointer isn't that hard, so I just changed it back.

    http://druidgames.cjb.net


  • new:
    mov [ebp-58h],eax
    mov ecx,[ebp-58h]
    mov [ebp-40h],ecx

    malloc:
    mov [ebp-40h], eax

    I just pasted the instructions after the call is done.

    I'm not too good at assembly but I think there is some value in understanding this. I tried this same code in vc++ 6.0 on debug and optimization and i get the same each time.

    After the call to malloc/new I was under the assumption that the return value would be found in eax. So to me the malloc case makes sense. What is the version using new doing with ecx? what is it for? Is it me or does it look like it is putting the return value in memory and then copying it from memory back to ecx and then finally back into memory where the final destination is...?

    Please enlighten?

  • I've seen member function calls in c++ where ecx seemed to contain an address to this...
    Is there something similar going on here? Why would it put the return address of the new memory into another place in memory before placing it in its final resting place?

  • : Hmm, I could probably rewrite the advanced qsort I'm using now to use inline functions as a template parameter, that's actually a good idea :). I want to use a custom qsort because I don't want someone using my engine to be at the mercy of their compiler's supplied one, since it is used in speed critical parts of it.
    :
    : And, when I say slowed code, it was literally moving slower. I use a timer to scroll around the screen at a certain rate of pixels per second, and the motion became more jerky, and the frame rate lowered about 5fps just by changing "riSpriteOrder" (it is the same as riTileOrder, except that it is for my sprite rendering ruitine, it contained a pointer to a class derived from riBaseSprite) to "riBaseSprite *". I can't explain why, and I haven't tried it under any other compilers (also, now I have other stuff stored within riSpriteOrder also, including a texture ID and sort position, so it is easier to use a class now).
    :
    : I really can't explain why or how it happened. It makes no sense to me, but I figured using a class instead of a pointer isn't that hard, so I just changed it back.

    You still haven't explained what you mean by "using a class instead of a pointer". Out of context (i.e. code) that statement doesn't make a lot of sense. The only thing I can really imagine it to mean is an object on the stack vrs. on the heap. As in:

    [code] MyObject a;
    MyObject* b;[/code]Where a is "using a class" and b is "using a pointer". I would see how this could affect performance by getting rid of one level of indirection, and more importantly by allowing C++ to generate a static call instead of dynamic (polymorphic) call. Is this what your talking about?

    Curious,
    Eric


  • I meant like this:
    [code]
    // Class
    class riTileOrder {
    public:
    riTile *pTile;
    };

    // Pointer
    riTile *pTile;
    [/code]
    Now do you see why I'm completely confused why that would create different speeds?

    http://druidgames.cjb.net


  • : I meant like this:
    : [code]
    : // Class
    : class riTileOrder {
    : public:
    : riTile *pTile;
    : };
    :
    : // Pointer
    : riTile *pTile;
    : [/code]
    : Now do you see why I'm completely confused why that would create different speeds?

    I can see how you would be confused if you thought [italic]that[/italic] was causing your performance hit, but I highly doubt that's the case. It just doesn't make sense. The problem [italic]has[/italic] to be somewhere else, in the code that was changed to switch from using the class wrapper instead of the pointer.

    Not knowing your code, I have no idea how this gets used and where the bottleneck really is... but it should be easy enough to prove that it has nothing to do with that (above). Either that, or I need to take up knitting or something... anything but programming...

    Cheers,
    Eric

  • As soon as I changed it back to the class from the pointer, the slow down went away (It's been a while since then. I rewrote my entire rendering ruitines for the new version of the engine, 1.1). I have no way of proving that it was that actually causing it (I really hope that it wasn't that, I couldn't see how the compiler would produce any different code, unless someone making MSVC was drunk throughout the production process). But I wish I could tell for certain. My code has changed way too much to test it now (I rewrote all of it when I started on the newer version).

    BTW: All of my new code is on my website if you ever get really too bored and want to tell me how bad my abuse of OOP really is =P (I'm not trying to make it OOP, but everyone can pretend I am if they want :). Stupid server always goes down when I want to do something to it (I can't test out my php code without uploading it...).

    http://druidgames.cjb.net


  • : Ok, I was using the AMD CodeAnalyst (great program, btw) and looked at the assembly behind one of the lines in my program, it was this line:

    Why not simply generate an assembly-language listing?

    : Which does the same exact thing, you'll notice. Their assembly output (in MSVC) is as follows, respectively:

    Only in debug mode. Compiling with /Ox gives:

    mov eax, DWORD PTR ?NodeCount@@3HA ; NodeCount
    lea ecx, DWORD PTR [eax*8]
    push ecx
    call ??2@YAPAXI@Z ; operator new
    add esp, 4
    mov DWORD PTR ?soa@@3PAVriTileOrder@@A, eax ; soa

    and

    mov eax, DWORD PTR ?NodeCount@@3HA ; NodeCount
    lea ecx, DWORD PTR [eax*8]
    push ecx
    call _malloc
    add esp, 4
    mov DWORD PTR ?soa@@3PAVriTileOrder@@A, eax ; soa

    respectively. (Obviously, I have no idea what's [italic]in[/italic] your structures.) Similarly,

    class priTileOrder
    {
    riTileOrder *p;
    };

    void test3()
    {
    soa2 = new riTileOrder *[NodeCount];
    }
    void test4()
    {
    soa3 = new priTileOrder[NodeCount];
    }

    produces *exactly* the same code for both functions. Are you compiling without full optimisation, or do you have other flags enabled?
    --
    [italic][blue]Sunlight[/blue][/italic]


  • I wasn't trying to find out the assembly, I just happened to see it, and was curious about why it was different in each case. I know that new and malloc shouldn't be any different, but they were in these cases. It may have been a debug build, whatever the code analyst wanted me to use is what I used :). I'm not trying to do the whole C or C++ is better thing, I was trying to find out if MSVC is weird :).


    http://druidgames.cjb.net


Sign In or Register to comment.

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Categories