Well, technically…

Pass by reference, return by value, etc Thursday 8th January 2009

Typically an assignment operator is declared like this.

MyClass& operator=( const MyClass& );

If you’re implementing a copy and swap idiom for the sake of exception safety you might implement something like this.

MyClass& operator=( const MyClass& other )
{
    MyClass copy( other );
    Swap( copy );
    return *this;
}

In this case we are copying a parameter which is passed by const reference. Couldn’t we just take it by value instead?

Well, yes, of course we can.

MyClass& operator=( MyClass other )
{
    Swap( other );
    return *this;
}

Similarly, you can do things like this for other operators:

MyClass operator+( MyClass l, const MyClass& r )
{
    return l += r;
}

Too cute? What’s the performance implication? Are there more copies or fewer?

Here’s a test class that I’m going to use to answer some of these questions.


extern "C"
{
    void def( long* ) throw();
    void copy( long* ) throw();
    void assign( long* ) throw();
    void destroy( long* ) throw();
}

struct TestClass
{
    TestClass()                              { def( _data ); }
    TestClass(const TestClass& r)            { copy( _data ); }
    TestClass& operator=(const TestClass& r) { assign( _data ); return *this; }
    ~TestClass()                             { destroy( _data ); }

    long _data[4];
};

The def(), copy(), assign() and destroy() functions are just used to track how many class instances are created and when. They are marked as ‘no throw’ to keep the example simple, but the possibility of having a copy constructor throw does have implications. The heavy use of extern “C” in the examples is just to make the function names in the generated assembler as simple as possible.

The class has some data to make it a realistic size.

Lets look at the calling side of pass-by-reference.

extern "C" void TakeValue( TestClass v ) throw();

extern "C" void g1()
{
    TakeValue( TestClass() );
}

extern "C" void g2()
{
    TestClass v;
    TakeValue( v );
}

Compiling these with a preview release of gcc 4.4 (gcc -S -O3) for x86_64 gives this (after removing some label noise that the compiler adds for exception management purposes:

g1:
    pushq   %rbx
    subq    $32, %rsp
    movq    %rsp, %rdi
    call    def
    movq    %rsp, %rdi
    call    TakeValue
    movq    %rsp, %rdi
    call    destroy
    addq    $32, %rsp
    popq    %rbx
    ret
    .size   g1, .-g1

g2:
    movq    %rbx, -16(%rsp)
    movq    %r12, -8(%rsp)
    subq    $88, %rsp
    leaq    32(%rsp), %r12
    movq    %r12, %rdi
    call    def
    movq    %rsp, %rdi
    call    copy
    movq    %rsp, %rdi
    call    TakeValue
    movq    %rsp, %rdi
    call    destroy
    movq    %r12, %rdi
    call    destroy
    movq    72(%rsp), %rbx
    movq    80(%rsp), %r12
    addq    $88, %rsp
    ret
    .size   g2, .-g2

If you’re not fluent in x64 assembler then the important things to watch for are the calls to the extern functions which track how many objects are created and also what order copying and function calls happen in.

Here, we can see that if we used an unnamed temporary passed directly into the function taking an object by value then only one object is created and there is no copying. If we have a named object that we pass in then, as you might expect, a copy is made. Note that even though we don’t actually do anything with the copy after the function has returned, the compiler can’t skip the copy as our undefined extern “C” functions may have side effects that can’t be ignored.

But in the called function itself is there any copying or clean up of the temporary that needs to be done?

Let’s see. We test a function that passes a pointer to the passed object to an undefined function to ensure that the compiler believes that the object really is used and can’t be optimized out of existence.

extern "C" void TakePointer( TestClass* p ) throw();

extern "C" void TakeValue( TestClass v ) throw()
{
    TakePointer( &v );
}

This compiles to the following assembler.

TakeValue:
    jmp TakePointer
    .size   TakeValue, .-TakeValue

Wow. We just jump (that’s goto, not a function call) to the function that takes the pointer. If you wanted to consult the documentation for x86_64 (aka amd64) calling conventions on linux you would find the following information about passing objects by value. Small objects (up to 16 bytes in size) are passed in one or two registers, or – after the registers assigned for parameters are all allocated – on the stack. For larger objects the caller allocates temporary space for them and passes a pointer to the temporary. The caller is then responsible for cleaning up the temporary after the function call. In other words, the state of the registers and stack on entry and exit of a function which takes a large object by value is exactly the same as it is for a function which takes a pointer to such an object. (It is also exactly the same for a function taking an object by reference. Once you get to assembler references are pointers.)

This (and the ‘jmp TakePointer’) proves that passing by value is costs exactly the same as constructing a temporary and passing a pointer or a reference to it.

OK, so what about returning a value, what does this cost?

Here we have two functions that return an object by value. One returns it directly and the other uses a local variable, mutates it and then returns it.

extern "C" void TakeReference( TestClass& v ) throw();

extern "C" TestClass MakeValue()
{
    return TestClass();
}

extern "C" TestClass MakeMutatedValue()
{
    TestClass ret;
    TakeReference( ret );
    return ret;
}

Here are the assembler highlights.

MakeValue:
    pushq   %rbx
    movq    %rdi, %rbx
    call    def
    movq    %rbx, %rax
    popq    %rbx
    ret
    .size   MakeValue, .-MakeValue

MakeMutatedValue:
    pushq   %rbx
    movq    %rdi, %rbx
    call    def
    movq    %rbx, %rdi
    call    TakeReference
    movq    %rbx, %rax
    popq    %rbx
    ret
    .size   MakeMutatedValue, .-MakeMutatedValue

Look no copies! So what’s going on here? x64 functions can use up to two registers for the return value: rax and rdx. For objects larger than 16 bytes, the caller must allocate space for the return value and passes a pointer to the allocated space in register rdi. The pointer passed in rdi is returned to the caller in rax. (rdi does not have to be preserved by the called function, so by returning the pointer to the caller in rax the caller is relieved of the need to save this value in an alternative preserved register or on the stack.) The other thing to note is that rbx belongs to a calling function must be preserved my the called function.

So here’s what happens in the first case. A pointer to space for the return value comes into the MakeValue function in rdi. MakeValue pushes rbx onto the stack so that it can restore the contents of rbx when it returns. It then saves the pointer value in rdi into rbx. This ensures than when it calls other functions which may overwrite rdi, it still has a copy of the pointer to the allocated space for the return value. MakeValue then calls the default constructor for TestClass. It then copies the return value pointer (saved in rbx) into rax (the return value register) and restores the original value of rbx that it save from the stack.

In the second case much the same happens. Indeed, the only different instructions are a movq where the pointer value parameter for TakeReference is set up and the call to TakeReference itself. Despite the fact that we had a named local variable, the storage used for this variable was the space allocated for the return value by the caller and no copy was made for the return statement. This is the “named return value optimization” in action.

And the cost on the calling side?

extern "C" void TakeReference( TestClass& v ) throw();

extern "C" TestClass MakeValue();

extern "C" void UseReturnedValue()
{
    TestClass v( MakeValue() );
    TakeReference( v );
}

UseReturnedValue:
    pushq   %rbx
    subq    $32, %rsp
    movq    %rsp, %rdi
    call    MakeValue
    movq    %rsp, %rdi
    call    TakeReference
    movq    %rsp, %rdi
    call    destroy
    addq    $32, %rsp
    popq    %rbx
    ret
    .size   UseReturnedValue, .-UseReturnedValue

The sub instruction allocates stack space for the object by moving the stack pointer down 32 bytes, the constructor is called inside the MakeValue function, then the stack pointer is also used as parameter to the TakeReference function and to the destructor. No copies here, either.

Excellent, passing and returning values needn’t imply any unnecessary copies, so implementing operator+ like this is fully optimal:

MyClass operator+( MyClass l, const MyClass& r )
{
    return l += r;
}

Or is it?

to be continued…

Author: charles
Filed Under Category: c++
Article
Comments: 2 Comments

2 Comments

Steve January 24th, 2009

By using such techniques to pass parameters by value to save a few CPU cycles, are you not exposing implementation details in the interface? And is that not a bad idea?

I had this exact issue several months ago in some code I was refactoring. Some operator methods had ‘unusual’ pass by value parameters – it just looked (and smelled) wrong. It was only when I delved into the details I realised why this was happening: to save an explicit copy.

It stood out like a sore thumb in the header because “we never pass classes by value, right kids?”.

Is there not a concern for unintentional slicing here as well?

hashpling January 27th, 2009

For the example that I’m looking at, I’m really thinking of the non-inline operator+= as the function of real value. The inline operator+ is an inline helper function to allow the example class to be used more flexibly.

The operator+ implementation is essentially providing client code as efficient a recipe as possible for using operator+= automatically in addition expressions. I don’t really have any problems with slightly different signatures so long as they don’t introduce any semantic gotchas.

As for slicing, because operator+ creates a new object, it’s only really fully applicable to types with value semantics. For this reason I don’t really consider slicing a particular issue.

Consider the following.

class Derived : public Base { /* … */ }

Base operator+( const Base& l, const Base& r ) { /* … */ }

Whatever the details of the implementation, if there is no operator+ defined for Derived types, then the expression d1 + d2 will take two values of derived types and return a result whose complete type is just Base. Conceptually, the answer is sliced from what it could (should?) have been, even if neither parameter has been sliced in and of itself.

My preferred method of eliminating slicing concerns is to keep base classes abstract and to only implement free operator implementations for the most derived concrete classes.

Browse by Categories

Browse by Tag

Blogroll

Meta

Pass by reference, return by value, etc Thursday 8th January 2009

2 Comments