I've written some pseudocode that should explain problem that I've discovered in my real application (Arduino 1.6 - https://github.com/maciejmiklas/LEDDisplay):
Display.h:
class Display {
public:
    void testRef();
    void testVal();
private:
    typedef struct {
            uint8_t xOnFirstKit;
            uint8_t yOnFirstKit;
            uint8_t xRelKit;
            uint8_t yRelKit;
            uint8_t xRelKitSize;
            uint8_t yRelKitSize;
            uint8_t xDataBytes;
            uint8_t xKit;
            uint8_t yKit;
            uint8_t xOnKit;
            uint8_t yOnKit;
            uint8_t xOnKitSize;
            uint8_t yOnKitSize;
            uint8_t xOnScreenIdx;
            uint8_t yOnScreenIdx;
            uint8_t yDataIdx;
        } KitData;
 inline void paintOnKitRef(KitData *kd); 
 inline void paintOnKitVal(KitData kd); 
}
Display.cpp:
#include "Display.h"
void Display::testRef(){
    KitData *kd = ....
    for(int i = 0 ; i < 5000 ; i++){
       paintOnKitRef(kd);
       ....
    }
}
void Display::testVal(){
    KitData *kd = ....
    for(int i = 0 ; i < 5000 ; i++){
       paintOnKitVal(*kd);
       ....
    }
}
inline void Display::paintOnKitRef(KitData *kd){
    for(int i = 0 ; i < 100 ; i++){
        kd->yDataIdx++;
        kd->yOnScreenIdx++;
        .....
    }
}
inline void Display::paintOnKitVal(KitData kd){
    for(int i = 0 ; i < 100 ; i++){
        kd.yDataIdx++;
        kd.yOnScreenIdx++;
        .....
    }
}
I have structure: KitData which is larger than 16 bytes, so I've decided to pass it by pointer instead of by value - it works as expected. 
I've measured execution times and it looks like passing by value (testVal()) is about 30% faster than passing by reference (testRef()).
Is this normal?
Edit:
code above is only a pseudocode - in my real test methods: paintOnKitVal() and paintOnKitRef() are containing real code executing many operations and other methods. Both methods also do the same thing - only difference is way of accessing kd (trough pointer or dot notation).
This is the real test class: https://github.com/maciejmiklas/LEDDisplay/blob/callByPoint/Display.cpp
- Execute test method: paint(...)- this will use call-by-pointer as you can see in line 211
- Comment out line 211 and remove comment from line 212 - from now test will use call-by-value and execution time will be shorter.
 
     
     
    