It turns out the following works, but the gate is applied in a counter-intuitive way. This may be a result of the bit ordering in Qiskit, which seems to give rise here to a very non-standard implementation, so beware! Specifically, test the following code (you can turn the qubits to |1> using the commented-out x() gates):
q = QuantumRegister(2, 'q')
c = ClassicalRegister(2, 'c')
U4x4 = np.array( [[0, 1, 0, 0], [1, 0, 0, 0],
[0, 0, 0, 1], [0, 0, 1, 0]] )
qc = QuantumCircuit(q,c)
qc.x(q[0])
#qc.x(q[1])
gate4x4 = UnitaryGate(U4x4)
qc.append(gate4x4, [q[0], q[1]] )
qc.measure(q[0],c[0])
qc.measure(q[1],c[1])
qc.draw()
Just by looking at the matrix, you can see that this is supposed to have the following output: |00> -> |01>, |01> -> |00>, |10> -> |11> and |11> -> |10>, where the first bit, i.e. a in |ab> denotes the value measured on q[0]. In other words, if input is q[0]=|1> and q[1]=|0>, one would expect the input state in the standard basis to be (the column vector) (0;0;1;0) so the output would be (0;0;0;1). But try it out by simulating it on Aer and you'll find out that's not the case. With qc.x(q[0]) as shown, the output is (0;0;0;0). To get the expected output you would need to append on [q[1], q[0]] instead. While this can definitely be handled by someone who was made aware, I think this is utterly confusing.
Here is a controlled version of this gate. Again, notice the (very unintuitive) inverse order of the qubits needed in the append instruction in order for the first qubit q[0] to act as the control.
q = QuantumRegister(3, 'q')
c = ClassicalRegister(3, 'c')
U4x4 = np.array( [[0, 1, 0, 0], [1, 0, 0, 0],
[0, 0, 0, 1], [0, 0, 1, 0]] )
k = 4
# This makes a controlled variant of the matrix
C_U = np.vstack([np.hstack([np.eye(k), np.zeros((k,k))]),
np.hstack([np.zeros((k,k)), U4x4])])
qc = QuantumCircuit(q,c)
#qc.x(q[0])
gate3Q = UnitaryGate(C_U)
qc.x(q[0])
qc.append(gate3Q, [q[2], q[1], q[0]] )
qc.measure(q[0],c[0])
qc.measure(q[1],c[1])
qc.measure(q[2],c[2])
qc.draw()
One can easily confirm for themselves what happens by running this code (and turning x(q[0]) on and off, etc.) with
backend = BasicAer.get_backend('qasm_simulator')
shots = 2048
results = execute(qc, backend=backend, shots=shots).result()
answer = results.get_counts()
print(answer)
plot_histogram(answer)
By looking at the definition of C_U and the append, it would seem that the first qubit, q[2], should be the control. But no, it is q[0]. For further reference, here is the Qiskit version I'm running:
{'qiskit-terra': '0.11.0',
'qiskit-aer': '0.3.4',
'qiskit-ignis': '0.2.0',
'qiskit-ibmq-provider': '0.4.4',
'qiskit-aqua': '0.6.1',
'qiskit': '0.14.0'}