Yes, in this case, multiple words of data are being read by each bank simultaneously. This requires fine-grained control of memory accesses by the program being executed, otherwise you will encounter bank access/scheduling conflicts. In terms of hardware, a port is just an interface of connections between two pieces of hardware (serial or parallel, one wire or more). Each memory bank will have a port interfacing the shared memory with the GPU cores.
In regards to cache, you might want to see the question What is the difference between a cache and a buffer? for an in-depth look at caches and other nomenclature. With respect to ports, a cache is meant to be transparent to the use of the port - ideally, you should obtain an increase in throughput (or decrease in latency) using a cache without affecting the way the port is used at a high-level.
In terms of memory banks, the controller and end-point of each bank would require no change in terms of interfacing. When accessing subsequent data words, assuming that word has been cached in the cache hierarchy, the data would simply be available/returned faster - rather than if the cache controller has not caught up yet, and accessing the word directly in memory is required. In both cases, the external port interface is identical, just the timing of the signals change due to the increased delay.