With every parallel processing task, there is some amount of overhead due to task coordination. pmap applies the mapping function to each element individually in a different thread. As the lazy sequence returned by pmap is consumed, the consumer thread must coordinate with the producer threads. The way pmap is defined, this overhead occurs for each and every element produced.
Considering this, when you use pmap to compute a simple function (such as squaring a number, as in your example), the time it takes for the threads to coordinate their activities swamps the time it takes to actually compute the value. As the docstring says, pmap is "only useful for computationally intensive functions where the time of f dominates the coordination overhead" (empasis added). In these cases, pmap will take longer than map regardless of how many cores you have.
To actually see a benefit from pmap, you must choose a "harder" problem. In some cases, this may be as simple as partitioning the input sequence into chunks. Then the sequence of chunks can be processed with pmap and then run through concat to get the final output.
For example:
(defn chunked-pmap [f partition-size coll]
(->> coll ; Start with original collection.
(partition-all partition-size) ; Partition it into chunks.
(pmap (comp doall ; Map f over each chunk,
(partial map f))) ; and use doall to force it to be
; realized in the worker thread.
(apply concat))) ; Concatenate the chunked results
; to form the return value.
However, there is also an overhead for partitioning the sequence and concatenating the chunks at the end. For example, at least on my machine, chunked-pmap still under-performed map by a significant amount for your example. Still, it may be effective for some functions.
Another way to improve the effectiveness of pmap is to partition the work at a different place in the overall algorithm. For example, suppose we were interested in calculating the euclidean distance between pairs of points. While parallelizing the square function has proven to be ineffective, we might have some luck parallelizing the entire distance function. Realistically, we would want to partition the task at an even higher level, but that is the gist of it.
In short, the performance of parallel algorithms are sensitive to the manner in which the task is partitioned, and you have chosen a level that is too granular for your test.