This is due to a combination of features of modern processors.
The first thing that contributes to a high IPS is the fact that modern processors have multiple execution units that can operate independently. In the below image (borrowed from Wikipedia: Intel Core Microarchitecture) you can see at the bottom that there are eight execution units (shown in yellow) that can all execute instructions concurrently. Not all of those units can secure the same types of instruction, but at least 5 of them can perform an ALU operation and there are three SSE capable units.

Combine that with a long instruction pipeline which can efficiently stack instructions ready for those units to execute instructions (out of order, if necessary) means that a modern processor can have a large number of instructions on the fly at any given time.
Each instruction might take a few clock cycles to execute, but if you can effectively parallelize their execution then you can give yourself a massive boost to IPS at the cost of processor complexity and thermal output.
Keeping these large pipelines full of instructions also needs a large cache that can be prefilled with instructions and data. This contributes to the size of the die and also the amount of heat the processor produces.
The reason this is not done on smaller processors is because it substantially increases the amount of control logic required around the processing cores, as well as the amount of space required and also heat generated. If you want a small, low power, highly responsive processor then you want a short pipeline without too much "extra" stuff surrounding the actual functional cores. So typically they minimise cache, restrict it to only one of each type of unit required to process instructions, and reduce the complexity of every part.
They could make a small processor as complex as as larger processor and achieve a similar performance, but then the power draw and cooling requirements would be exponentially increased.