Firstly thats a very good question !
The not fast enough bit is to do with latency, however it could be possible to do away with some of the latency but that would up the cost.
The latency is related to how physically fast the memory and bus systems are, this is also affected by physical distance.
Faster memory and better bus connections cost more, thats why the caches get bigger and slower, its the best way to get the most bang for buck.
Much of the design of things in modern computing is to do with issues relating to latency, if the processor, memory and disk subsystems were closer matched then we
wouldn't bother with extra design features like level 3 caches.
The presence of the cache is also to do with the architecture, see
Von Neumann bottleneck.
The cache levels also refer to how close they are to the processor. Level 1 cache is 'on-chip' cache, as such it uses up valuable real estate on the silicon. There are only so many transistors that can fit within a set area, transistor count is based on the size of the die and the size of the gates or density. The bigger the die the more waste as impurities will cause more faulty units and a lower yield. More transistors allow for more complex and powerful processors, so making the level 1 cache bigger could be detrimental to the overall design, as it would use transistors that could be used for other logic or lower the yield by increasing the die size.
Moores law covers alot of this, many people think they understand moores law as they have the media's attention deficit disorder definition, they generally don't.
Moores Law :-
http://arstechnica.com/articles/paedia/cpu/moore.ars/3
Caches in general :-
http://en.wikipedia.org/wiki/CPU_cache
Design is the careful balancing of multiple forces or variables.
So on one level you are right, its just that your processor design would probably cost you £10,000, and it might not scale as well as 10 x £1000 processors !
Of course you can also pay for the extra complexity, it works fine on a SISD architecture, as soon as you bring in multiprocessor architectures you have cache snooping and
cache coherency to deal with.