"High Performance Computing" is traditionally the purview of universities, governments and Big Science. The difference between what most people have at home and high-performance computing (HPC) is the difference between bicycles and the space race.
But the scene has been changing for quite some time. Clusters of cheap desktops running Linux and one or another parallel-computing software interfaces were the first to come. A showcase example was Oak Ridge National Labs "Stone Souper Computer", put together from surplus PCs and components that would otherwise have been thrown away. It was used for real work, doing the computational heavy lifting for several ecological modeling projects. Over time clustering became the primary means of acheiving high performance. A quick look at the TOP500 list shows just how prevalent clustering is.
Over the last few years, multicore CPUs have expanded this in a somewhat new direction. First, by doubling or quadrupling the number of CPU cores in a computing node, and later by adding highly-parallel processors to the clusters. By highly-parallel, I refer to the TOP500's current #1 supercomputer, the Roadrunner, which combines standard multicore AMD Opteron processors (sibling to the Athlon 64 processors common in desktop computers) with IBM PowerXCell processors, which are a slightly modified version of the Cell processor that powers every Sony Playstation3. The PowerXCell features one normal PowerPC processor, and 8 "Synergistic Processing Engines," essentially smaller sub-processors which focus on simple parallelizable tasks.
With the recent attention of the HPC market seeing the utility of many small, simple processing cores, many turned their attention to modern graphics processing units (GPUs). Modern GPUs contain dozens or even hundreds of small processing pipelines, optimized for the types of math and operations that 3D graphics require. As time progressed, these 3D graphics, especially games, required more complex computation, to such a degree that each of the pipelines started to resember a general purpose, if graphics-optimized, CPU.
Soon, hobbyists and programmers started taking advantage of this, programming their GPUs to perform computations that would normally run on a CPU. By moving the highly-parallel parts of their computation to the GPU, they could perform dozens of operations in the time that a single CPU could do one.
ATI (now AMD) and nVidia smelled money. ATI released some of their low-level programming interfaces, and a set of software extensions that allowed programmers easier access to the GPU. nVidia developed a whole programming language called CUDA, that allows developers to write C code directly for their GPUs. Now nVidia even has a version of their most powerful graphics card, with all the graphics hardware removed. Called "Tesla", it's a pure computation engine. You can fill your PC with as many Tesla cards as you have slots to stick 'em in. They have dedicated chassis full of Tesla cards with dedicated high-performance connections to PCs. Tokyo Tech University has begun adding Tesla units to their Tsubame supercomputer. The US National Center for Atmospheric Research has begun testing Tesla for accelerating particularly obnoxious computation, and found significant gains.
And really, this is sort of taking us full circle. Some of the dead technological offshoots in computing's past looked in this direction. Many hundreds, even thousands of small processors were used in the Connection Machines computers, each processor being only one bit wide. INMOS developed a radically different type of computing with their transputer, each chip a small microprocessor with a small amount of RAM and several inter-transputer network links, they were designed from the ground up as massively parallel computing engines.
The Connection Machines started with the ideas introduced in variable-width "bit-slice" processors of their predecessors, and took them to their logical extreme: A machine an arbitrary number of bits wide, and code-reconfigurable. It proved to be unsuccessful in implementation, though. With their 5th generation CM-5, Connection Machines designed a large parallel machine powered by up to 512 Sun SPARC processors
The INMOS Transputer is the evolutionary forebear to today's massively parallel supercomputers, though. Each transputer was a small processor, memory and enough glue logic to allow the device to stand mostly on its own. With network links to multiple other transputers on the same board, in the same case, or even spread throughout multiple cases, a cluster of transputers could appear to be a single virtual system, with individual threads running on each transputer. It was only ahead of its time in so far as there was a great deal of headroom left in getting higher performance out of traditional computing, and programming for parallel computing is hard.
Today, the modern single-threaded microprocessor has hit a performance barrier. Power requirements and the laws of decreasing returns have made it difficult to wring higher performance out of more transistors and higher clockspeeds. To improve returns on Moore's Law, today's CPU makers divided their transistor budget across two or more fully functional CPU cores on a single die. Dual-core CPUs are commonplace, and quad-core CPUs are about to become so, as well.
With the ubiquity of multicore computing and the aforementioned implicitly parallel nature of graphics processing, parallel programming has been thrust into the forefront of development. In order to achieve adquate performance for any demanding task, developers now must parallelize their computing tasks. What was a good idea in hardware, and optional in software before, has become ubiquitous in hardware, and required in software now.