I've been transferring data between the BeagleBone's PRUs and main memory.  If I use the PRU's SBBO instruction to store a range of PRU registers to main (DDR) memory, I find that I get around 600MB/s -- not bad!
But surprisingly, when I try to read that data back, the main CPU seems to go much slower.
I wrote some sample code for the main CPU to sum up all the bytes in a big (many MB) ordinary buffer.  I got ~300MB/sec.  Using the LDM instruction I got that up to 600MB/sec and in one case over 1GB/sec.  So in general the main CPU seems to have no trouble accessing main memory.
But when I run the same code on the buffer allocated by the uio_pruss kernel module, I get only about a tenth of that: 30MB/s, or closer to to 40MB/s when using LDM.
Kumar Abhishek from the BeagleLogic project helped me understand what was going on.  The uio_pruss module allocates that buffer using dma_alloc_coherent(), which is the standard way that linux kernel modules talk to DMA-based peripherals when they need to exchange smallish amounts of data quickly.  It tells the kernel that somebody else is going to be writing to main memory via DMA, so for this block of data, make sure we bypass the cache for every single memory access.
For larger blocks of data travelling in one direction, CPU -> peripheral or peripheral <- CPU, the Dynamic DMA mapping Guide describes that the standard approach is to kmalloc() the memory in the kernel module, then use functions like dma_map_single(), dma_unmap_single(), dma_sync_single_for_cpu(), dma_sync_single_for_device() to make sure entire buffers are safe for access by the peripheral or CPU.
That way, rather than every memory access having to bypass the cache, the kernel can make sure it's safe for the CPU to access everything in the block now that the peripheral is done with it, or vice versa.
Unfortunately, though, on the ARM A8 CPU in the beaglebone, making sure a buffer doesn't already have stale data in the cache (which can happen unexpectedly due to things like speculative preloading) requires the kernel to walk cache line by cache line through the entire buffer, taking longer than the memory transfer it's preparing for!
Kumar reports that he gets upward of 200MB/sec using this approach, dominated by the dma_* kernel calls.  I tried it myself with a simple kernel module and got a bit over 100MB/sec, so it seems plausible to me.
This thread, "dma_sync_single_for_cpu takes a really long time", is worth reading all the way through. 
The only other way I can think of to get faster CPU access to big chunks of data from the PRUs is to tell the L1 and L2 caches to flush themselves, then access the data without calling the dma_sync_* functions at all.  The danger there is that it's very much tied to the specific CPU architecture and is very much not the recommended approach, so nobody's going to sympathize if you get corrupt data, and the only way to know if you've done it right is to try to test all the edge cases you can think of.
