a quick look at wave/play code shows that's its allocating a few things to SRAM that could probably be shifted to SDRAM. so I think this could be improved.
streaming many samples at one time could be problematic, as a thread is allocated for each sample, so this is not particularly efficient (though I suppose they will start being io bound, so perhaps not too bad if card is fast enough).
if, however, the multiple wave/plays are due to the multiple subpatches, as per OP, then yeah, this is the kind of issue I expected... SRAM is in short supply, you will still start hitting many limits. on this kind of platform I think it would be more desirable for 'fast loading' of alternative patches, rather than having multiple loaded (and yeah, I recognise the issue with reverb tails etc)
as for immediate tips, review the 'precious resources thread' , making sure preset/mod source allocations and alike on patch settings can help save a few bytes.