You could build polyphonic blocks, but if you want to keep all audio and CV signals separate you would need as many inlets and outlets as there are voices. In your example, the audio is already summed at the output before it goes into the filter.
Some more info on this can be found in this topic:
As far as I know, voices get allocated when pressing a new note. A polyphonic subpatch further down in the chain without a keyboard object may not have the same voice allocated.
There's a good discussion on polyphonic voice assignment that may help:
In my opinion, it's best to create full mono voices instead of polyphonic subpatches chained together.