Anything in an image-to-video generation that isn't in the starting frame is always tricky to get right. You could try adding an end frame with the pie already in the face and see how that works.
You could also try having the pie in the edge of the starting frame (with the hand of the thrower), describe it flying through the air in your text prompt (and use motion brushes) and then crop the edge of the frame out of the resulting video. I've done this in my slime videos by cropping the "hand of the pourer" out of the video.
I'm finding 1.6 a bit pointless so far- I've got prompts that work very well in 1.5, and in 1.6 they don't work as well, plus it's slower. I'm guessing it's better for text-to-video. I'm sticking with 1.5 for now.