Idea: Decompose into two stages - (1) RL predicts 6-DoF grasp pose from visual features, (2) state machine + IK executes the grasp.