Abstract: Recently, audio generation tasks have attracted considerable research interests. Despite rapid advancements in generating high-fidelity audio that is coarsely aligned with the text ...