Abstract: Current large speech language models are mainly based on semantic tokens from discretization of self-supervised learned representations and acoustic tokens from a neural codec, following a ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results