Google has released a new AI model, MusicLM, that can generate music from texts and even paintings with high fidelity, meaning it can turn a text or painting into a song with different styles.
MusicLM even has a music medley that mixes different music styles in story mode. I selected an audio that is mixed very interestingly, but not very pleasant to listen to.
MusicLM can specify instruments, locations, genres, years, performance levels of the musicians, etc., and adjust the quality of the music generated so that multiple versions of a piece of music can be converted.
MusicLM is not the first AI model for generating songs; similar products include Riffusion, Dance Diffusion, etc. Google itself has released AudioML, and OpenAI, the developer of the most popular chat robot “ChatGPT”, has launched Jukebox.
MusicLM is actually a hierarchical sequence-to-sequence model. According to artificial intelligence researcher Keunwoo Choi, MusicLM combines several models such as MuLan+AudioLM and MuLan+w2b-Bert+Soundstream. The AudioLM model can be considered the precursor of MusicLM. MusicLM uses the multilevel autoregressive modeling of AudioLM as a generation condition. It can generate music with a frequency of 24kHz by text descriptions and maintain this frequency within a few minutes.
The research team introduced the first evaluation data MusicCaps specifically for the task of text music generation to solve the problem of the lack of evaluation data. MusicCaps was co-developed by subject matter experts and includes 5500 music-text pairs. Based on this, Google trained MusicLM with 280,000 hours of music datasets. Experiments at Google have shown that MusicLM performs better than previous models in both audio quality and text description matching.
However, MusicLM also faces the usual problems of all generative AIs – imperfect engineering, material violations, moral disputes, etc. In terms of technical problems, for example, MusicLM is technically able to generate vocals, but the effect is not good, the lyrics are messy and the meaning is unclear.