In a study revealed on the preprint server Arxiv.org, researchers affiliated with the Institute of Computational Perception at Johannes Kepler University Linz and the Austrian Research Institute for Artificial Intelligence describe an AI system that may predict the almost definitely place inside sheet music matching an audio recording, ostensibly outperforming present state-of-the-art image-based rating followers by way of alignment precision.
Score following is the idea for purposes like automated accompaniment, page-turning, and synchronizing dwell performances to visualizations. Existing methods both depend on fixed-size, small snippets of sheet music pictures or require a computer-readable rating illustration extracted utilizing optical music recognition. But the researchers’ system can uniquely observe a whole sheet music web page, following musical performances of any size in an end-to-end style.
The crew modeled rating following as a picture segmentation activity. Based on a musical efficiency as much as a given cut-off date, their system predicts a segmentation masks — a small picture “piece” — for the rating that corresponds to the presently enjoying music. While trackers that leverage solely a fixed-size audio enter usually aren’t in a position to distinguish between repeating notes in the event that they exceed a sure context, the proposed system has no problem even in scores spanning over longer intervals of time within the audio, the researchers say.
In the course of experiments, the researchers sourced polyphonic piano samples from the Multi-model Sheet Music Dataset (MSMD), which includes songs from numerous composers together with Bach, Mozart, and Beethoven. After manually figuring out and fixing alignment errors, they educated their system on 353 pairs of sheet music and MIDI info.
The coauthors report that their system outperformed all baselines excepting the best threshold, attaining extra exact outcomes by way of time distinction (i.e., greater percentages for tighter error thresholds). It sometimes yielded errors, which the researchers attribute to the system’s freedom to carry out “big jumps” on the sheet picture paper. But they assert the experimental outcomes present the system is “very precise” in most contexts.
“Future work will … require testing on scanned or photographed sheet images, to gauge generalization capabilities of the system in the visual domain as well,” the researchers wrote. “The next step towards a system with greater capabilities is to either explicitly or implicitly incorporate a mechanism to handle repetitions in the score as well as in the performance. We assume that the proposed method will be able to acquire this capability quite naturally from properly prepared training data, although we suspect its performance will heavily depend on its implicit encoding of the audio history so far, i. e., how large an auditory context the recurrent network is able to store.”