These datasets are intended for use by data scientists testing ML (machine-learning) models. My intention is to support development of classification models that perform well for Irish traditional dance music, which will then support good research on the musical characteristics of this music.
The classification challenge supported here is the classic CSI (Cover Song Identification) problem – "what tune is that?" – but focused on the barely researched area of CSI within a particular folk musical culture. To chunk this problem into a more tractable sub-task, these test datasets let us focus on the particular ML challenge of gaining a human expert's ability to distinguish specific Irish reels from each other.
As background for researchers unfamiliar with this particular challenge, see some of my data analysis about this problem:
Data structure
For detailed explanations of the metadata files included here, see this CoverHunterMPS documentation. In these data sets, the "work" identifier for each sample is the irishtune.info (Tunography) Tune ID#, alternatively called the "Ng number."
Two different variants, each generated with different CQT parameters, are shared here. The "main" dataset used the default CoverHunter settings:
- Sample rate: 16 kHz
- Hop size: 0.04 sec
- Octave resolution: 12 steps
- Minimum frequency: 32 Hz
- 96 frequency bins
The "melody" variant used settings better suited for identifying melodic content:
- Sample rate: 16 kHz
- Hop size: 0.04 sec
- Octave resolution: 12 steps
- Minimum frequency: 170 Hz
- 60 frequency bins
License for your use of these datasets
CC BY-NC: Creative Commons Attribution-NonCommercial 4.0 International
Download and Use with CoverHunterMPS
These datasets were prepared and tested for use with the tools/train.py and tools/eval_testset.py scripts in the CoverHunterMPS project, but any model that relies on CQT (Constant-Q Transform) features could use these equally well.
Both of the test datasets on this page can be used during model training runs to display how well your model does against these testsets by installing them and then un-commenting the corresponding reels50hard and reels50easy lines in the training-related hparams.yaml file of your CoverHunterMPS project.
Place the testsets you want in the "data" folder of your CoverHunterMPS project, and unzip them there.
- Reels50hard "main": reels50hard_testset.zip (165 MB)
- Reels50hard "melody": r50h_testset.b60.zip (104 MB)
- Reels50easy "main": reels50easy_testset.zip (103 MB)
- Reels50easy "melody": r50e_testset.b60.zip (65 MB)
- Reels50transpose "melody": r50t_testset.b60.zip (52 MB)
Example of using the eval_testset.py script to evaluate a pretrained model with one of these test datasets:
python3 -m tools.eval_testset training/yourmodel data/reels50easy_testset/full.txt data/reels50easy_testset/full.txt -plot_name="tSNE.png"
Reels50hard
Fifty reels, with about five different performances of each reel, selected to be particularly challenging for machine-learning models. Details of each performance, including the reasons for selecting each reel and each performance are described in the Google sheet Reels50hard dataset.
Changelog:
- First published here in April 2024.
- 21 August 2024:
- Replacement dataset posted after discovering and fixing human typos in the above Reels50hard Google sheet which led to missing data in the testset. Dataset is now complete, otherwise no changes made to the intended recordings and their labels.
- Added "melody" variant with higher and narrower frequency range.
Reels50easy
Fifty reels, with exactly three performances of each reel. None of the reels and none of the performances are also included in Reels50hard. All performances of each tune share the same 8-bar structure. None of the tunes are musically similar to each other, defined as not having an associated "(compare ...)" comment in the Tunography.
Changelog:
- First published here in April 2024.
- 21 August 2024: Added "melody" variant with higher and narrower frequency range.
Reels50transpose
Fifty reels, selected to focus on the challenge of recognizing musical identity across transpositions larger than a few semitones. This dataset overlaps as much as possible with the Reels50hard testset, to minimize reduction of available training data, and to leverage the general difficulty of the Reels50hard combinations. In this set I also trimmed perfs per work down to pairs when possible, to force test metrics to reveal as much as possible whether the model is able to handle transposition. Details of each performance, including the reasons for selecting each reel and each performance are described in the Google sheet Reels50transpose testset.
Changelog:
- First published here 25 October 2024, only in the "melody" variant.