Return to Publications

Irish Traditional Music Test Datasets for Machine Learning

These datasets are intended for use by data scientists testing machine learning (ML) models. My intention is to support development of classification models that perform well for Irish traditional dance music, which will then support good research on the musical characteristics of this music.

The classification challenge supported here is the classic Cover Song Identification (CSI) problem – "what tune is that?" – but focused on the barely researched area of CSI within a particular folk musical culture. To chunk this problem into a more tractable sub-task, these test datasets let us focus on the particular ML challenge of gaining a human expert's ability to distinguish specific Irish reels from each other.

As background for researchers unfamiliar with this particular challenge, see some of my data analysis about this problem:

Hugging Face Hosting

As of 18 November 2025, these testsets are also hosted at https://huggingface.co/datasets/alanngnet/irish-tunes-csi.

Data structure

For detailed explanations of the metadata files included here, see this CoverHunterMPS documentation. In these data sets, the "work" identifier for each sample is the (Tunography) Tune ID#, alternatively called the "Ng number."

Two different variants, each generated with different CQT parameters, are shared here. The "main" dataset used the default CoverHunter settings:

The "melody" variant used settings better suited for identifying melodic content:

License for your use of these datasets

CC BY-NC: Creative Commons Attribution-NonCommercial 4.0 International

Download and Use with CoverHunterMPS

These datasets were prepared and tested for use with the tools/train.py and tools/eval_testset.py scripts in the CoverHunterMPS project, but any model that relies on CQT (Constant-Q Transform) features could use these equally well.

The test datasets on this page can be used during model training runs to display how well your model does against these testsets by installing them and then un-commenting the corresponding reels50hard, reels50easy, and reels50transpose lines in the training-related hparams.yaml file of your CoverHunterMPS project.

Place the testsets you want in the "data" folder of your CoverHunterMPS project, and unzip them there.

  1. Reels50easy "main": reels50easy.96bins.zip (103 MB)
  2. Reels50easy "melody": reels50easy.60bins.zip (65 MB)
  3. Reels50hard "main": reels50hard.96bins.zip (164 MB)
  4. Reels50hard "melody": reels50hard.60bins.zip (104 MB)
  5. Reels50transpose "main": reels50transpose.96bins.zip (83 MB)
  6. Reels50transpose "melody": reels50transpose.60bins.zip (52 MB)

Example of using the eval_testset.py script to evaluate a pretrained model with one of these test datasets:

python3 -m tools.eval_testset 
	training/yourmodel
	data/reels50easy_testset/full.txt 
	data/reels50easy_testset/full.txt
	-plot_name="tSNE.png"

Reels50easy

Fifty reels, with exactly three performances of each reel. None of the reels and none of the performances are also included in Reels50hard. All performances of each tune share the same 8-bar structure. None of the tunes are musically similar to each other, defined as not having an associated "(compare ...)" comment in the Tunography.

Changelog:

  1. First published here in April 2024.
  2. 21 August 2024: Added "melody" variant with higher and narrower frequency range. Archived copy of this deprecated version: Reels50easy "main" reels50easy_testset.zip (103 MB) and Reels50easy "melody" r50e_testset.b60.zip (65 MB)
  3. 18 November 2025: No significant change. Renamed folders for readability and consistency with the other testsets, and re-generated all the CQTs anyways just to keep them consistently fresh with the other testsets, in case low-level changes in the Tunography audio or Python versions cause any slight numerical impacts.

Reels50hard

Fifty reels, with about five different performances of each reel, selected to be particularly challenging for machine-learning models. Details of each performance, including the reasons for selecting each reel and each performance are described in the Google sheet Reels50hard dataset.

Changelog:

  1. First published here in April 2024.
  2. 21 August 2024:
    1. Replacement dataset posted after discovering and fixing human typos in the above Reels50hard Google sheet which led to missing data in the testset. Dataset is now complete, otherwise no changes made to the intended recordings and their labels.
    2. Added "melody" variant with higher and narrower frequency range.
    3. Archived copy of this deprecated version: Reels50hard "main": reels50hard_testset.zip (165 MB) and Reels50hard "melody": r50h_testset.b60.zip (104 MB).
  3. 18 November 2025: Replaced three samples and renamed the folders for better readability and consistency with the other testsets.
    1. Performance PCny.8_2 for tune #590 was replaced with CMcN.6_2 because the PCny performance had been reclassified to a better-matching tune.
    2. Performance JyHenry.11_2 for tune #187 was replaced with MR+8.17_1 for the same reason.
    3. Performance RogRd.6_2 for tune #543 was replaced with ShMulch.8_1 for the same reason.

Reels50transpose

Fifty reels, selected to focus on the challenge of recognizing musical identity across transpositions larger than a few semitones. This dataset overlaps as much as possible with the Reels50hard testset, to minimize reduction of available training data, and to leverage the general difficulty of the Reels50hard combinations. In this set I also trimmed perfs per work down to pairs when possible, to force test metrics to reveal as much as possible whether the model is able to handle transposition. Details of each performance, including the reasons for selecting each reel and each performance are described in the Google sheet Reels50transpose testset.

Changelog:

  1. First published here 25 October 2024, only in the "melody" variant. Archived copy of this deprecated version: r50t_testset.b60.zip (52 MB)
  2. Revised 17 November 2025 because one of these performances had been reclassified to a better-matching work, so I substituted a different performance (for tune ID #187, replaced JyHenry.11_2 with MR+8.17_1). Also added the "main", 96-bin variant. Also renamed the folders for better readability.