From raw parquet to an ML-suitable, clean dataset

Reading raw parquet files into raw dataframes

First order of business is turning the output of the data ETL into a dataframe ready for processing.

read_parquet[source]

read_parquet(path:str)

Reads a multi-file parquet at path, returning a dataframe of three columns.

Next we need to tokenize the contents, and get ahold of the counters.

preprocess[source]

preprocess(df:DataFrame)

Tokenizes pitches and durations and returns a dataframe

# test
from testing import test_eq, path

from neuralmusic.midi import parse_midi_file

df = parse_midi_file(path("data/ff4-airship.mid"))
df_tok, pitch_count, duration_count = preprocess(df)

test_eq(["7.11.2", "7", "7"], list(df_tok["pitches"][0][0:3]))
test_eq(["quarter", "eighth", "eighth"], list(df_tok["durations"][0][0:3]))
test_eq([110, 110, 110], list(df_tok["velocities"][0][0:3]))

test_eq(43, len(pitch_count))
test_eq(6, len(duration_count))

Transforms

When constructing our data source, we'll build some transforms to first get tuples of values at a time (pitches and durations), and numericalize them in parallel as well.

to_dual[source]

to_dual(fields)

Returns a transform that will extract fields from a Series in the form of fastai Tuples.

dual_numericalize[source]

dual_numericalize(vocabs:Collection[Collection[str]])

Returns a transform that will numericalize each side of the tuple constructing a separate vocabulary for each side.

We also need to make a splitter that will separate our rows according to a certain ratio, by default 0.2.

make_splitter[source]

make_splitter(df:DataFrame, split:float=0.2)

Returns a splitter that acts on indices on a dataframe. By default it reserves 20% of the data for validation.

DataLoader

We need a slightly custom DataLoader that instead of loading single sequences of tokens like in traditional language models, loads tuples of sequences (in our case, a sequence of pitches and a sequence of durations), all at the same time.

class DualLMDataLoader[source]

DualLMDataLoader(dataset, lens=None, cache=2, bs=64, seq_len=72, num_workers=0, shuffle=False, pin_memory=False, timeout=0, drop_last=False, indexed=None, n=None, wif=None, before_iter=None, after_item=None, before_batch=None, after_batch=None, after_iter=None, create_batches=None, create_item=None, create_batch=None, retain=None) :: LMDataLoader

A Language Model data loader that loads tuples of 2 sequences instead of single sequences. It's used to load pitches and durations at the same time.

Finally we bring everything together by creating a DataSource.

data_source[source]

data_source(df:DataFrame, pitch_vocab:Collection[str], duration_vocab:Collection[str], split:float=0.2, dl_type='DualLMDataLoader')

Creates a DataSource ready to become a databunch.

Bringing everything together

Now we can obtain a DataBunch ready for training from a bunch of parquet files just like that:

process[source]

process(path:str, batch_size:int, seq_len:int, validation_split:float=0.2)

Turn raw parquet files into a DataBunch ready for training.