Reading raw parquet files into raw dataframes¶
First order of business is turning the output of the data ETL into a dataframe ready for processing.
Next we need to tokenize the contents, and get ahold of the counters.
# test
from testing import test_eq, path
from neuralmusic.midi import parse_midi_file
df = parse_midi_file(path("data/ff4-airship.mid"))
df_tok, pitch_count, duration_count = preprocess(df)
test_eq(["7.11.2", "7", "7"], list(df_tok["pitches"][0][0:3]))
test_eq(["quarter", "eighth", "eighth"], list(df_tok["durations"][0][0:3]))
test_eq([110, 110, 110], list(df_tok["velocities"][0][0:3]))
test_eq(43, len(pitch_count))
test_eq(6, len(duration_count))
Transforms¶
When constructing our data source, we'll build some transforms to first get tuples of values at a time (pitches and durations), and numericalize them in parallel as well.
We also need to make a splitter that will separate our rows according to a certain ratio, by default 0.2.
DataLoader¶
We need a slightly custom DataLoader that instead of loading single sequences of tokens like in traditional language models, loads tuples of sequences (in our case, a sequence of pitches and a sequence of durations), all at the same time.
Finally we bring everything together by creating a DataSource.
Bringing everything together¶
Now we can obtain a DataBunch ready for training from a bunch of parquet files just like that: