Reading raw parquet files into raw dataframes¶

First order of business is turning the output of the data ETL into a dataframe ready for processing.

Next we need to tokenize the contents, and get ahold of the counters.

# test
from testing import test_eq, path

from neuralmusic.midi import parse_midi_file

df = parse_midi_file(path("data/ff4-airship.mid"))
df_tok, pitch_count, duration_count = preprocess(df)

test_eq(["7.11.2", "7", "7"], list(df_tok["pitches"][0][0:3]))
test_eq(["quarter", "eighth", "eighth"], list(df_tok["durations"][0][0:3]))
test_eq([110, 110, 110], list(df_tok["velocities"][0][0:3]))

test_eq(43, len(pitch_count))
test_eq(6, len(duration_count))

Transforms¶

When constructing our data source, we'll build some transforms to first get tuples of values at a time (pitches and durations), and numericalize them in parallel as well.

We also need to make a splitter that will separate our rows according to a certain ratio, by default 0.2.

DataLoader¶

We need a slightly custom DataLoader that instead of loading single sequences of tokens like in traditional language models, loads tuples of sequences (in our case, a sequence of pitches and a sequence of durations), all at the same time.

Finally we bring everything together by creating a DataSource.

Bringing everything together¶

Now we can obtain a DataBunch ready for training from a bunch of parquet files just like that:

Data pre-processing

Reading raw parquet files into raw dataframes¶

`read_parquet`[source]

`preprocess`[source]

Transforms¶

`to_dual`[source]

`dual_numericalize`[source]

`make_splitter`[source]

DataLoader¶

`class` `DualLMDataLoader`[source]

`data_source`[source]

Bringing everything together¶

`process`[source]

Data pre-processing

Reading raw parquet files into raw dataframes¶

read_parquet[source]

preprocess[source]

Transforms¶

to_dual[source]

dual_numericalize[source]

make_splitter[source]

DataLoader¶

class DualLMDataLoader[source]

data_source[source]

Bringing everything together¶

process[source]

`read_parquet`[source]

`preprocess`[source]

`to_dual`[source]

`dual_numericalize`[source]

`make_splitter`[source]

`class` `DualLMDataLoader`[source]

`data_source`[source]

`process`[source]