Turning a bunch of MIDI files into parquet data
Untar'ing the file¶
The first step is to untar the file containing the MIDI files.
Partitioning the files in minibatches¶
Since the tar.gz file may contain a huge amount of MIDI files, we'll partition those files into minibatches that we can process in parallel.
Processing a minibatch¶
For each minibatch, we'll go through its MIDI files, parse them, and write them to a separate Parquet file.
Merging the parquet files¶
Once we have all the minibatches in separate parquet files, merging them into a single dataset is trivial.
Putting everything together¶
Now we can build the ETL flow!
Testing the ETL¶
The ETL accepts a tar.gz
file input containing MIDI files:
# test
from testing import test_eq, path
from omegaconf import OmegaConf
import fastparquet
tmp_path = "/tmp/neuralmusic_etl"
targz_path = path("data/midi.tar.gz")
dot_list = [f"tar_gz_path={targz_path}", f"outdir={tmp_path}", "partition_size=1"]
etl_cfg = OmegaConf.from_dotlist(dot_list)
flow = build_etl(etl_cfg)
init_stats()
started_at = time.time()
flow.run()
test_eq(4, total_songs)
test_eq(0, malformed_songs)
test_eq(4, valid_songs)
test_eq(4, valid_rows)
df = fastparquet.ParquetFile(tmp_path, verify=True).to_pandas()
test_eq(4, len(df))
# TODO: figure out order!
# test_eq(["7.11.2", "7", "7"], pitches[0:3])
# test_eq([1.75, 0.5, 0.5], durations[0:3])
# test_eq([110, 110, 110], velocities[0:3])
It also accepts a path to a folder with MIDI files:
# test
from testing import test_eq, path
from omegaconf import OmegaConf
import fastparquet
tmp_path = "/tmp/neuralmusic_etl"
midi_path = path("data")
dot_list = [f"midi_path={midi_path}", f"outdir={tmp_path}", "partition_size=1"]
etl_cfg = OmegaConf.from_dotlist(dot_list)
flow = build_etl(etl_cfg)
init_stats()
started_at = time.time()
flow.run()
test_eq(4, total_songs)
test_eq(0, malformed_songs)
test_eq(4, valid_songs)
test_eq(4, valid_rows)
df = fastparquet.ParquetFile(tmp_path, verify=True).to_pandas()
test_eq(4, len(df))
# TODO: figure out order!
# test_eq(["7.11.2", "7", "7"], pitches[0:3])
# test_eq([1.75, 0.5, 0.5], durations[0:3])
# test_eq([110, 110, 110], velocities[0:3])