Decode 32 bit PCM WAV by Dan-Flores · Pull Request #1313 · meta-pytorch/torchcodec

Dan-Flores · 2026-03-26T05:25:07Z

This PR adds plumbing to enable decoding 32-bit PCM WAV files with get_samples_played_in_range:

getSamplesInRange():

Maps start/stop seconds to byte offsets.
Reads samples, then converts to tensor using from_blob and converts to float32.

Benchmark:

Details

We will improve performance in a later optimization PR! [Benchmark code](https://www.internalfb.com/phabricator/paste/view/P2252086087)

Audio Decoding Benchmark
File: test/resources/sine_stereo_s32_10min.wav
Warmup: 3, Trials: 10

AudioDecoder 0.207s
WavDecoder 0.451s
SoundFile 0.145s

pytorch-bot · 2026-03-26T05:25:10Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/meta-pytorch/torchcodec/1313

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit aff4a53 with merge base 910005c ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Dan-Flores · 2026-03-26T05:38:09Z

+void safeReadFileExact(
+    std::ifstream& file,
+    Container& buffer,
+    uint32_t bytesToRead) {


Some driveby changes:

Renamed to safeReadFileExact to be explicit about this function enforcing that bytesToRead must be read.

All callsites already pass uint32_t. Since bytesToRead gets cast to std::streamsize (a signed type) on line 92, using uint32_t is safer.

On this call

safeReadFile(file_, chunkBuffer, bytesToReadThisChunk);

bytesToReadThisChunk is an int64_t.

In general, it's best to stick to simpler types whenever possible. For our own interfaces, we should avoid using stuff like uint32 and just try to strick to int, int64_t, etc.

I've updated most constexpr and function arguments to use int64_t, I hope this isn't going too far in the other direction.
My understanding is that thr tradeoff between int and int64_t is that int64_t uses some additional bytes, which is not really significant for this use case. Is that correct?

Dan-Flores · 2026-03-26T18:36:32Z

+
+  return reinterpret_cast<const T*>(chunkData.data());
+}
+


This function is used to interpret bytes as type T, similar to safeReadFileExact().

Unlike safeReadFileExact(), we do not want to go through file.read() and copy to a buffer.

Instead, this function validates it is safe to read bytes from memory as type T. This allows us to read the WAV sample data with intData[i], which is faster than safeReadFileExact().

See usage in convertToFloatBuffer().

Dan-Flores · 2026-03-26T18:39:36Z

 namespace facebook::torchcodec {

-class WavDecoder {
+class FORCE_PUBLIC_VISIBILITY WavDecoder {


I'm not completely sure why this is needed, but it is.
It was a pain to debug why my builds started failing with this message:
OSError: /home/dev/torchcodec/src/torchcodec/libtorchcodec_custom_ops8.so: undefined symbol: _ZN8facebook10torchcodec10WavDecoder17getSamplesInRangeEdSt8optionalIdE

It's for the same reasons the other APIs need it too: This WavDecoder class relies on a "hidden" class (the torch::stable::Tensor), which causes the compiler to transitively mark it as "hidden" as well.

But it can't be hidden, because we need to link against it in custom_ops and other places. So we have to explicitly mark it as public to override the compiler's default behavior.

It's messy.

I still owe you a thorough investigation / documentation of all this, which I can only do after #1260 lands, so that'll be just before the 2.12 release.

NicolasHug

Thanks @Dan-Flores , I made a first pass

NicolasHug · 2026-03-27T09:58:23Z

-  STD_TORCH_CHECK(false, "get_wav_all_samples not yet implemented");
+  std::tuple<torch::stable::Tensor, double> result =
+      wavDecoder->getSamplesInRange(start_seconds, stop_seconds);
+  return std::get<0>(result);


getSamplesInRange is returning a tuple. Does it need to? It seems like std::get<1> is never used.

Good point, lets align this with AudioDecoder. We should be using the second value in the tuple in the python WavDecoder object to make AudioSamples.

NicolasHug · 2026-03-27T10:48:08Z

+    const int64_t bytesReadThisChunk = file_.gcount();
+
+    if (bytesReadThisChunk <= 0)
+      break;


I'm not sure I understand the reason for not using safeReadFileExact above. That is, why are we OK to not actually read the number of samples we expected to read?

If really this is a behavior we want to support, then let's add a strict (or exact) bool parameter to safeReadFile and always call safeReadFile.

This ensures we return something even if the header had an incorrect dataSize or blockAlign, but I'm not certain its necessary. Its less confusing to use safeReadFile here, so I'll remove the partial read behavior.

NicolasHug · 2026-03-27T10:50:28Z

+      break;
+
+    const int64_t samplesInChunk = bytesReadThisChunk / header_.blockAlign;
+    if (samplesInChunk > 0) {


Can this be false without it being in an error state? I.e. do we need to do anything in the (currently non-existing) else{} block?

This is similar to the check above (bytesReadThisChunk <= 0), in either case the read() call returned less bytes than expected for a sample, so we should not process any samples. I'll deduplicate the check and refactor this section to make it more obvious.

NicolasHug · 2026-03-27T10:52:26Z

+  const double actualPtsSeconds =
+      static_cast<double>(startSample) / header_.sampleRate;


this division is just undoing the multiplication that was done above when startSample was defined?

When we calculate startSample as std::round(startSeconds * header_.sampleRate), the resulting startSample may not begin exactly at startSeconds as it needs to match the start of a sample boundary. I'll add a comment to this effect.

…checks, update chunk reading loop

NicolasHug · 2026-03-31T09:36:40Z

+void safeReadFileExact(
+    std::ifstream& file,
+    Container& buffer,
+    uint32_t bytesToRead) {


On this call

safeReadFile(file_, chunkBuffer, bytesToReadThisChunk);

bytesToReadThisChunk is an int64_t.

In general, it's best to stick to simpler types whenever possible. For our own interfaces, we should avoid using stuff like uint32 and just try to strick to int, int64_t, etc.

NicolasHug · 2026-03-31T09:42:04Z

+            std::to_string(stopSecondsOptional.value()) + ").");
+    if (startSeconds == stopSecondsOptional.value()) {
+      return AudioFramesOutput{
+          torch::stable::empty({header_.numChannels, 0}, kStableFloat32), 0.0};


Do we want to return startSeconds instead of 0 here? We should just try to be consistent with AudioDecoder (see comment below about testing this)

Thanks for catching this, this enables the (0, 0) test to pass.

NicolasHug · 2026-03-31T10:05:08Z

+  // reading partial samples. See
+  // https://github.com/FFmpeg/FFmpeg/blob/0f600cbc16b7903703b47d23981b636c94a41c71/libavformat/wavdec.c#L786-L791
+  size_t alignedChunkSize = DEFAULT_CHUNK_BUFFER_SIZE;
+  if (header_.blockAlign > 1) {


Are we in a correct state if this is ever <= 1?

Seems like 0 would be an incorrect state, and 1 could still go through the same logic - so should we remove this if() condition?

NicolasHug · 2026-03-31T10:08:10Z

+  STD_TORCH_CHECK(
+      numSamples <= INT64_MAX / header_.blockAlign,
+      "bytesToRead calculation overflow: numSamples * header_.blockAlign");
+  const int64_t bytesToRead = numSamples * header_.blockAlign;


I'd still recommend renaming blockAlign to something more descriptive. I think it's fair to say that

const int64_t bytesToRead = numSamples * header_.numBytesPerSample

is easier to follow than what we currently have (at least for someone who's not expert in the wav docs). And we have other instances of that in the code since blockAlign is currently used in a lot of places.

We generally want to prioritize the readability of our own code. I get the point about the wav docs using blockAlign, but we can just add a comment to that effect in our header?

NicolasHug · 2026-03-31T10:14:57Z

+  const int64_t bytesToRead = numSamples * header_.blockAlign;
+
+  while (totalBytesRead < bytesToRead) {
+    const int64_t bytesToReadThisChunk = std::min(


This is a confusing name.

The chunk is the entire array that gets filled over multiple iterations. This bytesToReadThisChunk is not the number of bytes to read in that chunk. It's the number of bytes to read for that one iteration, and many of those will make up the entire chunk.

NicolasHug · 2026-03-31T10:18:28Z

 constexpr uint32_t MAX_FMT_CHUNK_SIZE = 200;
+// Soundfile's default chunk size. See
+// https://github.com/libsndfile/libsndfile/blob/master/src/common.h#L77
+constexpr size_t DEFAULT_CHUNK_BUFFER_SIZE = 8192;


Is "Chunk" the right term to refer to this? My understanding is that it's actually a temporary sample buffer.

A chunk, to me, refers to a unit in the wav format - as in the "header chunk", the "data chunk". But these are different things.

Should this be "DefaultSampleBufferSize"

This is a good point, I'll make this change when I reintroduce the buffered reading optimization.

NicolasHug · 2026-03-31T10:23:50Z

+    convertSamplesToFloat(chunkBuffer, samplesInChunk, outputPtr);
+    samplesProcessed += samplesInChunk;
+    totalBytesRead += bytesReadThisChunk;
+  }


The entire loop above: can I suggest we just call torch::from_buffer or something similar and then get a float32 tensor by calling .to(torhc.float32) and then scale?

Basically do what the Python code would do. That's only temporary, we'll eventually move it to something more like what you have above, but this would greatly simplify the state of the PR and allow us to do all these optimizations as future improvements.

I agree, to make this PR easier to review we can break it up this way.
Removing the buffered read code eliminates some code you left feedback on, I'll incorporate the feedback when I reintroduce the optimization.

NicolasHug · 2026-03-31T17:03:48Z

+      startSeconds <= INT64_MAX / header_.sampleRate,
+      "startSample calculation would overflow: startSeconds * sampleRate");
+  const int64_t startSample =
+      static_cast<int64_t>(std::round(startSeconds * header_.sampleRate));


On the various checks like:

STD_TORCH_CHECK(..., "... could overflow");

I know you're following my recommendation to be very defensive about user input, and we do want to be super careful about avoiding segfault, bad buffer/file accesses, etc.

But we start having a lot of those, and we might be reaching a point where we're making the code too difficult to follow. The check about is meant to validate that startSeconds * header_.sampleRate <= INT64_MAX, and since sampleRate ins uint32 that can only be false if the user passes an absurdly high startSeconds (in the range of ~60 years!)

Maybe we just.. don't care about checking this? The worst case that may happen seems to be that users will just get garbage as output.

Curious @scotts if you have any recommendation on how to deal with this in a principled way.

scotts · 2026-03-31T17:45:50Z

-    STD_TORCH_CHECK(false, "File too small to contain chunk:", chunkId);
-  }
+  STD_TORCH_CHECK(
+      fileSize >= CHUNK_HEADER_SIZE,


Suggested change

fileSize >= CHUNK_HEADER_SIZE,

fileSize >= static_cast<uint64_t>(CHUNK_HEADER_SIZE),

It's a safe implicit promotion, but with all of the different number types flying around, explicit is better than implicit.

…ing optimization

NicolasHug

Thanks @Dan-Flores !

NicolasHug · 2026-04-01T08:14:50Z

+  samples = torch::stable::subtract(
+      zeros, samples, -scale); // 0 - (-scale) * samples = scale * samples


So there might not be a torch::stable::multiply but I'm hoping the * operator is overloaded - we can't just do samples = samples * scale ? 😲

No, unfortunately not:

/home/dev/torchcodec/src/torchcodec/_core/WavDecoder.cpp: In member function ‘facebook::torchcodec::AudioFramesOutput facebook::torchcodec::WavDecoder::getSamplesInRange(double, std::optional<double>)’: /home/dev/torchcodec/src/torchcodec/_core/WavDecoder.cpp:350:19: error: no match for ‘operator*’ (operand types are ‘torch::stable::Tensor’ and ‘const double’) 350 | samples = samples * scale; | ~~~~~~~ ^ ~~~~~ | | | | | const double | torch::stable::Tensor

NicolasHug · 2026-04-01T08:24:36Z

+      header_.dataOffset <= static_cast<uint64_t>(INT64_MAX - byteOffset),
+      "dataPosition calculation would overflow: dataOffset + byteOffset ");
+  const int64_t dataPosition =
+      static_cast<int64_t>(header_.dataOffset) + byteOffset;


Nit: just define byteOffset only or dataPosition only, and do someting like

int64_t byteOffset = startSample * header_.numBytesPerSample; byteOffset += static_cast<int64_t>(header_.dataOffset;

It avoid the temporary byteOffset variable.

NicolasHug · 2026-04-01T08:26:41Z

+  // Convert to [channels, samples]
+  samples = torch::stable::transpose(samples, 0, 1);
+
+  // We return the actual sample start time, not the requested startSeconds.


let's just clarify that it should be the same as startSeconds up to rounding.

scotts · 2026-04-01T15:47:39Z

    std::array<uint8_t, CHUNK_HEADER_SIZE> chunkHeader;
    safeReadFile(file_, chunkHeader, CHUNK_HEADER_SIZE);
    // Read chunk size which immediately follows the chunk ID
    uint32_t chunkSize = readValue<uint32_t>(chunkHeader, 4);


Unrelated to this PR, I'm surprised that readValue() does a memcpy(). If we're always using it to read integer values, we should be able to read the values in-place using a reinterpret_cast. Getting that right is tricky, though, so that optimization should be a follow-up.

Actually, related to this PR, part of the difficulty is that we're using so many different kinds of integer types. Sometimes we're forced to, though, so I want to better understand. Is it the WAV standard that we have to read these values as uint32_t? That is, is it defined somewhere that the chunk size is stored as a 32-bit unsigned integer?

Edit: Yes, seems so: https://www.mmsp.ece.mcgill.ca/Documents/AudioFormats/WAVE/WAVE.html

Its recommended to use memcpy() for type punning. The data we copy is fairly small, so I don't think theres a big performance loss with this copy.

@Dan-Flores, today I learned! :)

scotts · 2026-04-01T16:18:39Z

    }
    // Skip this chunk and continue searching (odd chunks are padded)
+    STD_TORCH_CHECK(
+        static_cast<uint64_t>(chunkSize) <= UINT64_MAX - CHUNK_HEADER_SIZE - 1,


I don't get this check. Are you trying to establish that adding CHUNK_HEADER_SIZE to chunkSize won't overflow? Why? That is, if you wanted to establish that chunkSize + CHUNK_HEADER_SIZE would not overflow, I think (emphasis on I think), it would be:

static_cast<uint64_t>(chunkSize) <= static_cast<uint64_t>(UINT32_MAX) - static_cast<uint64_t>(CHUNK_HEADER_SIZE)

That is, because chunkSize is a uint32_t, we need to check it against UINT32_MAX, not UINT64_MAX. I'm not sure on the + or - 1 part.

I also may be missing something, so please don't assume I'm correct. But even if we implemented it with the right maxes, I don't think it's protecting against anything? We are adding them together on line 261, but that's all in 64-bit integers. This is the kind of check you'd do if you were also going to do something like chunkSize += CHUNK_HEADER_SIZE, but I don't see that.

This check was redundant, its been removed in the latest version of this PR.

scotts · 2026-04-01T17:34:46Z

        CHUNK_HEADER_SIZE + static_cast<uint64_t>(chunkSize) + (chunkSize % 2);
    STD_TORCH_CHECK(
        startPos <= UINT64_MAX - numBytesToSkip,
        "File position arithmetic would overflow");


Note that this is correct, I think, because startPos is a 64-bit number.

scotts · 2026-04-01T20:54:44Z

      sizeof(typename Container::value_type) == 1,
      "Container value_type must be a 1-byte type for safe byte access");
+  // readValue is always called with positive offset, so casting to size_t is
+  // safe


Why make a comment when you can make an actual assertion? :) I agree with @NicolasHug that there's a danger to drowning in too many checks, but in this case, a comment and an assertion "cost" the reader about the same.

STD_TORCH_CHECK(offset >= 0, "offset ", offset, "is not positive");

Functions don't get to control how they get called, even when both are written by the same person. :)

scotts · 2026-04-01T21:04:24Z


 template <typename T, typename Container>
-T readValue(const Container& data, size_t offset) {
+T readValue(const Container& data, int64_t offset) {


Nit on the names of the types here: when I see typename Container, I assume it's going to behave at least something like a member of the Containers library: https://en.cppreference.com/w/cpp/container.html

I think we can make things more clear by naming these types mechanically:

template <typename InType, typename OutType> OutType readValue(const InType& data, int64_t offset);

I got here because I read line 54, and I wasn't sure which direction T was going in.

scotts · 2026-04-01T21:18:26Z

      offset);
-  return std::memcmp(data + offset, expected.data(), 4) == 0;
+  // matchesFourCC is always called with non-negative offset, so casting to
+  // size_t is safe


Ditto on my point about about asserting non-negative. An assertion is easier to understand than a comment. And asserting that an integer is non-negative is easy to read. (I'm going to stop making this point, as it applies in a bunch of places.)

One way to deal with assertion noise is just don't provide a message. :) If you think something is rare, and unlikely to ever be helpful to users, just use the default message. Then the code becomes much more readable:

STD_TORCH_CHECK(offset >= 0); STD_TORCH_CHECK(dataSize >= 4 && offset <= dataSize - 4);

It's often not the assertion itself that makes it hard to follow the code, but trying to create a nice string. So skip the string. :)

Thanks for the suggestion, I'll apply this to the various size_t conversions where I left a comment.
I can see how the check offset >= 0 indicates something similar to my comment: "all callsites are non-negative", while also enforcing it in the future.

If you think something is rare, and unlikely to ever be helpful to users, just use the default message

Since this function is only ever called by us, I'm positive this error will not occur.
I also suspect many of the overflow/underflow checks will be quite rare, but since those are user facing, perhaps we should keep the longer error strings in those cases.

decode PCM f32 with getSamplesInRange

4c1b549

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 26, 2026

nits

6dd1def

Dan-Flores commented Mar 26, 2026

View reviewed changes

Dan-Flores added 2 commits March 26, 2026 15:17

lint

23d02d1

comments + renaming

8a904e7

Dan-Flores commented Mar 26, 2026

View reviewed changes

Comment thread src/torchcodec/_core/WavDecoder.h Outdated

safeSeek

c1a92c3

Dan-Flores commented Mar 26, 2026

View reviewed changes

Comment thread src/torchcodec/_core/WavDecoder.cpp Outdated

Dan-Flores changed the title ~~Decode WAV PCM F32~~ Decode 32 bit PCM WAV Mar 26, 2026

Dan-Flores marked this pull request as ready for review March 26, 2026 19:40

NicolasHug reviewed Mar 27, 2026

View reviewed changes

Dan-Flores added 6 commits March 27, 2026 14:49

small fixes

28b3617

wavDecoder returns OpsAudioFramesOutput

5ca7d20

fix torch checks, explicit round, remove redundant totalSamples

901a756

safeReadFile optionally checks exact read bytes, remove unreasonable …

89cdd1c

…checks, update chunk reading loop

fix test, samples not frames

0696025

remove partial read option

c6370b7

NicolasHug reviewed Mar 30, 2026

View reviewed changes

Comment thread src/torchcodec/_core/WavDecoder.h Outdated

Dan-Flores added 5 commits March 30, 2026 14:01

use AudioFramesOutput everywhere, remove unused var

8c587ca

merge test

bd6ffbc

nit

062abea

remove dead code, rename function

92a48f7

remove break condition, add check to prevent infinite loop

97f01c4

NicolasHug reviewed Mar 31, 2026

View reviewed changes

scotts reviewed Mar 31, 2026

View reviewed changes

Dan-Flores added 4 commits March 31, 2026 19:09

rename blockAlign, simplify getSamplesInRange by removing buffer read…

66d5b8d

…ing optimization

use int64_t mostly, remove some superfluous checks, test start==stop

d21efc7

fix from_blob call

86f3a8c

nits

f245ff7

NicolasHug approved these changes Apr 1, 2026

View reviewed changes

scotts reviewed Apr 1, 2026

View reviewed changes

explicit missing cast, remove temp var dataPosition, fix comment

05f23f4

scotts reviewed Apr 1, 2026

View reviewed changes

Dan-Flores added 3 commits April 2, 2026 03:33

name in/out templated types, turn assertions into comments

07d524f

swap template type order

d2265e7

remote tautological check

aff4a53

Dan-Flores mentioned this pull request Apr 2, 2026

Optimize WAV decoding with buffered read #1320

Draft

Dan-Flores merged commit 7e8c399 into meta-pytorch:main Apr 16, 2026
72 checks passed

Dan-Flores deleted the wav_decode_f32_pcm branch April 16, 2026 04:34

		const double actualPtsSeconds =
		static_cast<double>(startSample) / header_.sampleRate;

	fileSize >= CHUNK_HEADER_SIZE,
	fileSize >= static_cast<uint64_t>(CHUNK_HEADER_SIZE),

		samples = torch::stable::subtract(
		zeros, samples, -scale); // 0 - (-scale) * samples = scale * samples

Conversation

Dan-Flores commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark:

Audio Decoding Benchmark File: test/resources/sine_stereo_s32_10min.wav Warmup: 3, Trials: 10

Uh oh!

pytorch-bot bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/meta-pytorch/torchcodec/1313

✅ No Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dan-Flores Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

NicolasHug Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Dan-Flores commented Mar 26, 2026 •

edited

Loading

Audio Decoding Benchmark
File: test/resources/sine_stereo_s32_10min.wav
Warmup: 3, Trials: 10

pytorch-bot bot commented Mar 26, 2026 •

edited

Loading

Dan-Flores Mar 26, 2026 •

edited

Loading

NicolasHug Mar 27, 2026 •

edited

Loading

scotts Apr 1, 2026 •

edited

Loading

scotts Apr 1, 2026 •

edited

Loading