Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/meta-pytorch/torchcodec/1313
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit aff4a53 with merge base 910005c ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
| void safeReadFileExact( | ||
| std::ifstream& file, | ||
| Container& buffer, | ||
| uint32_t bytesToRead) { |
There was a problem hiding this comment.
Some driveby changes:
- Renamed to
safeReadFileExactto be explicit about this function enforcing thatbytesToReadmust be read. - All callsites already pass
uint32_t. SincebytesToReadgets cast tostd::streamsize(a signed type) on line 92, usinguint32_tis safer.
There was a problem hiding this comment.
On this call
safeReadFile(file_, chunkBuffer, bytesToReadThisChunk);
bytesToReadThisChunk is an int64_t.
In general, it's best to stick to simpler types whenever possible. For our own interfaces, we should avoid using stuff like uint32 and just try to strick to int, int64_t, etc.
There was a problem hiding this comment.
I've updated most constexpr and function arguments to use int64_t, I hope this isn't going too far in the other direction.
My understanding is that thr tradeoff between int and int64_t is that int64_t uses some additional bytes, which is not really significant for this use case. Is that correct?
|
|
||
| return reinterpret_cast<const T*>(chunkData.data()); | ||
| } | ||
|
|
There was a problem hiding this comment.
This function is used to interpret bytes as type T, similar to safeReadFileExact().
Unlike safeReadFileExact(), we do not want to go through file.read() and copy to a buffer.
Instead, this function validates it is safe to read bytes from memory as type T. This allows us to read the WAV sample data with intData[i], which is faster than safeReadFileExact().
See usage in convertToFloatBuffer().
| namespace facebook::torchcodec { | ||
|
|
||
| class WavDecoder { | ||
| class FORCE_PUBLIC_VISIBILITY WavDecoder { |
There was a problem hiding this comment.
I'm not completely sure why this is needed, but it is.
It was a pain to debug why my builds started failing with this message:
OSError: /home/dev/torchcodec/src/torchcodec/libtorchcodec_custom_ops8.so: undefined symbol: _ZN8facebook10torchcodec10WavDecoder17getSamplesInRangeEdSt8optionalIdE
There was a problem hiding this comment.
It's for the same reasons the other APIs need it too: This WavDecoder class relies on a "hidden" class (the torch::stable::Tensor), which causes the compiler to transitively mark it as "hidden" as well.
But it can't be hidden, because we need to link against it in custom_ops and other places. So we have to explicitly mark it as public to override the compiler's default behavior.
It's messy.
I still owe you a thorough investigation / documentation of all this, which I can only do after #1260 lands, so that'll be just before the 2.12 release.
NicolasHug
left a comment
There was a problem hiding this comment.
Thanks @Dan-Flores , I made a first pass
| STD_TORCH_CHECK(false, "get_wav_all_samples not yet implemented"); | ||
| std::tuple<torch::stable::Tensor, double> result = | ||
| wavDecoder->getSamplesInRange(start_seconds, stop_seconds); | ||
| return std::get<0>(result); |
There was a problem hiding this comment.
getSamplesInRange is returning a tuple. Does it need to? It seems like std::get<1> is never used.
There was a problem hiding this comment.
Good point, lets align this with AudioDecoder. We should be using the second value in the tuple in the python WavDecoder object to make AudioSamples.
| const int64_t bytesReadThisChunk = file_.gcount(); | ||
|
|
||
| if (bytesReadThisChunk <= 0) | ||
| break; |
There was a problem hiding this comment.
I'm not sure I understand the reason for not using safeReadFileExact above. That is, why are we OK to not actually read the number of samples we expected to read?
If really this is a behavior we want to support, then let's add a strict (or exact) bool parameter to safeReadFile and always call safeReadFile.
There was a problem hiding this comment.
This ensures we return something even if the header had an incorrect dataSize or blockAlign, but I'm not certain its necessary. Its less confusing to use safeReadFile here, so I'll remove the partial read behavior.
| break; | ||
|
|
||
| const int64_t samplesInChunk = bytesReadThisChunk / header_.blockAlign; | ||
| if (samplesInChunk > 0) { |
There was a problem hiding this comment.
Can this be false without it being in an error state? I.e. do we need to do anything in the (currently non-existing) else{} block?
There was a problem hiding this comment.
This is similar to the check above (bytesReadThisChunk <= 0), in either case the read() call returned less bytes than expected for a sample, so we should not process any samples. I'll deduplicate the check and refactor this section to make it more obvious.
| const double actualPtsSeconds = | ||
| static_cast<double>(startSample) / header_.sampleRate; |
There was a problem hiding this comment.
this division is just undoing the multiplication that was done above when startSample was defined?
There was a problem hiding this comment.
When we calculate startSample as std::round(startSeconds * header_.sampleRate), the resulting startSample may not begin exactly at startSeconds as it needs to match the start of a sample boundary. I'll add a comment to this effect.
…checks, update chunk reading loop
| void safeReadFileExact( | ||
| std::ifstream& file, | ||
| Container& buffer, | ||
| uint32_t bytesToRead) { |
There was a problem hiding this comment.
On this call
safeReadFile(file_, chunkBuffer, bytesToReadThisChunk);
bytesToReadThisChunk is an int64_t.
In general, it's best to stick to simpler types whenever possible. For our own interfaces, we should avoid using stuff like uint32 and just try to strick to int, int64_t, etc.
| std::to_string(stopSecondsOptional.value()) + ")."); | ||
| if (startSeconds == stopSecondsOptional.value()) { | ||
| return AudioFramesOutput{ | ||
| torch::stable::empty({header_.numChannels, 0}, kStableFloat32), 0.0}; |
There was a problem hiding this comment.
Do we want to return startSeconds instead of 0 here? We should just try to be consistent with AudioDecoder (see comment below about testing this)
There was a problem hiding this comment.
Thanks for catching this, this enables the (0, 0) test to pass.
| // reading partial samples. See | ||
| // https://github.com/FFmpeg/FFmpeg/blob/0f600cbc16b7903703b47d23981b636c94a41c71/libavformat/wavdec.c#L786-L791 | ||
| size_t alignedChunkSize = DEFAULT_CHUNK_BUFFER_SIZE; | ||
| if (header_.blockAlign > 1) { |
There was a problem hiding this comment.
Are we in a correct state if this is ever <= 1?
Seems like 0 would be an incorrect state, and 1 could still go through the same logic - so should we remove this if() condition?
| STD_TORCH_CHECK( | ||
| numSamples <= INT64_MAX / header_.blockAlign, | ||
| "bytesToRead calculation overflow: numSamples * header_.blockAlign"); | ||
| const int64_t bytesToRead = numSamples * header_.blockAlign; |
There was a problem hiding this comment.
I'd still recommend renaming blockAlign to something more descriptive. I think it's fair to say that
const int64_t bytesToRead = numSamples * header_.numBytesPerSampleis easier to follow than what we currently have (at least for someone who's not expert in the wav docs). And we have other instances of that in the code since blockAlign is currently used in a lot of places.
We generally want to prioritize the readability of our own code. I get the point about the wav docs using blockAlign, but we can just add a comment to that effect in our header?
| const int64_t bytesToRead = numSamples * header_.blockAlign; | ||
|
|
||
| while (totalBytesRead < bytesToRead) { | ||
| const int64_t bytesToReadThisChunk = std::min( |
There was a problem hiding this comment.
This is a confusing name.
The chunk is the entire array that gets filled over multiple iterations. This bytesToReadThisChunk is not the number of bytes to read in that chunk. It's the number of bytes to read for that one iteration, and many of those will make up the entire chunk.
| constexpr uint32_t MAX_FMT_CHUNK_SIZE = 200; | ||
| // Soundfile's default chunk size. See | ||
| // https://github.com/libsndfile/libsndfile/blob/master/src/common.h#L77 | ||
| constexpr size_t DEFAULT_CHUNK_BUFFER_SIZE = 8192; |
There was a problem hiding this comment.
Is "Chunk" the right term to refer to this? My understanding is that it's actually a temporary sample buffer.
A chunk, to me, refers to a unit in the wav format - as in the "header chunk", the "data chunk". But these are different things.
Should this be "DefaultSampleBufferSize"
There was a problem hiding this comment.
This is a good point, I'll make this change when I reintroduce the buffered reading optimization.
| convertSamplesToFloat(chunkBuffer, samplesInChunk, outputPtr); | ||
| samplesProcessed += samplesInChunk; | ||
| totalBytesRead += bytesReadThisChunk; | ||
| } |
There was a problem hiding this comment.
The entire loop above: can I suggest we just call torch::from_buffer or something similar and then get a float32 tensor by calling .to(torhc.float32) and then scale?
Basically do what the Python code would do. That's only temporary, we'll eventually move it to something more like what you have above, but this would greatly simplify the state of the PR and allow us to do all these optimizations as future improvements.
There was a problem hiding this comment.
I agree, to make this PR easier to review we can break it up this way.
Removing the buffered read code eliminates some code you left feedback on, I'll incorporate the feedback when I reintroduce the optimization.
| startSeconds <= INT64_MAX / header_.sampleRate, | ||
| "startSample calculation would overflow: startSeconds * sampleRate"); | ||
| const int64_t startSample = | ||
| static_cast<int64_t>(std::round(startSeconds * header_.sampleRate)); |
There was a problem hiding this comment.
On the various checks like:
STD_TORCH_CHECK(..., "... could overflow");I know you're following my recommendation to be very defensive about user input, and we do want to be super careful about avoiding segfault, bad buffer/file accesses, etc.
But we start having a lot of those, and we might be reaching a point where we're making the code too difficult to follow. The check about is meant to validate that startSeconds * header_.sampleRate <= INT64_MAX, and since sampleRate ins uint32 that can only be false if the user passes an absurdly high startSeconds (in the range of ~60 years!)
Maybe we just.. don't care about checking this? The worst case that may happen seems to be that users will just get garbage as output.
Curious @scotts if you have any recommendation on how to deal with this in a principled way.
| STD_TORCH_CHECK(false, "File too small to contain chunk:", chunkId); | ||
| } | ||
| STD_TORCH_CHECK( | ||
| fileSize >= CHUNK_HEADER_SIZE, |
There was a problem hiding this comment.
| fileSize >= CHUNK_HEADER_SIZE, | |
| fileSize >= static_cast<uint64_t>(CHUNK_HEADER_SIZE), |
It's a safe implicit promotion, but with all of the different number types flying around, explicit is better than implicit.
| samples = torch::stable::subtract( | ||
| zeros, samples, -scale); // 0 - (-scale) * samples = scale * samples |
There was a problem hiding this comment.
So there might not be a torch::stable::multiply but I'm hoping the * operator is overloaded - we can't just do samples = samples * scale ? 😲
There was a problem hiding this comment.
No, unfortunately not:
/home/dev/torchcodec/src/torchcodec/_core/WavDecoder.cpp: In member function ‘facebook::torchcodec::AudioFramesOutput facebook::torchcodec::WavDecoder::getSamplesInRange(double, std::optional<double>)’:
/home/dev/torchcodec/src/torchcodec/_core/WavDecoder.cpp:350:19:
error: no match for ‘operator*’ (operand types are ‘torch::stable::Tensor’ and ‘const double’)
350 | samples = samples * scale;
| ~~~~~~~ ^ ~~~~~
| | |
| | const double
| torch::stable::Tensor
| header_.dataOffset <= static_cast<uint64_t>(INT64_MAX - byteOffset), | ||
| "dataPosition calculation would overflow: dataOffset + byteOffset "); | ||
| const int64_t dataPosition = | ||
| static_cast<int64_t>(header_.dataOffset) + byteOffset; |
There was a problem hiding this comment.
Nit: just define byteOffset only or dataPosition only, and do someting like
int64_t byteOffset = startSample * header_.numBytesPerSample;
byteOffset += static_cast<int64_t>(header_.dataOffset;
It avoid the temporary byteOffset variable.
| // Convert to [channels, samples] | ||
| samples = torch::stable::transpose(samples, 0, 1); | ||
|
|
||
| // We return the actual sample start time, not the requested startSeconds. |
There was a problem hiding this comment.
let's just clarify that it should be the same as startSeconds up to rounding.
| std::array<uint8_t, CHUNK_HEADER_SIZE> chunkHeader; | ||
| safeReadFile(file_, chunkHeader, CHUNK_HEADER_SIZE); | ||
| // Read chunk size which immediately follows the chunk ID | ||
| uint32_t chunkSize = readValue<uint32_t>(chunkHeader, 4); |
There was a problem hiding this comment.
Unrelated to this PR, I'm surprised that readValue() does a memcpy(). If we're always using it to read integer values, we should be able to read the values in-place using a reinterpret_cast. Getting that right is tricky, though, so that optimization should be a follow-up.
There was a problem hiding this comment.
Actually, related to this PR, part of the difficulty is that we're using so many different kinds of integer types. Sometimes we're forced to, though, so I want to better understand. Is it the WAV standard that we have to read these values as uint32_t? That is, is it defined somewhere that the chunk size is stored as a 32-bit unsigned integer?
Edit: Yes, seems so: https://www.mmsp.ece.mcgill.ca/Documents/AudioFormats/WAVE/WAVE.html
There was a problem hiding this comment.
Its recommended to use memcpy() for type punning. The data we copy is fairly small, so I don't think theres a big performance loss with this copy.
| } | ||
| // Skip this chunk and continue searching (odd chunks are padded) | ||
| STD_TORCH_CHECK( | ||
| static_cast<uint64_t>(chunkSize) <= UINT64_MAX - CHUNK_HEADER_SIZE - 1, |
There was a problem hiding this comment.
I don't get this check. Are you trying to establish that adding CHUNK_HEADER_SIZE to chunkSize won't overflow? Why? That is, if you wanted to establish that chunkSize + CHUNK_HEADER_SIZE would not overflow, I think (emphasis on I think), it would be:
static_cast<uint64_t>(chunkSize) <= static_cast<uint64_t>(UINT32_MAX) -
static_cast<uint64_t>(CHUNK_HEADER_SIZE)That is, because chunkSize is a uint32_t, we need to check it against UINT32_MAX, not UINT64_MAX. I'm not sure on the + or - 1 part.
I also may be missing something, so please don't assume I'm correct. But even if we implemented it with the right maxes, I don't think it's protecting against anything? We are adding them together on line 261, but that's all in 64-bit integers. This is the kind of check you'd do if you were also going to do something like chunkSize += CHUNK_HEADER_SIZE, but I don't see that.
There was a problem hiding this comment.
This check was redundant, its been removed in the latest version of this PR.
| CHUNK_HEADER_SIZE + static_cast<uint64_t>(chunkSize) + (chunkSize % 2); | ||
| STD_TORCH_CHECK( | ||
| startPos <= UINT64_MAX - numBytesToSkip, | ||
| "File position arithmetic would overflow"); |
There was a problem hiding this comment.
Note that this is correct, I think, because startPos is a 64-bit number.
| sizeof(typename Container::value_type) == 1, | ||
| "Container value_type must be a 1-byte type for safe byte access"); | ||
| // readValue is always called with positive offset, so casting to size_t is | ||
| // safe |
There was a problem hiding this comment.
Why make a comment when you can make an actual assertion? :) I agree with @NicolasHug that there's a danger to drowning in too many checks, but in this case, a comment and an assertion "cost" the reader about the same.
STD_TORCH_CHECK(offset >= 0, "offset ", offset, "is not positive");Functions don't get to control how they get called, even when both are written by the same person. :)
|
|
||
| template <typename T, typename Container> | ||
| T readValue(const Container& data, size_t offset) { | ||
| T readValue(const Container& data, int64_t offset) { |
There was a problem hiding this comment.
Nit on the names of the types here: when I see typename Container, I assume it's going to behave at least something like a member of the Containers library: https://en.cppreference.com/w/cpp/container.html
I think we can make things more clear by naming these types mechanically:
template <typename InType, typename OutType>
OutType readValue(const InType& data, int64_t offset);I got here because I read line 54, and I wasn't sure which direction T was going in.
| offset); | ||
| return std::memcmp(data + offset, expected.data(), 4) == 0; | ||
| // matchesFourCC is always called with non-negative offset, so casting to | ||
| // size_t is safe |
There was a problem hiding this comment.
Ditto on my point about about asserting non-negative. An assertion is easier to understand than a comment. And asserting that an integer is non-negative is easy to read. (I'm going to stop making this point, as it applies in a bunch of places.)
One way to deal with assertion noise is just don't provide a message. :) If you think something is rare, and unlikely to ever be helpful to users, just use the default message. Then the code becomes much more readable:
STD_TORCH_CHECK(offset >= 0);
STD_TORCH_CHECK(dataSize >= 4 && offset <= dataSize - 4);It's often not the assertion itself that makes it hard to follow the code, but trying to create a nice string. So skip the string. :)
There was a problem hiding this comment.
Thanks for the suggestion, I'll apply this to the various size_t conversions where I left a comment.
I can see how the check offset >= 0 indicates something similar to my comment: "all callsites are non-negative", while also enforcing it in the future.
If you think something is rare, and unlikely to ever be helpful to users, just use the default message
Since this function is only ever called by us, I'm positive this error will not occur.
I also suspect many of the overflow/underflow checks will be quite rare, but since those are user facing, perhaps we should keep the longer error strings in those cases.
This PR adds plumbing to enable decoding 32-bit PCM WAV files with
get_samples_played_in_range:getSamplesInRange():from_bloband converts tofloat32.Benchmark:
Details
We will improve performance in a later optimization PR! [Benchmark code](https://www.internalfb.com/phabricator/paste/view/P2252086087)Audio Decoding Benchmark
File: test/resources/sine_stereo_s32_10min.wav
Warmup: 3, Trials: 10
AudioDecoder 0.207s
WavDecoder 0.451s
SoundFile 0.145s