-
Notifications
You must be signed in to change notification settings - Fork 996
[DOC] Document common computer vision patterns #19518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ymrohit
wants to merge
1
commit into
pytorch:main
Choose a base branch
from
ymrohit:docathon-8831-cv-patterns
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+301
−6
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,248 @@ | ||
| (working-with-cv-models)= | ||
|
|
||
| # Working with Computer Vision Models | ||
|
|
||
| Computer vision deployments depend on the boundary between the app and the exported program being precise. Before exporting, write down the tensor contract that your app will satisfy: | ||
|
|
||
| - input shape, including whether the model expects `NCHW` (`[batch, channels, height, width]`) or `NHWC` (`[batch, height, width, channels]`) | ||
| - input dtype, such as `float32` normalized image values or `uint8` image bytes | ||
| - color channel order, such as RGB or BGR | ||
| - resize, crop, and normalization rules | ||
| - output tensors and the post-processing expected for each task | ||
|
|
||
| ExecuTorch runs the graph that you export. It does not infer image layout, resize policy, label mappings, or task-specific post-processing from the model file. | ||
|
|
||
| ## Put preprocessing in the model when it must be identical | ||
|
|
||
| For many CV models, resize, dtype conversion, and normalization should behave exactly the same in test code and in the app. If the operations are exportable and do not rely on platform image APIs, wrap the PyTorch module before export. | ||
|
|
||
| This example accepts `uint8` `NCHW` RGB input, converts it to `float32`, resizes it, center-crops it, normalizes it, and then calls the image classifier. | ||
|
|
||
| ```python | ||
| import torch | ||
| from torch import nn | ||
| import torch.nn.functional as F | ||
|
|
||
|
|
||
| class ImageClassifierWithPreprocess(nn.Module): | ||
| def __init__(self, model: nn.Module) -> None: | ||
| super().__init__() | ||
| self.model = model | ||
| self.register_buffer( | ||
| "mean", | ||
| torch.tensor([0.485, 0.456, 0.406], dtype=torch.float32).view(1, 3, 1, 1), | ||
| ) | ||
| self.register_buffer( | ||
| "std", | ||
| torch.tensor([0.229, 0.224, 0.225], dtype=torch.float32).view(1, 3, 1, 1), | ||
| ) | ||
|
|
||
| def forward(self, image: torch.Tensor) -> torch.Tensor: | ||
| image = image.to(dtype=torch.float32).div(255.0) | ||
| image = F.interpolate( | ||
| image, | ||
| size=(256, 256), | ||
| mode="bilinear", | ||
| align_corners=False, | ||
| ) | ||
| image = image[:, :, 16:240, 16:240] | ||
| image = (image - self.mean) / self.std | ||
| return self.model(image) | ||
|
|
||
|
|
||
| wrapped_model = ImageClassifierWithPreprocess(model).eval() | ||
| sample_inputs = (torch.zeros(1, 3, 320, 320, dtype=torch.uint8),) | ||
| exported_program = torch.export.export(wrapped_model, sample_inputs) | ||
| ``` | ||
|
|
||
| Keep preprocessing outside the model when it is better owned by the application, such as camera orientation, EXIF handling, platform-native decoding, user-selected crop rectangles, or UI-specific resizing. In that case, validate the app-side preprocessing against the same PyTorch preprocessing used during export. | ||
|
|
||
| If the model expects a crop after resizing, keep that policy in exactly one place. A fixed center crop can be implemented in the wrapper with tensor slicing after `interpolate`; camera- or UI-dependent crops are usually easier to apply before packing pixels into the input tensor. | ||
|
|
||
| ## Convert images to tensors in app code | ||
|
|
||
| Most mobile image APIs expose decoded pixels as interleaved rows. Most PyTorch vision models expect channels-first tensors. If preprocessing stays in the app, explicitly pack pixels into the model's expected layout. | ||
|
|
||
| ### Android | ||
|
|
||
| The following Kotlin helper resizes a `Bitmap`, reads RGB pixels, applies ImageNet-style normalization, and packs the result as `NCHW` `float32` data for `Tensor.fromBlob`. | ||
|
|
||
| ```kotlin | ||
| import android.graphics.Bitmap | ||
| import org.pytorch.executorch.Tensor | ||
|
|
||
| fun bitmapToNchwTensor( | ||
| bitmap: Bitmap, | ||
| size: Int, | ||
| mean: FloatArray = floatArrayOf(0.485f, 0.456f, 0.406f), | ||
| std: FloatArray = floatArrayOf(0.229f, 0.224f, 0.225f), | ||
| ): Tensor { | ||
| val resized = Bitmap.createScaledBitmap(bitmap, size, size, true) | ||
| val pixels = IntArray(size * size) | ||
| resized.getPixels(pixels, 0, size, 0, 0, size, size) | ||
|
|
||
| val input = FloatArray(3 * size * size) | ||
| for (i in pixels.indices) { | ||
| val pixel = pixels[i] | ||
| val r = ((pixel shr 16) and 0xff) / 255.0f | ||
| val g = ((pixel shr 8) and 0xff) / 255.0f | ||
| val b = (pixel and 0xff) / 255.0f | ||
|
|
||
| input[i] = (r - mean[0]) / std[0] | ||
| input[size * size + i] = (g - mean[1]) / std[1] | ||
| input[2 * size * size + i] = (b - mean[2]) / std[2] | ||
| } | ||
|
|
||
| return Tensor.fromBlob(input, longArrayOf(1, 3, size.toLong(), size.toLong())) | ||
| } | ||
| ``` | ||
|
|
||
| If the exported model accepts `uint8` image bytes instead, use `Tensor.fromBlobUnsigned(...)` and keep dtype conversion inside the exported graph. | ||
|
|
||
| ```kotlin | ||
| val inputBytes = ByteArray(3 * width * height) | ||
| val inputTensor = Tensor.fromBlobUnsigned( | ||
| inputBytes, | ||
| longArrayOf(1, 3, height.toLong(), width.toLong()), | ||
| ) | ||
| ``` | ||
|
|
||
| ### iOS | ||
|
|
||
| The following Swift helper draws a `UIImage` into an RGB buffer, normalizes it, and creates a channels-first `Tensor<Float>`. | ||
|
|
||
| ```swift | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @shoumikhin @metascroy Could you take a look at the iOS code here? Thanks! |
||
| import CoreGraphics | ||
| import ExecuTorch | ||
| import UIKit | ||
|
|
||
| func imageToNchwTensor( | ||
| _ image: UIImage, | ||
| size: Int, | ||
| mean: [Float] = [0.485, 0.456, 0.406], | ||
| std: [Float] = [0.229, 0.224, 0.225] | ||
| ) -> Tensor<Float>? { | ||
| guard let cgImage = image.cgImage else { | ||
| return nil | ||
| } | ||
|
|
||
| let pixelCount = size * size | ||
| var rgba = [UInt8](repeating: 0, count: pixelCount * 4) | ||
| let colorSpace = CGColorSpaceCreateDeviceRGB() | ||
|
|
||
| let didDraw = rgba.withUnsafeMutableBytes { buffer -> Bool in | ||
| guard let baseAddress = buffer.baseAddress, | ||
| let context = CGContext( | ||
| data: baseAddress, | ||
| width: size, | ||
| height: size, | ||
| bitsPerComponent: 8, | ||
| bytesPerRow: size * 4, | ||
| space: colorSpace, | ||
| bitmapInfo: CGImageAlphaInfo.premultipliedLast.rawValue | | ||
| CGBitmapInfo.byteOrder32Big.rawValue | ||
| ) else { | ||
| return false | ||
| } | ||
| context.draw(cgImage, in: CGRect(x: 0, y: 0, width: size, height: size)) | ||
| return true | ||
| } | ||
| guard didDraw else { | ||
| return nil | ||
| } | ||
|
|
||
| var input = [Float](repeating: 0, count: 3 * pixelCount) | ||
| for i in 0..<pixelCount { | ||
| let base = 4 * i | ||
| let r = Float(rgba[base]) / 255.0 | ||
| let g = Float(rgba[base + 1]) / 255.0 | ||
| let b = Float(rgba[base + 2]) / 255.0 | ||
|
|
||
| input[i] = (r - mean[0]) / std[0] | ||
| input[pixelCount + i] = (g - mean[1]) / std[1] | ||
| input[2 * pixelCount + i] = (b - mean[2]) / std[2] | ||
| } | ||
|
|
||
| return Tensor<Float>(input, shape: [1, 3, size, size]) | ||
| } | ||
| ``` | ||
|
|
||
| If your model is exported for `NHWC`, keep the same decoded pixels but pack them in row-major `[height, width, channels]` order and use shape `[1, height, width, 3]`. | ||
|
|
||
| ## Decode common CV outputs | ||
|
|
||
| Output tensors are model-specific. Preserve the output schema used during export and keep a small validation test that compares app-side post-processing with PyTorch post-processing. | ||
|
|
||
| For TorchVision models, check the [models and pre-trained weights documentation](https://docs.pytorch.org/vision/stable/models.html) for model-specific transforms, categories, and task conventions. | ||
|
|
||
| ### Image classification | ||
|
|
||
| Image classifiers commonly return a logits tensor with shape `[1, num_classes]`. For top-1 classification, find the largest logit and map the index through the same labels file used during training or evaluation. | ||
|
|
||
| ```kotlin | ||
| import org.pytorch.executorch.EValue | ||
|
|
||
| val output = module.forward(EValue.from(inputTensor))[0].toTensor() | ||
| val logits = output.dataAsFloatArray | ||
|
|
||
| var topIndex = 0 | ||
| for (i in 1 until logits.size) { | ||
| if (logits[i] > logits[topIndex]) { | ||
| topIndex = i | ||
| } | ||
| } | ||
| val topScore = logits[topIndex] | ||
| ``` | ||
|
|
||
| Use `softmax` only when the UI needs probabilities. Ranking classes by logits and by softmax probabilities gives the same order. | ||
|
|
||
| ### Semantic segmentation | ||
|
|
||
| Semantic segmentation models commonly return class scores with shape `[1, classes, height, width]`. For each output pixel, choose the class channel with the largest score, then resize the mask back to the displayed image size if needed. | ||
|
|
||
| ```kotlin | ||
| fun argmaxMask(scores: FloatArray, classes: Int, height: Int, width: Int): IntArray { | ||
| val mask = IntArray(height * width) | ||
| for (y in 0 until height) { | ||
| for (x in 0 until width) { | ||
| val offset = y * width + x | ||
| var bestClass = 0 | ||
| var bestScore = scores[offset] | ||
| for (c in 1 until classes) { | ||
| val score = scores[c * height * width + offset] | ||
| if (score > bestScore) { | ||
| bestScore = score | ||
| bestClass = c | ||
| } | ||
| } | ||
| mask[offset] = bestClass | ||
| } | ||
| } | ||
| return mask | ||
| } | ||
| ``` | ||
|
|
||
| See the [DeepLabV3 Android demo](https://github.com/meta-pytorch/executorch-examples/tree/main/dl3/android/DeepLabV3Demo) for an end-to-end ExecuTorch segmentation app that exports a model, runs it on Android, and overlays the predicted mask on an image. | ||
|
|
||
| ### Object detection and instance segmentation | ||
|
|
||
| Detection and instance segmentation models do not have a single universal output format. Common patterns include: | ||
|
|
||
| - boxes as `[num_detections, 4]`, usually in `xyxy` or `xywh` coordinates | ||
| - labels as `[num_detections]` | ||
| - scores as `[num_detections]` | ||
| - masks as `[num_detections, height, width]` or `[num_detections, 1, height, width]` | ||
|
|
||
| Check whether thresholding, non-maximum suppression, box decoding, and mask resizing are already part of the exported graph. If they are not, keep those steps in the app and document the expected coordinate system. When the model runs on a resized or cropped image, map boxes and masks back to the original image coordinates before rendering overlays. | ||
|
|
||
| ## Validate the model and app contract | ||
|
|
||
| Before shipping a CV model, validate these items: | ||
|
|
||
| - The app sends the same dtype, shape, layout, color order, and normalization that the exported graph expects. | ||
| - The app uses the same labels, palette, score threshold, and coordinate convention as the PyTorch reference. | ||
| - A known image produces matching top classes, masks, or detections in PyTorch and in the ExecuTorch app. | ||
| - The preprocessing is applied exactly once. Do not normalize in both the app and the exported model. | ||
| - The output code handles model-specific shapes instead of assuming all CV models return classifier logits. | ||
|
|
||
| For the basic export and runtime flow, start with {doc}`getting-started`. For mobile runtime integration, see {doc}`using-executorch-android` and {doc}`using-executorch-ios`. | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@psiddh @kirklandsign Could you take a look at the Android-specific code here? Thanks!