kornia.contrib#
Models#
Base#
- class kornia.contrib.models.base.ModelBase(*args, **kwargs)[source]#
Abstract model class with some utilities function.
- compile(*, fullgraph=False, dynamic=False, backend='inductor', mode=None, options={}, disable=False)[source]#
- Return type:
ModelBase[ModelConfig]
- load_checkpoint(checkpoint, device=None)[source]#
Load checkpoint from a given url or file.
- Parameters:
checkpoint (str) – The url or filepath for the respective checkpoint
device (torch.device | None, optional) – The desired device to load the weights and move the model Default:
None
- Return type:
None
Structures#
- class kornia.contrib.models.SegmentationResults(logits, scores, mask_threshold=0.0)[source]#
Encapsulate the results obtained by a Segmentation model.
- Parameters:
- property binary_masks: Tensor#
Binary mask generated from logits considering the mask_threshold.
Shape will be the same of logits \((B, C, H, W)\) where \(C\) is the number masks predicted.
Note
If you run original_res_logits, this will generate the masks based on the original resolution logits. Otherwise, this will use the low resolution logits (self.logits).
- original_res_logits(input_size, original_size, image_size_encoder)[source]#
Remove padding and upscale the logits to the original image size.
Resize to image encoder input -> remove padding (bottom and right) -> Resize to original size
Note
This method set a internal original_res_logits which will be used if available for the binary masks.
- Parameters:
input_size – The size of the image input to the model, in (H, W) format. Used to remove padding.
original_size – The original size of the image before resizing for input to the model, in (H, W) format.
image_size_encoder – The size of the input image for image encoder, in (H, W) format. Used to resize the logits back to encoder resolution before remove the padding.
- Returns:
Batched logits in \((K, C, H, W)\) format, where (H, W) is given by original_size.
- class kornia.contrib.models.Prompts(points=None, boxes=None, masks=None)[source]#
Encapsulate the prompts inputs for a Model.
- Parameters:
points (optional) – A tuple with the keypoints (coordinates x, y) and their respective labels. Shape \((K, N, 2)\) for the keypoints, and \((K, N)\) Default:
None
boxes (optional) – Batched box inputs, with shape \((K, 4)\). Expected to be into xyxy format. Default:
None
masks (optional) – Batched mask prompts to the model with shape \((K, 1, H, W)\) Default:
None
- boxes: Tensor | None = None#
- masks: Tensor | None = None#
- points: tuple[Tensor, Tensor] | None = None#
VisualPrompter#
- class kornia.contrib.visual_prompter.VisualPrompter(config=SamConfig(model_type='vit_h', pretrained=True), device=None, dtype=None)[source]#
This class allow the user to run multiple query with multiple prompts for a model.
At the moment, we just support the SAM model. The model is loaded based on the given config.
For default the images are transformed to have their long side with size of the image_encoder.img_size. This Prompter class ensure to transform the images and the prompts before prediction. Also, the image is passed automatically for the method preprocess_image, which is responsible for normalize the image and pad it to have the right size for the SAM model \(( ext{image_encoder.img_size}, ext{image_encoder.img_size})\). For default the image is normalized by the mean and standard deviation of the SAM dataset values.
- Parameters:
config (SamConfig, optional) – A model config to generate the model. Now just the SAM model is supported. Default:
SamConfig(model_type='vit_h', pretrained=True)
device (torch.device | None, optional) – The desired device to use the model. Default:
None
dtype (torch.dtype | None, optional) – The desired dtype to use the model. Default:
None
Example
>>> # prompter = VisualPrompter() # Will load the vit h for default >>> # You can load a custom SAM type for modifying the config >>> prompter = VisualPrompter(SamConfig('vit_b')) >>> image = torch.rand(3, 25, 30) >>> prompter.set_image(image) >>> boxes = Boxes( ... torch.tensor( ... [[[[0, 0], [0, 10], [10, 0], [10, 10]]]], ... device=prompter.device, ... dtype=torch.float32 ... ), ... mode='xyxy' ... ) >>> prediction = prompter.predict(boxes=boxes) >>> prediction.logits.shape torch.Size([1, 3, 256, 256])
- compile(*, fullgraph=False, dynamic=False, backend='inductor', mode=None, options={}, disable=False)[source]#
Applies torch.compile(…)/dynamo API into the VisualPrompter API.
Note
For more information about the dynamo API check the official docs https://pytorch.org/docs/stable/generated/torch.compile.html
- Parameters:
fullgraph (bool, optional) – Whether it is ok to break model into several subgraphs Default:
False
dynamic (bool, optional) – Use dynamic shape tracing Default:
False
backend (str, optional) – backend to be used Default:
'inductor'
mode (str | None, optional) – Can be either “default”, “reduce-overhead” or “max-autotune” Default:
None
options (dict[Any, Any], optional) – A dictionary of options to pass to the backend. Default:
{}
disable (bool, optional) – Turn torch.compile() into a no-op for testing Default:
False
- Return type:
None
Example
>>> # prompter = VisualPrompter() >>> # prompter.compile() # You should have torch >= 2.0.0 installed >>> # Use the prompter methods ...
- predict(keypoints=None, keypoints_labels=None, boxes=None, masks=None, multimask_output=True, output_original_size=True)[source]#
Predict masks for the given image based on the input prompts.
- Parameters:
keypoints (Keypoints | Tensor | None, optional) – Point prompts to the model. Each point is in (X,Y) in pixels. Shape \((K, N, 2)\). Where N is the number of points and K the number of prompts. Default:
None
keypoint_labels – Labels for the point prompts. 1 indicates a foreground point and 0 indicates a background point. Shape \((K, N)\). Where N is the number of points, and K the number of prompts.
boxes (Boxes | Tensor | None, optional) – A box prompt to the model. If a tensor, should be in a xyxy mode. Shape \((K, 4)\) Default:
None
masks (Tensor | None, optional) – A low resolution mask input to the model, typically coming from a previous prediction iteration. Has shape \((K, 1, H, W)\), where for SAM, H=W=256. Default:
None
multimask_output (bool, optional) – If true, the model will return three masks. For ambiguous input prompts (such as a single click), this will often produce better masks than a single prediction. If only a single mask is needed, the model’s predicted quality score can be used to select the best mask. For non-ambiguous prompts, such as multiple input prompts, multimask_output=False can give better results. Default:
True
output_original_size (bool, optional) – If true, the logits of SegmentationResults will be post-process to match the original input image size. Default:
True
- Return type:
- Returns:
A prediction with the logits and scores (IoU of each predicted mask)
- preprocess_image(x, mean=None, std=None)[source]#
Normalize and pad a tensor.
For normalize the tensor: will priorize the mean and std passed as argument, if None will use the default Sam Dataset values.
For pad the tensor: Will pad the tensor into the right and bottom to match with the size of self.model.image_encoder.img_size
- Parameters:
x (Tensor) – The image to be preprocessed
mean (Tensor | None, optional) – Mean for each channel. Default:
None
std (Tensor | None, optional) – Standard deviations for each channel. Default:
None
- Return type:
Tensor
- Returns:
The image preprocessed (normalized if has mean and str available and padded to encoder size)
- preprocess_prompts(keypoints=None, keypoints_labels=None, boxes=None, masks=None)[source]#
Validate and preprocess the given prompts to be aligned with the input image.
- Return type:
- set_image(image, mean=None, std=None)[source]#
Set the embeddings from the given image with image_decoder of the model.
Prepare the given image with the selected transforms and the preprocess method.
- Parameters:
image (Tensor) – RGB image. Normally images with range of [0-1], the model preprocess normalize the pixel values with the mean and std defined in its initialization. Expected to be into a float32 dtype. Shape \((3, H, W)\).
- Return type:
None
Edge Detection#
- class kornia.contrib.EdgeDetector[source]#
Detect edges in a given image using a CNN.
By default, it uses the method described in [SRS20].
- Returns:
A tensor of shape \((B,1,H,W)\).
Example
>>> img = torch.rand(1, 3, 320, 320) >>> detect = EdgeDetector() >>> out = detect(img) >>> out.shape torch.Size([1, 1, 320, 320])
Face Detection#
- class kornia.contrib.FaceDetector(top_k=5000, confidence_threshold=0.3, nms_threshold=0.3, keep_top_k=750)[source]#
Detect faces in a given image using a CNN.
By default, it uses the method described in [FYP+21].
- Parameters:
top_k (
int
, optional) – the maximum number of detections to return before the nms. Default:5000
confidence_threshold (
float
, optional) – the threshold used to discard detections. Default:0.3
nms_threshold (
float
, optional) – the threshold used by the nms for iou. Default:0.3
keep_top_k (
int
, optional) – the maximum number of detections to return after the nms. Default:750
- Returns:
A list of B tensors with shape \((N,15)\) to be used with
kornia.contrib.FaceDetectorResult
.
Example
>>> img = torch.rand(1, 3, 320, 320) >>> detect = FaceDetector() >>> res = detect(img)
- class kornia.contrib.FaceKeypoint(value)[source]#
Define the keypoints detected in a face.
The left/right convention is based on the screen viewer.
- EYE_LEFT = 0#
- EYE_RIGHT = 1#
- MOUTH_LEFT = 3#
- MOUTH_RIGHT = 4#
- NOSE = 2#
- class kornia.contrib.FaceDetectorResult(data)[source]#
Encapsulate the results obtained by the
kornia.contrib.FaceDetector
.- Parameters:
data (
Tensor
) – the encoded results coming from the feature detector with shape \((14,)\).
- property bottom_right: Tensor#
The [x y] position of the bottom-right coordinate of the bounding box.
- get_keypoint(keypoint)[source]#
The [x y] position of a given facial keypoint.
- Parameters:
keypoint (
FaceKeypoint
) – the keypoint type to return the position.- Return type:
Interactive Demo#
Visit the Kornia face detection demo on the Hugging Face Spaces.
Object Detection#
- class kornia.contrib.object_detection.BoundingBoxDataFormat(value)[source]#
Enum class that maps bounding box data format.
- XYWH = 0#
- XYXY = 1#
- CXCYWH = 2#
- CENTER_XYWH = 2#
- class kornia.contrib.object_detection.BoundingBox(data, data_format)[source]#
Bounding box data class.
Useful for representing bounding boxes in different formats for object detection.
- Parameters:
data – tuple of bounding box data. The length of the tuple depends on the data format.
data_format – bounding box data format.
- data: tuple[float, float, float, float]#
- data_format: BoundingBoxDataFormat#
- class kornia.contrib.object_detection.ObjectDetectorResult(class_id, confidence, bbox)[source]#
Object detection result.
- Parameters:
class_id (
int
) – class id of the detected object.confidence (
float
) – confidence score of the detected object.bbox (
BoundingBox
) – bounding box of the detected object in xywh format.
-
bbox:
BoundingBox
#
- class kornia.contrib.object_detection.ObjectDetector(model, pre_processor, post_processor)[source]#
This class wraps an object detection model and performs pre-processing and post-processing.
- compile(*, fullgraph=False, dynamic=False, backend='inductor', mode=None, options=None, disable=False)[source]#
Compile the internal object detection model with
torch.compile()
.- Return type:
None
- forward(images)[source]#
Detect objects in a given list of images.
- Parameters:
images – list of RGB images. Each image is a Tensor with shape \((3, H, W)\).
- Returns:
list of detections found in each image. For item in a batch, shape is \((D, 6)\), where \(D\) is the number of detections in the given image, \(6\) represents class id, score, and xywh bounding box.
- class kornia.contrib.object_detection.ResizePreProcessor(size, interpolation_mode='bilinear')[source]#
This module resizes a list of image tensors to the given size.
Additionally, also returns the original image sizes for further post-processing.
- forward(imgs)[source]#
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- kornia.contrib.object_detection.results_from_detections(detections, format)[source]#
Convert a detection tensor to a list of
ObjectDetectorResult
.- Parameters:
detections (Tensor) – tensor with shape \((D, 6)\), where \(D\) is the number of detections in the given image, \(6\) represents class id, score, and xywh bounding box.
- Return type:
- Returns:
list of
ObjectDetectorResult
.
Real-Time Detection Transformer (RT-DETR)#
- class kornia.contrib.models.rt_detr.RTDETRModelType(value)[source]#
Enum class that maps RT-DETR model type.
- resnet18d = 0#
- resnet34d = 1#
- resnet50d = 2#
- resnet101d = 3#
- hgnetv2_l = 4#
- hgnetv2_x = 5#
- class kornia.contrib.models.rt_detr.RTDETRConfig(model_type, num_classes, checkpoint=None, neck_hidden_dim=None, neck_dim_feedforward=None, neck_expansion=None, head_hidden_dim=256, head_num_queries=300, head_num_decoder_layers=None, confidence_threshold=0.3)[source]#
Configuration to construct RT-DETR model.
- Parameters:
model_type (RTDETRModelType | str | int) –
model variant. Available models are
ResNetD-18:
0
,'resnet18d'
orRTDETRModelType.resnet18d
ResNetD-34:
1
,'resnet34d'
orRTDETRModelType.resnet34d
ResNetD-50:
2
,'resnet50d'
orRTDETRModelType.resnet50d
ResNetD-101:
3
,'resnet101d'
orRTDETRModelType.resnet101d
HGNetV2-L:
4
,'hgnetv2_l'
orRTDETRModelType.hgnetv2_l
HGNetV2-X:
5
,'hgnetv2_x'
orRTDETRModelType.hgnetv2_x
num_classes (int) – number of classes.
checkpoint (str | None, optional) – URL or local path of model weights. Default:
None
neck_hidden_dim (int | None, optional) – hidden dim for neck. Default:
None
neck_dim_feedforward (int | None, optional) – feed-forward network dim for neck. Default:
None
neck_expansion (float | None, optional) – expansion ratio for neck. Default:
None
head_hidden_dim (int, optional) – hidden dim for head. Default:
256
head_num_queries (int, optional) – number of queries for Deformable DETR transformer decoder. Default:
300
head_num_decoder_layers (int | None, optional) – number of decoder layers for Deformable DETR transformer decoder. Default:
None
- checkpoint: str | None = None#
- confidence_threshold: float = 0.3#
- head_num_decoder_layers: int | None = None#
- head_num_queries: int = 300#
- model_type: RTDETRModelType | str | int#
- neck_dim_feedforward: int | None = None#
- neck_expansion: float | None = None#
- num_classes: int#
- class kornia.contrib.models.rt_detr.RTDETR(backbone, neck, head)[source]#
RT-DETR Object Detection model, as described in https://arxiv.org/abs/2304.08069.
- __init__(backbone, neck, head)[source]#
Construct RT-DETR Object Detection model.
- Parameters:
backbone (ResNetD | PPHGNetV2) – backbone network for feature extraction.
neck (HybridEncoder) – neck network for feature fusion.
head (RTDETRHead) – head network to decode features into detection results.
- forward(images)[source]#
Detect objects in an image.
- Parameters:
images – images to be detected. Shape \((N, C, H, W)\).
- Returns:
logits - Tensor of shape \((N, Q, K)\), where \(Q\) is the number of queries, \(K\) is the number of classes.
boxes - Tensor of shape \((N, Q, 4)\), where \(Q\) is the number of queries.
- static from_config(config)[source]#
Construct RT-DETR Object Detection model from a config object.
- Parameters:
config (
RTDETRConfig
) – configuration object for RT-DETR.- Return type:
Note
For
config.neck_hidden_dim
,config.neck_dim_feedforward
,config.neck_expansion
, andconfig.head_num_decoder_layers
, if they areNone
, their values will be replaced with the default values depending on theconfig.model_type
. See the source code for the default values.
- load_checkpoint(checkpoint, device=None)#
Load checkpoint from a given url or file.
- Parameters:
checkpoint (str) – The url or filepath for the respective checkpoint
device (torch.device | None, optional) – The desired device to load the weights and move the model Default:
None
- Return type:
None
- class kornia.contrib.models.rt_detr.DETRPostProcessor(confidence_threshold)[source]#
- forward(logits, boxes, original_sizes)[source]#
Post-process outputs from DETR.
- Parameters:
logits – tensor with shape \((N, Q, K)\), where \(N\) is the batch size, \(Q\) is the number of queries, \(K\) is the number of classes.
boxes – tensor with shape \((N, Q, 4)\), where \(N\) is the batch size, \(Q\) is the number of queries.
original_sizes – list of tuples, each tuple represent (img_height, img_width).
- Returns:
Processed detections. For each image, the detections have shape (D, 6), where D is the number of detections in that image, 6 represent (class_id, confidence_score, x, y, w, h).
Image Segmentation#
- kornia.contrib.connected_components(image, num_iterations=100)[source]#
Computes the Connected-component labelling (CCL) algorithm.
The implementation is an adaptation of the following repository:
https://gist.github.com/efirdc/5d8bd66859e574c683a504a4690ae8bc
Warning
This is an experimental API subject to changes and optimization improvements.
Note
See a working example here.
- Parameters:
- Return type:
- Returns:
The labels image with the same shape of the input image.
Example
>>> img = torch.rand(2, 1, 4, 5) >>> img_labels = connected_components(img, num_iterations=100)
Segment Anything (SAM)#
- class kornia.contrib.models.sam.SamModelType(value)[source]#
Map the SAM model types.
- vit_h = 0#
- vit_l = 1#
- vit_b = 2#
- mobile_sam = 3#
- class kornia.contrib.models.sam.SamConfig(model_type=None, checkpoint=None, pretrained=False, encoder_embed_dim=None, encoder_depth=None, encoder_num_heads=None, encoder_global_attn_indexes=None)[source]#
Encapsulate the Config to build a SAM model.
- Parameters:
model_type (str | int | SamModelType | None, optional) –
the available models are: Default:
None
0, ‘vit_h’ or
kornia.contrib.sam.SamModelType.vit_h()
1, ‘vit_l’ or
kornia.contrib.sam.SamModelType.vit_l()
2, ‘vit_b’ or
kornia.contrib.sam.SamModelType.vit_b()
3, ‘mobile_sam’, or
kornia.contrib.sam.SamModelType.mobile_sam()
checkpoint (str | None, optional) – URL or a path for a file with the weights of the model Default:
None
encoder_embed_dim (int | None, optional) – Patch embedding dimension. Default:
None
encoder_depth (int | None, optional) – Depth of ViT. Default:
None
encoder_num_heads (int | None, optional) – Number of attention heads in each ViT block. Default:
None
encoder_global_attn_indexes (tuple[int, ...] | None, optional) – Encoder indexes for blocks using global attention. Default:
None
- checkpoint: str | None = None#
- encoder_depth: int | None = None#
- encoder_embed_dim: int | None = None#
- encoder_global_attn_indexes: tuple[int, ...] | None = None#
- encoder_num_heads: int | None = None#
- model_type: str | int | SamModelType | None = None#
- pretrained: bool = False#
- class kornia.contrib.models.sam.Sam(image_encoder, prompt_encoder, mask_decoder)[source]#
- __init__(image_encoder, prompt_encoder, mask_decoder)[source]#
SAM predicts object masks from an image and input prompts.
- Parameters:
image_encoder (ImageEncoderViT | TinyViT) – The backbone used to encode the image into image embeddings that allow for efficient mask prediction.
prompt_encoder (PromptEncoder) – Encodes various types of input prompts.
mask_decoder (MaskDecoder) – Predicts masks from the image embeddings and encoded prompts.
- forward(images, batched_prompts, multimask_output)[source]#
Predicts masks end-to-end from provided images and prompts.
This method expects that the images have already been pre-processed, at least been normalized, resized and padded to be compatible with the self.image_encoder.
Note
For each image \((3, H, W)\), it is possible to input a batch (\(K\)) of \(N\) prompts, the results are batched by the number of prompts batch. So given a prompt with \(K=5\), and \(N=10\), the results will look like \(5xCxHxW\) where \(C\) is determined by multimask_output. And within each of these masks \((5xC)\), it should be possible to find \(N\) instances if the model succeed.
- Parameters:
images – The image as a torch tensor in \((B, 3, H, W)\) format, already transformed for input to the model.
batched_prompts –
- A list over the batch of images (list length should be \(B\)), each a dictionary with
the following keys. If it does not have the respective prompt, it should not be included in this dictionary. The options are:
- ”points”: tuple of (Tensor, Tensor) within the coordinate keypoints and their respective labels.
the tuple should look like (keypoints, labels), where:
The keypoints (a tensor) are a batched point prompts for this image, with shape \((K, N, 2)\). Already transformed to the input frame of the model.
The labels (a tensor) are a batched labels for point prompts, with shape \((K, N)\). Where 1 indicates a foreground point and 0 indicates a background point.
- ”boxes”: (Tensor) Batched box inputs, with shape \((K, 4)\). Already transformed to the input
frame of the model.
”mask_inputs”: (Tensor) Batched mask inputs to the model, in the form \((K, 1, H, W)\).
multimask_output – Whether the model should predict multiple disambiguating masks, or return a single mask.
- Returns:
- A list over input images, where each element is as SegmentationResults the following.
- logits: Low resolution logits with shape \((K, C, H, W)\). Can be passed as mask input to
subsequent iterations of prediction. Where \(K\) is the number of input prompts, \(C\) is determined by multimask_output, and \(H=W=256\) are the model output size.
scores: The model’s predictions of mask quality (iou prediction), in shape BxC.
- static from_config(config)[source]#
Build/load the SAM model based on it’s config.
- Parameters:
config (
SamConfig
) – The SamConfig data structure. If the model_type is available, build from it, otherwise will use the parameters set.- Return type:
- Returns:
The respective SAM model
Example
>>> from kornia.contrib.models.sam import SamConfig >>> sam_model = Sam.from_config(SamConfig('vit_b'))
- load_checkpoint(checkpoint, device=None)#
Load checkpoint from a given url or file.
- Parameters:
checkpoint (str) – The url or filepath for the respective checkpoint
device (torch.device | None, optional) – The desired device to load the weights and move the model Default:
None
- Return type:
None
Image Patches#
- kornia.contrib.compute_padding(original_size, window_size)[source]#
Compute required padding to ensure chaining of
extract_tensor_patches()
andcombine_tensor_patches()
produces expected result.- Parameters:
- Return type:
- Returns:
The required padding for (top, bottom, left, right) as a tuple of 4 ints.
Example
>>> image = torch.arange(12).view(1, 1, 4, 3) >>> padding = compute_padding((4,3), (3,3)) >>> out = extract_tensor_patches(image, window_size=(3, 3), stride=(3, 3), padding=padding) >>> combine_tensor_patches(out, original_size=(4, 3), window_size=(3, 3), stride=(3, 3), unpadding=padding) tensor([[[[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 9, 10, 11]]]])
Note
This function is supposed to be used in conjunction with
extract_tensor_patches()
andcombine_tensor_patches()
.
- kornia.contrib.extract_tensor_patches(input, window_size, stride=1, padding=0)[source]#
Function that extract patches from tensors and stack them.
See
ExtractTensorPatches
for details.- Parameters:
input (
Tensor
) – tensor image where to extract the patches with shape \((B, C, H, W)\).window_size (
Union
[int
,Tuple
[int
,int
]]) – the size of the sliding window and the output patch size.stride (
Union
[int
,Tuple
[int
,int
]], optional) – stride of the sliding window. Default:1
padding (
Union
[int
,Tuple
[int
,int
],Tuple
[int
,int
,int
,int
]], optional) – Zero-padding added to both side of the input. Default:0
- Return type:
- Returns:
the tensor with the extracted patches with shape \((B, N, C, H_{out}, W_{out})\).
Examples
>>> input = torch.arange(9.).view(1, 1, 3, 3) >>> patches = extract_tensor_patches(input, (2, 3)) >>> input tensor([[[[0., 1., 2.], [3., 4., 5.], [6., 7., 8.]]]]) >>> patches[:, -1] tensor([[[[3., 4., 5.], [6., 7., 8.]]]])
- kornia.contrib.combine_tensor_patches(patches, original_size, window_size, stride, unpadding=0)[source]#
Restore input from patches.
See
CombineTensorPatches
for details.- Parameters:
patches (
Tensor
) – patched tensor with shape \((B, N, C, H_{out}, W_{out})\).original_size (
Union
[int
,Tuple
[int
,int
]]) – the size of the original tensor and the output patch size.window_size (
Union
[int
,Tuple
[int
,int
]]) – the size of the sliding window used while extracting patches.stride (
Union
[int
,Tuple
[int
,int
]]) – stride of the sliding window.unpadding (
Union
[int
,Tuple
[int
,int
],Tuple
[int
,int
,int
,int
]], optional) – remove the padding added to both side of the input. Default:0
- Return type:
- Returns:
The combined patches in an image tensor with shape \((B, C, H, W)\).
Example
>>> out = extract_tensor_patches(torch.arange(16).view(1, 1, 4, 4), window_size=(2, 2), stride=(2, 2)) >>> combine_tensor_patches(out, original_size=(4, 4), window_size=(2, 2), stride=(2, 2)) tensor([[[[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11], [12, 13, 14, 15]]]])
Note
This function is supposed to be used in conjunction with
extract_tensor_patches()
.
- class kornia.contrib.ExtractTensorPatches(window_size, stride=1, padding=0)[source]#
Module that extract patches from tensors and stack them.
In the simplest case, the output value of the operator with input size \((B, C, H, W)\) is \((B, N, C, H_{out}, W_{out})\).
- where
\(B\) is the batch size.
\(N\) denotes the total number of extracted patches stacked in
\(C\) denotes the number of input channels.
\(H\), \(W\) the input height and width of the input in pixels.
\(H_{out}\), \(W_{out}\) denote to denote to the patch size defined in the function signature. left-right and top-bottom order.
window_size
is the size of the sliding window and controls the shape of the output tensor and defines the shape of the output patch.stride
controls the stride to apply to the sliding window and regulates the overlapping between the extracted patches.padding
controls the amount of implicit zeros-paddings on both sizes at each dimension.
The parameters
window_size
,stride
andpadding
can be either:a single
int
– in which case the same value is used for the height and width dimension.a
tuple
of two ints – in which case, the first int is used for the height dimension, and the second int for the width dimension.
padding
can also be atuple
of four ints – in which case, the first two ints are for the height dimension while the last two ints are for the width dimension.- Parameters:
window_size (
Union
[int
,Tuple
[int
,int
]]) – the size of the sliding window and the output patch size.stride (
Union
[int
,Tuple
[int
,int
],None
], optional) – stride of the sliding window. Default:1
padding (
Union
[int
,Tuple
[int
,int
],Tuple
[int
,int
,int
,int
],None
], optional) – Zero-padding added to both side of the input. Default:0
- Shape:
Input: \((B, C, H, W)\)
Output: \((B, N, C, H_{out}, W_{out})\)
- Returns:
the tensor with the extracted patches.
Examples
>>> input = torch.arange(9.).view(1, 1, 3, 3) >>> patches = extract_tensor_patches(input, (2, 3)) >>> input tensor([[[[0., 1., 2.], [3., 4., 5.], [6., 7., 8.]]]]) >>> patches[:, -1] tensor([[[[3., 4., 5.], [6., 7., 8.]]]])
- class kornia.contrib.CombineTensorPatches(original_size, window_size, unpadding=0)[source]#
Module that combine patches from tensors.
In the simplest case, the output value of the operator with input size \((B, N, C, H_{out}, W_{out})\) is \((B, C, H, W)\).
- where
\(B\) is the batch size.
\(N\) denotes the total number of extracted patches stacked in
\(C\) denotes the number of input channels.
\(H\), \(W\) the input height and width of the input in pixels.
\(H_{out}\), \(W_{out}\) denote to denote to the patch size defined in the function signature. left-right and top-bottom order.
original_size
is the size of the original image prior to extracting tensor patches and defines the shape of the output patch.window_size
is the size of the sliding window used while extracting tensor patches.unpadding
is the amount of padding to be removed. This value must be the same as padding used while extracting tensor patches.
The parameters
original_size
,window_size
, andunpadding
can be either:a single
int
– in which case the same value is used for the height and width dimension.a
tuple
of two ints – in which case, the first int is used for the height dimension, and the second int for the width dimension.
unpadding
can also be atuple
of four ints – in which case, the first two ints are for the height dimension while the last two ints are for the width dimension.- Parameters:
patches – patched tensor.
original_size (
Union
[int
,Tuple
[int
,int
]]) – the size of the original tensor and the output patch size.window_size (
Union
[int
,Tuple
[int
,int
]]) – the size of the sliding window used.unpadding (
Union
[int
,Tuple
[int
,int
],Tuple
[int
,int
,int
,int
]], optional) – remove the padding added to both side of the input. Default:0
- Shape:
Input: \((B, N, C, H_{out}, W_{out})\)
Output: \((B, C, H, W)\)
Example
>>> out = extract_tensor_patches(torch.arange(16).view(1, 1, 4, 4), window_size=(2, 2), stride=(2, 2)) >>> combine_tensor_patches(out, original_size=(4, 4), window_size=(2, 2), stride=(2, 2)) tensor([[[[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11], [12, 13, 14, 15]]]])
Note
This function is supposed to be used in conjunction with
ExtractTensorPatches
.
Image Classification#
- class kornia.contrib.VisionTransformer(image_size=224, patch_size=16, in_channels=3, embed_dim=768, depth=12, num_heads=12, dropout_rate=0.0, dropout_attn=0.0, backbone=None)[source]#
Vision transformer (ViT) module.
The module is expected to be used as operator for different vision tasks.
The method is inspired from existing implementations of the paper [DBK+21].
Warning
This is an experimental API subject to changes in favor of flexibility.
- Parameters:
image_size (
int
, optional) – the size of the input image. Default:224
patch_size (
int
, optional) – the size of the patch to compute the embedding. Default:16
in_channels (
int
, optional) – the number of channels for the input. Default:3
embed_dim (
int
, optional) – the embedding dimension inside the transformer encoder. Default:768
depth (
int
, optional) – the depth of the transformer. Default:12
num_heads (
int
, optional) – the number of attention heads. Default:12
dropout_rate (
float
, optional) – dropout rate. Default:0.0
dropout_attn (
float
, optional) – attention dropout rate. Default:0.0
backbone (
Optional
[Module
], optional) – an nn.Module to compute the image patches embeddings. Default:None
Example
>>> img = torch.rand(1, 3, 224, 224) >>> vit = VisionTransformer(image_size=224, patch_size=16) >>> vit(img).shape torch.Size([1, 197, 768])
- class kornia.contrib.MobileViT(mode='xxs', in_channels=3, patch_size=(2, 2), dropout=0.0)[source]#
Module MobileViT. Default arguments is for MobileViT XXS.
Paper: https://arxiv.org/abs/2110.02178 Based on: https://github.com/chinhsuanwu/mobilevit-pytorch
- Parameters:
mode (
str
, optional) – ‘xxs’, ‘xs’ or ‘s’, defaults to ‘xxs’. Default:'xxs'
in_channels (
int
, optional) – the number of channels for the input image. Default:3
patch_size (
Tuple
[int
,int
], optional) – image_size must be divisible by patch_size. Default:(2, 2)
dropout (
float
, optional) – dropout ratio in Transformer. Default:0.0
Example
>>> img = torch.rand(1, 3, 256, 256) >>> mvit = MobileViT(mode='xxs') >>> mvit(img).shape torch.Size([1, 320, 8, 8])
- class kornia.contrib.TinyViT(img_size=224, in_chans=3, num_classes=1000, embed_dims=[96, 192, 384, 768], depths=[2, 2, 6, 2], num_heads=[3, 6, 12, 24], window_sizes=[7, 7, 14, 7], mlp_ratio=4.0, drop_rate=0.0, drop_path_rate=0.0, use_checkpoint=False, mbconv_expand_ratio=4.0, local_conv_size=3, activation=nn.GELU, mobile_sam=False)[source]#
TinyViT model, as described in https://arxiv.org/abs/2207.10666
- Parameters:
img_size (optional) – Size of input image. Default:
224
in_chans (optional) – Number of input image’s channels. Default:
3
num_classes (optional) – Number of output classes. Default:
1000
embed_dims (optional) – List of embedding dimensions. Default:
[96, 192, 384, 768]
depths (optional) – List of block count for each downsampling stage Default:
[2, 2, 6, 2]
num_heads (optional) – List of attention heads used in self-attention for each downsampling stage. Default:
[3, 6, 12, 24]
window_sizes (optional) – List of self-attention’s window size for each downsampling stage. Default:
[7, 7, 14, 7]
mlp_ratio (optional) – Ratio of MLP dimension to embedding dimension in self-attention. Default:
4.0
drop_rate (optional) – Dropout rate. Default:
0.0
drop_path_rate (optional) – Stochastic depth rate. Default:
0.0
use_checkpoint (optional) – Whether to use activation checkpointing to trade compute for memory. Default:
False
mbconv_expand_ratio (optional) – Expansion ratio used in MBConv block. Default:
4.0
local_conv_size (optional) – Kernel size of convolution used in TinyViTBlock Default:
3
activation (optional) – activation function. Default:
nn.GELU
mobile_same – Whether to use modifications for MobileSAM.
- forward(x)[source]#
Classify images if
mobile_sam=False
, produce feature maps ifmobile_sam=True
.- Return type:
- static from_config(variant, pretrained=False, **kwargs)[source]#
Create a TinyViT model from pre-defined variants.
- Parameters:
variant (str) – TinyViT variant. Possible values:
'5m'
,'11m'
,'21m'
.pretrained (bool | str, optional) – whether to use pre-trained weights. Possible values:
False
,True
,'in22k'
,'in1k'
. For TinyViT-21M (variant='21m'
),'in1k_384'
,'in1k_512'
are also available. Default:False
**kwargs (Any) – other keyword arguments that will be passed to
TinyViT
.
- Return type:
Note
When
img_size
is different from the pre-trained size, bicubic interpolation will be performed on attention biases. When usingpretrained=True
, ImageNet-1k checkpoint ('in1k'
) is used. For feature extraction or fine-tuning, ImageNet-22k checkpoint ('in22k'
) is preferred.
Image Stitching#
- class kornia.contrib.ImageStitcher(matcher, estimator='ransac', blending_method='naive')[source]#
Stitch two images with overlapping fields of view.
- Parameters:
matcher (
Module
) – image feature matching module.estimator (
str
, optional) – method to compute homography, either “vanilla” or “ransac”. “ransac” is slower with a better accuracy. Default:'ransac'
blending_method (
str
, optional) – method to blend two images together. Only “naive” is currently supported. Default:"naive"
Note
Current implementation requires strict image ordering from left to right.
IS = ImageStitcher(KF.LoFTR(pretrained='outdoor'), estimator='ransac').cuda() # Compute the stitched result with less GPU memory cost. with torch.inference_mode(): out = IS(img_left, img_right) # Show the result plt.imshow(K.tensor_to_image(out))
Lambda#
- class kornia.contrib.Lambda(func)[source]#
Applies user-defined lambda as a transform.
- Parameters:
- Returns:
The output of the user-defined lambda.
Example
>>> import kornia >>> x = torch.rand(1, 3, 5, 5) >>> f = Lambda(lambda x: kornia.color.rgb_to_grayscale(x)) >>> f(x).shape torch.Size([1, 1, 5, 5])
Distance Transform#
- kornia.contrib.distance_transform(image, kernel_size=3, h=0.35)[source]#
Approximates the Manhattan distance transform of images using cascaded convolution operations.
The value at each pixel in the output represents the distance to the nearest non-zero pixel in the image image. It uses the method described in [PDP20]. The transformation is applied independently across the channel dimension of the images.
- Parameters:
- Return type:
- Returns:
tensor with shape \((B,C,H,W)\).
Example
>>> tensor = torch.zeros(1, 1, 5, 5) >>> tensor[:,:, 1, 2] = 1 >>> dt = kornia.contrib.distance_transform(tensor)
- kornia.contrib.diamond_square(output_size, roughness=0.5, random_scale=1.0, random_fn=torch.rand, normalize_range=None, device=None, dtype=None)[source]#
Generates Plasma Fractal Images using the diamond square algorithm.
See: https://en.wikipedia.org/wiki/Diamond-square_algorithm
- Parameters:
output_size (
Tuple
[int
,int
,int
,int
]) – a tuple of integers with the BxCxHxW of the image to be generated.roughness (
Union
[float
,Tensor
], optional) – the scale value to apply at each recursion step. Default:0.5
random_scale (
Union
[float
,Tensor
], optional) – the initial value of the scale for recursion. Default:1.0
random_fn (
Callable
[...
,Tensor
], optional) – the callable function to use to sample a random tensor. Default:torch.rand
normalize_range (
Optional
[Tuple
[float
,float
]], optional) – whether to normalize using min-max the output map. In case of a range is specified, min-max norm is applied between the provided range. Default:None
device (
Optional
[device
], optional) – the torch device to place the output map. Default:None
dtype (
Optional
[dtype
], optional) – the torch dtype to place the output map. Default:None
- Return type:
- Returns:
A tensor with shape \((B,C,H,W)\) containing the fractal image.