\n Write a GPU program that performs element-wise addition of two vectors containing 32-bit floating point numbers.\n The program should take two input vectors of equal length and produce a single output vector containing their sum.\n
\n\n
Implementation Requirements
\n
\n
External libraries are not permitted
\n
The solve function signature must remain unchanged
\n
The final result must be stored in vector C
\n
\n\n
Example 1:
\n
\nInput: A = [1.0, 2.0, 3.0, 4.0]\n B = [5.0, 6.0, 7.0, 8.0]\nOutput: C = [6.0, 8.0, 10.0, 12.0]\n
\n\n
Example 2:
\n
\nInput: A = [1.5, 1.5, 1.5]\n B = [2.3, 2.3, 2.3]\nOutput: C = [3.8, 3.8, 3.8]\n
\n Implement a program that performs the Rectified Linear Unit (ReLU) activation function on a vector of 32-bit floating point numbers.\n The ReLU function sets all negative values to zero and leaves positive values unchanged: $$\\text{ReLU}(x) = \\max(0, x)$$\n
\n\n
Implementation Requirements
\n
\n
External libraries are not permitted
\n
The solve function signature must remain unchanged
\n Implement a program that performs the leaky ReLU activation function on a vector of floating-point numbers. The leaky ReLU function is defined as:\n $$ f(x) = \\begin{cases}\n x & \\text{if } x > 0 \\\\\n \\alpha x & \\text{if } x \\leq 0\n \\end{cases} $$\n where $\\alpha$ is a small positive constant (0.01 in this problem).\n
\n\n
Implementation Requirements
\n
\n
External libraries are not permitted
\n
The solve function signature must remain unchanged
\n
The final result must be stored in vector output
\n
Use $\\alpha = 0.01$ as the leaky coefficient
\n
\n\n
Example 1:
\n
\n Input: x = [1.0, -2.0, 3.0, -4.0]\n Output: y = [1.0, -0.02, 3.0, -0.04]
\n\n
Example 2:
\n
\n Input: x = [-1.5, 0.0, 2.5, -3.0]\n Output: y = [-0.015, 0.0, 2.5, -0.03]
\n Implement a program that performs R rounds of parallel hashing on an array of 32-bit integers using the provided hash function.\n The hash should be applied R times iteratively (the output of one round becomes the input to the next).\n
\n\n
Implementation Requirements
\n
\n
External libraries are not permitted
\n
The solve function signature must remain unchanged
\n Write a program that multiplies two matrices of 32-bit floating point numbers on a GPU.\n Given matrix $A$ of dimensions $M \\times N$ and matrix $B$ of dimensions $N \\times K$, compute\n the product matrix $C = A \\times B$, which will have dimensions $M \\times K$.\n All matrices are stored in row-major format.\n
\n\n
Implementation Requirements
\n
\n
Use only native features (external libraries are not permitted)
\n
The solve function signature must remain unchanged
\n Implement a program that copies an $N \\times N$ matrix of 32-bit floating point numbers from input array $A$ to output array $B$ on the GPU. The program should perform a direct element-wise copy so that $B_{i,j} = A_{i,j}$ for all valid indices.\n
\n\n
Implementation Requirements
\n
\n
External libraries are not permitted
\n
The solve function signature must remain unchanged
\n
The final result must be stored in matrix B
\n
\n\n
Example 1:
\n
\nInput: A = [[1.0, 2.0],\n [3.0, 4.0]]\nOutput: B = [[1.0, 2.0],\n [3.0, 4.0]]\n
\n Write a program that transposes a matrix of 32-bit floating point numbers on a GPU. The\n transpose of a matrix switches its rows and columns. Given a matrix $A$ of dimensions $rows \\times cols$, the transpose $A^T$ will have dimensions $cols \\times rows$. All matrices are stored in row-major format.\n
\n\n\n\n
Implementation Requirements
\n
\n
Use only native features (external libraries are not permitted)
\n
The solve function signature must remain unchanged
\n
The final result must be stored in the matrix output
\n Run inference on a PyTorch model. Given an input tensor and a trained torch.nn.Linear model, compute the forward pass and store the result in the output tensor.\n
\n\n
\n The model performs a linear transformation: output = input @ weight.T + bias where weight has shape [output_size, input_size] and bias has shape [output_size].\n
\n\n
Implementation Requirements
\n
\n
Use PyTorch's built-in functions and operations
\n
The solve function signature must remain unchanged
\n
The final result must be stored in the output tensor
\n
The model is already loaded and ready for inference
\n Implement the SiLU (Sigmoid Linear Unit) activation function forward pass for 1D input vectors.\n Given an input tensor of shape [N] where N is the number of elements, compute the output using the elementwise formula.\n
\n\n
\n SiLU is defined as:\n $$\n \\begin{align}\n \\sigma(x) &= \\frac{1}{1 + e^{-x}} \\\\\n \\text{SiLU}(x) &= x \\cdot \\sigma(x)\n \\end{align}\n $$\n
\n\n
Implementation Requirements
\n
\n
Use only native features (external libraries are not permitted)
\n
The solve function signature must remain unchanged
\n
The final result must be stored in the output tensor
\n Implement the Swish-Gated Linear Unit (SWiGLU) activation function forward pass for 1D input vectors.\n Given an input tensor of shape [N] where N is the number of elements, compute the output using the elementwise formula.\n The input and output tensor must be of type float32.\n
\n\n
\n SWiGLU is defined as:\n
\n
Split input $x$ into two halves: $x_1$ and $x_2$
\n
Compute SiLU on the first half:\n $$\n \\text{SiLU}(x_1) = x_1 \\cdot \\sigma(x_1), \\quad\n \\sigma(x) = \\frac{1}{1 + e^{-x}}\n $$\n
\n Implement a GPU program that performs clipping on 1D input vectors.\n Given an input tensor of shape [N] where N is the number of elements,\n compute the output by clipping each element to a specified range [lo, hi].\n The input and output tensor must be of type float32.\n
\n\n
\n Clipping is defined as:\n
\n
For each element x in the input tensor, \"clip\" the element so that it falls within the allowed range [lo, hi].\n
\n
This operation ensures all values are within the specified range and is commonly used in ML for activation stabilization and pre-quantization.
\n \n\n\n
Implementation Requirements
\n
\n
Use only native features (external libraries are not permitted)
\n
The solve function signature must remain unchanged
\n
The final result must be stored in the output tensor
\n
\n\n
Example 1:
\n
\nInput: [1.5, -2.0, 3.0, 4.5], lo = 0.0, hi = 3.5\nOutput: [1.5, 0.0, 3.0, 3.5]\n
\n\n
Example 2:
\n
\nInput: [-1.0, 2.0, 5.0], lo = -0.5, hi = 2.5\nOutput: [-0.5, 2.0, 2.5]\n
\n Write a GPU program that interleaves two arrays of 32-bit floating point numbers.\n Given two input arrays A and B, each of length N,\n produce an output array of length 2N where elements alternate between the two inputs:\n [A[0], B[0], A[1], B[1], A[2], B[2], ...]\n
\n\n\n\n
Implementation Requirements
\n
\n
Use only native features (external libraries are not permitted)
\n
The solve function signature must remain unchanged
\n
The final result must be stored in the output array
\n
\n\n
Example 1:
\n
\nInput: A = [1.0, 2.0, 3.0], B = [4.0, 5.0, 6.0]\nOutput: [1.0, 4.0, 2.0, 5.0, 3.0, 6.0]\n
\n\n
Example 2:
\n
\nInput: A = [10.0, 20.0], B = [30.0, 40.0]\nOutput: [10.0, 30.0, 20.0, 40.0]\n
\n Implement the Gaussian Error Gated Linear Unit (GEGLU) activation function forward pass for 1D input\n vectors. Given an input tensor of shape [N] where N is the number of elements, compute the output\n using the elementwise formula. The input and output tensor must be of type float32.\n
\n\n
\n GEGLU is defined as:\n
\n
Split input $x$ into two halves: $x_1$ and $x_2$
\n
Compute GELU on the second half:\n $$\n \\text{GELU}(x_2) = \\frac{1}{2} x_2 \\left(1 + \\text{erf}\\left(\\frac{x_2}{\\sqrt{2}}\\right)\\right)\n $$\n
\n Implement a GPU program that converts an RGB image to grayscale on the GPU.\n Given an input RGB image represented as a 1D array of 32-bit floating point values,\n compute the corresponding grayscale image using the standard RGB to grayscale conversion formula.\n
\n\n
\n The conversion formula is: gray = 0.299 \u00d7 R + 0.587 \u00d7 G + 0.114 \u00d7 B\n
\n\n
\n The input array input contains height \u00d7 width \u00d7 3 elements,\n where the RGB values for each pixel are stored consecutively (R, G, B, R, G, B, ...).\n The output array output should contain height \u00d7 width grayscale values.\n
\n\n
Implementation Requirements
\n
\n
External libraries are not permitted
\n
The solve function signature must remain unchanged
\n
The final result must be stored in the array output
\n
Use the exact coefficients: 0.299 for red, 0.587 for green, 0.114 for blue
\n Write a GPU program that applies the sigmoid activation function element-wise to a vector of\n 32-bit floating point numbers. For each element x in the input vector X,\n compute sigmoid(x) = 1 / (1 + exp(-x)) and store the result in the output vector\n Y. The sigmoid function maps any real number to the range (0, 1).\n
\n\n
Implementation Requirements
\n
\n
External libraries are not permitted
\n
The solve function signature must remain unchanged
\n
The final result must be stored in vector Y
\n
\n\n
Example 1:
\n
\nInput: X = [0.0, 1.0, -1.0, 2.0]\nOutput: Y = [0.5, 0.7311, 0.2689, 0.8808]\n
\n\n
Example 2:
\n
\nInput: X = [0.5, -0.5, 3.0, -3.0]\nOutput: Y = [0.6225, 0.3775, 0.9526, 0.0474]\n
\n\n
Constraints
\n
\n
1 ≤ N ≤ 100,000,000
\n
Input values are finite 32-bit floating point numbers
\n Write a program to invert the colors of an image. The image is\n represented as a 1D array of RGBA (Red, Green, Blue, Alpha) values, where each\n component is an 8-bit unsigned integer (unsigned char).\n
\n\n
\n Color inversion is performed by subtracting each color component (R, G, B)\n from 255. The Alpha component should remain unchanged.\n
\n\n
\n The input array\n image will contain width * height * 4 elements. The\n first 4 elements represent the RGBA values of the top-left pixel, the next 4\n elements represent the pixel to its right, and so on.\n
\n\n
Implementation Requirements
\n
\n
Use only native features (external libraries are not permitted)
\n
The solve function signature must remain unchanged
\n
\n The final result must be stored in the array\n image\n
\n Implement a program that performs element-wise addition of two $N \\times N$ matrices containing 32-bit floating point numbers on a GPU.\n The program should take two input matrices of equal dimensions and produce a single output matrix containing their element-wise sum.\n
\n\n
Implementation Requirements
\n
\n
External libraries are not permitted
\n
The solve function signature must remain unchanged
\n
The final result must be stored in matrix C
\n
\n\n
Example 1:
\n
\nInput: A = [[1.0, 2.0],\n [3.0, 4.0]]\n B = [[5.0, 6.0],\n [7.0, 8.0]]\nOutput: C = [[6.0, 8.0],\n [10.0, 12.0]]\n
\n Implement a program that performs a 1D convolution operation. Given an input array and a kernel (filter), compute the convolved\n output. The convolution should be performed with a \"valid\" boundary condition, meaning the kernel is only applied\n where it fully overlaps with the input.\n
\n\n\n\n
\n The input consists of two arrays:\n
\n
input: A 1D array of 32-bit floating-point numbers.
\n
kernel: A 1D array of 32-bit floating-point numbers representing the convolution kernel.
\n
\nThe output should be written to the output array, which will have a size of input_size - kernel_size + 1.\n\n\n
\n The convolution operation is defined mathematically as:\n
\n Implement a program for multi-head self-attention. Given three input matrices $Q$ (queries), $K$ (keys), and $V$ (values) of size $N \\times d_{\\text{model}}$, compute:\n $$ \\text{MultiHead}(Q,K,V) = \\text{Concat}(\\text{head}_1,\\ldots,\\text{head}_h) $$\n where each head computes:\n $$ \\text{head}_i = \\text{softmax}\\left(\\frac{Q_iK_i^T}{\\sqrt{d_k}}\\right)V_i $$\n with $d_k = d_{\\text{model}}/h$ and $Q_i, K_i, V_i$ being the i-th head's partition of the input matrices.\n
\n\n
Implementation Requirements
\n
\n
Use only native features (external libraries are not permitted)
\n
The solve function signature must remain unchanged
\n
The final result must be stored in the output array
\n Implement the k-means clustering algorithm for 2D points. Given arrays of x and y coordinates for data points, initial centroids, and other parameters, assign each point to the nearest centroid and update the centroids iteratively. The final centroids and labels should be stored in the output arrays.\n
\n\n
Implementation Requirements
\n
\n
External libraries are not permitted
\n
The solve function signature must remain unchanged
\n
The final result must be stored in labels, final_centroid_x, and final_centroid_y
\n Implement a radix sort algorithm that sorts an array of 32-bit unsigned integers on a GPU.\n The program should take an input array of unsigned integers and sort them in ascending order using the radix sort algorithm.\n The input parameter contains the unsorted array, and the sorted result should be stored in the output array.\n
\n\n
Implementation Requirements
\n
\n
External libraries are not permitted
\n
The solve function signature must remain unchanged
\n
The final sorted result must be stored in the output array
\n
Use radix sort algorithm (not other sorting algorithms)
\n Implement a GPU program that computes the Fast Fourier Transform (FFT) of a\n complex-valued 1-D signal. Given an input signal array containing\n N complex numbers stored as interleaved real/imaginary pairs,\n compute the discrete Fourier transform and store the result in the\n spectrum array. The FFT converts a time-domain signal into its\n frequency-domain representation using the formula: $$ X_k = \\sum_{n=0}^{N-1}\n x_n \\cdot e^{-j 2\\pi kn / N} \\quad \\text{for } k = 0, 1, \\ldots, N-1 $$ The\n FFT algorithm reduces the computational complexity from O(N\u00b2) to O(N log N) by\n exploiting symmetries in the twiddle factors.\n
\n\n
Implementation Requirements
\n
\n
External libraries (cuFFT etc.) are not permitted
\n
The solve function signature must remain unchanged
\n
The final result must be stored in the spectrum array
\n
The kernel must be entirely GPU-resident\u2014no host-side FFT calls
\n
\n Both input and output use interleaved real/imaginary layout:\n [real\u2080, imag\u2080, real\u2081, imag\u2081, ...]\n
\n Implement a program that finds the shortest path in an unweighted 2D grid using Breadth-First Search (BFS). Given a grid with obstacles and start/end positions, return the minimum number of steps needed to reach the destination.\n
\n\n\n\n
Implementation Requirements
\n
\n
Use only native features (external libraries are not permitted)
\n
The solve function signature must remain unchanged
\n
Return the shortest path length, or -1 if no path exists
\n
Grid cells with value 0 are free, cells with value 1 are obstacles
\n
Movement is allowed in 4 directions: up, down, left, right
Implement Causal (masked) Self-Attention for a given set of matrices.\n Given the query matrix Q of size M\u00d7d, key matrix K of size M\u00d7d, and value matrix\n V of size M\u00d7d, your program should compute the output matrix using the formula:\n $$\\text{Attention}_{\\text{causal}}(Q, K, V) = \\text{softmax}\\Bigl(\\text{masked}\\Bigl( \\frac{QK^T}{\\sqrt{d}} \\Bigr)\\Bigr)V$$\n
\n\n\n
\n where mask is a causal mask that sets all positions corresponding to keys after the current query to $-\\infty$.\n $$$$\n i.e., for query i and key j:\n $$\n \\text{masked}(a_{ij}) =\n \\begin{cases}\n a_{ij}, & j \\le i \\\\\n -\\infty, & j > i\n \\end{cases}\n $$\n The softmax function is applied row-wise. Q, K, V, and output are all of data type float32;\n M, and d are of data type int32.\n
\n\n\n\n
Implementation Requirements
\n
\n
Use only native features (external libraries are not permitted)
\n
The\n solve function signature must remain unchanged\n
\n
The final result must be stored in the output matrix\n output\n
\n Implement Linear Attention for a given set of matrices, following the method described in\n \n \"Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention\"\n .\n Given the query matrix Q of size M\u00d7d, key matrix K of size M\u00d7d, and value matrix\n V of size M\u00d7d, your program should compute the output matrix using the formula:\n $$\n \\text{LinearAttention}(Q, K, V) = \\frac{\\phi(Q) \\left(\\phi(K)^T V \\right)}{\\phi(Q) \\left(\\sum_j \\phi(K_j) \\right)}\n $$\n
\n\n
\n where $ \\phi(x) $ is a feature map applied element-wise, for example:\n $$\n \\phi(x) = \\text{ELU}(x) + 1 =\n \\begin{cases}\n x + 1, & x > 0 \\\\\n e^x, & x \\le 0\n \\end{cases}\n $$\n All matrices Q, K, V, and output are of type float32, and M and d are of type int32.\n
\n\n\n
Implementation Requirements
\n
\n
Use only native features (external libraries are not permitted)
\n
The\n solve function signature must remain unchanged\n
\n
The final result must be stored in the output matrix\n output\n
\n Implement Sliding Window Self-Attention for a given set of matrices.\n Before introducing the sliding window version, let's first recall standard Self-Attention.\n
\n\n
1. Standard Softmax Attention
\n
\n Given query matrix Q, key matrix K, and value matrix V, each position i attends to all positions j using a softmax-weighted sum:\n
\n In other words, each query computes similarity with all keys, applies a softmax to get attention weights, and then computes a weighted sum of values.\n
\n\n
2. Sliding Window Self-Attention
\n
\n Sliding Window Attention modifies standard attention by restricting each query to attend only to a local window around its position.\n
\n\n
\n
For each position i, only consider the keys and values within a window of size window_size around i (positions [i-window_size, ..., i+window_size]).
\n
Compute similarity scores between Qi and the keys in this window:
\n Given a weighted directed graph of N vertices represented as an\n N × N distance matrix, compute the shortest path distance between\n every pair of vertices using the Floyd-Warshall algorithm. The matrix is stored as a flat array in\n row-major order: dist[i * N + j] is the weight of the directed edge from vertex\n i to vertex j. A value of +infinity means no direct edge\n exists. The diagonal is always zero. For each intermediate vertex k from 0 to N - 1\n (in order), update all pairs:\n
\n Implement a single GPT-2 transformer decoder block. Given an input tensor\n $x$ of shape (seq_len, 768) and a packed weight buffer containing\n all block parameters, compute the output using pre-norm architecture with\n multi-head self-attention and a feed-forward network with GELU activation.\n
\n\n\n\n
The block uses GPT-2's pre-norm architecture: LayerNorm is applied\nbefore each sub-layer (attention and feed-forward), not after. At a high level:
\n\n$$\n\\begin{aligned}\nx' &= x + \\text{MultiHeadAttn}\\!\\left(\\text{LN}_1(x)\\right) \\$$4pt]\n\\text{output} &= x' + \\text{FeedForward}\\!\\left(\\text{LN}_2(x')\\right)\n\\end{aligned}\n$$\n\n
Layer Norm 1: $x_{\\text{norm}} = \\text{LN}_1(x)$ with parameters $\\gamma_1, \\beta_1$
\n
QKV Projection: $QKV = x_{\\text{norm}} \\cdot W_{qkv} + b_{qkv}$, split into $Q, K, V$ each of shape (seq_len, 768)
\n
Multi-Head Attention: Reshape $Q, K, V$ into 12 heads of dimension 64, compute per-head scaled dot-product attention (no causal mask), then concatenate heads into $A$
\n
Output Projection: $P = A \\cdot W_{\\text{attn}} + b_{\\text{attn}}$
\n
Residual 1: $x' = x + P$
\n
Layer Norm 2: $h_{\\text{norm}} = \\text{LN}_2(x')$ with parameters $\\gamma_2, \\beta_2$
Use only native features (external libraries are not permitted)
\n
The solve function signature must remain unchanged
\n
The final result must be stored in the output tensor
\n
LayerNorm uses $\\epsilon = 10^{-5}$
\n
Use the GELU tanh approximation: $\\text{GELU}(x) = 0.5\\,x\\!\\left(1 + \\tanh\\!\\left(\\sqrt{\\tfrac{2}{\\pi}}\\left(x + 0.044715\\,x^3\\right)\\right)\\right)$
\n
\n\n
Weight Layout
\n
All block parameters are packed into a single contiguous weights buffer\n(7,087,872 floats) in the following order. Index into the buffer using the offsets below\n(e.g. $W_{qkv}[i][j]$ is at weights[1536 + i * 2304 + j]).\nAll 2D matrices are stored in row-major order.
\n\n
\n
\n
Parameter
\n
Shape
\n
Size
\n
Offset
\n
\n
\n
$\\gamma_1$ (LN1 weight)
\n
(768,)
\n
768
\n
0
\n
\n
\n
$\\beta_1$ (LN1 bias)
\n
(768,)
\n
768
\n
768
\n
\n
\n
$W_{qkv}$
\n
(768, 2304)
\n
1,769,472
\n
1,536
\n
\n
\n
$b_{qkv}$
\n
(2304,)
\n
2,304
\n
1,771,008
\n
\n
\n
$W_{\\text{attn}}$
\n
(768, 768)
\n
589,824
\n
1,773,312
\n
\n
\n
$b_{\\text{attn}}$
\n
(768,)
\n
768
\n
2,363,136
\n
\n
\n
$\\gamma_2$ (LN2 weight)
\n
(768,)
\n
768
\n
2,363,904
\n
\n
\n
$\\beta_2$ (LN2 bias)
\n
(768,)
\n
768
\n
2,364,672
\n
\n
\n
$W_{fc}$
\n
(768, 3072)
\n
2,359,296
\n
2,365,440
\n
\n
\n
$b_{fc}$
\n
(3072,)
\n
3,072
\n
4,724,736
\n
\n
\n
$W_{\\text{proj}}$
\n
(3072, 768)
\n
2,359,296
\n
4,727,808
\n
\n
\n
$b_{\\text{proj}}$
\n
(768,)
\n
768
\n
7,087,104
\n
\n
\n\n
Example
\n
With seq_len = 4, x uniformly drawn from [\u22121, 1], and weights randomly initialized\n(see Weight Layout for the packing structure):
\n Implement a single Llama-style transformer decoder block. Given an input tensor $x$ of shape\n (seq_len, 512), a packed weight buffer, and precomputed RoPE tables, compute the\n output using pre-norm architecture with Grouped Query Attention (GQA), Rotary Position Embeddings\n (RoPE), and a SwiGLU feed-forward network.\n
\n\n\n\n
\n The block follows Llama's pre-norm architecture. Unlike GPT-2, it uses\n RMSNorm (no mean subtraction, no additive bias), Grouped Query\n Attention with 8 query heads and 2 key/value heads, Rotary Position\n Embeddings applied to Q and K, and a SwiGLU feed-forward network.\n None of the linear projections have bias terms.\n
where $M_{\\text{causal}}$ is the upper-triangular causal mask ($-\\infty$ above the diagonal)\nand $\\text{SiLU}(x) = x \\cdot \\sigma(x)$.
\n\n
Implementation Requirements
\n
\n
Use only native features (external libraries are not permitted)
\n
The solve function signature must remain unchanged
\n
The final result must be stored in the output tensor
\n
RMSNorm uses $\\varepsilon = 10^{-5}$, no additive bias
\n
Apply causal masking: position $i$ attends only to positions $\\le i$
\n
Repeat K and V heads $4\\times$ (GQA groups) before computing attention
\n
cos and sin have shape (seq_len, 32) \u2014 apply\n them to both Q and K heads independently
\n
\n\n
Weight Layout
\n
All parameters are packed into a single contiguous weights buffer\n(2,819,072 floats) in the order below. All 2-D matrices are stored row-major\nwith shape (out_dim, in_dim). There are no bias terms.
\n\n
\n
\n
Parameter
\n
Shape
\n
Size
\n
Offset
\n
\n
\n
$w_1$ (RMSNorm 1 scale)
\n
(512,)
\n
512
\n
0
\n
\n
\n
$W_Q$
\n
(512, 512)
\n
262,144
\n
512
\n
\n
\n
$W_K$
\n
(128, 512)
\n
65,536
\n
262,656
\n
\n
\n
$W_V$
\n
(128, 512)
\n
65,536
\n
328,192
\n
\n
\n
$W_O$
\n
(512, 512)
\n
262,144
\n
393,728
\n
\n
\n
$w_2$ (RMSNorm 2 scale)
\n
(512,)
\n
512
\n
655,872
\n
\n
\n
$W_{\\text{gate}}$
\n
(1408, 512)
\n
720,896
\n
656,384
\n
\n
\n
$W_{\\text{up}}$
\n
(1408, 512)
\n
720,896
\n
1,377,280
\n
\n
\n
$W_{\\text{down}}$
\n
(512, 1408)
\n
720,896
\n
2,098,176
\n
\n
\n\n
Example
\n
With seq_len = 4, x drawn uniformly from [−1, 1], and randomly\ninitialized weights:
\n Write a program that performs a 2D convolution operation on the GPU. Given an input matrix and a kernel (filter), compute the convolved\n output. The convolution should be performed with a \"valid\" boundary condition, meaning the kernel is only applied\n where it fully overlaps with the input.\n
\n\n\n\n
\n The input consists of:\n
\n
input: A 2D matrix of 32-bit floating-point numbers, represented as a 1D array in row-major order.\n
\n
kernel: A 2D kernel (filter) of 32-bit floating-point numbers, also represented as a 1D array in\n row-major order.
\n
\n\n\n
\n The output should be written to the output matrix (also a 1D array in row-major order). The output matrix will have dimensions:\n
\n Implement a program that performs a 3D convolution operation. Given a 3D input volume and a 3D kernel (filter), compute the convolved\n output. The convolution should use a \"valid\" boundary condition (no padding).\n
\n\n
\n For a 3D convolution, the output at position $(i,j,k)$ is given by:\n
\n Write a GPU program that computes the histogram of an array of 32-bit integers.\n The histogram should count the number of occurrences of each integer value in the range [0, num_bins).\n You are given an input array input of length N and the number of bins num_bins.\n
\n\n
\n The result should be an array of integers of length\nnum_bins, where each element represents\nthe count of occurrences of its corresponding index in the input array.\n
\n\n
Implementation Requirements
\n
\n
Use only native features (external libraries are not permitted)
\n
The\n solve function signature must remain unchanged\n
\n
The final result must be stored in the\n histogram array.\n
\n Write a GPU program that computes the prefix sum (cumulative sum) of an array of 32-bit floating point numbers.\n For an input array [a, b, c, d, ...], the prefix sum is [a, a+b, a+b+c, a+b+c+d, ...].\n
\n\n\n\n
Implementation Requirements
\n
\n
Use only GPU native features (external libraries are not permitted)
\n
The solve function signature must remain unchanged
\n Implement a GPU program that computes the dot product of two vectors containing 32-bit floating point numbers.\n The dot product is the sum of the products of the corresponding elements of two vectors.\n
\n
\n Mathematically, the dot product of two vectors $A$ and $B$ of length $n$ is defined as:\n $$\n A \\cdot B = \\sum_{i=0}^{n-1} A_i \\cdot B_i = A_0 \\cdot B_0 + A_1 \\cdot B_1 + \\ldots + A_{n-1} \\cdot B_{n-1}\n $$\n
\n
Implementation Requirements
\n
\n
Use only GPU native features (external libraries are not permitted)
\n
The solve function signature must remain unchanged
\n
The final result must be stored in the output variable
\n
\n
Example 1:
\n
Input: A = [1.0, 2.0, 3.0, 4.0]\n B = [5.0, 6.0, 7.0, 8.0]\n Output: result = 70.0 (1.0*5.0 + 2.0*6.0 + 3.0*7.0 + 4.0*8.0)
\n
Example 2:
\n
Input: A = [0.5, 1.5, 2.5]\n B = [2.0, 3.0, 4.0]\n Output: result = 15.5 (0.5*2.0 + 1.5*3.0 + 2.5*4.0)
\n Implement a GPU program that performs sparse matrix-vector multiplication.\n Given a sparse matrix $A$ of dimensions $M \\times N$ and a dense vector $x$ of length $N$,\n compute the product vector $y = A \\times x$, which will have length $M$. A is stored in row-major order.\n nnz is the number of non-zero elements in A.\n
\n\n
\n Mathematically, the operation is defined as:\n $$\n y_i = \\sum_{j=0}^{N-1} A_{ij} \\cdot x_j \\quad \\text{for} \\quad i = 0, 1, \\ldots, M-1\n $$\n
\n\n
\n The matrix $A$ is approximately 60 - 70% sparse.\n
\n\n
Implementation Requirements
\n
\n
Use only GPU native features (external libraries are not permitted)
\n
The solve function signature must remain unchanged
\n Implement a basic General Matrix Multiplication (GEMM). Given matrix $A$ of dimensions $M \\times K$, matrix $B$ of dimensions $K \\times N$, input/output matrix $C$ of dimensions $M \\times N$, and scalar multipliers $ \\alpha $ and $ \\beta $, compute the operation:\n $$ C = \\alpha \\cdot (A \\times B) + \\beta \\cdot C_{initial} $$\n
\n
\n The input matrices $A$, $B$, and the initial state of $C$ contain 16-bit floating-point numbers (FP16/half). All matrices are stored in row-major order. The scalars $ \\alpha $ and $ \\beta $ are 32-bit floats.\n
\n\n
Implementation Requirements
\n
\n
Use only native features (external libraries other than WMMA are not permitted).
\n
The solve function signature must remain unchanged.
\n
Accumulation during multiplication should use FP32 for better precision before converting the final result to FP16.
\n
The final result must be stored back into matrix C as half.
\n Implement a GPU program to calculate the categorical cross-entropy loss for a batch of predictions.\n Given a matrix of predicted logits $Z$ of size $N \\times C$ and a vector of true class labels true_labels of size $N$, compute the average cross-entropy loss over the batch.\n The loss for a single sample $j$ with logits $z_j = [z_{j1}, \\ldots, z_{jC}]$ and true label $y_j$ is calculated using the numerically stable formula:\n $$ \\text{Loss}_j = \\log\\left(\\sum_{k=1}^{C} e^{z_{jk}}\\right) - z_{j, y_j} $$\n The final output stored in the loss variable should be the average loss over the $N$ samples:\n $$ L = \\frac{1}{N} \\sum_{j=1}^{N} \\text{Loss}_j $$\n The input parameters are logits, true_labels, N (number of samples), and C (number of classes). The result should be stored in loss (a pointer to a single float).\n
\n\n
Implementation Requirements
\n
\n
External libraries are not permitted
\n
The solve function signature must remain unchanged
\n
The final result (average loss) must be stored in loss
\n
\n\n
Example 1:
\n
Input: N = 2, C = 3\n logits = [[1.0, 2.0, 0.5], [0.1, 3.0, 1.5]]\n true_labels = [1, 1]\nOutput: loss = [0.3548926]
\n\n\n
Example 2:
\n
Input: N = 3, C = 4\n logits = [[-0.5, 1.5, 0.0, 1.0], [2.0, -1.0, 0.5, 0.5], [0.0, 0.0, 0.0, 0.0]]\n true_labels = [3, 0, 1]\nOutput: loss = [0.98820376]
\n Implement a GPU program to calculate the Mean Squared Error (MSE) between\n predicted values and target values. Given two arrays of equal length,\n predictions and targets, compute: $$ \\text{MSE} =\n \\frac{1}{N} \\sum_{i=1}^{N} (predictions_i - targets_i)^2 $$ where N is the\n number of elements in each array.\n
\n\n
Implementation Requirements
\n
\n
External libraries are not permitted.
\n
The solve function signature must remain unchanged.
\n
The final result must be stored in the mse variable.
\n Implement a program that applies a Gaussian blur filter to a 2D image. Given an input image represented as a floating-point array and a Gaussian kernel, the program should compute the convolution of the image with the kernel.\n All inputs and outputs are stored in row-major order.\n
\n\n
\n The Gaussian blur is performed by convolving each pixel with a weighted average of its neighbors, where the weights are determined by the Gaussian kernel. For each output pixel at position (i, j), the value is calculated as:\n\n $$ output[i, j] = \\sum_{m=-k_h/2}^{k_h/2} \\sum_{n=-k_w/2}^{k_w/2} input[i+m, j+n] \\times kernel[m+k_h/2, n+k_w/2] $$\n\n where $k_h$ and $k_w$ are the kernel height and width.\n
\n\n
Implementation Requirements
\n
\n
External libraries are not permitted
\n
The solve function signature must remain unchanged
\n
The final result must be stored in the output array
\n
Handle boundary conditions by using zero-padding (treat values outside the image boundary as zeros)
\n Implement a GPU program that, given a 1D array input of 32-bit floating point numbers of length N, selects the k largest elements and writes them in descending order to the output array of length k.\n
\n\n
Implementation Requirements
\n
\n
External libraries are not permitted
\n
The solve function signature must remain unchanged
\n
The final result must be stored in the output array
\n Implement a batched matrix multiplication in FP32. Given a batch of matrices A of shape [B, M, K] and a batch of matrices B of shape [B, K, N], compute the output batch C of shape [B, M, N] such that for each batch index b:\n $$\n C_b = A_b \\times B_b\n $$\n All matrices are stored in row-major order and use 32-bit floating point numbers (FP32).\n
\n\n
Implementation Requirements
\n
\n
External libraries are not permitted
\n
The solve function signature must remain unchanged
\n Implement a quantized matrix multiplication program for 8-bit signed integer matrices. Given two input matrices A of dimensions $M \\times K$ and B of dimensions $K \\times N$, quantization scales scale_A, scale_B, output scale scale_C, zero-points zero_point_A, zero_point_B, zero_point_C, compute:\n $$\n C_{\\text{quant}}(i, j) = \\mathrm{clamp}\\left(\n \\mathrm{round}\\left(\n \\frac{\n \\sum_{k=0}^{K-1} (A_{ik} - z_A)(B_{kj} - z_B) \\cdot s_A s_B\n }{s_C}\n \\right) + z_C,\\ -128,\\ 127\n \\right)\n $$\n where s_A = scale_A, z_A = zero_point_A, etc.\n
\n\n
Implementation Requirements
\n
\n
External libraries are not permitted
\n
The solve function signature must remain unchanged
\n
The final result must be stored in the output matrix C as int8
\n
After accumulation in int32 and scaling in float32, values must be rounded to the nearest integer, shifted by zero_point_C, and clamped to the [-128, 127] range
\n
\n\n
Example 1:
\n
\n Input:\n A = [[1, 2],\n [3, 4]]\n B = [[5, 6],\n [7, 8]]\n M = 2, N = 2, K = 2\n scale_A = 0.1, scale_B = 0.2, scale_C = 0.05\n zero_point_A = 0, zero_point_B = 0, zero_point_C = 0\n\n Output:\n C = [[19, 22],\n [43, 50]]\n
\n\n
Example 2:
\n
\n Input:\n A = [[1, 2]]\n B = [[3],\n [4]]\n M = 1, N = 1, K = 2\n scale_A = 1.0, scale_B = 1.0, scale_C = 1.0\n zero_point_A = 1, zero_point_B = 3, zero_point_C = 5\n\n Output:\n C = [[6]]\n
\n Solve the Ordinary Least Squares (OLS) regression problem on a GPU. Given a feature matrix $X$ of size $n\\_samples \\times n\\_features$ and a target vector $y$ of size $n\\_samples$, compute the coefficient vector $\\beta$ that minimizes the sum of squared residuals:\n $$ \\min_{\\beta} ||X\\beta - y||^2 $$\n\n The closed-form solution to OLS is:\n $$ \\beta = (X^TX)^{-1}X^Ty $$\n
\n\n
Implementation Requirements
\n
\n
External libraries are not permitted.
\n
The solve function signature must remain unchanged.
\n
The final coefficients must be stored in the beta vector.
\n
Assume that the feature matrix $X$ is full rank (i.e., $X^TX$ is invertible).
\n Solve the logistic regression problem on a GPU. Given a feature matrix $X$ of size $n\\_samples \\times n\\_features$ and a binary target vector $y$ of size $n\\_samples$ (containing only 0s and 1s), compute the coefficient vector $\\beta$ that maximizes the log-likelihood:\n $$ \\max_{\\beta} \\sum_{i=1}^{n} \\left[ y_i \\log(p_i) + (1-y_i) \\log(1-p_i) \\right] $$\n\n where $p_i = \\sigma(X_i^T \\beta)$ and $\\sigma(z) = \\frac{1}{1 + e^{-z}}$ is the sigmoid function.\n
\n\n
Implementation Requirements
\n
\n
External libraries are not permitted
\n
The solve function signature must remain unchanged
\n
The final coefficients must be stored in the beta vector
\n
The target vector y contains only binary values (0 and 1)
\n Implement Monte Carlo integration on a GPU. Given a set of function values $y_i = f(x_i)$ sampled at random points $x_i$ uniformly distributed in the interval $[a, b]$, estimate the definite integral:\n $$ \\int_a^b f(x) \\, dx \\approx (b - a) \\cdot \\frac{1}{n} \\sum_{i=1}^{n} y_i $$\n\n The Monte Carlo method approximates the integral by computing the average of the function values and multiplying by the interval width.\n
\n\n
Implementation Requirements
\n
\n
External libraries are not permitted
\n
The solve function signature must remain unchanged
\n
The final result must be stored in the result variable
\n
Solutions are tested with absolute tolerance of 1e-2 and relative tolerance of 1e-2
\n
\n\n
Example:
\n
\nInput: a = 0, b = 2, n_samples = 8\n y_samples = [0.0625, 0.25, 0.5625, 1.0, 1.5625, 2.25, 3.0625, 4.0]\nOutput: result = 3.1875\n
\n\n
Constraints
\n
\n
1 \u2264 n_samples \u2264 100,000,000
\n
-1000.0 \u2264 a < b \u2264 1000.0
\n
-10000.0 \u2264 function values \u2264 10000.0
\n
The tolerance is set to 1e-2 to account for the inherent randomness in Monte Carlo methods and floating-point precision variations.
\n\n
Performance is measured with n_samples = 10,000,000
\n Implement a GPU program that raises a square matrix $A$ of size $N \\times N$ to an integer power $P$. \n The solve function receives a flattened input matrix input (row-major order), an empty output matrix output of the same size, the dimension N, and the exponent P. \n You must compute $\\text{output} = A^{P}$ where matrix multiplication is standard dense multiplication over 32-bit floating point numbers.\n
\n\n
Implementation Requirements
\n
\n
External libraries are not permitted.
\n
The solve function signature must remain unchanged.
\n
The final result must be written to the output array in row-major order.
\n Implement a GPU program that, for N three-dimensional points stored on the device, fills indices[i] with the index j \u2260 i of the point closest to points[i]. Comparing squared Euclidean distance is sufficient\u2014you do not need to compute square-roots.\n
\n\n
Implementation Requirements
\n
\n
The solve function signature must remain unchanged
\n
External libraries are not permitted
\n
The final result must be stored in the indices array
\n
\n\n
Example 1:
\n
\nInput: points = [(0,0,0), (1,0,0), (5,5,5)]\n indices = [-1, -1, -1]\n N = 3\nOutput: indices = [1, 0, 1] # 0\u21c61 are nearest, 2 is closest to 1
\n\n
Constraints
\n
\n
1 \u2264 N \u2264 100,000
\n
Coordinates are 32-bit floats in the range [-1000, 1000]
\n Implement batch normalization forward pass for 2D input tensors. Given an input tensor of shape [N, C] where N is the batch size and C is the number of features, compute the normalized output using learnable scale (gamma) and shift (beta) parameters.\n
\n Implement a 2D max pooling operation for image/feature map downsampling.\n The program should take an input tensor and produce an output tensor by applying max pooling with specified kernel size, stride, and padding.\n
\n\n\n\n
Implementation Requirements
\n
\n
External libraries are not permitted
\n
The solve function signature must remain unchanged
\n
The final result must be stored in tensor output
\n
\n\n
Max Pooling Operation
\n
\n For each output position (n, c, h_out, w_out), compute the maximum value over the corresponding input window:\n \n output[n, c, h_out, w_out] = max(input[n, c, h:h+kernel_size, w:w+kernel_size])\n \n where h = h_out * stride and w = w_out * stride\n
\n Write a GPU program that counts the number of elements with the integer value k in an array of 32-bit integers.\n The program should count the number of elements with k in an array.\n You are given an input array input of length N and integer k.\n
\n\n
Implementation Requirements
\n
\n
Use only native features (external libraries are not permitted)
\n
The solve function signature must remain unchanged
\n
The final result must be stored in the output variable
\n
\n\n
Example 1:
\n
\nInput: [1, 2, 3, 4, 1], k = 1\nOutput: 2\n
\n\n
Example 2:
\n
\nInput: [5, 10, 5, 2], k = 11\nOutput: 0\n
\n\n
Constraints
\n
\n
1 ≤ N ≤ 100,000,000
\n
1 ≤ input[i], k ≤ 100,000
\n\n
Performance is measured with K = 501,010, N = 100,000,000
\n Write a GPU program that counts the number of elements with the integer value k in an 2D array of 32-bit integers.\n The program should count the number of elements with k in an 2D array.\n You are given an input 2D array input of length N x M and integer k.\n
\n\n
Implementation Requirements
\n
\n
Use only native features (external libraries are not permitted)
\n
The solve function signature must remain unchanged
\n
The final result must be stored in the output variable
\n Write a GPU program that counts the number of elements with the integer value p in an 3D array of 32-bit integers.\n The program should count the number of elements with p in an 3D array.\n You are given an input 3D array input of length N x M x K and integer p.\n
\n\n
Implementation Requirements
\n
\n
Use only native features (external libraries are not permitted)
\n
The solve function signature must remain unchanged
\n
The final result must be stored in the output variable
\n
\n\n
Example 1:
\n
\nInput: input [[[1, 2, 3],\n [4, 5, 1]],\n [[1, 1, 1],\n [2, 2, 2]]]\n N = 2, M = 2, K = 3\n p = 1\nOutput: output = 5\n
\n\n
Example 2:
\n
\nInput: input [[[5, 10],\n [5, 2],\n [2, 2]]]\n N = 1, M = 3, K = 2\n p = 1\nOutput: output = 0\n
\n\n
Constraints
\n
\n
1 ≤ N, M, K ≤ 1,000
\n
1 ≤ input[i], p ≤ 100
\n\n
Performance is measured with K = 500, M = 500, N = 500
\n Implement a program that computes the sum of a subarray of 32-bit integers.\n You are given an input array input of length N, and two indices S and E.\n S and E are inclusive, 0-based start and end indices \u2014 compute the sum of input[S..E].\n
\n\n
Implementation Requirements
\n
\n
Use only native features (external libraries are not permitted)
\n
The solve function signature must remain unchanged
\n
The final result must be stored in the output variable
\n
\n\n
Example 1:
\n
\nInput: input = [1, 2, 1, 3, 4], S = 1, E = 3\nOutput: output = 6\n
\n\n
Example 2:
\n
\nInput: input = [1, 2, 3, 4], S = 0, E = 3\nOutput: output = 10\n
\n Implement a program that computes the sum of a 2D subarray of 32-bit integers.\n You are given an input 2D array input of length N x M, and two row indices S_ROW and E_ROW and two column indices S_COL and E_COL.\n S_ROW, E_ROW, S_COL and E_COL are inclusive, 0-based start and end indices \u2014 compute the sum of input[S_ROW..E_ROW][S_COL..E_COL].\n
\n\n
Implementation Requirements
\n
\n
Use only native features (external libraries are not permitted)
\n
The solve function signature must remain unchanged
\n
The final result must be stored in the output variable
\n Implement a program that computes the sum of a 3D subarray of 32-bit integers.\n You are given an input 3D array input of length N x M x K, and two depth indices S_DEP and E_DEP and two row indices S_ROW and E_ROW and two column indices S_COL and E_COL.\n S_DEP, E_DEP, S_ROW, E_ROW, S_COL and E_COL are inclusive, 0-based start and end indices \u2014 compute the sum of input[S_DEP..E_DEP][S_ROW..E_ROW][S_COL..E_COL].\n
\n\n
Implementation Requirements
\n
\n
Use only native features (external libraries are not permitted)
\n
The solve function signature must remain unchanged
\n
The final result must be stored in the output variable
\n Write a GPU program that performs parallel reduction on an array of 32-bit floating point numbers to compute their sum.\n The program should take an input array and produce a single output value containing the sum of all elements.\n
\n\n
Implementation Requirements
\n
\n
Use only GPU native features (external libraries are not permitted)
\n
The solve function signature must remain unchanged
\n
The final result must be stored in the output variable
\n Implement RMS Normalization forward pass for 1D input vectors. Given an input tensor of shape [N] where N is the number of elements, compute the normalized output using a scalar scale (gamma) and shift (beta) parameter.\n
\n Implement a program that computes the maximum sum of any contiguous subarray of length exactly window_size. You are given an array input of length N consisting of 32-bit signed integers, and an integer window_size.\n
\n\n
Implementation Requirements
\n
\n
Use only native features (external libraries are not permitted)
\n
The solve function signature must remain unchanged
\n
The final result must be stored in the output variable
\n Implement Attention with Linear Biases (ALiBi), following the method described in\n \n \"Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation\"\n , for a given set of matrices.\n Given the query matrix Q of size M\u00d7d, key matrix K of size N\u00d7d, and value matrix\n V of size N\u00d7d, your program should compute the output matrix using the formula:\n
\n where α is a slope controlling the linear bias and Δ = i - j represents the relative position between query i and key j.\n The softmax function is applied row-wise. Q, K, V, output, and α are all of data type float32;\n M, N, d are of data type int32.\n
\n\n
Implementation Requirements
\n
\n
Use only native features (external libraries are not permitted)
\n
The\n solve function signature must remain unchanged\n
\n
The final result must be stored in the output matrix\n output\n
\n Implement a batched matrix multiplication in FP16. Given a batch of matrices A of shape [B, M, K] and a batch of matrices B of shape [B, K, N], compute the output batch C of shape [B, M, N] such that for each batch index b:\n $$\n C_b = A_b \\times B_b\n $$\n All matrices are stored in row-major order and use 16-bit floating point numbers (FP16/half). Accumulation during multiplication should use FP32 for better precision before converting the final result to FP16.\n
\n\n
Implementation Requirements
\n
\n
External libraries are not permitted
\n
The solve function signature must remain unchanged
\n
Accumulation during multiplication should use FP32 for better precision before converting the final result to FP16
\n
The final result must be stored in the C array as half
\n Implement a GPU program that computes the dot product of two vectors containing 16-bit floating point numbers (FP16/half).\n The dot product is the sum of the products of the corresponding elements of two vectors.\n
\n
\n Mathematically, the dot product of two vectors $A$ and $B$ of length $n$ is defined as:\n $$\n A \\cdot B = \\sum_{i=0}^{n-1} A_i \\cdot B_i = A_0 \\cdot B_0 + A_1 \\cdot B_1 + \\ldots + A_{n-1} \\cdot B_{n-1}\n $$\n
\n
\n All inputs are stored as 16-bit floating point numbers (FP16/half). For best precision, accumulation during multiplication should use FP32 before converting the final result to FP16.\n
\n
Implementation Requirements
\n
\n
External libraries are not permitted
\n
The solve function signature must remain unchanged
\n
Accumulation during multiplication should use FP32 for better precision before converting the final result to FP16
\n
The final result must be stored in the output variable as half
\n
\n
Example 1:
\n
Input: A = [1.0, 2.0, 3.0, 4.0]\n B = [5.0, 6.0, 7.0, 8.0]\n Output: result = 70.0 (1.0*5.0 + 2.0*6.0 + 3.0*7.0 + 4.0*8.0)
\n
Example 2:
\n
Input: A = [0.5, 1.5, 2.5]\n B = [2.0, 3.0, 4.0]\n Output: result = 15.5 (0.5*2.0 + 1.5*3.0 + 2.5*4.0)
\n Write a program that computes the softmax function for an array of 32-bit floating-point numbers on a GPU. The softmax function is defined as follows:\n
\n\n
\n For an input array $x$ of length $n$, the softmax of $x$, denoted $\\sigma(x)$, is an array of length $n$ where the $i$-th element is:\n
\n Your solution should handle potential overflow issues by using the \"max trick\". Subtract the maximum value of the input array from each element before exponentiation.\n
\n\n
Implementation Requirements
\n
\n
Use only native features (external libraries are not permitted)
\n
The solve function signature must remain unchanged
\n
The final result must be stored in the array output
\n Write a GPU program that implements top-p (nucleus) sampling for LLM inference.\n
\n\n
\n Top-p sampling is a text generation technique where you sample from the smallest set of tokens whose cumulative probability exceeds threshold p.\n This balances randomness and quality better than pure top-k or greedy sampling.\n
\n\n
\n Given logits (unnormalized scores) from a language model:\n
\n
Convert logits to probabilities using softmax
\n
Sort tokens by probability (descending)
\n
Find the smallest set where cumulative probability \u2265 p (the \"nucleus\")
\n
Renormalize the nucleus probabilities to sum to 1
\n
Sample a token from the nucleus using the provided random seed
\n \n\n\n
Implementation Requirements
\n
\n
Use only native features (external libraries are not permitted)
\n
The solve function signature must remain unchanged
\n
Ensure numerical stability when computing softmax
\n
\n\n
Example 1:
\n
\nInput:\n logits = [1.0, 2.0, 3.0, 0.5]\n p = 0.9\n seed = 42\n\nOutput:\n sampled_token = 2 or 1\n (tokens with highest probabilities, sampled randomly)\n
\n\n
Example 2:
\n
\nInput:\n logits = [10.0, 1.0, 1.0]\n p = 0.5\n seed = 123\n\nOutput:\n sampled_token = 0\n (single token dominates the probability mass)\n
\n Implement a GPU program that computes the Rotary Positional Embedding (RoPE) for a batch of query vectors.\n RoPE is a method for encoding positional information in transformer models by rotating the query and key vectors using precomputed cosine and sine components.\n
\n
\n Mathematically, given a query vector $x$ and corresponding cosine and sine vectors, the operation is defined as:\n $$\n \\text{RoPE}(x) = x \\odot \\cos + \\text{rotate\\_half}(x) \\odot \\sin\n $$\n
\n
\n Where $\\odot$ denotes element-wise multiplication. The $\\text{rotate\\_half}(x)$ operation swaps the first and second halves of the vector and negates the first half. For a vector of dimension $d$:\n $$\n \\text{rotate\\_half}([x_1, \\dots, x_{d/2}, x_{d/2+1}, \\dots, x_d]) = [-x_{d/2+1}, \\dots, -x_d, x_1, \\dots, x_{d/2}]\n $$\n
\n
Implementation Requirements
\n
\n
External libraries are not permitted
\n
The solve function signature must remain unchanged
\n
The input tensors Q, cos, and sin have shape (M, D), where M is the number of tokens and D is the head dimension
\n
D (head dimension) is guaranteed to be an even number
\n
The final result must be stored in the output variable with the same shape (M, D)
\n
\n
Example 1:
\n
Input: Q = [[1.0, 2.0, 3.0, 4.0],\n [1.0, 1.0, 1.0, 1.0]]\n Cos = [[1.0, 1.0, 1.0, 1.0],\n [0.0, 0.0, 0.0, 0.0]]\n Sin = [[0.0, 0.0, 0.0, 0.0],\n [1.0, 1.0, 1.0, 1.0]]\nOutput: result = [[1.0, 2.0, 3.0, 4.0],\n [-1.0, -1.0, 1.0, 1.0]]\n (Row 0 is identity via Cos; Row 1 is rotated via Sin)
\n
Constraints
\n
\n
Q, cos, and sin have identical dimensions
\n
D % 2 == 0
\n
1 \u2264 M, D \u2264 10,000
\n\n
Performance is measured with D = 128, M = 1,048,576
\n Implement a GPU program that \"dequantizes\" a weight matrix on the GPU. You are given an input matrix X of shape [M, N] containing quantized values and a scale matrix S of shape [ceil(M/T), ceil(N/T)], where T is the tile size.\n
\n
\n For each element $X_{i,j}$, the corresponding scale factor is $S_{row, col}$ where $row = \\lfloor i / T \\rfloor$ and $col = \\lfloor j / T \\rfloor$.\n The output $Y_{i,j}$ should be computed as:\n $$\n Y_{i,j} = X_{i,j} \\times S_{row, col}\n $$\n
\n\n
Implementation Requirements
\n
\n
External libraries are not permitted
\n
The solve function signature must remain unchanged
\n
The final result must be stored in the output buffer Y
\n
\n\n
Example 1:
\n
\nInput:\nM = 4, N = 4, TILE_SIZE = 2\nX = [\n [10, 10, 5, 5],\n [10, 10, 5, 5],\n [ 2, 2, 8, 8],\n [ 2, 2, 8, 8]\n]\nS = [\n [0.5, 2.0],\n [4.0, 0.25]\n]\n\nOutput:\nY = [\n [ 5.0, 5.0, 10.0, 10.0],\n [ 5.0, 5.0, 10.0, 10.0],\n [ 8.0, 8.0, 2.0, 2.0],\n [ 8.0, 8.0, 2.0, 2.0]\n]\nExplanation:\nTile (0,0) of X is multiplied by S[0,0] (0.5).\nTile (0,1) of X is multiplied by S[0,1] (2.0).\nTile (1,0) is multiplied by S[1,0] (4.0).\nTile (1,1) is multiplied by S[1,1] (0.25).\n
\n\n
Constraints
\n
\n
1 ≤ M, N ≤ 8192
\n
TILE_SIZE ∈ {16, 32, 64, 128}
\n\n
Performance is measured with M = 8,192, N = 8,192, TILE_SIZE = 128
\n Implement a GPU program that performs Top-K Gating for Mixture of Experts (MoE) models. Given a logit matrix of shape [M, E] where M is the number of tokens and E is the number of experts, identify the k largest values in each row, extract their indices, and apply softmax to get mixing weights.\n
\n\n
\n For each row i, the operation computes:\n $$\n \\begin{align}\n \\text{indices}_i, \\text{vals}_i &= \\text{TopK}(\\text{logits}_i, k) \\\\\n \\text{vals}_i &= \\text{logits}_i[\\text{indices}_i] \\\\\n \\text{weights}_i &= \\text{Softmax}(\\text{vals}_i)\n \\end{align}\n $$\n
\n\n
\n The selected experts must remain ordered by descending logit value, matching the order returned by\n topk. The topk_weights array must correspond positionally to\n topk_indices in that same order.\n
\n\n
Implementation Requirements
\n
\n
External libraries are not permitted
\n
The solve function signature must remain unchanged
\n
The final result must be stored in the topk_weights and topk_indices arrays
\n
\n\n
Example 1:
\n
\nInput:\n logits = [[1.0, 2.0, 3.0, 4.0],\n [4.0, 3.0, 2.0, 1.0]]\n M = 2, E = 4, k = 2\n\nOutput:\n topk_weights = [[0.7311, 0.2689],\n [0.7311, 0.2689]]\n topk_indices = [[3, 2],\n [0, 1]]\n\nExplanation:\nRow 0: Top-2 values are 4.0 and 3.0 at indices 3 and 2.\n Softmax([4.0, 3.0]) = [0.7311, 0.2689]\nRow 1: Top-2 values are 4.0 and 3.0 at indices 0 and 1.\n Softmax([4.0, 3.0]) = [0.7311, 0.2689]\n
\n\n
Constraints
\n
\n
1 \u2264 M \u2264 10,000 (number of tokens)
\n
1 \u2264 E \u2264 256 (number of experts)
\n
1 \u2264 k \u2264 E (top-k selection, typically k=2)
\n Given a 2D grid of 32-bit floating point values, apply one iteration of the 5-point Jacobi stencil:\n each interior cell of the output is set to the average of its four cardinal neighbors (top, bottom,\n left, right) from the input grid. Boundary cells (first/last row and column) are copied unchanged\n from the input to the output.\n
\n\n\n\n
Implementation Requirements
\n
\n
Use only native features (external libraries are not permitted)
\n
The solve function signature must remain unchanged
\n
The final result must be stored in output
\n
Read exclusively from input and write exclusively to output (do not update input)
Implement a GPU program that computes the softmax attention operation for a given set of matrices. Given the query\n matrix Q of size M\u00d7d, key matrix K of size N\u00d7d, and value matrix\n V of size N\u00d7d, your program should compute the output matrix using the formula:\n $$\\text{Attention}(Q, K, V) = \\text{softmax}\\Bigl( \\frac{QK^T}{\\sqrt{d}} \\Bigr)V,$$ where the softmax function is\n applied row-wise.
\n
Implementation Requirements
\n
\n
Use only GPU native features (external libraries are not permitted)
\n
The\n solve function signature must remain unchanged\n
\n
The final result must be stored in the output matrix\n output\n
\n Given an array of N 32-bit floating point values and an integer array\n flags of the same length, where flags[i] = 1 marks the start of a new\n segment and flags[i] = 0 continues the current segment, compute the\n exclusive prefix sum within each segment and store the result in\n output. The first element is always a segment start\n (flags[0] = 1). Within each segment, output[i] equals the sum of all\n values elements in the same segment that appear before index i, so the\n first element of every segment is always 0.0.\n
\n\n
Implementation Requirements
\n
\n
Use only native features (external libraries are not permitted)
\n
The solve function signature must remain unchanged
\n Given two sorted arrays A of length M and B of length\n N, both containing 32-bit floating-point values in non-decreasing order, produce a\n single sorted array C of length M + N containing all elements of\n A and B in non-decreasing order.\n
\n\n
Implementation Requirements
\n
\n
Use only GPU native features (external libraries are not permitted)
\n
The solve function signature must remain unchanged
\n
The final merged result must be stored in C
\n
\n\n
Example
\n
\nInput:\n A = [1.0, 3.0, 5.0, 7.0], M = 4\n B = [2.0, 4.0, 6.0, 8.0], N = 4\n\nOutput:\n C = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]\n
\n\n
\nInput:\n A = [-1.0, 1.0, 3.0], M = 3\n B = [2.0], N = 1\n\nOutput:\n C = [-1.0, 1.0, 2.0, 3.0]\n
\n\n
Constraints
\n
\n
1 ≤ M, N ≤ 50,000,000
\n
M + N ≤ 50,000,000
\n
Both A and B are sorted in non-decreasing order
\n
Elements are 32-bit floats
\n
Performance is measured with M = 25,000,000, N = 25,000,000
\n Given a 1D array A of N 32-bit floating point numbers, compact all\n positive elements (A[i] > 0) to the front of the output array out,\n preserving their original relative order. Fill any remaining positions with 0.0.\n Stream compaction is a fundamental GPU primitive used throughout rendering, sparse computation,\n and collision detection.\n
\n\n
Implementation Requirements
\n
\n
Use only native GPU features (external libraries are not permitted)
\n
The solve function signature must remain unchanged
\n
\n The first k positions of out must contain the k elements of\n A where A[i] > 0, in their original order\n
\n
Positions k through N−1 of out must be 0.0
\n
Elements where A[i] = 0.0 are not selected
\n
\n\n
Example
\n
\nInput: A = [1.0, -2.0, 3.0, 0.0, -1.0, 4.0]\nOutput: out = [1.0, 3.0, 4.0, 0.0, 0.0, 0.0]\n
\n\n
Constraints
\n
\n
1 ≤ N ≤ 100,000,000
\n
−1000.0 ≤ A[i] ≤ 1000.0
\n
out is pre-allocated with N elements, initialised to 0.0
\n Implement a GPU program that multiplies a sparse matrix A of dimensions M × N\n by a dense matrix B of dimensions N × K, producing a dense output matrix\n C of dimensions M × K.\n All matrices are stored in row-major order using 32-bit floats.\n The matrix A is approximately 60–70% sparse (i.e., 60–70% of elements are zero),\n and nnz gives the number of non-zero elements in A.\n
\n\n
\n Mathematically, the operation is defined as:\n $$\n C_{ij} = \\sum_{k=0}^{N-1} A_{ik} \\cdot B_{kj} \\quad \\text{for} \\quad i = 0, \\ldots, M-1,\\; j = 0, \\ldots, K-1\n $$\n
\n\n
Implementation Requirements
\n
\n
Use only GPU native features (external libraries are not permitted)
\n
The solve function signature must remain unchanged
\nRun batched autoregressive inference for a 10-parameter transformer that adds two 10-digit\nnumbers. Given prompts of shape [batch_size, 31] (int32) and a 10-float weight\nbuffer, write output logits of shape [batch_size, 11, 10] — one logit\nrow per decode step over the 10-digit vocabulary (0–9). All tensors are float32 except\nthe int32 prompts.\n
\n\n
\nThe model comes from the\nAdderBoard\ncompetition for the smallest autoregressive transformer that adds 10-digit numbers at\n≥99% accuracy. It encodes carry propagation in 10 learned parameters via RoPE geometry,\ntied embeddings, and SwiGLU gating.\n
\n\n\n\n
Model Architecture
\n\n
Single-layer pre-norm transformer. Hidden dim 2, 1 head, head dim 2, vocab 10 (digits\n0–9), tied input/output embeddings.
\n\n
Each step runs the full sequence [batch_size, seq_len, 2] through:
where a_rev and b_rev are the digits in least-significant-first order,\nzero-padded to 10 digits. The model then generates 11 output tokens (digits of the sum, also\nleast-significant-first).
\n\n
Implementation Requirements
\n
\n
Implement solve(prompts, output, weights, batch_size) with the exact signature shown (JAX exception: solve(prompts, weights, batch_size) returns the output tensor directly)
\n
Do not use any external libraries beyond what the framework provides
\n
The function must write logits into the output buffer (except JAX, which returns it)
\n Compute the 2D Discrete Fourier Transform (2D DFT) of a complex-valued signal stored on the GPU.\n Given a 2D complex input signal of shape M × N, compute its 2D DFT spectrum\n using the row-column decomposition: apply a 1D DFT along each row, then a 1D DFT along each\n column of the result. All values are 32-bit floating point.\n
\n\n
Implementation Requirements
\n
\n
Use only native features (external libraries are not permitted)
\n
The solve function signature must remain unchanged
\n
The final result must be stored in spectrum
\n
\n The input and output are stored as 1D arrays of interleaved real and imaginary parts in\n row-major order: element x[m, n] has its real part at index\n 2*(m*N + n) and imaginary part at index 2*(m*N + n) + 1\n
\nImplement Grouped Query Attention (GQA), the attention mechanism used in modern large language\nmodels such as LLaMA-3, Mistral, and Gemma. GQA reduces the KV-cache memory footprint during\ninference by sharing key and value heads across groups of query heads. Given query tensor\nQ with num_q_heads heads and key/value tensors K,\nV each with num_kv_heads heads, compute scaled dot-product attention\nwhere every group of num_q_heads / num_kv_heads consecutive query heads attends to\nthe same key and value head. All tensors use float32.\n
\n\n\n\n
Implementation Requirements
\n
\n
Implement the function solve(Q, K, V, output, num_q_heads, num_kv_heads, seq_len, head_dim).
\n
Do not change the function signature or use external libraries beyond the standard GPU frameworks.
\n
Write the result into the provided output buffer.
\n
num_q_heads is always divisible by num_kv_heads.
\n
Use scaled dot-product attention with scale factor 1 / sqrt(head_dim) and a softmax over the key dimension.
\n
\n\n
Example
\n
\n With num_q_heads = 4, num_kv_heads = 2 (groups of 2), seq_len = 3,\n head_dim = 4:\n
\n Implement a weight-only INT4 quantized matrix multiplication (W4A16), a core kernel used in\n modern LLM inference. Given a float16 activation matrix x of shape\n M × K and a weight matrix stored in packed INT4 format, compute the output\n matrix y = x × WT of shape M × N, where\n W is the dequantized float16 weight matrix of shape N × K.\n
\n\n
\n Packing format: Each byte of w_q stores two INT4 weights. The\n high nibble (bits 7–4) holds weight w[n, 2i] and the low nibble (bits\n 3–0) holds w[n, 2i+1]. INT4 values are stored unsigned in the range\n [0, 15] with an offset of 8, so the signed weight is nibble − 8,\n giving values in [−8, 7].\n
\n\n
\n Dequantization: Weights are dequantized group-wise. Each contiguous block of\n group_size weights along the K dimension shares one float16 scale:\n
\n Given two matrices a and x, each of shape [B, L] (batch size × sequence length),\n compute the linear recurrence h of shape [B, L] defined by:\n h[b, 0] = x[b, 0] and h[b, t] = a[b, t] × h[b, t−1] + x[b, t] for t ≥ 1.\n All values are float32. This operation is the core computational primitive of\n State Space Models (SSMs) such as Mamba, S4, and H3.\n
\n\n\n\n
Implementation Requirements
\n
\n
Use only native features (external libraries are not permitted)
\n
The solve function signature must remain unchanged
\n
The result must be stored in the output tensor h
\n
\n\n
Examples
\n\n
Example 1 \u2014 exponential decay (a = 0.5, single impulse):
\n Implement the SwiGLU MLP block \u2014 the feedforward network used in LLaMA, Mistral, Gemma, and most\n modern large language models. Given an input matrix x of shape\n [M, d_model] and three weight matrices W_gate, W_up\n (each [d_model, d_ffn]), and W_down ([d_ffn, d_model]),\n compute:\n output = (SiLU(x × W_gate) ⊙ (x × W_up)) × W_down,\n where SiLU(z) = z × sigmoid(z) and ⊙ denotes element-wise\n multiplication. All tensors are float32.\n
\n\n\n\n
Implementation Requirements
\n
\n
Implement the solve function with the signature unchanged.
\n
Do not use external libraries beyond the framework provided.
\n Implement a LoRA (Low-Rank Adaptation) linear layer forward pass. Given an input matrix\n x of shape batch × d_in, a base weight matrix W of\n shape d_out × d_in, a LoRA down-projection matrix A of shape\n rank × d_in, and a LoRA up-projection matrix B of shape\n d_out × rank, compute\n output = x × WT + lora_scale × (x × AT) × BT.\n All tensors are float32.\n
\n\n\n\n
Implementation Requirements
\n
\n
Implement the solve function; do not change its signature.
\n
Do not use external libraries beyond those provided.
\n Implement the token verification step of speculative decoding. A draft model proposes $T$ tokens;\n the target model evaluates them in one forward pass and accepts or rejects each. Given $B$\n sequences, produce the verified output tokens. Probability tensors are float32;\n token tensors are int32.\n
\n\n
\n Notation for each sequence $b$, at each draft position $i = 0, \\ldots, T{-}1$:\n
\n
\n
$t_i = \\texttt{draft_tokens}[b, i]$ — the token proposed by the draft model
\n
$p_i(v) = \\texttt{draft_probs}[b, i, v]$ — draft model's probability for token $v$
\n
$q_i(v) = \\texttt{target_probs}[b, i, v]$ — target model's probability for token $v$
\n
$u_i = \\texttt{uniform_samples}[b, i]$ — pre-generated $U[0,1)$ sample for position $i$
\n
\n\n\n\n
\n For each sequence $b$, process positions $i = 0, 1, \\ldots, T{-}1$ left-to-right:\n
If $u_i < \\alpha_i$: accept $t_i$, continue to position $i{+}1$.
\n
If $u_i \\ge \\alpha_i$: reject, stop. Sample replacement from:\n $$\\text{adj}(v) = \\frac{\\max(0,\\; q_i(v) - p_i(v))}{\\sum_{v'} \\max(0,\\; q_i(v') - p_i(v'))}$$\n using inverse CDF with $r = \\texttt{uniform_samples}[b, T]$. If $\\text{adj}$ is all zeros, use uniform $1/V$.\n
\n
If all $T$ tokens accepted: sample a bonus token from $q_{T-1}$ using $\\texttt{uniform_samples}[b, T]$.
\n\n
\n Write results into output_tokens[b, :] (shape $[B, T{+}1]$): accepted/resampled tokens\n fill positions $0$ through the accepted count (inclusive), remaining positions are zero.\n
Do not change the function signature or use external libraries beyond the standard GPU frameworks.
\n
Write results into the provided output_tokens buffer (shape [B, T+1], int32).
\n
Memory layout is row-major: draft_probs[b, i, v] is at offset b*T*V + i*V + v.
\n
\n Inverse CDF sampling: given distribution $\\text{adj}$ (already normalized), find the\n smallest index $k$ where $\\sum_{v=0}^{k} \\text{adj}(v) \\ge r$, where\n $r = \\texttt{uniform_samples}[b, T]$. Clamp the result to $[0, V-1]$.\n
\n
\n If the adjusted distribution is all zeros (i.e., $q_i \\le p_i$ everywhere), fall back to\n the uniform distribution over $V$ tokens.\n
\n Implement a causal depthwise 1D convolution over a batched sequence tensor\n x of shape (B, L, D), producing an output of the same shape.\n In a depthwise convolution, each channel d is convolved independently using its\n own kernel weight[d, :] \u2014 there is no mixing across channels.\n The convolution is causal: output position l may only depend on\n input positions 0, 1, …, l (past and present), never future positions.\n This operation is a key component of state-space models such as Mamba, where it is applied\n before the selective scan to mix local context within each feature channel.\n
\n\n\n\n
\n Formally, for each batch element b, sequence position l, and channel d:\n
\n where positions l − k < 0 are treated as zero (zero-pad the left boundary).\n The tensor layout is channels-last: x[b, l, d] is stored at offset\n b × L × D + l × D + d.\n
\n\n
Implementation Requirements
\n
\n
The solve function signature must remain unchanged
\n
The result must be written into the output tensor
\n
Use only native features (external libraries are not permitted)
\n
Input positions before the start of the sequence (i.e. indices l − k < 0) must be treated as zero
\n Implement decaying causal attention. Given query matrix Q, key matrix K,\n and value matrix V, each of shape seq_len × d_model, and a scalar\n decay factor gamma ∈ (0, 1], compute the unnormalized causal attention output\n where position n attends to all past positions m ≤ n with weight\n gamman−m:\n
\n Unlike standard softmax attention, there is no normalization \u2014 the weights decay geometrically from\n the current position backward. This is the parallel form of the Retention mechanism (RetNet), used\n as a recurrence-friendly alternative to attention in sequence models.\n
\n\n\n\n
Implementation Requirements
\n
\n
Implement the solve function; do not change its signature.
\n
Do not use external libraries beyond those provided.
\n
Write the result into output.
\n
\n\n
Example
\n
Example 1 \u2014 with seq_len = 2, d_model = 4, gamma = 0.5:
\n Implement the forward pass of a State Space Model (SSM) selective scan, the core operation in\n Mamba-style sequence models. Given an input sequence u, time-step parameters\n delta, state-transition matrix A, input projection B,\n output projection C, and skip-connection weights skip, compute the\n output sequence y in float32.\n
\n\n\n\n
Implementation Requirements
\n
\n Implement the function solve(u, delta, A, B, C, skip, y, batch, seq_len, d_model, d_state)\n with the signature unchanged. Do not use external libraries beyond the allowed framework.\n Write the result into the pre-allocated output tensor y.\n
\n
\n For each batch b, position t, and channel d, the computation is:\n
\n The initial hidden state $h_{b,-1,d,n} = 0$ for all $b, d, n$.\n All channels d are independent: they share the same B and C\n projections but have separate state-transition rows in A.\n
\nImplement decode-phase multi-head attention where the key and value caches are stored as\nint8 with per-token scale factors. This memory layout halves KV-cache bandwidth\nversus float32 and is used in production LLM serving systems such as TensorRT-LLM\nand vLLM. Given a query tensor Q for a single new token, int8 key cache\nK_int8, int8 value cache V_int8, and per-token scales\nk_scale and v_scale, dequantize the caches and compute scaled\ndot-product attention to produce output. All non-integer tensors use\nfloat32.\n
\n\n
Implementation Requirements
\n
\n
Implement the function solve(Q, K_int8, V_int8, k_scale, v_scale, output, num_heads, seq_len, head_dim).
\n
Do not change the function signature or use external libraries beyond the standard GPU frameworks.
\n
Write the result into the provided output buffer.
\n
Dequantize using per-token scales: K_float[h, s, d] = K_int8[h, s, d] × k_scale[h, s] (and analogously for V).
\n
Use scaled dot-product attention with scale factor 1 / sqrt(head_dim) and a softmax over the sequence dimension.