[ADRENO][TEXTURE] Texture based lowering #18523

srkreddy1238 · 2025-11-28T11:27:06Z

Introduces the below features over texture annotation

Lowering, codegen and runtime for texture.
image2d_array_t support - Added depth dimension allows more allocations using texture instead of falling back to buffer when the texture limits exceeds.
A comprehensive set of schedules for Adreno textures.
Texture packing of arbitrary types up to 128 bit (FP16-NCHW8c, INT8-NCHW16c ...etc.).
A clBufferDescriptor debug dump controlled by cmake options.
Pipeline definition for adreno target.

While covering these features the below interfaces or passes or enhanced which need a review.

alloc_tensor: VDevice information is passed across these API's. The way of texture allocation is alloc_storage allocates buffer/image objects as requested followed by alloc_tensor being a view of any scope. This takes care of optimum utilization backing memory across different image objects or scopes.
Constants Saving: Handled by adding memory scope section in executable. This introduces a new header magic to retain the backward compatibility.
Static Memory Planing: Mostly port from Relay static memory planner with mixed mode allocator.

gemini-code-assist · 2025-11-28T11:29:33Z

Summary of Changes

Hello @srkreddy1238, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant overhaul to TVM's Adreno backend, primarily by enabling and optimizing texture-based lowering. The changes span across runtime, Relax, and TIR components to ensure that GPU texture memory can be effectively utilized for various operations like convolutions, pooling, and layout transformations. This aims to improve memory efficiency and performance on Adreno devices by providing dedicated schedules and memory management for texture objects, alongside robust mechanisms for propagating memory scope information throughout the compilation pipeline.

Highlights

Texture Annotation and Lowering: Introduces comprehensive support for texture annotation, lowering, codegen, and runtime specifically for Adreno GPUs. This enables more efficient memory utilization by leveraging texture memory instead of falling back to buffers when limits are exceeded.
image2d_array_t Support: Adds support for image2d_array_t which includes a depth dimension, allowing for more flexible and larger texture allocations, particularly beneficial for NCHW layouts.
Adreno Texture Schedules: A comprehensive set of DLight schedules for Adreno textures has been added, including specialized rules for Conv2d, LayoutTransform, Pool2D, and a Fallback mechanism for general operations.
Texture Packing: Enables texture packing of arbitrary data types up to 128 bits, supporting formats like FP16-NCHW8c and INT8-NCHW16c, which are crucial for optimizing performance on Adreno GPUs.
Memory Scope Propagation: Enhances runtime.Tensor with SetScope and GetScope methods, and updates SaveDLTensor/Load to preserve memory scope information. This ensures that memory allocation decisions, especially for textures, are correctly propagated through the Relax and TIR pipelines.
Static Memory Planning Integration: The static memory planner has been updated to account for texture memory scopes and sizes, porting concepts from Relay's static memory planner with a mixed-mode allocator to better manage device-specific memory.
New TIR Passes: Introduces InjectTextureAlloc and TextureFlatten TIR passes. InjectTextureAlloc inserts texture allocation intrinsics, while TextureFlatten transforms multi-dimensional buffer accesses into 2D (width, height, depth) texture accesses based on storage scope.
OpenCL Codegen and Runtime Updates: Updates the OpenCL codegen to correctly handle image2d_array_t types and texture2d_load/store intrinsics, using int4 for coordinates and managing channel sizes. The OpenCL runtime now supports allocating image2d_array_t with depth and calculates texture memory sizes based on device attributes.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces significant features for Adreno GPU texture-based lowering, including new runtime and compilation passes, scheduling rules, and analysis capabilities. The changes are extensive and well-structured, particularly the refactoring of scheduling rules and the introduction of more robust analysis for memory scopes and buffer information. However, there are several areas that need attention. I've identified some potential runtime errors due to unsafe assumptions about symbolic shapes and struct info, which should be addressed. Additionally, there are instances of dead code, typos in public headers, and use of bare excepts that should be cleaned up to improve code quality and maintainability.

python/tvm/dlight/adreno/pool.py

src/relax/op/op_common.h

src/relax/transform/static_plan_block_memory.cc

include/tvm/runtime/tensor.h

include/tvm/tir/transform.h

python/tvm/dlight/adreno/convolution.py

python/tvm/dlight/adreno/fallback.py

python/tvm/dlight/adreno/layout_transform.py

python/tvm/tir/transform/transform.py

tests/python/relax/adreno/test_transform_annotate_custom_scope.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

include/tvm/runtime/tensor.h

Co-authored-by: Sanjay <[email protected]>

Removed Comments

srkreddy1238 · 2025-12-05T17:25:28Z

@tqchen managed with out altering runtime::Tensor

VM byte code now adds MemoryScope section.

To keep backward compatibility of Load, I added new header magic kTVMVMBytecodeMagicV2 . Let me know if this need to be handled via VM_VERSION.

elvin-n · 2025-12-10T07:57:56Z

Q not related directly to this PR, sorry. Can TVM and Adreno flow be compiled/executed on Snapdragon X1? Windows/Linux?

srkreddy1238 · 2025-12-11T07:52:39Z

Q not related directly to this PR, sorry. Can TVM and Adreno flow be compiled/executed on Snapdragon X1? Windows/Linux?

Yes. It works on Snapdragon laptops (X Elite) with both arm64 (Arm64 need Qualcomm Arm64 OpenCL SDK) and x64 (Generic OpenCL works w/o CLML enabled) compilations.

Sample config

           cp ../cmake/config.cmake .
           Add-Content config.cmake "set(USE_OPENCL $ENV:OPENCL_SDK_ADRENO_X86)"
           Add-Content config.cmake "set(USE_LLVM $ENV:LLVM_CONFIG)"
           Add-Content config.cmake "set(USE_CLML $ENV:OPENCL_SDK_ADRENO_X86)"
           Add-Content config.cmake "set(USE_RPC ON)"
           Add-Content config.cmake "set(USE_CPP_RPC ON)"
           Add-Content config.cmake "set(USE_KALLOC_ALIGNMENT 32)"
           Add-Content config.cmake "set(USE_OPENCL_EXTN_QCOM ON)"
           cmake .. -G "Visual Studio 17 2022" -A x64
           cmake --build . --config Release --parallel $env:NUMBER_OF_PROCESSORS

           cd build-arm64
           cp ../cmake/config.cmake .
           Add-Content config.cmake "set(USE_OPENCL $ENV:OPENCL_SDK_ADRENO_ARM64)"
           Add-Content config.cmake "set(USE_CLML $ENV:OPENCL_SDK_ADRENO_ARM64)"
           Add-Content config.cmake "set(USE_RPC ON)"
           Add-Content config.cmake "set(USE_CPP_RPC ON)"
           Add-Content config.cmake "set(USE_KALLOC_ALIGNMENT 32)"
           Add-Content config.cmake "set(USE_OPENCL_EXTN_QCOM ON)"
           cmake .. -G "Visual Studio 17 2022" -A ARM64
           cmake --build . --config Release --parallel $env:NUMBER_OF_PROCESSORS --target tvm_rpc

mlc-llm too works.

tqchen · 2026-01-08T16:29:01Z

src/runtime/opencl/opencl_module.cc

    launch_param_config_.Init(arg_size.size(), launch_param_tags);
  }
+
+#ifdef PROFILE_SHADER_DUMP


this is understandable, however, would be useful to think about if we can move this logic to outside of the core code-base, e.g. if such tracing can be done in the driver API, or we can have a ffi function hook that user registers and runs

tqchen · 2026-01-08T16:31:31Z

src/tir/transforms/texture_flatten.cc

+using runtime::DefaultTextureLayoutSeparator;
+using runtime::IsTextureStorage;
+
+class TextureLoweringBase : public StmtExprMutator {


Given this and the other transformations are related to texture and is not needed in None texture case, might be useful to move it to src/tir/backend/adreno, is it possible to have a adreno specific pipeline for this like relax one? that way it gives clearity and also gives freedom of evolutions for future changes

tqchen · 2026-01-08T16:34:56Z

Thanks @srkreddy1238 , i think we are close to merging, just did another round of quick look

tqchen · 2026-01-08T16:35:03Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces texture-based lowering for Adreno GPUs, enhancing support for image2d_array_t, comprehensive scheduling, and texture packing. It also refactors convolution scheduling, adds fallback and layout transform schedules, and integrates texture allocation into the TIR pipeline. The VM executable format is updated to support memory scopes for constants, ensuring proper device allocation. Overall, the changes significantly improve Adreno GPU backend capabilities for texture operations.

I am having trouble creating individual review comments. Click here to see my feedback.

tests/python/relax/texture/adreno_utils.py (193)

In the verify function, when comparing tuple outputs, the line tvm.testing.assert_allclose(val1, ret2, rtol=1e-3, atol=1e-3) is incorrect. It compares val1 (an element from the texture-enabled run) against ret2 (the entire tuple of outputs from the non-texture run). This will always fail or produce misleading results. It should compare val1 against val2.

        if isinstance(ret1, tuple):
            for val1, val2 in zip(ret1, ret2):
                tvm.testing.assert_allclose(val1, val2, rtol=1e-3, atol=1e-3)

src/relax/backend/adreno/annotate_custom_storage.cc (284-303)

The CollectConsumerScopeInfo class has a TODO comment regarding handling ops that generate tuples (like split). This suggests that the current logic for propagating scope information through tuples might be incomplete or not fully robust for all scenarios. Incomplete handling of tuple-generating ops could lead to incorrect memory scope assignments. Please ensure that tuple-generating operations are correctly handled, or provide a plan for addressing this TODO.

python/tvm/dlight/adreno/fallback.py (78-80)

The try...except Exception blocks in schedule_inline_blocks are too broad. This can mask underlying issues and make debugging challenging. Please consider catching more specific exceptions or adding logging to capture the exception details.

                    try:
                        # Would cause an issue inlining to producer with multiple consumers
                        while (
                            len(sch.get_producers(blk)) == 1
                            and len(sch.get_consumers(sch.get_producers(blk)[0])) == 1
                        ):
                            sch.reverse_compute_inline(blk)
                            inlined_once = True
                    except tir.TVMError as e: # Catch specific TVM errors
                        # Log the error or handle it gracefully
                        # print(f"Warning: Could not reverse inline block {block_info.name}: {e}")
                        break

python/tvm/dlight/adreno/fallback.py (67-68)

The try...except Exception blocks in schedule_inline_blocks are too broad. This can mask underlying issues and make debugging challenging. Please consider catching more specific exceptions or adding logging to capture the exception details.

                try:
                    sch.compute_inline(blk)
                except tir.TVMError as e: # Catch specific TVM errors
                    # Log the error or handle it gracefully
                    # print(f"Warning: Could not inline block {block_info.name}: {e}")
                    remaining_blocks.append(blk)

tests/python/relax/texture/adreno_utils.py (120)

In the build_run function, the relax_pipeline is applied twice: once explicitly with mod = relax_pipeline(mod) and then again implicitly when tvm.compile is called with relax_pipeline=relax_pipeline. This redundant application of the pipeline could lead to unnecessary processing or unexpected behavior. Please remove the explicit mod = relax_pipeline(mod) call.

    relax_pipeline = relax.pipeline.get_default_pipeline(tgt)
    tir_pipeline = tvm.tir.get_default_tir_pipeline(tgt)
    # mod = relax_pipeline(mod) # Remove this line
    ex = tvm.compile(mod, tgt, relax_pipeline=relax_pipeline, tir_pipeline=tir_pipeline)

python/tvm/dlight/adreno/utils.py (70-73)

There appears to be a circular import dependency between utils.py and fallback.py. utils.py imports Fallback from fallback.py, and fallback.py likely imports functions from utils.py. This can lead to runtime issues or unexpected behavior. Consider refactoring to break this cycle, perhaps by moving shared utilities to a common, independent module.

python/tvm/dlight/adreno/utils.py (76-79)

There appears to be a circular import dependency between utils.py and fallback.py. utils.py imports Fallback from fallback.py, and fallback.py likely imports functions from utils.py. This can lead to runtime issues or unexpected behavior. Consider refactoring to break this cycle, perhaps by moving shared utilities to a common, independent module.

python/tvm/dlight/adreno/utils.py (86-89)

There appears to be a circular import dependency between utils.py and fallback.py. utils.py imports Fallback from fallback.py, and fallback.py likely imports functions from utils.py. This can lead to runtime issues or unexpected behavior. Consider refactoring to break this cycle, perhaps by moving shared utilities to a common, independent module.

src/relax/backend/adreno/annotate_custom_storage.cc (400-407)

The SupportsTexture function for Conv2DAttrs returns {true, false}, indicating that texture is used for data but not for weights. While this might be a deliberate design choice or a current limitation, it's worth clarifying the rationale. If weights could also benefit from texture memory, this could represent a missed optimization opportunity.

python/tvm/dlight/analysis/common_analysis.py (259-265)

The is_convolution and is_pool methods in BlockInfo currently raise NotImplementedError. If these methods are intended to be used, they should be implemented. If they are not meant to be part of the generic BlockInfo or are not yet ready, consider making BlockInfo an abstract base class with these as abstract methods, or remove them if they are not intended for this class.

python/tvm/target/target.py (841-864)

The adreno target function now accepts a cfg parameter, which is directly appended to the target keys. There is no validation or clear documentation on the expected values for cfg or its intended impact on the target. This could lead to confusion or unexpected behavior if arbitrary strings are passed. Consider adding validation or more explicit documentation for this parameter.

python/tvm/dlight/adreno/pool.py (56)

The schedule_max_pool method uses a strict check iters_kind != "SSSSSRR". This exact match might be too restrictive and could prevent valid pooling patterns from being scheduled if their iteration kinds vary slightly (e.g., due to different input shapes or optimizations). Consider if a more flexible pattern matching or a different heuristic would be more robust.

src/tir/transforms/inject_texture_alloc.cc (66-71)

In VisitStmt_(const AllocateNode* op), the ICHECK(op->extents.size() >= 3) and ICHECK(channel_size == 128 || channel_size == 64) checks are very specific. The extents.size() >= 3 implies a minimum dimensionality for texture allocation, and the channel_size limits it to 64 or 128 bits. These constraints might be too rigid and could prevent other valid texture configurations.

src/tir/transforms/texture_flatten.cc (108-113)

In VisitStmt_(const BufferRealizeNode* op), similar to InjectTextureAlloc, the ICHECK(op->bounds.size() >= 3) and ICHECK(channel_size == 128 || channel_size == 64) checks are very specific. These constraints might be too rigid and could prevent other valid texture configurations.

src/tir/transforms/texture_flatten.cc (56-58)

The SimplifyOffset function in TextureLoweringBase assumes expr.extent == 1 for all regions. This might not always be true for complex buffer regions, potentially leading to incorrect offset calculations or runtime errors.

src/tir/transforms/vectorize_loop.cc (524-530)

In Vectorizer::VisitExpr_ for builtin::texture2d_load(), the ICHECK condition lane * dtype.bits() <= op->args[4].as<IntImmNode>()->value implies that the vectorized lane count multiplied by the data bits must be less than or equal to the channel size. This is a reasonable check, but it might be more robust to ensure exact equality or provide clear error messages if the vectorized load is not perfectly aligned with the channel size.

src/tir/transforms/vectorize_loop.cc (541-547)

In Vectorizer::VisitExpr_ for builtin::texture2d_store(), the ICHECK condition lane * dtype.bits() == op->args[4].as<IntImmNode>()->value enforces exact equality between the vectorized lane count multiplied by data bits and the channel size. This strict equality might limit flexibility for certain write patterns where partial writes or writes with different lane counts could be valid.

python/tvm/dlight/adreno/convolution.py (150-229)

The removal of the Config dataclass and get_configs method hardcodes scheduling parameters directly into schedule_conv2d. While this simplifies the immediate code, it reduces flexibility for future tuning or adaptation to different Adreno models or convolution variants. Consider if these parameters should remain configurable, perhaps through a target attribute or a more explicit configuration mechanism.

python/tvm/dlight/adreno/fallback.py (94)

In schedule_default, accessing block_info.write_bufs(sch)[0].assoc_lps[-1] assumes that write_bufs will always return a non-empty list and assoc_lps will always have a last element. This could lead to an IndexError. Please add checks to ensure these lists are not empty before accessing elements.

        s_loops, r_loops, o_loops = [], [], []
        
        write_buffers = block_info.write_bufs(sch)
        if not write_buffers or not write_buffers[0].assoc_lps:
            # Handle case where no write buffers or associated loops are found
            return

        v_loop = write_buffers[0].assoc_lps[-1]
        if v_loop is None:
            return

tests/python/relax/texture/test_texture_nd.py (61)

In preprocess_pipeline, tvm.tir.transform.BindTarget(Target.current(allow_none=False)) is applied twice. This is redundant and could potentially lead to unexpected behavior if the target state changes between the two calls. Please remove the duplicate call.

            tvm.transform.Sequential(
                [
                    # tvm.tir.transform.BindTarget(Target.current(allow_none=False)), # Remove this line
                    tvm.relax.transform.FoldConstant(),
                    tvm.relax.transform.DecomposeOpsForInference(),
                    tvm.relax.transform.FoldConstant(),
                    tvm.tir.transform.BindTarget(tvm.target.Target.current(allow_none=False)),
                    tvm.relax.transform.ConvertLayout(desired_layouts),

srkreddy1238 added 3 commits November 27, 2025 13:54

Texture Lowering

b0a9307

Combine analysis code

ed12f36

Remove unused header

8f48187

gemini-code-assist bot reviewed Nov 28, 2025

View reviewed changes

Update python/tvm/dlight/adreno/pool.py

615f98f

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

tqchen requested changes Nov 28, 2025

View reviewed changes

include/tvm/runtime/tensor.h Outdated Show resolved Hide resolved

Various review comments

9e8bcf7

srkreddy1238 force-pushed the texture-lower-ffi branch 2 times, most recently from 5f4e0a6 to 9283ef2 Compare December 3, 2025 18:36

srkreddy1238 added 3 commits December 4, 2025 15:33

Scope handling w/o changes to Tensor

aa9f24b

Lint

ffc31d1

Lint

0a8aba9

srkreddy1238 force-pushed the texture-lower-ffi branch from 9283ef2 to 4999eca Compare December 4, 2025 14:44

Tests execution, pipelines for adreno target.

0b6ff71

srkreddy1238 force-pushed the texture-lower-ffi branch from 4999eca to 0b6ff71 Compare December 4, 2025 14:52

Lint

c66c11d

srkreddy1238 force-pushed the texture-lower-ffi branch from 42aa91b to c66c11d Compare December 4, 2025 15:21

srkreddy1238 requested a review from tqchen December 4, 2025 18:13

srkreddy1238 added 4 commits December 5, 2025 00:22

Enable OpenCL target.

766f198

Enable opencl tests

b87ccbf

Texture and CLML tests enabled

39e54a1

Lint

8607e25

srkreddy1238 force-pushed the texture-lower-ffi branch 2 times, most recently from 7a9922b to 2814d38 Compare December 5, 2025 14:49

Workaround for crash.

a3cd6eb

Co-authored-by: Sanjay <[email protected]>

srkreddy1238 force-pushed the texture-lower-ffi branch from 2814d38 to a3cd6eb Compare December 5, 2025 14:52

sanjskrish and others added 2 commits December 5, 2025 20:44

Removed Comments

172573b

Merge pull request #2 from sanjskrish/texture-lower-ffi

a1e03f6

Removed Comments

srkreddy1238 force-pushed the texture-lower-ffi branch from a32a0f1 to a1e03f6 Compare December 5, 2025 15:44

tqchen requested changes Jan 8, 2026

View reviewed changes

gemini-code-assist bot reviewed Jan 8, 2026

View reviewed changes

[ADRENO][TEXTURE] Texture based lowering #18523

Are you sure you want to change the base?

[ADRENO][TEXTURE] Texture based lowering #18523

Conversation

srkreddy1238 commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Nov 28, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

srkreddy1238 commented Dec 5, 2025

Uh oh!

elvin-n commented Dec 10, 2025

Uh oh!

srkreddy1238 commented Dec 11, 2025

Uh oh!

tqchen Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

tqchen Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

tqchen commented Jan 8, 2026

Uh oh!

tqchen commented Jan 8, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

tests/python/relax/texture/adreno_utils.py (193)

src/relax/backend/adreno/annotate_custom_storage.cc (284-303)

python/tvm/dlight/adreno/fallback.py (78-80)

python/tvm/dlight/adreno/fallback.py (67-68)

tests/python/relax/texture/adreno_utils.py (120)

python/tvm/dlight/adreno/utils.py (70-73)

python/tvm/dlight/adreno/utils.py (76-79)

python/tvm/dlight/adreno/utils.py (86-89)

src/relax/backend/adreno/annotate_custom_storage.cc (400-407)

python/tvm/dlight/analysis/common_analysis.py (259-265)

python/tvm/target/target.py (841-864)

python/tvm/dlight/adreno/pool.py (56)

src/tir/transforms/inject_texture_alloc.cc (66-71)

src/tir/transforms/texture_flatten.cc (108-113)

src/tir/transforms/texture_flatten.cc (56-58)

src/tir/transforms/vectorize_loop.cc (524-530)

src/tir/transforms/vectorize_loop.cc (541-547)

python/tvm/dlight/adreno/convolution.py (150-229)

python/tvm/dlight/adreno/fallback.py (94)

tests/python/relax/texture/test_texture_nd.py (61)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

srkreddy1238 commented Nov 28, 2025 •

edited

Loading