Skip to content

Conversation

@srkreddy1238
Copy link
Contributor

@srkreddy1238 srkreddy1238 commented Nov 28, 2025

Introduces the below features over texture annotation

  • Lowering, codegen and runtime for texture.
  • image2d_array_t support - Added depth dimension allows more allocations using texture instead of falling back to buffer when the texture limits exceeds.
  • A comprehensive set of schedules for Adreno textures.
  • Texture packing of arbitrary types up to 128 bit (FP16-NCHW8c, INT8-NCHW16c ...etc.).
  • A clBufferDescriptor debug dump controlled by cmake options.
  • Pipeline definition for adreno target.

While covering these features the below interfaces or passes or enhanced which need a review.

  • alloc_tensor: VDevice information is passed across these API's. The way of texture allocation is alloc_storage allocates buffer/image objects as requested followed by alloc_tensor being a view of any scope. This takes care of optimum utilization backing memory across different image objects or scopes.
  • Constants Saving: Handled by adding memory scope section in executable. This introduces a new header magic to retain the backward compatibility.
  • Static Memory Planing: Mostly port from Relay static memory planner with mixed mode allocator.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @srkreddy1238, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant overhaul to TVM's Adreno backend, primarily by enabling and optimizing texture-based lowering. The changes span across runtime, Relax, and TIR components to ensure that GPU texture memory can be effectively utilized for various operations like convolutions, pooling, and layout transformations. This aims to improve memory efficiency and performance on Adreno devices by providing dedicated schedules and memory management for texture objects, alongside robust mechanisms for propagating memory scope information throughout the compilation pipeline.

Highlights

  • Texture Annotation and Lowering: Introduces comprehensive support for texture annotation, lowering, codegen, and runtime specifically for Adreno GPUs. This enables more efficient memory utilization by leveraging texture memory instead of falling back to buffers when limits are exceeded.
  • image2d_array_t Support: Adds support for image2d_array_t which includes a depth dimension, allowing for more flexible and larger texture allocations, particularly beneficial for NCHW layouts.
  • Adreno Texture Schedules: A comprehensive set of DLight schedules for Adreno textures has been added, including specialized rules for Conv2d, LayoutTransform, Pool2D, and a Fallback mechanism for general operations.
  • Texture Packing: Enables texture packing of arbitrary data types up to 128 bits, supporting formats like FP16-NCHW8c and INT8-NCHW16c, which are crucial for optimizing performance on Adreno GPUs.
  • Memory Scope Propagation: Enhances runtime.Tensor with SetScope and GetScope methods, and updates SaveDLTensor/Load to preserve memory scope information. This ensures that memory allocation decisions, especially for textures, are correctly propagated through the Relax and TIR pipelines.
  • Static Memory Planning Integration: The static memory planner has been updated to account for texture memory scopes and sizes, porting concepts from Relay's static memory planner with a mixed-mode allocator to better manage device-specific memory.
  • New TIR Passes: Introduces InjectTextureAlloc and TextureFlatten TIR passes. InjectTextureAlloc inserts texture allocation intrinsics, while TextureFlatten transforms multi-dimensional buffer accesses into 2D (width, height, depth) texture accesses based on storage scope.
  • OpenCL Codegen and Runtime Updates: Updates the OpenCL codegen to correctly handle image2d_array_t types and texture2d_load/store intrinsics, using int4 for coordinates and managing channel sizes. The OpenCL runtime now supports allocating image2d_array_t with depth and calculates texture memory sizes based on device attributes.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant features for Adreno GPU texture-based lowering, including new runtime and compilation passes, scheduling rules, and analysis capabilities. The changes are extensive and well-structured, particularly the refactoring of scheduling rules and the introduction of more robust analysis for memory scopes and buffer information. However, there are several areas that need attention. I've identified some potential runtime errors due to unsafe assumptions about symbolic shapes and struct info, which should be addressed. Additionally, there are instances of dead code, typos in public headers, and use of bare excepts that should be cleaned up to improve code quality and maintainability.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@srkreddy1238 srkreddy1238 force-pushed the texture-lower-ffi branch 2 times, most recently from 5f4e0a6 to 9283ef2 Compare December 3, 2025 18:36
@srkreddy1238 srkreddy1238 force-pushed the texture-lower-ffi branch 2 times, most recently from 7a9922b to 2814d38 Compare December 5, 2025 14:49
@srkreddy1238
Copy link
Contributor Author

@tqchen managed with out altering runtime::Tensor

VM byte code now adds MemoryScope section.

To keep backward compatibility of Load, I added new header magic kTVMVMBytecodeMagicV2 . Let me know if this need to be handled via VM_VERSION.

@elvin-n
Copy link
Contributor

elvin-n commented Dec 10, 2025

Q not related directly to this PR, sorry. Can TVM and Adreno flow be compiled/executed on Snapdragon X1? Windows/Linux?

@srkreddy1238
Copy link
Contributor Author

Q not related directly to this PR, sorry. Can TVM and Adreno flow be compiled/executed on Snapdragon X1? Windows/Linux?

Yes. It works on Snapdragon laptops (X Elite) with both arm64 (Arm64 need Qualcomm Arm64 OpenCL SDK) and x64 (Generic OpenCL works w/o CLML enabled) compilations.

Sample config

           cp ../cmake/config.cmake .
           Add-Content config.cmake "set(USE_OPENCL $ENV:OPENCL_SDK_ADRENO_X86)"
           Add-Content config.cmake "set(USE_LLVM $ENV:LLVM_CONFIG)"
           Add-Content config.cmake "set(USE_CLML $ENV:OPENCL_SDK_ADRENO_X86)"
           Add-Content config.cmake "set(USE_RPC ON)"
           Add-Content config.cmake "set(USE_CPP_RPC ON)"
           Add-Content config.cmake "set(USE_KALLOC_ALIGNMENT 32)"
           Add-Content config.cmake "set(USE_OPENCL_EXTN_QCOM ON)"
           cmake .. -G "Visual Studio 17 2022" -A x64
           cmake --build . --config Release --parallel $env:NUMBER_OF_PROCESSORS
           cd build-arm64
           cp ../cmake/config.cmake .
           Add-Content config.cmake "set(USE_OPENCL $ENV:OPENCL_SDK_ADRENO_ARM64)"
           Add-Content config.cmake "set(USE_CLML $ENV:OPENCL_SDK_ADRENO_ARM64)"
           Add-Content config.cmake "set(USE_RPC ON)"
           Add-Content config.cmake "set(USE_CPP_RPC ON)"
           Add-Content config.cmake "set(USE_KALLOC_ALIGNMENT 32)"
           Add-Content config.cmake "set(USE_OPENCL_EXTN_QCOM ON)"
           cmake .. -G "Visual Studio 17 2022" -A ARM64
           cmake --build . --config Release --parallel $env:NUMBER_OF_PROCESSORS --target tvm_rpc

mlc-llm too works.

launch_param_config_.Init(arg_size.size(), launch_param_tags);
}

#ifdef PROFILE_SHADER_DUMP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is understandable, however, would be useful to think about if we can move this logic to outside of the core code-base, e.g. if such tracing can be done in the driver API, or we can have a ffi function hook that user registers and runs

using runtime::DefaultTextureLayoutSeparator;
using runtime::IsTextureStorage;

class TextureLoweringBase : public StmtExprMutator {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this and the other transformations are related to texture and is not needed in None texture case, might be useful to move it to src/tir/backend/adreno, is it possible to have a adreno specific pipeline for this like relax one? that way it gives clearity and also gives freedom of evolutions for future changes

@tqchen
Copy link
Member

tqchen commented Jan 8, 2026

Thanks @srkreddy1238 , i think we are close to merging, just did another round of quick look

@tqchen
Copy link
Member

tqchen commented Jan 8, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces texture-based lowering for Adreno GPUs, enhancing support for image2d_array_t, comprehensive scheduling, and texture packing. It also refactors convolution scheduling, adds fallback and layout transform schedules, and integrates texture allocation into the TIR pipeline. The VM executable format is updated to support memory scopes for constants, ensuring proper device allocation. Overall, the changes significantly improve Adreno GPU backend capabilities for texture operations.

I am having trouble creating individual review comments. Click here to see my feedback.

tests/python/relax/texture/adreno_utils.py (193)

critical

In the verify function, when comparing tuple outputs, the line tvm.testing.assert_allclose(val1, ret2, rtol=1e-3, atol=1e-3) is incorrect. It compares val1 (an element from the texture-enabled run) against ret2 (the entire tuple of outputs from the non-texture run). This will always fail or produce misleading results. It should compare val1 against val2.

        if isinstance(ret1, tuple):
            for val1, val2 in zip(ret1, ret2):
                tvm.testing.assert_allclose(val1, val2, rtol=1e-3, atol=1e-3)

src/relax/backend/adreno/annotate_custom_storage.cc (284-303)

high

The CollectConsumerScopeInfo class has a TODO comment regarding handling ops that generate tuples (like split). This suggests that the current logic for propagating scope information through tuples might be incomplete or not fully robust for all scenarios. Incomplete handling of tuple-generating ops could lead to incorrect memory scope assignments. Please ensure that tuple-generating operations are correctly handled, or provide a plan for addressing this TODO.

python/tvm/dlight/adreno/fallback.py (78-80)

high

The try...except Exception blocks in schedule_inline_blocks are too broad. This can mask underlying issues and make debugging challenging. Please consider catching more specific exceptions or adding logging to capture the exception details.

                    try:
                        # Would cause an issue inlining to producer with multiple consumers
                        while (
                            len(sch.get_producers(blk)) == 1
                            and len(sch.get_consumers(sch.get_producers(blk)[0])) == 1
                        ):
                            sch.reverse_compute_inline(blk)
                            inlined_once = True
                    except tir.TVMError as e: # Catch specific TVM errors
                        # Log the error or handle it gracefully
                        # print(f"Warning: Could not reverse inline block {block_info.name}: {e}")
                        break

python/tvm/dlight/adreno/fallback.py (67-68)

high

The try...except Exception blocks in schedule_inline_blocks are too broad. This can mask underlying issues and make debugging challenging. Please consider catching more specific exceptions or adding logging to capture the exception details.

                try:
                    sch.compute_inline(blk)
                except tir.TVMError as e: # Catch specific TVM errors
                    # Log the error or handle it gracefully
                    # print(f"Warning: Could not inline block {block_info.name}: {e}")
                    remaining_blocks.append(blk)

tests/python/relax/texture/adreno_utils.py (120)

high

In the build_run function, the relax_pipeline is applied twice: once explicitly with mod = relax_pipeline(mod) and then again implicitly when tvm.compile is called with relax_pipeline=relax_pipeline. This redundant application of the pipeline could lead to unnecessary processing or unexpected behavior. Please remove the explicit mod = relax_pipeline(mod) call.

    relax_pipeline = relax.pipeline.get_default_pipeline(tgt)
    tir_pipeline = tvm.tir.get_default_tir_pipeline(tgt)
    # mod = relax_pipeline(mod) # Remove this line
    ex = tvm.compile(mod, tgt, relax_pipeline=relax_pipeline, tir_pipeline=tir_pipeline)

python/tvm/dlight/adreno/utils.py (70-73)

high

There appears to be a circular import dependency between utils.py and fallback.py. utils.py imports Fallback from fallback.py, and fallback.py likely imports functions from utils.py. This can lead to runtime issues or unexpected behavior. Consider refactoring to break this cycle, perhaps by moving shared utilities to a common, independent module.

python/tvm/dlight/adreno/utils.py (76-79)

high

There appears to be a circular import dependency between utils.py and fallback.py. utils.py imports Fallback from fallback.py, and fallback.py likely imports functions from utils.py. This can lead to runtime issues or unexpected behavior. Consider refactoring to break this cycle, perhaps by moving shared utilities to a common, independent module.

python/tvm/dlight/adreno/utils.py (86-89)

high

There appears to be a circular import dependency between utils.py and fallback.py. utils.py imports Fallback from fallback.py, and fallback.py likely imports functions from utils.py. This can lead to runtime issues or unexpected behavior. Consider refactoring to break this cycle, perhaps by moving shared utilities to a common, independent module.

src/relax/backend/adreno/annotate_custom_storage.cc (400-407)

medium

The SupportsTexture function for Conv2DAttrs returns {true, false}, indicating that texture is used for data but not for weights. While this might be a deliberate design choice or a current limitation, it's worth clarifying the rationale. If weights could also benefit from texture memory, this could represent a missed optimization opportunity.

python/tvm/dlight/analysis/common_analysis.py (259-265)

medium

The is_convolution and is_pool methods in BlockInfo currently raise NotImplementedError. If these methods are intended to be used, they should be implemented. If they are not meant to be part of the generic BlockInfo or are not yet ready, consider making BlockInfo an abstract base class with these as abstract methods, or remove them if they are not intended for this class.

python/tvm/target/target.py (841-864)

medium

The adreno target function now accepts a cfg parameter, which is directly appended to the target keys. There is no validation or clear documentation on the expected values for cfg or its intended impact on the target. This could lead to confusion or unexpected behavior if arbitrary strings are passed. Consider adding validation or more explicit documentation for this parameter.

python/tvm/dlight/adreno/pool.py (56)

medium

The schedule_max_pool method uses a strict check iters_kind != "SSSSSRR". This exact match might be too restrictive and could prevent valid pooling patterns from being scheduled if their iteration kinds vary slightly (e.g., due to different input shapes or optimizations). Consider if a more flexible pattern matching or a different heuristic would be more robust.

src/tir/transforms/inject_texture_alloc.cc (66-71)

medium

In VisitStmt_(const AllocateNode* op), the ICHECK(op->extents.size() >= 3) and ICHECK(channel_size == 128 || channel_size == 64) checks are very specific. The extents.size() >= 3 implies a minimum dimensionality for texture allocation, and the channel_size limits it to 64 or 128 bits. These constraints might be too rigid and could prevent other valid texture configurations.

src/tir/transforms/texture_flatten.cc (108-113)

medium

In VisitStmt_(const BufferRealizeNode* op), similar to InjectTextureAlloc, the ICHECK(op->bounds.size() >= 3) and ICHECK(channel_size == 128 || channel_size == 64) checks are very specific. These constraints might be too rigid and could prevent other valid texture configurations.

src/tir/transforms/texture_flatten.cc (56-58)

medium

The SimplifyOffset function in TextureLoweringBase assumes expr.extent == 1 for all regions. This might not always be true for complex buffer regions, potentially leading to incorrect offset calculations or runtime errors.

src/tir/transforms/vectorize_loop.cc (524-530)

medium

In Vectorizer::VisitExpr_ for builtin::texture2d_load(), the ICHECK condition lane * dtype.bits() <= op->args[4].as<IntImmNode>()->value implies that the vectorized lane count multiplied by the data bits must be less than or equal to the channel size. This is a reasonable check, but it might be more robust to ensure exact equality or provide clear error messages if the vectorized load is not perfectly aligned with the channel size.

src/tir/transforms/vectorize_loop.cc (541-547)

medium

In Vectorizer::VisitExpr_ for builtin::texture2d_store(), the ICHECK condition lane * dtype.bits() == op->args[4].as<IntImmNode>()->value enforces exact equality between the vectorized lane count multiplied by data bits and the channel size. This strict equality might limit flexibility for certain write patterns where partial writes or writes with different lane counts could be valid.

python/tvm/dlight/adreno/convolution.py (150-229)

medium

The removal of the Config dataclass and get_configs method hardcodes scheduling parameters directly into schedule_conv2d. While this simplifies the immediate code, it reduces flexibility for future tuning or adaptation to different Adreno models or convolution variants. Consider if these parameters should remain configurable, perhaps through a target attribute or a more explicit configuration mechanism.

python/tvm/dlight/adreno/fallback.py (94)

medium

In schedule_default, accessing block_info.write_bufs(sch)[0].assoc_lps[-1] assumes that write_bufs will always return a non-empty list and assoc_lps will always have a last element. This could lead to an IndexError. Please add checks to ensure these lists are not empty before accessing elements.

        s_loops, r_loops, o_loops = [], [], []
        
        write_buffers = block_info.write_bufs(sch)
        if not write_buffers or not write_buffers[0].assoc_lps:
            # Handle case where no write buffers or associated loops are found
            return

        v_loop = write_buffers[0].assoc_lps[-1]
        if v_loop is None:
            return

tests/python/relax/texture/test_texture_nd.py (61)

medium

In preprocess_pipeline, tvm.tir.transform.BindTarget(Target.current(allow_none=False)) is applied twice. This is redundant and could potentially lead to unexpected behavior if the target state changes between the two calls. Please remove the duplicate call.

            tvm.transform.Sequential(
                [
                    # tvm.tir.transform.BindTarget(Target.current(allow_none=False)), # Remove this line
                    tvm.relax.transform.FoldConstant(),
                    tvm.relax.transform.DecomposeOpsForInference(),
                    tvm.relax.transform.FoldConstant(),
                    tvm.tir.transform.BindTarget(tvm.target.Target.current(allow_none=False)),
                    tvm.relax.transform.ConvertLayout(desired_layouts),

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants