Skip to content

Report correct L2 cache size on ARM (Neoverse V1/V2)#372

Merged
malfet merged 4 commits intopytorch:mainfrom
Rohanjames1997:wrong_l2
Mar 12, 2026
Merged

Report correct L2 cache size on ARM (Neoverse V1/V2)#372
malfet merged 4 commits intopytorch:mainfrom
Rohanjames1997:wrong_l2

Conversation

@Rohanjames1997
Copy link
Copy Markdown
Contributor

Fixes #369

This is my first attempt at fixing the L2 size reporting issue. Comments and feedback are welcome.

I tested on Arm Neoverse V1 and V2 EC2 instances. I reused the reproducer in #369

#include <cstddef>
#include <cstdio>
#include <cpuinfo.h>
size_t l2_bytes() {
  if (!cpuinfo_initialize()) return 0;
  const cpuinfo_processor* p = cpuinfo_get_current_processor();
  if (!p || !p->cache.l2) return 0;
  return p->cache.l2->size; // bytes
}
int main() { std::printf("%zu\n", l2_bytes()); }

This now returns the expected results on Arm Neoverse V1 and V2


Additionally, here is the output of ./cache-info.

  1. On Neoverse-V1:
Max cache size (upper bound): 4194304 bytes
L1 instruction cache: 64 x 64 KB, 4-way set associative (256 sets), 64 byte lines, shared by 1 processors
L1 data cache: 64 x 64 KB, 4-way set associative (256 sets), 64 byte lines, shared by 1 processors
L2 data cache: 64 x 1 MB (inclusive), 8-way set associative (2048 sets), 64 byte lines, shared by 1 processors
  1. On Neoverse-V2:
Max cache size (upper bound): 4194304 bytes
L1 instruction cache: 64 x 64 KB, 4-way set associative (256 sets), 64 byte lines, shared by 1 processors
L1 data cache: 64 x 64 KB, 4-way set associative (256 sets), 64 byte lines, shared by 1 processors
L2 data cache: 64 x 2 MB (inclusive), 8-way set associative (4096 sets), 64 byte lines, shared by 1 processors

Copy link
Copy Markdown

@Radu2k Radu2k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks for addressing the issue! The only suggestion I would have is to add a check that level is matching used index or checking indexes and matching by level. This slipped through my fingers when I wrote the implementation in vllm and could lead to wrong cache values.

Here is tree -L 2 at /sys/devices/system/cpu/cpu0/cache and if you do a cat on level should see 2 at index2:

.
├── index0
│   ├── allocation_policy
│   ├── coherency_line_size
│   ├── level
│   ├── number_of_sets
│   ├── shared_cpu_list
│   ├── shared_cpu_map
│   ├── size
│   ├── type
│   ├── uevent
│   ├── ways_of_associativity
│   └── write_policy
├── index1
│   ├── allocation_policy
│   ├── coherency_line_size
│   ├── level
│   ├── number_of_sets
│   ├── shared_cpu_list
│   ├── shared_cpu_map
│   ├── size
│   ├── type
│   ├── uevent
│   ├── ways_of_associativity
│   └── write_policy
├── index2
│   ├── allocation_policy
│   ├── coherency_line_size
│   ├── level
│   ├── number_of_sets
│   ├── shared_cpu_list
│   ├── shared_cpu_map
│   ├── size
│   ├── type
│   ├── uevent
│   ├── ways_of_associativity
│   └── write_policy
├── index3
│   ├── allocation_policy
│   ├── coherency_line_size
│   ├── level
│   ├── number_of_sets
│   ├── shared_cpu_list
│   ├── shared_cpu_map
│   ├── size
│   ├── type
│   ├── uevent
│   ├── ways_of_associativity
│   └── write_policy
└── uevent

@Rohanjames1997
Copy link
Copy Markdown
Contributor Author

Thanks @Radu2k! That's a good catch.

I can error out with 0 and use the previous hardcoded values if level of index2 does not show 2. Will push that commit now.


checking indexes and matching by level

I tried this as well, although it can get a little involved, with a few more potential file opens. If you and the other reviewers deem it necessary, I can push that fix next.

cc: @malfet

@Rohanjames1997 Rohanjames1997 requested a review from Radu2k February 5, 2026 17:59
Copy link
Copy Markdown

@Radu2k Radu2k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, LGTM 👍

@Rohanjames1997
Copy link
Copy Markdown
Contributor Author

Will fix the linter soon. Anything else of note? @malfet

@malfet
Copy link
Copy Markdown
Contributor

malfet commented Feb 9, 2026

@Rohanjames1997 I wonder why reading from sysfs is an arm specific thing rather than a generic Linux one?
And if those entries are not available, what would it return now (0 it seems), compared to previous behavior

I know that many sysfs entries are not available from AWS lambdas for security reasons

Copy link
Copy Markdown
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please:

  • Fix lint
  • Use existing APIs/abstractions
  • Add some links to the docs explaining why sysfs should be used, which kernel versions added support for this entry, and whether it's arch specific or generic


/* Verify the index actually corresponds to the requested cache level */
snprintf(path, sizeof(path), "/sys/devices/system/cpu/cpu%u/cache/index%u/level", cpu_id, cache_level);
FILE* file = fopen(path, "r");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why cpuinfo_linux_parse_small_file should not be used there?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I had no idea about that function 🙂

@Rohanjames1997
Copy link
Copy Markdown
Contributor Author

Rohanjames1997 commented Feb 9, 2026

@malfet Thanks much for the review! As you can tell this is my first contrib to the repo.

It looks like x86 CPUs have a CPUID instruction that directly returns cache properties - src/x86/cache/deterministic.c:14-95. ARM has no equivalent, so it used hardcoded values before.
If sysfs is unavailable, the function returns 0 and the code uses existing hardcoded values - same behavior as before.

Does that sound reasonable?

I will fix lint, use existing APIs & update docs next.

Comment on lines +39 to +47
uint32_t value = 0;
const char* p = text_start;
while (p < text_end && *p >= '0' && *p <= '9') {
value = value * 10 + (*p - '0');
p++;
}
if (p == text_start || value == 0) {
return false;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A) this code looks identical to code on lines 25-29, why not move it to a shared function
B) Isn't this function already exists in standard library and called strtoull?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I've extracted it into a helper function and used strtoul

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, looking at it more closely, I see that a similar parsing logic is present in processors.c

One probable reason could be that cpuinfo_linux_parse_small_file passes a non-null-terminated buffer (text_start to text_end), while strtoul requires a null-terminated string and would read past the buffer.

What do you think @malfet ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if there is a document somewhere explaining how to properly query cache sizes? I.e. couldn't one use cpuid to uniquely identify whether l2 cache is shared or not?

I.e. shouldn't want be able to detect cache configuration during

switch (midr & (CPUINFO_ARM_MIDR_IMPLEMENTER_MASK | CPUINFO_ARM_MIDR_PART_MASK)) {

or
somewhere inside chipset.c ?

@fbarchard
Copy link
Copy Markdown
Collaborator

Could you include N1 and N2 as well?

@Rohanjames1997
Copy link
Copy Markdown
Contributor Author

@fbarchard I unfortunately don't have access to N1 and N2 machines to test my changes.

@zhili03
Copy link
Copy Markdown

zhili03 commented Mar 11, 2026

Hi @Rohanjames1997 , nice work, thanks for fixing the bug! We have done test over Neoverse N1( AWS Graviton 2), and here are the results

Max cache size (upper bound): 4194304 bytes
L1 instruction cache: 32 x 64 KB, 4-way set associative (256 sets), 64 byte lines, shared by 1 processors
L1 data cache: 32 x 64 KB, 4-way set associative (256 sets), 64 byte lines, shared by 1 processors
L2 data cache: 32 x 1 MB (inclusive), 8-way set associative (2048 sets), 64 byte lines, shared by 1 processors

are expected. Besides, the l2c_test.cpp output is 1048576, which is consistent with the 1MB L2 cache for Neoverse N1. Will look into Neoverse N2 soon if possible.

cc: @Radu2k @malfet @fbarchard

@zhili03
Copy link
Copy Markdown

zhili03 commented Mar 11, 2026

Hi @Rohanjames1997 , nice work, thanks for fixing the bug! We have done test over Neoverse N1( AWS Graviton 2), and here are the results

Max cache size (upper bound): 4194304 bytes
L1 instruction cache: 32 x 64 KB, 4-way set associative (256 sets), 64 byte lines, shared by 1 processors
L1 data cache: 32 x 64 KB, 4-way set associative (256 sets), 64 byte lines, shared by 1 processors
L2 data cache: 32 x 1 MB (inclusive), 8-way set associative (2048 sets), 64 byte lines, shared by 1 processors

are expected. Besides, the l2c_test.cpp output is 1048576, which is consistent with the 1MB L2 cache for Neoverse N1. Will look into Neoverse N2 soon if possible.

cc: @Radu2k @malfet @fbarchard

Managed to perform the test over Neoverse N2 (Azure D4ps v6), and the results are below,

Max cache size (upper bound): 4194304 bytes
L1 instruction cache: 4 x 64 KB, 4-way set associative (256 sets), 64 byte lines, shared by 1 processors
L1 data cache: 4 x 64 KB, 4-way set associative (256 sets), 64 byte lines, shared by 1 processors
L2 data cache: 4 x 1 MB (inclusive), 8-way set associative (2048 sets), 64 byte lines, shared by 1 processors

Again the l2c_test.cpp output is 1048576. Therefore, the PR demonstrates correct output for Neoverse N2 as well.

@malfet
Copy link
Copy Markdown
Contributor

malfet commented Mar 11, 2026

[EDIT] I finally understand why accessing /sysfs is a way to go, because accessing doing something like https://github.com/torvalds/linux/blob/master/arch/arm64/kernel/cacheinfo.c

is not possible from userland. Moreover parsing ACPI should also be done on the kernel level

@malfet malfet merged commit 7607ca5 into pytorch:main Mar 12, 2026
12 checks passed
@Rohanjames1997
Copy link
Copy Markdown
Contributor Author

Thanks @malfet ! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cpuinfo reports incorrect L2 cache size on ARM (Neoverse V1/V2)

5 participants