Add Ability to Activate per Sample Timestamps by gmarler · Pull Request #496 · javierhonduco/lightswitch

gmarler · 2026-06-22T16:42:36Z

Currently, lightswitch applies one timestamp per session (every 5 seconds, by default) across all samples collected during the session window. In the case of creating a pprof per session and shipping the result to lightswitch-backend, the timestamp is applied upon actual ingestion time, not collection time.

This PR has the following rationale:

This feature is behind a flag, which is false by default: --enable-ts-per-sample
When enabled, apply a timestamp to every sample collected, much closer to the moment of collection
As a practical matter, nanosec/microsec resolution timestamps are too fine grained, so the timestamps collected are millisec resolution. From past experience, even sub-second profiling generally buckets samples on a per 1-100 ms resolution, so millisec resolution should be perfectly adequate
The timestamp is encoded in a new pprof label: collected_at. This will allow the backend to continue operating as-is until support is added for the new label at that level
This feature allows much finer grained (down to sub-second, eventually) zooming in of profiling time ranges, but even at per-second resolution is quite useful to zero in on shorter lived code path activity
Tests have been added

A separate POC has already been performed, adding support for these timestamps in lightswitch-backend and the UI, just to confirm this approach is functional, and achieves the stated goals.

javierhonduco · 2026-06-23T09:27:49Z

    unwind_state->sample.pid = per_process_id;
    unwind_state->sample.tid = per_thread_id;
-    unwind_state->sample.collected_at = bpf_ktime_get_boot_ns();
+    unwind_state->sample.collected_at = bpf_ktime_get_ns();


Why was this helper changed? We would like to account for the time the computer has been sleeping, no?

Because that helper wasn't available for at RHEL 8.10's kernel, whereas the one I'm using is available universally.

I was surprised at this.

And yes, sleep time isn't accounted for, but the intervals are so small that it's not (currently) an issue.

Would you prefer I add a config flag (like use_boot_ns, for instance) similar to other flags you have in lightswitch_config already so that kernels that have it could use bpf_ktime_get_boot_ns(), and older kernels like RHEL 8.10 can fall back to bpf_ktime_get_ns()?

It wasn't that hard to auto-detect whether the capability exists, just as you've done for other things, so I did that.

BUT... after making that change, something interesting happened - 2 tests started to get verifier failures, but the binary worked just fine.

I explain what I had to do to get rid of that in the commit message for 6d0be10.

What's different now is that the flags default to false, but the check against system info will set them to the correct detected values - apparently the binary already did this, but the tests did not. Fine for the one flag that's apparently available on every kernel we've ever used it with, but the flag I just added isn't always available.

Setting any of these flags to true by default gives the verifier the chance to check for the non-present features, which the verifier apparently doesn't like (and probably only on the older kernels like RHEL 8.10).

Convoluted!

Beyond the fact that lightswitch itself runs without issues on this distro version, additional proof that RedHat 8.10 has support for bpf_ktime_get_boot_ns():

# cat /etc/os-release | grep 8.10 VERSION="8.10 (Ootpa)" VERSION_ID="8.10" PRETTY_NAME="Red Hat Enterprise Linux 8.10 (Ootpa)" REDHAT_BUGZILLA_PRODUCT_VERSION=8.10 REDHAT_SUPPORT_PRODUCT_VERSION="8.10"

# sudo cat /proc/kallsyms |& grep bpf_ktime_get_boot_ns ffffffff9b461c60 T bpf_ktime_get_boot_ns ffffffff9c237b80 R bpf_ktime_get_boot_ns_proto

As you mentioned, the tests hardcode certain settings that might not be available, that's the ultimate root cause of the issue while running the tests on older kernels, but this all seems unrelated to me to the bpf_ktime_get_boot helper. I think we should not have this feature detection since all the kernels we support should have it (RHEL 8 backported it, it seems).

Yes, bpf_ktime_get_boot_ns has been there all the time, it was task_pt_regs_helper that was missing all along on RHEL 8.10, and I mentally switched them.

So bpf_ktime_get_boot_ns doesn't need a detector at present (personally, I'd have it anyway, just so everything is consistent), so I'll remove it.

The genesis of this issue seems to have been that these tests have been failing for quite some time on RHEL 8.10, but since we don't explicitly run the test suite as part of generating the package, this was simply never noticed.

The binary auto-detects whether the helper is available, so the problem never shows up there - I just assumed that if the tests died with a verifier error, the binary surely would, and never tried it.

javierhonduco · 2026-06-23T09:31:47Z

+        long,
+        help = "Attach a per-sample collected_at timestamp (ms) label to pprof output"
+    )]
+    pub(crate) enable_ts_per_sample: bool,


We should maybe emit a warning if the profile format is not pprof and this is enabled since other formats support high resultion timestamps already (see the firefox profiler collector)

Fine - do you not expect to support those timestamp types as well, in possible future PRs?

So the warning would be (currently only for the firefox profiler collector format, and any others you may add in future): "--enable-ts-per-sample is not necessary for output format X - such timestamps are provided by default"?

Or what would you prefer?

Let me think about this, need to make sure we don't under/over design this feature

javierhonduco · 2026-06-23T09:32:29Z

        );
    }

+    fn hash_of(sample: &RawSample) -> u64 {


Should this impement Hash/er directly?

Suppose so. I'll see how involved (or easy) it is.

So, would you prefer this as a trait implementation (which would only be for the tests), or to alternatively use a simpler BuildHasher variant?

I'm doing the trait first, but let me know if you want me to switch.

javierhonduco · 2026-06-23T09:37:27Z

Tangentially related but I was thinking of adding support for OTel which supports "aggregated" profiles à la pprof and also timestamped samples

gmarler · 2026-06-30T16:31:46Z

Tangentially related but I was thinking of adding support for OTel which supports "aggregated" profiles à la pprof and also timestamped samples

Certainly doable, but one format per PR, eh?

- add collected_at to AggregatedSample - propagate collected_at in convert.rs and emit pprof label - RawSample::hash — now includes (self.collected_at / 1_000_000) (ms bucket), so identical stacks at different milliseconds become separate aggregated entries - AggregatedSample — gains pub collected_at: u64 field (milliseconds since Unix epoch) - RawAggregatedSample::process() — sets collected_at: self.sample.collected_at / 1_000_000 (ns → ms) - new hash/bucketing tests - tests for same stack in different ms buckets stays separate; same stack within one ms bucket merges - symbolize_profile — propagates collected_at through symbolization - to_pprof — emits a collected_at numeric pprof label (value = ms, unit = milliseconds) on every sample - test: collected_at label round-trips through protobuf encode/decode

- Replace with what is available

…r to true - The actual binary doesn't fail the way the tests do, because it actually does detection from system info query and resets the flags as needed, and the branch is pruned by the verifier - The reason the tests fail though is that use_task_pt_regs_helper defaults to true, which means the verifier ALWAYS sees a live branch calling bpf_task_pt_regs and bpf_get_current_task_btf. If either of the helpers we need to use (including the one added in this PR) isn't available on RHEL 8.10, the verifier rejects the program with EINVAL.

javierhonduco · 2026-07-01T07:34:32Z

+        let unit_str = &profile.string_table[lbl.num_unit as usize];
+        assert_eq!(unit_str, "milliseconds");
+
+        // Verify after encode/decode round-trip


This test is making sure that the (de)serialization works, we don't need to check for this, we can assume it works and that upstream is well tested

Did you mean:

// Verify in memory

// Verify after encode/decode round-trip

Or both? Or basically the entire test?

javierhonduco · 2026-07-01T07:35:25Z

-    unwind_state->sample.collected_at = bpf_ktime_get_boot_ns();
+    unwind_state->sample.collected_at = lightswitch_config.use_ktime_get_boot_ns
+        ? bpf_ktime_get_boot_ns()
+        : bpf_ktime_get_ns();


Let's back out this change. This helper is always present

- It actually is always available in kernels of interest to us

javierhonduco reviewed Jun 23, 2026

View reviewed changes

gmarler added 5 commits June 30, 2026 12:38

bpf_ktime_get_boot_ns() is not yet available in RHEL 8 kernels

7cb1792

- Replace with what is available

Gate per sample ts with --enable-ts-per-sample bool flag

3ef6c2a

Prefer bpf_ktime_get_boot_ns whenever available

a686630

Convert hash_of utility to a blanket trait implementation

211d6e5

gmarler force-pushed the ts-per-sample branch from bb6c4f9 to 211d6e5 Compare June 30, 2026 17:06

gmarler added 3 commits June 30, 2026 14:21

Update cargo insta after adding new CLI option

97fb1b2

Actually update the cargo insta after review

ce944b5

javierhonduco reviewed Jul 1, 2026

View reviewed changes

Remove needless detection of bpf_ktime_get_boot_ns

b32a05b

- It actually is always available in kernels of interest to us

Conversation

gmarler commented Jun 22, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gmarler Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gmarler Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

javierhonduco Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gmarler Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

javierhonduco commented Jun 23, 2026

Uh oh!

gmarler commented Jun 30, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gmarler Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gmarler Jun 30, 2026 •

edited

Loading

gmarler Jun 30, 2026 •

edited

Loading

javierhonduco Jul 1, 2026 •

edited

Loading

gmarler Jul 1, 2026 •

edited

Loading

gmarler Jul 1, 2026 •

edited

Loading