fix: Rework enrol workflow by stevekeay · Pull Request #1909 · rackerlabs/understack

stevekeay · 2026-04-02T13:20:10Z

This is somewhat of an opinionated move from a large, complex workflow
that calls lots of different python scripts to a minimal workflow that
executes one single script to take all required steps.

In theory there are benefits to the multiple workflow steps, but in
practice I don't think we really benefit from the hardcore argo
functionality, and the workflows are fragile, difficult to read, almost
impossible to test, inscrutable to troubleshoot. By contrast the
python script is easily maintainable.

The python script is "beefed up" with extra capabilities and it now
handles the transitioning of the node through the various stages,
including configuring RAID, running inspection, etc.

Logging and error handling is improved.

Ability to cope with non-standard cabling is improved: we used
to rely completely on a cabling convention to determine which port
should be used for PXE.

We now use the LLDP data reported by the BMC (when available - note that
we have hardware where the LLDP feature does not work), and we enable HTTP
boot on ALL interfaces that the BMC reports to be connected. The server
will attempt each one in turn until it gets a DHCP response.

We persist the set of PXE interfaces in the extra field of the baremetal
node. This is then used by the inspection hook to set the "pxe" flag on
those same interfaces.

skrobul · 2026-04-09T08:02:19Z

workflows/argo-events/workflowtemplates/enroll-server.yaml

        image: ghcr.io/rackerlabs/understack/ironic-nautobot-client:latest
        command:
-          - enroll-server
+          - enrol-server


The package only exposes enroll-server so this will fail

Good point, fixed in 0aaf487

User can pass a list of known switches to prefer.

The python script is "beefed up" with extra capabilities and it now handles the transitioning of the node through the various stages, including configuring RAID, running inspection, etc. Logging and error handling is improved. Ability to cope with non-standard cabling is improved slightly - we used to rely completely on a cabling convention to determine which port should be used for PXE. We now detect that from the LLDP data reported by the BMC (when available - note that we have hardware where the LLDP feature does not work). This provides us with the chassis MAC address of the connected switch. Until we can find an easy way to look this up, we allow the user to pass one or more PXE switch MACs on the command line.

We no longer try to set a single solitary interface as "the" PXE interface. We now set a whole slew of interfaces - the server will try them all and use the one that works. So we want Ironic to support HTTP boot on any of the interfaces listed. At the time this hook runs, we don't know which interface will get used, so we set the "pxe" flag on all of them.

We see HTTP 503 errors from redfish for a while after booting the server.

cardoe

So overall I'm down for the part that gets rid of the argo-workflow and calls it with Python. I'm -1 on all the Redfish API calls because we should be doing this via Ironic. The logic on retrying and rebooting still doesn't cover all the behaviors of the hardware that have been tested via Ironic but there's no point in adding it here.

stevekeay · 2026-04-09T21:57:38Z

So overall I'm down for the part that gets rid of the argo-workflow and calls it with Python. I'm -1 on all the Redfish API calls because we should be doing this via Ironic. The logic on retrying and rebooting still doesn't cover all the behaviors of the hardware that have been tested via Ironic but there's no point in adding it here.

Note that this PR doesn't add that redfish interaction - it is already happening, I'm mostly just refactoring.

To move in the openstack direction requires significant changes that deserve careful planning and testing. I'm happy to take this on, but I am concerned that we are postponing delivery of a very time-sensitive capability, notably the ability to on-board the RXDB gear. The way they cabled up those racks we have no way to enrol them.

I would therefore suggest the following as follow-on PRs:

I don't think we can easily remove all the redfish interaction without adding significant new features to Ironic. Some of it should be do-able though:

setting the password (and other BMC settings) is probably something that needs to stay in enroll, since Ironic expects the credentials to work right away.
ironic redfish inspection is probably sufficient to handle the "discovery" part, including enumerating the interfaces on the box
setting the BIOS settings can likely be done in a clean step which is defined using the data from (2)
Alan's raid thing can probably become a step or a hook
We discussed previously - some of this should be happening on every clean, like the RAID and BIOS needs to be ship-shape and we can set a BIOS password but we can not stop them messing with RAID.

I think we still need the external orchestration because it's not obvious how to do all this glue via steps/runbooks/hooks/etc.

skrobul · 2026-04-10T11:49:42Z

Something to explore in future (not for this PR) https://www.dell.com/support/manuals/en-us/poweredge-r7615/idrac9_7.xx_ug/configuring-servers-and-server-components-using-auto-config?guid=guid-d38bd838-ad66-414b-a61f-0cb241f6459e&lang=en-us

stevekeay · 2026-04-10T12:20:28Z

Something to explore in future (not for this PR) https://www.dell.com/support/manuals/en-us/poweredge-r7615/idrac9_7.xx_ug/configuring-servers-and-server-components-using-auto-config?guid=guid-d38bd838-ad66-414b-a61f-0cb241f6459e&lang=en-us

Yes this is an interesting feature (although I'd put money on it not working properly). It is irksome that they chose the "golden" model, like you configure a server how you like it and then "save" that config to a file. I'd prefer if they documented the options/schema for the file to support us building it ourselves, in true declarative style. It requires a file tailored to the exact hardware (NICs and raid/disks, etc) so we still need the same kind of logic to support this.

Autodiscovery is designed as a one-shot initial setup. I guess you could maybe reset the iDRAC and get it to re-configure the server but there is no real support for managing the server during its lifetime.

stevekeay force-pushed the rework-enrol-workflow branch 2 times, most recently from a8435f8 to ebbdafc Compare April 7, 2026 12:44

stevekeay changed the title ~~Rework enrol workflow~~ fix: Rework enrol workflow Apr 7, 2026

stevekeay force-pushed the rework-enrol-workflow branch 3 times, most recently from d4956f9 to 87bf1fa Compare April 8, 2026 14:14

skrobul requested changes Apr 9, 2026

View reviewed changes

stevekeay force-pushed the rework-enrol-workflow branch from 87bf1fa to 0aaf487 Compare April 9, 2026 11:58

stevekeay requested a review from skrobul April 9, 2026 20:40

stevekeay added 12 commits April 9, 2026 21:40

Fix test so it is not affected by environment

3136318

Simplify PXE NIC selection algorithm and allow for more edge cases

4e987a4

User can pass a list of known switches to prefer.

Handle case where BIOS setting changes have already been staged

36006ef

Add code comments to document the enrol process

2c0f601

Add extra bios settings step to assist the raid configuration

2d74e7e

Disable PXE boot - we don't use it any more.

626d62b

Delete unused script - this is handled during enrol

fdcf014

Enrol with a list of http-boot-interfaces instead of a single interface

22a016e

Retry redfish requests after powering on the server (takes a long time)

3734e8a

We see HTTP 503 errors from redfish for a while after booting the server.

Use consistent spelling of enroll

1924915

stevekeay force-pushed the rework-enrol-workflow branch from 0aaf487 to 1924915 Compare April 9, 2026 20:41

cardoe reviewed Apr 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Rework enrol workflow#1909

fix: Rework enrol workflow#1909
stevekeay wants to merge 12 commits intomainfrom
rework-enrol-workflow

stevekeay commented Apr 2, 2026 •

edited

Loading

Uh oh!

skrobul Apr 9, 2026

Uh oh!

stevekeay Apr 9, 2026

Uh oh!

cardoe left a comment

Uh oh!

stevekeay commented Apr 9, 2026

Uh oh!

skrobul commented Apr 10, 2026

Uh oh!

stevekeay commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

stevekeay commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skrobul Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

stevekeay Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

cardoe left a comment

Choose a reason for hiding this comment

Uh oh!

stevekeay commented Apr 9, 2026

Uh oh!

skrobul commented Apr 10, 2026

Uh oh!

stevekeay commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stevekeay commented Apr 2, 2026 •

edited

Loading