Skip to content

fix: Rework enrol workflow#1909

Open
stevekeay wants to merge 12 commits intomainfrom
rework-enrol-workflow
Open

fix: Rework enrol workflow#1909
stevekeay wants to merge 12 commits intomainfrom
rework-enrol-workflow

Conversation

@stevekeay
Copy link
Copy Markdown
Contributor

@stevekeay stevekeay commented Apr 2, 2026

This is somewhat of an opinionated move from a large, complex workflow
that calls lots of different python scripts to a minimal workflow that
executes one single script to take all required steps.

In theory there are benefits to the multiple workflow steps, but in
practice I don't think we really benefit from the hardcore argo
functionality, and the workflows are fragile, difficult to read, almost
impossible to test, inscrutable to troubleshoot. By contrast the
python script is easily maintainable.

The python script is "beefed up" with extra capabilities and it now
handles the transitioning of the node through the various stages,
including configuring RAID, running inspection, etc.

Logging and error handling is improved.

Ability to cope with non-standard cabling is improved: we used
to rely completely on a cabling convention to determine which port
should be used for PXE.

We now use the LLDP data reported by the BMC (when available - note that
we have hardware where the LLDP feature does not work), and we enable HTTP
boot on ALL interfaces that the BMC reports to be connected. The server
will attempt each one in turn until it gets a DHCP response.

We persist the set of PXE interfaces in the extra field of the baremetal
node. This is then used by the inspection hook to set the "pxe" flag on
those same interfaces.

@stevekeay stevekeay force-pushed the rework-enrol-workflow branch 2 times, most recently from a8435f8 to ebbdafc Compare April 7, 2026 12:44
@stevekeay stevekeay changed the title Rework enrol workflow fix: Rework enrol workflow Apr 7, 2026
@stevekeay stevekeay force-pushed the rework-enrol-workflow branch 3 times, most recently from d4956f9 to 87bf1fa Compare April 8, 2026 14:14
image: ghcr.io/rackerlabs/understack/ironic-nautobot-client:latest
command:
- enroll-server
- enrol-server
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The package only exposes enroll-server so this will fail

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, fixed in 0aaf487

@stevekeay stevekeay force-pushed the rework-enrol-workflow branch from 87bf1fa to 0aaf487 Compare April 9, 2026 11:58
@stevekeay stevekeay requested a review from skrobul April 9, 2026 20:40
stevekeay added 12 commits April 9, 2026 21:40
User can pass a list of known switches to prefer.
The python script is "beefed up" with extra capabilities and it now
handles the transitioning of the node through the various stages,
including configuring RAID, running inspection, etc.

Logging and error handling is improved.

Ability to cope with non-standard cabling is improved slightly - we used
to rely completely on a cabling convention to determine which port
should be used for PXE.   We now detect that from the LLDP data reported
by the BMC (when available - note that we have hardware where the LLDP
feature does not work).  This provides us with the chassis MAC address
of the connected switch.  Until we can find an easy way to look this up,
we allow the user to pass one or more PXE switch MACs on the command
line.
We no longer try to set a single solitary interface as "the" PXE
interface.  We now set a whole slew of interfaces - the server will try
them all and use the one that works.

So we want Ironic to support HTTP boot on any of the interfaces listed.
At the time this hook runs, we don't know which interface will get used,
so we set the "pxe" flag on all of them.
We see HTTP 503 errors from redfish for a while after booting the server.
@stevekeay stevekeay force-pushed the rework-enrol-workflow branch from 0aaf487 to 1924915 Compare April 9, 2026 20:41
Copy link
Copy Markdown
Contributor

@cardoe cardoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So overall I'm down for the part that gets rid of the argo-workflow and calls it with Python. I'm -1 on all the Redfish API calls because we should be doing this via Ironic. The logic on retrying and rebooting still doesn't cover all the behaviors of the hardware that have been tested via Ironic but there's no point in adding it here.

@stevekeay
Copy link
Copy Markdown
Contributor Author

So overall I'm down for the part that gets rid of the argo-workflow and calls it with Python. I'm -1 on all the Redfish API calls because we should be doing this via Ironic. The logic on retrying and rebooting still doesn't cover all the behaviors of the hardware that have been tested via Ironic but there's no point in adding it here.

Note that this PR doesn't add that redfish interaction - it is already happening, I'm mostly just refactoring.

To move in the openstack direction requires significant changes that deserve careful planning and testing. I'm happy to take this on, but I am concerned that we are postponing delivery of a very time-sensitive capability, notably the ability to on-board the RXDB gear. The way they cabled up those racks we have no way to enrol them.

I would therefore suggest the following as follow-on PRs:

I don't think we can easily remove all the redfish interaction without adding significant new features to Ironic. Some of it should be do-able though:

  1. setting the password (and other BMC settings) is probably something that needs to stay in enroll, since Ironic expects the credentials to work right away.

  2. ironic redfish inspection is probably sufficient to handle the "discovery" part, including enumerating the interfaces on the box

  3. setting the BIOS settings can likely be done in a clean step which is defined using the data from (2)

  4. Alan's raid thing can probably become a step or a hook

  5. We discussed previously - some of this should be happening on every clean, like the RAID and BIOS needs to be ship-shape and we can set a BIOS password but we can not stop them messing with RAID.

I think we still need the external orchestration because it's not obvious how to do all this glue via steps/runbooks/hooks/etc.

@skrobul
Copy link
Copy Markdown
Collaborator

skrobul commented Apr 10, 2026

@stevekeay
Copy link
Copy Markdown
Contributor Author

Something to explore in future (not for this PR) https://www.dell.com/support/manuals/en-us/poweredge-r7615/idrac9_7.xx_ug/configuring-servers-and-server-components-using-auto-config?guid=guid-d38bd838-ad66-414b-a61f-0cb241f6459e&lang=en-us

Yes this is an interesting feature (although I'd put money on it not working properly). It is irksome that they chose the "golden" model, like you configure a server how you like it and then "save" that config to a file. I'd prefer if they documented the options/schema for the file to support us building it ourselves, in true declarative style. It requires a file tailored to the exact hardware (NICs and raid/disks, etc) so we still need the same kind of logic to support this.

Autodiscovery is designed as a one-shot initial setup. I guess you could maybe reset the iDRAC and get it to re-configure the server but there is no real support for managing the server during its lifetime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants