Conversation
a8435f8 to
ebbdafc
Compare
d4956f9 to
87bf1fa
Compare
| image: ghcr.io/rackerlabs/understack/ironic-nautobot-client:latest | ||
| command: | ||
| - enroll-server | ||
| - enrol-server |
87bf1fa to
0aaf487
Compare
User can pass a list of known switches to prefer.
The python script is "beefed up" with extra capabilities and it now handles the transitioning of the node through the various stages, including configuring RAID, running inspection, etc. Logging and error handling is improved. Ability to cope with non-standard cabling is improved slightly - we used to rely completely on a cabling convention to determine which port should be used for PXE. We now detect that from the LLDP data reported by the BMC (when available - note that we have hardware where the LLDP feature does not work). This provides us with the chassis MAC address of the connected switch. Until we can find an easy way to look this up, we allow the user to pass one or more PXE switch MACs on the command line.
We no longer try to set a single solitary interface as "the" PXE interface. We now set a whole slew of interfaces - the server will try them all and use the one that works. So we want Ironic to support HTTP boot on any of the interfaces listed. At the time this hook runs, we don't know which interface will get used, so we set the "pxe" flag on all of them.
We see HTTP 503 errors from redfish for a while after booting the server.
0aaf487 to
1924915
Compare
cardoe
left a comment
There was a problem hiding this comment.
So overall I'm down for the part that gets rid of the argo-workflow and calls it with Python. I'm -1 on all the Redfish API calls because we should be doing this via Ironic. The logic on retrying and rebooting still doesn't cover all the behaviors of the hardware that have been tested via Ironic but there's no point in adding it here.
Note that this PR doesn't add that redfish interaction - it is already happening, I'm mostly just refactoring. To move in the openstack direction requires significant changes that deserve careful planning and testing. I'm happy to take this on, but I am concerned that we are postponing delivery of a very time-sensitive capability, notably the ability to on-board the RXDB gear. The way they cabled up those racks we have no way to enrol them. I would therefore suggest the following as follow-on PRs: I don't think we can easily remove all the redfish interaction without adding significant new features to Ironic. Some of it should be do-able though:
I think we still need the external orchestration because it's not obvious how to do all this glue via steps/runbooks/hooks/etc. |
|
Something to explore in future (not for this PR) https://www.dell.com/support/manuals/en-us/poweredge-r7615/idrac9_7.xx_ug/configuring-servers-and-server-components-using-auto-config?guid=guid-d38bd838-ad66-414b-a61f-0cb241f6459e&lang=en-us |
Yes this is an interesting feature (although I'd put money on it not working properly). It is irksome that they chose the "golden" model, like you configure a server how you like it and then "save" that config to a file. I'd prefer if they documented the options/schema for the file to support us building it ourselves, in true declarative style. It requires a file tailored to the exact hardware (NICs and raid/disks, etc) so we still need the same kind of logic to support this. Autodiscovery is designed as a one-shot initial setup. I guess you could maybe reset the iDRAC and get it to re-configure the server but there is no real support for managing the server during its lifetime. |
This is somewhat of an opinionated move from a large, complex workflow
that calls lots of different python scripts to a minimal workflow that
executes one single script to take all required steps.
In theory there are benefits to the multiple workflow steps, but in
practice I don't think we really benefit from the hardcore argo
functionality, and the workflows are fragile, difficult to read, almost
impossible to test, inscrutable to troubleshoot. By contrast the
python script is easily maintainable.
The python script is "beefed up" with extra capabilities and it now
handles the transitioning of the node through the various stages,
including configuring RAID, running inspection, etc.
Logging and error handling is improved.
Ability to cope with non-standard cabling is improved: we used
to rely completely on a cabling convention to determine which port
should be used for PXE.
We now use the LLDP data reported by the BMC (when available - note that
we have hardware where the LLDP feature does not work), and we enable HTTP
boot on ALL interfaces that the BMC reports to be connected. The server
will attempt each one in turn until it gets a DHCP response.
We persist the set of PXE interfaces in the extra field of the baremetal
node. This is then used by the inspection hook to set the "pxe" flag on
those same interfaces.