Iwo Mergler said:
8 man-years sounds excessive. Did any of those people
have previous Linux-porting experience?
2 are hardened expert developers, 1 intermediate (me) 1 project manager, 1
system manager, 2 testers and 1 writing documentation and the handling off
releases back to Open Source (which needs a process so at least one does not
deliberately leak patented IP).
Otherwise a
year of learning curve is realistic. Recent kernels
are a lot easier to port, but take longer to understand.
I suppose I should qualify "Linux port". In my understanding
that means porting the kernel. That usually involves adjusting
a few addresses and rewriting or adapting the low level
assembler stuff. Most new drivers tend to be unnecessary,
as most jellybean hardware is compatible with the same thing
on a different platform. It's not always obvious.
Keep Dreaming ;-)
This particular setup was a disk-less dual Opteron card (actually *using* IPMI
for syslog reporting to a manglement system). The card must reserve one core on
each CPU for a high priority process, a virtual machine, while the reast of the
kernel, IRQ's and user stuff goes on the second core. In the case of a kernel
panic, a capture kernel is kept in memory, the only task of that is to perform a
memory dump over TCP, clean up the mess and restart without loosing the data in
the VM (because there is a database in there). If that fails the board reboots.
The Init process deals with reserving the cores and redirecting IRQ's. Athlon
was kind-of new two years ago.
Now, the easy way, the one with "a few adresses adjusted" would be a Linux-BIOS.
Boot up, job done.
However, that is not "industry standard", the BIOS must be allowed to piss over
everything first (just in case someone wants to run windows on the board).
Instead PXE-Linux loads the kernel and boots (but PXE-Linux's path naming
algorithm is soo gross that it needs a patch: It assumes one unique kernel per
board id'ed by macaddr; we want the same kernel for *many* boards in a location
*we* name). Then the init process must undo what the BIOS+PXE Linux broke (as
far as possible) and ... there are some gross hacks to find the address where
the capture kernel is to be loaded (the kernel cannot touch that memory area).
Because the kernel cannot touch that memory, EEC error correction will not be
triggered and the capture kernel might be corrupt when the real kernel
eventually panics so we need memory scrubbing. Memory scrubbing is not supported
in the kernel for K8 (Athlon/Opteron) so we have to fix the bleeding-edge EDAC
driver so it does.
Then there are thousands of things that do not quite work as advertised - like
truncating of core dumps f.ex. which nobody apparently used, ever. Getting kexec
to work from one kernel to another was easy - getting kexec to work once more
back to the old kernel HARD, nobody kexec's twice, apparently. e.t.c.
The high-resolution timer patch was not available for Athlon back then (kernel
2.6.16) so we had to use /dev/rtc for some "microsleep" stuff. There is a BUG in
/dev/rtc - some interaction between the HPET hardware and the software emulation
of the RTC device used in new kernels we never quite got to the bottom of ...
but RTC is a fossil so it will never get fixed. RTC also goes offline every 11
minutes when NTP updates the kernel time. Oh!
The testing, fixing, workarounding and documenting of all the niglets and
failures take time. However we are now quite convinced that this system *will*
run for 10 years without a hard reboot and the customer knows how to use it from
the documentation!
The build system takes a lot of time too - getting from the standard kernel
source and to the thing we ship in a sane way takes a lot of design. Basically
we do as RedHat does: take a standard kernel and patch the hell out of it before
the build. But just to make the process more fragile and suck more disk and
network bandwith, we must use Clearcase - the corprat standard!