-
EisNerd
can someone help me with gerrit?
-
EisNerd
I'd like to resolve a merge conflict on a change submitted by someone else (Gordon in this case)
-
andyf
I've seen a couple of reports of freshly upgraded systems crashing in the ZFS z_upgrade task, which is something to do with object accounting upgrades
-
andyf
I have a crash dump but I'm a bit stuck, can anyone help me get a bit further?
-
andyf
-
jbk
that sounds like it might be the same thing arekinath's hit
-
andyf
In both cases, it seems to have been a one-off thing.. no crashes afterwards
-
andyf
I'm just not sure I can get anything more from that dump, and I can't see a candidate for an early dereference in the task callback
-
andyf
jbk - this one seems like it might be relevant
openzfs/zfs c0daec32f839f687a7b6
-
igork1
andyf: how old was pool before upgrade it to latest features ?
-
igork
or just import old pool to latest zfs code
-
andyf
both that I know of are just upgrades from r34 to r36. It isn't a zfs pool upgrade, it's a dmu object set upgrade
-
andyf
and, from this latest crashdump, a zfs recv was occuring at the same time, which makes that openzfs patch interesting
-
igork
all depend on situation - if you have pool before object quota integrated -> save data -> export it -> import to new code where need upgrade - you have to wait full upgrade done before send/receive
-
igork
if you have checkpoint - you can revert to it
-
igork
but all depend on new features
-
andyf
right, but the system should not panic
-
rmustacc
Failure should never cause a panic.
-
igork
it try manipulate with every blovk
-
igork
but it panic for save data
-
rmustacc
I am hard pressed to believe that's the only option in designing a system.
-
andyf
The fix in openzfs seems to match this case, but I cannot yet prove it.
-
andyf
there was definitely a ZFS recv occuring when the NULL pointer de-reference occured in the taskq
-
andyf
(for both systems that I have crash dumps for)
-
igork
will be correct try to run scrub
-
andyf
These aren't my systems, they are production systems at different places and I don't have access. I'm sure they run scrub pretty regularly
-
andyf
Looking at
openzfs/zfs #5295 - I'm pretty sure this is what we are seeing. Dereference on the call to mutex_lock() right at the start of the task callback.
-
andyf
I'll get that patch pulled over and see if I can replicate it
-
tsoome
EisNerd normally you ask author, but if you are testing code yourself, you can cherry-pick the patch and then rebase master
-
myrkraverk
Hi. I want to install Illumos on Virtual Box. What is a good distro for that?
-
myrkraverk
Is there someone with a ready made image?
-
danmcd
I use OmniOSce on VBox. I installed it from a .iso image, and ran with it from there.
-
danmcd
I suspect the other ones will work as well, but for VBox I used OmniOSce (because VBox is on my personal machine, and I'd installed OmniOS way back).
-
myrkraverk
Ok.
-
myrkraverk
I'll try that.
-
myrkraverk
I've used Indiana before, but don't need it specifically for what I want to do now.
-
rmustacc
You could use that just as well with the iso or usb image and set it up.
-
rmustacc
I don't believe there's anything particular about vbox here.
-
tsoome
even oracle solaris can run on vbox;)
-
Cthulhux
omnios is less weird than indiana
-
Cthulhux
in many things
-
andyf
jbk - thanks for the pointer, it does indeed look exactly
illumos.org/issues/13194 ( arekinath's issue )
-
andyf
*like
-
jbk
looks like you might have found a fix as well
-
andyf
hopefully. I've merged it over from openzfs, but I'm just trying to replicate the original issue
-
jbk
btw, do you have the script jerry uses to port over openzfs fixes (it's just a large-ish sed script that'll correct the paths in an openzfs patch to illumos paths)
-
jbk
especially for larger patches, it can be a nice starting point
-
andyf
no, but I wrote something quickly
-
andyf
this one was pretty small
-
andyf
I don't know if I'll be able to replicate it, but it's a good match for the openzfs bug information
-
andyf
I've written a test case bug I need to try and blow up the race a bit
-
jbk
i've got the one rwlock bit still out there, but in some testing it triggered the deadman timer, except the dump was so large, i couldn't extract it, then i accidentially deleted the BE w/ it :)
-
jbk
so i need to try to recreate it to see if it's actually related to the change or not
-
jbk
but i want to get the getgrouplist stuff finished first (need to finish up addressing the first round of feedback now that the holiday is over)
-
pmooney
tsoome: I just ONUed a r34 system onto recent -gate bits and loader appeared to be hosed on the next boot
-
pmooney
wondering if it might have something to do with nextboot
-
tsoome
no messages or anything?
-
tsoome
bios or uefi?
-
tsoome
or both
-
pmooney
if I boot in bios mode, boot2 complains about being unable to find loader/zfsloader
-
pmooney
trying to boot in uefi mode had some quick errors
-
pmooney
I can't capture them
-
pmooney
it prints them only to the vga console and then dumps out to the uefi shell
-
pmooney
I don't remember if I was booting this thing in BIOS mode or UEFI prior to the issue
-
pmooney
the boot options were hosed when I rebooted after ONU
-
pmooney
it's an x9 board, so kinda on the cusp of UEFI support
-
pmooney
(it's kinda shitty)
-
tsoome
hm, ok so we fail to read the pool
-
pmooney
this is that machine with the mirror + SLOG
-
tsoome
what does eeprom -bp tell?
-
andyf
I should probably know this, but does onu end up installing an updated loader?
-
pmooney
tsoome: give me a few minutes
-
pmooney
the omnios live image doesn't run at all over SoL
-
pmooney
so it's local vga console only
-
tsoome
you should be able to copy and install gptzfsboot and loader binary from older BE to restore the boot
-
tsoome
but the big question would be, what actually does happen.
-
tsoome
hm, if there is something odd in bootenv area, that should still leave the ability to read the pool by something like: ls zfs:rpool/ROOT/dataset:
-
tsoome
and lszfs should tell you the default dataset
-
pmooney
it's not making it that far into loader
-
pmooney
at least from the BIOS boot
-
tsoome
if that does not work, there must be something more going on...
-
pmooney
and uefi boot just kicks me to the uefi console
-
tsoome
boot: prompt does allow similar things, you can try zfs:/rpool/ROOT/dataset:/boot/loader
-
pmooney
omnios installer does not ship with eeprom
-
pmooney
sure, but like
-
pmooney
unless it can list datasets
-
pmooney
I have no clue
-
tsoome
but anyhow, *if* there was something on bootenv area, the nextboot code should wipe it clean
-
tsoome
enter status to get list of devices
-
pmooney
in boot2?
-
tsoome
including pool
-
tsoome
on boot: prompt, yes
-
tsoome
anyhow, sonce nextboot code will (or should) wipe bootenv, that would imply there is something else going on.
-
pmooney
certainly possible
-
pmooney
this is my intel test machine, so it's sole task is to ONU from omni r34 bits onto -gate bits
-
andyf
eeprom just manipulates /boot/solaris/bootenv.rc - in case that helps
-
pmooney
which it's been doing recently
-
tsoome
no, eeprom -bp will print pool bootenv data
-
andyf
ah, via the new ioctls?
-
tsoome
yes
-
andyf
got it, thanks
-
tsoome
it is a bit buggy still, and basically experimental, but... :D
-
pmooney
tsoome: 'status' shows the disks on the system
-
pmooney
but no loaded pools
-
pmooney
(the ZFS partitions are labeled as such, but that's it)
-
tsoome
experimental in a sense that there is data in bootenv, there is a mechanism vis ioctl to manipulate the data, but what exactly is useful and what not...
-
tsoome
so, we fail to recognize the pool(s)
-
pmooney
apparently
-
pmooney
it could be that the boot2 is the originally installed one
-
pmooney
and that the system was booting via uefi previously
-
pmooney
but that doesn't explain why the uefi loader stuff also can't read the pools
-
tsoome
installboot -i can tell the version
-
pmooney
*pool
-
pmooney
unless it ships with the omnios installer, I basically have no userful commands on this machine
-
pmooney
*useful
-
pmooney
or anything in the boot2 prompt
-
tsoome
I guess intallboot is there
-
pmooney
otherwise this system is a boat anchor right now
-
pmooney
is there anything else of use to collect from boot2?
-
andyf
/usr/sbin/installboot should be there
-
tsoome
you should be able to import the pool, mount old BE and copy over the /boot to restore the bootability
-
pmooney
the /boot in the efi partition?
-
tsoome
no the boot from old BE
-
tsoome
actually, this omnios media, is it iso or usb?
-
pmooney
usb
-
tsoome
so you can esc out from its boot menu to loader prompt, then set currdev=zfs:rpool/ROOT/dataset: and enter boot
-
tsoome
note the last colon
-
tsoome
with usb image, you have the same boot: prompt as well, and you can start loader from disk, but in current state that wont help because the zfs reader code is the same...
-
tsoome
I can also give you userboot tree snapshot so you can try to see if userboot.so will read the pool and if not, where it does fail -- userboot can be debugged by normal debugger..
-
pmooney
ok, I'm booted back into the old (r34-ac) BE
-
tsoome
that pool is not anything interesting, is it? just mirror + slog?
-
pmooney
plus the null devices from adding/removing the slog
-
pmooney
this is the machine that exposed the mirror+slog boot issues in the first place
-
tsoome
ok.. but we managed to fix it...
-
tsoome
or maybe that fix was not complete...
-
pmooney
it's been working fine for months
-
pmooney
andyf even backported the stuff so I could set bootfs properly
-
pmooney
(in r24)
-
pmooney
*34
-
arekinath
andyf: hmm... that bug does sound like a plausible match, definitely the most plausible one so far that I've seen
-
arekinath
especially the instructions on how to repro, needing lots of mounts and upgrades going on at once (most easily triggered by constant recv traffic)
-
tsoome
yea, since you can read pool fine with older loader, the nextboot update must have some issue still
-
pmooney
so if I switch back to the old BE, it should use the old loader?
-
pmooney
or do I need to copy over existing files?
-
tsoome
if you want to switch back, you have to force installboot to install previous gptzfsboot
-
andyf
arekinath - I'm building that patch then I'll reboot and run the zfs testsuite, and get it up for review tomorrow. I haven't managed to replicate it yet even with those instructions
-
arekinath
yeah it's a tough one. as I noted, I've only ever been able to get it to happen in prod
-
tsoome
you can use new BE, but you would need to copy over gptzfsboot + loader and still enforce installboot to install older gptzfsboot
-
arekinath
which is super annoying
-
pmooney
tsoome: step 1 is getting this machien so it can just boot on its own
-
pmooney
(on the old BE)
-
wonko
I'm trying to transition from one NIC to another. Everything uses an aggr that sits on top of the physical interfaces. they connect to different switches so I can't do LACP on the switch side. What's going to be the easiest way to transition over?
-
tsoome
pmooney
148-52-235-80.sta.estpak.ee/userboot.tar is my snapshot (to be extracted in usr/src/boot/sys/boot, it will create test and userboot.so, so you can use it as: ./test -b ./userboot.so -d disk1 -d disk2 -d disk3 ...
-
tsoome
it takes whole disk device or file.
-
pmooney
will look into it once I have this machien booting again
-
tsoome
um, it is quite possible it needs few more updated files to actually build.. darn. anyhow, its almost 1am here, need to sleep some.
-
pmooney
tsoome: if you get a tarball of binaries, I'm happy to take it for a spin on that machine
-
pmooney
definitely something we can do later