[Wayland] "Lost connection to Wayland compositor."
Categories
(Core :: Widget: Gtk, defect, P1)
Tracking
()
People
(Reporter: emilio, Assigned: stransky)
References
(Blocks 2 open bugs)
Details
(Keywords: topcrash)
Crash Data
Attachments
(4 files, 2 obsolete files)
I've ran plasma this last couple days and sometimes Firefox or Thunderbird will crash without crash reporter with the following message in the journal:
Lost connection to Wayland compositor.
The bad part is that there's actually no crash report or anything, so we just don't see these crashes...
Martin do you have any idea of how to track this down or possibly intercept these _exit
calls from gtk?
Comment 1•3 years ago
|
||
Did kwin_wayland crash? If not, it would be great to have WAYLAND_DEBUG log.
Reporter | ||
Comment 2•3 years ago
|
||
It did not. How can I get that log?
Comment 3•3 years ago
|
||
Run firefox as follows env WAYLAND_DEBUG=1 firefox
. Do you run firefox nightly? I haven't seen any crashes with Firefox 94, i.e. stable branch?
Comment 4•3 years ago
|
||
/i.e. stable branch?/i.e. stable branch/
Reporter | ||
Comment 5•3 years ago
|
||
Ah, I thought you meant WAYLAND_DEBUG
in kwin, thanks. Will run it like that for a while and see if I hit this.
And yeah, this is Firefox Nightly and Thunderbird Daily as well. I've seen this mostly when my computer is super-busy (building or what not).
Comment 6•3 years ago
|
||
Sheer luck that i saw the email from this bug, still this one might be similar/duplicate: https://bugzilla.mozilla.org/show_bug.cgi?id=1718851
Note that since then i've switched to Arch (but haven't set it up fully yet as i'm doing some workarounds), might not be the right guy yo be able to reproduce it.
Tl;dr: Check if there's something useful at the mentioned bug link above.
Comment 7•3 years ago
|
||
The bug from Comment #6 at the start of it is a different crash probably due to EarlyOOM
, but at the end of it there are two different error messages about the compositor, one is this one from this bug and the other one is again some compositor-ish thingie. I don't think that there are useful WAYLAND_DEBUG logs (the one there was not taken when the crash happened).
Reporter | ||
Comment 8•3 years ago
|
||
Here's a log of a crashing session. There's not much going on there I think. It seems it crashed with:
Gdk-Message: 20:39:12.453: Error flushing display: Broken pipe
(at the end of the log). Journal for that time is:
Nov 27 20:37:31 ryzen plasmashell[3715]: kf.plasma.quick: Couldn't create KWindowShadow for ToolTipDialog(0x56535cca8580)
Nov 27 20:37:31 ryzen plasmashell[3715]: kf.plasma.quick: Couldn't create KWindowShadow for ToolTipDialog(0x56535cca8580)
Nov 27 20:37:31 ryzen plasmashell[3715]: kf.plasma.quick: Couldn't create KWindowShadow for ToolTipDialog(0x56535cca8580)
Nov 27 20:38:30 ryzen kernel: xhci_hcd 0000:06:00.3: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Nov 27 20:38:30 ryzen kernel: xhci_hcd 0000:06:00.3: Looking for event-dma 00000000ffc368e0 trb-start 00000000ffc368f0 trb-end 00000000ffc368f0 seg-start 00000000ffc36000 seg-end 00000000ffc36ff0
Nov 27 20:38:40 ryzen pipewire[4105]: spa.alsa: surround21:0: snd_pcm_mmap_commit error: Broken pipe
Nov 27 20:39:08 ryzen plasmashell[7578]: ###!!! [Parent][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost
Nov 27 20:39:12 ryzen /usr/libexec/gdm-wayland-session[3417]: error in client communication (pid 19408)
Nov 27 20:39:34 ryzen /usr/libexec/gdm-wayland-session[3417]: kwin_libinput: Libinput: event2 - Ultimate Gadget Laboratories Ultimate Hacking Keyboard: client bug: event processing lagging behind by 31ms, your system is too slow
So I guess the issue is the error in client communication
? That's not very descriptive...
Or maybe the kernel sound error? But sound still works after restarting firefox fwiw, and I don't know why that would end up being a wayland display flush IO error...
Vlad, does the above give you any hint or any further thing I could try to diagnose this?
Comment 9•3 years ago
|
||
Heh, that's a big wayland debug log. At quick glance, I don't see anything suspicious besides firefox stopping responding to pointer events before the client connection breaks.
Reporter | ||
Comment 10•3 years ago
|
||
Ah, I figured out how to reproduce this consistently! It's just a matter of the parent process main thread being busy for long enough. If I open the browser toolbox and in the console I paste:
var s = Date.now(); while (s + 5000 > Date.now());
Which should just stop responding for a bit, instead we crash.
Reporter | ||
Comment 11•3 years ago
|
||
Hmm, apparently not so consistently as I hoped...
Reporter | ||
Comment 12•3 years ago
|
||
But frequently enough
Assignee | ||
Comment 13•3 years ago
|
||
Emilio Cobos Álvarez (:emilio) from comment #0)
I've ran plasma this last couple days and sometimes Firefox or Thunderbird will crash without crash reporter with the following message in the journal:
Lost connection to Wayland compositor.
The bad part is that there's actually no crash report or anything, so we just don't see these crashes...
Martin do you have any idea of how to track this down or possibly intercept these
_exit
calls from gtk?
Looks like glib2 issue when some bad condition happens on the socket. Maybe strace can help you here?
Comment 14•3 years ago
|
||
(In reply to Emilio Cobos Álvarez (:emilio) from comment #10)
...
Which should just stop responding for a bit, instead we crash.
For the record, this might trigger an yet unsolved problem on Wayland: if the main thread is blocked long enough and does not read from the wayland socket, the buffer will fill up. Once that happens, the connection gets terminated. This happens especially fast with e.g. 1000Hz mice, see e.g. https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1915#note_1197910
In case that's the problem, there are two solutions:
- wait for compositors/libwayland to figure out how to better handle this
- move wayland socket polling into its own thread (somehow steal it from GTK). This would likely also benefit Vsync timings, as any short hang on the main thread creates stutter in our frame callback based vsyncsource (essentially bug 1675680).
Comment 15•3 years ago
|
||
(In reply to Emilio Cobos Álvarez (:emilio) from comment #11)
Hmm, apparently not so consistently as I hoped...
(In reply to Emilio Cobos Álvarez (:emilio) from comment #12)
But frequently enough
If the theory from above / comment 14 applies, then it would only crash if you create events - e.g. moving the mouse over the surface etc.
Reporter | ||
Comment 16•3 years ago
|
||
(In reply to Robert Mader [:rmader] from comment #14)
(In reply to Emilio Cobos Álvarez (:emilio) from comment #10)
...
Which should just stop responding for a bit, instead we crash.For the record, this might trigger an yet unsolved problem on Wayland: if the main thread is blocked long enough and does not read from the wayland socket, the buffer will fill up. Once that happens, the connection gets terminated. This happens especially fast with e.g. 1000Hz mice, see e.g. https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1915#note_1197910
Indeed it's much more easy to trigger if I move the mousewheel around and move the cursor around while this happens... So it seems like a reasonable theory. I'll try to catch it on rr to dig a bit more, ni?ing myself so I don't forget.
In case that's the problem, there are two solutions:
- wait for compositors/libwayland to figure out how to better handle this
- move wayland socket polling into its own thread (somehow steal it from GTK). This would likely also benefit Vsync timings, as any short hang on the main thread creates stutter in our frame callback based vsyncsource (essentially bug 1675680).
How hard is (2)? I hit this frequently enough to be annoying :-)
Comment 17•3 years ago
•
|
||
See https://gitlab.freedesktop.org/wayland/wayland/-/issues/159, https://gitlab.freedesktop.org/wayland/wayland/-/issues/237, https://gitlab.freedesktop.org/wayland/wayland/-/merge_requests/188.
I do not believe there is a good way to move polling to its own thread (Gtk owns the display connection and needs to be the one dispatching the callbacks). One could try to also poll and read into queues from another thread, but that would be competing with the main thread (the queues are locked prior to I/O) and would be quite hacky in my opinion. Threaded poll/read will also increase jitter/latency as most messages have to go back to the main thread anyway for dispatch.
Reporter | ||
Comment 18•3 years ago
|
||
Thanks Kenny! So if the theory is correct, presumably I'm not hitting this (so often / at all?) on GNOME because:
GNOME detects the client being unresponsive by a timeout, when it does not get a pong reply to a ping event.
So that's an enhancement that KWin could make. But the real fix would be something like https://gitlab.freedesktop.org/wayland/wayland/-/merge_requests/188
Is my understanding right?
Reporter | ||
Comment 19•3 years ago
|
||
So it's a bit unfortunate that GTK is calling _exit
rather than exit()
. exit()
would be caught by the crash reporter via this hook IIUC, but _exit
won't run that code. I'll try to see if GTK can be changed to not use _exit
...
Reporter | ||
Updated•3 years ago
|
Comment 20•3 years ago
|
||
If I am reading the source right, Mutter's ping timeout is 5 seconds, so that's geared towards severe stalls (and the artificial ones induced with SIGSTOP in the linked issues). The issues discussed within the Wayland community have primarily focused on reasonably responsive clients drowning in event floods like high resolution scrolling and high DPI mouse input without having done anything wrong.
The solutions discussed in #wayland@irc.oftc.net have so far been a throttling mechanism to prioritize "important" events when a client is seen to not process their input fast enough, and "large enough"/dynamic/unbounded connection buffers like !188 implements. The current connection buffers are just 4KB, so it might very well be that a bigger buffer would be all it takes to give all reasonable clients enough time to come back to a read.
It might be a good idea to comment on the Wayland issues to highlight that clients are experiencing issues, but in that case we want to establish which category we're in: Reasonable stalls (tens or hundreds milliseconds) with event floods, or severe stalls (seconds) with low-to-normal rates of events.
Comment 21•3 years ago
|
||
FTR, for Mutter there's also a workaround being discussed as we want to unthrottle input devices for Gnome 42 (likely too early for the Wayland solution), which in case of 1000Hz mice would otherwise easily crash many clients like games.
Reporter | ||
Comment 22•3 years ago
|
||
I haven't hit this (so much?) recently, but it seems this is well-understood at this point and there's not all that much we can do on our side (other than avoid jank when possible of course).
Comment 23•3 years ago
|
||
(In reply to Emilio Cobos Álvarez (:emilio) from comment #22)
I haven't hit this (so much?) recently, but it seems this is well-understood at this point and there's not all that much we can do on our side (other than avoid jank when possible of course).
I agree, we can't do much to avoid the crash but we may be able to improve our crash reports via https://gitlab.gnome.org/GNOME/gtk/-/issues/4514#note_1328534.
Comment 25•3 years ago
|
||
I am seeing this two with Thunderbird on GNOME. Thunderbird hangs for a while then exits. No crash report but get this in the logs.
% thunderbird
Gdk-Message: 15:59:45.495: Lost connection to Wayland compositor.
Exiting due to channel error.
Exiting due to channel error.
I'll try to grab the status with WAYLAND_DEBUG=1
.
This happens very frequently on startup, presumably because there is a lot of background work while the UI hangs. However this often happens during the "steady state" as well. I see this frequently, probably more than once per day on average.
Let me know if I should file a Thunderbird-specific issue because this seems to be exacerbated by the general slowness of Thunderbird with my not-small mailboxes but the root cause list likely the same between the two programs. Presumably moving slow things off of the main thread would sufficiently mitigate this issue.
Assignee | ||
Comment 26•3 years ago
|
||
Thunderbird is not supposed to run on Wayland yet as it's based on old ESR91 line.
Comment 27•3 years ago
|
||
Oh ok. I guess I accidentally hardcoded it on when trying to use Firefox on Wayland. Well in case it is helpful anyways...
Comment 28•3 years ago
|
||
I've been getting this for months on Fx...
Jul 28 13:01:09 firefox[42125]: Lost connection to Wayland compositor.
Jul 28 13:01:09 plasmashell[43933]: Exiting due to channel error.
Jul 28 13:01:09 plasmashell[43930]: Exiting due to channel error.
Jul 28 13:01:09 plasmashell[43793]: Exiting due to channel error.
Jul 28 13:01:09 plasmashell[43753]: Exiting due to channel error.
Jul 28 13:01:09 plasmashell[43749]: Exiting due to channel error.
Jul 28 13:01:09 plasmashell[43402]: Exiting due to channel error.
Jul 28 13:01:09 plasmashell[43856]: Exiting due to channel error.
Jul 28 13:01:09 plasmashell[43398]: Exiting due to channel error.
Jul 28 13:01:09 plasmashell[43266]: Exiting due to channel error.
Jul 28 13:01:09 plasmashell[43175]: Exiting due to channel error.
Jul 28 13:01:09 plasmashell[42432]: Exiting due to channel error.
Jul 28 13:01:09 plasmashell[42324]: Exiting due to channel error.
Jul 28 13:01:09 plasmashell[42283]: Exiting due to channel error.
Jul 28 13:01:14 systemd[1786]: app-firefox-b6455ea50864449eb4b19acaeb9d238a.scope: Consumed 2min 33.683s CPU time.
Comment 29•3 years ago
|
||
Still happening with Fx 103.
Comment 30•3 years ago
|
||
BTW, this isn't happening with Chrome and Ozone. Would be nice to get someone to look at it.
Comment 31•3 years ago
|
||
(In reply to gbcox from comment #30)
BTW, this isn't happening with Chrome and Ozone. Would be nice to get someone to look at it.
Please see above, e.g. comment 22. There currently little we can about it without bigger architectural changes. Wayland compositors can work around the issue pretty far (Gnome does, for example), but there's also a need for a clean solution on the Wayland core level.
Comment 32•3 years ago
|
||
BTW, this isn't happening with Chrome and Ozone.
I'm not so sure. I recently somehow managed to cause a strange emergent busy-loop.
(uBlock Origin hiding a modal, causing the website to sometimes go into a frenzy,
presumably attempting to show it as long as it's not visible, but uBO kept hiding it?)
The only reason I was able to put 2 and 2 together is that Firefox started crashing
in minutes of me replicating that uBO rule on it - which led me to this issue.
And AFAICT it's easy to repro using the busy-looping that emilio posted, i.e.:
var s = Date.now(); while (s + 5000 > Date.now());
But you need to inspect an extension instead, to do this in Chrome in an impactful
way (I used "Inspect popup" in the menu for uBO, I suspect most extensions will work).
If it doesn't work at first, just multiply the constant by 2 or 3.
FWIW what I repro'd with was Google Chrome Stable 104.0.5112.101
, on Linux/x64.
This is how it then failed for me (manually wrapped/spaced for readability):
Tue 2022-08-30 20:06:17 EEST eddyb-nix user@1000.service/init.scope[14560]:
app-google\x2dchrome-c48d6f01e8f940899cf94e1a9c8b794f.scope: Consumed 3min 29.538s CPU time.
Tue 2022-08-30 20:06:17 EEST eddyb-nix user@1000.service/app-google\x2dchrome-c48d6f01e8f940899cf94e1a9c8b794f.scope[2034552]:
[2034002:2034002:0830/200617.463836:ERROR:wayland_event_watcher.cc(61)] Fatal Wayland communication error: Broken pipe.
Tue 2022-08-30 20:06:17 EEST eddyb-nix user@1000.service/plasma-kwin_wayland.service[15235]:
error in client communication (pid 2034002)
I can't find a matching Chrome bug report other than maybe https://crbug.com/1290059
(but that still could be unrelated). NB: I don't plan to report this to the Chrome bug
tracker myself, since this is so many layers of "drive-by", I'm liable to waste their time.
Comment 33•3 years ago
|
||
(In reply to Martin Stránský [:stransky] (ni? me) from comment #26)
Thunderbird is not supposed to run on Wayland yet as it's based on old ESR91 line.
Is this supposed to be fixed now with Thunderbird 102?
I still experience it very often that Thunderbird Wayland randomly crashed the KDE compositor kwin (or some other components?).
The result is always that the mouse cursor is stuck, and I can't do anything any more (music for example continues to play tho).
This does not happen on a certain action, mostly I just hover over something in the application, but those crashes only appear when Thunderbird Wayland is open.
Comment 34•3 years ago
|
||
(In reply to lrdarknesss from comment #33)
Is this supposed to be fixed now with Thunderbird 102?
No, please see comment 31 / comment 22. Kwin could do better (Gnome/Mutter does) by avoiding sending pointer events to unresponsive clients, FF/TB could do better in avoiding main thread stalls (AFAIK TB is worse than FF in this regard) - and there could be some bigger changes in our Wayland architecture, creating a Wayland thread. Unfortunately this is a bit hard to pull off in a clean way. Finally there are discussions to handle this better on the Wayland library level.
Comment 35•3 years ago
|
||
Okay, so Thunderbird Wayland is no viable option atm, at least not on KDE.
Comment 36•2 years ago
•
|
||
I'm also seeing this on sway several times an hour (using nightly).
Gdk-Message: 08:29:24.126: Lost connection to Wayland compositor.
Thread 1 "firefox-bin" hit Breakpoint 3, __GI__exit (status=status@entry=1) at ../sysdeps/unix/sysv/linux/_exit.c:27
27 {
(gdb) bt
#0 __GI__exit (status=status@entry=1) at ../sysdeps/unix/sysv/linux/_exit.c:27
#1 0x00007ffff636b7c3 in _gdk_wayland_display_queue_events (display=<optimized out>) at ../gtk/gdk/wayland/gdkeventsource.c:211
#2 0x00007ffff6338029 in gdk_display_get_event (display=0x7ffff7844200) at ../gtk/gdk/gdkdisplay.c:442
#3 0x00007ffff6372728 in gdk_event_source_dispatch (base=<optimized out>, callback=<optimized out>, data=<optimized out>) at ../gtk/gdk/wayland/gdkeventsource.c:120
#4 0x00007ffff621d87b in g_main_dispatch (context=0x7fffe850adf0) at ../glib/glib/gmain.c:3454
#5 g_main_context_dispatch (context=0x7fffe850adf0) at ../glib/glib/gmain.c:4172
#6 0x00007ffff6274c89 in g_main_context_iterate.constprop.0 (context=0x7fffe850adf0, block=0, dispatch=1, self=<optimized out>) at ../glib/glib/gmain.c:4248
#7 0x00007ffff621c132 in g_main_context_iteration (context=0x7fffe850adf0, may_block=0) at ../glib/glib/gmain.c:4313
#8 0x00007fffed6f989f in nsThread::ProcessNextEvent(bool, bool*) () at /home/the8472/opt/firefox/libxul.so
#9 0x00007fffed744054 in mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) () at /home/the8472/opt/firefox/libxul.so
#10 0x00007fffee2a2e5f in MessageLoop::Run() () at /home/the8472/opt/firefox/libxul.so
#11 0x00007fffee9b8469 in nsBaseAppShell::Run() () at /home/the8472/opt/firefox/libxul.so
#12 0x00007fffec451d45 in nsAppStartup::Run() () at /home/the8472/opt/firefox/libxul.so
#13 0x00007fffec4dac9b in XREMain::XRE_mainRun() () at /home/the8472/opt/firefox/libxul.so
#14 0x00007fffec4db683 in XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) () at /home/the8472/opt/firefox/libxul.so
#15 0x00007fffec4dba66 in XRE_main(int, char**, mozilla::BootstrapConfig const&) () at /home/the8472/opt/firefox/libxul.so
#16 0x00005555555e7cfc in main ()
Edit: note that this kills firefox in a way that it doesn't even submit crash reports.
(In reply to Robert Mader [:rmader] from comment #14)
(In reply to Emilio Cobos Álvarez (:emilio) from comment #10)
...
Which should just stop responding for a bit, instead we crash.For the record, this might trigger an yet unsolved problem on Wayland: if the main thread is blocked long enough and does not read from the wayland socket, the buffer will fill up. Once that happens, the connection gets terminated. This happens especially fast with e.g. 1000Hz mice, see e.g. https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1915#note_1197910
Is it possible to increase the buffer size?
Comment 39•2 years ago
|
||
While I get that a proper solution would need deeper architectural changes, this issue has been around for a while with no remedy, so the feasibility of possible workarounds should be at least considered:
- The mentioned buffer size increase could help with the Wayland design problem. Not sure how feasible it is though as I haven't found much info on it.
- Misbehaving websites could and should be throttled. That problem is likely worthy of a whole separate discussion, but the problem is quite relevant here as resource abuse is what's causing the stall leading to this issue.
I've ran into some bloated pages now and then causing this issue, or alternatively if the site was still barely okay on its own, then inspecting the website often pushed it into "crash" territory.
However I lately tend to run into this frequently with Discord which was always a notorious resource hog, and I guess they've kept on adding bloat until it was too much, so nowadays I regularly have to recover the browser session after looking around there. Not necessarily good or relevant metrics, but checked one of the heavier servers, and switching to a channel with plenty of embedded content issues 643 network requests, and seems to finish in 900+ ms according to network measurements when everything is cached. In line with others' observations, I tend to encounter "crashing" most often when I don't keep the cursor perfectly still during these resource abusing events.
While this isn't an issue with most websites even on older and less performant laptops with iGPUs, the fact that a single website can stall event processing long enough even on desktops which are more than powerful enough for the task is a significant problem as it spoils the whole browser session.
Comment 40•2 years ago
|
||
Can confirm. See also https://bugs.kde.org/show_bug.cgi?id=469696
[Child 15731, Main Thread] WARNING: JSWindowActorChild::SendRawMessage (Conduits, ConduitClosed) not sent: !CanSend() || !mManager || !mManager->CanSend(): file /builds/worker/checkouts/gecko/dom/ipc/jsactor/JSWindowActorChild.cpp:57
[Child 15731, Main Thread] WARNING: JSWindowActorChild::SendRawMessage (Conduits, ConduitClosed) not sent: !CanSend() || !mManager || !mManager->CanSend(): file /builds/worker/checkouts/gecko/dom/ipc/jsactor/JSWindowActorChild.cpp:57
[Child 15731, Main Thread] WARNING: JSWindowActorChild::SendRawMessage (Conduits, ConduitClosed) not sent: !CanSend() || !mManager || !mManager->CanSend(): file /builds/worker/checkouts/gecko/dom/ipc/jsactor/JSWindowActorChild.cpp:57
[Child 15731, Main Thread] WARNING: JSWindowActorChild::SendRawMessage (Conduits, ConduitClosed) not sent: !CanSend() || !mManager || !mManager->CanSend(): file /builds/worker/checkouts/gecko/dom/ipc/jsactor/JSWindowActorChild.cpp:57
[Child 15731, Main Thread] WARNING: JSWindowActorChild::SendRawMessage (Conduits, ConduitClosed) not sent: !CanSend() || !mManager || !mManager->CanSend(): file /builds/worker/checkouts/gecko/dom/ipc/jsactor/JSWindowActorChild.cpp:57
[Child 15731, Main Thread] WARNING: JSWindowActorChild::SendRawMessage (Conduits, ConduitClosed) not sent: !CanSend() || !mManager || !mManager->CanSend(): file /builds/worker/checkouts/gecko/dom/ipc/jsactor/JSWindowActorChild.cpp:57
[Child 15731, Main Thread] WARNING: JSWindowActorChild::SendRawMessage (Conduits, ConduitClosed) not sent: !CanSend() || !mManager || !mManager->CanSend(): file /builds/worker/checkouts/gecko/dom/ipc/jsactor/JSWindowActorChild.cpp:57
[Child 15731, Main Thread] WARNING: JSWindowActorChild::SendRawMessage (Conduits, ConduitClosed) not sent: !CanSend() || !mManager || !mManager->CanSend(): file /builds/worker/checkouts/gecko/dom/ipc/jsactor/JSWindowActorChild.cpp:57
Gdk-Message: 16:20:03.063: Lost connection to Wayland compositor.
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Comment 41•2 years ago
|
||
The problem only happens to me when the Firefox window is shown. For example, when I have the Firefox window on another virtual desktop, everything is fine until I switched to that virtual desktop.
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Comment 43•2 years ago
|
||
Since the issue title says "sometimes" I want to note that for me this happens so frequently that it renders firefox unusable under wayland.
It takes several attempts to even start up and afterwards it only takes a few minutes until it crashes from regular browsing.
In my case that's likely due to the combination of a high refresh rate mouse + a heavy firefox session + sway (which doesn't implement workarounds for this bug).
Comment 44•2 years ago
|
||
It does not happen often for me, but there are certain scenarios where it consistently happens.
On Firefox that would be opening a lot (30 or so) tabs at the same time while moving the mouse.
On thunderbird when moving the mouse over the window while it is processing a lot of mails, for example archiving a couple of thousand mails.
Tested using Arch, Plasma Wayland
Firefox 117.0.1
Thunderbird 115.2.2
Reporter | ||
Comment 45•2 years ago
|
||
Bug 1859267 should at least give us visibility into how frequent this is.
Reporter | ||
Comment 46•2 years ago
|
||
This no longer seems to happen on latest KWin fwiw. Vlad did you implement a workaround or something out of curiosity?
Reporter | ||
Updated•2 years ago
|
Comment 47•2 years ago
|
||
Weird. No, we did not land anything to address this issue.
Comment 48•2 years ago
|
||
(In reply to Emilio Cobos Álvarez (:emilio) from comment #46)
This no longer seems to happen on latest KWin fwiw. Vlad did you implement a workaround or something out of curiosity?
Unfortunately I can't find what happened anymore, but if I remember well, there was a change on the KDE side which significantly mitigated the issue.
Remembering it that way because this was a common issue with programs ignoring GUI needs while working on some other task, even including KDE programs like Krusader, and I could reproduce the issue within minutes, but it got way better after a Plasma update.
I'm not sure the issue is completely gone though for the following reasons:
- What I remember was just a mitigation which just significantly lowers the chance of a crash, but doesn't completely eliminate it. I believe updates just get throttled if the program is unresponsive, but not completely stopped, so the connection buffer can still get filled completely.
- Firefox still crashes occasionally under "heavy load" the same way it did earlier, nowadays it's just maybe once a month instead of having a quick reproducer of going on a malware-like site like Discord and just moving around the cursor at key resource abusing moments
Aside from the fundamental issue of not tending to window manager events, it may help to know that by heavy load I mean having likely hundreds of tabs open in multiple windows. I'm not fully sure if window count matters as it's usually correlated to tab count for me to some degree, but generally the more windows I have open, the more likely I observe the following problems:
- During handling one of the windows, Firefox just silently crashes, closing all windows. As it happens rarely nowadays I haven't verified if it's surely the issue being discussed here, but the symptom is the same
- A window (so far never the first/main window) stops painting content while remaining functional evident from forced window content updates obtainable through switching to another window and then back. Now this might not be related to this issue, just noting that a multi-window "stress test" comes with its odd problems
Consider it anecdotal, but I haven't seen either of these problems with just a single window even with likely a 100 or so tabs, so reproducing race condition kind of issues is likely easier with multiple windows.
Also, weaker hardware (mostly laptop) and heavy I/O load in the background surely helped with seeing more issues. The later was quite interesting as I used to see KDE background services crash just because I was moving around a lot of data through NFS, and while doing so, Firefox was definitely easier to crash.
Aside from not blocking the event loop tending to GUI needs, the real solution would be something like this: https://gitlab.freedesktop.org/wayland/wayland/-/merge_requests/188
Tending to events as soon as possible helps with the problem, but we are not dealing with realtime scheduling here, programs can go an arbitrary long time without getting scheduled which apparently tends to be a problem mostly during heavy I/O load, so it needs to be ensured that the Wayland connection buffer doesn't get completely filled in such situations.
Comment 49•2 years ago
•
|
||
Aside from not blocking the event loop tending to GUI needs, the real solution would be something like this: https://gitlab.freedesktop.org/wayland/wayland/-/merge_requests/188
Note that some people are pushing back against solutions like that because the compositor shouldn't bear a potentially unbounded memory cost (which would cause reliability issues too) due to unresponsive clients.
An alternative is for the client library to handle connection losses more gracefully (like a GPU driver reset): https://invent.kde.org/plasma/kwin/-/wikis/Restarting
Comment 50•2 years ago
|
||
I can also still reproduce this on the latest stable plasma version with kwin.
More easily in thunderbird with a very large inbox than with firefox, but it seems to still occur.
Heavy loads combined with mouse movements.
(unless you were talking about plasma 6 or other preview / not yet stable versions)
Comment 51•2 years ago
|
||
(In reply to The 8472 from comment #49)
Note that some people are pushing back against solutions like that because the compositor shouldn't bear a potentially unbounded memory cost (which would cause reliability issues too) due to unresponsive clients.
I'm aware of the controversial nature, and I've meant the linked change to be one of the possible solutions, not necessarily the best way, but it also goes in 2 directions:
- Changes the hardcoded buffer size to be just merely the default buffer size. I don't think this change is that horrible, even if it introduces an indirection for the buffer. The current buffer size (at least with the current event handling strategy) is obviously insufficient, and I'm saying that without hardware I consider fancy. Even knowing that a program was unresponsive, all I had to do was just moving the cursor through it somewhere else, and by the time I've realized my mistake, it was already late. I get the anti-bloating argument, but the other perspective is that the buffer gets filled up in a fraction of second before the user can even react.
- Adds dynamic buffer control with the mentioned unbounded memory cost. I get the problems with that, and this one is surely not necessarily desirable. Aside from silly user hostile program components like "anti-cheats" possibly getting upset about this, I don't think there's a need to keep all events forever if the program is unable to keep up, but on desktops where memory is so abundant, it surely feels odd not having the possibility to handle even just a couple seconds of event processing backlog
(In reply to qlum from comment #50)
I can also still reproduce this on the latest stable plasma version with kwin.
More easily in thunderbird with a very large inbox than with firefox, but it seems to still occur.
Heavy loads combined with mouse movements.(unless you were talking about plasma 6 or other preview / not yet stable versions)
I'm actually quite behind with improvements as far as desktop system needs go:
Operating System: Kubuntu 23.04
KDE Plasma Version: 5.27.4
KDE Frameworks Version: 5.104.0
Qt Version: 5.15.8
Kernel Version: 6.2.0-33-generic (64-bit)
Graphics Platform: Wayland
Kubuntu 23.04 brought a lot of stability improvements to me, including the mitigation of this problem, but as mentioned earlier likely not a complete fix. Kubuntu is not really "fresh" when it comes to KDE changes to begin with, and KDE is improving rapidly in the past few years, so when it comes to desktop usage, even 22.04 (latest LTS) is already considered quite old.
I'm also using Thunderbird with multiple folders containing thousands of emails each, but I don't remember experiencing this problem there even when it was a problem elsewhere. Even the current 115.3.0 version is fine despite the recent UI redesign making it less responsive.
Comment 52•2 years ago
|
||
(In reply to Pedro [:pedrov] from comment #51)
(In reply to qlum from comment #50)
I can also still reproduce this on the latest stable plasma version with kwin.
More easily in thunderbird with a very large inbox than with firefox, but it seems to still occur.
Heavy loads combined with mouse movements.(unless you were talking about plasma 6 or other preview / not yet stable versions)
I'm actually quite behind with improvements as far as desktop system needs go:
Operating System: Kubuntu 23.04 KDE Plasma Version: 5.27.4 KDE Frameworks Version: 5.104.0 Qt Version: 5.15.8 Kernel Version: 6.2.0-33-generic (64-bit) Graphics Platform: Wayland
Kubuntu 23.04 brought a lot of stability improvements to me, including the mitigation of this problem, but as mentioned earlier likely not a complete fix. Kubuntu is not really "fresh" when it comes to KDE changes to begin with, and KDE is improving rapidly in the past few years, so when it comes to desktop usage, even 22.04 (latest LTS) is already considered quite old.
I'm also using Thunderbird with multiple folders containing thousands of emails each, but I don't remember experiencing this problem there even when it was a problem elsewhere. Even the current 115.3.0 version is fine despite the recent UI redesign making it less responsive.
In firefox I reproduced it with some effort by opening reddit 250 times and dragging tabs around for a while (copy / paste the url in a text editor) paste that in a bookmark folder and open all bookmarks.
In thunderbird, I have access to a mail account that gets around 5k new mails a month (mostly notifications), dragging those 5k to archive then moving the mouse around is a very consistent crash.
I will say it has gotten harder to trigger the issue in the last few months.
Comment 53•2 years ago
|
||
I am encountering this several times a day on Arch with firefox 118.0.2-1, sway-1.8.1-1, wlroots-0.16.2-2, and gtk3-3.24.38-1, using amdgpu on Polaris.
It seemed to worsen over the past few months as I kept updating these packages.
I can confirm the same lack of crash reporter and terminal output. It only happens when I am interacting with Firefox.
The lack of crash reports may be hiding how prevalent this bug is.
I am using a 1000Hz mouse with a freely scrolling wheel, as well as several windows with hundreds of tabs, so I can see why I would be especially affected.
This has become a serious impediment to getting any work done, and it will force me off of Firefox so as to keep Sway and native Wayland browsing.
I will look into adjusting buffer sizes of the socket for the Wayland connection via the file descriptor as the8472 pointed out on the sway issue.
I will very likely need assistance though, especially if this can't be achieved directly in Bash.
If I can do anything else to help debug, please let me know. I can rebuild with debugging symbols and attach a debugger.
Reporter | ||
Comment 54•2 years ago
|
||
There should be crash reports on nightly (bug 1859267). But AIUI, this doesn't seem very actionable from our end, unless we give up on GTK or steal its wayland events somehow from another thread...
Comment 55•2 years ago
•
|
||
this doesn't seem very actionable from our end
There are workarounds that firefox could implement
A) increase the socket buffer size - This would buy some time until a crash but not help when the main thread is blocked for extended amounts of time
B) implement a unix socket proxy (forwarding payloads + file descriptors) that runs on another thread and buffers backlogged messages in userspace. GTK would then be pointed to the proxy instead of the real wayland socket - this could buy arbitrary amounts of time as long as that thread doesn't get suspended or starved. the thread could maybe use a realtime scheduling priority so it can keep making progress even if the system is oversubscribed
They're not perfect solutions but they'd hopefully paper over a large fraction of those stall-induced crashes.
Comment 56•1 year ago
|
||
Aside from the problem logically still existing even if it's less commonly seen, just updating my report for clarity that I'm no longer uncertain, I can definitely still experience this problem.
The key assistant in my case seems to be really heavy background I/O which got so unusually abusive, it delayed even desktop actions sometimes by more than a minute, and made many programs unresponsive as a surprisingly good stability test given that well-behaving programs survived without issues, one program likely lost its PipeWire connection as it could never produce sound again but otherwise stayed alive, and then Firefox just crashed shortly after the cursor was moved over it while it wasn't responsive.
My abusive workload was deleting millions of heavily referenced (tons of reflinked duplicates) files on a compressed (zstd:1, could be way "worse") Btrfs partition. Surely not a usual problem, but for bug reproducing purposes it beats everything I've seen before, at least excluding the "good old" HDD swap file hell which likely wouldn't be taken too seriously these days anymore.
Assignee | ||
Comment 57•1 year ago
|
||
(In reply to Emilio Cobos Álvarez (:emilio) from comment #54)
There should be crash reports on nightly (bug 1859267). But AIUI, this doesn't seem very actionable from our end, unless we give up on GTK or steal its wayland events somehow from another thread...
It may be possible to bundle gtk3 library and do changes directly there. That may also solve recent gtk3 issues like missing xdg-popup resize or add direct rendering to wl_surface owned by gtk widget. But I'm not sure how that will work with other gtk3 components like ATK or IM and so.
Assignee | ||
Comment 58•1 year ago
|
||
(In reply to The 8472 from comment #55)
B) implement a unix socket proxy (forwarding payloads + file descriptors) that runs on another thread and buffers backlogged messages in userspace. GTK would then be pointed to the proxy instead of the real wayland socket - this could buy arbitrary amounts of time as long as that thread doesn't get suspended or starved. the thread could maybe use a realtime scheduling priority so it can keep making progress even if the system is oversubscribed
I'm not sure how to open GdkDisplay over existing wayland display connection. I don't think Gdk provides such API.
Comment 59•1 year ago
|
||
(In reply to Martin Stránský [:stransky] (ni? me) from comment #58)
(In reply to The 8472 from comment #55)
B) implement a unix socket proxy (forwarding payloads + file descriptors) that runs on another thread and buffers backlogged messages in userspace. GTK would then be pointed to the proxy instead of the real wayland socket - this could buy arbitrary amounts of time as long as that thread doesn't get suspended or starved. the thread could maybe use a realtime scheduling priority so it can keep making progress even if the system is oversubscribed
I'm not sure how to open GdkDisplay over existing wayland display connection. I don't think Gdk provides such API.
WAYLAND_DISPLAY
can be an absolute path. So one could setup the proxy and then set the environment variable and let gdk connect to it.
Assignee | ||
Comment 60•1 year ago
|
||
(In reply to The 8472 from comment #59)
WAYLAND_DISPLAY
can be an absolute path. So one could setup the proxy and then set the environment variable and let gdk connect to it.
That's interesting. There's also WAYLAND_SOCKET evn variable which may be even more useful. Is there any example implementation of the 'unix socket proxy' thing we can use as a template or any library which implements it? My first google search haven't revealed anything useful.
Assignee | ||
Updated•1 year ago
|
Assignee | ||
Updated•1 year ago
|
Comment 61•1 year ago
•
|
||
(In reply to Martin Stránský [:stransky] (ni? me) from comment #60)
(In reply to The 8472 from comment #59)
That's interesting. There's also WAYLAND_SOCKET evn variable which may be even more useful. Is there any example implementation of the 'unix socket proxy' thing we can use as a template or any library which implements it? My first google search haven't revealed anything useful.
I'm not aware of any, it probably has to be written from scratch. Waypipe has to do something similar internally but in a more complicated way than a simple unix socket proxy because it has to serialize the wayland protocol instead of just passing through bytes/descriptors.
Structurally it should be similar to implementing a TCP proxy. Plus the SCM_RIGHTS passing.
So that would be an epoll loop to support non-blocking IO to handle multiple connections on a single thread, a bunch of per-connection buffers and then calling sendmsg/recvmsg with cmsg handling to pass the descriptors, if any.
And one would have to study the wayland protocol to figure out what's the maximum amount of file descriptors it can send per number of bytes received to size the buffers properly (or use MSG_PEEK probing to figure out when buffers need to be resized).
https://man7.org/linux/man-pages/man7/unix.7.html
https://man7.org/linux/man-pages/man3/cmsg.3.html
https://man7.org/linux/man-pages/man2/recvmsg.2.html
https://man7.org/linux/man-pages/man2/sendmsg.2.html
https://man7.org/linux/man-pages/man7/epoll.7.html
Assignee | ||
Comment 62•1 year ago
|
||
Thanks, looks like a bigger project. It may be worth to implement it as a standalone library so other wayland apps can benefit from it.
Comment 63•1 year ago
|
||
I hacked up a quick demo that manages to prevent firefox crashes when dropping a file on google image search (bug 1792754) or running the reproducer from comment #10
Emphasis on hacked. The code is shoddy.
Assignee | ||
Comment 64•1 year ago
|
||
Cool, Thanks! So at least we know this path is viable.
I was thinking about implementation by GIOStream (https://docs.gtk.org/gio/class.IOStream.html) which should handle all aspects of file/socket handling.
Comment 65•1 year ago
|
||
Unless it's documented somewhere I would not expect that to handle the file-descriptor passing. IO streams usually just mean bytes.
Comment 67•1 year ago
|
||
Copying crash signatures from duplicate bugs.
Comment 68•1 year ago
|
||
The bug is linked to a topcrash signature, which matches the following criteria:
- Top 20 desktop browser crashes on beta
- Top 5 desktop browser crashes on Linux on beta
- Top 5 desktop browser crashes on Linux on release (startup)
For more information, please visit BugBot documentation.
Comment 69•1 year ago
|
||
WAYLAND_DEBUG=1 thunderbird
Comment 70•1 year ago
|
||
Attached another crash report on sway. Can reproduce consistently with a large inbox and several calendars. No errors in journalctl/dmesg. No OOM conditions either.
thunderbird=115.5.0-1
wayland=1.22.0-1
sway=1:1.8.1-4
wlroots=0.17.0-1
gtk3=1:3.24.38-1
Running Arch Linux (x86_64) with linux=6.6.2-arch1-1.
Comment 71•1 year ago
|
||
I'm not sure were I read it, but I think it was Emilio who brought up the idea to just statically build a local copy of GTK3. That would allow us to fix this issue quite nicely - we could introduce a new thread that does the polling on the Wayland socket and is never blocked by other stuff on the main thread.
It would additionally allow us to do all kind of small changes that make our life easier - we could go as far and ditch the content subsurface and just render to the xdg-toplevel - and still from time to time rebase on upstream, just like we do for libwebrtc, which has a way higher volume of changes.
Mutter did that with Clutter and it was a very good decision for its use-case - and with most apps being ported to GTK4, FF is often the only app still using GTK3 anyway.
Emilio, was it you who suggested that - and what do you think about it?
Reporter | ||
Comment 72•1 year ago
|
||
It was Martin (so 301 him). We'd also need to link gdk statically too, right? One concern I'd have would be how it'd interact with the rest of stuff in the system (like gtk modules etc). If we do binary-incompatible changes to gtk, then we'd be in ABI hell...
It would additionally allow us to do all kind of small changes that make our life easier - we could go as far and ditch the content subsurface and just render to the xdg-toplevel.
I think this is somewhat unrelated, but seems worth reconsidering our widget set up to stop using GTK for most stuff (other than to read theming information and settings perhaps). That means reworking how a lot of things like drag and drop, IME, pointer events, rendering, etc work without a GtkWindow. Might be worth it in the end tho, and it seems it could be incremental? E.g., we could work on a widget/gtk/WindowWayland
that did those kinds of things you're suggesting.
If we'd end up statically linking gtk, then your approach would also work (and be less work of course), but it seems at that point we could just remove a fair bit of abstraction layers altogether?
we could introduce a new thread that does the polling on the Wayland socket and is never blocked by other stuff on the main thread.
It seems that as long as we use gtk/gdk for stuff like settings and widget styling / rendering, that'd open a default display connection on the main thread here. If we could just replace that event source by our own I think that'd allow us to fix this in the way you're suggesting... It might be worth trying to do that without having to fork gdk?
![]() |
||
Updated•1 year ago
|
Comment 73•1 year ago
|
||
(In reply to The 8472 from comment #63)
I hacked up a quick demo that manages to prevent firefox crashes when dropping a file on google image search (bug 1792754) or running the reproducer from comment #10
Emphasis on hacked. The code is shoddy.
Quick report after one full day of testing - this demo successfully prevents all crashes for me.
Assignee | ||
Comment 74•1 year ago
|
||
(In reply to Emilio Cobos Álvarez (:emilio) from comment #72)
It was Martin (so 301 him). We'd also need to link gdk statically too, right? One concern I'd have would be how it'd interact with the rest of stuff in the system (like gtk modules etc). If we do binary-incompatible changes to gtk, then we'd be in ABI hell...
We'd also need a big list of build requires.
Assignee | ||
Updated•1 year ago
|
Assignee | ||
Comment 75•1 year ago
|
||
I think the easiest immediate fix will be the proxy implementation. But I wonder how to incorporate it into Firefox.
Emilio, may we use the rust demo here? I have zero experience with Rust so I have no idea if that can be utilized somehow. Another option is to implement it in C++ which can be easily integrated then.
Assignee | ||
Updated•1 year ago
|
Assignee | ||
Comment 77•1 year ago
|
||
Will look at C++ implementation which can be used with Firefox internally.
Assignee | ||
Updated•1 year ago
|
Reporter | ||
Comment 78•1 year ago
|
||
The rust implementation should be straight-forward to use fwiw, depending on how you want to build it / run it and such.
Assignee | ||
Comment 79•1 year ago
|
||
Assignee | ||
Comment 80•1 year ago
|
||
Comment on attachment 9368399 [details]
WIP: Bug 1743144 [Wayland] Wayland proxy wip
There's an initial Wayland proxy implementation. It's based on https://github.com/the8472/weyland-p5000 WIP.
From some reason Wayland clipboard doesn't work with this proxy while the original code (weyland-p5000) is fine.
Comment 81•1 year ago
|
||
I don't know if it causes the clipboard issue but your code isn't dealing with partial writes. Though if that were the problem i'd expect the assert(ret == mData.size());
to trigger.
Comment 82•1 year ago
|
||
In the sway issue a user reported that using a proxy by itself is not sufficient because firefox overloads the system to the point where the proxy gets slowed down too.
Increasing the socket buffer sizes did seem to help.
Whether adjusting thread priorities of the proxy (up) and firefox (down) would help is currently being tested.
Assignee | ||
Comment 83•1 year ago
•
|
||
(In reply to The 8472 from comment #81)
I don't know if it causes the clipboard issue but your code isn't dealing with partial writes. Though if that were the problem i'd expect the
assert(ret == mData.size());
to trigger.
I moved a bit forward here. Looks like if I use the proxy, reading clipboard data through provided fd from application to compositor (the one passed by sendmsg) is not finished. If the proxy is used, compositor receives data from application but it doesn't recognize it's all and still waits for more.
If app is run without proxy, fd passed by sendmsg is closed from application side (I guess) so compositor gets POLLHUP:
read(32, "Accept", 4194304) = 6
poll([{fd=32, events=POLLIN}], 1, 0) = 1 ([{fd=32, revents=POLLHUP}])
poll([{fd=32, events=POLLIN}], 1, -1) = 1 ([{fd=32, revents=POLLHUP}])
read(32, "", 4194298) = 0
close(32) = 0
("Accept" string is a text data send from application to compositor).
Comment 84•1 year ago
|
||
Are you closing the file descriptors after sending them? In Rust OwnedFd
does that automatically when dropping them.
Basically, SCM_RIGHTS sends a dup of the file descriptor, it doesn't consume them.
Assignee | ||
Comment 85•1 year ago
|
||
(In reply to The 8472 from comment #84)
Are you closing the file descriptors after sending them? In Rust
OwnedFd
does that automatically when dropping them.
Basically, SCM_RIGHTS sends a dup of the file descriptor, it doesn't consume them.
Yes, looks like that's the problem here.
Comment 86•1 year ago
|
||
(Martin Stránský [:stransky] from comment #75)
I think the easiest immediate fix will be the proxy implementation. But I wonder how to incorporate it into Firefox.
Emilio, may we use the rust demo here? I have zero experience with Rust so I have no idea if that can be utilized somehow. Another option is to implement it in C++ which can be easily integrated then.
(In reply to Emilio Cobos Álvarez (:emilio) from comment #78)
The rust implementation should be straight-forward to use fwiw, depending on how you want to build it / run it and such.
Can you make patch that shows how to integrate https://github.com/the8472/weyland-p5000 ?
(If we can have Rust, why rewriting it in legacy C++?)
Comment 87•1 year ago
|
||
If you want to use my rust code I'll have to clean that up to turn it into a library and do clean error handling. I can do that, but only if there's intent to use it.
Comment 88•1 year ago
|
||
Sorry if I missed this in the discussion but it seems like there is a lot of work considering how to mitigate this by ensuring that we are consistently reading events (either internally or via a proxy) but it seems like we haven't considered fixing the main issue.
Fundamentally without requiring a real-time OS we can't assume that we (or a proxy that we spawn) can respond to events in any particular deadline. Even without considering SIGSTOP busy CPUs can starve our processes and the compositor may be running at a higher priority where it continues to send us events.
It seems that the reliable fix is assuming that we will occasionally get disconnected and support reconnecting. IIUC this is QT did. The stated reason in that patch seems to focus on compositor crashes but it seems like this will also solve the temporary hang + disconnect case. This seems like a more complete solution to me because it can handle complete process stops, machines stalling for any reason and compositor crashes.
IIUC we will need some support from GTK to gracefully handle the disconnection. But I would be surprised if they weren't interested in supporting this.
Maybe the proxy is a decent mitigation in the meantime, but it seems to be just an improvement to the current state, rather than a fundamental fix.
Reporter | ||
Comment 89•1 year ago
|
||
Yeah, we'd need to turn it into a lib, and expose it via an FFI function. An empty crate with that crate as a dependency like and a cbindgen.toml file like dom/base/rust
would do. I can help martin if he wants to pursue this.
Assignee | ||
Comment 90•1 year ago
|
||
(In reply to Emilio Cobos Álvarez (:emilio) from comment #89)
Yeah, we'd need to turn it into a lib, and expose it via an FFI function. An empty crate with that crate as a dependency like and a cbindgen.toml file like
dom/base/rust
would do. I can help martin if he to pursue this.
I have almost production ready C++ code for it. The thing here is that I don't know rust code so if we adopt the rust one we also need a maintainer for it. If we use C++ one it can be maintained as part of toolkit/gtk code. So if we want to go the Rust path we also need a maintainer for it.
Assignee | ||
Comment 91•1 year ago
|
||
There's the latest C++ proxy version attached. There are still some TODO to check + I'd like to run some performance tests and add a build option to make it optional.
I'd prefer to use C++ code as I understand it but if there's strong feeling for Rust path I'm not against it (if we have a maintainer for it).
Assignee | ||
Comment 92•1 year ago
|
||
Assignee | ||
Comment 93•1 year ago
|
||
Depends on D196554
Assignee | ||
Comment 94•1 year ago
|
||
Depends on D196555
Assignee | ||
Comment 95•1 year ago
|
||
I did some speed testing and I'm getting 158-195 speedometer points with proxy and 160-161 without it. So it may slightly affect performance. May be worth to test on low perf hardware too like arm (but we can also compile Firefox with proxy disabled there).
Updated•1 year ago
|
Assignee | ||
Updated•1 year ago
|
Comment 96•1 year ago
|
||
sched_param param;
if (pthread_attr_getschedparam(&attr, ¶m) == 0) {
param.sched_priority = sched_get_priority_max(SCHED_FIFO);
pthread_attr_setschedparam(&attr, ¶m);
}
- does this even work? I am under the impression that only root/processes with CAP_SYS_NICE can set RT prios or need to ask rtkit
- the maximum priority is probably overkill. even the lowest realtime priority is still higher than the default scheduling policy. plus it's higher than what sway uses so sway might get preempted on the first message it sends to the socket (leading to a wakeup) which probably increases context switches
Assignee | ||
Comment 97•1 year ago
|
||
(In reply to The 8472 from comment #96)
sched_param param; if (pthread_attr_getschedparam(&attr, ¶m) == 0) { param.sched_priority = sched_get_priority_max(SCHED_FIFO); pthread_attr_setschedparam(&attr, ¶m); }
- does this even work? I am under the impression that only root/processes with CAP_SYS_NICE can set RT prios or need to ask rtkit
- the maximum priority is probably overkill. even the lowest realtime priority is still higher than the default scheduling policy. plus it's higher than what sway uses so sway might get preempted on the first message it sends to the socket (leading to a wakeup) which probably increases context switches
Sure, which sched_priority do you suggest to use?
Comment 98•1 year ago
|
||
the lowest, it's what sway does. https://github.com/swaywm/sway/pull/6994/files#diff-055d9d833b1db9e8fa249150a4d00918c36e78fb9e2e130e5663a4a6d73cbd07R20
Assignee | ||
Comment 99•1 year ago
|
||
(In reply to The 8472 from comment #98)
the lowest, it's what sway does. https://github.com/swaywm/sway/pull/6994/files#diff-055d9d833b1db9e8fa249150a4d00918c36e78fb9e2e130e5663a4a6d73cbd07R20
Thanks, will update it.
Assignee | ||
Comment 100•1 year ago
|
||
I'm going to use the proxy with Fedora Firefox package to get real user feedback here.
Assignee | ||
Comment 101•1 year ago
|
||
Tested with sched_get_priority_min(SCHED_RR) and the speedometer performance is the same.
Assignee | ||
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Comment 102•1 year ago
|
||
(In reply to Martin Stránský [:stransky] (ni? me) from comment #100)
I'm going to use the proxy with Fedora Firefox package to get real user feedback here.
I've applied D196554.diff, D196555.diff, D196556.diff on my custom Gentoo build of Firefox and it seems to be working fine as far as I can tell. One quick issue I ran into is if you have multiple Firefox profiles and open the profile selector with firefox -p
it crashes:
/usr/lib64/firefox/firefox -p
Wayland Proxy error: StartProxyServer(): bind() error : Address already in use
[1] 2902 segmentation fault /usr/lib64/firefox/firefox -p
I don't know if you can reproduce that in Fedora but thought I'd mention it.
Comment 103•1 year ago
|
||
I have been experiencing this over the last weeks ; in fact it is the most common reasons for my thunderbird and firefox crashes under ubuntu 23.10:
- https://crash-stats.mozilla.org/report/index/bp-25fb92e1-cfb5-4708-837d-492ed0231226
- https://crash-stats.mozilla.org/report/index/bp-38acfaff-48ea-4d0f-b55d-6069f0231219
- https://crash-stats.mozilla.org/report/index/bp-e04e2103-34bd-4820-b014-bedfc0231123
- https://crash-stats.mozilla.org/report/index/bp-fd448f76-450a-41d9-a8e5-bbd810231116
- https://crash-stats.mozilla.org/report/index/bp-6c18dd8a-899e-4730-88af-b4e090231108
- https://crash-stats.mozilla.org/report/index/bp-d7481b5b-fe4b-4bb7-b51b-dc3840231103
- https://crash-stats.mozilla.org/report/index/ade617e6-cf49-4348-9909-ae2490231219
- https://crash-stats.mozilla.org/report/index/c1f24b4c-96ca-4052-82c1-065bc0231212
- https://crash-stats.mozilla.org/report/index/11e661fd-a243-493d-a9e3-222a00231207
In all cases, the STRs involved scrolling with my mouse wheel. It might have been out of focus of a scrollable field, but I can state for sure the browser or mailer was not in a busy state that would make it unresponsive, it was scrolling properly and out of the blue all of a sudden, crashed.
Is it possible we are spending too much time handling the scrolling events and we fail to reply in time to wayland? This is a Logitech G903 mouse, solaar shows DPI set at 3200 and polling at 1kHz.
Comment 104•1 year ago
|
||
Polling at 1K was frequently reported as a potential trigger for this issue.
Comment 105•1 year ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #104)
Polling at 1K was frequently reported as a potential trigger for this issue.
Ok, but is the fault on our side or wayland or desktop ?
Comment 106•1 year ago
|
||
This "https://github.com/stransky/wayland-proxy/" works great for me. 0 crashes for 4 days now (whereas before it crashed like 5-20 times a day; KDE wayland Manjaro.). When will this be added to the mainline firefox for everyone (besides Fedora)?
Assignee | ||
Comment 107•1 year ago
|
||
(In reply to :gerard-majax [PTO 13/12-08/01] from comment #103)
In all cases, the STRs involved scrolling with my mouse wheel. It might have been out of focus of a scrollable field, but I can state for sure the browser or mailer was not in a busy state that would make it unresponsive, it was scrolling properly and out of the blue all of a sudden, crashed.
Is it possible we are spending too much time handling the scrolling events and we fail to reply in time to wayland? This is a Logitech G903 mouse, solaar shows DPI set at 3200 and polling at 1kHz.
Botond, do you know if there's a potential response lag during scrolling? Do we do significant amount of work in main thread during scroll?
Thanks.
Comment 108•1 year ago
|
||
Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.
For more information, please visit BugBot documentation.
Comment 109•1 year ago
•
|
||
(In reply to Martin Stránský [:stransky] (ni? me) from comment #107)
(In reply to :gerard-majax [PTO 13/12-08/01] from comment #103)
In all cases, the STRs involved scrolling with my mouse wheel. It might have been out of focus of a scrollable field, but I can state for sure the browser or mailer was not in a busy state that would make it unresponsive, it was scrolling properly and out of the blue all of a sudden, crashed.
Is it possible we are spending too much time handling the scrolling events and we fail to reply in time to wayland? This is a Logitech G903 mouse, solaar shows DPI set at 3200 and polling at 1kHz.
Botond, do you know if there's a potential response lag during scrolling? Do we do significant amount of work in main thread during scroll?
With the GPU process enabled (layers.gpu-process.enabled=true
; not enabled by default on Linux), the parent process main thread does block on a synchronous IPC round-trip to the GPU process for every event (to perform a hit-test to determine which content process to dispatch the event to). See bug 1677509 for more details on this; it may be possible to avoid this using the approach discussed in bug 1677509 comment 10 but this would involve a non-trivial refactor and come with some risk to input handling latency.
With the GPU process disabled, the hit-test happens within the parent process but can still involve the main thread blocking on other threads in a couple of places:
- At the beginning of the hit test, the main thread needs to acquire a lock that is also acquired by the SceneBuilder thread during a "scene swap" (when a new WebRender scene has been built and WebRender and APZ coordinate to start using the new scene at the same time). As far as I'm aware, a scene swap should be fast but IIRC it does involve some synchronous back-and-forth IPC between the SceneBuilder and RenderBackend threads (Nical would know more of the details here).
- The WebRender part of the hit test used to involve a synchronous round-trip to the RenderBackend thread, but we reworked that in bug 1580178 to avoid this in almost all cases. If I'm reading the code right, the only remaining case in which we can block (on this line) is during the first hit-test on a new window, if it's requested sufficiently soon after creating the window that the initial hit tester hasn't been built yet.
Assignee | ||
Comment 110•1 year ago
|
||
Thanks. GPU process is not implemented on Wayland at all.
Comment 111•1 year ago
|
||
Comment 112•1 year ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/6a0659188d6c
https://hg.mozilla.org/mozilla-central/rev/b0d2efdcd6cc
Comment 113•1 year ago
|
||
:stransky Fx122 is affected here but we are near the end of the beta cycle. 122.0b9 builds on 2024-01-12.
What do you think about adding an uplift request on this? Is it low-risk enough to take at this stage?
Assignee | ||
Comment 114•1 year ago
|
||
(In reply to Donal Meehan [:dmeehan] from comment #113)
:stransky Fx122 is affected here but we are near the end of the beta cycle. 122.0b9 builds on 2024-01-12.
What do you think about adding an uplift request on this? Is it low-risk enough to take at this stage?
Better to keep it in nightly only.
Updated•1 year ago
|
Comment 115•1 year ago
|
||
Nightly crash rate seems to have decreased. However, there are still crashes:
Assignee | ||
Comment 116•1 year ago
|
||
(In reply to Wayne Mery (:wsmwk) from comment #115)
Nightly crash rate seems to have decreased. However, there are still crashes:
That's interesting, Thanks.
It may be caused by missing Wayland protocol ping handling as Wayland proxy comments states:
https://mastransky.wordpress.com/2023/12/22/wayland-proxy-load-balancer/#comments
We may look at it and investigate possible improvements.
Comment 117•1 year ago
|
||
Further observation - crash rate for signature <name omitted> | HandleGLibMessage has been driven to zero.
On the other hand, crash rate for HandleGLibMessage which had dropped by about half, in just the last couple days has a sudden increase in crashes in 123.0 and 123.0.1 - roughly doubled.
Assignee | ||
Comment 118•1 year ago
|
||
(In reply to Wayne Mery (:wsmwk) from comment #117)
Further observation - crash rate for signature <name omitted> | HandleGLibMessage has been driven to zero.
On the other hand, crash rate for HandleGLibMessage which had dropped by about half, in just the last couple days has a sudden increase in crashes in 123.0 and 123.0.1 - roughly doubled.
Would be great to have someone who can reproduce it with latest nightly with proxy enabled. We had reports about 1000Hz mouse issues but that should be solved now. I wonder what else can cause the issues (beside heavy system utilization which may cause delays in event processing).
Comment 119•1 year ago
|
||
I can confirm that since landing wayland proxy, both Firefox Nightly and Thunderbird Daily do not anymore crash because of that, so 1kHz mouse is OK.
Description
•