Closed Bug 1743144 Opened 3 years ago Closed 1 year ago

[Wayland] "Lost connection to Wayland compositor."

Categories

(Core :: Widget: Gtk, defect, P1)

defect

Tracking

()

RESOLVED FIXED
123 Branch
Tracking Status
firefox-esr115 --- disabled
firefox121 --- wontfix
firefox122 --- wontfix
firefox123 --- fixed

People

(Reporter: emilio, Assigned: stransky)

References

(Blocks 2 open bugs)

Details

(Keywords: topcrash)

Crash Data

Attachments

(4 files, 2 obsolete files)

I've ran plasma this last couple days and sometimes Firefox or Thunderbird will crash without crash reporter with the following message in the journal:

Lost connection to Wayland compositor.

That's https://gitlab.gnome.org/GNOME/gtk/-/blob/c742debea8d7f2c93f06a5b506851698291cb2da/gdk/wayland/gdkeventsource.c#L237

The bad part is that there's actually no crash report or anything, so we just don't see these crashes...

Martin do you have any idea of how to track this down or possibly intercept these _exit calls from gtk?

Flags: needinfo?(stransky)

Did kwin_wayland crash? If not, it would be great to have WAYLAND_DEBUG log.

It did not. How can I get that log?

Flags: needinfo?(vlad.zahorodnii)

Run firefox as follows env WAYLAND_DEBUG=1 firefox. Do you run firefox nightly? I haven't seen any crashes with Firefox 94, i.e. stable branch?

Flags: needinfo?(vlad.zahorodnii)

/i.e. stable branch?/i.e. stable branch/

Ah, I thought you meant WAYLAND_DEBUG in kwin, thanks. Will run it like that for a while and see if I hit this.

And yeah, this is Firefox Nightly and Thunderbird Daily as well. I've seen this mostly when my computer is super-busy (building or what not).

Sheer luck that i saw the email from this bug, still this one might be similar/duplicate: https://bugzilla.mozilla.org/show_bug.cgi?id=1718851
Note that since then i've switched to Arch (but haven't set it up fully yet as i'm doing some workarounds), might not be the right guy yo be able to reproduce it.

Tl;dr: Check if there's something useful at the mentioned bug link above.

The bug from Comment #6 at the start of it is a different crash probably due to EarlyOOM, but at the end of it there are two different error messages about the compositor, one is this one from this bug and the other one is again some compositor-ish thingie. I don't think that there are useful WAYLAND_DEBUG logs (the one there was not taken when the crash happened).

Here's a log of a crashing session. There's not much going on there I think. It seems it crashed with:

Gdk-Message: 20:39:12.453: Error flushing display: Broken pipe

(at the end of the log). Journal for that time is:

Nov 27 20:37:31 ryzen plasmashell[3715]: kf.plasma.quick: Couldn't create KWindowShadow for ToolTipDialog(0x56535cca8580)
Nov 27 20:37:31 ryzen plasmashell[3715]: kf.plasma.quick: Couldn't create KWindowShadow for ToolTipDialog(0x56535cca8580)
Nov 27 20:37:31 ryzen plasmashell[3715]: kf.plasma.quick: Couldn't create KWindowShadow for ToolTipDialog(0x56535cca8580)
Nov 27 20:38:30 ryzen kernel: xhci_hcd 0000:06:00.3: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13
Nov 27 20:38:30 ryzen kernel: xhci_hcd 0000:06:00.3: Looking for event-dma 00000000ffc368e0 trb-start 00000000ffc368f0 trb-end 00000000ffc368f0 seg-start 00000000ffc36000 seg-end 00000000ffc36ff0
Nov 27 20:38:40 ryzen pipewire[4105]: spa.alsa: surround21:0: snd_pcm_mmap_commit error: Broken pipe
Nov 27 20:39:08 ryzen plasmashell[7578]: ###!!! [Parent][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost
Nov 27 20:39:12 ryzen /usr/libexec/gdm-wayland-session[3417]: error in client communication (pid 19408)
Nov 27 20:39:34 ryzen /usr/libexec/gdm-wayland-session[3417]: kwin_libinput: Libinput: event2  - Ultimate Gadget Laboratories Ultimate Hacking Keyboard: client bug: event processing lagging behind by 31ms, your system is too slow

So I guess the issue is the error in client communication? That's not very descriptive...

Or maybe the kernel sound error? But sound still works after restarting firefox fwiw, and I don't know why that would end up being a wayland display flush IO error...

Vlad, does the above give you any hint or any further thing I could try to diagnose this?

Flags: needinfo?(vlad.zahorodnii)

Heh, that's a big wayland debug log. At quick glance, I don't see anything suspicious besides firefox stopping responding to pointer events before the client connection breaks.

Flags: needinfo?(vlad.zahorodnii)

Ah, I figured out how to reproduce this consistently! It's just a matter of the parent process main thread being busy for long enough. If I open the browser toolbox and in the console I paste:

var s = Date.now(); while (s + 5000 > Date.now());

Which should just stop responding for a bit, instead we crash.

Hmm, apparently not so consistently as I hoped...

But frequently enough

Emilio Cobos Álvarez (:emilio) from comment #0)

I've ran plasma this last couple days and sometimes Firefox or Thunderbird will crash without crash reporter with the following message in the journal:

Lost connection to Wayland compositor.

That's https://gitlab.gnome.org/GNOME/gtk/-/blob/c742debea8d7f2c93f06a5b506851698291cb2da/gdk/wayland/gdkeventsource.c#L237

The bad part is that there's actually no crash report or anything, so we just don't see these crashes...

Martin do you have any idea of how to track this down or possibly intercept these _exit calls from gtk?

Looks like glib2 issue when some bad condition happens on the socket. Maybe strace can help you here?

Flags: needinfo?(stransky)
See Also: → 1675680

(In reply to Emilio Cobos Álvarez (:emilio) from comment #10)

...
Which should just stop responding for a bit, instead we crash.

For the record, this might trigger an yet unsolved problem on Wayland: if the main thread is blocked long enough and does not read from the wayland socket, the buffer will fill up. Once that happens, the connection gets terminated. This happens especially fast with e.g. 1000Hz mice, see e.g. https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1915#note_1197910

In case that's the problem, there are two solutions:

  • wait for compositors/libwayland to figure out how to better handle this
  • move wayland socket polling into its own thread (somehow steal it from GTK). This would likely also benefit Vsync timings, as any short hang on the main thread creates stutter in our frame callback based vsyncsource (essentially bug 1675680).

(In reply to Emilio Cobos Álvarez (:emilio) from comment #11)

Hmm, apparently not so consistently as I hoped...

(In reply to Emilio Cobos Álvarez (:emilio) from comment #12)

But frequently enough

If the theory from above / comment 14 applies, then it would only crash if you create events - e.g. moving the mouse over the surface etc.

(In reply to Robert Mader [:rmader] from comment #14)

(In reply to Emilio Cobos Álvarez (:emilio) from comment #10)

...
Which should just stop responding for a bit, instead we crash.

For the record, this might trigger an yet unsolved problem on Wayland: if the main thread is blocked long enough and does not read from the wayland socket, the buffer will fill up. Once that happens, the connection gets terminated. This happens especially fast with e.g. 1000Hz mice, see e.g. https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1915#note_1197910

Indeed it's much more easy to trigger if I move the mousewheel around and move the cursor around while this happens... So it seems like a reasonable theory. I'll try to catch it on rr to dig a bit more, ni?ing myself so I don't forget.

In case that's the problem, there are two solutions:

  • wait for compositors/libwayland to figure out how to better handle this
  • move wayland socket polling into its own thread (somehow steal it from GTK). This would likely also benefit Vsync timings, as any short hang on the main thread creates stutter in our frame callback based vsyncsource (essentially bug 1675680).

How hard is (2)? I hit this frequently enough to be annoying :-)

Flags: needinfo?(emilio)

See https://gitlab.freedesktop.org/wayland/wayland/-/issues/159, https://gitlab.freedesktop.org/wayland/wayland/-/issues/237, https://gitlab.freedesktop.org/wayland/wayland/-/merge_requests/188.

I do not believe there is a good way to move polling to its own thread (Gtk owns the display connection and needs to be the one dispatching the callbacks). One could try to also poll and read into queues from another thread, but that would be competing with the main thread (the queues are locked prior to I/O) and would be quite hacky in my opinion. Threaded poll/read will also increase jitter/latency as most messages have to go back to the main thread anyway for dispatch.

Thanks Kenny! So if the theory is correct, presumably I'm not hitting this (so often / at all?) on GNOME because:

GNOME detects the client being unresponsive by a timeout, when it does not get a pong reply to a ping event.

So that's an enhancement that KWin could make. But the real fix would be something like https://gitlab.freedesktop.org/wayland/wayland/-/merge_requests/188

Is my understanding right?

So it's a bit unfortunate that GTK is calling _exit rather than exit(). exit() would be caught by the crash reporter via this hook IIUC, but _exit won't run that code. I'll try to see if GTK can be changed to not use _exit...

If I am reading the source right, Mutter's ping timeout is 5 seconds, so that's geared towards severe stalls (and the artificial ones induced with SIGSTOP in the linked issues). The issues discussed within the Wayland community have primarily focused on reasonably responsive clients drowning in event floods like high resolution scrolling and high DPI mouse input without having done anything wrong.

The solutions discussed in #wayland@irc.oftc.net have so far been a throttling mechanism to prioritize "important" events when a client is seen to not process their input fast enough, and "large enough"/dynamic/unbounded connection buffers like !188 implements. The current connection buffers are just 4KB, so it might very well be that a bigger buffer would be all it takes to give all reasonable clients enough time to come back to a read.

It might be a good idea to comment on the Wayland issues to highlight that clients are experiencing issues, but in that case we want to establish which category we're in: Reasonable stalls (tens or hundreds milliseconds) with event floods, or severe stalls (seconds) with low-to-normal rates of events.

FTR, for Mutter there's also a workaround being discussed as we want to unthrottle input devices for Gnome 42 (likely too early for the Wayland solution), which in case of 1000Hz mice would otherwise easily crash many clients like games.

I haven't hit this (so much?) recently, but it seems this is well-understood at this point and there's not all that much we can do on our side (other than avoid jank when possible of course).

Flags: needinfo?(emilio)

(In reply to Emilio Cobos Álvarez (:emilio) from comment #22)

I haven't hit this (so much?) recently, but it seems this is well-understood at this point and there's not all that much we can do on our side (other than avoid jank when possible of course).

I agree, we can't do much to avoid the crash but we may be able to improve our crash reports via https://gitlab.gnome.org/GNOME/gtk/-/issues/4514#note_1328534.

See Also: → 1749574
Depends on: 1759785

I am seeing this two with Thunderbird on GNOME. Thunderbird hangs for a while then exits. No crash report but get this in the logs.

% thunderbird
Gdk-Message: 15:59:45.495: Lost connection to Wayland compositor.
Exiting due to channel error.
Exiting due to channel error.

I'll try to grab the status with WAYLAND_DEBUG=1.

This happens very frequently on startup, presumably because there is a lot of background work while the UI hangs. However this often happens during the "steady state" as well. I see this frequently, probably more than once per day on average.

Let me know if I should file a Thunderbird-specific issue because this seems to be exacerbated by the general slowness of Thunderbird with my not-small mailboxes but the root cause list likely the same between the two programs. Presumably moving slow things off of the main thread would sufficiently mitigate this issue.

Thunderbird is not supposed to run on Wayland yet as it's based on old ESR91 line.

Attached file thunderbird.log

Oh ok. I guess I accidentally hardcoded it on when trying to use Firefox on Wayland. Well in case it is helpful anyways...

I've been getting this for months on Fx...

Jul 28 13:01:09 firefox[42125]: Lost connection to Wayland compositor.
Jul 28 13:01:09 plasmashell[43933]: Exiting due to channel error.
Jul 28 13:01:09 plasmashell[43930]: Exiting due to channel error.
Jul 28 13:01:09 plasmashell[43793]: Exiting due to channel error.
Jul 28 13:01:09 plasmashell[43753]: Exiting due to channel error.
Jul 28 13:01:09 plasmashell[43749]: Exiting due to channel error.
Jul 28 13:01:09 plasmashell[43402]: Exiting due to channel error.
Jul 28 13:01:09 plasmashell[43856]: Exiting due to channel error.
Jul 28 13:01:09 plasmashell[43398]: Exiting due to channel error.
Jul 28 13:01:09 plasmashell[43266]: Exiting due to channel error.
Jul 28 13:01:09 plasmashell[43175]: Exiting due to channel error.
Jul 28 13:01:09 plasmashell[42432]: Exiting due to channel error.
Jul 28 13:01:09 plasmashell[42324]: Exiting due to channel error.
Jul 28 13:01:09 plasmashell[42283]: Exiting due to channel error.
Jul 28 13:01:14 systemd[1786]: app-firefox-b6455ea50864449eb4b19acaeb9d238a.scope: Consumed 2min 33.683s CPU time.

Still happening with Fx 103.

BTW, this isn't happening with Chrome and Ozone. Would be nice to get someone to look at it.

(In reply to gbcox from comment #30)

BTW, this isn't happening with Chrome and Ozone. Would be nice to get someone to look at it.

Please see above, e.g. comment 22. There currently little we can about it without bigger architectural changes. Wayland compositors can work around the issue pretty far (Gnome does, for example), but there's also a need for a clean solution on the Wayland core level.

BTW, this isn't happening with Chrome and Ozone.

I'm not so sure. I recently somehow managed to cause a strange emergent busy-loop.
(uBlock Origin hiding a modal, causing the website to sometimes go into a frenzy,
presumably attempting to show it as long as it's not visible, but uBO kept hiding it?)

The only reason I was able to put 2 and 2 together is that Firefox started crashing
in minutes of me replicating that uBO rule on it - which led me to this issue.

And AFAICT it's easy to repro using the busy-looping that emilio posted, i.e.:

var s = Date.now(); while (s + 5000 > Date.now());

But you need to inspect an extension instead, to do this in Chrome in an impactful
way (I used "Inspect popup" in the menu for uBO, I suspect most extensions will work).
If it doesn't work at first, just multiply the constant by 2 or 3.

FWIW what I repro'd with was Google Chrome Stable 104.0.5112.101, on Linux/x64.

This is how it then failed for me (manually wrapped/spaced for readability):

Tue 2022-08-30 20:06:17 EEST eddyb-nix user@1000.service/init.scope[14560]:
    app-google\x2dchrome-c48d6f01e8f940899cf94e1a9c8b794f.scope: Consumed 3min 29.538s CPU time.

Tue 2022-08-30 20:06:17 EEST eddyb-nix user@1000.service/app-google\x2dchrome-c48d6f01e8f940899cf94e1a9c8b794f.scope[2034552]:
    [2034002:2034002:0830/200617.463836:ERROR:wayland_event_watcher.cc(61)] Fatal Wayland communication error: Broken pipe.

Tue 2022-08-30 20:06:17 EEST eddyb-nix user@1000.service/plasma-kwin_wayland.service[15235]:
    error in client communication (pid 2034002)

I can't find a matching Chrome bug report other than maybe https://crbug.com/1290059
(but that still could be unrelated). NB: I don't plan to report this to the Chrome bug
tracker myself, since this is so many layers of "drive-by", I'm liable to waste their time.

(In reply to Martin Stránský [:stransky] (ni? me) from comment #26)

Thunderbird is not supposed to run on Wayland yet as it's based on old ESR91 line.

Is this supposed to be fixed now with Thunderbird 102?
I still experience it very often that Thunderbird Wayland randomly crashed the KDE compositor kwin (or some other components?).
The result is always that the mouse cursor is stuck, and I can't do anything any more (music for example continues to play tho).

This does not happen on a certain action, mostly I just hover over something in the application, but those crashes only appear when Thunderbird Wayland is open.

(In reply to lrdarknesss from comment #33)

Is this supposed to be fixed now with Thunderbird 102?

No, please see comment 31 / comment 22. Kwin could do better (Gnome/Mutter does) by avoiding sending pointer events to unresponsive clients, FF/TB could do better in avoiding main thread stalls (AFAIK TB is worse than FF in this regard) - and there could be some bigger changes in our Wayland architecture, creating a Wayland thread. Unfortunately this is a bit hard to pull off in a clean way. Finally there are discussions to handle this better on the Wayland library level.

Okay, so Thunderbird Wayland is no viable option atm, at least not on KDE.

I'm also seeing this on sway several times an hour (using nightly).

Gdk-Message: 08:29:24.126: Lost connection to Wayland compositor.

Thread 1 "firefox-bin" hit Breakpoint 3, __GI__exit (status=status@entry=1) at ../sysdeps/unix/sysv/linux/_exit.c:27
27	{
(gdb) bt
#0  __GI__exit (status=status@entry=1) at ../sysdeps/unix/sysv/linux/_exit.c:27
#1  0x00007ffff636b7c3 in _gdk_wayland_display_queue_events (display=<optimized out>) at ../gtk/gdk/wayland/gdkeventsource.c:211
#2  0x00007ffff6338029 in gdk_display_get_event (display=0x7ffff7844200) at ../gtk/gdk/gdkdisplay.c:442
#3  0x00007ffff6372728 in gdk_event_source_dispatch (base=<optimized out>, callback=<optimized out>, data=<optimized out>) at ../gtk/gdk/wayland/gdkeventsource.c:120
#4  0x00007ffff621d87b in g_main_dispatch (context=0x7fffe850adf0) at ../glib/glib/gmain.c:3454
#5  g_main_context_dispatch (context=0x7fffe850adf0) at ../glib/glib/gmain.c:4172
#6  0x00007ffff6274c89 in g_main_context_iterate.constprop.0 (context=0x7fffe850adf0, block=0, dispatch=1, self=<optimized out>) at ../glib/glib/gmain.c:4248
#7  0x00007ffff621c132 in g_main_context_iteration (context=0x7fffe850adf0, may_block=0) at ../glib/glib/gmain.c:4313
#8  0x00007fffed6f989f in nsThread::ProcessNextEvent(bool, bool*) () at /home/the8472/opt/firefox/libxul.so
#9  0x00007fffed744054 in mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) () at /home/the8472/opt/firefox/libxul.so
#10 0x00007fffee2a2e5f in MessageLoop::Run() () at /home/the8472/opt/firefox/libxul.so
#11 0x00007fffee9b8469 in nsBaseAppShell::Run() () at /home/the8472/opt/firefox/libxul.so
#12 0x00007fffec451d45 in nsAppStartup::Run() () at /home/the8472/opt/firefox/libxul.so
#13 0x00007fffec4dac9b in XREMain::XRE_mainRun() () at /home/the8472/opt/firefox/libxul.so
#14 0x00007fffec4db683 in XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) () at /home/the8472/opt/firefox/libxul.so
#15 0x00007fffec4dba66 in XRE_main(int, char**, mozilla::BootstrapConfig const&) () at /home/the8472/opt/firefox/libxul.so
#16 0x00005555555e7cfc in main ()

Edit: note that this kills firefox in a way that it doesn't even submit crash reports.

(In reply to Robert Mader [:rmader] from comment #14)

(In reply to Emilio Cobos Álvarez (:emilio) from comment #10)

...
Which should just stop responding for a bit, instead we crash.

For the record, this might trigger an yet unsolved problem on Wayland: if the main thread is blocked long enough and does not read from the wayland socket, the buffer will fill up. Once that happens, the connection gets terminated. This happens especially fast with e.g. 1000Hz mice, see e.g. https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1915#note_1197910

Is it possible to increase the buffer size?

While I get that a proper solution would need deeper architectural changes, this issue has been around for a while with no remedy, so the feasibility of possible workarounds should be at least considered:

  • The mentioned buffer size increase could help with the Wayland design problem. Not sure how feasible it is though as I haven't found much info on it.
  • Misbehaving websites could and should be throttled. That problem is likely worthy of a whole separate discussion, but the problem is quite relevant here as resource abuse is what's causing the stall leading to this issue.

I've ran into some bloated pages now and then causing this issue, or alternatively if the site was still barely okay on its own, then inspecting the website often pushed it into "crash" territory.
However I lately tend to run into this frequently with Discord which was always a notorious resource hog, and I guess they've kept on adding bloat until it was too much, so nowadays I regularly have to recover the browser session after looking around there. Not necessarily good or relevant metrics, but checked one of the heavier servers, and switching to a channel with plenty of embedded content issues 643 network requests, and seems to finish in 900+ ms according to network measurements when everything is cached. In line with others' observations, I tend to encounter "crashing" most often when I don't keep the cursor perfectly still during these resource abusing events.

While this isn't an issue with most websites even on older and less performant laptops with iGPUs, the fact that a single website can stall event processing long enough even on desktops which are more than powerful enough for the task is a significant problem as it spoils the whole browser session.

See Also: → 1792754

Can confirm. See also https://bugs.kde.org/show_bug.cgi?id=469696

[Child 15731, Main Thread] WARNING: JSWindowActorChild::SendRawMessage (Conduits, ConduitClosed) not sent: !CanSend() || !mManager || !mManager->CanSend(): file /builds/worker/checkouts/gecko/dom/ipc/jsactor/JSWindowActorChild.cpp:57
[Child 15731, Main Thread] WARNING: JSWindowActorChild::SendRawMessage (Conduits, ConduitClosed) not sent: !CanSend() || !mManager || !mManager->CanSend(): file /builds/worker/checkouts/gecko/dom/ipc/jsactor/JSWindowActorChild.cpp:57
[Child 15731, Main Thread] WARNING: JSWindowActorChild::SendRawMessage (Conduits, ConduitClosed) not sent: !CanSend() || !mManager || !mManager->CanSend(): file /builds/worker/checkouts/gecko/dom/ipc/jsactor/JSWindowActorChild.cpp:57
[Child 15731, Main Thread] WARNING: JSWindowActorChild::SendRawMessage (Conduits, ConduitClosed) not sent: !CanSend() || !mManager || !mManager->CanSend(): file /builds/worker/checkouts/gecko/dom/ipc/jsactor/JSWindowActorChild.cpp:57
[Child 15731, Main Thread] WARNING: JSWindowActorChild::SendRawMessage (Conduits, ConduitClosed) not sent: !CanSend() || !mManager || !mManager->CanSend(): file /builds/worker/checkouts/gecko/dom/ipc/jsactor/JSWindowActorChild.cpp:57
[Child 15731, Main Thread] WARNING: JSWindowActorChild::SendRawMessage (Conduits, ConduitClosed) not sent: !CanSend() || !mManager || !mManager->CanSend(): file /builds/worker/checkouts/gecko/dom/ipc/jsactor/JSWindowActorChild.cpp:57
[Child 15731, Main Thread] WARNING: JSWindowActorChild::SendRawMessage (Conduits, ConduitClosed) not sent: !CanSend() || !mManager || !mManager->CanSend(): file /builds/worker/checkouts/gecko/dom/ipc/jsactor/JSWindowActorChild.cpp:57
[Child 15731, Main Thread] WARNING: JSWindowActorChild::SendRawMessage (Conduits, ConduitClosed) not sent: !CanSend() || !mManager || !mManager->CanSend(): file /builds/worker/checkouts/gecko/dom/ipc/jsactor/JSWindowActorChild.cpp:57
Gdk-Message: 16:20:03.063: Lost connection to Wayland compositor.
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.

The problem only happens to me when the Firefox window is shown. For example, when I have the Firefox window on another virtual desktop, everything is fine until I switched to that virtual desktop.

Duplicate of this bug: 1844690
Summary: [Wayland/KDE] Firefox and Thunderbird crash on Plasma sometimes with "Lost connection to Wayland compositor." → [Wayland/KDE/Sway] Firefox and Thunderbird crash on Plasma sometimes with "Lost connection to Wayland compositor."
Summary: [Wayland/KDE/Sway] Firefox and Thunderbird crash on Plasma sometimes with "Lost connection to Wayland compositor." → [Wayland/KDE/Sway] Firefox and Thunderbird sometimes crash with "Lost connection to Wayland compositor."
Summary: [Wayland/KDE/Sway] Firefox and Thunderbird sometimes crash with "Lost connection to Wayland compositor." → [Wayland/KDE/Sway] Firefox and Thunderbird sometimes crash with "Lost connection to Wayland compositor." without crash report

Since the issue title says "sometimes" I want to note that for me this happens so frequently that it renders firefox unusable under wayland.
It takes several attempts to even start up and afterwards it only takes a few minutes until it crashes from regular browsing.

In my case that's likely due to the combination of a high refresh rate mouse + a heavy firefox session + sway (which doesn't implement workarounds for this bug).

It does not happen often for me, but there are certain scenarios where it consistently happens.
On Firefox that would be opening a lot (30 or so) tabs at the same time while moving the mouse.
On thunderbird when moving the mouse over the window while it is processing a lot of mails, for example archiving a couple of thousand mails.

Tested using Arch, Plasma Wayland
Firefox 117.0.1
Thunderbird 115.2.2

See Also: → 1859267

Bug 1859267 should at least give us visibility into how frequent this is.

This no longer seems to happen on latest KWin fwiw. Vlad did you implement a workaround or something out of curiosity?

Flags: needinfo?(vlad.zahorodnii)
Summary: [Wayland/KDE/Sway] Firefox and Thunderbird sometimes crash with "Lost connection to Wayland compositor." without crash report → [Wayland/Sway] Firefox and Thunderbird sometimes crash with "Lost connection to Wayland compositor." without crash report

Weird. No, we did not land anything to address this issue.

Flags: needinfo?(vlad.zahorodnii)

(In reply to Emilio Cobos Álvarez (:emilio) from comment #46)

This no longer seems to happen on latest KWin fwiw. Vlad did you implement a workaround or something out of curiosity?

Unfortunately I can't find what happened anymore, but if I remember well, there was a change on the KDE side which significantly mitigated the issue.
Remembering it that way because this was a common issue with programs ignoring GUI needs while working on some other task, even including KDE programs like Krusader, and I could reproduce the issue within minutes, but it got way better after a Plasma update.

I'm not sure the issue is completely gone though for the following reasons:

  • What I remember was just a mitigation which just significantly lowers the chance of a crash, but doesn't completely eliminate it. I believe updates just get throttled if the program is unresponsive, but not completely stopped, so the connection buffer can still get filled completely.
  • Firefox still crashes occasionally under "heavy load" the same way it did earlier, nowadays it's just maybe once a month instead of having a quick reproducer of going on a malware-like site like Discord and just moving around the cursor at key resource abusing moments

Aside from the fundamental issue of not tending to window manager events, it may help to know that by heavy load I mean having likely hundreds of tabs open in multiple windows. I'm not fully sure if window count matters as it's usually correlated to tab count for me to some degree, but generally the more windows I have open, the more likely I observe the following problems:

  • During handling one of the windows, Firefox just silently crashes, closing all windows. As it happens rarely nowadays I haven't verified if it's surely the issue being discussed here, but the symptom is the same
  • A window (so far never the first/main window) stops painting content while remaining functional evident from forced window content updates obtainable through switching to another window and then back. Now this might not be related to this issue, just noting that a multi-window "stress test" comes with its odd problems
    Consider it anecdotal, but I haven't seen either of these problems with just a single window even with likely a 100 or so tabs, so reproducing race condition kind of issues is likely easier with multiple windows.
    Also, weaker hardware (mostly laptop) and heavy I/O load in the background surely helped with seeing more issues. The later was quite interesting as I used to see KDE background services crash just because I was moving around a lot of data through NFS, and while doing so, Firefox was definitely easier to crash.

Aside from not blocking the event loop tending to GUI needs, the real solution would be something like this: https://gitlab.freedesktop.org/wayland/wayland/-/merge_requests/188
Tending to events as soon as possible helps with the problem, but we are not dealing with realtime scheduling here, programs can go an arbitrary long time without getting scheduled which apparently tends to be a problem mostly during heavy I/O load, so it needs to be ensured that the Wayland connection buffer doesn't get completely filled in such situations.

Aside from not blocking the event loop tending to GUI needs, the real solution would be something like this: https://gitlab.freedesktop.org/wayland/wayland/-/merge_requests/188

Note that some people are pushing back against solutions like that because the compositor shouldn't bear a potentially unbounded memory cost (which would cause reliability issues too) due to unresponsive clients.

An alternative is for the client library to handle connection losses more gracefully (like a GPU driver reset): https://invent.kde.org/plasma/kwin/-/wikis/Restarting

I can also still reproduce this on the latest stable plasma version with kwin.
More easily in thunderbird with a very large inbox than with firefox, but it seems to still occur.
Heavy loads combined with mouse movements.

(unless you were talking about plasma 6 or other preview / not yet stable versions)

(In reply to The 8472 from comment #49)

Note that some people are pushing back against solutions like that because the compositor shouldn't bear a potentially unbounded memory cost (which would cause reliability issues too) due to unresponsive clients.

I'm aware of the controversial nature, and I've meant the linked change to be one of the possible solutions, not necessarily the best way, but it also goes in 2 directions:

  • Changes the hardcoded buffer size to be just merely the default buffer size. I don't think this change is that horrible, even if it introduces an indirection for the buffer. The current buffer size (at least with the current event handling strategy) is obviously insufficient, and I'm saying that without hardware I consider fancy. Even knowing that a program was unresponsive, all I had to do was just moving the cursor through it somewhere else, and by the time I've realized my mistake, it was already late. I get the anti-bloating argument, but the other perspective is that the buffer gets filled up in a fraction of second before the user can even react.
  • Adds dynamic buffer control with the mentioned unbounded memory cost. I get the problems with that, and this one is surely not necessarily desirable. Aside from silly user hostile program components like "anti-cheats" possibly getting upset about this, I don't think there's a need to keep all events forever if the program is unable to keep up, but on desktops where memory is so abundant, it surely feels odd not having the possibility to handle even just a couple seconds of event processing backlog

(In reply to qlum from comment #50)

I can also still reproduce this on the latest stable plasma version with kwin.
More easily in thunderbird with a very large inbox than with firefox, but it seems to still occur.
Heavy loads combined with mouse movements.

(unless you were talking about plasma 6 or other preview / not yet stable versions)

I'm actually quite behind with improvements as far as desktop system needs go:

Operating System: Kubuntu 23.04
KDE Plasma Version: 5.27.4
KDE Frameworks Version: 5.104.0
Qt Version: 5.15.8
Kernel Version: 6.2.0-33-generic (64-bit)
Graphics Platform: Wayland

Kubuntu 23.04 brought a lot of stability improvements to me, including the mitigation of this problem, but as mentioned earlier likely not a complete fix. Kubuntu is not really "fresh" when it comes to KDE changes to begin with, and KDE is improving rapidly in the past few years, so when it comes to desktop usage, even 22.04 (latest LTS) is already considered quite old.

I'm also using Thunderbird with multiple folders containing thousands of emails each, but I don't remember experiencing this problem there even when it was a problem elsewhere. Even the current 115.3.0 version is fine despite the recent UI redesign making it less responsive.

(In reply to Pedro [:pedrov] from comment #51)

(In reply to qlum from comment #50)

I can also still reproduce this on the latest stable plasma version with kwin.
More easily in thunderbird with a very large inbox than with firefox, but it seems to still occur.
Heavy loads combined with mouse movements.

(unless you were talking about plasma 6 or other preview / not yet stable versions)

I'm actually quite behind with improvements as far as desktop system needs go:

Operating System: Kubuntu 23.04
KDE Plasma Version: 5.27.4
KDE Frameworks Version: 5.104.0
Qt Version: 5.15.8
Kernel Version: 6.2.0-33-generic (64-bit)
Graphics Platform: Wayland

Kubuntu 23.04 brought a lot of stability improvements to me, including the mitigation of this problem, but as mentioned earlier likely not a complete fix. Kubuntu is not really "fresh" when it comes to KDE changes to begin with, and KDE is improving rapidly in the past few years, so when it comes to desktop usage, even 22.04 (latest LTS) is already considered quite old.

I'm also using Thunderbird with multiple folders containing thousands of emails each, but I don't remember experiencing this problem there even when it was a problem elsewhere. Even the current 115.3.0 version is fine despite the recent UI redesign making it less responsive.

In firefox I reproduced it with some effort by opening reddit 250 times and dragging tabs around for a while (copy / paste the url in a text editor) paste that in a bookmark folder and open all bookmarks.

In thunderbird, I have access to a mail account that gets around 5k new mails a month (mostly notifications), dragging those 5k to archive then moving the mouse around is a very consistent crash.

I will say it has gotten harder to trigger the issue in the last few months.

I am encountering this several times a day on Arch with firefox 118.0.2-1, sway-1.8.1-1, wlroots-0.16.2-2, and gtk3-3.24.38-1, using amdgpu on Polaris.
It seemed to worsen over the past few months as I kept updating these packages.
I can confirm the same lack of crash reporter and terminal output. It only happens when I am interacting with Firefox.

The lack of crash reports may be hiding how prevalent this bug is.

I am using a 1000Hz mouse with a freely scrolling wheel, as well as several windows with hundreds of tabs, so I can see why I would be especially affected.
This has become a serious impediment to getting any work done, and it will force me off of Firefox so as to keep Sway and native Wayland browsing.

I will look into adjusting buffer sizes of the socket for the Wayland connection via the file descriptor as the8472 pointed out on the sway issue.
I will very likely need assistance though, especially if this can't be achieved directly in Bash.

If I can do anything else to help debug, please let me know. I can rebuild with debugging symbols and attach a debugger.

There should be crash reports on nightly (bug 1859267). But AIUI, this doesn't seem very actionable from our end, unless we give up on GTK or steal its wayland events somehow from another thread...

this doesn't seem very actionable from our end

There are workarounds that firefox could implement

A) increase the socket buffer size - This would buy some time until a crash but not help when the main thread is blocked for extended amounts of time

B) implement a unix socket proxy (forwarding payloads + file descriptors) that runs on another thread and buffers backlogged messages in userspace. GTK would then be pointed to the proxy instead of the real wayland socket - this could buy arbitrary amounts of time as long as that thread doesn't get suspended or starved. the thread could maybe use a realtime scheduling priority so it can keep making progress even if the system is oversubscribed

They're not perfect solutions but they'd hopefully paper over a large fraction of those stall-induced crashes.

Summary: [Wayland/Sway] Firefox and Thunderbird sometimes crash with "Lost connection to Wayland compositor." without crash report → [Wayland/Sway] Firefox and Thunderbird sometimes crash with "Lost connection to Wayland compositor."
Depends on: 1860153
See Also: → 1861980

Aside from the problem logically still existing even if it's less commonly seen, just updating my report for clarity that I'm no longer uncertain, I can definitely still experience this problem.

The key assistant in my case seems to be really heavy background I/O which got so unusually abusive, it delayed even desktop actions sometimes by more than a minute, and made many programs unresponsive as a surprisingly good stability test given that well-behaving programs survived without issues, one program likely lost its PipeWire connection as it could never produce sound again but otherwise stayed alive, and then Firefox just crashed shortly after the cursor was moved over it while it wasn't responsive.

My abusive workload was deleting millions of heavily referenced (tons of reflinked duplicates) files on a compressed (zstd:1, could be way "worse") Btrfs partition. Surely not a usual problem, but for bug reproducing purposes it beats everything I've seen before, at least excluding the "good old" HDD swap file hell which likely wouldn't be taken too seriously these days anymore.

(In reply to Emilio Cobos Álvarez (:emilio) from comment #54)

There should be crash reports on nightly (bug 1859267). But AIUI, this doesn't seem very actionable from our end, unless we give up on GTK or steal its wayland events somehow from another thread...

It may be possible to bundle gtk3 library and do changes directly there. That may also solve recent gtk3 issues like missing xdg-popup resize or add direct rendering to wl_surface owned by gtk widget. But I'm not sure how that will work with other gtk3 components like ATK or IM and so.

(In reply to The 8472 from comment #55)

B) implement a unix socket proxy (forwarding payloads + file descriptors) that runs on another thread and buffers backlogged messages in userspace. GTK would then be pointed to the proxy instead of the real wayland socket - this could buy arbitrary amounts of time as long as that thread doesn't get suspended or starved. the thread could maybe use a realtime scheduling priority so it can keep making progress even if the system is oversubscribed

I'm not sure how to open GdkDisplay over existing wayland display connection. I don't think Gdk provides such API.

(In reply to Martin Stránský [:stransky] (ni? me) from comment #58)

(In reply to The 8472 from comment #55)

B) implement a unix socket proxy (forwarding payloads + file descriptors) that runs on another thread and buffers backlogged messages in userspace. GTK would then be pointed to the proxy instead of the real wayland socket - this could buy arbitrary amounts of time as long as that thread doesn't get suspended or starved. the thread could maybe use a realtime scheduling priority so it can keep making progress even if the system is oversubscribed

I'm not sure how to open GdkDisplay over existing wayland display connection. I don't think Gdk provides such API.

WAYLAND_DISPLAY can be an absolute path. So one could setup the proxy and then set the environment variable and let gdk connect to it.

(In reply to The 8472 from comment #59)

WAYLAND_DISPLAY can be an absolute path. So one could setup the proxy and then set the environment variable and let gdk connect to it.

That's interesting. There's also WAYLAND_SOCKET evn variable which may be even more useful. Is there any example implementation of the 'unix socket proxy' thing we can use as a template or any library which implements it? My first google search haven't revealed anything useful.

Priority: -- → P1
Flags: needinfo?(stransky)

(In reply to Martin Stránský [:stransky] (ni? me) from comment #60)

(In reply to The 8472 from comment #59)
That's interesting. There's also WAYLAND_SOCKET evn variable which may be even more useful. Is there any example implementation of the 'unix socket proxy' thing we can use as a template or any library which implements it? My first google search haven't revealed anything useful.

I'm not aware of any, it probably has to be written from scratch. Waypipe has to do something similar internally but in a more complicated way than a simple unix socket proxy because it has to serialize the wayland protocol instead of just passing through bytes/descriptors.
Structurally it should be similar to implementing a TCP proxy. Plus the SCM_RIGHTS passing.
So that would be an epoll loop to support non-blocking IO to handle multiple connections on a single thread, a bunch of per-connection buffers and then calling sendmsg/recvmsg with cmsg handling to pass the descriptors, if any.
And one would have to study the wayland protocol to figure out what's the maximum amount of file descriptors it can send per number of bytes received to size the buffers properly (or use MSG_PEEK probing to figure out when buffers need to be resized).

https://man7.org/linux/man-pages/man7/unix.7.html
https://man7.org/linux/man-pages/man3/cmsg.3.html
https://man7.org/linux/man-pages/man2/recvmsg.2.html
https://man7.org/linux/man-pages/man2/sendmsg.2.html
https://man7.org/linux/man-pages/man7/epoll.7.html

Thanks, looks like a bigger project. It may be worth to implement it as a standalone library so other wayland apps can benefit from it.

I hacked up a quick demo that manages to prevent firefox crashes when dropping a file on google image search (bug 1792754) or running the reproducer from comment #10
Emphasis on hacked. The code is shoddy.

https://github.com/the8472/weyland-p5000

Cool, Thanks! So at least we know this path is viable.
I was thinking about implementation by GIOStream (https://docs.gtk.org/gio/class.IOStream.html) which should handle all aspects of file/socket handling.

Unless it's documented somewhere I would not expect that to handle the file-descriptor passing. IO streams usually just mean bytes.

Duplicate of this bug: 1860153

Copying crash signatures from duplicate bugs.

Crash Signature: [@ HandleGLibMessage]

The bug is linked to a topcrash signature, which matches the following criteria:

  • Top 20 desktop browser crashes on beta
  • Top 5 desktop browser crashes on Linux on beta
  • Top 5 desktop browser crashes on Linux on release (startup)

For more information, please visit BugBot documentation.

Attached file thunderbird.log

WAYLAND_DEBUG=1 thunderbird

Attached another crash report on sway. Can reproduce consistently with a large inbox and several calendars. No errors in journalctl/dmesg. No OOM conditions either.

thunderbird=115.5.0-1
wayland=1.22.0-1
sway=1:1.8.1-4
wlroots=0.17.0-1
gtk3=1:3.24.38-1

Running Arch Linux (x86_64) with linux=6.6.2-arch1-1.

I'm not sure were I read it, but I think it was Emilio who brought up the idea to just statically build a local copy of GTK3. That would allow us to fix this issue quite nicely - we could introduce a new thread that does the polling on the Wayland socket and is never blocked by other stuff on the main thread.

It would additionally allow us to do all kind of small changes that make our life easier - we could go as far and ditch the content subsurface and just render to the xdg-toplevel - and still from time to time rebase on upstream, just like we do for libwebrtc, which has a way higher volume of changes.

Mutter did that with Clutter and it was a very good decision for its use-case - and with most apps being ported to GTK4, FF is often the only app still using GTK3 anyway.

Emilio, was it you who suggested that - and what do you think about it?

Flags: needinfo?(emilio)

It was Martin (so 301 him). We'd also need to link gdk statically too, right? One concern I'd have would be how it'd interact with the rest of stuff in the system (like gtk modules etc). If we do binary-incompatible changes to gtk, then we'd be in ABI hell...

It would additionally allow us to do all kind of small changes that make our life easier - we could go as far and ditch the content subsurface and just render to the xdg-toplevel.

I think this is somewhat unrelated, but seems worth reconsidering our widget set up to stop using GTK for most stuff (other than to read theming information and settings perhaps). That means reworking how a lot of things like drag and drop, IME, pointer events, rendering, etc work without a GtkWindow. Might be worth it in the end tho, and it seems it could be incremental? E.g., we could work on a widget/gtk/WindowWayland that did those kinds of things you're suggesting.

If we'd end up statically linking gtk, then your approach would also work (and be less work of course), but it seems at that point we could just remove a fair bit of abstraction layers altogether?

we could introduce a new thread that does the polling on the Wayland socket and is never blocked by other stuff on the main thread.

It seems that as long as we use gtk/gdk for stuff like settings and widget styling / rendering, that'd open a default display connection on the main thread here. If we could just replace that event source by our own I think that'd allow us to fix this in the way you're suggesting... It might be worth trying to do that without having to fork gdk?

Flags: needinfo?(emilio)
Crash Signature: [@ HandleGLibMessage] → [@ HandleGLibMessage] [@ <name omitted> | HandleGLibMessage ]

(In reply to The 8472 from comment #63)

I hacked up a quick demo that manages to prevent firefox crashes when dropping a file on google image search (bug 1792754) or running the reproducer from comment #10
Emphasis on hacked. The code is shoddy.

https://github.com/the8472/weyland-p5000

Quick report after one full day of testing - this demo successfully prevents all crashes for me.

(In reply to Emilio Cobos Álvarez (:emilio) from comment #72)

It was Martin (so 301 him). We'd also need to link gdk statically too, right? One concern I'd have would be how it'd interact with the rest of stuff in the system (like gtk modules etc). If we do binary-incompatible changes to gtk, then we'd be in ABI hell...

We'd also need a big list of build requires.

Flags: needinfo?(stransky)
Summary: [Wayland/Sway] Firefox and Thunderbird sometimes crash with "Lost connection to Wayland compositor." → [Wayland] "Lost connection to Wayland compositor."

I think the easiest immediate fix will be the proxy implementation. But I wonder how to incorporate it into Firefox.

Emilio, may we use the rust demo here? I have zero experience with Rust so I have no idea if that can be utilized somehow. Another option is to implement it in C++ which can be easily integrated then.

Flags: needinfo?(emilio)
Flags: needinfo?(stransky)
Duplicate of this bug: 1868760

Will look at C++ implementation which can be used with Firefox internally.

Assignee: nobody → stransky

The rust implementation should be straight-forward to use fwiw, depending on how you want to build it / run it and such.

Flags: needinfo?(emilio)

Comment on attachment 9368399 [details]
WIP: Bug 1743144 [Wayland] Wayland proxy wip

There's an initial Wayland proxy implementation. It's based on https://github.com/the8472/weyland-p5000 WIP.
From some reason Wayland clipboard doesn't work with this proxy while the original code (weyland-p5000) is fine.

I don't know if it causes the clipboard issue but your code isn't dealing with partial writes. Though if that were the problem i'd expect the assert(ret == mData.size()); to trigger.

In the sway issue a user reported that using a proxy by itself is not sufficient because firefox overloads the system to the point where the proxy gets slowed down too.

Increasing the socket buffer sizes did seem to help.
Whether adjusting thread priorities of the proxy (up) and firefox (down) would help is currently being tested.

(In reply to The 8472 from comment #81)

I don't know if it causes the clipboard issue but your code isn't dealing with partial writes. Though if that were the problem i'd expect the assert(ret == mData.size()); to trigger.

I moved a bit forward here. Looks like if I use the proxy, reading clipboard data through provided fd from application to compositor (the one passed by sendmsg) is not finished. If the proxy is used, compositor receives data from application but it doesn't recognize it's all and still waits for more.

If app is run without proxy, fd passed by sendmsg is closed from application side (I guess) so compositor gets POLLHUP:

read(32, "Accept", 4194304)             = 6
poll([{fd=32, events=POLLIN}], 1, 0)    = 1 ([{fd=32, revents=POLLHUP}])
poll([{fd=32, events=POLLIN}], 1, -1)   = 1 ([{fd=32, revents=POLLHUP}])
read(32, "", 4194298)                   = 0
close(32)                               = 0

("Accept" string is a text data send from application to compositor).

Are you closing the file descriptors after sending them? In Rust OwnedFd does that automatically when dropping them.
Basically, SCM_RIGHTS sends a dup of the file descriptor, it doesn't consume them.

(In reply to The 8472 from comment #84)

Are you closing the file descriptors after sending them? In Rust OwnedFd does that automatically when dropping them.
Basically, SCM_RIGHTS sends a dup of the file descriptor, it doesn't consume them.

Yes, looks like that's the problem here.

(Martin Stránský [:stransky] from comment #75)

I think the easiest immediate fix will be the proxy implementation. But I wonder how to incorporate it into Firefox.

Emilio, may we use the rust demo here? I have zero experience with Rust so I have no idea if that can be utilized somehow. Another option is to implement it in C++ which can be easily integrated then.

(In reply to Emilio Cobos Álvarez (:emilio) from comment #78)

The rust implementation should be straight-forward to use fwiw, depending on how you want to build it / run it and such.

Can you make patch that shows how to integrate https://github.com/the8472/weyland-p5000 ?
(If we can have Rust, why rewriting it in legacy C++?)

Flags: needinfo?(emilio)

If you want to use my rust code I'll have to clean that up to turn it into a library and do clean error handling. I can do that, but only if there's intent to use it.

Sorry if I missed this in the discussion but it seems like there is a lot of work considering how to mitigate this by ensuring that we are consistently reading events (either internally or via a proxy) but it seems like we haven't considered fixing the main issue.

Fundamentally without requiring a real-time OS we can't assume that we (or a proxy that we spawn) can respond to events in any particular deadline. Even without considering SIGSTOP busy CPUs can starve our processes and the compositor may be running at a higher priority where it continues to send us events.

It seems that the reliable fix is assuming that we will occasionally get disconnected and support reconnecting. IIUC this is QT did. The stated reason in that patch seems to focus on compositor crashes but it seems like this will also solve the temporary hang + disconnect case. This seems like a more complete solution to me because it can handle complete process stops, machines stalling for any reason and compositor crashes.

IIUC we will need some support from GTK to gracefully handle the disconnection. But I would be surprised if they weren't interested in supporting this.

Maybe the proxy is a decent mitigation in the meantime, but it seems to be just an improvement to the current state, rather than a fundamental fix.

Yeah, we'd need to turn it into a lib, and expose it via an FFI function. An empty crate with that crate as a dependency like and a cbindgen.toml file like dom/base/rust would do. I can help martin if he wants to pursue this.

Flags: needinfo?(emilio)

(In reply to Emilio Cobos Álvarez (:emilio) from comment #89)

Yeah, we'd need to turn it into a lib, and expose it via an FFI function. An empty crate with that crate as a dependency like and a cbindgen.toml file like dom/base/rust would do. I can help martin if he to pursue this.

I have almost production ready C++ code for it. The thing here is that I don't know rust code so if we adopt the rust one we also need a maintainer for it. If we use C++ one it can be maintained as part of toolkit/gtk code. So if we want to go the Rust path we also need a maintainer for it.

There's the latest C++ proxy version attached. There are still some TODO to check + I'd like to run some performance tests and add a build option to make it optional.

I'd prefer to use C++ code as I understand it but if there's strong feeling for Rust path I'm not against it (if we have a maintainer for it).

I did some speed testing and I'm getting 158-195 speedometer points with proxy and 160-161 without it. So it may slightly affect performance. May be worth to test on low perf hardware too like arm (but we can also compile Firefox with proxy disabled there).

Attachment #9368399 - Attachment is obsolete: true
Flags: needinfo?(stransky)
  sched_param param;
  if (pthread_attr_getschedparam(&attr, &param) == 0) {
    param.sched_priority = sched_get_priority_max(SCHED_FIFO);
    pthread_attr_setschedparam(&attr, &param);
  }
  • does this even work? I am under the impression that only root/processes with CAP_SYS_NICE can set RT prios or need to ask rtkit
  • the maximum priority is probably overkill. even the lowest realtime priority is still higher than the default scheduling policy. plus it's higher than what sway uses so sway might get preempted on the first message it sends to the socket (leading to a wakeup) which probably increases context switches

(In reply to The 8472 from comment #96)

  sched_param param;
  if (pthread_attr_getschedparam(&attr, &param) == 0) {
    param.sched_priority = sched_get_priority_max(SCHED_FIFO);
    pthread_attr_setschedparam(&attr, &param);
  }
  • does this even work? I am under the impression that only root/processes with CAP_SYS_NICE can set RT prios or need to ask rtkit
  • the maximum priority is probably overkill. even the lowest realtime priority is still higher than the default scheduling policy. plus it's higher than what sway uses so sway might get preempted on the first message it sends to the socket (leading to a wakeup) which probably increases context switches

Sure, which sched_priority do you suggest to use?

I'm going to use the proxy with Fedora Firefox package to get real user feedback here.

Tested with sched_get_priority_min(SCHED_RR) and the speedometer performance is the same.

Attachment #9368867 - Attachment is obsolete: true

(In reply to Martin Stránský [:stransky] (ni? me) from comment #100)

I'm going to use the proxy with Fedora Firefox package to get real user feedback here.

I've applied D196554.diff, D196555.diff, D196556.diff on my custom Gentoo build of Firefox and it seems to be working fine as far as I can tell. One quick issue I ran into is if you have multiple Firefox profiles and open the profile selector with firefox -p it crashes:

/usr/lib64/firefox/firefox -p
Wayland Proxy error: StartProxyServer(): bind() error : Address already in use
[1]    2902 segmentation fault  /usr/lib64/firefox/firefox -p

I don't know if you can reproduce that in Fedora but thought I'd mention it.

I have been experiencing this over the last weeks ; in fact it is the most common reasons for my thunderbird and firefox crashes under ubuntu 23.10:

In all cases, the STRs involved scrolling with my mouse wheel. It might have been out of focus of a scrollable field, but I can state for sure the browser or mailer was not in a busy state that would make it unresponsive, it was scrolling properly and out of the blue all of a sudden, crashed.

Is it possible we are spending too much time handling the scrolling events and we fail to reply in time to wayland? This is a Logitech G903 mouse, solaar shows DPI set at 3200 and polling at 1kHz.

Flags: needinfo?(stransky)

Polling at 1K was frequently reported as a potential trigger for this issue.

(In reply to Gabriele Svelto [:gsvelto] from comment #104)

Polling at 1K was frequently reported as a potential trigger for this issue.

Ok, but is the fault on our side or wayland or desktop ?

This "https://github.com/stransky/wayland-proxy/" works great for me. 0 crashes for 4 days now (whereas before it crashed like 5-20 times a day; KDE wayland Manjaro.). When will this be added to the mainline firefox for everyone (besides Fedora)?

(In reply to :gerard-majax [PTO 13/12-08/01] from comment #103)

In all cases, the STRs involved scrolling with my mouse wheel. It might have been out of focus of a scrollable field, but I can state for sure the browser or mailer was not in a busy state that would make it unresponsive, it was scrolling properly and out of the blue all of a sudden, crashed.

Is it possible we are spending too much time handling the scrolling events and we fail to reply in time to wayland? This is a Logitech G903 mouse, solaar shows DPI set at 3200 and polling at 1kHz.

Botond, do you know if there's a potential response lag during scrolling? Do we do significant amount of work in main thread during scroll?
Thanks.

Flags: needinfo?(stransky) → needinfo?(botond)

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit BugBot documentation.

(In reply to Martin Stránský [:stransky] (ni? me) from comment #107)

(In reply to :gerard-majax [PTO 13/12-08/01] from comment #103)

In all cases, the STRs involved scrolling with my mouse wheel. It might have been out of focus of a scrollable field, but I can state for sure the browser or mailer was not in a busy state that would make it unresponsive, it was scrolling properly and out of the blue all of a sudden, crashed.

Is it possible we are spending too much time handling the scrolling events and we fail to reply in time to wayland? This is a Logitech G903 mouse, solaar shows DPI set at 3200 and polling at 1kHz.

Botond, do you know if there's a potential response lag during scrolling? Do we do significant amount of work in main thread during scroll?

With the GPU process enabled (layers.gpu-process.enabled=true; not enabled by default on Linux), the parent process main thread does block on a synchronous IPC round-trip to the GPU process for every event (to perform a hit-test to determine which content process to dispatch the event to). See bug 1677509 for more details on this; it may be possible to avoid this using the approach discussed in bug 1677509 comment 10 but this would involve a non-trivial refactor and come with some risk to input handling latency.

With the GPU process disabled, the hit-test happens within the parent process but can still involve the main thread blocking on other threads in a couple of places:

  • At the beginning of the hit test, the main thread needs to acquire a lock that is also acquired by the SceneBuilder thread during a "scene swap" (when a new WebRender scene has been built and WebRender and APZ coordinate to start using the new scene at the same time). As far as I'm aware, a scene swap should be fast but IIRC it does involve some synchronous back-and-forth IPC between the SceneBuilder and RenderBackend threads (Nical would know more of the details here).
  • The WebRender part of the hit test used to involve a synchronous round-trip to the RenderBackend thread, but we reworked that in bug 1580178 to avoid this in almost all cases. If I'm reading the code right, the only remaining case in which we can block (on this line) is during the first hit-test on a new window, if it's requested sufficiently soon after creating the window that the initial hit tester hasn't been built yet.
Flags: needinfo?(botond)

Thanks. GPU process is not implemented on Wayland at all.

Pushed by stransky@redhat.com: https://hg.mozilla.org/integration/autoland/rev/6a0659188d6c [Wayland] Implement Wayland proxy r=emilio https://hg.mozilla.org/integration/autoland/rev/b0d2efdcd6cc [Wayland] Enable Wayland proxy on start r=emilio
Regressions: 1873699
Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 123 Branch

:stransky Fx122 is affected here but we are near the end of the beta cycle. 122.0b9 builds on 2024-01-12.
What do you think about adding an uplift request on this? Is it low-risk enough to take at this stage?

Flags: needinfo?(stransky)

(In reply to Donal Meehan [:dmeehan] from comment #113)

:stransky Fx122 is affected here but we are near the end of the beta cycle. 122.0b9 builds on 2024-01-12.
What do you think about adding an uplift request on this? Is it low-risk enough to take at this stage?

Better to keep it in nightly only.

Flags: needinfo?(stransky)
Regressions: 1874089
Regressions: 1874107
Regressions: 1874717
Regressions: 1874857

(In reply to Wayne Mery (:wsmwk) from comment #115)

Nightly crash rate seems to have decreased. However, there are still crashes:

That's interesting, Thanks.

It may be caused by missing Wayland protocol ping handling as Wayland proxy comments states:
https://mastransky.wordpress.com/2023/12/22/wayland-proxy-load-balancer/#comments

We may look at it and investigate possible improvements.

Further observation - crash rate for signature <name omitted> | HandleGLibMessage has been driven to zero.

On the other hand, crash rate for HandleGLibMessage which had dropped by about half, in just the last couple days has a sudden increase in crashes in 123.0 and 123.0.1 - roughly doubled.

(In reply to Wayne Mery (:wsmwk) from comment #117)

Further observation - crash rate for signature <name omitted> | HandleGLibMessage has been driven to zero.

On the other hand, crash rate for HandleGLibMessage which had dropped by about half, in just the last couple days has a sudden increase in crashes in 123.0 and 123.0.1 - roughly doubled.

Would be great to have someone who can reproduce it with latest nightly with proxy enabled. We had reports about 1000Hz mouse issues but that should be solved now. I wonder what else can cause the issues (beside heavy system utilization which may cause delays in event processing).

I can confirm that since landing wayland proxy, both Firefox Nightly and Thunderbird Daily do not anymore crash because of that, so 1kHz mouse is OK.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: