Retarget nsHttpChannel handled by HttpChannelParent to a background thread, avoid main thread loop on critical path
Categories
(Core :: Networking: HTTP, enhancement, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox65 | --- | affected |
People
(Reporter: mayhemer, Unassigned)
References
(Depends on 1 open bug, Blocks 1 open bug)
Details
(Whiteboard: [necko-triaged])
Attachments
(1 file, 1 obsolete file)
8.25 KB,
text/plain
|
Details |
Comment 1•7 years ago
|
||
Updated•7 years ago
|
![]() |
Reporter | |
Comment 2•7 years ago
|
||
![]() |
Reporter | |
Comment 3•7 years ago
|
||
![]() |
Reporter | |
Updated•7 years ago
|
Comment 4•7 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #0)
Created attachment 9023355 [details]
bt logThis is the first bug based on Backtrack data :)
We do an unnecessary parent process main thread loop for every chunk of data
we deliver to the content process via PBackground, being it from cache or
net.Retargetting nsHttpChannel is the last piece of the PBg-HTTP project we are
currently missing. This has never been filed or I can't find the bug, and
it has been completely forgotten about.
nsHttpChannel is already implemented nsIThreadRetargetableRequest. It seems we can just try to call nsHttpChannel::RetargetDeliveryTo from HttpBackgroundChannelParent.
![]() |
Reporter | |
Comment 5•7 years ago
|
||
That's what I had in mind, and it would definitely be better than what we have now.
The real affect is tho blocked by bug 1528285. CacheFileInputStream::NotifyListener (from comment 0) also calls OnStartRequest, which now happens on the main thread only.
We could also split this bug to two - to simply retarget nsHttpChannel to call OnData on a suitable worker thread as part 1 and then let call HttpChannelParent::OnStart/StopRequest off the main thread as part 2.
Comment 6•7 years ago
|
||
Comment 7•7 years ago
|
||
Comment 8•7 years ago
•
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=845897e98c2c13353677063328b5330703955083
There are still some test timeouts.
Comment 9•7 years ago
|
||
:selena, I am still discussing how to tackle those failed tests with Honza. Not sure if it's a good idea to turn off retargeting for those tests. If we really decide to do this, then I'll send you the list of tests.
Comment 10•7 years ago
|
||
On second thoughts, it seems not right to turn off retargeting just because of test timeouts.
FWIW, Junior's encournted something like this before [1].
Comment 11•7 years ago
|
||
(In reply to Kershaw Chang [:kershaw] from comment #9)
:selena, I am still discussing how to tackle those failed tests with Honza. Not sure if it's a good idea to turn off retargeting for those tests. If we really decide to do this, then I'll send you the list of tests.
I think there is no need to keep this ni.
Comment 12•7 years ago
|
||
Hi John,
Recently, I've been investigating the reason of a media test [1] timeout, which I think might have something to do with my patch in this bug. The log shows that the "canplaythrough" event is not fired. I also found that in the successful cases, the http channel is closed at [2]. For the failed cases, the channel is closed at [3]. Do you know what could make this difference?
Moreover, from networking point of view, it seems that all the data is completely delivered to ChannelMediaResource::Listener::OnDataAvailable. Do you have any idea why the "canplaythrough" event is not fired?
Thanks.
[1] https://searchfox.org/mozilla-central/rev/201450283cddc9e409cec707acb65ba6cf6037b1/testing/web-platform/tests/2dcontext/imagebitmap/createImageBitmap-origin.sub.html
[2] https://searchfox.org/mozilla-central/rev/201450283cddc9e409cec707acb65ba6cf6037b1/dom/media/ChannelMediaResource.cpp#838
[3] https://searchfox.org/mozilla-central/rev/201450283cddc9e409cec707acb65ba6cf6037b1/dom/media/ChannelMediaResource.cpp#689
Comment 13•7 years ago
|
||
Kershaw, may I know which platform are you using for the test? I'd like to reproduce it before checking the issue. Also, will you by any chance be able to enable debug log with MOZ_LOG=nsMediaElement:5
and upload the output? FWICT, canplaythrough
event is fired at [1], and debug log would be very helpful.
[1] https://searchfox.org/mozilla-central/source/dom/html/HTMLMediaElement.cpp#5562
Comment 14•6 years ago
|
||
(In reply to John Lin [:jhlin][:jolin] from comment #13)
Kershaw, may I know which platform are you using for the test? I'd like to reproduce it before checking the issue. Also, will you by any chance be able to enable debug log with
MOZ_LOG=nsMediaElement:5
and upload the output? FWICT,canplaythrough
event is fired at [1], and debug log would be very helpful.[1] https://searchfox.org/mozilla-central/source/dom/html/HTMLMediaElement.cpp#5562
I am only using linux platform for the test. Please see the failed test with MOZ_LOG=nsMediaElement:5 enabled at [1]. Unfortunately, I think this is unlikely to reproduce locally. This seems to be only reproducible on try.
The debug log is at [2]. You can search the http channel with the address 0xe6d26000 and you can see this channel is canceled with the status 0x805d0021 (NS_ERROR_PARSED_DATA_CACHED).
Thanks!
[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=8df420e6a64674faa1061f0d2791e0c32894ba10&selectedJob=237923266
[2] https://taskcluster-artifacts.net/FcFT28CsRjCGYjkyAVLQEg/0/public/logs/live_backing.log
Comment 15•6 years ago
|
||
Kershaw,
Thanks for the log! It seems the decoder enter dormant state right after it finishes decoding the 1st frame. I tried to add some more log to see why, but couldn't trigger the failure on the try server anymore after rebasing on the latest central code. Could it be some patches landed there already fixes this bug? Can you confirm that the failure is still reproducible? Thanks a lot!
Comment 16•6 years ago
|
||
(In reply to John Lin [:jhlin][:jolin] from comment #15)
Kershaw,
Thanks for the log! It seems the decoder enter dormant state right after it finishes decoding the 1st frame. I tried to add some more log to see why, but couldn't trigger the failure on the try server anymore after rebasing on the latest central code. Could it be some patches landed there already fixes this bug? Can you confirm that the failure is still reproducible? Thanks a lot!
As I mentioned in comment #12, I think this failure may be caused by my patch. Or my patch causes the failure rate higher.
I think it's not easy to reproduce this without my patch.
Could you send the patch with more logs to me? I can try to reproduce this again and send you the log.
Could you elaborate a bit on the reason of dormant state? My patch changes the speed of sending data from parent process to child process a bit faster. Is this the possible reason?
Thanks.
Comment 17•6 years ago
|
||
Honza,
Could you take a look at the talos comparison result [1] of f03bdc6a8047 (without retargeting) and fa2553da1517 (with retargetgin)?
Honestly, I can't tell if retargeting have some improvments from the result. But I think there is also no performance regression.
FWIW, I think it's still worth to continue working on this. What do you think?
![]() |
Reporter | |
Comment 18•6 years ago
|
||
Ah.. I think acreskey might help here more than me, as I said before. I personally have trouble comparing talos results myself.
I usually push with ./mach try -b o -p linux64,macosx64,win32,win64 -u none -t all --rebuild-talos 10
. Still, this doesn't run ALL talos tests anyway (despite the -t all!) I think I should add -shippable (former -pgo) to the list of platforms. You ran only a fraction of the tests and only 5 times? Maybe push with rebuild=20 twice: a vanilla, base cset and your patch applied on it.
Comment 19•6 years ago
|
||
(In reply to Kershaw Chang [:kershaw] from comment #16)
(In reply to John Lin [:jhlin][:jolin] from comment #15)
Kershaw,
Thanks for the log! It seems the decoder enter dormant state right after it finishes decoding the 1st frame. I tried to add some more log to see why, but couldn't trigger the failure on the try server anymore after rebasing on the latest central code. Could it be some patches landed there already fixes this bug? Can you confirm that the failure is still reproducible? Thanks a lot!
As I mentioned in comment #12, I think this failure may be caused by my patch. Or my patch causes the failure rate higher.
I think it's not easy to reproduce this without my patch.
Could you send the patch with more logs to me? I can try to reproduce this again and send you the log.
Sorry. I meant to say 'suspend' but typed 'dormant' by mistake. Could you please try again with https://paste.rs/KyZ.diff applied and log enabled with MOZ_LOG=nsMediaElement:5,MediaDecoder:5,MediaResource:5
?
Could you elaborate a bit on the reason of dormant state? My patch changes the speed of sending data from parent process to child process a bit faster. Is this the possible reason?
Thanks.
IIUC, [1] makes sure readyState
advance to HAVE_ENOUGH_DATA
and canplaythrough
event fired. Hope the log can show us how data arrive sooner could affect this.
Comment 20•6 years ago
|
||
Sorry. I meant to say 'suspend' but typed 'dormant' by mistake. Could you please try again with https://paste.rs/KyZ.diff applied and log enabled with
MOZ_LOG=nsMediaElement:5,MediaDecoder:5,MediaResource:5
?
I finally managed to reproduce this again.
Please take a look at the logs [1][2].
[1] https://taskcluster-artifacts.net/c5PXlESrSV-2TRQ-YZDyBA/0/public/logs/live_backing.log
[2] https://taskcluster-artifacts.net/ZuNKcPcLSl-I8uk0fXSq6g/0/public/logs/live_backing.log
Comment 21•6 years ago
|
||
Thanks a lot for your help!
I've compared the logs of passed and failed jobs. It looks like the video element didn't advance to HAVE_ENOUGH_DATA
because mDownloadSuspendedByCache
was not set [1]. Tracing back where that member is set leads me to [2]. The code above it determines whether to call NotifyDataEnded()
or not according to the state of media cache. And I guess data arrival timing could affect that.
Could you please help reproduce one more time with https://paste.rs/mlr.diff applied and MOZ_LOG=nsMediaElement:5,MediaDecoder:5,MediaResource:5,MediaCache:5
to confirm my theory?
[1] https://searchfox.org/mozilla-central/source/dom/html/HTMLMediaElement.cpp#5448
[2] https://searchfox.org/mozilla-central/source/dom/media/ChannelMediaResource.cpp#407
Comment 22•6 years ago
|
||
(In reply to John Lin [:jhlin][:jolin] from comment #21)
Thanks a lot for your help!
I've compared the logs of passed and failed jobs. It looks like the video element didn't advance to
HAVE_ENOUGH_DATA
becausemDownloadSuspendedByCache
was not set [1]. Tracing back where that member is set leads me to [2]. The code above it determines whether to callNotifyDataEnded()
or not according to the state of media cache. And I guess data arrival timing could affect that.Could you please help reproduce one more time with https://paste.rs/mlr.diff applied and
MOZ_LOG=nsMediaElement:5,MediaDecoder:5,MediaResource:5,MediaCache:5
to confirm my theory?[1] https://searchfox.org/mozilla-central/source/dom/html/HTMLMediaElement.cpp#5448
[2] https://searchfox.org/mozilla-central/source/dom/media/ChannelMediaResource.cpp#407
Please see the log at [1][2].
It seems that ChannelMediaResource::Seek is called at [3], so I can't see the second log you added at [4].
[1] https://taskcluster-artifacts.net/V_szf4jKQiekHrdEiKsPDA/0/public/logs/live_backing.log
[2] https://taskcluster-artifacts.net/FfiXLBHfTEOXZn6FwLNi1A/0/public/logs/live_backing.log
[3] https://searchfox.org/mozilla-central/rev/69ace9da347adcc4a33c6fa3d8e074759b91068c/dom/media/ChannelMediaResource.cpp#860
[4] https://searchfox.org/mozilla-central/rev/69ace9da347adcc4a33c6fa3d8e074759b91068c/dom/media/ChannelMediaResource.cpp#396
Comment 23•6 years ago
|
||
Comment 24•6 years ago
|
||
(In reply to Kershaw Chang [:kershaw] from comment #23)
Comment 25•6 years ago
|
||
as a note, 20x for each job is a bit overkill- typically 6 is plenty, 10 is top end. Also running this at scale (all tests, all platforms) puts a high load on our limited device pool which causes a backlog for all users on try.
Comment 26•6 years ago
|
||
Comment 28•6 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #27)
Kershaw, what's the status here?
Sorry for the late progress here. I'll push this to try and see how many tests are still failure this week.
I also have to capture the log again for John to analyse the reason of failed media wpt test.
About the talos result, I've run 4 times and I can't tell if there is any improvement or regression of the results. But, I think we can continue working on this bug.
![]() |
Reporter | |
Comment 29•6 years ago
|
||
I'm not pushing, only asking :) Thanks!
Comment 30•6 years ago
|
||
I would be very curious as to whether this change improves the raptor tp6 and tp6-mobile page load tests.
Also, any chance it could impact Bug 1556022?
Comment 31•6 years ago
|
||
(In reply to Andrew Creskey from comment #30)
I would be very curious as to whether this change improves the raptor tp6 and tp6-mobile page load tests.
Also, any chance it could impact Bug 1556022?
I think e10s is disabled for Fenix, right? If yes, this change has no effect.
If e10s is enabled, this change might have some improvements.
Comment 32•6 years ago
|
||
(In reply to Kershaw Chang [:kershaw] from comment #31)
(In reply to Andrew Creskey from comment #30)
I would be very curious as to whether this change improves the raptor tp6 and tp6-mobile page load tests.
Also, any chance it could impact Bug 1556022?I think e10s is disabled for Fenix, right? If yes, this change has no effect.
If e10s is enabled, this change might have some improvements.
Kershaw, on android geckoview_example and Fenix both use e10s, multi-process.
Because Fenix uses a fixed geckoview version (68 I believe, but it's regularly updated) I think it's easier to test the performance impact with geckoview_example.
I happen to have a good android baseline revision with a lot of pageload test jobs.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ab04ffe5d4d20069bda271eec7e33f390ef6b7ae
If you point me at a patch that will build against this I can kick off some tests.
Comment 33•6 years ago
|
||
(In reply to Andrew Creskey from comment #32)
(In reply to Kershaw Chang [:kershaw] from comment #31)
(In reply to Andrew Creskey from comment #30)
I would be very curious as to whether this change improves the raptor tp6 and tp6-mobile page load tests.
Also, any chance it could impact Bug 1556022?I think e10s is disabled for Fenix, right? If yes, this change has no effect.
If e10s is enabled, this change might have some improvements.Kershaw, on android geckoview_example and Fenix both use e10s, multi-process.
Because Fenix uses a fixed geckoview version (68 I believe, but it's regularly updated) I think it's easier to test the performance impact with geckoview_example.I happen to have a good android baseline revision with a lot of pageload test jobs.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ab04ffe5d4d20069bda271eec7e33f390ef6b7ae
If you point me at a patch that will build against this I can kick off some tests.
I've rebased the patch to current central and pushed it to phabricator.
https://phabricator.services.mozilla.com/D22516
Comment 34•6 years ago
|
||
Thanks Kershaw.
We'll be able to see results here as I add repeat jobs:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=a15a5d367a7b7a197d90f13ff2670610a09a5003&newProject=try&newRevision=05ad273d6a5ad811ba821e1821343ab2dee19904&framework=10
I chose to only test android because I find that it takes time to test platforms and android is sensitive to performance but also reasonably stable.
We can always add tests for more targets to these same pushes.
Comment 35•6 years ago
|
||
... there was a problem with the android job queue so this will take a bit longer to get performance results...
Comment 36•6 years ago
|
||
Surprisingly that try result looks greener than previous push. One possible reason is that the easily failed media wpt test was modified in Bug 1529056.
So, there is no need to ask John's help right now.
I'll also try to fix the review comments and move this bug forward this week.
Comment 37•6 years ago
|
||
Perfherder compare has got a reasonably number of results, feel free to add any jobs to it.
(left hand side is baseline, right hand side has the patch applied)
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=a15a5d367a7b7a197d90f13ff2670610a09a5003&newProject=try&newRevision=05ad273d6a5ad811ba821e1821343ab2dee19904&framework=10
![]() |
Reporter | |
Comment 38•6 years ago
|
||
Andrew, what other jobs would you suggest to run? From what I see in the log (and how I can understand it) it seems this change doesn't have much impact, the differences seems to be either a noise (but with 25 runs each?) or just a reorder of things w/o bringing a real consistent benefit.
Comment 39•6 years ago
|
||
Honza, the results are a bit puzzling to me.
Certainly there is a more noise than we'd like, so it's hard to say.
This one result, if this is real, is worrisome:
Amazon Load time on the Pixel 2 has almost tripled?
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&newProject=try&newRevision=05ad273d6a5ad811ba821e1821343ab2dee19904&originalSignature=1995546&newSignature=1995546&framework=10&originalRevision=a15a5d367a7b7a197d90f13ff2670610a09a5003
I'll add tests for Windows.
![]() |
Reporter | |
Comment 40•6 years ago
|
||
One theory: in backtrack logs I often see that, specially for responses coming from the cache, we do fire OnStart/OnData (once)/OnStop in one main thread loop (one event). With retargetting, we are adding another loop to deliver OnStop again on the main thread - in sequence, which can get significantly delayed by main thread blockers... Prioritization could help here, but it's still questionable if overall this will help at all anyway.
We can try to not retarget for cached loads to verify my theory, but to be honest, the same can happen for network loads when they are small enough to again deliver all 3 notifications at the same loop.
We may rather wait for bug 1528285 to be fixed first and use it here to retarget the whole stream listener.
Updated•6 years ago
|
Updated•6 years ago
|
Mass-removing myself from cc; search for 12b9dfe4-ece3-40dc-8d23-60e179f64ac1 or any reasonable part thereof, to mass-delete these notifications (and sorry!)
Updated•3 years ago
|
Comment 43•6 months ago
|
||
As I mentioned in bug 1952237, I think we also need to impl nsIThreadRetargetableStreamListener for CacheFileOutputStream, otherwise we can't retarget channels that are writing to a cache entry.
Description
•