Closed Bug 1562548 Opened 6 years ago Closed 6 years ago

Improve GCM perfomance on aarch32

Categories

(NSS :: Libraries, task, P2)

Tracking

(firefox-esr60 wontfix, firefox-esr68 wontfix, firefox67 wontfix, firefox67.0.1 wontfix, firefox68 wontfix, firefox69 wontfix, firefox71 wontfix, firefox72 wontfix, firefox73 fixed)

RESOLVED FIXED
Tracking Status
firefox-esr60 --- wontfix
firefox-esr68 --- wontfix
firefox67 --- wontfix
firefox67.0.1 --- wontfix
firefox68 --- wontfix
firefox69 --- wontfix
firefox71 --- wontfix
firefox72 --- wontfix
firefox73 --- fixed

People

(Reporter: m_kato, Assigned: m_kato)

References

(Blocks 1 open bug, )

Details

(Whiteboard: [geckoview:p2] [bcs:p2])

Attachments

(2 files)

bug 1559012 is for aarch64, but we can improve GCM even if aarch32.

Blocks: 1555407

Makoto - Can I volunteer you for this one? :) (Feel free to un-assign if it's overload).

Assignee: nobody → m_kato
Status: NEW → ASSIGNED
Priority: -- → P2

esr68=affected because we might want to uplift this optimization to Fennec ESR 68.1.

Blocks: 1570313

I'm curious about the attached patch - can I give this a try and see if it helps? Is it waiting on anything?
I've re-profiled speedof.me with webrender and the GCM_DecryptUpdate is very prominent:
https://perfht.ml/2pBJsVr

Flags: needinfo?(m_kato)
Flags: needinfo?(m_kato)

I need to rebase old patch for review

(In reply to Andrew Creskey from comment #5)

I'm curious about the attached patch - can I give this a try and see if it helps? Is it waiting on anything?
I've re-profiled speedof.me with webrender and the GCM_DecryptUpdate is very prominent:
https://perfht.ml/2pBJsVr

Well, this fix will improve a lot of performance of gcmHash_Update that is 22% of this profiling data. About rijndael_encryptBlock128, we have already landed improvement for ARMv8 CPU (MotoG5, Pixel 2 and etc) although it isn't merged into Fenix Nightly.

Attachment #9082577 - Attachment description: Bug 1562548 - Improve GCM perfomance on aarch32 → Bug 1562548 - Improve GCM perfomance on aarch32 using NEON.

Thank you Makoto.
I did apply this patch to an arm32 PGO build of geckoview_example.
Unfortunately I couldn't pick up any performance improvements but all I have to test with are the speedof.me tests.
I think Ideally we would more isolated GCM performance tests.

(In reply to Andrew Creskey from comment #8)

Thank you Makoto.
I did apply this patch to an arm32 PGO build of geckoview_example.
Unfortunately I couldn't pick up any performance improvements but all I have to test with are the speedof.me tests.
I think Ideally we would more isolated GCM performance tests.

This fix is that GCM improves 1.5-2.0 times faster than original even if non-ARMv8 hardware. Since we don't turn on WebRender for 32-bit device, it spends a lot of times on filter processing (bug 961759).

#     mode          in symmkey  opreps  cxreps     context          op   time(sec)     thrgput
 aes_gcm_e        69Mb     256      1M       0       0.000   10000.000      10.000         6Mb <-- with this fix
 aes_gcm_e        37Mb     256    654T       0       0.000   10000.000      10.000         3Mb

Thank you for collecting those throughput results, m_kato, they look great.
I did also turn on WebRender for that test, but, like I said, it's not a well defined work flow and there's a lot of noise.
I have a bit of a backlog, but let me needinfo myself to run this change on a real set of pages.
Our current automated pageload tests don't necessarily use the actual cipher suite that a client/server would agree on.

Flags: needinfo?(acreskey)

Makato, I'm curious -- how are you building your measured throughput tests?
Is this in-tree in gecko?
In Bug 1591725 we're looking at a more aggressive build optimization than the current -Oz. It occurred to me that if your tests are built in a standalone app with a different build configuration then we would get different results. AFAIK -Oz may disable some vectorization.

Flags: needinfo?(acreskey)

(In reply to Andrew Creskey from comment #11)

Makato, I'm curious -- how are you building your measured throughput tests?
Is this in-tree in gecko?
In Bug 1591725 we're looking at a more aggressive build optimization than the current -Oz. It occurred to me that if your tests are built in a standalone app with a different build configuration then we would get different results. AFAIK -Oz may disable some vectorization.

This result is by bltest in NSS tree (https://searchfox.org/mozilla-central/source/security/nss/cmd/bltest). And this algorithm is for armv7 using the paper (comment #4) and this doesn't depends on compiler's automatic vectorization.

Flags: needinfo?(acreskey)

Since I don't have commit permission for nss, could you land this? (I guess that sheriffs don't have permission for nss).

Flags: needinfo?(jjones)

Will take in NSS 3.49. Leaving my needinfo until after I branch.

Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → 3.49
Flags: needinfo?(jjones)
Flags: needinfo?(acreskey)

I do get this build error in new nss-3.49 release on arm, and after a bit of tinkering I found the issue to be the patch discussed in this thread. Here's the error:

/usr/lib/gcc/armv7a-unknown-linux-gnueabihf/9.2.0/../../../../armv7a-unknown-linux-gnueabihf/bin/ld: warning: wildcard match appears in both version 'NSSprivate_3.11' and 'NSSprivate_3.16' in script
/usr/lib/gcc/armv7a-unknown-linux-gnueabihf/9.2.0/../../../../armv7a-unknown-linux-gnueabihf/bin/ld: error: Linux2.6_arm_armv7a-unknown-linux-gnueabihf-gcc_glibc_PTH_OPT.OBJ/Linux_SINGLE_SHLIB/gcm-arm32-neon.o: multiple definition of 'gcm_HashMult_hw'
/usr/lib/gcc/armv7a-unknown-linux-gnueabihf/9.2.0/../../../../armv7a-unknown-linux-gnueabihf/bin/ld: Linux2.6_arm_armv7a-unknown-linux-gnueabihf-gcc_glibc_PTH_OPT.OBJ/Linux_SINGLE_SHLIB/gcm.o: previous definition here
/usr/lib/gcc/armv7a-unknown-linux-gnueabihf/9.2.0/../../../../armv7a-unknown-linux-gnueabihf/bin/ld: error: Linux2.6_arm_armv7a-unknown-linux-gnueabihf-gcc_glibc_PTH_OPT.OBJ/Linux_SINGLE_SHLIB/gcm-arm32-neon.o: multiple definition of 'gcm_HashWrite_hw'
/usr/lib/gcc/armv7a-unknown-linux-gnueabihf/9.2.0/../../../../armv7a-unknown-linux-gnueabihf/bin/ld: Linux2.6_arm_armv7a-unknown-linux-gnueabihf-gcc_glibc_PTH_OPT.OBJ/Linux_SINGLE_SHLIB/gcm.o: previous definition here
/usr/lib/gcc/armv7a-unknown-linux-gnueabihf/9.2.0/../../../../armv7a-unknown-linux-gnueabihf/bin/ld: error: Linux2.6_arm_armv7a-unknown-linux-gnueabihf-gcc_glibc_PTH_OPT.OBJ/Linux_SINGLE_SHLIB/gcm-arm32-neon.o: multiple definition of 'gcm_HashInit_hw'
/usr/lib/gcc/armv7a-unknown-linux-gnueabihf/9.2.0/../../../../armv7a-unknown-linux-gnueabihf/bin/ld: Linux2.6_arm_armv7a-unknown-linux-gnueabihf-gcc_glibc_PTH_OPT.OBJ/Linux_SINGLE_SHLIB/gcm.o: previous definition here
/usr/lib/gcc/armv7a-unknown-linux-gnueabihf/9.2.0/../../../../armv7a-unknown-linux-gnueabihf/bin/ld: error: Linux2.6_arm_armv7a-unknown-linux-gnueabihf-gcc_glibc_PTH_OPT.OBJ/Linux_SINGLE_SHLIB/gcm-arm32-neon.o: multiple definition of 'gcm_HashZeroX_hw'
/usr/lib/gcc/armv7a-unknown-linux-gnueabihf/9.2.0/../../../../armv7a-unknown-linux-gnueabihf/bin/ld: Linux2.6_arm_armv7a-unknown-linux-gnueabihf-gcc_glibc_PTH_OPT.OBJ/Linux_SINGLE_SHLIB/gcm.o: previous definition here
collect2: error: ld returned 1 exit status

I'm going to attach the full build log in a moment.
Should I open a new bug for this?

Flags: needinfo?(jjones)
Regressions: 1608327

Thanks for opening 1608327, Mike.

Flags: needinfo?(jjones)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: