20210718, 05:22  #23 
Jul 2021
3·11 Posts 
That's why I implemented both FFT and NTT from scratch by myself with those skills that I have.
I implemented both so that to have relative comparison. Both my variants are averageoptimized so theirs comparison is quite fair. Also theirs inner loops and formulas look very much the same. My FFT uses AVX1 SIMD to do DOUBLE multiplies. And NTT uses nonSIMD 64x64>128 multiply (builtin in CPU). The reason I see why my FFT is 10x slower than my NTT is because NTT uses full 64bit words (actually around 60bit) for computation, because NTT doesn't suffer from error overflow, there is no error. While my FFT uses splitting into very small words of 3,4bits size. So 64/3 = 21x difference in word size. If anyone can correct me  if I used wrong thechnique of FFT with splitting into 3bit words? Is there any other way to solve task? This 3bit splitting was done to keep final error within bounds of 0.050.1. 3bit is for 2^25 bytes input numbers sizes. Of cause for smaller sizes like 2^20 I use 78 bits. This 3 bits were chosen based on experimental results for random inputs of different sizes, table was built to achieve this. NTT doesn't suffer from cumulative error. Instead it uses Chinese Reminder Theorem (CRT) to restore higher precision. In CRT I used 3 primes, because NTT with 64bit words needs around 64 + 64 + 64 = 192bit final result meaning 3 primes of 64bits each are needed in CRT. So this gives 3x slow down of NTT, but still NTT is 10x faster than my FFT. So my conclusion was that if for my middleoptimized implementations of both FFT and NTT with almost same algorithms and loops are different in 10x times then I could imagine that if Prime95 gurus implemented NTT with same high optimizations in my like they did in GWNUM then it could happen that imaginary Prime95's NTT could be faster than Prime95's FFT, that's why I created this discussion thread initially. Also people say that NTT is much slower due to lots of modular divisions. But I don't have any divisions in my code at all, only multiplications. Because I used Barrett and Montgomery reducing which allowed replacing all divisions with multiplicaions (and shift/add). Last fiddled with by moytrage on 20210718 at 05:23 
20210718, 05:35  #24 
Jul 2021
3·11 Posts 
Basically exactly this is done in my code of both FFT and NTT. I have input array A and no recursion. Outer loop runs Log(N) times and inner loop goes through A sequentially from start to finish. And no recursion. To achieve no recursion input array is permuted. So I have Log(N) sequential reads/writes of my A array, hence I conclude that my algorithm is already quite cache friendly.

20210718, 05:55  #25 
Jul 2021
3×11 Posts 
I don't understand why NTT should be much slower computationallywise. People say it is slow due to lots of division in modular reduction. But I used Barrett and Montgomery reductions everywhere which allowed me to use no divisions at all, only mult/add operations.
So after Barrett+Montomerry reduction I don't see the reason why FFT should be much faster. Both of them use SIMD and/or builtin CPU multiplications and that's it. So both algorithms look to me about the same. More than that they both have almost the same code, loops and algorithms. I can't see real difference. Last fiddled with by moytrage on 20210718 at 06:16 
20210718, 06:51  #26 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2·11·269 Posts 
If your NTT (which NTT?) is 9.8 sec/iter and your FFT is 1015 times slower than your NTT, we'll call it 12, that's ~118. sec/iter for moytrage FFT, while prime95 you timed as 0.34 sec. That's a ratio of ~346. between FFT implementations.
IBDWT vs. padded FFT is what, 2:1 based on length for padding? Bits/word in DP FFT is typ 1719 in prime95 or other established code (CUDALucas, Mlucas, Gpuowl) depending on FFT length, not 34, so that would be ~46 fold. Memory or cache efficiency probably gets killed by the low bits/word, and perhaps the usage patterns as Prime95 described. He's done a LOT of optimization work over the decades. The integer based code I thoroughly tested for exponent limit due to iteration error in Gpuowl 1.9 was good for ~20 bits/word, in M61 Fast Galois transform 4M length. It was SP or M31 that was expected to provide ~35 bits/word. Author's description at V1.8: https://mersenneforum.org/showpost.p...postcount=224; V1.9 exponent M61 transform limit test results at https://www.mersenneforum.org/showpo...31&postcount=8 I don't know why you're only able to use 34 bits/word in your DP based FFT and yet see .07 .1 roundoff error at that. Other software goes up to ~0.30.4 roundoff error or slightly higher for performance with many more bits/word for DP based FFTs. Usable bits/word versus FFT length are given in a column on the right of some of the more recent processorspecific benchmarks in the prime95 reference thread. If you plan to turn your code into a prime testing program, reading background on the development of Gpuowl, Mlucas, CUDALucas, prime95 could be enlightening. There are forum threads for most if not all. If the code is wrong, even rarely, relative speed does not much matter in an application as demanding as primality testing. Every branch, every case, needs well designed test inputs. 
20210718, 07:16  #27  
Jul 2021
3·11 Posts 
Quote:
Code:
{ {6, 22}, {7, 22}, {8, 20}, {9, 19}, {10, 20}, {11, 17}, {12, 16}, {13, 14}, {14, 13}, {15, 12}, {16, 11}, {17, 10}, {18, 9}, {19, 8}, {20, 7}, {21, 5}, {22, 4}, {23, 3}, {24, 3}, } These bits I got when tried to limit error below 0.1. So this number of bits is the largest number of bits that still give error below 0.1 All errors measured experimentally on random input numbers. Of cause if I limit error to 0.4 then of cause I get more bits. I just thought limiting to 0.4 is not very safe. Of cause if some extra errors checking things are done then 0.4 is also allowable. The only reason I can think of why error would be higher than in Prime95 is because maybe of using `ffastmath` switch when compiling. It makes math faster but less accurate. Also I use SIMD which is less accurate than builtin CPU multiplication. Do I understand correctly that Prime95 somehow for some reason uses 1719 bits with error of 0.4? Does it use DOUBLE64 computations? Not 128 or 256 bit floats? Not longdouble 80 bit floats? Of cause if I use 18 bits instead of 3 bits then I'll be 6x times faster. How came that Prime95 managed to use 18 bits with 0.4 error? Actually I didn't want to make any Primalityproof program out of my code, especially if it is dozens times slower than existing Prime95. I only implemented FFT and NTT from scratch just for educational purpose because always wanted to learn what FFT/NTT is. If somebody quite clever already tested that NTT is much slower than IBDWT for same level of optimizations then I'm quite happy, that was my only concern when starting this thread, concern "if" Prime95 could become faster. If everybody here says that NTT is inherintly slower than FFT, then I could understand why FFT was chosen for Prime95. I just can't understand  both FFT and NTT uses just multiplications (and few additions), taking into account that all NTT's divisions can be replaced by multiplications of Barrett and Montgomery reductions, then why FFT is always considered faster? Both are just multiplications, around same number of operations O(Log(N) * N), then both should be about the same. Integer multiplications is a bit slower than DOUBLE, but NTT can use almost all 64 bits for computational words, while FFT only 1719 bits as in Prime95. Last fiddled with by moytrage on 20210718 at 07:25 

20210718, 07:39  #28 
Jul 2021
3×11 Posts 
So I rerun my bits evaluation part and found out that for error 0.4 my code needs 8bits, to remind for error 0.1 it needed 4 bits. Sure 8bit words will be 2 times faster than 4bit word.
How come that 18 bits can be used I can't see if I do just DOUBLE64 bit operations everywhere, not float128 or float256. Maybe I did some cruel mistake somewhere like doing multiplications/additions in some very wrong order that reduces precision a lot. BTW does anyone know exact formula of error? If two large numbers are pseudorandom what is the formula for error based on numbers size and number of bits in a word? Also do I understand correctly that if I have Kbit word and N such words then it means that I need at least float type that can hold (K + K + Log2(N)) bits of mantissa? Because when doing polynomials multiplication, each polynomial of degree N and each coefficient having K bits then multiplying these two polynomials has each coefficient of size not more than (K + K + Log2(N)) bits, because each final coefficient's value is at most N * (2^K)^2 after opening parenthesis. Taking above into account then if word is 19 bits and large numbers are 2^25 bytes (2^28 bits) then there will be 2^23.75 such words. So using formula above maximal polynomial's coefficient after multiplication could be of size (19 + 19 + 23.75)=61.75 bits. But standard IEEE double64 type has only 52bit mantissa, meaning it can't hold 61.75 bits precisely. How come then that 19 bits can be used for word size? Last fiddled with by moytrage on 20210718 at 08:04 
20210718, 08:10  #29 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
1011100011110_{2} Posts 
There's something very unusual about that bits vs length table. It seems to me it gives for DP about what might be expected for SP. Some examples, that use DP FFT:
CUDALucas v2.06, 17.6 bits/word at 18M FFT length, on GTX1060 GPU Code:
 Date Time  Test Num Iter Residue  FFT Error ms/It Time  ETA Done   Jul 18 00:17:41  M332197331 112030000 0x88fc24530de05291  18432K 0.33594 36.0679 360.67s  91:21:46:05 33.72%  Code:
Starting M43158617 fft length = 2304K  Date Time  Test Num Iter Residue  FFT Error ms/It Time  ETA Done   Jul 24 20:20:44  M43158617 20000 0x6c847491e15e5282  2304K 0.28906 4.0205 80.41s  2:00:10:42 0.04%  Code:
20210718 02:02:47 asr2/radeonvii 999299999 OK 103300000 10.34% 0af6d6e23d1b882d 13969 us/it + check 7.43s + save 3.78s; ETA 144d 20:45 17 errors 2 Mlucas V17, 17.6 bits/word on 18M FFT length, one core of Xeon E5645 Code:
M332220523: using FFT length 18432K = 18874368 8byte floats. this gives an average 17.601676676008438 bits per digit Using complex FFT radices 36 16 16 32 32 [Jul 22 13:31:39] M332220523 Iter# = 10000 [ 0.00% complete] clocks = 02:16:04.515 [816.4516 msec/iter] Res64: 1A313D709BFA6663. AvgMaxErr = 0.171224865. MaxErr = 0.250000000. 
20210718, 08:14  #30  
Jul 2021
100001_{2} Posts 
Quote:
Quote:


20210718, 08:30  #31 
Jun 2003
3·11·157 Posts 
To OP: Instead of implementing your own FFT (probably too inefficient) or comparing against a special purpose software like P95 (too specialized), try using a highperformance generalized FFT library like FFTW for your FFT baseline. You could also use OpenPFGW which supports testing of arbitrary numbers using generic FFT (rather than special form numbers using IBDWT). It shares the same underlying FFT routines with P95, so it is very high performance.
Having said that, 34 bits per fft limb sounds too conservative. You might be able to get away with more number of bits. Also, are you using balanced digits representation? That will allow you to put many more bits per limb. 
20210718, 08:45  #32  
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2·11·269 Posts 
Quote:
See the paragraph starting "Maximum usable" in https://www.mersenneforum.org/showpo...65&postcount=6 IEEE 754 binary64 is 53 bit (1 implicit) for normalized values. 

20210718, 08:55  #33 
Jul 2021
3×11 Posts 
I implemented both FFT and NTT as a part of selfeducational process, I didn't know about them before and was always wanted to learn and implement them.
And only as a byproduct I found out that FFT got slower, only because FFT used 4 bits and NTT almost 64 bits per word. So my initial purpose wasn't to invent some cool and fast library faster than Prime95. Also in advance I didn't know that FFT will appear to be slower. Please read my previous post above, do you think that my formulas are not correct? Using these formulas I can't see how 1819 bits words in FFT can be used with 52bit mantissa of IEEE double64. I compared my FFT and NTT because they are two about the same, same level of optimizations and almost same skeleton of algorithm, same loops, almost same formulas. Both use only multiplications (no divisions) (NTT uses only multiplications because of Barrett and Montgomery reductions). And I can see clearly that only reason that FFT is 10x slower than NTT is because NTT uses almost 64 bit words and FFT uses 4 bit words, hence their Fourier table sizes 16x different hence the slow down. But already I found out how to do some speedup, 8bit words can be used instead of 4bit, because 8bit words give 0.4 error, while before I was targetting 0.1 error. If I can target 0.4 error then of cause I can use 8bit words and get 2x speedup. 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
PRP on gpu is faster that on cpu  indomit  Information & Answers  4  20201007 10:50 
faster than LL?  paulunderwood  Miscellaneous Math  13  20160802 00:05 
My CPU is getting faster and faster ;)  lidocorc  Software  2  20081108 09:26 
Faster way to do LLT?  1260  Miscellaneous Math  23  20050904 07:12 
Faster than LL?  clowns789  Miscellaneous Math  3  20040527 23:39 