RE: 1.60 results vs 1.61 results. I sorted on CPU time and both the shortest and longest CPU times (Sorry I said runtimes earlier) were from 1.60.
1.61 results were down a bit from sorts in either direction.
===============================================
Arch64: (both 1.60 and 1.61 results sorted together
Arm Clock 1.8Ghz Core clock 500 Mhz
Long 19,035 Short 17,000
Pi4 32 bit OS
Arm clock 1.5Ghz Core clock 500 Mhz
Long 18,417 Short 16867
===============================================
From these results, it doesn't appear the 64-bit hardware is being taken advantage of. Of course, I don't know the precision of the calculations involved, and if the lower precision is used the 64-bit hardware wouldn't matter much. I am guessing the instructions are fully cacheable. The data perhaps not so much. I have no idea how many ways the same numbers are crunched and re-crunched
An additional wrinkle I thought I had noticed B4. The one with the 32Bit OS I had 1st has clocked at 1.5Ghz whereas the newer one with 8GB and the 64bit OS is clocked at 1.8Ghz. The core clock is 500Mhz in both though. I don't know the hardware well enough to know how the clocks mentioned are utilized, but I suspect that _should_ give the 8GB one an additional advantage.
I added the ARM and core clocks to the results above.
I might just buy another Pi4 (8GB) to make the hardware really the same. (I know the extra mem won't be used with the 32Bit OS. I will just move the micro SD over, (or perhaps make another copy).
Generally we care about run_times with BOINC projects. The cpu_times are irrelevant. The lower the run_times, the more work can be crunched and reported in a day and the higher RAC produced.
I see a reduction in the run_times with the 1.61 app compared to the 1.60 app. Good for everyone in general, but not the reason for the app. The Einstein developer was unhappy with the too high invalid ratio (>25 %) that the 1.60 app was producing against the new 1.33 BRP4X64 app that the majority of Windows and the bulk of Einstein crunchers were using.
So he recompiled the 1.60 to produce better validation results for the 1.61 app. The devs I would say don't normally have any concern over the speeds of their applications, only that they produce valid scientific results.
You would have to ask Bernd how he compiled the source code and what configuration parameters he used to produce the aarch64 application. If you are a fellow developer, you could offer your skills to the project to produce more efficient applications.
Thanks for the challenge, but I am afraid I am not that good, and as you said, even if I made it more efficient, I would worry about screwing up the scientific results.
When I started BOINC, it was with the intention of showing how much better the 64Bit OS is and its use of double the size and quantity of registers and the ALU... and the increased clock. I thought it would just be automagically faster.
Note: The Raspberry Pi 64-bit OS is barely out of beta itself, and may improve with time. I have no idea how much overhead is involved in the OS that might compromise the runtime results. I have been updating it regularly in hopes of improvements there as well. Also, my runtimes and CPU times are very close to one another.
there's no doubt that 1.61 app is faster than 1.60, my jetson nano & xavier nx are significantly faster but my cluster of Pi's is only showing an increase of around 6% - 12%.
My Jetson nx will get a new task done in about 6500 sec, before it was 8400 (custom arm @ 1.4Ghz using 5 cores out of 6)
Jetson nano will get a new task done in 14500 sec, before it was 17,500. (Arm A57 @ 1.43, using 3 cores out of 4)
My overclocked Pi's (1900/2000), running on all four cores take roughly 14500 sec, and before roughly 16500 sec (if I only use 3 cores, the new run time is about 12500 sec)
Note my Pi cluster run times are all over place due to other work they do.
Unfortunately the new 1.61 app is still producing an awful lot of invalids (approx 20%ish), whereas with the 32 bit app its very rare to get in invalid.
Undoubtedly, the errors are from too aggressive fmath config settings in the CXX flags parameter. I believe that is what Bernd said he tweaked once already in the 1.61 app.
But I too am seeing too high invalid ratio still on my ARM64 hosts. Most projects state their desire is < 10% invalid ratio.
Looks like he needs to take another whack at the application. Or loosen the validation settings which he stated he didn't really want to do.
He needs to sort it out as the BRP4 data has now taken back prominence over the FGRP5 data in project priority.
With my Intel-based laptop, running 4 CPU jobs, one APU/GPU Job, and one nVidia GPU job I am getting
378/5 valid/invalid.
With my AMD based Desktop, running 10 CPU jobs, and one AMD GPU job I am getting
917/20`Valid/invalid.
32bit Pi 4 running 4 jobs
67/13
64 bit Pi 4 running 4 jobs 22/26
I am astounded!!! In the 1st place, I haven't been keeping track of the invalids on a job or computer basis but was assuming until now that the invalids were from the study, expected due to the data, and not the processes.
Now that you have me looking at it on a basis of what code it is running and on which processors, the result on the Pis, in general, are pretty bad, and the one on the AARCH64 is especially bad, and worse I guess than the results you were seeing! :o
Note: All the jobs are running on real cores. Run N or N-1 Jobs on N cores. 90% of the time with nothing serious going on besides these jobs.
If you look at your wingmen invalid results, you will see that you are always the odd man out against two Windows hosts with the BRP4X64 app. That app is from 2013 and is only SSE enabled.
I have a hunch the 1.60 and 1.61 apps were built with much newer math library functions, with probably SSE math functions deprecated and instead using newer NEON or similar math API libraries.
Probably why such a high invalid rate. Bernd might have to "dumb down" the 1.61 app even further so that it uses only the much older SSE math libraries. If they even exist in the current ARM64 application environment.
Or do some more heavy lifting and bring the old BRP4X64 into modern development status.
I am guessing invalid results that don't have to do with erroneous or incomplete data have to be sent out to have the same job redone? ... presumably by a different system?
Yes, that is the default BOINC mechanism for data replication. It depends on the project whether that is used though. Some projects only submit a single result because of the nature of the data and applications.
Einstein uses the default data replication of two or more valid results before submitting to the science database.
RE: 1.60 results vs 1.61
)
RE: 1.60 results vs 1.61 results. I sorted on CPU time and both the shortest and longest CPU times (Sorry I said runtimes earlier) were from 1.60.
1.61 results were down a bit from sorts in either direction.
===============================================
Arch64: (both 1.60 and 1.61 results sorted together
Arm Clock 1.8Ghz Core clock 500 Mhz
Long 19,035 Short 17,000
Pi4 32 bit OS
Arm clock 1.5Ghz Core clock 500 Mhz
Long 18,417 Short 16867
===============================================
From these results, it doesn't appear the 64-bit hardware is being taken advantage of. Of course, I don't know the precision of the calculations involved, and if the lower precision is used the 64-bit hardware wouldn't matter much. I am guessing the instructions are fully cacheable. The data perhaps not so much. I have no idea how many ways the same numbers are crunched and re-crunched
An additional wrinkle I thought I had noticed B4. The one with the 32Bit OS I had 1st has clocked at 1.5Ghz whereas the newer one with 8GB and the 64bit OS is clocked at 1.8Ghz. The core clock is 500Mhz in both though. I don't know the hardware well enough to know how the clocks mentioned are utilized, but I suspect that _should_ give the 8GB one an additional advantage.
I added the ARM and core clocks to the results above.
I might just buy another Pi4 (8GB) to make the hardware really the same. (I know the extra mem won't be used with the 32Bit OS. I will just move the micro SD over, (or perhaps make another copy).
Generally we care about
)
Generally we care about run_times with BOINC projects. The cpu_times are irrelevant. The lower the run_times, the more work can be crunched and reported in a day and the higher RAC produced.
I see a reduction in the run_times with the 1.61 app compared to the 1.60 app. Good for everyone in general, but not the reason for the app. The Einstein developer was unhappy with the too high invalid ratio (>25 %) that the 1.60 app was producing against the new 1.33 BRP4X64 app that the majority of Windows and the bulk of Einstein crunchers were using.
So he recompiled the 1.60 to produce better validation results for the 1.61 app. The devs I would say don't normally have any concern over the speeds of their applications, only that they produce valid scientific results.
You would have to ask Bernd how he compiled the source code and what configuration parameters he used to produce the aarch64 application. If you are a fellow developer, you could offer your skills to the project to produce more efficient applications.
Thanks for the challenge, but
)
Thanks for the challenge, but I am afraid I am not that good, and as you said, even if I made it more efficient, I would worry about screwing up the scientific results.
When I started BOINC, it was with the intention of showing how much better the 64Bit OS is and its use of double the size and quantity of registers and the ALU... and the increased clock. I thought it would just be automagically faster.
Note: The Raspberry Pi 64-bit OS is barely out of beta itself, and may improve with time. I have no idea how much overhead is involved in the OS that might compromise the runtime results. I have been updating it regularly in hopes of improvements there as well. Also, my runtimes and CPU times are very close to one another.
there's no doubt that 1.61
)
there's no doubt that 1.61 app is faster than 1.60, my jetson nano & xavier nx are significantly faster but my cluster of Pi's is only showing an increase of around 6% - 12%.
My Jetson nx will get a new task done in about 6500 sec, before it was 8400 (custom arm @ 1.4Ghz using 5 cores out of 6)
Jetson nano will get a new task done in 14500 sec, before it was 17,500. (Arm A57 @ 1.43, using 3 cores out of 4)
My overclocked Pi's (1900/2000), running on all four cores take roughly 14500 sec, and before roughly 16500 sec (if I only use 3 cores, the new run time is about 12500 sec)
Note my Pi cluster run times are all over place due to other work they do.
Unfortunately the new 1.61 app is still producing an awful lot of invalids (approx 20%ish), whereas with the 32 bit app its very rare to get in invalid.
Undoubtedly, the errors are
)
Undoubtedly, the errors are from too aggressive fmath config settings in the CXX flags parameter. I believe that is what Bernd said he tweaked once already in the 1.61 app.
But I too am seeing too high invalid ratio still on my ARM64 hosts. Most projects state their desire is < 10% invalid ratio.
Looks like he needs to take another whack at the application. Or loosen the validation settings which he stated he didn't really want to do.
He needs to sort it out as the BRP4 data has now taken back prominence over the FGRP5 data in project priority.
With my Intel based laptop,
)
With my Intel-based laptop, running 4 CPU jobs, one APU/GPU Job, and one nVidia GPU job I am getting
378/5 valid/invalid.
With my AMD based Desktop, running 10 CPU jobs, and one AMD GPU job I am getting
917/20`Valid/invalid.
32bit Pi 4 running 4 jobs
67/13
64 bit Pi 4 running 4 jobs 22/26
I am astounded!!! In the 1st place, I haven't been keeping track of the invalids on a job or computer basis but was assuming until now that the invalids were from the study, expected due to the data, and not the processes.
Now that you have me looking at it on a basis of what code it is running and on which processors, the result on the Pis, in general, are pretty bad, and the one on the AARCH64 is especially bad, and worse I guess than the results you were seeing! :o
Note: All the jobs are running on real cores. Run N or N-1 Jobs on N cores. 90% of the time with nothing serious going on besides these jobs.
If you look at your wingmen
)
If you look at your wingmen invalid results, you will see that you are always the odd man out against two Windows hosts with the BRP4X64 app. That app is from 2013 and is only SSE enabled.
I have a hunch the 1.60 and 1.61 apps were built with much newer math library functions, with probably SSE math functions deprecated and instead using newer NEON or similar math API libraries.
Probably why such a high invalid rate. Bernd might have to "dumb down" the 1.61 app even further so that it uses only the much older SSE math libraries. If they even exist in the current ARM64 application environment.
Or do some more heavy lifting and bring the old BRP4X64 into modern development status.
I am guessing invalid results
)
I am guessing invalid results that don't have to do with erroneous or incomplete data have to be sent out to have the same job redone? ... presumably by a different system?
Yes, that is the default
)
Yes, that is the default BOINC mechanism for data replication. It depends on the project whether that is used though. Some projects only submit a single result because of the nature of the data and applications.
Einstein uses the default data replication of two or more valid results before submitting to the science database.
Are there any FPGA systems
)
Are there any FPGA systems active at E@H?
A Proud member of the O.F.A. (Old Farts Association).