Jetson Nano vs Raspberry PI 4 - CPU comparisons

Post Views: 5,831

It’s not fair comparing the new Cortex-A72 based Raspberry PI 4 against the Jetson-Nano equipped with the somewhat older Cortex-A57. Maybe.

“You can’t sum up apples with oranges!”

That is what our teacher told us when we tried to apply a wrong logic to a subject. And they would probably tell us the same right now. The Raspberry PI and the Nvidia Jetson Nano are two single board computers, but the correlation stops here: they have different processors, different clocks, different uses and a totally different market. Not to speak about cost.

Why, then, did we decide to test them together? The reason might seem a bit awkward, but here it is: all the boards share the ARM architecture. The C code written for one can easily recompile for the other, and running the same code on different boards can show the subtle differences between them.

We will give a short description of both SBC and their specifications, then we’ll have a look at the code used for the test. I won’t call it a “benchmark” because we just want to measure the performance of our code, and not the computers.

CPU specifications

	Raspberry PI 4	Nvidia Jetson Nano
CPU	Broadcom BCM2711, Quad core Cortex-A72 (ARM v8) 64-bit SoC	64-bit Quad-core ARM Cortex-A57
GPU	Broadcom VideoCore VI (Mali-T880) @ 500MHz	128-core Maxwell GPU @ 921MHz
Clock frequency	1.5GHz	1.43 GHz
System memory	1GB, 2GB, or 4GB LPDDR4–3200 SDRAM 25.6 GB/s	4GB 64-bit LPDDR4 @ 1600MHz 25.6 GB/s

The first evidence is that both SBCs run similar memory configurations. Our goal is to test whether the memory bandwidth is sufficient for both processors.

It is the new processor, and not the raise of the clock frequency that made the difference between Raspberry PI 3 and PI 4. That’s why we now want to test the A72 against the A57.

There are a few differences between the Cortex-A57 and the Cortex-A72 (source: ARM Developer)

The first difference is the L2 cache size. The Cortex-A57 comes with 512 KB, 1 MB, or 2 MB L2 cache, but the Cortex-A72 has with a fourth option of 4MB.
The GIC CPU interface can now be disabled for the Cortex-A72.
The Cortex-A72 model offers the option to include or exclude the ACP (Accelerator Coherency Port) interface.
The number of FEQ (Fill/Evict Queue) Entries on the Cortex-A72 has been increased to include options of 20, 24, and 28 compared to the Cortex-A57 which offers 16 or 20 entries. This feature is important for Cycle Model users doing performance analysis and studying the impact of various L2 cache parameters.
A number of system registers are updated with new values to reflect the Cortex-A72. The primary part number field in the Main ID register (MIDR) for Cortex-A72 is 0xD08 vs the Cortex-A57 value of 0xD07 and the Cortex-A53 value of 0xD03. A number of other ID registers change value from 7 on the Cortex-A57 to 8 on the Cortex-A72.
A number of new events are tracked by the Cortex-A72 Performance Monitor Unit (PMU). All of the new events have event numbers 0x100 and greater. There are three main sections covering:
- Branch Prediction
- Queues
- Cache

The Cortex-A72 contains many micro-architecture updates for incremental performance improvement, for example the L2 FEQ size. In a test comparing the Cortex-A57 CPAK and a Cortex-A72 CPAK with the exact same software program, both CPUs reported approximately 21,500 instructions retired. This is the instruction count if the program were viewed as a sequential instruction stream. Both CPUs also do a number of speculative operations; the Cortex-A57 reported about 37,000 instructions speculatively executed and the Cortex-A72 reported 35,700.

The screenshots of the instruction events below show first Cortex-A72 followed by Cortex-A57. All of the micro-architecture improvements of the Cortex-A72 combine to provide an incredibly high performance CPU.

Testing software

As our primary interest is Math, we used Ernst W. Mayer‘s program Mlucas to test our hardware. The tests are the work of Sam Laur of FInland.

Operating system.

Mlucas SIMD build requires a 64-bit OS, so we installed Gentoo 64-bit and uninstalled all the modules we didn’t need (Ethernet, HDMI, the Xserver and so on). This way the boards would run with minimum amximum efficiency.

Mlucas

Mlucas is one of the programs used to test the 50th and 51st Mersenne primes. It performs the Lucas-Lehmer algorithm to test the primality of multi-million digits Mersenne candidates. The core of the code uses FFT multiplication through the irrational base discrete weighted transform (IBDWT), a variant of the fast Fourier transform using an irrational base; its development is due to by Richard Crandall (Reed College), Barry Fagin (Dartmouth College) and Joshua Doenias (NeXT Software).

Mlucas runs multi-threaded and makes vast use of hand-tuned assembly and SIMD optimizations. Most of the code is written in plain C, so it can be used on a rich variety of different platforms. For this test, we chose to run Mlucas with 4 threads.

Our results

We run the code on the Raspberry PI 4 with the following results:

18.0
      2048  msec/iter =   55.83  ROE[avg,max] = [0.256171472, 0.312500000]  radices = 128 16 16 32  0  0  0  0  0  0
      2560  msec/iter =   67.83  ROE[avg,max] = [0.235211654, 0.312500000]  radices = 160 16 16 32  0  0  0  0  0  0
      2816  msec/iter =   76.70  ROE[avg,max] = [0.276159794, 0.343750000]  radices = 176 16 16 32  0  0  0  0  0  0
      3072  msec/iter =   85.49  ROE[avg,max] = [0.266928258, 0.406250000]  radices = 192 16 16 32  0  0  0  0  0  0
      3328  msec/iter =   91.66  ROE[avg,max] = [0.254067332, 0.343750000]  radices = 208 16 16 32  0  0  0  0  0  0
      3584  msec/iter =  100.41  ROE[avg,max] = [0.271899162, 0.375000000]  radices = 224 16 16 32  0  0  0  0  0  0
      3840  msec/iter =  107.90  ROE[avg,max] = [0.247359254, 0.312500000]  radices = 240 16 16 32  0  0  0  0  0  0

18.0

2048 msec/iter = 55.83 ROE[avg,max] = [0.256171472, 0.312500000] radices = 128 16 16 32 0 0 0 0 0 0

2560 msec/iter = 67.83 ROE[avg,max] = [0.235211654, 0.312500000] radices = 160 16 16 32 0 0 0 0 0 0

2816 msec/iter = 76.70 ROE[avg,max] = [0.276159794, 0.343750000] radices = 176 16 16 32 0 0 0 0 0 0

3072 msec/iter = 85.49 ROE[avg,max] = [0.266928258, 0.406250000] radices = 192 16 16 32 0 0 0 0 0 0

3328 msec/iter = 91.66 ROE[avg,max] = [0.254067332, 0.343750000] radices = 208 16 16 32 0 0 0 0 0 0

3584 msec/iter = 100.41 ROE[avg,max] = [0.271899162, 0.375000000] radices = 224 16 16 32 0 0 0 0 0 0

3840 msec/iter = 107.90 ROE[avg,max] = [0.247359254, 0.312500000] radices = 240 16 16 32 0 0 0 0 0 0

The run on the Jetson Nano gave the following results:

18.0
      2048  msec/iter =   49.13  ROE[avg,max] = [0.000311985, 0.375000000]  radices = 128 16 16 32  0  0  0  0  0  0
      2304  msec/iter =   55.65  ROE[avg,max] = [0.000273731, 0.375000000]  radices = 144 16 16 32  0  0  0  0  0  0
      2560  msec/iter =   60.10  ROE[avg,max] = [0.000236003, 0.312500000]  radices = 160 16 16 32  0  0  0  0  0  0
      2816  msec/iter =   68.26  ROE[avg,max] = [0.000259256, 0.343750000]  radices = 176 16 16 32  0  0  0  0  0  0
      3072  msec/iter =   75.20  ROE[avg,max] = [0.000267585, 0.375000000]  radices = 192 16 16 32  0  0  0  0  0  0
      3328  msec/iter =   80.51  ROE[avg,max] = [0.000282796, 0.375000000]  radices = 208 32 16 16  0  0  0  0  0  0
      3584  msec/iter =   86.82  ROE[avg,max] = [0.000254826, 0.343750000]  radices = 224 16 16 32  0  0  0  0  0  0
      3840  msec/iter =   93.43  ROE[avg,max] = [0.000247071, 0.312500000]  radices = 240 16 16 32  0  0  0  0  0  0
      4096  msec/iter =  100.19  ROE[avg,max] = [0.000227303, 0.312500000]  radices = 256 16 16 32  0  0  0  0  0  0
      4608  msec/iter =  112.94  ROE[avg,max] = [0.000248429, 0.312500000]  radices = 288 16 16 32  0  0  0  0  0  0
      5120  msec/iter =  128.99  ROE[avg,max] = [0.000234485, 0.281250000]  radices = 160 32 32 16  0  0  0  0  0  0
      5632  msec/iter =  146.71  ROE[avg,max] = [0.000257845, 0.343750000]  radices = 176 32 32 16  0  0  0  0  0  0
      6144  msec/iter =  161.94  ROE[avg,max] = [0.000247003, 0.312500000]  radices = 192 32 32 16  0  0  0  0  0  0
      6656  msec/iter =  172.41  ROE[avg,max] = [0.000266479, 0.375000000]  radices = 208 32 32 16  0  0  0  0  0  0
      7168  msec/iter =  186.51  ROE[avg,max] = [0.000226100, 0.281250000]  radices = 224 32 32 16  0  0  0  0  0  0
      7680  msec/iter =  200.75  ROE[avg,max] = [0.000236377, 0.312500000]  radices = 240 32 32 16  0  0  0  0  0  0
      8192  msec/iter =  221.50  ROE[avg,max] = [0.000237378, 0.312500000]  radices = 256 32 32 16  0  0  0  0  0  0

18.0

2048 msec/iter = 49.13 ROE[avg,max] = [0.000311985, 0.375000000] radices = 128 16 16 32 0 0 0 0 0 0

2304 msec/iter = 55.65 ROE[avg,max] = [0.000273731, 0.375000000] radices = 144 16 16 32 0 0 0 0 0 0

2560 msec/iter = 60.10 ROE[avg,max] = [0.000236003, 0.312500000] radices = 160 16 16 32 0 0 0 0 0 0

2816 msec/iter = 68.26 ROE[avg,max] = [0.000259256, 0.343750000] radices = 176 16 16 32 0 0 0 0 0 0

3072 msec/iter = 75.20 ROE[avg,max] = [0.000267585, 0.375000000] radices = 192 16 16 32 0 0 0 0 0 0

3328 msec/iter = 80.51 ROE[avg,max] = [0.000282796, 0.375000000] radices = 208 32 16 16 0 0 0 0 0 0

3584 msec/iter = 86.82 ROE[avg,max] = [0.000254826, 0.343750000] radices = 224 16 16 32 0 0 0 0 0 0

3840 msec/iter = 93.43 ROE[avg,max] = [0.000247071, 0.312500000] radices = 240 16 16 32 0 0 0 0 0 0

4096 msec/iter = 100.19 ROE[avg,max] = [0.000227303, 0.312500000] radices = 256 16 16 32 0 0 0 0 0 0

4608 msec/iter = 112.94 ROE[avg,max] = [0.000248429, 0.312500000] radices = 288 16 16 32 0 0 0 0 0 0

5120 msec/iter = 128.99 ROE[avg,max] = [0.000234485, 0.281250000] radices = 160 32 32 16 0 0 0 0 0 0

5632 msec/iter = 146.71 ROE[avg,max] = [0.000257845, 0.343750000] radices = 176 32 32 16 0 0 0 0 0 0

6144 msec/iter = 161.94 ROE[avg,max] = [0.000247003, 0.312500000] radices = 192 32 32 16 0 0 0 0 0 0

6656 msec/iter = 172.41 ROE[avg,max] = [0.000266479, 0.375000000] radices = 208 32 32 16 0 0 0 0 0 0

7168 msec/iter = 186.51 ROE[avg,max] = [0.000226100, 0.281250000] radices = 224 32 32 16 0 0 0 0 0 0

7680 msec/iter = 200.75 ROE[avg,max] = [0.000236377, 0.312500000] radices = 240 32 32 16 0 0 0 0 0 0

8192 msec/iter = 221.50 ROE[avg,max] = [0.000237378, 0.312500000] radices = 256 32 32 16 0 0 0 0 0 0

So it looks that under the sheer CPU power, the Jetson Nano has a 10 per cent advantage over the Raspberry PI 4.

As a side note, running stuff concurrently on the Nano GPU doesn’t affect Mlucas timings at all. I’m running it in the higher power mode, of course (10W) and have installed a 40mm fan on the heat sink. It is PWM controlled so in normal use it doesn’t even spin up. But even running CPU+GPU at full load doesn’t make the fan run at anywhere close to full RPM, so there’s not really any detectable noise.

Conclusions

As a matter of facts, the newer CPU is not always the better. Maybe the Nano board has a better cooling interface, or a more optimized internal data management. Maybe the main design for the PI was the work on spot activities, and does not handle overheating well. The final line is that the Nano runs Mlucas faster than the PI.

By way of another micro-board comparison, the main (4-core a73) processor of the Odroid N2 gets timings ~20% faster than the ones related to the Nano. So somewhat better on a bang-for-buck basis, but similar ballpark overall. The Cortex-A73 is a couple years newer, and faster (by clock speed) than the A57, so it is not surprising. Though the core optimization for different uses (mobile / power efficiency), and the A72 should actually be a bit faster at the same clocks than the A73. Why do I say this? Well… the A72 has a longer pipeline, wider instruction decode, wider out-of-order dispatch, and finally, one execution port more. But if you compare the Odroid N1 and N2, things get a bit more complicated because of RAM differences, DDR3 vs. DDR4.

On the GPU side, I am working on a full-CUDA program to test the GPU of the Jetson Nano. As a 921.6 MHz. Maxwell architecture, it’s compute capability 5.3 – GeForce 900, getting a bit old by now. Keep in touch for the next analysis.

Useful links:

Luigi_Morelli

Definire ciò che si è non risulta mai semplice o intuitivo, in specie quando nella vita si cerca costantemente di migliorarsi, di crescere tanto professionalmente quanto emotivamente. Lavoro per contribuire al mutamento dei settori cardine della computer science e per offrire sintesi ragionate e consulenza ad aziende e pubblicazioni ICT, ma anche perche’ ciò che riesco a portare a termine mi dà soddisfazione, piacere. Così come mi piace suonare (sax, tastiere, chitarra), cantare, scrivere (ho pubblicato 350 articoli scientfici e 3 libri sinora, ma non ho concluso ciò che ho da dire), leggere, Adoro la matematica, la logica, la filosofia, la scienza e la tecnologia, ed inseguo quel concetto di homo novus rinascimentale, cercando di completare quelle sezioni della mia vita che ancora appaiono poco ricche.

Moreware Blog

Mindware site for Raspberry PI and Arduino prototyping