Jetson Nano vs Raspberry PI 4 – CPU comparisons

Jetson Nano vs Raspberry PI

It’s not fair comparing the new Cortex-A72 based Raspberry PI 4 againt the Jetson-Nano equipped with the somewhat older Cortex-A57. Maybe.

“You can’t sum up apples with oranges!”

That is what our teacher told us when we tried to apply a wrong logic to a subject. And they would probably tell us the same right now. The Raspberry PI and the Nvidia Jetson Nano are two single board computers, but the correlation stops here: they have different processors, different clocks, different uses and a totally different market. Not to speak about cost.

Why, then, did we decide to test them together? The reason might seem a bit awkward, but here it is: all the boards share the ARM architecture. The C code written for one can easily recompile for the other, and running the same code on different boards can show the subtle differences between them.

We will give a short description of both SBC and their specifications, then we’ll have a look at the code used for the test. I won’t call it a “benchmark” because we just want to measure the performance of our code, and not the computers.

CPU specifications

 Raspberry PI 4Nvidia Jetson Nano  
CPUBroadcom BCM2711, Quad core Cortex-A72 (ARM v8) 64-bit SoC64-bit Quad-core ARM Cortex-A57
GPUBroadcom VideoCore VI (Mali-T880) @ 500MHz 128-core Maxwell GPU @ 921MHz
Clock frequency1.5GHz1.43 GHz
System memory1GB, 2GB, or 4GB LPDDR4–3200 SDRAM 25.6 GB/s4GB 64-bit LPDDR4 @ 1600MHz 25.6 GB/s

The first evidence is that both SBCs run similar memory configurations. Our goal is to test whether the memory bandwidth is sufficient for both processors.

It is the new processor, and not the raise of the clock frequency that made the difference between Raspberry PI 3 and PI 4. That’s why we now want to test the A72 against the A57.

There are a few differences between the Cortex-A57 and the Cortex-A72 (source: ARM Developer)

  1. The first difference is the L2 cache size. The Cortex-A57 comes with 512 KB, 1 MB, or 2 MB L2 cache, but the Cortex-A72 has with a fourth option of 4MB.
  2. The GIC CPU interface can now be disabled for the Cortex-A72.
  3. The Cortex-A72 model offers the option to include or exclude the ACP (Accelerator Coherency Port) interface.
  4. The number of FEQ (Fill/Evict Queue) Entries on the Cortex-A72 has been increased to include options of 20, 24, and 28 compared to the Cortex-A57 which offers 16 or 20 entries. This feature is important for Cycle Model users doing performance analysis and studying the impact of various L2 cache parameters.
  5. A number of system registers are updated with new values to reflect the Cortex-A72. The primary part number field in the Main ID register (MIDR) for Cortex-A72 is 0xD08 vs the Cortex-A57 value of 0xD07 and the Cortex-A53 value of 0xD03. A number of other ID registers change value from 7 on the Cortex-A57 to 8 on the Cortex-A72.
  6. A number of new events are tracked by the Cortex-A72 Performance Monitor Unit (PMU). All of the new events have event numbers 0x100 and greater. There are three main sections covering:
    • Branch Prediction
    • Queues
    • Cache

The Cortex-A72 contains many micro-architecture updates for incremental performance improvement, for example the L2 FEQ size. In a test comparing the Cortex-A57 CPAK and a Cortex-A72 CPAK with the exact same software program, both CPUs reported approximately 21,500 instructions retired. This is the instruction count if the program were viewed as a sequential instruction stream. Both CPUs also do a number of speculative operations; the Cortex-A57 reported about 37,000 instructions speculatively executed and the Cortex-A72 reported 35,700.

The screenshots of the instruction events below show first Cortex-A72 followed by Cortex-A57. All of the micro-architecture improvements of the Cortex-A72 combine to provide an incredibly high performance CPU.

 

Cortex-A72
Cortex-A72

 

Cortex-A57
Cortex-A57

 

Testing software

As our primary interest is Math, we used Ernst W. Mayer‘s program Mlucas to test our hardware. The tests are the work of Sam Laur of FInland.

Operating system.

Mlucas SIMD build requires a 64-bit OS, so we installed Gentoo 64-bit and uninstalled all the modules we didn’t need (Ethernet, HDMI, the Xserver and so on). This way the boards would run with minimum amximum efficiency.

Mlucas

Mlucas is one of the programs used to test the 50th and 51st Mersenne primes. It performs the Lucas-Lehmer algorithm to test the primality of multi-million digits Mersenne candidates. The core of the code uses FFT multiplication through the irrational base discrete weighted transform (IBDWT), a variant of the fast Fourier transform using an irrational base; its development is due to by Richard Crandall (Reed College), Barry Fagin (Dartmouth College) and Joshua Doenias (NeXT Software).

Mlucas runs multi-threaded and makes vast use of hand-tuned assembly and SIMD optimizations. Most of the code is written in plain C, so it can be used on a rich variety of different platforms. For this test, we chose to run Mlucas with 4 threads.

Our results

We run the code on the Raspberry PI 4 with the following results:

The run on the Jetson Nano gave the following results:

So it looks that under the sheer CPU power, the Jetson Nano has a 10 per cent advantage over the Raspberry PI 4.

As a side note, running stuff concurrently on the Nano GPU doesn’t affect Mlucas timings at all. I’m running it in the higher power mode, of course (10W) and have installed a 40mm fan on the heat sink. It is PWM controlled so in normal use it doesn’t even spin up. But even running CPU+GPU at full load doesn’t make the fan run at anywhere close to full RPM, so there’s not really any detectable noise.

Conclusions

As a matter of facts, the newer CPU is not always the better. Maybe the Nano board has a better cooling interface, or a more optimized internal data management. Maybe the main design for the PI was the work on spot activities, and does not handle overheating well. The final line is that the Nano runs Mlucas faster than the PI.

By way of another micro-board comparison, the main (4-core a73) processor of the Odroid N2 gets timings ~20% faster than the ones related to the Nano. So somewhat better on a bang-for-buck basis, but similar ballpark overall. The Cortex-A73 is a couple years newer, and faster (by clock speed) than the A57, so it is not surprising. Though the core optimization for different uses (mobile / power efficiency), and the A72 should actually be a bit faster at the same clocks than the A73. Why do I say this? Well… the A72 has a longer pipeline, wider instruction decode, wider out-of-order dispatch, and finally, one execution port more. But if you compare the Odroid N1 and N2, things get a bit more complicated because of RAM differences, DDR3 vs. DDR4.

On the GPU side, I am working on a full-CUDA program to test the GPU of the Jetson Nano. As a 921.6 MHz. Maxwell architecture, it’s compute capability 5.3 – GeForce 900, getting a bit old by now. Keep in touch for the next analysis.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.