CUDA with Jetson Nano - Parallel Pollard Rho test

Post Views: 3,086

Math is hard. When your teacher does not like to teach it.

According to Wikipedia,

CUDA (Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) model created by Nvidia. It allows software developers and software engineers to use a CUDA-enabled graphics processing unit (GPU) for general purpose processing — an approach termed GPGPU (General-Purpose computing on Graphics Processing Units). The CUDA platform is a software layer that gives direct access to the GPU’s virtual instruction set and parallel computational elements, for the execution of compute kernels.

In other, simpler words, Nvidia delivered a new development framework, made of both hardware and software, to parallelize computing tasks, thus reducing their completion times.

The CUDA paradigm is composed of a dedicated hardware with multiple processors, and a clever extension of the C/C++ language, capable to address the power of each and every single processor through simple tasks. When you have to perform repetitive tasks like searches, queries, sorting and convolutions, you can get (partially) rid of heavy loops, and schedule a tiny list of operations to different workers. With thousands of worker threads acting in parallel, the time to complete the task is greatly reduced.

Of course there are some caveats: while the serial Von Neumann scheduling only requires an input, a black box where the elaboration is done instruction after instruction, and an output, parallel architectures need to consider where the data is stored, who is working on it (and how), when will some data be free for the next evaluation. The analysis of concurrency is not easy, but once you know the basis it becomes straightforward.

CUDA Hardware – The Jetson Nano

CUDA reached its 10th incarnation, showing as a powerful and stable development tool. Born as a graphical architecture able to quickly address pixels and drive their colors on a monitor, CUDA has today become a complete system for parallel development. While its share of fame still resides in graphical data treatment and imaging, an ever growing group of developers dedicated time and effort to expose and exploit the parallel power hidden in the bare metal. Engineers and math researchers refactored well-known serial algorithms to harness the blazing speed of multiprocessing.

And rightly so. We have developed a parallel factorization algorithm witn CUDA GPUs, up to 1,000x faster than the original CPU method. Just dedicating one thread to each one of the 960 residual classes (out of 4620) we have such huge speedup. But the Jetson Nano does not have 960 spare computing units, and although it could still benefit from its 128 CUDA cores, we will test a different factorization algorithm on it.

Instead of subdividing the work into 8 blocks using 125 cu each, we will try a totally different approach, a bit less deterministic, but that will take advantage of the somewhat limited CUDA with Jetson Nano using a statistics simulation.

Factorization – The Pollard Rho algorithm

From Wikipedia:

Suppose we need to factorize a number n = p*q, where p is a non-trivial factor. A polynomial modulo n, called g(x) (e.g., g(x) = (x² + 1) mod n, is used to generate a pseudorandom sequence. A starting value, say 2, is chosen, and the sequence continues as x₁=g(2), x₂=g(g(2)), x₃=g(g(g(2))), etc. The sequence is related to another sequence {x_k mod p}. Since p is not known beforehand, this sequence cannot be explicitly computed in the algorithm. Yet, in it lies the core idea of the algorithm.

Because the number of possible values for these sequences are finite (the modular operation assures it), both the {x_k} sequence, which is mod n, and {x_k mod p } sequence will eventually repeat, even though we do not know the latter. Assume that the sequences behave like random numbers. Due to the birthday paradox (we will analyze it o another article), the number of x_k before a repetition occurs is expected to be O(√N), where N is the number of possible values. So the sequence {x_k mod p} will likely repeat much earlier than the sequence {x_k}. Once a sequence has a repeated value, the sequence will cycle, because each value depends only on the one before it. This structure of eventual cycling gives rise to the name “Rho algorithm”, owing to similarity to the shape of the Greek character ρ when the values x₁ mod p, x₂ mod p, etc. are represented as nodes in a directed graph.

Ok. Now let’s take a deep breath, and see what the previous paragraph really means.

Suppose we are running on a long circular race track. How do we know we have completed one cycle? A clever solution is to have two runners A and B with B running twice as fast as A. They start off at the same position and when B overtakes A, we know that B has cycled around at least once.

We have the following algorithm:

x = 2; y = 2; d = 1
    while d == 1:
        x = g(x)
        y = g(g(y))
        d = gcd(|x - y|, n)
    if d == n: 
        return failure
    else:
        return d

x = 2; y = 2; d = 1

while d == 1:

x = g(x)

y = g(g(y))

d = gcd(|x - y|, n)

if d == n:

return failure

else:

return d

Suppose we want to factor the number 8051. We have

n = 8051
x = 2
y = 2
g(x) = (x^2 + 1)
g(g(x)) = g(x^2 + 1)

n = 8051

x = 2

y = 2

g(x) = (x^2 + 1)

g(g(x)) = g(x^2 + 1)

Applying the algorithm, we get the following resolution steps:

iteration	x mod 8051	y = (x^2 + 1)^2 + 1 mod 8051	GCD(\|x − y\|, 8051)
1	2	2	1
2	5 (2^2 +1)	26 (5^2 + 1)	1
3	26 (5^2 + 1)	7474 ((677^2 + 1) mod 8051)	1
4	677 (26^2 + 1)	871	97

Starting with different x and/or y, we could find the other divisor of 8051, 83

If we use a naive trial-factoring algorithm, we should test all prime factors below the square root of N. The trial-factoring algorithm grows exponentially with the digits of the number (2^n/2) .

The Pollard Rho algorithm offers a trade-off between its running time and the probability that it finds a factor. A prime divisor can be achieved with a probability around 0.5, in O(√d) <= O(n^1/4) iterations. This is a heuristic claim, and rigorous analysis of the algorithm remains open.

From theory to practice

To show the algorithm in action (and how a parallel implementation can accelerate outcomes) we prepared a short source code, running the Pollard Rho factorization using one CPU, multiple CPUs through OpenMP, and GPUs. A list of candidates to factorize is provided, to show:

How the algorithm becomes slower as the factor grows
How the slowdown is due to the factor size, and not to the number to factor
How the GPU overhead makes the parallel algorithm slower than the CPU’s for smaller factors
How a correct choice of the blocksize and blocknumber for GPUs can lead to better results
What is the overall gain of the parallel algorithm.

The algorithm we present is limited to 64 bit numbers, but it is fairly easy to extend it to multi-precision arithmetic. Anyway, it is known that the efficiency of Pollard Rho decreases with the length of the factors: in general, when N has more than 40 digits, other methods (such as P-1 and ECM) should be chosen.

The algorithm is very fast for numbers with small factors, but slower in cases where all factors are large. The ρ algorithm’s most remarkable success was the factorization of the eighth Fermat number, $F 8$ = 1238926361552897 * 93461639715357977769163558199606896584051237541638188580280321. The ρ algorithm was a good choice for $F 8$ because the prime factor $p$ = 1238926361552897 is much smaller than the other factor. The factorization took 2 hours on a UNIVAC 1100/42.

You will find more information on Fermat factorization and primality proving on my FermatSearch site.

Differences between CPU and GPU implementation

We have two different kernels to run as host and device.

The host kernel runs on the CPU (or CPUs when OpenMP is enabled). It takes the number(s) to be factored, and loops until a solution is found.

The device kernel runs on the GPU. The (random) values of x and y are pre-computed, indexed and stored inside an array. The array is copied asynchronously after the number to factor. Then the block is launched. Each thread of the device code simultaneously accesses a section of the array, calculates its step (g(x), g(g(x)), absolute value of the sum and gcd), then stores the results back on the array, and checks if the GCD is greater than 1. If it is, we have a factor. The value is saved on the result location. Once this one-pass loop is completed, the result is transferred back to the host memory, and returned to the calling main.

How does it perform?

We used a list of 19 numbers, and tested them on 2 different platforms: a PC with a Intel Core i7 9800X processor and a Nvidia RTX 2060 GPU, and a Nvidia Jetson Nano board.

The PC gave the following results:

Overall results:
Host   Time = 503.134 s
Device Time =   1.576 s
Speedup     = 319.2

Overall results:

Host Time = 503.134 s

Device Time = 1.576 s

Speedup = 319.2

In other words, even a naive implementation of the algorithm in CUDA showed a speedup of 320x. We ran a 256×256 grid on the GPU while we had 1920 computational units, and didn’t take advantage of shared memory speedup or register optimization.

On the Nano we achieved the following timings:

Overall results:
Host   Time = 1893.795 s
Device Time =   39.448 s
Speedup     =   48.0

Overall results:

Host Time = 1893.795 s

Device Time = 39.448 s

Speedup = 48.0

The GPU on the Nano is a Maxwell with a low number of cores and aa clock below 1 GHz, but we achieved a 48x speedup relative to the CPU.

We also notice that CUDA with Jetson Nano using the parallel algorithm is more than 12 times faster than the Intel 9800X. Not bad for a board that costs 15 times less than the PC system.

The source code to experiment CUDA with Jetson Nano is available on Github. We will test other back-to-back CPU vs GPU algorithms soon, to show not only the joy of parallel computing, but also the amaze of the graphic acceleration of CUDA.

Useful links:

Jetson Nano Development Kit

Luigi_Morelli

Definire ciò che si è non risulta mai semplice o intuitivo, in specie quando nella vita si cerca costantemente di migliorarsi, di crescere tanto professionalmente quanto emotivamente. Lavoro per contribuire al mutamento dei settori cardine della computer science e per offrire sintesi ragionate e consulenza ad aziende e pubblicazioni ICT, ma anche perche’ ciò che riesco a portare a termine mi dà soddisfazione, piacere. Così come mi piace suonare (sax, tastiere, chitarra), cantare, scrivere (ho pubblicato 350 articoli scientfici e 3 libri sinora, ma non ho concluso ciò che ho da dire), leggere, Adoro la matematica, la logica, la filosofia, la scienza e la tecnologia, ed inseguo quel concetto di homo novus rinascimentale, cercando di completare quelle sezioni della mia vita che ancora appaiono poco ricche.

Moreware Blog

Mindware site for Raspberry PI and Arduino prototyping

CUDA with Jetson Nano – Parallel Pollard Rho test

CUDA Hardware – The Jetson Nano

Factorization – The Pollard Rho algorithm

From theory to practice

Differences between CPU and GPU implementation

How does it perform?

Like this:

Luigi_Morelli

Leave a ReplyCancel reply

Moreware Blog

Mindware site for Raspberry PI and Arduino prototyping

CUDA Hardware – The Jetson Nano

Factorization – The Pollard Rho algorithm

From theory to practice

Differences between CPU and GPU implementation

How does it perform?

Share this:

Like this:

Luigi_Morelli

Leave a ReplyCancel reply