CDW Ches99 Talk

33 %
67 %
Information about CDW Ches99 Talk

Published on January 5, 2008

Author: BAWare


Montgomery’s Multiplication Technique: How to make it Smaller and Faster:  Montgomery’s Multiplication Technique: How to make it Smaller and Faster Colin D. Walter Computation Department, UMIST, UK Peter Montgomery:  Peter Montgomery “Modular Multiplication without Trial Division” Math. Computation, vol. 44 (1985) 519-521 (A  B) mod M without obtaining digits q  (A  B) / M Motivation:  Motivation Faster RSA Cryptosystem  through pipelined array Safer encryption  against timing or DPA attacks Overview:  Overview RSA & Notation Classical Algorithm Montgomery’s Version Comparison carry propagation digit distribution communication timing/power attacks Conclusion Enigma:  Enigma Special Purpose Colossus (1943-44) Tommy Flowers, Bletchley Park, England. General Purpose ENIAC (1943-46) John Eckert & John Mauchly Philadelphia, US. RSA:  RSA Modulus M of around 1024 bits Two keys d and e such that Ade  A mod M A encrypted to C = Ae mod M C decrypted by A = Cd mod M M = PQ, a product of two large primes e is often small (e.g. a Fermat prime) d satisfies de  1 mod (P–1)(Q–1) Faster H/W = More Secure Encryption :  Faster H/W = More Secure Encryption Work to factorize M doubles for every extra ~15 bits (for key lengths ~210 bits) Work to en/decrypt : ((1024+15)/1024)2 per multiplication ((1024+15)/1024)3 per exponentiation, i.e. only 5% extra! Number representations:  Number representations X = i=0 xiri r = 2k is the radix (prime to M) xi is the ith digit (usually 0  xi < r) n  max no. of digits in any number Redundant reps: wider digit range than 0 .. r1 H/W is built from kk-bit multipliers n fixed by H/W register size n–1 Redundancy:  Redundancy Digits xj split into carry-save parts: xj = xj,s + rxj,c X  X+Y is performed by digit-parallel addition: xj  xj,s + xj1,c + yj No carry propagation: only old carries on right side Multiplication AB :  Multiplication AB Use n digit multipliers to form aiB and add to a partial product P: P := 0 ; For i := n1 downto 0 do P := rP + aiB { Post-condition: P = AB } Slide11:  Either: Use redundancy in P and parallel digit addition to add aiB in one clock cycle Cell j computes aibj in cycle i P := P + aiB (digit-parallel) P in Carry-Save form: pj = pj,s + rpj,c cell j cell j-1 cell j+1 pj,c pj+1,c pj+1,s pj,s pj-1,c pj-1,s pj+1,s pj,c pj,s pj-1,c pj-1,s bj+1 bj bj-1 ai ai Slide12:  or: Pipeline the addition of aiB over n cycles and propagate carries with no redundancy: Cell j computes aibj in cycle i+j carry P := P + aiB (digit-serial) time j+1 time j time j-1 carry carry carry pj+1 pj pj-1 pj+1 bj+1 pj bj pj-1 bj-1 ai ai ai ai Multiplier Complexity:  Multiplier Complexity Assume wires take area but not time (or power). AreaTime2 complexity for un-pipelined k-bit multiplication is bounded below by k2 This can be achieved for time in [log k ..k ] Discrete Fourier Transform has large constants for time and area. Better, but asymptotically poorer designs for k expected here. Slide14:  Cross-over point ? 107 transistors available for RSA  k  64 to accommodate aiB Speed by using at least n multipliers to perform a full length aiB (or equivalent) in one cycle. Real-Time ?:  Real-Time ? Assume : bus is one k-bit digit per cycle k-bit multiplier operates in one cycle Then : AB takes n cycles using n multipliers Throughput is one digit per cycle for multn. Need O(nk) multiplications for decryption Conclude : Need O(nk) rows of n multipliers. Classical Mod Multn Algorithm:  Classical Mod Multn Algorithm { Pre-condition: 0  A < rn } P := 0 ; For i := n1 downto 0 do Begin P := r×P + ai×B ; qi := P div M ; P := P  qi×M ; End { Post-conditions: P = A×B  Q×M, P  (A×B) mod M } Comments:  Carry propagation a problem (it slows finding q) ; Use only top digits of M and P to determine a good multiple of M to remove ; P is bounded by small multiple of M ; Clean up only at end ; Critical path is finding q. Comments Disadvantages:  Disadvantages Redundant rep. for digit-parallel operation Global broadcast of q to each digit position Montgomery’s Mod Multn Algm:  Montgomery’s Mod Multn Algm { Pre-condition: 0  A < rn } P := 0 ; For i := 0 to n1 do Begin qi := (p0+aib0)(-m0-1) mod r ; P := (P + ai×B + qi×M) div r ; { Invariant: 0  P < M+B } End ; { Post-condition: Prn = A×B + Q×M , P  (ABr–n) mod M } Peter Montgomery ::  Peter Montgomery : reverses the multiplication order chooses digits from least to most significant shifts down on each iteration. uses the least significant digits to determine multiple of M to subtract. Computes (AB×r–n) mod M Slide21:  The factor r–n is cleared up in post-processing Any extra multiple of M is removed then qi has no carries to wait for Pipelining of the digits can now take place : compute aibj+1 on the cycle after aibj use a non-redundant representation no broadcasting of qi The Post-Condition:  The Post-Condition m01 exists qi chosen so division by r is exact Define Ai =  j=0 ajrj and Qi analogously Then Ai = Ai1+riai and An = A So ri+1P = Ai×B + Qi×M at end of ith iteration Hence rnP = A×B + Q×M at end. i The Bounds:  The Bounds A converted on-line to non-redundant form Can assume ai  r1 So loop invariant P < M+B Slide24:  If critical path length is computing q: Scale M to ensure (m01) mod r = 1 Shift B up to make b0 = 0 Result: qi = p0 mod r is simple Critical path in repeated cell. Cost: Increase n by 2 Removing rn:  Removing rn The Montgomery class of A is A  rnA mod M Montgomery modr multn is denoted  . Montgomery product of A and B is A  B  A B rn  ABrn  AB mod M. Applying  to A instead of  to A produces Ae in an expn algorithm _ _ _ _ _ _ _ _ _ _ _ ___ ___ Encryption Process:  Process: A  A  Ae  Ae Precompute: R2 = rn  r2n mod M Start with: A  R2  Arn  A mod M Exponentiate to obtain Ae End with: Ae  1  Ae mod M _ _ _ __ __ __ __ _ Encryption Process 2M Bound:  Outputs are re-used as inputs. So need to bound I/O: Suppose an1 = 0 Then P < M+B at end of loop n2 yields P < M+r1B at very end. e.g. If B < 2M then P < 2M 2M Bound Slide28:  Suppose 2rM < rn, A < 2M and R2 < 2M Then A < 2M, Ae < 2M and P = Ae  1 < 2M Final output P satisfies Prn = Ae + QM where Q  rn1. Here Ae < 2M yields Prn < (rn+1)M So P  M P = M  Ae  0 mod M  A  0 mod M A = M should never arise; A = 0 yields P = 0. So no final modular adjustment is necessary. _ __ __ __ __ __ _ Digit-Parallel Implementation:  Digit-Parallel Implementation Classical vs Montgomery: Similarities: Broadcasting of qi and ai Redundant representations Computing qi takes time Differences: Bits to determine qi Slide30:  ai, qi P := P + aiB (digit-parallel, not modular) cell j+1 P in Carry-Save form: pj = pj,s + rpj,c cell j cell j-1 ai, qi mj bj mj+1 bj+1 mj-1 bj-1 pj+1,c pj,c pj-1,c pj+1,s pj-1,s pj,s pj-1,c pj-1,s pj,s pj,c pj+1,s Slide31:  j j+1 j-1 n-1 Digit-Parallel P := rP + aiB - qiM (Classical) pj-1,c pj-1,s pj,s pj,c pj+1,c pj+1,s pn-2,s pn-3,c pj-2,s pj-3,c pj,c mn-1 bn-1 mj+1 bj+1 mj-1 bj-1 mj bj ai qi ai qi qi qi+1 pj-2,c Slide32:  (Montgomery) j+1 j j-1 0 Data Flow for P(i+1) := (P(i) + aiB + qiM)/r pj-1 (n) pj-1 pj-1 pj pj pj+1 pj pj-2 pj-2 p0 (n) (n) (i) (i) (i) (i) (i+1) (i+1) (i+1) ci,j+2 ci,j+1 ci,j ci,j-1 ci,1 ai ai ai ai ai ai qi qi qi qi qi mi mi+1 mi-1 m0 b0 bj-1 bj+1 bj Systolic Array (Montgomery):  Systolic Array (Montgomery) Write ith value of P as P(i) =  j=0 p(i1) r j Cells in col j compute p(i)j at time 2i+j : p(i)j + rc(i)j  p(i1)j+1 + c(i)j1 + aibj + qimj Cells in col 0 compute qi at time 2i : qi  (p(i1)1+ai×b0)(m01) mod r Any number of rows may be constructed Different timing schedules are possible n1 Slide34:  cell i,j+1 cell i,j cell i,j-1 cell i+1,j+1 cell i+1,j cell i+1,j-1 carry carry carry carry carry carry carry carry ai ai ai ai ai+1 ai+1 ai+1 ai+1 qi+1 qi+1 qi+1 qi+1 qi qi qi qi mj bj mj+1 bj+1 mj-1 bj-1 mj-1 bj-1 mj-1 bj-1 mj bj mj bj mj+1 bj+1 mj+1 bj+1 p(i+1) p(i+1) p(i+1) p(i+1) p(i) p(i) p(i) p(i) p(i+2) p(i+2) p(i+2) p(i+2) j-1 j-1 j-1 j-2 j-2 j-2 j+1 j+1 j+1 j j j Systolic Array for P := (AB + QM)r-n Slide35:  j+1 j j-1 0 Data Flow for P(i+1) := (P(i) + aiB + qiM)/r pj-1 (n) pj-1 pj-1 pj pj pj+1 pj pj-2 pj-2 p0 (n) (n) (i) (i) (i) (i) (i+1) (i+1) (i+1) ci,j+2 ci,j+1 ci,j ci,j-1 ci,1 ai ai ai ai ai ai qi qi qi qi qi mi mi+1 mi-1 m0 b0 bj-1 bj+1 bj Digit-Serial Implementation (Montgomery):  Digit-Serial Implementation (Montgomery) Advantages: Local communication Shorter critical path Critical path easily in repeated cell Non-redundant representation Digit serial I/O Different digits qi and ai re DPA Digit-Serial Implementation (Montgomery):  Disadvantage: H/W only half used Solutions: Interleave two multiplications E.g. configure exponentiation  75% use Group digits as per Peter Kornerup [94] Digit-Serial Implementation (Montgomery) Slide38:  Other cell boundaries/groupings are possible; Timing front angles in the data dependency graph can be altered; For current speed of array implementations see Blum and Paar [99]; Vuillemin et al. [97] constructed an array; Design is parametrised: by k and no. of rows. Data Dependency Diagrams:  Data Dependency Diagrams Data Dependency Diagrams:  Data Dependency Diagrams Parallel Digit Implementation t = 0 t = 1 t = 2 t = 3 Data Dependency Diagrams:  Data Dependency Diagrams Walter [93] t=4 t=5 t=3 t=6 t=2 t=1 t=0 t=7 1 tick 2 ticks ... Data Dependency Diagrams:  Data Dependency Diagrams Kornerup [94] t=0 t=1 t=2 t=3 t=4 t=5 t=6 Data Integrity:  Data Integrity P = A×B  Q×M or Prn = A×B + Q×M These are easily checked mod m. e.g. m a prime just above the maximum cell output. Cost: ~ one cell in the array i.e. increasing n by 1. On error, abort or re-compute by another route: e.g. M replaced by dM for a digit d prime to r. Timing & Power Attacks:  Timing & Power Attacks Most attacks which succeed on the classical algorithm have equivalents which will succeed on corresponding implementation of Montgomery’s algorithm. With parallel digit processing, the same digits of A and Q are used in every digit slice in the same cycle. So DPA might reveal them. Pipelined version has no equivalent (see data dependency graph). It uses many different digits of A and Q in each cycle. DPA is more difficult. Conclusions:  Conclusions For single k-bit multiplier or array of n parallel cells, classical and Montgomery algorithms are almost equal. For pipelined array, Montgomery method has advantages: smaller time & area constants, better I/O, better against DPA; Pipeline is more complex for 100% use, but faster clock. Parameters can be chosen for specific purposes. Slide46:  Go forth and Multiply

Add a comment

Related presentations

Related pages

bYTEBoss document search engine

CDW_Ches99_Talk digit bryan lukano piti' 10911 ECU izithakazelo zakwa nhlengethwa shimano maa-handbook spd press_3t10_finaltrad goi Trenberth net runrate ...
Read more

Montgomery-Multiplication--PPT | Powerpoint Presentations ...

Source : Sponsored Links. About; Contact Us; Family safe; Privacy Policy; Terms ...
Read more

Montgomery’s Multiplication Technique

Montgomery’s Multiplication Technique: How to make it Smaller and Faster Colin D. Walter Computation Department, UMIST, UK
Read more

CFH Total Document Management Ltd CIPF02 CIPFA Business Ltd COMM04 Community & Voluntary Service COMM12 Commercial Interiors & Storage Ltd COMM31
Read more

bYTEBoss Briefing090813

<<>> Einsatzbriefing <<>> zu ZI 090813 nE Lage Informationen des Imperialen Geheimdienstes zufolge gehen die feindlichen Aktivit ten im Rishi ...
Read more

Supplier List Invoices >£250 - Revenue. _options * This sheet is manipulated by the 'Options...' dialog and should not be changed by hand Supplier
Read more