ํƒœ๊ทธ ๋ณด๊ด€๋ฌผ: restricted-time

restricted-time

์ต์ŠคํŠธ๋ฆผ ํ”ผ๋ณด๋‚˜์น˜ ๊ฑด์˜ ํ”ผ๋ณด๋‚˜์น˜

์ด ์›น ์‚ฌ์ดํŠธ์—๋Š” ์ˆ˜์‹ญ์–ต ๊ฑด์˜ ํ”ผ๋ณด๋‚˜์น˜ ์ฑŒ๋ฆฐ์ง€๊ฐ€ ์žˆ์—ˆ์œผ๋ฏ€๋กœ ์ˆ˜์‹ญ์–ต ๊ฑด์˜ ํ”ผ๋ณด๋‚˜์น˜ ์ฑŒ๋ฆฐ์ง€๋กœ ์ผ์„ ๊พธ๋ฏธ์‹ญ์‹œ์˜ค!

๋‹น์‹ ์˜ ๋„์ „์€ ๊ฐ€๋Šฅํ•œ ํ•œ ์งง์€ ํ”„๋กœ๊ทธ๋žจ์œผ๋กœ 1,000,000,000 ๋ฒˆ์งธ ํ”ผ๋ณด๋‚˜์น˜ ์ˆ˜์˜ ์ฒซ 1000 ์ž๋ฆฌ๋ฅผ ์ถœ๋ ฅํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ ์„ ํƒ์ ์œผ๋กœ ๋‚˜๋จธ์ง€ ์ˆซ์ž๋ฅผ ํฌํ•จํ•˜๋˜ ์ด์— ๊ตญํ•œ๋˜์ง€ ์•Š๋Š” ์ถ”๊ฐ€ ์ถœ๋ ฅ์ด ์„ ํƒ ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋‚˜๋Š” ๊ทœ์น™ ๊ฒƒ์„ ์‚ฌ์šฉํ•˜๊ณ  fib 0 = 0, fib 1 = 1.

ํ”„๋กœ๊ทธ๋žจ์„ ์‹คํ–‰ํ•˜๊ณ  ์ •ํ™•์„ฑ์„ ๊ฒ€์ฆ ํ•  ์ˆ˜์žˆ์„ ์ •๋„๋กœ ๋น ๋ฅธ ํ”„๋กœ๊ทธ๋žจ์ด์–ด์•ผํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ์ฒ˜์Œ 1000 ์ž๋ฆฌ ์ˆซ์ž๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.




๋‹ต๋ณ€

Python 2 + Sympy, 72 ๋ฐ”์ดํŠธ

from sympy import*
n=sqrt(5)
print'7'+`((.5+n/2)**1e9/n).evalf(1e3)`[2:]

์˜จ๋ผ์ธ์œผ๋กœ ์‚ฌ์šฉํ•ด๋ณด์‹ญ์‹œ์˜ค!

Jeff Dege ๋•๋ถ„์— ์‹ค์งˆ์ ์œผ๋กœ -0 ๋ฐ”์ดํŠธ๋ฅผ ์ œ๊ฑฐํ•˜์—ฌ -10 ๋ฐ”์ดํŠธ
-1 ๋ฐ”์ดํŠธ (Zachari ๋•๋ถ„์— 1000-> 1e3)
-2 ๋ฐ”์ดํŠธ -Erik the Outgolfer ๋•๋ถ„์— ๋ถˆํ•„์š”ํ•œ ๋ณ€์ˆ˜๋ฅผ ์ œ๊ฑฐํ•˜์—ฌ
-2 ๋ฐ”์ดํŠธ Zacharรฝ ๋•๋ถ„์— Python 2๋กœ ์ด๋™ํ•˜์—ฌ -2 ๋ฐ”์ดํŠธ
ThePirateBay -11๋•๋ถ„์— -3 ๋ฐ”์ดํŠธ str11 notjagan ๋•๋ถ„์— ๋ฐฑํ‹ฑ ์œผ๋กœ ๊ต์ฒด ํ•˜์—ฌ -3 ๋ฐ”์ดํŠธ

์ด์ œ OP์˜ ๊ฒŒ์‹œ๋˜์ง€ ์•Š์€ haskell ์†”๋ฃจ์…˜์„ ๋Šฅ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค!


๋‹ต๋ณ€

ํŒŒ์ด์ฌ 2 , 106 ๋ฐ”์ดํŠธ

a,b=0,1
for c in bin(10**9):
 a,b=2*a*b-a*a,a*a+b*b
 if'1'==c:a,b=b,a+b
 while a>>3340:a/=10;b/=10
print a

์˜จ๋ผ์ธ์œผ๋กœ ์‚ฌ์šฉํ•ด๋ณด์‹ญ์‹œ์˜ค!

๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€์—†๊ณ  ์ •์ˆ˜ ์‚ฐ์ˆ ์ž…๋‹ˆ๋‹ค. ๊ฑฐ์˜ ์ฆ‰์‹œ ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ์€ ๋ถ„ํ•  ์ •๋ณต ์ •์ฒด์„ฑ์ž…๋‹ˆ๋‹ค.

f(2*n)   = 2*f(n)*f(n+1) - f(n)^2
f(2*n+1) = f(n)^2 + f(n+1)^2

์ด๋ฅผ ํ†ตํ•ด (a,b) = (f(n),f(n+1))double๋กœ ์—…๋ฐ์ดํŠธ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค n -> 2*n. ์šฐ๋ฆฌ๋Š” get์„ ์›ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ฐ˜๋ณต n=10**9๋งŒ log_2(10**9)=30ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ ๋Š” ์ด์ง„ ํ™•์žฅ์˜ ๊ฐ ์ˆซ์ž ์— ๋Œ€ํ•ด ๋ฐ˜๋ณต์ ์œผ๋กœ ์ˆ˜ํ–‰ nํ•˜์—ฌ ๊ตฌ์ถ• ํ•ฉ๋‹ˆ๋‹ค . ์ผ ๋•Œ ํ”ผ๋ณด๋‚˜์น˜ ์ด๋™ ์œผ๋กœ ๋‘ ๋ฐฐ์˜ ๊ฐ’์ด ์œ„๋กœ ์ด๋™10**9n->2*n+ccc==12*n -> 2*n+1(a,b)=(b+a,b)

๊ฐ’์„ a,b๊ด€๋ฆฌ ๊ฐ€๋Šฅ ํ•˜๊ฒŒ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ์ฒซ ๋ฒˆ์งธ 1006์ž๋ฆฌ ๋งŒ ๋ฐ”๋‹ฅ์— 10๋‚˜์˜ฌ ๋•Œ๊นŒ์ง€ ๋‚˜๋ˆ•๋‹ˆ๋‹ค 2**3340 ~ 1e1006.


๋‹ต๋ณ€

(๋ฆฌ๋ˆ…์Šค ์‹œ์Šคํ…œ ํ˜ธ์ถœ ํฌํ•จ)์˜ x86 32 ๋น„ํŠธ ์ปดํ“จํ„ฐ ์ฝ”๋“œ (106) 105 ๋ฐ”์ดํŠธ

changelog : off-by-one ์ƒ์ˆ˜๋Š” Fib (1G)์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณ€๊ฒฝํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ ๋น ๋ฅธ ๋ฒ„์ „์—์„œ ๋ฐ”์ดํŠธ๋ฅผ ์ €์žฅํ–ˆ์Šต๋‹ˆ๋‹ค.

๋˜๋Š” v (์Šค์นด์ด ๋ ˆ์ดํฌ)์— 18 % ๋А๋ฆฐ 102 ๋ฐ”์ดํŠธ (์‚ฌ์šฉ์€ mov/ sub/ cmc๋Œ€์‹  lea/ cmp๋‚ด๋ถ€ ๋ฃจํ”„์—์„œ ๋ฐ˜์ถœ ๋ฐ ํฌ์žฅ์ง€๋ฅผ ์ƒ์„ฑํ•˜๋Š” 10**9๋Œ€์‹  2**32). ๋˜๋Š” ๊ฐ€์žฅ ์•ˆ์ชฝ ๋ฃจํ”„์˜ ์บ๋ฆฌ ์ฒ˜๋ฆฌ์— ๋ถ„๊ธฐ๊ฐ€์žˆ๋Š” ~ 5.3x ๋А๋ฆฐ ๋ฒ„์ „์˜ ๊ฒฝ์šฐ 101 ๋ฐ”์ดํŠธ์ž…๋‹ˆ๋‹ค. (์ €๋Š” 25.4 %์˜ ์ง€์‚ฌ ์˜ˆ์ธก๋ฅ ์„ ์ธก์ •ํ–ˆ์Šต๋‹ˆ๋‹ค!)

๋˜๋Š” ์„ ํ–‰ 0์ด ํ—ˆ์šฉ๋˜๋Š” ๊ฒฝ์šฐ 104/101 ๋ฐ”์ดํŠธ. (์ถœ๋ ฅ์˜ 1 ์ž๋ฆฌ๋ฅผ ๊ฑด๋„ˆ ๋›ฐ๋Š” ํ•˜๋“œ ์ฝ”๋“œ์—๋Š” 1 ๋ฐ”์ดํŠธ๊ฐ€ ๋” ํ•„์š”ํ•˜๋ฏ€๋กœ Fib (10 ** 9)์— ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.)

๋ถˆํ–‰ํžˆ๋„ TIO์˜ NASM ๋ชจ๋“œ -felf32๋Š” ์ปดํŒŒ์ผ๋Ÿฌ ํ”Œ๋ž˜๊ทธ์—์„œ ๋ฌด์‹œ ๋˜๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค . ์–ด์จŒ๋“  ์ฃผ์„์— ์‹คํ—˜ ์•„์ด๋””์–ด๊ฐ€ ์—‰๋ง์ธ ๋‚ด ์†Œ์Šค ์ฝ”๋“œ์™€ ์˜ ๋งํฌ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด๊ฒƒ์€ ์™„์ „ํ•œ ํ”„๋กœ๊ทธ๋žจ ์ž…๋‹ˆ๋‹ค. Fib (10 ** 9)์˜ ์ฒซ 1000 ์ž๋ฆฌ ์ˆซ์ž ๋‹ค์Œ์— ์—ฌ๋ถ„์˜ ์ˆซ์ž (๋งˆ์ง€๋ง‰ ์†Œ์ˆ˜๊ฐ€ ์ž˜๋ชป๋จ)์™€ ๊ฐ€๋น„์ง€ ๋ฐ”์ดํŠธ (์ค„ ๋ฐ”๊ฟˆ ์ œ์™ธ)๊ฐ€ ๋’ค ๋”ฐ๋ฆ…๋‹ˆ๋‹ค. ๊ฐ€๋น„์ง€์˜ ๋Œ€๋ถ€๋ถ„์€ ๋น„ ASCII์ด๋ฏ€๋กœ์„ ํ†ตํ•ด ํŒŒ์ดํ”„ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค cat -v. ๊ทธ๋ž˜๋„ ํ„ฐ๋ฏธ๋„ ์—๋ฎฌ๋ ˆ์ดํ„ฐ (KDE konsole)๋ฅผ ์ค‘๋‹จํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค . โ€œ์“ฐ๋ ˆ๊ธฐ ๋ฐ”์ดํŠธโ€๋Š” Fib (999999999)๋ฅผ ์ €์žฅํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฏธ -1024๋ ˆ์ง€์Šคํ„ฐ์— ์žˆ์—ˆ์œผ๋ฏ€๋กœ ์ ์ ˆํ•œ ํฌ๊ธฐ๋ณด๋‹ค 1024 ๋ฐ”์ดํŠธ๋ฅผ ์ธ์‡„ํ•˜๋Š” ๊ฒƒ์ด ๋” ์ €๋ ดํ–ˆ์Šต๋‹ˆ๋‹ค.

ELF ์‹คํ–‰ ํŒŒ์ผ๋กœ ๋งŒ๋“œ๋Š” ๋ณดํ’€์ด ์•„๋‹Œ ๋จธ์‹  ์ฝ”๋“œ (์ •์  ์‹คํ–‰ ํŒŒ์ผ์˜ ํ…์ŠคํŠธ ์„ธ๊ทธ๋จผํŠธ ํฌ๊ธฐ) ๋งŒ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. ( ๋งค์šฐ ์ž‘์€ ELF ์‹คํ–‰ ํŒŒ์ผ์ด ๊ฐ€๋Šฅ ํ•˜์ง€๋งŒ ๊ท€์ฐฎ๊ฒŒํ•˜๊ณ  ์‹ถ์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค). BSS ๋Œ€์‹  ์Šคํƒ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ๋” ์งง์•˜์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋ฉ”ํƒ€ ๋ฐ์ดํ„ฐ์— ์˜์กดํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ๋ฐ”์ด๋„ˆ๋ฆฌ์—์„œ ๋‹ค๋ฅธ ๊ฒƒ์„ ๊ณ„์‚ฐํ•˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (์ผ๋ฐ˜์ ์œผ๋กœ ์ œ๊ฑฐ ๋œ ์ •์  ๋ฐ”์ด๋„ˆ๋ฆฌ๋ฅผ ์ƒ์„ฑํ•˜๋ฉด 340 ๋ฐ”์ดํŠธ ELF๊ฐ€ ์‹คํ–‰ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.)

C์—์„œ ํ˜ธ์ถœ ํ•  ์ˆ˜์žˆ๋Š”์ด ์ฝ”๋“œ์—์„œ ํ•จ์ˆ˜๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์Šคํƒ ํฌ์ธํ„ฐ (MMX ๋ ˆ์ง€์Šคํ„ฐ์—์žˆ์„ ์ˆ˜ ์žˆ์Œ) ๋ฐ ๊ธฐํƒ€ ์˜ค๋ฒ„ ํ—ค๋“œ๋ฅผ ์ €์žฅ / ๋ณต์›ํ•˜๋Š” ๋ฐ ๋ช‡ ๋ฐ”์ดํŠธ๊ฐ€ ํ•„์š”ํ•˜์ง€๋งŒ ๋ฌธ์ž์—ด๊ณผ ํ•จ๊ป˜ ๋ฐ˜ํ™˜ํ•˜์—ฌ ๋ฐ”์ดํŠธ๋ฅผ ์ ˆ์•ฝ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค write(1,buf,len)์‹œ์Šคํ…œ ํ˜ธ์ถœ ๋Œ€์‹  ๋ฉ”๋ชจ๋ฆฌ์— . ๊ธฐ๊ณ„ ์ฝ”๋“œ๋กœ ๊ณจํ”„๋ฅผ ํƒ€๋ฉด ์—ฌ๊ธฐ์—์„œ ์•ฝ๊ฐ„์˜ ์—ฌ์œ ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. ๋‹ค๋ฅธ ๋ˆ„๊ตฌ๋„ ๋„ค์ดํ‹ฐ๋ธŒ ํ™•์žฅ ์ •๋ฐ€๋„์—†์ด ์–ด๋–ค ์–ธ์–ด๋กœ๋„ ๋‹ต๋ณ€์„ ๊ฒŒ์‹œํ•˜์ง€ ์•Š์•˜์ง€๋งŒ์ด ๊ธฐ๋Šฅ ๋ฒ„์ „์€ ์ „์ฒด๋ฅผ ๋‹ค์‹œ ๊ณจํ•‘ํ•˜์ง€ ์•Š๊ณ  ์—ฌ์ „ํžˆ 120 ๋ฐ”์ดํŠธ ๋ฏธ๋งŒ์ด์–ด์•ผํ•œ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. ์˜ํšŒ.


์—ฐ์‚ฐ:

brute force a+=b; swap(a,b), ์„ ํ–‰> = 1017 ์†Œ์ˆ˜๋งŒ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ํ•„์š”์— ๋”ฐ๋ผ ์ž˜๋ฆฝ๋‹ˆ๋‹ค. ๋‚ด ์ปดํ“จํ„ฐ์—์„œ 1 ๋ถ„ 13 ์ดˆ (๋˜๋Š” 322.47 ์–ต ํด๋Ÿญ ์‚ฌ์ดํด +-0.05 %)๋กœ ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค (์ฝ”๋“œ ํฌ๊ธฐ์˜ ๋ช‡ ๋ฐ”์ดํŠธ๊ฐ€ ๋” ์žˆ์œผ๋ฉด ๋ช‡ % ๋น ๋ฅด๊ฑฐ๋‚˜ ๋ฃจํ”„ ์–ธ ๋กค๋ง์œผ๋กœ ์ธํ•ด ์ฝ”๋“œ ํฌ๊ธฐ๊ฐ€ ํ›จ์”ฌ ์ปค 62 %๊นŒ์ง€ ๋–จ์–ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค). ์˜๋ฆฌํ•œ ์ˆ˜ํ•™, ์˜ค๋ฒ„ ํ—ค๋“œ๋ฅผ ์ค„์ด๋ฉด์„œ ๋™์ผํ•œ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜์‹ญ์‹œ์˜ค). ๋‚ด ์ปดํ“จํ„ฐ (4.4GHz Skylake i7-6700k)์—์„œ 12min35s์—์„œ ์‹คํ–‰๋˜๋Š” @AndersKaseorg์˜ Python ๊ตฌํ˜„์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค. ๋‘ ๋ฒ„์ „ ๋ชจ๋‘ L1D ์บ์‹œ ๋ˆ„๋ฝ์ด ์—†์œผ๋ฏ€๋กœ DDR4-2666์€ ์ค‘์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

ํŒŒ์ด์ฌ๊ณผ๋Š” ๋‹ฌ๋ฆฌ, ํ™•์žฅ ์ž๋ฆฟ์ˆ˜๋Š” ์ž˜๋ฆฐ ์‹ญ์ง„์ˆ˜๋ฅผ ์ž์œ ๋กญ๊ฒŒ ๋งŒ๋“œ๋Š” ํ˜•์‹์œผ๋กœ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค . 32 ๋น„ํŠธ ์ •์ˆ˜ ๋‹น 9 ์ž๋ฆฌ ์ˆซ์ž ๊ทธ๋ฃน์„ ์ €์žฅํ•˜๋ฏ€๋กœ ํฌ์ธํ„ฐ ์˜คํ”„์…‹์€ ํ•˜์œ„ 9 ์ž๋ฆฌ๋ฅผ ๋ฒ„๋ฆฝ๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ 10์˜ ๊ฑฐ๋“ญ ์ œ๊ณฑ ์ธ ์‚ฌ์‹ค์ƒ 10 ์–ต์˜ ๊ธฐ๋ณธ์ž…๋‹ˆ๋‹ค. (์ด ๋„์ „์ด 10 ์–ต ํ”ผ๋ณด๋‚˜์น˜ ์ˆ˜๋ฅผ ํ•„์š”๋กœํ•˜๋Š” ๊ฒƒ์€ ์ˆœ์ˆ˜ํ•œ ์šฐ์—ฐ์˜ ์ผ์น˜์ด์ง€๋งŒ, 2 ๋ฐ”์ดํŠธ์™€ 2 ๊ฐœ์˜ ์ƒ์ˆ˜๋ฅผ ์ ˆ์•ฝ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.)

๋‹ค์Œ ์˜ GMP ์šฉ์–ด, ํ™•์žฅ ๋ฐ€๋„ ์ˆซ์ž์˜ ๊ฐ 32 ๋น„ํŠธ ์ฒญํฌ๋Š” โ€œ์‚ฌ์ง€โ€๋ผ๊ณ ํ•ฉ๋‹ˆ๋‹ค. ์ถ”๊ฐ€ํ•˜๋Š” ๋™์•ˆ ์ˆ˜ํ–‰์€ 1e9์™€ ๋น„๊ตํ•˜์—ฌ ์ˆ˜๋™์œผ๋กœ ์ƒ์„ฑํ•ด์•ผํ•˜์ง€๋งŒ ์ผ๋ฐ˜์ ์œผ๋กœ ๋‹ค์Œ ํŒ”๋‹ค๋ฆฌ ์˜ ์ผ๋ฐ˜์ ์ธ ADC๋ช…๋ น์— ๋Œ€ํ•œ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค . (๋˜ํ•œ [0..999999999]2 ^ 32 ~ = 4.295e9๊ฐ€ ์•„๋‹Œ ์ˆ˜๋™์œผ๋กœ ๋ฒ”์œ„๋ฅผ ์ค„ ๋ฐ”๊ฟˆํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค . ๋น„๊ต์˜ ์ˆ˜ํ–‰ ๊ฒฐ๊ณผ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ lea+ ์—†์ด ๋ถ„๊ธฐ ์—†์ด์ด ์ž‘์—…์„ cmov์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.)

๋งˆ์ง€๋ง‰ ์‚ฌ์ง€๊ฐ€ 0์ด ์•„๋‹Œ ์บ๋ฆฌ ์•„์›ƒ์„ ์ƒ์„ฑ ํ•  ๋•Œ ์™ธ๋ถ€ ๋ฃจํ”„์˜ ๋‹ค์Œ ๋‘ ๋ฒˆ์˜ ๋ฐ˜๋ณต์€ ์ •์ƒ๋ณด๋‹ค 1 ์‚ฌ์ง€์—์„œ ์ฝ์ง€ ๋งŒ ์—ฌ์ „ํžˆ ๋™์ผํ•œ ์œ„์น˜์— ์”๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ memcpy(a, a+4, 114*4)1 ์‚ฌ์ง€๋งŒํผ ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ด๋™ํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์ง€๋งŒ ๋‹ค์Œ ๋‘ ์ถ”๊ฐ€ ๋ฃจํ”„์˜ ์ผ๋ถ€๋กœ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ~ 18 ๋ฐ˜๋ณต๋งˆ๋‹ค ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.


ํฌ๊ธฐ ์ ˆ์•ฝ ๋ฐ ์„ฑ๋Šฅ์„์œ„ํ•œ ํ•ดํ‚น :

  • ๋‚ด๊ฐ€ ์•Œ๋ฉด lea ebx, [eax-4 + 1]๋Œ€์‹  ๊ฐ™์€ ์ผ๋ฐ˜์ ์ธ ๊ฒƒ๋“ค . ๊ทธ๋ฆฌ๊ณ  ์†๋„ ๊ฐ€ ๋А๋ ค์ง€ ๋Š” ๊ณณ ์—์„œ๋งŒ ์‚ฌ์šฉํ•˜์‹ญ์‹œ์˜ค.mov ebx, 1eax=4loopLOOP

  • adc๋‚ด๋ถ€ ๋ฃจํ”„ ์—์„œ ๋ฒ„ํผ์˜ ์‹œ์ž‘ ๋ถ€๋ถ„์— ๊ณ„์† ์“ฐ๋Š” ๋™์•ˆ ์šฐ๋ฆฌ๊ฐ€ ์ฝ๋Š” ํฌ์ธํ„ฐ๋ฅผ ์˜คํ”„์…‹ํ•˜์—ฌ 1 ๊ฐœ์˜ ํŒ” ๋‹ค๋ฆฌ๋ฅผ ๋ฌด๋ฃŒ๋กœ ์ž๋ฆ…๋‹ˆ๋‹ค . ์—์„œ ์ฝ๊ณ  [edi+edx]์”๋‹ˆ๋‹ค [edi]. ๋”ฐ๋ผ์„œ ๋Œ€์ƒ์— ๋Œ€ํ•œ ์ฝ๊ธฐ / ์“ฐ๊ธฐ ์˜คํ”„์…‹์„ ์–ป edx=0๊ฑฐ๋‚˜ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค 4. ๋จผ์ € ๋‘ ๊ฐ€์ง€๋ฅผ ๋ชจ๋‘ ์˜คํ”„์…‹ ํ•œ ๋‹ค์Œ dst ๋งŒ ์˜คํ”„์…‹ํ•˜๋Š” ๋‘ ๋ฒˆ์˜ ์—ฐ์† ๋ฐ˜๋ณต์— ๋Œ€ํ•ด์ด ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค. esp&4๋ฒ„ํผ ์•ž์ชฝ์œผ๋กœ ํฌ์ธํ„ฐ๋ฅผ ์žฌ์„ค์ •ํ•˜๊ธฐ ์ „์— ( &= -1024๋ฒ„ํผ๊ฐ€ ์ •๋ ฌ๋˜์–ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์—๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ) ๋‘ ๋ฒˆ์งธ ๊ฒฝ์šฐ๋ฅผ ๊ฐ์ง€ํ•ฉ๋‹ˆ๋‹ค . ์ฝ”๋“œ์—์„œ ์ฃผ์„์„ ์ฐธ์กฐํ•˜์‹ญ์‹œ์˜ค.

  • (์ •์  ์‹คํ–‰ ํŒŒ์ผ์˜ ๊ฒฝ์šฐ) Linux ํ”„๋กœ์„ธ์Šค ์‹œ์ž‘ ํ™˜๊ฒฝ์€ ๋Œ€๋ถ€๋ถ„์˜ ๋ ˆ์ง€์Šคํ„ฐ๋ฅผ 0์œผ๋กœ ๋งŒ๋“ค๊ณ  esp/ ์•„๋ž˜์˜ ์Šคํƒ ๋ฉ”๋ชจ๋ฆฌ rsp๋Š” 0์œผ๋กœ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค. ๋‚ด ํ”„๋กœ๊ทธ๋žจ์€ ์ด๊ฒƒ์„ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค. ํ˜ธ์ถœ ํ•  ์ˆ˜์—†๋Š” ๊ธฐ๋Šฅ ๋ฒ„์ „ (ํ• ๋‹น๋˜์ง€ ์•Š์€ ์Šคํƒ์ด ๋”๋Ÿฌ์šธ ์ˆ˜์žˆ๋Š” ๊ณณ)์—์„œ 0์œผ๋กœ ๋œ ๋ฉ”๋ชจ๋ฆฌ์— BSS๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค (ํฌ์ธํ„ฐ๋ฅผ ์„ค์ •ํ•˜๋Š” ๋ฐ 4 ๋ฐ”์ดํŠธ๊ฐ€ ๋” ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค). ์ œ๋กœํ™”์—๋Š” edx2 ๋ฐ”์ดํŠธ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. x86-64 System V ABI๋Š” ์ด๋“ค ์ค‘ ํ•˜๋‚˜๋ฅผ ๋ณด์ฆํ•˜์ง€ ์•Š์ง€๋งŒ Linux์˜ ๊ตฌํ˜„์€ ์ปค๋„์—์„œ ์ •๋ณด ์œ ์ถœ์„ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด ์ œ๋กœ๋ฅผ ์ˆ˜ํ–‰ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋™์ ์œผ๋กœ ์—ฐ๊ฒฐ๋œ ํ”„๋กœ์„ธ์Šค์—์„œ /lib/ld.sobefore ์‹คํ–‰ _startํ•˜๊ณ  ๋ ˆ์ง€์Šคํ„ฐ๋ฅผ 0์ด ์•„๋‹Œ ์ƒํƒœ๋กœ ๋‘ก๋‹ˆ๋‹ค (์•„๋งˆ๋„ ์Šคํƒ ํฌ์ธํ„ฐ ์•„๋ž˜์˜ ๋ฉ”๋ชจ๋ฆฌ์—์„œ ๊ฐ€๋น„์ง€).

  • ๋ฃจํ”„ ์™ธ๋ถ€ -1024์—์„œ ebx์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ๋ณด๊ด€ ํ•ฉ๋‹ˆ๋‹ค. bl๋‚ด๋ถ€ ๋ฃจํ”„์˜ ์นด์šดํ„ฐ๋กœ ์‚ฌ์šฉ ํ•˜์—ฌ 0์œผ๋กœ ๋๋‚ฉ๋‹ˆ๋‹ค (์˜ ํ•˜์œ„ ๋ฐ”์ดํŠธ -1024์ด๋ฏ€๋กœ ๋ฃจํ”„ ์™ธ๋ถ€์—์„œ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ์ƒ์ˆ˜๋ฅผ ๋ณต์›). Intel Haswell ์ด์ƒ์—๋Š” low8 ๋ ˆ์ง€์Šคํ„ฐ์— ๋Œ€ํ•œ ๋ถ€๋ถ„ ๋ ˆ์ง€์Šคํ„ฐ ๋ณ‘ํ•ฉ ์ฒ˜๋ฒŒ์ด ์—†์œผ๋ฉฐ ์‹ค์ œ๋กœ ์ด๋ฆ„์„ ๋”ฐ๋กœ ๋ฐ”๊พธ์ง€ ์•Š์•„๋„ ๋˜๋ฏ€๋กœ AMD์™€ ๊ฐ™์ด ์ „์ฒด ๋ ˆ์ง€์Šคํ„ฐ์— ๋Œ€ํ•œ ์ข…์†์„ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค (์—ฌ๊ธฐ์—์„œ๋Š” ๋ฌธ์ œ๊ฐ€ ์•„๋‹™๋‹ˆ๋‹ค). ๊ทธ๋Ÿฌ๋‚˜ ์ด๊ฒƒ์€ Nehalem ๋ฐ ๊ทธ ์ด์ „ ๋ฒ„์ „์—์„œ ๋”์ฐํ•œ ์ผ์ด์ง€๋งŒ ๋ณ‘ํ•ฉ ํ•  ๋•Œ ๋ถ€๋ถ„ ๋“ฑ๋ก ๋งˆ๊ตฌ๊ฐ„์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋ถ€๋ถ„ xor์ •๊ทœ์‹์„ ์ž‘์„ฑํ•œ ๋‹ค์Œ -zeroing ๋˜๋Š” a ์—†์ด ์ „์ฒด ์ •๊ทœ์‹ ์„ ์ฝ๋Š” ๋‹ค๋ฅธ ์žฅ์†Œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค .movzx์ผ๋ฐ˜์ ์œผ๋กœ ์ด์ „ ์ฝ”๋“œ ์ค‘ ์ผ๋ถ€๊ฐ€ ์ƒ์œ„ ๋ฐ”์ดํŠธ๋ฅผ 0์œผ๋กœ ๋งŒ๋“ค์—ˆ์œผ๋ฏ€๋กœ AMD์™€ Intel SnB ์ œํ’ˆ๊ตฐ์—์„œ๋Š” ๊ดœ์ฐฎ์ง€ ๋งŒ Intel ์‚ฌ์ „ Sandybridge์—์„œ๋Š” ๋А๋ฆฝ๋‹ˆ๋‹ค.

    1024stdout ( sub edx, ebx) ์— ์“ธ ๋ฐ”์ดํŠธ ์ˆ˜๋กœ ์‚ฌ์šฉ ํ•˜๋ฏ€๋กœ mov edx, 1000๋” ๋งŽ์€ ๋ฐ”์ดํŠธ ๋น„์šฉ์ด ๋“ค๊ธฐ ๋•Œ๋ฌธ์— ํ”„๋กœ๊ทธ๋žจ์—์„œ ํ”ผ๋ณด๋‚˜์น˜ ์ˆซ์ž ๋’ค์— ๊ฐ€๋น„์ง€ ๋ฐ”์ดํŠธ๋ฅผ ์ธ์‡„ ํ•ฉ๋‹ˆ๋‹ค.

  • (์‚ฌ์šฉ๋˜์ง€ ์•Š์Œ) adc ebx,ebxEBX = 0์„ ์‚ฌ์šฉํ•˜์—ฌ EBX = CF๋ฅผ ์–ป๊ณ  1 ๋ฐ”์ดํŠธ๋ฅผ ์ ˆ์•ฝ setc blํ•ฉ๋‹ˆ๋‹ค.

  • dec/ jnz๋‚ด๋ถ€ adc๋ฃจํ”„ adc๋Š” Intel Sandybridge ์ด์ƒ์—์„œ ํ”Œ๋ž˜๊ทธ๋ฅผ ์ฝ์„ ๋•Œ ๋ถ€๋ถ„ ํ”Œ๋ž˜๊ทธ ์Šคํ†จ์„ ๋ฐœ์ƒ์‹œํ‚ค์ง€ ์•Š๊ณ  CF๋ฅผ ์œ ์ง€ ํ•ฉ๋‹ˆ๋‹ค. ์ดˆ๊ธฐ CPU ์—์„œ๋Š” ๋‚˜์˜์ง€๋งŒ Skylake์—์„œ๋Š” AFAIK๊ฐ€ ๋ฌด๋ฃŒ์ž…๋‹ˆ๋‹ค. ๋˜๋Š” ์ตœ์•…์˜ ๊ฒฝ์šฐ, ์—ฌ๋ถ„์˜ UOP.

  • ์•„๋ž˜์˜ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ esp๊ฑฐ๋Œ€ํ•œ ์ ์ƒ‰ ์˜์—ญ ์œผ๋กœ ์‚ฌ์šฉํ•˜์‹ญ์‹œ์˜ค . ์ด๊ฒƒ์€ ์™„์ „ํ•œ Linux ํ”„๋กœ๊ทธ๋žจ์ด๋ฏ€๋กœ ์‹ ํ˜ธ ์ฒ˜๋ฆฌ๊ธฐ๋ฅผ ์„ค์น˜ํ•˜์ง€ ์•Š์•˜์œผ๋ฉฐ ์‚ฌ์šฉ์ž ๊ณต๊ฐ„ ์Šคํƒ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋น„๋™๊ธฐ์‹์œผ๋กœ ํด๋กœ๋ฒ„ํ•˜๋Š” ๊ฒƒ์€ ์—†์Šต๋‹ˆ๋‹ค. ๋‹ค๋ฅธ OS์—์„œ๋Š” ๊ทธ๋ ‡์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ์Šคํƒ ์—”์ง„ ์„ ํ™œ์šฉ ํ•˜์—ฌ pop eax( lodsdHaswell / Skylake์—์„œ 2 ๊ฐœ, IvB์—์„œ 3 ๊ฐœ ์ดํ•˜, Agner Fog์˜ ๋ช…๋ น์–ด ํ‘œ ์— ๋”ฐ๋ผ ์ด์ „์— ) ๋Œ€์‹  (1 uop + ๊ฐ€๋” ์Šคํƒ ๋™๊ธฐํ™” uop) ์„ ์‚ฌ์šฉํ•˜์—ฌ uop ๋ฌธ์ œ ๋Œ€์—ญํญ์„ ์ ˆ์•ฝํ•˜์‹ญ์‹œ์˜ค . IIRC,์ด ๋‚ด๊ฐ€ ์•„๋งˆ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฐ™์€ ์†๋„๋ฅผ ์–ป์„ ์ˆ˜ (73) (83)์— ๋Œ€ํ•œ ์ดˆ์—์„œ ๋Ÿฐํƒ€์ž„์„ ๋–จ์–ด mov์ฒ˜๋Ÿผ, ์ƒ‰์ธ ์–ด๋“œ๋ ˆ์‹ฑ ๋ชจ๋“œ์™€ ํ•จ๊ป˜ mov eax, [edi+ebp]์–ด๋”” ebpSRC์—์™€ DST ๋ฒ„ํผ ์‚ฌ์ด์˜ ์˜คํ”„์…‹์„ ๋ณด์œ ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ํ”ผ๋ณด๋‚˜์น˜ ๋ฐ˜๋ณต์„ ์œ„ํ•ด src์™€ dst๋ฅผ ๊ต์ฒดํ•˜๋Š” ๊ณผ์ •์—์„œ ์˜คํ”„์…‹ ๋ ˆ์ง€์Šคํ„ฐ๋ฅผ ๋ฌดํšจํ™”ํ•ด์•ผํ•˜๋ฏ€๋กœ ๋‚ด๋ถ€ ๋ฃจํ”„ ์™ธ๋ถ€์˜ ์ฝ”๋“œ๊ฐ€ ๋” ๋ณต์žกํ•ด์ง‘๋‹ˆ๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ ์•„๋ž˜ โ€œ์„ฑ๋Šฅโ€์„น์…˜์„ ์ฐธ์กฐํ•˜์‹ญ์‹œ์˜ค.

  • ์–ด๋””์„œ๋‚˜ ๋ฉ”๋ชจ๋ฆฌ stc์— ์ €์žฅํ•˜๋Š” ๋Œ€์‹  ์ฒซ ๋ฒˆ์งธ ๋ฐ˜๋ณต์— ์บ๋ฆฌ ์ธ (1 ๋ฐ”์ดํŠธ ) ์„ ๋ถ€์—ฌํ•˜์—ฌ ์‹œํ€€์Šค๋ฅผ ์‹œ์ž‘ํ•˜์‹ญ์‹œ์˜ค 1. ์ฃผ์„์— ๋ฌธ์„œํ™” ๋œ ๋‹ค๋ฅธ ๋งŽ์€ ๋ฌธ์ œ ๊ด€๋ จ ๋‚ด์šฉ.

NASM ๋ชฉ๋ก (๊ธฐ๊ณ„ ์ฝ”๋“œ + ์†Œ์Šค) , ์ƒ์„ฑ nasm -felf32 fibonacci-1G.asm -l /dev/stdout | cut -b -28,$((28+12))- | sed 's/^/ /'. ๊ทธ๋Ÿฐ ๋‹ค์Œ ์ฃผ์„ ์ฒ˜๋ฆฌ ๋œ ์ผ๋ถ€ ๋ธ”๋ก์„ ์ˆ˜๋™์œผ๋กœ ์ œ๊ฑฐํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— ํ–‰ ๋ฒˆํ˜ธ ๋งค๊ธฐ๊ธฐ์— ๊ณต๋ฐฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์„ ํ–‰ ์—ด์„ ์ œ๊ฑฐํ•˜์—ฌ YASM ๋˜๋Š” NASM์— ๊ณต๊ธ‰ํ•  ์ˆ˜ ์žˆ๋„๋กํ•˜๋ ค๋ฉด์„ ์‚ฌ์šฉํ•˜์‹ญ์‹œ์˜ค cut -b 27- <fibonacci-1G.lst > fibonacci-1G.asm.

  1          machine      global _start
  2          code         _start:
  3 address

  4 00000000 B900CA9A3B       mov    ecx, 1000000000       ; Fib(ecx) loop counter
  5                       ;    lea    ebp, [ecx-1]          ;  base-1 in the base(pointer) register ;)
  6 00000005 89CD             mov    ebp, ecx    ; not wrapping on limb==1000000000 doesn't change the result.
  7                                              ; It's either self-correcting after the next add, or shifted out the bottom faster than Fib() grows.
  8
 42
 43                       ;    mov    esp, buf1
 44
 45                       ;    mov    esi, buf1   ; ungolfed: static buffers instead of the stack
 46                       ;    mov    edi, buf2

 47 00000007 BB00FCFFFF       mov    ebx, -1024
 48 0000000C 21DC             and    esp, ebx    ; alignment necessary for convenient pointer-reset
 49                       ;    sar    ebx, 1
 50 0000000E 01DC             add    esp, ebx     ; lea    edi, [esp + ebx].  Can't skip this: ASLR or large environment can put ESP near the bottom of a 1024-byte block to start with
 51 00000010 8D3C1C           lea    edi, [esp + ebx*1]
 52                           ;xchg   esp, edi   ; This is slightly faster.  IDK why.
 53
 54                           ; It's ok for EDI to be below ESP by multiple 4k pages.  On Linux, IIRC the main stack automatically extends up to ulimit -s, even if you haven't adjusted ESP.  (Earlier I used -4096 instead of -1024)
 55                           ; After an even number of swaps, EDI will be pointing to the lower-addressed buffer
 56                           ; This allows a small buffer size without having the string step on the number.
 57
 58                       ; registers that are zero at process startup, which we depend on:
 59                       ;    xor   edx, edx
 60                       ;;  we also depend on memory far below initial ESP being zeroed.
 61
 62 00000013 F9               stc    ; starting conditions: both buffers zeroed, but carry-in = 1
 63                       ; starting Fib(0,1)->0,1,1,2,3 vs. Fib(1,0)->1,0,1,1,2 starting "backwards" puts us 1 count behind
 66
 67                       ;;; register usage:
 68                       ;;; eax, esi: scratch for the adc inner loop, and outer loop
 69                       ;;; ebx: -1024.  Low byte is used as the inner-loop limb counter (ending at zero, restoring the low byte of -1024)
 70                       ;;; ecx: outer-loop Fibonacci iteration counter
 71                       ;;; edx: dst read-write offset (for "right shifting" to discard the least-significant limb)
 72                       ;;; edi: dst pointer
 73                       ;;; esp: src pointer
 74                       ;;; ebp: base-1 = 999999999.  Actually still happens to work with ebp=1000000000.
 75
 76                       .fibonacci:
 77                       limbcount equ 114             ; 112 = 1006 decimal digits / 9 digits per limb.  Not enough for 1000 correct digits, but 114 is.
 78                                                     ; 113 would be enough, but we depend on limbcount being even to avoid a sub
 79 00000014 B372             mov    bl, limbcount
 80                       .digits_add:
 81                           ;lodsd                       ; Skylake: 2 uops.  Or  pop rax  with rsp instead of rsi
 82                       ;    mov    eax, [esp]
 83                       ;    lea    esp, [esp+4]   ; adjust ESP without affecting CF.  Alternative, load relative to edi and negate an offset?  Or add esp,4 after adc before cmp
 84 00000016 58               pop    eax
 85 00000017 130417           adc    eax, [edi + edx*1]    ; read from a potentially-offset location (but still store to the front)
 86                        ;; jz .out   ;; Nope, a zero digit in the result doesn't mean the end!  (Although it might in base 10**9 for this problem)
 87
 88                       %if 0   ;; slower version
                          ;; could be even smaller (and 5.3x slower) with a branch on CF: 25% mispredict rate
 89                           mov  esi, eax
 90                           sub  eax, ebp  ; 1000000000 ; sets CF opposite what we need for next iteration
 91                           cmovc eax, esi
 92                           cmc                         ; 1 extra cycle of latency for the loop-carried dependency. 38,075Mc for 100M iters (with stosd).
 93                                                       ; not much worse: the 2c version bottlenecks on the front-end bottleneck
 94                       %else   ;; faster version
 95 0000001A 8DB0003665C4     lea    esi, [eax - 1000000000]
 96 00000020 39C5             cmp    ebp, eax                ; sets CF when (base-1) < eax.  i.e. when eax>=base
 97 00000022 0F42C6           cmovc  eax, esi                ; eax %= base, keeping it in the [0..base) range
 98                       %endif
 99
100                       %if 1
101 00000025 AB               stosd                          ; Skylake: 3 uops.  Like add + non-micro-fused store.  32,909Mcycles for 100M iters (with lea/cmp, not sub/cmc)
102                       %else
103                         mov    [edi], eax                ; 31,954Mcycles for 100M iters: faster than STOSD
104                         lea   edi, [edi+4]               ; Replacing this with ADD EDI,4 before the CMP is much slower: 35,083Mcycles for 100M iters
105                       %endif
106
107 00000026 FECB             dec    bl                      ; preserves CF.  The resulting partial-flag merge on ADC would be slow on pre-SnB CPUs
108 00000028 75EC             jnz .digits_add
109                           ; bl=0, ebx=-1024
110                           ; esi has its high bit set opposite to CF
111                       .end_innerloop:
112                           ;; after a non-zero carry-out (CF=1): right-shift both buffers by 1 limb, over the course of the next two iterations
113                           ;; next iteration with r8 = 1 and rsi+=4:  read offset from both, write normal.  ends with CF=0
114                           ;; following iter with r8 = 1 and rsi+=0:  read offset from dest, write normal.  ends with CF=0
115                           ;; following iter with r8 = 0 and rsi+=0:  i.e. back to normal, until next carry-out (possible a few iters later)
116
117                           ;; rdi = bufX + 4*limbcount
118                           ;; rsi = bufY + 4*limbcount + 4*carry_last_time
119
120                       ;    setc   [rdi]
123 0000002A 0F92C2           setc   dl
124 0000002D 8917             mov    [edi], edx ; store the carry-out into an extra limb beyond limbcount
125 0000002F C1E202           shl    edx, 2

139                           ; keep -1024 in ebx.  Using bl for the limb counter leaves bl zero here, so it's back to -1024 (or -2048 or whatever)
142 00000032 89E0             mov    eax, esp   ; test/setnz could work, but only saves a byte if we can somehow avoid the  or dl,al
143 00000034 2404             and    al, 4      ; only works if limbcount is even, otherwise we'd need to subtract limbcount first.

148 00000036 87FC             xchg   edi, esp   ; Fibonacci: dst and src swap
149 00000038 21DC             and    esp, ebx  ; -1024  ; revert to start of buffer, regardless of offset
150 0000003A 21DF             and    edi, ebx  ; -1024
151
152 0000003C 01D4             add    esp, edx             ; read offset in src

155                           ;; after adjusting src, so this only affects read-offset in the dst, not src.
156 0000003E 08C2             or    dl, al              ; also set r8d if we had a source offset last time, to handle the 2nd buffer
157                           ;; clears CF for next iter

165 00000040 E2D2             loop .fibonacci  ; Maybe 0.01% slower than dec/jnz overall

169                       to_string:

175                       stringdigits equ 9*limbcount  ; + 18
176                       ;;; edi and esp are pointing to the start of buffers, esp to the one most recently written
177                       ;;;  edi = esp +/- 2048, which is far enough away even in the worst case where they're growing towards each other
178                       ;;;  update: only 1024 apart, so this only works for even iteration-counts, to prevent overlap

180                           ; ecx = 0 from the end of the fib loop
181                           ;and   ebp, 10     ; works because the low byte of 999999999 is 0xff
182 00000042 8D690A           lea    ebp, [ecx+10]         ;mov    ebp, 10
183 00000045 B172             mov    cl, (stringdigits+8)/9
184                       .toascii:  ; slow but only used once, so we don't need a multiplicative inverse to speed up div by 10
185                           ;add   eax, [rsi]    ; eax has the carry from last limb:  0..3  (base 4 * 10**9)
186 00000047 58               pop    eax                  ; lodsd
187 00000048 B309             mov    bl, 9
188                       .toascii_digit:
189 0000004A 99               cdq                         ; edx=0 because eax can't have the high bit set
190 0000004B F7F5             div    ebp                  ; edx=remainder = low digit = 0..9.  eax/=10

197 0000004D 80C230           add    dl, '0'
198                                              ; stosb  ; clobber [rdi], then  inc rdi
199 00000050 4F               dec    edi         ; store digits in MSD-first printing order, working backwards from the end of the string
200 00000051 8817             mov    [edi], dl
201
202 00000053 FECB             dec    bl
203 00000055 75F3             jnz  .toascii_digit
204
205 00000057 E2EE             loop .toascii
206
207                           ; Upper bytes of eax=0 here.  Also AL I think, but that isn't useful
208                           ; ebx = -1024
209 00000059 29DA             sub  edx, ebx   ; edx = 1024 + 0..9 (leading digit).  +0 in the Fib(10**9) case
210
211 0000005B B004             mov   al, 4                 ; SYS_write
212 0000005D 8D58FD           lea  ebx, [eax-4 + 1]       ; fd=1
213                           ;mov  ecx, edi               ; buf
214 00000060 8D4F01           lea  ecx, [edi+1]           ; Hard-code for Fib(10**9), which has one leading zero in the highest limb.
215                       ;    shr  edx, 1 ;    for use with edx=2048
216                       ;    mov  edx, 100
217                       ;    mov byte  [ecx+edx-1], 0xa;'\n'  ; count+=1 for newline
218 00000063 CD80             int  0x80                   ; write(1, buf+1, 1024)
219
220 00000065 89D8             mov  eax, ebx ; SYS_exit=1
221 00000067 CD80             int  0x80     ; exit(ebx=1)
222
  # next byte is 0x69, so size = 0x69 = 105 bytes

์ด ์ค‘ ๋ช‡ ๋ฐ”์ดํŠธ๋ฅผ ๋” ๊ณจํ”„๋ฅผ ์น  ์—ฌ์ง€๊ฐ€ ์žˆ์ง€๋งŒ ์ด๋ฏธ 2 ์ผ ๋™์•ˆ 12 ์‹œ๊ฐ„ ์ด์ƒ์„ ๋ณด๋ƒˆ์Šต๋‹ˆ๋‹ค. ์†๋„๊ฐ€ ์ถฉ๋ถ„ํžˆ ๋น ๋ฅด์ง€ ๋งŒ ์†๋„๋ฅผ ๋‚ฎ์ถ”๋Š” ๋ฐฉ๋ฒ•์ด ์žˆ์ง€๋งŒ ์†๋„๋ฅผ ํฌ์ƒํ•˜๊ณ  ์‹ถ์ง€ ์•Š์Šต๋‹ˆ๋‹ค . ๋‚ด๊ฐ€ ๊ฒŒ์‹œ ํ•œ ์ด์œ  ์ค‘ ํ•˜๋‚˜๋Š” ๋ฌด์ฐจ๋ณ„ asm ๋ฒ„์ „์„ ์–ผ๋งˆ๋‚˜ ๋นจ๋ฆฌ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋Š”์ง€ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋ˆ„๊ตฐ๊ฐ€๊ฐ€ ์‹ค์ œ๋กœ ์ตœ์†Œ ํฌ๊ธฐ๋กœ ๊ฐ€๊ณ  ์‹ถ์ง€๋งŒ ์•„๋งˆ๋„ 10 ๋ฐฐ ๋А๋ฆฌ๊ฒŒ (์˜ˆ : ๋ฐ”์ดํŠธ ๋‹น 1 ์ž๋ฆฌ), ์ด๊ฒƒ์„ ์‹œ์ž‘์ ์œผ๋กœ ์ž์œ ๋กญ๊ฒŒ ๋ณต์‚ฌํ•˜์‹ญ์‹œ์˜ค.

๊ฒฐ๊ณผ ์‹คํ–‰ ํŒŒ์ผ ( yasm -felf32 -Worphan-labels -gdwarf2 fibonacci-1G.asm && ld -melf_i386 -o fibonacci-1G fibonacci-1G.o)์€ 340B (์ŠคํŠธ๋ฆฝ)์ž…๋‹ˆ๋‹ค.

size fibonacci-1G
 text    data     bss     dec     hex filename
  105       0       0     105      69 fibonacci-1G

๊ณต์—ฐ

๋‚ด๋ถ€ adc๋ฃจํ”„๋Š” Skylake์—์„œ 10 ๊ฐœ์˜ ์œตํ•ฉ ๋„๋ฉ”์ธ uops (~ 128 ๋ฐ”์ดํŠธ๋งˆ๋‹ค +1 ์Šคํƒ ๋™๊ธฐํ™” uop)์ด๋ฏ€๋กœ ์ตœ์ ์˜ ํ”„๋ก ํŠธ ์—”๋“œ ์ฒ˜๋ฆฌ๋Ÿ‰ (์Šคํƒ ๋™๊ธฐํ™” uops ๋ฌด์‹œ)์œผ๋กœ Skylake์—์„œ ~ 2.5 ์‚ฌ์ดํด ๋‹น ํ•˜๋‚˜์”ฉ ๋ฐœํ–‰ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. . ์ž„๊ณ„ ๊ฒฝ๋กœ ๋Œ€๊ธฐ ์‹œ๊ฐ„์€ adc-> cmp-> ๋‹ค์Œ ๋ฐ˜๋ณต์˜ adc๋ฃจํ”„ ์ „๋‹ฌ ์ข…์†์„ฑ ์ฒด์ธ์— ๋Œ€ํ•ด 2์ฃผ๊ธฐ ์ด๋ฏ€๋กœ ๋ณ‘๋ชฉ ํ˜„์ƒ์€ ๋ฐ˜๋ณต ๋‹น ~ 2.5์ฃผ๊ธฐ์˜ ํ”„๋ŸฐํŠธ ์—”๋“œ ๋ฌธ์ œ ์ œํ•œ์ด์–ด์•ผํ•ฉ๋‹ˆ๋‹ค.

adc eax, [edi + edx]์‹คํ–‰ ํฌํŠธ์— ๋Œ€ํ•œ 2 ๊ฐœ์˜ ์œตํ•ฉ๋˜์ง€ ์•Š์€ ๋„๋ฉ”์ธ uops : load + ALU. ๊ทธ๊ฒƒ์€ ๋””์ฝ”๋” (1 ์œตํ•ฉ ๋„๋ฉ”์ธ uop)์—์„œ ๋งˆ์ดํฌ๋กœ ์œตํ•ฉํ•˜์ง€๋งŒ , Haswell / Skylake ์—์„œ์กฐ์ฐจ๋„ ์ธ๋ฑ์‹ฑ ๋œ ์–ด๋“œ๋ ˆ์‹ฑ ๋ชจ๋“œ๋กœ ์ธํ•ด ์ด์Šˆ ๋‹จ๊ณ„์—์„œ 2 ์œตํ•ฉ ๋„๋ฉ”์ธ uops๋กœ ๋ผ๋ฏธ๋„ค์ดํŠธ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค . ํ•„์ž๋Š” ๋งˆ์ดํฌ๋กœ ํ“จ์ „์„ add eax, [edi + edx]์œ ์ง€ํ•˜๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ์ƒ๊ฐ ํ–ˆ์ง€๋งŒ ์ธ๋ฑ์‹ฑ ๋œ ์ฃผ์†Œ ์ง€์ • ๋ชจ๋“œ๋ฅผ ์œ ์ง€ํ•˜๋ฉด ์ด๋ฏธ 3 ๊ฐœ์˜ ์ž…๋ ฅ (ํ”Œ๋ž˜๊ทธ, ๋ฉ”๋ชจ๋ฆฌ ๋ฐ ๋Œ€์ƒ)์ด์žˆ๋Š” Uops์—์„œ๋Š” ์ž‘๋™ํ•˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‚ด๊ฐ€ ๊ทธ๊ฒƒ์„ ์ผ์„ ๋•Œ, ๊ทธ๊ฒƒ์€ ์„ฑ๋Šฅ ์ €ํ•˜๊ฐ€ ์—†์„ ๊ฒƒ์ด๋ผ๊ณ  ์ƒ๊ฐํ–ˆ์ง€๋งŒ ์ž˜๋ชป๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด ์ž˜๋ฆผ ์ฒ˜๋ฆฌ ๋ฐฉ๋ฒ• edx์€ 0 ๋˜๋Š” 4์— ๊ด€๊ณ„์—†์ด ๋งค๋ฒˆ ๋‚ด๋ถ€ ๋ฃจํ”„๋ฅผ ๋А๋ฆฌ๊ฒŒํ•ฉ๋‹ˆ๋‹ค .

์ €์žฅ์†Œ๋ฅผ ์˜คํ”„์…‹ ediํ•˜๊ณ  edx์กฐ์ • ํ•˜์—ฌ dst์— ๋Œ€ํ•œ ์ฝ๊ธฐ / ์“ฐ๊ธฐ ์˜คํ”„์…‹์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒƒ์ด ๋” ๋น ๋ฆ…๋‹ˆ๋‹ค . ๊ทธ๋ž˜์„œ adc eax, [edi]/ โ€ฆ / mov [edi+edx], eax/ lea edi, [edi+4]๋Œ€์‹  stosd. Haswell ์ด์ƒ์€ ์ธ๋ฑ์‹ฑ ๋œ ์ €์žฅ์†Œ๋ฅผ ๋งˆ์ดํฌ๋กœ ํ“จ์ฆˆ ํ•œ ์ƒํƒœ๋กœ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (Sandybridge / IvB๋„ ๋ผ๋ฏธ๋„ค์ดํŠธํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.)

์ธํ…” ํ•˜ ์Šค์›ฐ๊ณผ์— ์ด์ „ adc๊ณผ๋Š” cmovc2C ๋Œ€๊ธฐ ์‹œ๊ฐ„์ด 2 ๋งˆ์ดํฌ๋กœ ์—ฐ์‚ฐ ๊ฐ๊ฐ์ด๋‹ค . ( adc eax, [edi+edx]์—ฌ์ „ํžˆ Haswell์— ๋ผ๋ฏธ๋„ค์ดํŒ…๋˜์–ด ์žˆ์œผ๋ฉฐ 3 ๊ฐœ์˜ ํ†ตํ•ฉ ๋„๋ฉ”์ธ Uops๋กœ ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค). Broadwell ์ด์ƒ ์€ AMD์—์„œ ์˜ค๋žซ๋™์•ˆ ์‚ฌ์šฉํ–ˆ๋˜ ๊ฒƒ์ฒ˜๋Ÿผ ๋‹จ์ผ FOP (Haswell), ๋‹จ์ผ Uop ๋ช…๋ น์–ด ์ž‘์„ฑ adc๋ฐ cmovc(๋ฐ ๊ธฐํƒ€ ๋ช‡ ๊ฐ€์ง€) ์ด์ƒ์˜ 3 ์ž…๋ ฅ UOP๋ฅผ ํ—ˆ์šฉ ํ•ฉ๋‹ˆ๋‹ค. (์ด๊ฒƒ์ด AMD๊ฐ€ ์˜ค๋žซ๋™์•ˆ ํ™•์žฅ ์ •๋ฐ€ GMP ๋ฒค์น˜ ๋งˆํฌ์—์„œ ์ž˜ ์ˆ˜ํ–‰ ํ•œ ์ด์œ  ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค.) ์–ด์จŒ๋“  Haswell์˜ ๋‚ด๋ถ€ ๋ฃจํ”„๋Š” 12uops (๋•Œ๋กœ๋Š” +1 ์Šคํƒ ๋™๊ธฐํ™” uop) ์—ฌ์•ผํ•˜๋ฉฐ ํ”„๋ŸฐํŠธ ์—”๋“œ ๋ณ‘๋ชฉ ํ˜„์ƒ์€ ~ 3c์ž…๋‹ˆ๋‹ค. ์Šคํƒ ์‹ฑํฌ uops๋ฅผ ๋ฌด์‹œํ•˜๊ณ  ์ตœ์ƒ์˜ ๊ฒฝ์šฐ.

๋ฃจํ”„ ๋‚ด pop์—์„œ ๋ฐธ๋Ÿฐ์‹ฑ์—†์ด ์‚ฌ์šฉ pushํ•˜๋ฉด ๋ฃจํ”„๊ฐ€ LSD (loop stream detector)์—์„œ ์‹คํ–‰๋  ์ˆ˜ ์—†์œผ๋ฉฐ ๋งค๋ฒˆ uop ์บ์‹œ์—์„œ IDQ๋กœ ๋‹ค์‹œ ์ฝ์–ด์•ผํ•ฉ๋‹ˆ๋‹ค. 9 ๋˜๋Š” 10 uop โ€‹โ€‹๋ฃจํ”„๊ฐ€ ๋งค ์‚ฌ์ดํด๋งˆ๋‹ค 4 uops์—์„œ ์ตœ์ ์œผ๋กœ ๋ฐœํ–‰๋˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— Skylake์—์„œ ์ข‹์€ ์ ์ž…๋‹ˆ๋‹ค . ์ด๊ฒƒ์€ ์•„๋งˆ๋„ ๋Œ€์ฒด lodsd๊ฐ€ pop๋งŽ์€ ๋„์›€ ์ด ๋œ ์ด์œ  ์ค‘ ์ผ๋ถ€ ์ผ ๊ฒƒ์ž…๋‹ˆ๋‹ค . (LSD๋Š” ์Šคํƒ ๋™๊ธฐํ™” uop ์„ ์‚ฝ์ž… ํ•  ๊ณต๊ฐ„์„ ๋‚จ๊ธฐ์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— Uops๋ฅผ ์ž ๊ธ€ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค .) ํ•ด๋‹น ์—…๋ฐ์ดํŠธ๋ฅผ ๋ฐ›๊ธฐ ์ „์—

ํ•„์ž๋Š” Haswell์—์„œ ํ”„๋กœํŒŒ์ผ์„ ์ž‘์„ฑํ–ˆ์œผ๋ฉฐ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์•„๋‹Œ L1D ์บ์‹œ ๋งŒ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ CPU ์ฃผํŒŒ์ˆ˜์— ๊ด€๊ณ„์—†์ด 381.31 ์‹ญ์–ต ํด๋Ÿญ ์‚ฌ์ดํด๋กœ ์‹คํ–‰๋˜๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค. ํ”„๋ŸฐํŠธ ์—”๋“œ ๋ฌธ์ œ ์ฒ˜๋ฆฌ๋Ÿ‰์€ Skylake์˜ ๊ฒฝ์šฐ 3.70์— ๋น„ํ•ด ํด๋ก ๋‹น 3.72 ๊ฐœ์˜ ์œตํ•ฉ ๋„๋ฉ”์ธ uops์˜€์Šต๋‹ˆ๋‹ค. (๊ธฐ ๋•Œ๋ฌธ์— ๋ฌผ๋ก  ์‚ฌ์ดํด ๋‹น ์ง€์นจ์€ 2.87์—์„œ 2.42๋กœ ๋‚ด๋ คํ–ˆ๋‹ค adc๋ฐ cmov์Šค์›ฐ 2 ๋งˆ์ดํฌ๋กœ ์—ฐ์‚ฐ์ด๋‹ค.)

push๋งค๋ฒˆ ์Šคํƒ ๋™๊ธฐํ™” uop์„ ํŠธ๋ฆฌ๊ฑฐ stosdํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋Œ€์ฒดํ•˜๋Š” ๊ฒƒ์ด๋ณ„๋กœ ๋„์›€์ด๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค adc [esp + edx]. ๊ทธ๋ฆฌ๊ณ  ๋ฐ”์ดํŠธ๊ฐ€ ํ•„์š”ํ•˜๊ธฐ std๋•Œ๋ฌธ์— lodsd๋‹ค๋ฅธ ๋ฐฉํ–ฅ์œผ๋กœ ์ง„ํ–‰๋ฉ๋‹ˆ๋‹ค. ( mov [edi], eax/ lea edi, [edi+4]๋Œ€์ฒด stosd100M iters์œ„ํ•œ 31,954Mcycles์— 100M iters์œ„ํ•œ 32,909Mcycles์—์„œ๊ฐ€๋Š” ์Šน์ด๋‹ค. ๊ทธ๊ฒƒ์€ ๋ณด์ธ๋‹ค stosd3 ๋งˆ์ดํฌ๋กœ ์—ฐ์‚ฐ ๋“ฑ์˜ ๋ณตํ˜ธ๋ฅผ, ์Šคํ† ์–ด ์–ด๋“œ๋ ˆ์Šค / ์ €์žฅ ๋ฐ์ดํ„ฐ๋กœ๋Š”ํ•˜์ง€ ๋งˆ์ดํฌ๋กœ ์œตํ•ฉํ•˜๋ฏ€๋กœ ๋งˆ์ดํฌ๋กœ ์—ฐ์‚ฐ push+ ์Šคํƒ ๋™๊ธฐํ™” ์ฃ„์†กํ•ฉ๋‹ˆ๋‹ค. stosd)

Skylake์˜ ๋น ๋ฅธ 105B ๋ฒ„์ „ ์˜ ๊ฒฝ์šฐ 1,114 ๊ฐœ์˜ ์‚ฌ์ง€ 1G ๋ฐ˜๋ณต์— ๋Œ€ํ•ด ~ 322.47 ์‹ญ์–ต ์‚ฌ์ดํด์˜ ์‹ค์ œ ์„ฑ๋Šฅ์€ ๋‚ด๋ถ€ ๋ฃจํ”„ ๋ฐ˜๋ณต๋งˆ๋‹ค 2.824 ์‚ฌ์ดํด๋กœ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค . (์•„๋ž˜ ocperf.py์ถœ๋ ฅ ์ฐธ์กฐ). ์ •์  ๋ถ„์„์—์„œ ์˜ˆ์ƒ ํ•œ ๊ฒƒ๋ณด๋‹ค ๋А๋ฆฌ์ง€ ๋งŒ ์™ธ๋ถ€ ๋ฃจํ”„ ๋ฐ ์Šคํƒ ๋™๊ธฐํ™” UOP์˜ ์˜ค๋ฒ„ ํ—ค๋“œ๋ฅผ ๋ฌด์‹œํ•˜๊ณ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

๋ฐ›์นจ๋Œ€๋ฅผ ์„ฑ๋Šฅ branches๋ฐ branch-misses์‡ผ ๊ทธ ๋‚ด๋ถ€ ๋ฃจํ”„ ์˜ˆ์ธก ์˜ค๋ฅ˜ ์™ธ๋ถ€ ๋ฃจํ”„ ๋‹น (์ด ์ดฌ์˜ ์•„๋‹ˆ์—์š” ๋งˆ์ง€๋ง‰ ๋ฐ˜๋ณต์—) ํ•œ ๋ฒˆ. ๊ทธ๊ฒƒ์€ ๋˜ํ•œ ์—ฌ๋ถ„์˜ ์‹œ๊ฐ„์˜ ์ผ๋ถ€๋ฅผ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.


mov esi,eaxsub eax,ebpcmovc eax, esicmclea esi, [eax - 1000000000]/ cmp ebp,eax/ cmovc(6 + 2 + 3 = 11B ๋Œ€์‹  / / / (2 + 2 + 3 + 1 = 8B)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ€์žฅ ์•ˆ์ชฝ์˜ ๋ฃจํ”„๊ฐ€ ์ž„๊ณ„ ๊ฒฝ๋กœ์— ๋Œ€ํ•ด 3 ์‚ฌ์ดํด ๋Œ€๊ธฐ ์‹œ๊ฐ„์„ ๊ฐ–๋„๋กํ•˜์—ฌ ์ฝ”๋“œ ํฌ๊ธฐ๋ฅผ ์ ˆ์•ฝ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค ). cmov/๋Š” stosd์ž„๊ณ„ ๊ฒฝ๋กœ ๊บผ์ ธ ์žˆ์Šต๋‹ˆ๋‹ค. (์ฆ๋ถ„-์—๋”” ์˜ค stosd์˜ ์ƒ์ ๊ณผ ๋ณ„๋„๋กœ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ๊ฐ ๋ฐ˜๋ณต์€ ์งง์€ ์ข…์†์„ฑ ์ฒด์ธ์„ ๋ถ„๊ธฐํ•ฉ๋‹ˆ๋‹ค.) ebp init ๋ช…๋ น์„์—์„œ lea ebp, [ecx-1]๋กœ ๋ณ€๊ฒฝํ•˜์—ฌ ๋‹ค๋ฅธ 1B๋ฅผ ์ €์žฅํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ mov ebp,eax๋˜์—ˆ์ง€๋งŒ ์ž˜๋ชป๋œ ๊ฒƒ์„ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค.ebp๊ฒฐ๊ณผ๋ฅผ ๋ฐ”๊พธ์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒํ•˜๋ฉด ์บ๋ฆฌ๋ฅผ ๋ž˜ํ•‘ํ•˜๊ณ  ์ƒ์„ฑํ•˜๋Š” ๋Œ€์‹  ํŒ”๋‹ค๋ฆฌ๋ฅผ ์ •ํ™•ํžˆ == 1000000000์œผ๋กœ ๋งŒ๋“ค ์ˆ˜ ์žˆ์ง€๋งŒ์ด ์˜ค๋ฅ˜๋Š” Fib ()๊ฐ€ ์„ฑ์žฅํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ๋А๋ฆฌ๊ฒŒ ์ „ํŒŒ๋˜๋ฏ€๋กœ ์ตœ์ข… ๊ฒฐ๊ณผ์˜ ์„ ํ–‰ 1k ์ž๋ฆฌ๋ฅผ ๋ณ€๊ฒฝํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ์˜ค๋ฒ„ํ”Œ๋กœ์—†์ด ์œ ์ง€ํ•  ์ˆ˜์žˆ๋Š” ๊ณต๊ฐ„์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์šฐ๋ฆฌ๊ฐ€ ์ถ”๊ฐ€ ํ•  ๋•Œ ์˜ค๋ฅ˜๊ฐ€ ์Šค์Šค๋กœ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. 1G + 1G์กฐ์ฐจ๋„ 32 ๋น„ํŠธ ์ •์ˆ˜๋ฅผ ์˜ค๋ฒ„ํ”Œ๋กœํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ ๊ฒฐ๊ตญ ์œ„์ชฝ์œผ๋กœ ์Šค๋ฉฐ ๋“ค๊ฑฐ๋‚˜ ์ž˜๋ฆฝ๋‹ˆ๋‹ค.

3c ๋Œ€๊ธฐ ์‹œ๊ฐ„ ๋ฒ„์ „์€ 1 ๊ฐœ์˜ ์ถ”๊ฐ€ UOP์ด๋ฏ€๋กœ ํ”„๋ŸฐํŠธ ์—”๋“œ๋Š” Skylake์—์„œ 2.75c์ฃผ๊ธฐ๋งˆ๋‹ค ํ•˜๋‚˜์”ฉ ๋ฐœํ–‰ ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ ๋ฐฑ ์—”๋“œ๊ฐ€ ์‹คํ–‰ํ•  ์ˆ˜์žˆ๋Š” ๊ฒƒ๋ณด๋‹ค ์•ฝ๊ฐ„ ๋” ๋น ๋ฆ…๋‹ˆ๋‹ค. (Haswell์—์„œ๋Š” ์—ฌ์ „ํžˆ์„ ์‚ฌ์šฉ adcํ•˜๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ด 13 cmovuops๊ฐ€ ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์‹ค์ œ๋กœ 3 / 2.5 = 1.2๊ฐ€ ์•„๋‹Œ Skylake์—์„œ 1.18 ๋А๋ฆฐ ์†๋„ (์‚ฌ์ง€ ๋‹น 3.34์ฃผ๊ธฐ)๋ฅผ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์Šคํƒ ๋™๊ธฐํ™”์—†์ด ๋‚ด๋ถ€ ๋ฃจํ”„๋ฅผ ๋ณด๋ฉด์„œ ํ”„๋ŸฐํŠธ ์—”๋“œ ๋ณ‘๋ชฉ ํ˜„์ƒ์„ ๋Œ€๊ธฐ ์‹œ๊ฐ„ ๋ณ‘๋ชฉ ํ˜„์ƒ์œผ๋กœ ๋Œ€์ฒด ํ•  ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค. ์ฃ„์†กํ•ฉ๋‹ˆ๋‹ค. ์Šคํƒ ๋™๊ธฐํ™” Uops๋Š” ๋น ๋ฅธ ๋ฒ„์ „์—๋งŒ ์˜ํ–ฅ์„ ๋ฏธ์น˜๊ธฐ ๋•Œ๋ฌธ์— (๋ ˆ์ดํ„ด์‹œ ๋Œ€์‹  ํ”„๋ŸฐํŠธ ์—”๋“œ์—์„œ ๋ณ‘๋ชฉ ํ˜„์ƒ์ด ๋ฐœ์ƒ ํ•จ)์ด๋ฅผ ์„ค๋ช…ํ•˜๋Š” ๋ฐ ๋งŽ์€ ์‹œ๊ฐ„์ด ๊ฑธ๋ฆฌ์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์˜ˆ : 3 / 2.54 = 1.18.

๋˜ ๋‹ค๋ฅธ ์š”์ธ์€ 3c ๋Œ€๊ธฐ ์‹œ๊ฐ„ ๋ฒ„์ „์ด ์ž„๊ณ„ ๊ฒฝ๋กœ๊ฐ€ ์—ฌ์ „ํžˆ ์‹คํ–‰๋˜๋Š” ๋™์•ˆ ๋‚ด๋ถ€ ๋ฃจํ”„๋ฅผ ๋– ๋‚  ๋•Œ์˜ ์˜คํ•ด๋ฅผ ๊ฐ์ง€ ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค (ํ”„๋ก ํŠธ ์—”๋“œ๊ฐ€ ๋ฐฑ์—”๋“œ๋ณด๋‹ค ์•ž์„œ ๊ฐˆ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋น„ ์ˆœ์ฐจ์  ์‹คํ–‰์œผ๋กœ ์ธํ•ด ๋ฃจํ”„๊ฐ€ ์‹คํ–‰๋  ์ˆ˜ ์žˆ์Œ) ์นด์šดํ„ฐ ์ž˜๋ชป๋œ), ๋”ฐ๋ผ์„œ ํšจ๊ณผ์ ์ธ ์ž˜๋ชป๋œ ์˜ˆ์ธก ํŽ˜๋„ํ‹ฐ๊ฐ€ ๋‚ฎ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ํ”„๋ก ํŠธ ์—”๋“œ ์‚ฌ์ดํด์„ ์žƒ์œผ๋ฉด ๋ฐฑ์—”๋“œ๋ฅผ ๋”ฐ๋ผ ์žก์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ทธ๋ ‡์ง€ ์•Š์€ ๊ฒฝ์šฐ cmccarry_out-> edx ๋ฐ esp ์˜คํ”„์…‹์„ ๋ถ„๊ธฐ์—†์ด ์ฒ˜๋ฆฌํ•˜๋Š” ๋Œ€์‹  ์™ธ๋ถ€ ๋ฃจํ”„์—์„œ ๋ถ„๊ธฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ 3c ๋ฒ„์ „์˜ ์†๋„๋ฅผ ๋†’์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ ์ข…์†์„ฑ ๋Œ€์‹  ์ œ์–ด ์ข…์†์„ฑ์— ๋Œ€ํ•œ ๋ถ„๊ธฐ ์˜ˆ์ธก + ์ถ”๋ก  ์‹คํ–‰ adc์€ ์ด์ „ ๋‚ด๋ถ€ ๋ฃจํ”„์˜ UOP๊ฐ€ ์—ฌ์ „ํžˆ ๋น„ํ–‰ ์ค‘์ผ ๋•Œ ๋‹ค์Œ ๋ฐ˜๋ณต์—์„œ ๋ฃจํ”„๋ฅผ ์‹คํ–‰ํ•˜๊ธฐ ์‹œ์ž‘ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ถ„๊ธฐ์—†๋Š” ๋ฒ„์ „์—์„œ ๋‚ด๋ถ€ ๋ฃจํ”„์˜๋กœ๋“œ ์ฃผ์†Œ adc๋Š” ๋งˆ์ง€๋ง‰ ์‚ฌ์ง€ ์˜ ๋งˆ์ง€๋ง‰ CF๋กœ๋ถ€ํ„ฐ ๋ฐ์ดํ„ฐ ์ข…์†์„ฑ์„ ๊ฐ–์Šต๋‹ˆ๋‹ค .

ํ”„๋ก ํŠธ ์—”๋“œ์—์„œ 2c ๋Œ€๊ธฐ ์‹œ๊ฐ„ ๋‚ด๋ถ€ ๋ฃจํ”„ ๋ฒ„์ „ ๋ณ‘๋ชฉ ํ˜„์ƒ์ด ๋ฐœ์ƒํ•˜๋ฏ€๋กœ ๋ฐฑ์—”๋“œ๋Š” ๊ฑฐ์˜ ์œ ์ง€๋ฉ๋‹ˆ๋‹ค. ์™ธ๋ถ€ ๋ฃจํ”„ ์ฝ”๋“œ๊ฐ€ ๋Œ€๊ธฐ ์‹œ๊ฐ„์ด ๊ธธ๋ฉด ํ”„๋ŸฐํŠธ ์—”๋“œ๋Š” ๋‹ค์Œ ๋‚ด๋ถ€ ๋ฃจํ”„ ๋ฐ˜๋ณต์—์„œ uops๋ฅผ ๋ฐœํ–‰ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (๊ทธ๋Ÿฌ๋‚˜์ด ๊ฒฝ์šฐ ์™ธ๋ถ€ ๋ฃจํ”„ ํ•ญ๋ชฉ์—๋Š” ILP ๊ฐ€ ๋งŽ๊ณ  ๋Œ€๊ธฐ ์‹œ๊ฐ„์ด ๊ธด ํ•ญ๋ชฉ์ด ์—†์œผ๋ฏ€๋กœ ๋ฐฑ์—”๋“œ๋Š” ๋น„ ์ฃผ๋ฌธํ˜• ์Šค์ผ€์ค„๋Ÿฌ์—์„œ Uops๋ฅผ ํ†ตํ•ด ์”น๊ธฐ ์‹œ์ž‘ํ•  ๋•Œ ๋”ฐ๋ผ ์žก์„ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ์ž…๋ ฅ์ด ์ค€๋น„๋ฉ๋‹ˆ๋‹ค).

### Output from a profiled run
$ asm-link -m32 fibonacci-1G.asm && (size fibonacci-1G; echo disas fibonacci-1G) && ocperf.py stat -etask-clock,context-switches:u,cpu-migrations:u,page-faults:u,cycles,instructions,uops_issued.any,uops_executed.thread,uops_executed.stall_cycles -r4  ./fibonacci-1G
+ yasm -felf32 -Worphan-labels -gdwarf2 fibonacci-1G.asm
+ ld -melf_i386 -o fibonacci-1G fibonacci-1G.o
   text    data     bss     dec     hex filename
    106       0       0     106      6a fibonacci-1G
disas fibonacci-1G
perf stat -etask-clock,context-switches:u,cpu-migrations:u,page-faults:u,cycles,instructions,cpu/event=0xe,umask=0x1,name=uops_issued_any/,cpu/event=0xb1,umask=0x1,name=uops_executed_thread/,cpu/event=0xb1,umask=0x1,inv=1,cmask=1,name=uops_executed_stall_cycles/ -r4 ./fibonacci-1G
79523178745546834678293851961971481892555421852343989134530399373432466861825193700509996261365567793324820357232224512262917144562756482594995306121113012554998796395160534597890187005674399468448430345998024199240437534019501148301072342650378414269803983873607842842319964573407827842007677609077777031831857446565362535115028517159633510239906992325954713226703655064824359665868860486271597169163514487885274274355081139091679639073803982428480339801102763705442642850327443647811984518254621305295296333398134831057713701281118511282471363114142083189838025269079177870948022177508596851163638833748474280367371478820799566888075091583722494514375193201625820020005307983098872612570282019075093705542329311070849768547158335856239104506794491200115647629256491445095319046849844170025120865040207790125013561778741996050855583171909053951344689194433130268248133632341904943755992625530254665288381226394336004838495350706477119867692795685487968552076848977417717843758594964253843558791057997424878788358402439890396,๏ฟฝX\๏ฟฝ;3๏ฟฝI;ro~.๏ฟฝ'๏ฟฝ๏ฟฝR!q๏ฟฝ๏ฟฝ%๏ฟฝ๏ฟฝX'B ๏ฟฝ๏ฟฝ      8w๏ฟฝ๏ฟฝโ–’วช๏ฟฝ
 ... repeated 3 more times, for the 3 more runs we're averaging over
  Note the trailing garbage after the trailing digits.

 Performance counter stats for './fibonacci-1G' (4 runs):

      73438.538349      task-clock:u (msec)       #    1.000 CPUs utilized            ( +-  0.05% )
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
                 2      page-faults:u             #    0.000 K/sec                    ( +- 11.55% )
   322,467,902,120      cycles:u                  #    4.391 GHz                      ( +-  0.05% )
   924,000,029,608      instructions:u            #    2.87  insn per cycle           ( +-  0.00% )
 1,191,553,612,474      uops_issued_any:u         # 16225.181 M/sec                   ( +-  0.00% )
 1,173,953,974,712      uops_executed_thread:u    # 15985.530 M/sec                   ( +-  0.00% )
     6,011,337,533      uops_executed_stall_cycles:u #   81.855 M/sec                    ( +-  1.27% )

      73.436831004 seconds time elapsed                                          ( +-  0.05% )

( +- x %)ํ•ด๋‹น ํšŸ์ˆ˜์— ๋Œ€ํ•œ 4 ํšŒ ์‹คํ–‰์˜ ํ‘œ์ค€ ํŽธ์ฐจ์ž…๋‹ˆ๋‹ค. ๊ทธ๊ฒƒ์ด ๋งŽ์€ ์ˆ˜์˜ ๋ช…๋ น์„ ์‹คํ–‰ํ•œ๋‹ค๋Š” ์ ์ด ํฅ๋ฏธ ๋กญ์Šต๋‹ˆ๋‹ค. 924 ์‹ญ์–ต์€ ์šฐ์—ฐ์˜ ์ผ์น˜ ๊ฐ€ ์•„๋‹™๋‹ˆ๋‹ค . ์™ธ๋ถ€ ๋ฃจํ”„๊ฐ€ ์ด 924 ๊ฐœ์˜ ๋ช…๋ น์„ ์‹คํ–‰ํ•œ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

uops_issuedํ†ตํ•ฉ ๋„๋ฉ”์ธ ์ˆ˜ (ํ”„๋ŸฐํŠธ ์—”๋“œ ๋ฌธ์ œ ๋Œ€์—ญํญ uops_executed๊ณผ ๊ด€๋ จ๋จ )์ด๊ณ  ํ†ตํ•ฉ ๋˜์ง€ ์•Š์€ ๋„๋ฉ”์ธ ์ˆ˜ (์‹คํ–‰ ํฌํŠธ๋กœ ์ „์†ก ๋œ uops ์ˆ˜)์ž…๋‹ˆ๋‹ค. Micro-fusion์€ 2 ๊ฐœ์˜ unfused-domain uops๋ฅผ ํ•˜๋‚˜์˜ fused-domain uop์œผ๋กœ ์••์ถ•ํ•˜์ง€๋งŒ, mov-elimination ์€ ์ผ๋ถ€ fused-domain uops์—๋Š” ์‹คํ–‰ ํฌํŠธ๊ฐ€ ํ•„์š”ํ•˜์ง€ ์•Š์Œ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. uops ๋ฐ fused vs. unfused ๋„๋ฉ”์ธ ๊ณ„์‚ฐ์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์€ ์—ฐ๊ฒฐ๋œ ์งˆ๋ฌธ์„ ์ฐธ์กฐํ•˜์‹ญ์‹œ์˜ค. (๋˜ํ•œ Agner Fog์˜ ๋ช…๋ น์–ด ํ‘œ ๋ฐ uarch ์•ˆ๋‚ด์„œ ๋ฐ SO x86 ํƒœ๊ทธ ์œ„ํ‚ค ์˜ ๊ธฐํƒ€ ์œ ์šฉํ•œ ๋งํฌ๋ฅผ ์ฐธ์กฐํ•˜์‹ญ์‹œ์˜ค ).

๋‹ค๋ฅธ ๊ฒƒ์„ ์ธก์ •ํ•˜๋Š” ๋˜ ๋‹ค๋ฅธ ์‹คํ–‰์—์„œ : ๋™์ผํ•œ ๋‘ ๊ฐœ์˜ 456B ๋ฒ„ํผ๋ฅผ ์ฝ๊ฑฐ๋‚˜ ์“ธ ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ๋˜๋Š” L1D ์บ์‹œ ๋ฏธ์Šค๋Š” ์ „ํ˜€ ์ค‘์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋‚ด๋ถ€ ๋ฃจํ”„ ๋ถ„๊ธฐ๋Š” ์™ธ๋ถ€ ๋ฃจํ”„ ๋‹น ํ•œ ๋ฒˆ ์ž˜๋ชป ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค (๋ฃจํ”„๋ฅผ ๋– ๋‚˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ). (์ปดํ“จํ„ฐ๊ฐ€ ์™„์ „ํžˆ ์œ ํœด ์ƒํƒœ๊ฐ€ ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ด ์‹œ๊ฐ„์ด ๋” ๋†’์Šต๋‹ˆ๋‹ค. ์•„๋งˆ๋„ ๋‹ค๋ฅธ ๋…ผ๋ฆฌ ์ฝ”์–ด๊ฐ€ ์ผ์ • ์‹œ๊ฐ„ ๋™์•ˆ ํ™œ์„ฑํ™”๋˜์–ด ์ธํ„ฐ๋ŸฝํŠธ์— ๋” ๋งŽ์€ ์‹œ๊ฐ„์„ ์†Œ๋น„ํ–ˆ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค (์‚ฌ์šฉ์ž ๊ณต๊ฐ„ ์ธก์ • ์ฃผํŒŒ์ˆ˜๊ฐ€ 4.400GHz๋ณด๋‹ค ํ›จ์”ฌ ๋‚ฎ๊ธฐ ๋•Œ๋ฌธ์—). ๋˜๋Š” ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์ฝ”์–ด๊ฐ€ ๋” ๋งŽ์€ ์‹œ๊ฐ„ ๋™์•ˆ ํ™œ์„ฑํ™”๋˜์–ด ์ตœ๋Œ€ ํ„ฐ๋ณด๋ฅผ ๋‚ฎ์ถ”์—ˆ์Šต๋‹ˆ๋‹ค cpu_clk_unhalted.one_thread_active.HT ๊ฒฝ์Ÿ์ด ๋ฌธ์ œ์ธ์ง€ ์—ฌ๋ถ€๋ฅผ ์ถ”์ ํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค .)

     ### Another run of the same 105/106B "main" version to check other perf counters
      74510.119941      task-clock:u (msec)       #    1.000 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
                 2      page-faults:u             #    0.000 K/sec
   324,455,912,026      cycles:u                  #    4.355 GHz
   924,000,036,632      instructions:u            #    2.85  insn per cycle
   228,005,015,542      L1-dcache-loads:u         # 3069.535 M/sec
           277,081      L1-dcache-load-misses:u   #    0.00% of all L1-dcache hits
                 0      ld_blocks_partial_address_alias:u #    0.000 K/sec
   115,000,030,234      branches:u                # 1543.415 M/sec
     1,000,017,804      branch-misses:u           #    0.87% of all branches

๋‚ด ์ฝ”๋“œ๋Š” Ryzen์—์„œ ๋” ์ ์€ ์‚ฌ์ดํด๋กœ ์ž˜ ์‹คํ–‰๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์‚ฌ์ดํด ๋‹น 5 uops๋ฅผ ๋ฐœํ–‰ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค (๋˜๋Š” ์ผ๋ถ€๋Š” Ryzen์˜ AVX 256b์™€ ๊ฐ™์€ 2-uop ๋ช…๋ น ์ธ ๊ฒฝ์šฐ 6). stosdRyzen (Intel๊ณผ ๋™์ผ)์—์„œ 3 ๊ฐœ์˜ UPS ์ธ ํ”„๋ก ํŠธ ์—”๋“œ๊ฐ€ ๋ฌด์—‡์„ ํ•  ๊ฒƒ์ธ์ง€ ์ž˜ ๋ชจ๋ฅด๊ฒ ์Šต๋‹ˆ๋‹ค . ๋‚ด๋ถ€ ๋ฃจํ”„์˜ ๋‹ค๋ฅธ ๋ช…๋ น์€ Skylake ๋ฐ ๋ชจ๋“  ๋‹จ์ผ UOP์™€ ๋™์ผํ•œ ๋Œ€๊ธฐ ์‹œ๊ฐ„์ด๋ผ๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. (๋ฅผ ํฌํ•จํ•˜์—ฌ adc eax, [edi+edx]Skylake๋ณด๋‹ค ์œ ๋ฆฌํ•ฉ๋‹ˆ๋‹ค).


๋ฐ”์ดํŠธ ๋‹น 10 ์ง„์ˆ˜๋กœ ์ˆซ์ž๋ฅผ ์ €์žฅํ•˜๋ฉด ์•„๋งˆ๋„ ํ›จ์”ฌ ์ž‘์„ ์ˆ˜๋„ ์žˆ์ง€๋งŒ 9 ๋ฐฐ ๋А๋ฆด ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ์ˆ˜ํ–‰ cmp๋ฐ ์กฐ์ • cmov์ˆ˜ํ–‰์€ ๋™์ผํ•˜์ง€๋งŒ ์ž‘์—…์˜ 1/9๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ๋ฐ”์ดํŠธ ๋‹น 2 ์ง„์ˆ˜ (๋ฒ ์ด์Šค (100),ํ•˜์ง€ 4 ๋น„ํŠธ์˜ BCD ๋А๋ฆฐDAA ) ๊ฒƒ ๊ฐ™์€ ์ž‘์—… ๋ฐ div r8/ add ax, 0x3030์œ„ํ•ด ์ธ์‡„ ๋‘ ASCII ์ˆซ์ž๋กœ 99 ๋ฐ”์ดํŠธ์ง‘๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ฐ”์ดํŠธ ๋‹น 1 ์ž๋ฆฌ๋Š” ์ „ํ˜€ ํ•„์š”ํ•˜์ง€ ์•Š์œผ๋ฉฐ div0x30์„ ๋ฐ˜๋ณตํ•˜๊ณ  ์ถ”๊ฐ€ํ•˜๊ธฐ ๋งŒํ•˜๋ฉด๋ฉ๋‹ˆ๋‹ค. ๋ฐ”์ดํŠธ๋ฅผ ์ธ์‡„ ์ˆœ์„œ๋Œ€๋กœ ์ €์žฅํ•˜๋ฉด ๋‘ ๋ฒˆ์งธ ๋ฃจํ”„๊ฐ€ ์‹ค์ œ๋กœ ๊ฐ„๋‹จ ํ•ด์ง‘๋‹ˆ๋‹ค.


64 ๋น„ํŠธ ์ •์ˆ˜ (64 ๋น„ํŠธ ๋ชจ๋“œ) ๋‹น 18 ๋˜๋Š” 19 ๊ฐœ์˜ ์‹ญ์ง„์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์•ฝ 2 ๋ฐฐ ๋น ๋ฅด์ง€ ๋งŒ ๋ชจ๋“  REX ์ ‘๋‘์‚ฌ ๋ฐ 64 ๋น„ํŠธ ์ƒ์ˆ˜์— ๋Œ€ํ•ด ์ƒ๋‹นํ•œ ์ฝ”๋“œ ํฌ๊ธฐ๊ฐ€ ์†Œ์š”๋ฉ๋‹ˆ๋‹ค. 64 ๋น„ํŠธ ๋ชจ๋“œ์˜ 32 ๋น„ํŠธ ํŒ”๋‹ค๋ฆฌ pop eax๋Œ€์‹ ์„ (๋ฅผ) ์‚ฌ์šฉํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค lodsd. ๋‚˜๋Š” 8 ๋ฒˆ์งธ ๋ ˆ์ง€์Šคํ„ฐ๋กœ ์‚ฌ์šฉ ํ•˜๋Š” ๋Œ€์‹  espํฌ์ธํ„ฐ๊ฐ€ ์•„๋‹Œ ์Šคํฌ๋ž˜์น˜ ๋ ˆ์ง€์Šคํ„ฐ๋กœ ์‚ฌ์šฉํ•˜๊ณ  ( esi๋ฐ ์˜ ์‚ฌ์šฉ๋ฒ•์„ ๊ตํ™˜ ํ•˜์—ฌ) REX ์ ‘๋‘์‚ฌ๋ฅผ ํ”ผํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค .espr8d

ํ˜ธ์ถœ ๊ฐ€๋Šฅํ•œ ๊ธฐ๋Šฅ ๋ฒ„์ „์„ ๋งŒ๋“œ๋Š” ๊ฒฝ์šฐ 64 ๋น„ํŠธ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ์‚ฌ์šฉํ•˜๋Š” r8d๊ฒƒ์ด ์ €์žฅ / ๋ณต์›๋ณด๋‹ค ์ €๋ ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค rsp. 64 ๋น„ํŠธ๋Š” 1 ๋ฐ”์ดํŠธ dec r32์ธ์ฝ”๋”ฉ ๋„ ์‚ฌ์šฉํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค (REX ์ ‘๋‘์‚ฌ์ด๋ฏ€๋กœ). ๊ทธ๋Ÿฌ๋‚˜ ๋Œ€๋ถ€๋ถ„ dec bl2 ๋ฐ”์ดํŠธ๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. (์˜ ์ƒ์œ„ ๋ฐ”์ดํŠธ์— ์ƒ์ˆ˜๊ฐ€ ์žˆ๊ณ  ebx๋‚ด๋ถ€ ๋ฃจํ”„ ์™ธ๋ถ€์—์„œ๋งŒ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ƒ์ˆ˜์˜ ํ•˜์œ„ ๋ฐ”์ดํŠธ๊ฐ€์ด๋ฏ€๋กœ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค 0x00.)


๊ณ ์„ฑ๋Šฅ ๋ฒ„์ „

์ฝ”๋“œ ๊ณจํ”„๊ฐ€ ์•„๋‹Œ ์ตœ๋Œ€ ์„ฑ๋Šฅ์„ ์–ป์œผ๋ ค๋ฉด ๋‚ด๋ถ€ ๋ฃจํ”„๋ฅผ ํ’€๊ณ  ์ตœ๋Œ€ 22 ํšŒ ๋ฐ˜๋ณต์„ ์‹คํ–‰ํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋ถ„๊ธฐ ์˜ˆ์ธก์ž๊ฐ€ ์ž˜ ์ˆ˜ํ–‰ ํ•  ์ˆ˜์žˆ๋Š” ์งง๊ฑฐ๋‚˜ ์งง์€ ํŒจํ„ด์ž…๋‹ˆ๋‹ค. ๋‚ด ์‹คํ—˜์—์„œ ๋ฃจํ”„๊ฐ€ ์ž˜๋ชป ์˜ˆ์ธก mov cl, 22๋˜๊ธฐ ์ „์— .inner: dec cl/jnz .inner(๋‚ด๋ถ€ ๋ฃจํ”„์˜ ์ „์ฒด ์‹คํ–‰ ๋‹น 1๋ณด๋‹ค ํ›จ์”ฌ ์ ์€ 0.05 %์™€ ๊ฐ™์€) ์˜คํ•ด๊ฐ€ ๊ฑฐ์˜ ๋ฐœ์ƒํ•˜์ง€ ์•Š์ง€๋งŒ mov cl,23๋‚ด๋ถ€ ๋ฃจํ”„ ๋‹น 0.35์—์„œ 0.6 ๋ฐฐ๋กœ ์ž˜๋ชป ์˜ˆ์ธก๋ฉ๋‹ˆ๋‹ค. 46๋‚ด๋ถ€ ๋ฃจํ”„ ๋‹น ~ 1.28 ๋ฐฐ (100M ์™ธ๋ถ€ ๋ฃจํ”„ ๋ฐ˜๋ณต์˜ ๊ฒฝ์šฐ 128M ํšŒ)๋ฅผ ์ž˜๋ชป ์˜ˆ์ธกํ•˜์—ฌ ํŠนํžˆ ๋‚˜์ฉ๋‹ˆ๋‹ค. 114ํ”ผ๋ณด๋‚˜์น˜ ๋ฃจํ”„์˜ ์ผ๋ถ€๋กœ ์ฐพ์€ ๊ฒƒ์ฒ˜๋Ÿผ ๋‚ด๋ถ€ ๋ฃจํ”„ ๋‹น ์ •ํ™•ํžˆ ํ•œ ๋ฒˆ ์ž˜๋ชป ์˜ˆ์ธกํ–ˆ์Šต๋‹ˆ๋‹ค.

๋‚˜๋Š” ํ˜ธ๊ธฐ์‹ฌ์„ ๊ฐ€์ง€๊ณ  ๊ทธ๊ฒƒ์„ ์‹œ๋„ํ•˜์—ฌ ๋‚ด๋ถ€ ๋ฃจํ”„๋ฅผ 6์œผ๋กœ ํ’€์—ˆ์Šต๋‹ˆ๋‹ค %rep 6(114๋ฅผ ๊ท ๋“ฑํ•˜๊ฒŒ ๋‚˜๋ˆ„๊ธฐ ๋•Œ๋ฌธ). ๊ทธ๊ฒƒ์€ ๋Œ€๋ถ€๋ถ„ ๋ถ„๊ธฐ ๊ฒฐ์„์„ ์ œ๊ฑฐํ–ˆ์Šต๋‹ˆ๋‹ค. ๋‚˜๋Š” edx์Œ์ˆ˜๋ฅผ ๋งŒ๋“ค์–ด mov์ƒ์  ์˜ ์˜คํ”„์…‹์œผ๋กœ ์‚ฌ์šฉ ํ–ˆ์œผ๋ฏ€๋กœ adc eax,[edi]๋งˆ์ดํฌ๋กœ ์œตํ•ฉ ์ƒํƒœ๋ฅผ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (๊ทธ๋ž˜์„œ ํ”ผํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค stosd). ๋ธ”๋ก์—์„œ lea์—…๋ฐ์ดํŠธ ediํ•˜๊ธฐ ์œ„ํ•ด ๋ฅผ ๊บผ๋‚ด์„œ %rep6 ์ƒ์  ๋‹น ํ•˜๋‚˜์˜ ํฌ์ธํ„ฐ ์—…๋ฐ์ดํŠธ ๋งŒ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

๋‚˜๋Š” ๊ทธ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•˜์ง€๋Š” ์•Š์ง€๋งŒ ์™ธ๋ถ€ ๋ฃจํ”„์˜ ๋ชจ๋“  ๋ถ€๋ถ„ ๋ ˆ์ง€์Šคํ„ฐ ๋‚ด์šฉ์„ ์ œ๊ฑฐํ–ˆ์Šต๋‹ˆ๋‹ค. ์™ธ๋ถ€ ๋ฃจํ”„ ๋์—์„œ CF๊ฐ€ ์ตœ์ข… ADC์— ์˜์กดํ•˜์ง€ ์•Š๋Š” ๋ฐ ์•ฝ๊ฐ„ ๋„์›€์ด๋˜์—ˆ์œผ๋ฏ€๋กœ ์ผ๋ถ€ ๋‚ด๋ถ€ ๋ฃจํ”„ Uops๋ฅผ ์‹œ์ž‘ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์™ธ๋ถ€ ๋ฃจํ”„ ์ฝ”๋“œ๋Š” ์•„๋งˆ๋„ 2 ๊ฐœ์˜ ๋ช…๋ น์œผ๋กœ neg edx๊ต์ฒด ํ•œ ํ›„ (์•„์ง 1 ๊ฐœ๊ฐ€ ์žˆ์—ˆ๊ธฐ ๋•Œ๋ฌธ์—) ๋งˆ์ง€๋ง‰์œผ๋กœ ์ˆ˜ํ–‰ ํ•œ ์ดํ›„๋กœ 8 ๋น„ํŠธ๋ฅผ ๋–จ์–ด ๋œจ๋ฆฌ๊ณ  ๋ށ ์ฒด์ธ์„ ๋‹ค์‹œ ์ •๋ ฌ ํ•œ ์ดํ›„ ๋กœ ์กฐ๊ธˆ ๋” ์ตœ์ ํ™” ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฌผ๊ฑด์„ ๋“ฑ๋กํ•˜์‹ญ์‹œ์˜ค.xchgmov

ํ”ผ๋ณด๋‚˜์น˜ ๋ฃจํ”„์˜ NASM ์†Œ์Šค์ž…๋‹ˆ๋‹ค. ์›๋ž˜ ๋ฒ„์ „์˜ ํ•ด๋‹น ์„น์…˜์„ ๋Œ€์ฒด ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  ;;;; Main loop, optimized for performance, not code-size
%assign unrollfac 6
    mov    bl, limbcount/unrollfac  ; and at the end of the outer loop
    align 32
.fibonacci:
limbcount equ 114             ; 112 = 1006 decimal digits / 9 digits per limb.  Not enough for 1000 correct digits, but 114 is.
                              ; 113 would be enough, but we depend on limbcount being even to avoid a sub
;    align 8
.digits_add:

%assign i 0
%rep unrollfac
    ;lodsd                       ; Skylake: 2 uops.  Or  pop rax  with rsp instead of rsi
;    mov    eax, [esp]
;    lea    esp, [esp+4]   ; adjust ESP without affecting CF.  Alternative, load relative to edi and negate an offset?  Or add esp,4 after adc before cmp
    pop    eax
    adc    eax, [edi+i*4]    ; read from a potentially-offset location (but still store to the front)
 ;; jz .out   ;; Nope, a zero digit in the result doesn't mean the end!  (Although it might in base 10**9 for this problem)

    lea    esi, [eax - 1000000000]
    cmp    ebp, eax                ; sets CF when (base-1) < eax.  i.e. when eax>=base
    cmovc  eax, esi                ; eax %= base, keeping it in the [0..base) range
%if 0
    stosd
%else
  mov    [edi+i*4+edx], eax
%endif
%assign i i+1
%endrep
  lea   edi, [edi+4*unrollfac]

    dec    bl                      ; preserves CF.  The resulting partial-flag merge on ADC would be slow on pre-SnB CPUs
    jnz .digits_add
    ; bl=0, ebx=-1024
    ; esi has its high bit set opposite to CF
.end_innerloop:
    ;; after a non-zero carry-out (CF=1): right-shift both buffers by 1 limb, over the course of the next two iterations
    ;; next iteration with r8 = 1 and rsi+=4:  read offset from both, write normal.  ends with CF=0
    ;; following iter with r8 = 1 and rsi+=0:  read offset from dest, write normal.  ends with CF=0
    ;; following iter with r8 = 0 and rsi+=0:  i.e. back to normal, until next carry-out (possible a few iters later)

    ;; rdi = bufX + 4*limbcount
    ;; rsi = bufY + 4*limbcount + 4*carry_last_time

;    setc   [rdi]
;    mov    dl, dh               ; edx=0.  2c latency on SKL, but DH has been ready for a long time
;    adc    edx,edx    ; edx = CF.  1B shorter than setc dl, but requires edx=0 to start
    setc   al
    movzx  edx, al
    mov    [edi], edx ; store the carry-out into an extra limb beyond limbcount
    shl    edx, 2
    ;; Branching to handle the truncation would break the data-dependency (of pointers) on carry-out from this iteration
    ;;  and let the next iteration start, but we bottleneck on the front-end (9 uops)
    ;;  not the loop-carried dependency of the inner loop (2 cycles for adc->cmp -> flag input of adc next iter)
    ;; Since the pattern isn't perfectly regular, branch mispredicts would hurt us

    ; keep -1024 in ebx.  Using bl for the limb counter leaves bl zero here, so it's back to -1024 (or -2048 or whatever)
    mov    eax, esp
    and    esp, 4               ; only works if limbcount is even, otherwise we'd need to subtract limbcount first.

    and    edi, ebx  ; -1024    ; revert to start of buffer, regardless of offset
    add    edi, edx             ; read offset in next iter's src
    ;; maybe   or edi,edx / and edi, 4 | -1024?  Still 2 uops for the same work
    ;;  setc dil?

    ;; after adjusting src, so this only affects read-offset in the dst, not src.
    or     edx, esp             ; also set r8d if we had a source offset last time, to handle the 2nd buffer
    mov    esp, edi

;    xchg   edi, esp   ; Fibonacci: dst and src swap
    and    eax, ebx  ; -1024

    ;; mov    edi, eax
    ;; add    edi, edx
    lea    edi, [eax+edx]
    neg    edx            ; negated read-write offset used with store instead of load, so adc can micro-fuse

    mov    bl, limbcount/unrollfac
    ;; Last instruction must leave CF clear for next iter
;    loop .fibonacci  ; Maybe 0.01% slower than dec/jnz overall
;    dec ecx
    sub ecx, 1                  ; clear any flag dependencies.  No faster than dec, at least when CF doesn't depend on edx
    jnz .fibonacci

๊ณต์—ฐ:

 Performance counter stats for './fibonacci-1G-performance' (3 runs):

      62280.632258      task-clock (msec)         #    1.000 CPUs utilized            ( +-  0.07% )
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
                 3      page-faults:u             #    0.000 K/sec                    ( +- 12.50% )
   273,146,159,432      cycles                    #    4.386 GHz                      ( +-  0.07% )
   757,088,570,818      instructions              #    2.77  insn per cycle           ( +-  0.00% )
   740,135,435,806      uops_issued_any           # 11883.878 M/sec                   ( +-  0.00% )
   966,140,990,513      uops_executed_thread      # 15512.704 M/sec                   ( +-  0.00% )
    75,953,944,528      resource_stalls_any       # 1219.544 M/sec                    ( +-  0.23% )
       741,572,966      idq_uops_not_delivered_core #   11.907 M/sec                    ( +- 54.22% )

      62.279833889 seconds time elapsed                                          ( +-  0.07% )

์ด๋Š” ๋™์ผํ•œ Fib (1G) ์šฉ์œผ๋กœ 73 ์ดˆ ๋Œ€์‹  62.3 ์ดˆ์— ๋™์ผํ•œ ์ถœ๋ ฅ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (273.146G ์‚ฌ์ดํด vs. 322.467G. ๋ชจ๋“  ๊ฒƒ์ด L1 ์บ์‹œ์—์„œ ๋ฐœ์ƒํ•˜๋ฏ€๋กœ ์ฝ”์–ด ํด๋Ÿญ ์‚ฌ์ดํด๋งŒ์œผ๋กœ๋„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.)

์ด๊ณ„๋ณด๋‹ค ํ›จ์”ฌ ๋‚ฎ์€ ์ด๊ณ„๋ฅผ uops_issued๊ธฐ๋กํ•˜์‹ญ์‹œ์˜ค uops_executed. ์ฆ‰, ์œตํ•ฉ ๋„๋ฉ”์ธ (๋ฌธ์ œ / ROB)์—์„œ๋Š” 1 uop, ์œตํ•ฉ๋˜์ง€ ์•Š์€ ๋„๋ฉ”์ธ (์Šค์ผ€์ค„๋Ÿฌ / ์‹คํ–‰ ๋‹จ์œ„)์—์„œ๋Š” 2 uop๊ฐ€ ๋งˆ์ดํฌ๋กœ ํ“จ์ฆˆ๋˜์–ด ์žˆ์Œ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด์Šˆ / ์ด๋ฆ„ ๋ฐ”๊พธ๊ธฐ ๋‹จ๊ณ„์—์„œ ์ œ๊ฑฐ ๋œ ์ ์€ ์ˆ˜ ( mov๋ ˆ์ง€์Šคํ„ฐ ๋ณต์‚ฌ ๋˜๋Š”- xor์ œ๋กœํ™” ์™€ ๊ฐ™์ด ๋ฐœํ–‰ํ•ด์•ผํ•˜์ง€๋งŒ ์‹คํ–‰ ๋‹จ์œ„๋Š” ํ•„์š”ํ•˜์ง€ ์•Š์Œ). ์ œ๊ฑฐ ๋œ ์›์Šค๋Š” ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์œผ๋กœ ์นด์šดํŠธ์˜ ๋ถˆ๊ท ํ˜•์„ ์ค„ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

branch-misses1G์—์„œ ~ 400k๊นŒ์ง€ ๋‚ฎ์•„ ์กŒ์œผ๋ฏ€๋กœ ์–ธ ๋กค๋ง์ด ํšจ๊ณผ๊ฐ€์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. resource_stalls.any์ด์ œ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ํ”„๋ŸฐํŠธ ์—”๋“œ๊ฐ€ ๋” ์ด์ƒ ๋ณ‘๋ชฉ ํ˜„์ƒ์ด ์•„๋‹ˆ๋ผ๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ๋Œ€์‹  ๋ฐฑ ์—”๋“œ๊ฐ€ ๋’ค์ณ์ ธ ํ”„๋ŸฐํŠธ ์—”๋“œ๋ฅผ ์ œํ•œํ•ฉ๋‹ˆ๋‹ค. idq_uops_not_delivered.core์ „์šฉ ํ”„๋ŸฐํŠธ ์—”๋“œ๋Š” ๋งˆ์ดํฌ๋กœ ์—ฐ์‚ฐ์„ ์ œ๊ณตํ•˜์ง€ ์•Š์•˜๋‹ค ์‚ฌ์ดํด์„ ๊ณ„์‚ฐํ•˜์ง€๋งŒ ๋ฐฑ์—”๋“œ๊ฐ€ ๋˜์ง€ ์•Š์•˜๋‹ค ์ •์ฒด. ํ›Œ๋ฅญํ•˜๊ณ  ๋‚ฎ์œผ๋ฉฐ ํ”„๋ŸฐํŠธ ์—”๋“œ ๋ณ‘๋ชฉ ํ˜„์ƒ์ด ๊ฑฐ์˜ ์—†์Œ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.


์žฌ๋ฏธ์žˆ๋Š” ์‚ฌ์‹ค : ํŒŒ์ด์ฌ ๋ฒ„์ „์€ ์‹œ๊ฐ„์„ ๋ฐ˜์œผ๋กœ ๋Š˜๋ฆฌ์ง€ ์•Š๊ณ  10์œผ๋กœ ๋‚˜๋ˆ„๋Š” ๋ฐ ์ ˆ๋ฐ˜ ์ด์ƒ์„ ์†Œ๋น„ํ•ฉ๋‹ˆ๋‹ค. ํ•ฉ๋‹ˆ๋‹ค (๊ต์ฒด a/=10๋กœ a>>=64๋ณด๋‹ค 2 ๋ฐฐ ์ด์ƒ์œผ๋กœ ์†๋„๋ฅผ ์ตœ๋Œ€ํ•˜์ง€๋งŒ, ๋ฐ”์ด๋„ˆ๋ฆฌ ์ž˜๋ผ ๋‚ด๊ธฐ ๋•Œ๋ฌธ์— ๊ฒฐ๊ณผ๋ฅผ ๋ณ€๊ฒฝ! = ์†Œ์ˆ˜์  ์ ˆ์‚ฌ).

๋‚ด asm ๋ฒ„์ „์€ ๋ฌผ๋ก  ๋ฃจํ”„ ๋ฐ˜๋ณต ํšŸ์ˆ˜๋ฅผ ํ•˜๋“œ ์ฝ”๋”ฉ ํ•˜์—ฌ์ด ๋ฌธ์ œ ํฌ๊ธฐ์— ๋งž๊ฒŒ ์ตœ์ ํ™”๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ž„์˜์˜ ์ •๋ฐ€๋„ ์ˆซ์ž๋ฅผ ์ด๋™ํ•˜๋”๋ผ๋„ ๋ณต์‚ฌ ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ ๋‚ด ๋ฒ„์ „์€ ๋‹ค์Œ ๋‘ ๋ฐ˜๋ณต์— ๋Œ€ํ•œ ์˜คํ”„์…‹์—์„œ ์ฝ์€ ๊ฒƒ๋งŒ์œผ๋กœ๋„ ๊ฑด๋„ˆ ๋›ธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํŒŒ์ด์ฌ ๋ฒ„์ „ (Arch Linux์˜ 64 ๋น„ํŠธ python2.7)์„ ํ”„๋กœํŒŒ์ผํ–ˆ์Šต๋‹ˆ๋‹ค .

ocperf.py stat -etask-clock,context-switches:u,cpu-migrations:u,page-faults:u,cycles,instructions,uops_issued.any,uops_executed.thread,arith.divider_active,branches,branch-misses,L1-dcache-loads,L1-dcache-load-misses python2.7 ./fibonacci-1G.anders-brute-force.py
795231787455468346782938519619714818925554218523439891345303993734324668618251937005099962613655677933248203572322245122629171445627564825949953061211130125549987963951605345978901870056743994684484303459980241992404375340195011483010723426503784142698039838736078428423199645734078278420076776090777770318318574465653625351150285171596335102399069923259547132267036550648243596658688604862715971691635144878852742743550811390916796390738039824284803398011027637054426428503274436478119845182546213052952963333981348310577137012811185112824713631141420831898380252690791778709480221775085968511636388337484742803673714788207995668880750915837224945143751932016258200200053079830988726125702820190750937055423293110708497685471583358562391045067944912001156476292564914450953190468498441700251208650402077901250135617787419960508555831719090539513446891944331302682481336323419049437559926255302546652883812263943360048384953507064771198676927956854879685520768489774177178437585949642538435587910579974100118580

 Performance counter stats for 'python2.7 ./fibonacci-1G.anders-brute-force.py':

     755380.697069      task-clock:u (msec)       #    1.000 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
               793      page-faults:u             #    0.001 K/sec
 3,314,554,673,632      cycles:u                  #    4.388 GHz                      (55.56%)
 4,850,161,993,949      instructions:u            #    1.46  insn per cycle           (66.67%)
 6,741,894,323,711      uops_issued_any:u         # 8925.161 M/sec                    (66.67%)
 7,052,005,073,018      uops_executed_thread:u    # 9335.697 M/sec                    (66.67%)
   425,094,740,110      arith_divider_active:u    #  562.756 M/sec                    (66.67%)
   807,102,521,665      branches:u                # 1068.471 M/sec                    (66.67%)
     4,460,765,466      branch-misses:u           #    0.55% of all branches          (44.44%)
 1,317,454,116,902      L1-dcache-loads:u         # 1744.093 M/sec                    (44.44%)
        36,822,513      L1-dcache-load-misses:u   #    0.00% of all L1-dcache hits    (44.44%)

     755.355560032 seconds time elapsed

(๋‹จ์œ„)์˜ ์ˆซ์ž๋Š” ์„ฑ๋Šฅ ์นด์šดํ„ฐ๊ฐ€ ์ƒ˜ํ”Œ๋ง๋˜๋Š” ์‹œ๊ฐ„์ž…๋‹ˆ๋‹ค. HW๊ฐ€ ์ง€์›ํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ๋” ๋งŽ์€ ์นด์šดํ„ฐ๋ฅผ ๋ณผ ๋•Œ, perf๋Š” ๋‹ค๋ฅธ ์นด์šดํ„ฐ ์‚ฌ์ด์—์„œ ํšŒ์ „ํ•˜๊ณ  ์™ธ์‚ฝํ•ฉ๋‹ˆ๋‹ค. ๋™์ผํ•œ ์ž‘์—…์„ ์žฅ๊ธฐ๊ฐ„ ์ˆ˜ํ–‰ํ•ด๋„ ์ข‹์Šต๋‹ˆ๋‹ค.

๋‚ด๊ฐ€ ์‹คํ–‰ ํ•œ ๊ฒฝ์šฐ perfsysctl์„ ์„ค์ • ํ•œ ํ›„ kernel.perf_event_paranoid = 0(๋˜๋Š” ์‹คํ–‰ perf๋ฃจํŠธ๋กœ), ๊ทธ๊ฒƒ์„ ์ธก์ •ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค 4.400GHz. cycles:u์ธํ„ฐ๋ŸฝํŠธ (๋˜๋Š” ์‹œ์Šคํ…œ ํ˜ธ์ถœ)์— ์†Œ๋น„ ๋œ ์‹œ๊ฐ„์€ ๊ณ„์‚ฐํ•˜์ง€ ์•Š๊ณ  ์‚ฌ์šฉ์ž ๊ณต๊ฐ„ ์ฃผ๊ธฐ๋งŒ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. ๋ฐ์Šคํฌํƒ‘์ด ๊ฑฐ์˜ ์œ ํœด ์ƒํƒœ ์˜€์ง€๋งŒ ์ด๋Š” ์ผ๋ฐ˜์ ์ž…๋‹ˆ๋‹ค.


๋‹ต๋ณ€

ํ•˜์Šค์ผˆ, 83 61 ๋ฐ”์ดํŠธ

p(a,b)(c,d)=(a*d+b*c-a*c,a*c+b*d)
t g=g.g.g
t(t$t=<<t.p)(1,1)

์ถœ๋ ฅ ( F 1000000000 , F 1000000001 ). ๋‚ด ๋…ธํŠธ๋ถ์—์„œ๋Š” 1.35GiB์˜ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ 133 ์ดˆ ๋‚ด์— ์™ผ์ชฝ ํŒจ๋Ÿฐ๊ณผ ์ฒซ 1000 ์ž๋ฆฌ ์ˆซ์ž๋ฅผ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์ธ์‡„ํ•ฉ๋‹ˆ๋‹ค.

์ž‘๋™ ์›๋ฆฌ

ํ”ผ๋ณด๋‚˜์น˜ ๋ฐ˜๋ณต์€ ํ–‰๋ ฌ ์ง€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

[ F I โ€“ 1 , F I ; F i , F i + 1 ] = [0, 1; 1, 1] i ,

์—ฌ๊ธฐ์„œ ์šฐ๋ฆฌ๋Š” ์ด๋Ÿฌํ•œ ์ •์ฒด์„ฑ์„ ๋„์ถœํ•ฉ๋‹ˆ๋‹ค.

[ F I + J โ€“ 1 , F I + J ; F I + J , F I + J + 1 ] = [ F I โ€“ 1 , F I ; F i , F i + 1 ] โ‹… [ F j -1 , F j ; F์˜ J , F์˜ J + 1 ],
F๋Š” I + J = F์˜ ๋‚œ+ 1 F j + 1 โˆ’ F i โˆ’ 1 F j โˆ’ 1 = F i + 1 F j + 1 โˆ’ ( F i + 1 โˆ’ F i ) ( F j + 1 โˆ’ F j ),
F i + j + 1 = F i F j + F i + 1 F j + 1 ์ž…๋‹ˆ๋‹ค.

pํ•จ์ˆ˜๋กœ ๊ณ„์‚ฐ ( F I + J , F I + J + 1 ) ์†Œ์ • ( F I , F I + 1 ) ๋ฐ ( F์—์„œ J , F์˜ J + 1 ). ์“ฐ๊ธฐ f n์œ„ํ•ด ( F I , F์˜ I + 1 ), ์šฐ๋ฆฌ๋Š”์ด p (f i) (f j)=์„ f (i + j).

๊ทธ๋•Œ,

(t=<<t.p) (f i)
= t ((t.p) (f i)) (f i)
= t (p (f i).p (f i).p (f i)) (f i)
= (p (f i).p (f i).p (f i).p (f i).p (f i).p (f i).p (f i).p (f i).p (f i)) (f i)
= f (10 * i),

(t$t=<<t.p) (f i)
= ((t=<<t.p).(t=<<t.p).(t=<<t.p)) (f i)
= f (10^3 * i),

t(t$t=<<t.p) (f i)
= ((t$t=<<t.p).(t$t=<<t.p).(t$t=<<t.p)) (f i)
= f (10^9 * i),

๊ทธ๋ฆฌ๊ณ  ์šฐ๋ฆฌ๋Š” f 1=๋ฅผ ์—ฐ๊ฒฐ (1,1)ํ•ฉ๋‹ˆ๋‹ค.


๋‹ต๋ณ€

๋งค์Šค ๋งค ํ‹ฐ์นด, 15 34 ๋ฐ”์ดํŠธ

Fibonacci ๊ทธ ์ž์ฒด๊ฐ€ ๋‚ด ์ปดํ“จํ„ฐ์— ~ ๊ธฐ๊ฐ€ ๊ฑธ๋ฆฝ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ํ”„๋ก ํŠธ ์—”๋“œ๊ฐ€ ๊ทธ๊ฒƒ์„ ํ‘œ์‹œํ•˜๊ธฐ ์œ„ํ•ด 95 (+/- 5).

Fibonacci@1*^9&

์ฒ˜์Œ 1000 ์ž๋ฆฌ (34 ๋ฐ”์ดํŠธ) : โŒŠFibonacci@1*^9/1*^208986640โŒ‹&

๊ธธ์ง€๋งŒ ๋น ๋ฆ„ ToString@Fibonacci@1*^9~StringTake~1000&:


๋‹ต๋ณ€

ํŒŒ์ด์ฌ 2, 70 ๋ฐ”์ดํŠธ

a,b=0,1
i=1e9
while i:
 a,b=b,a+b;i-=1
 if a>>3360:a/=10;b/=10
print a

์ด๊ฒƒ์€ ๋‚ด ๋žฉํ†ฑ์—์„œ 18 ๋ถ„ 31 ์ดˆ ๋งŒ์— ์‹คํ–‰๋˜์–ด ์˜ฌ๋ฐ”๋ฅธ 1000 ์ž๋ฆฌ ์ˆซ์ž์™€ ๊ทธ ๋’ค์— 74100118580์˜ฌ๋ฐ”๋ฅธ ์ˆซ์ž๊ฐ€ ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค 74248787892.


๋‹ต๋ณ€

ํ•˜์Šค์ผˆ , 78 ๋ฐ”์ดํŠธ

(a%b)n|n<1=b|odd n=b%(a+b)$n-1|r<-2*a*b-a*a=r%(a*a+b*b)$div n 2
1%0$2143923439

์˜จ๋ผ์ธ์œผ๋กœ ์‚ฌ์šฉํ•ด๋ณด์‹ญ์‹œ์˜ค!

TIO์— 48 ์ดˆ๊ฐ€ ๊ฑธ๋ ธ์Šต๋‹ˆ๋‹ค. ๋‚ด ํŒŒ์ด์ฌ ๋‹ต๋ณ€ ๊ณผ ๋™์ผํ•œ ์žฌ๊ท€ ์ˆ˜์‹ ์ด์ง€๋งŒ ์ž˜๋ฆฌ์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

์ƒ์ˆ˜ 2143923439๋Š” 10**9-1์ด์ง„์ˆ˜๋กœ ๋ฐ”๋€Œ๊ณ  ๋์— 1์ด ๋” ์žˆ์Šต๋‹ˆ๋‹ค. ์ด์ง„์ˆ˜๋ฅผ ๊ฑฐ๊พธ๋กœ ๋ฐ˜๋ณตํ•˜๋ฉด์˜ ์ด์ง„์ˆ˜๋ฅผ ๋ฐ˜๋ณตํ•˜๋Š” ๊ฒƒ์ด ์‹œ๋ฎฌ๋ ˆ์ด์…˜ 10**9-1๋ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์„ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ์ด๊ฒƒ์„ ํ•˜๋“œ ์ฝ”๋”ฉํ•˜๋Š” ๊ฒƒ์ด ๋” ์งง์€ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.