Why is 2 * x * x faster than 2 * ( x * x ) in Python 3.x, for integers?











up vote
20
down vote

favorite
8












The following Python 3.x integer multiplication takes on average between 1.66s and 1.77s:



import time
start_time = time.time()
num = 0
for x in range(0, 10000000):
# num += 2 * (x * x)
num += 2 * x * x
print("--- %s seconds ---" % (time.time() - start_time))


if I replace 2 * x * x with 2 *(x * x), it takes between 2.04 and 2.25. How come?



On the other hand it is the opposite in Java: 2 * (x * x) is faster in Java. Java test link: Why is 2 * (i * i) faster than 2 * i * i in Java?



I ran each version of the program 10 times, here are the results.



   2 * x * x        |   2 * (x * x)
---------------------------------------
1.7717654705047607 | 2.0789272785186768
1.735931396484375 | 2.1166207790374756
1.7093875408172607 | 2.024367570877075
1.7004504203796387 | 2.047525405883789
1.6676218509674072 | 2.254328966140747
1.699510097503662 | 2.0949244499206543
1.6889283657073975 | 2.0841963291168213
1.7243537902832031 | 2.1290600299835205
1.712965488433838 | 2.1942825317382812
1.7622807025909424 | 2.1200053691864014









share|improve this question









New contributor




Waqas Gondal is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
















  • 11




    Little hint: Use the timeit module for better statistics
    – user8408080
    13 hours ago






  • 3




    Nearly same question has been asked for java stackoverflow.com/questions/53452713/…
    – quant
    12 hours ago






  • 14




    @quant I apologise if I appear a bit too direct; I believe you did not read the question properly because the link you provided is in the question itself.
    – Ely
    12 hours ago










  • For good measure you might as well throw in 2 * pow(x,2) and 2 * x**2 as well. Also, please redo your timings using timeit, it's much more accurate than time.time() for short processes.
    – smci
    1 hour ago










  • Related near-duplicate, although not very clearly-stated: Times-two faster than bit-shift, for Python 3.x integers?
    – smci
    46 mins ago















up vote
20
down vote

favorite
8












The following Python 3.x integer multiplication takes on average between 1.66s and 1.77s:



import time
start_time = time.time()
num = 0
for x in range(0, 10000000):
# num += 2 * (x * x)
num += 2 * x * x
print("--- %s seconds ---" % (time.time() - start_time))


if I replace 2 * x * x with 2 *(x * x), it takes between 2.04 and 2.25. How come?



On the other hand it is the opposite in Java: 2 * (x * x) is faster in Java. Java test link: Why is 2 * (i * i) faster than 2 * i * i in Java?



I ran each version of the program 10 times, here are the results.



   2 * x * x        |   2 * (x * x)
---------------------------------------
1.7717654705047607 | 2.0789272785186768
1.735931396484375 | 2.1166207790374756
1.7093875408172607 | 2.024367570877075
1.7004504203796387 | 2.047525405883789
1.6676218509674072 | 2.254328966140747
1.699510097503662 | 2.0949244499206543
1.6889283657073975 | 2.0841963291168213
1.7243537902832031 | 2.1290600299835205
1.712965488433838 | 2.1942825317382812
1.7622807025909424 | 2.1200053691864014









share|improve this question









New contributor




Waqas Gondal is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
















  • 11




    Little hint: Use the timeit module for better statistics
    – user8408080
    13 hours ago






  • 3




    Nearly same question has been asked for java stackoverflow.com/questions/53452713/…
    – quant
    12 hours ago






  • 14




    @quant I apologise if I appear a bit too direct; I believe you did not read the question properly because the link you provided is in the question itself.
    – Ely
    12 hours ago










  • For good measure you might as well throw in 2 * pow(x,2) and 2 * x**2 as well. Also, please redo your timings using timeit, it's much more accurate than time.time() for short processes.
    – smci
    1 hour ago










  • Related near-duplicate, although not very clearly-stated: Times-two faster than bit-shift, for Python 3.x integers?
    – smci
    46 mins ago













up vote
20
down vote

favorite
8









up vote
20
down vote

favorite
8






8





The following Python 3.x integer multiplication takes on average between 1.66s and 1.77s:



import time
start_time = time.time()
num = 0
for x in range(0, 10000000):
# num += 2 * (x * x)
num += 2 * x * x
print("--- %s seconds ---" % (time.time() - start_time))


if I replace 2 * x * x with 2 *(x * x), it takes between 2.04 and 2.25. How come?



On the other hand it is the opposite in Java: 2 * (x * x) is faster in Java. Java test link: Why is 2 * (i * i) faster than 2 * i * i in Java?



I ran each version of the program 10 times, here are the results.



   2 * x * x        |   2 * (x * x)
---------------------------------------
1.7717654705047607 | 2.0789272785186768
1.735931396484375 | 2.1166207790374756
1.7093875408172607 | 2.024367570877075
1.7004504203796387 | 2.047525405883789
1.6676218509674072 | 2.254328966140747
1.699510097503662 | 2.0949244499206543
1.6889283657073975 | 2.0841963291168213
1.7243537902832031 | 2.1290600299835205
1.712965488433838 | 2.1942825317382812
1.7622807025909424 | 2.1200053691864014









share|improve this question









New contributor




Waqas Gondal is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











The following Python 3.x integer multiplication takes on average between 1.66s and 1.77s:



import time
start_time = time.time()
num = 0
for x in range(0, 10000000):
# num += 2 * (x * x)
num += 2 * x * x
print("--- %s seconds ---" % (time.time() - start_time))


if I replace 2 * x * x with 2 *(x * x), it takes between 2.04 and 2.25. How come?



On the other hand it is the opposite in Java: 2 * (x * x) is faster in Java. Java test link: Why is 2 * (i * i) faster than 2 * i * i in Java?



I ran each version of the program 10 times, here are the results.



   2 * x * x        |   2 * (x * x)
---------------------------------------
1.7717654705047607 | 2.0789272785186768
1.735931396484375 | 2.1166207790374756
1.7093875408172607 | 2.024367570877075
1.7004504203796387 | 2.047525405883789
1.6676218509674072 | 2.254328966140747
1.699510097503662 | 2.0949244499206543
1.6889283657073975 | 2.0841963291168213
1.7243537902832031 | 2.1290600299835205
1.712965488433838 | 2.1942825317382812
1.7622807025909424 | 2.1200053691864014






python python-3.x performance benchmarking integer-arithmetic






share|improve this question









New contributor




Waqas Gondal is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|improve this question









New contributor




Waqas Gondal is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this question




share|improve this question








edited 1 hour ago









smci

14.4k672104




14.4k672104






New contributor




Waqas Gondal is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked 13 hours ago









Waqas Gondal

13810




13810




New contributor




Waqas Gondal is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





Waqas Gondal is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






Waqas Gondal is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.








  • 11




    Little hint: Use the timeit module for better statistics
    – user8408080
    13 hours ago






  • 3




    Nearly same question has been asked for java stackoverflow.com/questions/53452713/…
    – quant
    12 hours ago






  • 14




    @quant I apologise if I appear a bit too direct; I believe you did not read the question properly because the link you provided is in the question itself.
    – Ely
    12 hours ago










  • For good measure you might as well throw in 2 * pow(x,2) and 2 * x**2 as well. Also, please redo your timings using timeit, it's much more accurate than time.time() for short processes.
    – smci
    1 hour ago










  • Related near-duplicate, although not very clearly-stated: Times-two faster than bit-shift, for Python 3.x integers?
    – smci
    46 mins ago














  • 11




    Little hint: Use the timeit module for better statistics
    – user8408080
    13 hours ago






  • 3




    Nearly same question has been asked for java stackoverflow.com/questions/53452713/…
    – quant
    12 hours ago






  • 14




    @quant I apologise if I appear a bit too direct; I believe you did not read the question properly because the link you provided is in the question itself.
    – Ely
    12 hours ago










  • For good measure you might as well throw in 2 * pow(x,2) and 2 * x**2 as well. Also, please redo your timings using timeit, it's much more accurate than time.time() for short processes.
    – smci
    1 hour ago










  • Related near-duplicate, although not very clearly-stated: Times-two faster than bit-shift, for Python 3.x integers?
    – smci
    46 mins ago








11




11




Little hint: Use the timeit module for better statistics
– user8408080
13 hours ago




Little hint: Use the timeit module for better statistics
– user8408080
13 hours ago




3




3




Nearly same question has been asked for java stackoverflow.com/questions/53452713/…
– quant
12 hours ago




Nearly same question has been asked for java stackoverflow.com/questions/53452713/…
– quant
12 hours ago




14




14




@quant I apologise if I appear a bit too direct; I believe you did not read the question properly because the link you provided is in the question itself.
– Ely
12 hours ago




@quant I apologise if I appear a bit too direct; I believe you did not read the question properly because the link you provided is in the question itself.
– Ely
12 hours ago












For good measure you might as well throw in 2 * pow(x,2) and 2 * x**2 as well. Also, please redo your timings using timeit, it's much more accurate than time.time() for short processes.
– smci
1 hour ago




For good measure you might as well throw in 2 * pow(x,2) and 2 * x**2 as well. Also, please redo your timings using timeit, it's much more accurate than time.time() for short processes.
– smci
1 hour ago












Related near-duplicate, although not very clearly-stated: Times-two faster than bit-shift, for Python 3.x integers?
– smci
46 mins ago




Related near-duplicate, although not very clearly-stated: Times-two faster than bit-shift, for Python 3.x integers?
– smci
46 mins ago












5 Answers
5






active

oldest

votes

















up vote
11
down vote













First of all, note that we don't see the same thing in Python 2.x:



>>> timeit("for i in range(1000): 2*i*i")
51.00784397125244
>>> timeit("for i in range(1000): 2*(i*i)")
50.48330092430115


So this leads us to believe that this is due to how integers changed in Python 3: specifically, Python 3 uses long (arbitrarily large integers) everywhere.



For small enough integers (including the ones we're considering here), CPython actually just uses the O(N2) grade-school digit by digit multiplication algorithm (for larger integers it switches to the Karatsuba algorithm). You can see this yourself in the source.



The number of digits in x*x is roughly twice that of 2*x or x (since log(x2) = 2 log(x)), so the square of the number of digits in x*x (recall that grade-school multiplication is quadratic) is 4 times that of 2*x or x -- so it makes sense that 2*(x*x) is the slower variant.



Note that a "digit" in this context is not a base-10 digit, but a 30-bit value (which are treated as single digits in CPython's implementation). Hence, for x = 10000000, 2*x*x only requires single-digit multiplications whereas 2*(x*x) requires a 2-digit multiplication (since x*x has 2 30-bit digits).



Here's a direct way to see this (Python 3):



>>> timeit("a*b", "a,b = 2, 123456**2", number=100000000)
5.796971936999967
>>> timeit("a*b", "a,b = 2*123456, 123456", number=100000000)
4.3559221399999615


Again, compare this to Python 2, which doesn't use arbitrary-length integers everywhere:



>>> timeit("a*b", "a,b = 2, 123456**2", number=100000000)
3.0912468433380127
>>> timeit("a*b", "a,b = 2*123456, 123456", number=100000000)
3.1120400428771973


(One interesting note: If you look at the source, you'll see that the algorithm actually has a special case for squaring numbers (which we're doing here), but even still this is not enough to overcome the fact that 2*(x*x) just requires processing more digits.)






share|improve this answer























  • Is Karatsuba algorithm slower than digit multiplication algorithm?
    – Banghua Zhao
    12 hours ago






  • 1




    @BanghuaZhao It has a better runtime complexity (wrt to number of digits), but the number of digits has to be large enough for it to actually be worth it.
    – arshajii
    12 hours ago






  • 2




    source : #define KARATSUBA_CUTOFF 70 . So the Karatsuba algorithm is only used if ints have about 600 decimal digits. it is not the problem here.
    – B. M.
    8 hours ago






  • 2




    In Python, a "digit" is a 30 or 60-bit chunk, right? So base 2^30, not a decimal digit. Also critically important: Java int wraps at 2^32, while Python does arbitrary precision. Plus Java JIT-compiles to (inefficient) native code using 32-bit registers, while CPython interprets all the way. So those are massive differences.
    – Peter Cordes
    8 hours ago








  • 2




    The numbers involved in the question are small enough that Karatsuba is not involved, and beyond that, they are small enough that asymptotic considerations are completely irrelevant. All operands and results involved fit in two 30-bit internal "digits".
    – user2357112
    6 hours ago


















up vote
6
down vote













Python intern representation of integers is special, it uses slots of 30 bits :



In [6]: sys.getsizeof(2**30-1)
Out[6]: 28 # one slot + heading

In [7]: sys.getsizeof(2**30)
Out[7]: 32 # two slots


So everything happens as if Python counts in base B = 2**30 = 1 073 741 824 ~1 billion.



For a human who want to calculate 2*4*4, two ways :




  • (2*4)*4 = 8*4 =32 = 30 + 2 is immediate if you knows your add tables.

  • 2*(4*4) = 2*16 = 2*10 + 2*6 = (2*10+10) + 2 = 30 + 2 since we have to put the operation down.


Python have the same problem. If x is a number such than 2x < B < x² , let x² = aB+b , with a,b <B. is stored in 2 slots, which I note (a|b). Computations leads to (without managing carries here):



   (x*x)*2 =>  (a|b)*2 => (2*a|2*b)
(2*x)*x => (2x)*x =>(2a|2b)


in the first case the 2* operation is done two times, against only one in the first case. That explains the difference.






share|improve this answer



















  • 3




    Where do a, b and X come from, what do they represent and how do they relate to the value of x?
    – Bergi
    10 hours ago










  • @Bergi: a and b are the high and low 30-bit chunks of x*x. (Capital X no longer appears in the current revision of the answer.)
    – user2357112
    6 hours ago




















up vote
3
down vote













If your benchmark is right (didn't check), it may come from the fact that Python integers may be two different things : native integers when they are small (with a quick computation), and big integers when they increase in size (slower computation). The first syntax keeps the size smaller after the first operation while the second syntax may lead to two operations involving big integers.






share|improve this answer



















  • 11




    Seems more like a guess (albeit a good guess) than an answer. Using timeit with different size x might give the needed evidence to push it from a guess to an answer.
    – John Coleman
    13 hours ago












  • python has "cached" integers for -5 to 256 Dok - this would be easily veryfied if both formulas are closer to the same times if only small integers are in play? The max value goes to 2e+14 which is small enough to fit a 64-bit signed int so processor int calculation limitations is probably out of the game - it is too big for 32-bit unsigned int, so on 32-bit this might factor in?
    – Patrick Artner
    13 hours ago








  • 4




    I still measure a 5% difference using 2 as a value for x.
    – Vincent
    13 hours ago






  • 2




    The answer might seem more like a guess, though in my eyes it is helpful because I just learned something more about Python.
    – Ely
    12 hours ago


















up vote
2
down vote













From what I can tell, it comes down to a little bit more memory access in the version using 2 * (x * x). I printed the disassembled bytecode and it seems to prove that:



Relevant part of 2 * x * x:



7          28 LOAD_FAST                1 (num)
30 LOAD_CONST 3 (2)
32 LOAD_FAST 2 (x)
34 BINARY_MULTIPLY
36 LOAD_FAST 2 (x)
38 BINARY_MULTIPLY
40 INPLACE_ADD
42 STORE_FAST 1 (num)
44 JUMP_ABSOLUTE 24


Relevant part of 2 * (x * x):



  7          28 LOAD_FAST                1 (num)
30 LOAD_CONST 3 (2)
32 LOAD_FAST 2 (x)
34 LOAD_FAST 2 (x)
36 BINARY_MULTIPLY <=== 1st multiply x*x in a temp value
38 BINARY_MULTIPLY <=== then multiply result with 2
40 INPLACE_ADD
42 STORE_FAST 1 (num)
44 JUMP_ABSOLUTE 24





share|improve this answer





















  • If this was the case we'd see the same effect in Python 2, but we don't.
    – arshajii
    11 hours ago






  • 1




    it's just moved a LOAD_FAST one instruction later. how is this "more memory access"?
    – Sam Mason
    11 hours ago










  • @arshajii You are right, I checked the disassembled code for Python 2 and it shows the same thing. Although I still believe it might have an impact but is negligible compared to the one you noted.
    – Eric Fortin
    11 hours ago










  • @SamMason In the second case, it needs to write back the result of x*x in a temp value because it is needed for the next multiply whereas in the first case, it always writes in the same register. So the second case is a read after write. It is however a really small penalty but in a hot loop, it can have an impact.
    – Eric Fortin
    11 hours ago










  • CPython is a stack machine, it always leaves/writes results of computations to the top of the stack… there might be some cache related impact, but it's going to be incredibly small for code this close
    – Sam Mason
    11 hours ago


















up vote
-6
down vote













I think is related to how the compiler or interpreter designed, python and java uses the same philosophy, their compilers takes source code and convert it a byte code (.pyc in python, .class in java) then the VM uses the JIT Model to execute generated codes.



In parsing phase the compiler produces an AST used in semantic analysis.



The exact details of how this happens can vary wildly between different languages and different interpreters.






share|improve this answer





















  • How is this relevant to the question?
    – Solomon Ucko
    7 hours ago










  • CPython, the most widely-used Python interpreter, does not JIT-compile to native machine code. It's purely interpreted. This is one reason that loops over array elements, or other low-level code in python is very slow compared to fully-compiled languages (either JIT or ahead-of-time).
    – Peter Cordes
    44 mins ago











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});






Waqas Gondal is a new contributor. Be nice, and check out our Code of Conduct.










draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53570864%2fwhy-is-2-x-x-faster-than-2-x-x-in-python-3-x-for-integers%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























5 Answers
5






active

oldest

votes








5 Answers
5






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
11
down vote













First of all, note that we don't see the same thing in Python 2.x:



>>> timeit("for i in range(1000): 2*i*i")
51.00784397125244
>>> timeit("for i in range(1000): 2*(i*i)")
50.48330092430115


So this leads us to believe that this is due to how integers changed in Python 3: specifically, Python 3 uses long (arbitrarily large integers) everywhere.



For small enough integers (including the ones we're considering here), CPython actually just uses the O(N2) grade-school digit by digit multiplication algorithm (for larger integers it switches to the Karatsuba algorithm). You can see this yourself in the source.



The number of digits in x*x is roughly twice that of 2*x or x (since log(x2) = 2 log(x)), so the square of the number of digits in x*x (recall that grade-school multiplication is quadratic) is 4 times that of 2*x or x -- so it makes sense that 2*(x*x) is the slower variant.



Note that a "digit" in this context is not a base-10 digit, but a 30-bit value (which are treated as single digits in CPython's implementation). Hence, for x = 10000000, 2*x*x only requires single-digit multiplications whereas 2*(x*x) requires a 2-digit multiplication (since x*x has 2 30-bit digits).



Here's a direct way to see this (Python 3):



>>> timeit("a*b", "a,b = 2, 123456**2", number=100000000)
5.796971936999967
>>> timeit("a*b", "a,b = 2*123456, 123456", number=100000000)
4.3559221399999615


Again, compare this to Python 2, which doesn't use arbitrary-length integers everywhere:



>>> timeit("a*b", "a,b = 2, 123456**2", number=100000000)
3.0912468433380127
>>> timeit("a*b", "a,b = 2*123456, 123456", number=100000000)
3.1120400428771973


(One interesting note: If you look at the source, you'll see that the algorithm actually has a special case for squaring numbers (which we're doing here), but even still this is not enough to overcome the fact that 2*(x*x) just requires processing more digits.)






share|improve this answer























  • Is Karatsuba algorithm slower than digit multiplication algorithm?
    – Banghua Zhao
    12 hours ago






  • 1




    @BanghuaZhao It has a better runtime complexity (wrt to number of digits), but the number of digits has to be large enough for it to actually be worth it.
    – arshajii
    12 hours ago






  • 2




    source : #define KARATSUBA_CUTOFF 70 . So the Karatsuba algorithm is only used if ints have about 600 decimal digits. it is not the problem here.
    – B. M.
    8 hours ago






  • 2




    In Python, a "digit" is a 30 or 60-bit chunk, right? So base 2^30, not a decimal digit. Also critically important: Java int wraps at 2^32, while Python does arbitrary precision. Plus Java JIT-compiles to (inefficient) native code using 32-bit registers, while CPython interprets all the way. So those are massive differences.
    – Peter Cordes
    8 hours ago








  • 2




    The numbers involved in the question are small enough that Karatsuba is not involved, and beyond that, they are small enough that asymptotic considerations are completely irrelevant. All operands and results involved fit in two 30-bit internal "digits".
    – user2357112
    6 hours ago















up vote
11
down vote













First of all, note that we don't see the same thing in Python 2.x:



>>> timeit("for i in range(1000): 2*i*i")
51.00784397125244
>>> timeit("for i in range(1000): 2*(i*i)")
50.48330092430115


So this leads us to believe that this is due to how integers changed in Python 3: specifically, Python 3 uses long (arbitrarily large integers) everywhere.



For small enough integers (including the ones we're considering here), CPython actually just uses the O(N2) grade-school digit by digit multiplication algorithm (for larger integers it switches to the Karatsuba algorithm). You can see this yourself in the source.



The number of digits in x*x is roughly twice that of 2*x or x (since log(x2) = 2 log(x)), so the square of the number of digits in x*x (recall that grade-school multiplication is quadratic) is 4 times that of 2*x or x -- so it makes sense that 2*(x*x) is the slower variant.



Note that a "digit" in this context is not a base-10 digit, but a 30-bit value (which are treated as single digits in CPython's implementation). Hence, for x = 10000000, 2*x*x only requires single-digit multiplications whereas 2*(x*x) requires a 2-digit multiplication (since x*x has 2 30-bit digits).



Here's a direct way to see this (Python 3):



>>> timeit("a*b", "a,b = 2, 123456**2", number=100000000)
5.796971936999967
>>> timeit("a*b", "a,b = 2*123456, 123456", number=100000000)
4.3559221399999615


Again, compare this to Python 2, which doesn't use arbitrary-length integers everywhere:



>>> timeit("a*b", "a,b = 2, 123456**2", number=100000000)
3.0912468433380127
>>> timeit("a*b", "a,b = 2*123456, 123456", number=100000000)
3.1120400428771973


(One interesting note: If you look at the source, you'll see that the algorithm actually has a special case for squaring numbers (which we're doing here), but even still this is not enough to overcome the fact that 2*(x*x) just requires processing more digits.)






share|improve this answer























  • Is Karatsuba algorithm slower than digit multiplication algorithm?
    – Banghua Zhao
    12 hours ago






  • 1




    @BanghuaZhao It has a better runtime complexity (wrt to number of digits), but the number of digits has to be large enough for it to actually be worth it.
    – arshajii
    12 hours ago






  • 2




    source : #define KARATSUBA_CUTOFF 70 . So the Karatsuba algorithm is only used if ints have about 600 decimal digits. it is not the problem here.
    – B. M.
    8 hours ago






  • 2




    In Python, a "digit" is a 30 or 60-bit chunk, right? So base 2^30, not a decimal digit. Also critically important: Java int wraps at 2^32, while Python does arbitrary precision. Plus Java JIT-compiles to (inefficient) native code using 32-bit registers, while CPython interprets all the way. So those are massive differences.
    – Peter Cordes
    8 hours ago








  • 2




    The numbers involved in the question are small enough that Karatsuba is not involved, and beyond that, they are small enough that asymptotic considerations are completely irrelevant. All operands and results involved fit in two 30-bit internal "digits".
    – user2357112
    6 hours ago













up vote
11
down vote










up vote
11
down vote









First of all, note that we don't see the same thing in Python 2.x:



>>> timeit("for i in range(1000): 2*i*i")
51.00784397125244
>>> timeit("for i in range(1000): 2*(i*i)")
50.48330092430115


So this leads us to believe that this is due to how integers changed in Python 3: specifically, Python 3 uses long (arbitrarily large integers) everywhere.



For small enough integers (including the ones we're considering here), CPython actually just uses the O(N2) grade-school digit by digit multiplication algorithm (for larger integers it switches to the Karatsuba algorithm). You can see this yourself in the source.



The number of digits in x*x is roughly twice that of 2*x or x (since log(x2) = 2 log(x)), so the square of the number of digits in x*x (recall that grade-school multiplication is quadratic) is 4 times that of 2*x or x -- so it makes sense that 2*(x*x) is the slower variant.



Note that a "digit" in this context is not a base-10 digit, but a 30-bit value (which are treated as single digits in CPython's implementation). Hence, for x = 10000000, 2*x*x only requires single-digit multiplications whereas 2*(x*x) requires a 2-digit multiplication (since x*x has 2 30-bit digits).



Here's a direct way to see this (Python 3):



>>> timeit("a*b", "a,b = 2, 123456**2", number=100000000)
5.796971936999967
>>> timeit("a*b", "a,b = 2*123456, 123456", number=100000000)
4.3559221399999615


Again, compare this to Python 2, which doesn't use arbitrary-length integers everywhere:



>>> timeit("a*b", "a,b = 2, 123456**2", number=100000000)
3.0912468433380127
>>> timeit("a*b", "a,b = 2*123456, 123456", number=100000000)
3.1120400428771973


(One interesting note: If you look at the source, you'll see that the algorithm actually has a special case for squaring numbers (which we're doing here), but even still this is not enough to overcome the fact that 2*(x*x) just requires processing more digits.)






share|improve this answer














First of all, note that we don't see the same thing in Python 2.x:



>>> timeit("for i in range(1000): 2*i*i")
51.00784397125244
>>> timeit("for i in range(1000): 2*(i*i)")
50.48330092430115


So this leads us to believe that this is due to how integers changed in Python 3: specifically, Python 3 uses long (arbitrarily large integers) everywhere.



For small enough integers (including the ones we're considering here), CPython actually just uses the O(N2) grade-school digit by digit multiplication algorithm (for larger integers it switches to the Karatsuba algorithm). You can see this yourself in the source.



The number of digits in x*x is roughly twice that of 2*x or x (since log(x2) = 2 log(x)), so the square of the number of digits in x*x (recall that grade-school multiplication is quadratic) is 4 times that of 2*x or x -- so it makes sense that 2*(x*x) is the slower variant.



Note that a "digit" in this context is not a base-10 digit, but a 30-bit value (which are treated as single digits in CPython's implementation). Hence, for x = 10000000, 2*x*x only requires single-digit multiplications whereas 2*(x*x) requires a 2-digit multiplication (since x*x has 2 30-bit digits).



Here's a direct way to see this (Python 3):



>>> timeit("a*b", "a,b = 2, 123456**2", number=100000000)
5.796971936999967
>>> timeit("a*b", "a,b = 2*123456, 123456", number=100000000)
4.3559221399999615


Again, compare this to Python 2, which doesn't use arbitrary-length integers everywhere:



>>> timeit("a*b", "a,b = 2, 123456**2", number=100000000)
3.0912468433380127
>>> timeit("a*b", "a,b = 2*123456, 123456", number=100000000)
3.1120400428771973


(One interesting note: If you look at the source, you'll see that the algorithm actually has a special case for squaring numbers (which we're doing here), but even still this is not enough to overcome the fact that 2*(x*x) just requires processing more digits.)







share|improve this answer














share|improve this answer



share|improve this answer








edited 6 hours ago

























answered 12 hours ago









arshajii

99.8k18178250




99.8k18178250












  • Is Karatsuba algorithm slower than digit multiplication algorithm?
    – Banghua Zhao
    12 hours ago






  • 1




    @BanghuaZhao It has a better runtime complexity (wrt to number of digits), but the number of digits has to be large enough for it to actually be worth it.
    – arshajii
    12 hours ago






  • 2




    source : #define KARATSUBA_CUTOFF 70 . So the Karatsuba algorithm is only used if ints have about 600 decimal digits. it is not the problem here.
    – B. M.
    8 hours ago






  • 2




    In Python, a "digit" is a 30 or 60-bit chunk, right? So base 2^30, not a decimal digit. Also critically important: Java int wraps at 2^32, while Python does arbitrary precision. Plus Java JIT-compiles to (inefficient) native code using 32-bit registers, while CPython interprets all the way. So those are massive differences.
    – Peter Cordes
    8 hours ago








  • 2




    The numbers involved in the question are small enough that Karatsuba is not involved, and beyond that, they are small enough that asymptotic considerations are completely irrelevant. All operands and results involved fit in two 30-bit internal "digits".
    – user2357112
    6 hours ago


















  • Is Karatsuba algorithm slower than digit multiplication algorithm?
    – Banghua Zhao
    12 hours ago






  • 1




    @BanghuaZhao It has a better runtime complexity (wrt to number of digits), but the number of digits has to be large enough for it to actually be worth it.
    – arshajii
    12 hours ago






  • 2




    source : #define KARATSUBA_CUTOFF 70 . So the Karatsuba algorithm is only used if ints have about 600 decimal digits. it is not the problem here.
    – B. M.
    8 hours ago






  • 2




    In Python, a "digit" is a 30 or 60-bit chunk, right? So base 2^30, not a decimal digit. Also critically important: Java int wraps at 2^32, while Python does arbitrary precision. Plus Java JIT-compiles to (inefficient) native code using 32-bit registers, while CPython interprets all the way. So those are massive differences.
    – Peter Cordes
    8 hours ago








  • 2




    The numbers involved in the question are small enough that Karatsuba is not involved, and beyond that, they are small enough that asymptotic considerations are completely irrelevant. All operands and results involved fit in two 30-bit internal "digits".
    – user2357112
    6 hours ago
















Is Karatsuba algorithm slower than digit multiplication algorithm?
– Banghua Zhao
12 hours ago




Is Karatsuba algorithm slower than digit multiplication algorithm?
– Banghua Zhao
12 hours ago




1




1




@BanghuaZhao It has a better runtime complexity (wrt to number of digits), but the number of digits has to be large enough for it to actually be worth it.
– arshajii
12 hours ago




@BanghuaZhao It has a better runtime complexity (wrt to number of digits), but the number of digits has to be large enough for it to actually be worth it.
– arshajii
12 hours ago




2




2




source : #define KARATSUBA_CUTOFF 70 . So the Karatsuba algorithm is only used if ints have about 600 decimal digits. it is not the problem here.
– B. M.
8 hours ago




source : #define KARATSUBA_CUTOFF 70 . So the Karatsuba algorithm is only used if ints have about 600 decimal digits. it is not the problem here.
– B. M.
8 hours ago




2




2




In Python, a "digit" is a 30 or 60-bit chunk, right? So base 2^30, not a decimal digit. Also critically important: Java int wraps at 2^32, while Python does arbitrary precision. Plus Java JIT-compiles to (inefficient) native code using 32-bit registers, while CPython interprets all the way. So those are massive differences.
– Peter Cordes
8 hours ago






In Python, a "digit" is a 30 or 60-bit chunk, right? So base 2^30, not a decimal digit. Also critically important: Java int wraps at 2^32, while Python does arbitrary precision. Plus Java JIT-compiles to (inefficient) native code using 32-bit registers, while CPython interprets all the way. So those are massive differences.
– Peter Cordes
8 hours ago






2




2




The numbers involved in the question are small enough that Karatsuba is not involved, and beyond that, they are small enough that asymptotic considerations are completely irrelevant. All operands and results involved fit in two 30-bit internal "digits".
– user2357112
6 hours ago




The numbers involved in the question are small enough that Karatsuba is not involved, and beyond that, they are small enough that asymptotic considerations are completely irrelevant. All operands and results involved fit in two 30-bit internal "digits".
– user2357112
6 hours ago












up vote
6
down vote













Python intern representation of integers is special, it uses slots of 30 bits :



In [6]: sys.getsizeof(2**30-1)
Out[6]: 28 # one slot + heading

In [7]: sys.getsizeof(2**30)
Out[7]: 32 # two slots


So everything happens as if Python counts in base B = 2**30 = 1 073 741 824 ~1 billion.



For a human who want to calculate 2*4*4, two ways :




  • (2*4)*4 = 8*4 =32 = 30 + 2 is immediate if you knows your add tables.

  • 2*(4*4) = 2*16 = 2*10 + 2*6 = (2*10+10) + 2 = 30 + 2 since we have to put the operation down.


Python have the same problem. If x is a number such than 2x < B < x² , let x² = aB+b , with a,b <B. is stored in 2 slots, which I note (a|b). Computations leads to (without managing carries here):



   (x*x)*2 =>  (a|b)*2 => (2*a|2*b)
(2*x)*x => (2x)*x =>(2a|2b)


in the first case the 2* operation is done two times, against only one in the first case. That explains the difference.






share|improve this answer



















  • 3




    Where do a, b and X come from, what do they represent and how do they relate to the value of x?
    – Bergi
    10 hours ago










  • @Bergi: a and b are the high and low 30-bit chunks of x*x. (Capital X no longer appears in the current revision of the answer.)
    – user2357112
    6 hours ago

















up vote
6
down vote













Python intern representation of integers is special, it uses slots of 30 bits :



In [6]: sys.getsizeof(2**30-1)
Out[6]: 28 # one slot + heading

In [7]: sys.getsizeof(2**30)
Out[7]: 32 # two slots


So everything happens as if Python counts in base B = 2**30 = 1 073 741 824 ~1 billion.



For a human who want to calculate 2*4*4, two ways :




  • (2*4)*4 = 8*4 =32 = 30 + 2 is immediate if you knows your add tables.

  • 2*(4*4) = 2*16 = 2*10 + 2*6 = (2*10+10) + 2 = 30 + 2 since we have to put the operation down.


Python have the same problem. If x is a number such than 2x < B < x² , let x² = aB+b , with a,b <B. is stored in 2 slots, which I note (a|b). Computations leads to (without managing carries here):



   (x*x)*2 =>  (a|b)*2 => (2*a|2*b)
(2*x)*x => (2x)*x =>(2a|2b)


in the first case the 2* operation is done two times, against only one in the first case. That explains the difference.






share|improve this answer



















  • 3




    Where do a, b and X come from, what do they represent and how do they relate to the value of x?
    – Bergi
    10 hours ago










  • @Bergi: a and b are the high and low 30-bit chunks of x*x. (Capital X no longer appears in the current revision of the answer.)
    – user2357112
    6 hours ago















up vote
6
down vote










up vote
6
down vote









Python intern representation of integers is special, it uses slots of 30 bits :



In [6]: sys.getsizeof(2**30-1)
Out[6]: 28 # one slot + heading

In [7]: sys.getsizeof(2**30)
Out[7]: 32 # two slots


So everything happens as if Python counts in base B = 2**30 = 1 073 741 824 ~1 billion.



For a human who want to calculate 2*4*4, two ways :




  • (2*4)*4 = 8*4 =32 = 30 + 2 is immediate if you knows your add tables.

  • 2*(4*4) = 2*16 = 2*10 + 2*6 = (2*10+10) + 2 = 30 + 2 since we have to put the operation down.


Python have the same problem. If x is a number such than 2x < B < x² , let x² = aB+b , with a,b <B. is stored in 2 slots, which I note (a|b). Computations leads to (without managing carries here):



   (x*x)*2 =>  (a|b)*2 => (2*a|2*b)
(2*x)*x => (2x)*x =>(2a|2b)


in the first case the 2* operation is done two times, against only one in the first case. That explains the difference.






share|improve this answer














Python intern representation of integers is special, it uses slots of 30 bits :



In [6]: sys.getsizeof(2**30-1)
Out[6]: 28 # one slot + heading

In [7]: sys.getsizeof(2**30)
Out[7]: 32 # two slots


So everything happens as if Python counts in base B = 2**30 = 1 073 741 824 ~1 billion.



For a human who want to calculate 2*4*4, two ways :




  • (2*4)*4 = 8*4 =32 = 30 + 2 is immediate if you knows your add tables.

  • 2*(4*4) = 2*16 = 2*10 + 2*6 = (2*10+10) + 2 = 30 + 2 since we have to put the operation down.


Python have the same problem. If x is a number such than 2x < B < x² , let x² = aB+b , with a,b <B. is stored in 2 slots, which I note (a|b). Computations leads to (without managing carries here):



   (x*x)*2 =>  (a|b)*2 => (2*a|2*b)
(2*x)*x => (2x)*x =>(2a|2b)


in the first case the 2* operation is done two times, against only one in the first case. That explains the difference.







share|improve this answer














share|improve this answer



share|improve this answer








edited 8 hours ago

























answered 13 hours ago









B. M.

12.2k11934




12.2k11934








  • 3




    Where do a, b and X come from, what do they represent and how do they relate to the value of x?
    – Bergi
    10 hours ago










  • @Bergi: a and b are the high and low 30-bit chunks of x*x. (Capital X no longer appears in the current revision of the answer.)
    – user2357112
    6 hours ago
















  • 3




    Where do a, b and X come from, what do they represent and how do they relate to the value of x?
    – Bergi
    10 hours ago










  • @Bergi: a and b are the high and low 30-bit chunks of x*x. (Capital X no longer appears in the current revision of the answer.)
    – user2357112
    6 hours ago










3




3




Where do a, b and X come from, what do they represent and how do they relate to the value of x?
– Bergi
10 hours ago




Where do a, b and X come from, what do they represent and how do they relate to the value of x?
– Bergi
10 hours ago












@Bergi: a and b are the high and low 30-bit chunks of x*x. (Capital X no longer appears in the current revision of the answer.)
– user2357112
6 hours ago






@Bergi: a and b are the high and low 30-bit chunks of x*x. (Capital X no longer appears in the current revision of the answer.)
– user2357112
6 hours ago












up vote
3
down vote













If your benchmark is right (didn't check), it may come from the fact that Python integers may be two different things : native integers when they are small (with a quick computation), and big integers when they increase in size (slower computation). The first syntax keeps the size smaller after the first operation while the second syntax may lead to two operations involving big integers.






share|improve this answer



















  • 11




    Seems more like a guess (albeit a good guess) than an answer. Using timeit with different size x might give the needed evidence to push it from a guess to an answer.
    – John Coleman
    13 hours ago












  • python has "cached" integers for -5 to 256 Dok - this would be easily veryfied if both formulas are closer to the same times if only small integers are in play? The max value goes to 2e+14 which is small enough to fit a 64-bit signed int so processor int calculation limitations is probably out of the game - it is too big for 32-bit unsigned int, so on 32-bit this might factor in?
    – Patrick Artner
    13 hours ago








  • 4




    I still measure a 5% difference using 2 as a value for x.
    – Vincent
    13 hours ago






  • 2




    The answer might seem more like a guess, though in my eyes it is helpful because I just learned something more about Python.
    – Ely
    12 hours ago















up vote
3
down vote













If your benchmark is right (didn't check), it may come from the fact that Python integers may be two different things : native integers when they are small (with a quick computation), and big integers when they increase in size (slower computation). The first syntax keeps the size smaller after the first operation while the second syntax may lead to two operations involving big integers.






share|improve this answer



















  • 11




    Seems more like a guess (albeit a good guess) than an answer. Using timeit with different size x might give the needed evidence to push it from a guess to an answer.
    – John Coleman
    13 hours ago












  • python has "cached" integers for -5 to 256 Dok - this would be easily veryfied if both formulas are closer to the same times if only small integers are in play? The max value goes to 2e+14 which is small enough to fit a 64-bit signed int so processor int calculation limitations is probably out of the game - it is too big for 32-bit unsigned int, so on 32-bit this might factor in?
    – Patrick Artner
    13 hours ago








  • 4




    I still measure a 5% difference using 2 as a value for x.
    – Vincent
    13 hours ago






  • 2




    The answer might seem more like a guess, though in my eyes it is helpful because I just learned something more about Python.
    – Ely
    12 hours ago













up vote
3
down vote










up vote
3
down vote









If your benchmark is right (didn't check), it may come from the fact that Python integers may be two different things : native integers when they are small (with a quick computation), and big integers when they increase in size (slower computation). The first syntax keeps the size smaller after the first operation while the second syntax may lead to two operations involving big integers.






share|improve this answer














If your benchmark is right (didn't check), it may come from the fact that Python integers may be two different things : native integers when they are small (with a quick computation), and big integers when they increase in size (slower computation). The first syntax keeps the size smaller after the first operation while the second syntax may lead to two operations involving big integers.







share|improve this answer














share|improve this answer



share|improve this answer








edited 13 hours ago

























answered 13 hours ago









Thomas Baruchel

4,55421636




4,55421636








  • 11




    Seems more like a guess (albeit a good guess) than an answer. Using timeit with different size x might give the needed evidence to push it from a guess to an answer.
    – John Coleman
    13 hours ago












  • python has "cached" integers for -5 to 256 Dok - this would be easily veryfied if both formulas are closer to the same times if only small integers are in play? The max value goes to 2e+14 which is small enough to fit a 64-bit signed int so processor int calculation limitations is probably out of the game - it is too big for 32-bit unsigned int, so on 32-bit this might factor in?
    – Patrick Artner
    13 hours ago








  • 4




    I still measure a 5% difference using 2 as a value for x.
    – Vincent
    13 hours ago






  • 2




    The answer might seem more like a guess, though in my eyes it is helpful because I just learned something more about Python.
    – Ely
    12 hours ago














  • 11




    Seems more like a guess (albeit a good guess) than an answer. Using timeit with different size x might give the needed evidence to push it from a guess to an answer.
    – John Coleman
    13 hours ago












  • python has "cached" integers for -5 to 256 Dok - this would be easily veryfied if both formulas are closer to the same times if only small integers are in play? The max value goes to 2e+14 which is small enough to fit a 64-bit signed int so processor int calculation limitations is probably out of the game - it is too big for 32-bit unsigned int, so on 32-bit this might factor in?
    – Patrick Artner
    13 hours ago








  • 4




    I still measure a 5% difference using 2 as a value for x.
    – Vincent
    13 hours ago






  • 2




    The answer might seem more like a guess, though in my eyes it is helpful because I just learned something more about Python.
    – Ely
    12 hours ago








11




11




Seems more like a guess (albeit a good guess) than an answer. Using timeit with different size x might give the needed evidence to push it from a guess to an answer.
– John Coleman
13 hours ago






Seems more like a guess (albeit a good guess) than an answer. Using timeit with different size x might give the needed evidence to push it from a guess to an answer.
– John Coleman
13 hours ago














python has "cached" integers for -5 to 256 Dok - this would be easily veryfied if both formulas are closer to the same times if only small integers are in play? The max value goes to 2e+14 which is small enough to fit a 64-bit signed int so processor int calculation limitations is probably out of the game - it is too big for 32-bit unsigned int, so on 32-bit this might factor in?
– Patrick Artner
13 hours ago






python has "cached" integers for -5 to 256 Dok - this would be easily veryfied if both formulas are closer to the same times if only small integers are in play? The max value goes to 2e+14 which is small enough to fit a 64-bit signed int so processor int calculation limitations is probably out of the game - it is too big for 32-bit unsigned int, so on 32-bit this might factor in?
– Patrick Artner
13 hours ago






4




4




I still measure a 5% difference using 2 as a value for x.
– Vincent
13 hours ago




I still measure a 5% difference using 2 as a value for x.
– Vincent
13 hours ago




2




2




The answer might seem more like a guess, though in my eyes it is helpful because I just learned something more about Python.
– Ely
12 hours ago




The answer might seem more like a guess, though in my eyes it is helpful because I just learned something more about Python.
– Ely
12 hours ago










up vote
2
down vote













From what I can tell, it comes down to a little bit more memory access in the version using 2 * (x * x). I printed the disassembled bytecode and it seems to prove that:



Relevant part of 2 * x * x:



7          28 LOAD_FAST                1 (num)
30 LOAD_CONST 3 (2)
32 LOAD_FAST 2 (x)
34 BINARY_MULTIPLY
36 LOAD_FAST 2 (x)
38 BINARY_MULTIPLY
40 INPLACE_ADD
42 STORE_FAST 1 (num)
44 JUMP_ABSOLUTE 24


Relevant part of 2 * (x * x):



  7          28 LOAD_FAST                1 (num)
30 LOAD_CONST 3 (2)
32 LOAD_FAST 2 (x)
34 LOAD_FAST 2 (x)
36 BINARY_MULTIPLY <=== 1st multiply x*x in a temp value
38 BINARY_MULTIPLY <=== then multiply result with 2
40 INPLACE_ADD
42 STORE_FAST 1 (num)
44 JUMP_ABSOLUTE 24





share|improve this answer





















  • If this was the case we'd see the same effect in Python 2, but we don't.
    – arshajii
    11 hours ago






  • 1




    it's just moved a LOAD_FAST one instruction later. how is this "more memory access"?
    – Sam Mason
    11 hours ago










  • @arshajii You are right, I checked the disassembled code for Python 2 and it shows the same thing. Although I still believe it might have an impact but is negligible compared to the one you noted.
    – Eric Fortin
    11 hours ago










  • @SamMason In the second case, it needs to write back the result of x*x in a temp value because it is needed for the next multiply whereas in the first case, it always writes in the same register. So the second case is a read after write. It is however a really small penalty but in a hot loop, it can have an impact.
    – Eric Fortin
    11 hours ago










  • CPython is a stack machine, it always leaves/writes results of computations to the top of the stack… there might be some cache related impact, but it's going to be incredibly small for code this close
    – Sam Mason
    11 hours ago















up vote
2
down vote













From what I can tell, it comes down to a little bit more memory access in the version using 2 * (x * x). I printed the disassembled bytecode and it seems to prove that:



Relevant part of 2 * x * x:



7          28 LOAD_FAST                1 (num)
30 LOAD_CONST 3 (2)
32 LOAD_FAST 2 (x)
34 BINARY_MULTIPLY
36 LOAD_FAST 2 (x)
38 BINARY_MULTIPLY
40 INPLACE_ADD
42 STORE_FAST 1 (num)
44 JUMP_ABSOLUTE 24


Relevant part of 2 * (x * x):



  7          28 LOAD_FAST                1 (num)
30 LOAD_CONST 3 (2)
32 LOAD_FAST 2 (x)
34 LOAD_FAST 2 (x)
36 BINARY_MULTIPLY <=== 1st multiply x*x in a temp value
38 BINARY_MULTIPLY <=== then multiply result with 2
40 INPLACE_ADD
42 STORE_FAST 1 (num)
44 JUMP_ABSOLUTE 24





share|improve this answer





















  • If this was the case we'd see the same effect in Python 2, but we don't.
    – arshajii
    11 hours ago






  • 1




    it's just moved a LOAD_FAST one instruction later. how is this "more memory access"?
    – Sam Mason
    11 hours ago










  • @arshajii You are right, I checked the disassembled code for Python 2 and it shows the same thing. Although I still believe it might have an impact but is negligible compared to the one you noted.
    – Eric Fortin
    11 hours ago










  • @SamMason In the second case, it needs to write back the result of x*x in a temp value because it is needed for the next multiply whereas in the first case, it always writes in the same register. So the second case is a read after write. It is however a really small penalty but in a hot loop, it can have an impact.
    – Eric Fortin
    11 hours ago










  • CPython is a stack machine, it always leaves/writes results of computations to the top of the stack… there might be some cache related impact, but it's going to be incredibly small for code this close
    – Sam Mason
    11 hours ago













up vote
2
down vote










up vote
2
down vote









From what I can tell, it comes down to a little bit more memory access in the version using 2 * (x * x). I printed the disassembled bytecode and it seems to prove that:



Relevant part of 2 * x * x:



7          28 LOAD_FAST                1 (num)
30 LOAD_CONST 3 (2)
32 LOAD_FAST 2 (x)
34 BINARY_MULTIPLY
36 LOAD_FAST 2 (x)
38 BINARY_MULTIPLY
40 INPLACE_ADD
42 STORE_FAST 1 (num)
44 JUMP_ABSOLUTE 24


Relevant part of 2 * (x * x):



  7          28 LOAD_FAST                1 (num)
30 LOAD_CONST 3 (2)
32 LOAD_FAST 2 (x)
34 LOAD_FAST 2 (x)
36 BINARY_MULTIPLY <=== 1st multiply x*x in a temp value
38 BINARY_MULTIPLY <=== then multiply result with 2
40 INPLACE_ADD
42 STORE_FAST 1 (num)
44 JUMP_ABSOLUTE 24





share|improve this answer












From what I can tell, it comes down to a little bit more memory access in the version using 2 * (x * x). I printed the disassembled bytecode and it seems to prove that:



Relevant part of 2 * x * x:



7          28 LOAD_FAST                1 (num)
30 LOAD_CONST 3 (2)
32 LOAD_FAST 2 (x)
34 BINARY_MULTIPLY
36 LOAD_FAST 2 (x)
38 BINARY_MULTIPLY
40 INPLACE_ADD
42 STORE_FAST 1 (num)
44 JUMP_ABSOLUTE 24


Relevant part of 2 * (x * x):



  7          28 LOAD_FAST                1 (num)
30 LOAD_CONST 3 (2)
32 LOAD_FAST 2 (x)
34 LOAD_FAST 2 (x)
36 BINARY_MULTIPLY <=== 1st multiply x*x in a temp value
38 BINARY_MULTIPLY <=== then multiply result with 2
40 INPLACE_ADD
42 STORE_FAST 1 (num)
44 JUMP_ABSOLUTE 24






share|improve this answer












share|improve this answer



share|improve this answer










answered 12 hours ago









Eric Fortin

6,61721728




6,61721728












  • If this was the case we'd see the same effect in Python 2, but we don't.
    – arshajii
    11 hours ago






  • 1




    it's just moved a LOAD_FAST one instruction later. how is this "more memory access"?
    – Sam Mason
    11 hours ago










  • @arshajii You are right, I checked the disassembled code for Python 2 and it shows the same thing. Although I still believe it might have an impact but is negligible compared to the one you noted.
    – Eric Fortin
    11 hours ago










  • @SamMason In the second case, it needs to write back the result of x*x in a temp value because it is needed for the next multiply whereas in the first case, it always writes in the same register. So the second case is a read after write. It is however a really small penalty but in a hot loop, it can have an impact.
    – Eric Fortin
    11 hours ago










  • CPython is a stack machine, it always leaves/writes results of computations to the top of the stack… there might be some cache related impact, but it's going to be incredibly small for code this close
    – Sam Mason
    11 hours ago


















  • If this was the case we'd see the same effect in Python 2, but we don't.
    – arshajii
    11 hours ago






  • 1




    it's just moved a LOAD_FAST one instruction later. how is this "more memory access"?
    – Sam Mason
    11 hours ago










  • @arshajii You are right, I checked the disassembled code for Python 2 and it shows the same thing. Although I still believe it might have an impact but is negligible compared to the one you noted.
    – Eric Fortin
    11 hours ago










  • @SamMason In the second case, it needs to write back the result of x*x in a temp value because it is needed for the next multiply whereas in the first case, it always writes in the same register. So the second case is a read after write. It is however a really small penalty but in a hot loop, it can have an impact.
    – Eric Fortin
    11 hours ago










  • CPython is a stack machine, it always leaves/writes results of computations to the top of the stack… there might be some cache related impact, but it's going to be incredibly small for code this close
    – Sam Mason
    11 hours ago
















If this was the case we'd see the same effect in Python 2, but we don't.
– arshajii
11 hours ago




If this was the case we'd see the same effect in Python 2, but we don't.
– arshajii
11 hours ago




1




1




it's just moved a LOAD_FAST one instruction later. how is this "more memory access"?
– Sam Mason
11 hours ago




it's just moved a LOAD_FAST one instruction later. how is this "more memory access"?
– Sam Mason
11 hours ago












@arshajii You are right, I checked the disassembled code for Python 2 and it shows the same thing. Although I still believe it might have an impact but is negligible compared to the one you noted.
– Eric Fortin
11 hours ago




@arshajii You are right, I checked the disassembled code for Python 2 and it shows the same thing. Although I still believe it might have an impact but is negligible compared to the one you noted.
– Eric Fortin
11 hours ago












@SamMason In the second case, it needs to write back the result of x*x in a temp value because it is needed for the next multiply whereas in the first case, it always writes in the same register. So the second case is a read after write. It is however a really small penalty but in a hot loop, it can have an impact.
– Eric Fortin
11 hours ago




@SamMason In the second case, it needs to write back the result of x*x in a temp value because it is needed for the next multiply whereas in the first case, it always writes in the same register. So the second case is a read after write. It is however a really small penalty but in a hot loop, it can have an impact.
– Eric Fortin
11 hours ago












CPython is a stack machine, it always leaves/writes results of computations to the top of the stack… there might be some cache related impact, but it's going to be incredibly small for code this close
– Sam Mason
11 hours ago




CPython is a stack machine, it always leaves/writes results of computations to the top of the stack… there might be some cache related impact, but it's going to be incredibly small for code this close
– Sam Mason
11 hours ago










up vote
-6
down vote













I think is related to how the compiler or interpreter designed, python and java uses the same philosophy, their compilers takes source code and convert it a byte code (.pyc in python, .class in java) then the VM uses the JIT Model to execute generated codes.



In parsing phase the compiler produces an AST used in semantic analysis.



The exact details of how this happens can vary wildly between different languages and different interpreters.






share|improve this answer





















  • How is this relevant to the question?
    – Solomon Ucko
    7 hours ago










  • CPython, the most widely-used Python interpreter, does not JIT-compile to native machine code. It's purely interpreted. This is one reason that loops over array elements, or other low-level code in python is very slow compared to fully-compiled languages (either JIT or ahead-of-time).
    – Peter Cordes
    44 mins ago















up vote
-6
down vote













I think is related to how the compiler or interpreter designed, python and java uses the same philosophy, their compilers takes source code and convert it a byte code (.pyc in python, .class in java) then the VM uses the JIT Model to execute generated codes.



In parsing phase the compiler produces an AST used in semantic analysis.



The exact details of how this happens can vary wildly between different languages and different interpreters.






share|improve this answer





















  • How is this relevant to the question?
    – Solomon Ucko
    7 hours ago










  • CPython, the most widely-used Python interpreter, does not JIT-compile to native machine code. It's purely interpreted. This is one reason that loops over array elements, or other low-level code in python is very slow compared to fully-compiled languages (either JIT or ahead-of-time).
    – Peter Cordes
    44 mins ago













up vote
-6
down vote










up vote
-6
down vote









I think is related to how the compiler or interpreter designed, python and java uses the same philosophy, their compilers takes source code and convert it a byte code (.pyc in python, .class in java) then the VM uses the JIT Model to execute generated codes.



In parsing phase the compiler produces an AST used in semantic analysis.



The exact details of how this happens can vary wildly between different languages and different interpreters.






share|improve this answer












I think is related to how the compiler or interpreter designed, python and java uses the same philosophy, their compilers takes source code and convert it a byte code (.pyc in python, .class in java) then the VM uses the JIT Model to execute generated codes.



In parsing phase the compiler produces an AST used in semantic analysis.



The exact details of how this happens can vary wildly between different languages and different interpreters.







share|improve this answer












share|improve this answer



share|improve this answer










answered 12 hours ago









Lamine Kessoum

226




226












  • How is this relevant to the question?
    – Solomon Ucko
    7 hours ago










  • CPython, the most widely-used Python interpreter, does not JIT-compile to native machine code. It's purely interpreted. This is one reason that loops over array elements, or other low-level code in python is very slow compared to fully-compiled languages (either JIT or ahead-of-time).
    – Peter Cordes
    44 mins ago


















  • How is this relevant to the question?
    – Solomon Ucko
    7 hours ago










  • CPython, the most widely-used Python interpreter, does not JIT-compile to native machine code. It's purely interpreted. This is one reason that loops over array elements, or other low-level code in python is very slow compared to fully-compiled languages (either JIT or ahead-of-time).
    – Peter Cordes
    44 mins ago
















How is this relevant to the question?
– Solomon Ucko
7 hours ago




How is this relevant to the question?
– Solomon Ucko
7 hours ago












CPython, the most widely-used Python interpreter, does not JIT-compile to native machine code. It's purely interpreted. This is one reason that loops over array elements, or other low-level code in python is very slow compared to fully-compiled languages (either JIT or ahead-of-time).
– Peter Cordes
44 mins ago




CPython, the most widely-used Python interpreter, does not JIT-compile to native machine code. It's purely interpreted. This is one reason that loops over array elements, or other low-level code in python is very slow compared to fully-compiled languages (either JIT or ahead-of-time).
– Peter Cordes
44 mins ago










Waqas Gondal is a new contributor. Be nice, and check out our Code of Conduct.










draft saved

draft discarded


















Waqas Gondal is a new contributor. Be nice, and check out our Code of Conduct.













Waqas Gondal is a new contributor. Be nice, and check out our Code of Conduct.












Waqas Gondal is a new contributor. Be nice, and check out our Code of Conduct.
















Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53570864%2fwhy-is-2-x-x-faster-than-2-x-x-in-python-3-x-for-integers%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Eastern Orthodox Church

Zagreb

Understanding the information contained in the Deep Space Network XML data?