5 August, 2008
All Puzzles: Here
Puzzle: Devise a fast algorithm for computing the number of 1-bits in an unsigned integer. If the machine supports n-bit integers, can we compute the number of 1-bits in O(log n) machine instructions? Can we compute the number of 1-bits in O(b) machine instructions where b is the number of 1-bits in the integer?
Source: Commonly asked in job interviews for Software Engineers. I first heard it in 1996 when I was a graduate student.
Solution: This article presents six solutions to this problem. Source code in C is available.
1) Iterated Count int bitcount (unsigned int n) { int count = 0; while (n) { count += n & 0x1u; n >>= 1; } return count; }
Iterated Count runs in time proportional to the total number of bits. It simply loops through all the bits, terminating slightly earlier because of the while condition. Useful if 1′s are sparse and among the least significant bits.
2a) Sparse Ones int bitcount (unsigned int n) { int count = 0 ; while (n) { count++ ; n &= (n - 1) ; } return count ; }
Sparse Ones runs in time proportional to the number of 1 bits. The mystical line n &= (n – 1) simply sets the rightmost 1 bit in n to 0.
2b) Dense Ones int bitcount (unsigned int n) { int count = 8 * sizeof(int) ; n ^= (unsigned int) - 1 ; while (n) { count-- ; n &= (n - 1) ; } return count ; }
Dense Ones runs in time proportional to the number of 0 bits. It is the same as Sparse Ones, except that it first toggles all bits (n ~= -1), and continually subtracts the number of 1 bits from sizeof(int).
Sparse Ones and Dense Ones were first described by Peter Wegner in “A Technique for Counting Ones in a Binary Computer“, Communications of the ACM, Volume 3 (1960) Number 5, page 322.
3a) Precompute_8bit static int bits_in_char [256] ; int bitcount (unsigned int n) { // works only for 32-bit ints return bits_in_char [n & 0xffu] + bits_in_char [(n >> 8 ) & 0xffu] + bits_in_char [(n >> 16) & 0xffu] + bits_in_char [(n >> 24) & 0xffu] ; }
Precompute_8bit assumes an array bits_in_char such that bits_in_char[i] contains the number of 1 bits in the binary representation for i. It repeatedly updates count by masking out the last eight bits in n, and indexing into bits_in_char.
3b) Precompute_16bit static char bits_in_16bits [0x1u << 16] ; int bitcount (unsigned int n) { // works only for 32-bit ints return bits_in_16bits [n & 0xffffu] + bits_in_16bits [(n >> 16) & 0xffffu] ; }
Precompute_16bit is a variant of Precompute_8bit in that an array bits_in_16bits[] stores the number of 1 bits in successive 16 bit numbers (shorts).
4) Parallel Count #define TWO(c) (0x1u << (c)) #define MASK(c) \ (((unsigned int)(-1)) / (TWO(TWO(c)) + 1u)) #define COUNT(x,c) \ ((x) & MASK(c)) + (((x) >> (TWO(c))) & MASK(c)) int bitcount (unsigned int n) { n = COUNT(n, 0) ; n = COUNT(n, 1) ; n = COUNT(n, 2) ; n = COUNT(n, 3) ; n = COUNT(n, 4) ; /* n = COUNT(n, 5) ; for 64-bit integers */ return n ; }
Parallel Count carries out bit counting in a parallel fashion. Consider n after the first line has finished executing. Imagine splitting n into pairs of bits. Each pair contains the number of ones in those two bit positions in the original n. After the second line has finished executing, each nibble contains the number of ones in those four bits positions in the original n. Continuing this for five iterations, the 64 bits contain the number of ones among these sixty-four bit positions in the original n. That is what we wanted to compute.
5) Nifty Parallel Count #define MASK_01010101 (((unsigned int)(-1))/3) #define MASK_00110011 (((unsigned int)(-1))/5) #define MASK_00001111 (((unsigned int)(-1))/17) int bitcount (unsigned int n) { n = (n & MASK_01010101) + ((n >> 1) & MASK_01010101) ; n = (n & MASK_00110011) + ((n >> 2) & MASK_00110011) ; n = (n & MASK_00001111) + ((n >> 4) & MASK_00001111) ; return n % 255 ; }
Nifty Parallel Count works the same way as Parallel Count for the first three iterations. At the end of the third line (just before the return), each byte of n contains the number of ones in those eight bit positions in the original n. A little thought then explains why the remainder modulo 255 works.
According to Don Knuth (The Art of Computer Programming Vol IV, p 11), in the first textbook on programming, The Preparation of Programs for an Electronic Digital Computer by Wilkes, Wheeler and Gill (1957, reprinted 1984), pages 191–193 presented Nifty Parallel Count by D B Gillies and J C P Miller.
6) MIT HAKMEM Count int bitcount(unsigned int n) { /* works for 32-bit numbers only */ /* fix last line for 64-bit numbers */ register unsigned int tmp; tmp = n - ((n >> 1) & 033333333333) - ((n >> 2) & 011111111111); return ((tmp + (tmp >> 3)) & 030707070707) % 63; }
MIT HAKMEM Count is funky. Consider a 3 bit number as being 4a+2b+c. If we shift it right 1 bit, we have 2a+b. Subtracting this from the original gives 2a+b+c. If we right-shift the original 3-bit number by two bits, we get a, and so with another subtraction we have a+b+c, which is the number of bits in the original number. How is this insight employed? The first assignment statement in the routine computes tmp. Consider the octal representation of tmp. Each digit in the octal representation is simply the number of 1′s in the corresponding three bit positions in n. The last return statement sums these octal digits to produce the final answer. The key idea is to add adjacent pairs of octal digits together and then compute the remainder modulus 63. This is accomplished by right-shifting tmp by three bits, adding it to tmp itself and ANDing with a suitable mask. This yields a number in which groups of six adjacent bits (starting from the LSB) contain the number of 1′s among those six positions in n. This number modulo 63 yields the final answer. For 64-bit numbers, we would have to add triples of octal digits and use modulus 1023. This is HACKMEM 169, as used in X11 sources. Source: MIT AI Lab memo, late 1970′s.
7) Builtin Instructions
GNU compiler allows for
int __builtin_popcount (unsigned int x);
which translates into a single CPU instruction if the underlying machine architecture supports it. For example, Intel machines have POPCNT (SSE4 Instruction set announced in 2006). Many GCC builtin functions exist.
No Some Heavy Optimization Optimization Optimization Precomp_16 52.94 Mcps 76.22 Mcps 80.58 Mcps Precomp_8 29.74 Mcps 49.83 Mcps 51.65 Mcps Parallel 19.30 Mcps 36.00 Mcps 38.55 Mcps MIT 16.93 Mcps 17.10 Mcps 31.82 Mcps Nifty 12.78 Mcps 16.07 Mcps 29.71 Mcps Sparse 5.70 Mcps 15.01 Mcps 14.62 Mcps Dense 5.30 Mcps 14.11 Mcps 14.56 Mcps Iterated 3.60 Mcps 3.84 Mcps 9.24 Mcps Mcps = Million counts per second
Which of the several bit counting routines is the fastest? Results of speed trials on an i686 are summarized in the table on left. “No Optimization” was compiled with plain gcc. “Some Optimizations” was gcc -O3. “Heavy Optimizations” corresponds to gcc -O3 -mcpu=i686 -march=i686 -fforce-addr -funroll-loops -frerun-cse-after-loop -frerun-loop-opt -malign-functions=4.
Thanks to Seth Robertson who suggested performing speed trials by extendingbitcount.c. Seth also pointed me to MIT_Hackmem routine. Thanks to Denny Gursky who suggested the idea of Precompute_11bit. That would require three sums (11-bit, 11-bit and 10-bit precomputed counts). I then tried Precompute_16bit which turned out to be even faster.
If you have niftier solutions up your sleeves, please send me an e-mail or write comments below!
Further Reading:
- HAKMEM (bit counting is memo number 169), MIT AI Lab, Artificial Intelligence Memo No. 239, February 29, 1972.
- Bit Twiddling Hacks by Sean Anderson at Stanford University.
- Bitwise Tricks and Techniques by Don Knuth (The Art of Computer Programming, Part IV).
On 15 March 2009, this article was discussed on Reddit — you might learn quite a bit by reading the comments therein.
Ads by Google
Optimization Made Easy
Learn How to Solve Optimization Models Easily. Free Trial Downloadwww.lindo.com
Learn How to Solve Optimization Models Easily. Free Trial Downloadwww.lindo.com
5 August, 2008 at 3:47 pm
unsigned char bitcount(unsigned long long a){
// value can’t be greater than 64 and so we can improve the performance
//by declaring the function as unsigned char
__asm {
xor eax, eax // set return to 0
mov edx, dword ptr a // set edx to the first 32 bits (0-31)
test edx, edx // compare edx to zero (get e flag)
je SHORT L2 // jump if equal
L1:
mov ebx, edx // set ebx to edx
sub edx, 1 // decrease edx
add eax, 1 // increase eax
and edx, ebx // set edx to edx & ebx (get e flag)
jne SHORT L1 // jump if not equal zero
L2:
mov edx, dword ptr a + 4 // set edx to the second 32 bits (32-63)
test edx, edx // compare edx to zero (get e flag)
je SHORT L4 // jump if equal
L3:
mov ebx, edx // set ebx to edx
sub edx, 1 // decrease edx
add eax, 1 // increase eax
and edx, ebx // set edx to edx & ebx (get e flag)
jne SHORT L3 // jump if not equal zero
L4: // end
}
}
unsigned char bitcount(unsigned int a){
__asm {
xor eax, eax // set return to 0
mov edx, dword ptr a // set edx to the first 32 bits (0-31)
test edx, edx // compare edx to zero (get e flag)
je SHORT L2 // jump if equal
L1:
mov ebx, edx // set ebx to edx
sub edx, 1 // decrease edx
add eax, 1 // increase eax
and edx, ebx // set edx to edx & ebx (get e flag)
jne SHORT L1 // jump if not equal zero
L2: // end
}
}
5 August, 2008 at 11:27 pm
(now you only have to get your hands on a list of processor that do support this instruction, wich may be tricky)
#define MASK_00110011 (((unsigned int)(-1))/5)
#define MASK_00001111 (((unsigned int)(-1))/17)
#define MASK_00001111_2 (((unsigned int)(-1))/257)
{ \
/* printf(“Source : %X\n”,src); */ \
n=src; \
n = (n & MASK_01010101) + ((n >> 1) & MASK_01010101) ; \
/*printf(“Result 1 : %X\n”,n);*/ \
n = (n & MASK_00110011) + ((n >> 2) & MASK_00110011) ; \
/*printf(“Result 2 : %X\n”,n); */ \
n = (n & MASK_00001111) + ((n >> 4) & MASK_00001111) ; \
/*printf(“Result 3 : %X\n”,n); */ \
n = (n & MASK_00001111_2) + ((n >> 8) & MASK_00001111_2) ; \
/* printf(“Result 4 : %X\n”,n);*/ \
n = (n & 65535) + ((n >> 16) & 65535) ; \
/*printf(“Result : %i\n”,n); */ \
}
11 August, 2008 at 4:52 am
{
unsigned long long tmp;
– ((n >> 1) & 0333333333333333333333ull)
– ((n >> 2) & 0111111111111111111111ull);
return (tmp & 0707070707070707070700ull) % 63
+ (tmp & 7);
}
4 September, 2008 at 9:53 am
16 September, 2008 at 7:58 am
14 (…00001110)
BitCount(49) = 4;
BitCount(12) = 2;
17 September, 2008 at 12:55 am
3 October, 2008 at 1:42 pm
24 November, 2008 at 10:03 pm
3 December, 2008 at 3:02 am
5 December, 2008 at 5:36 am
{
int n, nr = 3, i = 1;
n = nr;
while (nr>1))
{
nr = ((n<<1)&(~n));
n = nr;
nr = nr | 1;
i = 1;
}
else
{
nr = n | (1<<i);
i++;
}
}
}
5 December, 2008 at 5:39 am
{
int n, nr = 3, i = 1;
n = nr;
while (nr>1))
{
nr = ((n<<1)&(~n));
n = nr;
nr = nr | 1;
i = 1;
}
else
{
nr = n | (1<<i);
i++;
}
}
}
5 January, 2009 at 12:02 am
int count = (n > 0);
if((n &= (n – 1))){ ++count;
if((n &= (n – 1))){ ++count;
if((n &= (n – 1))){ ++count;
if((n &= (n – 1))){ ++count;
if((n &= (n – 1))){ ++count;
}}}}}
}
17 January, 2009 at 12:52 pm
15 March, 2009 at 4:08 am
20 August, 2010 at 3:31 pm
15 March, 2009 at 6:32 am
15 March, 2009 at 10:27 am
unsigned smallest, ripple, ones;
// x = xxx0 1111 0000
smallest = x & -x; // 0000 0001 0000
ripple = x + smallest; // xxx1 0000 0000
ones = x ^ ripple; // 0001 1111 0000
ones = (ones >> 2)/smallest; // 0000 0000 0111
return ripple | ones; // xxx1 0000 0111
}
15 March, 2009 at 2:32 pm
15 March, 2009 at 6:16 pm
uint64_t x;
int n = 0;
while (x) {
if (x & 1) foo(n);
x >>= 1;
n++;
}
15 March, 2009 at 6:17 pm
20 April, 2009 at 4:08 am
I need to solve the same urgently…
24 October, 2009 at 4:41 pm
try this:
if(mask & (1 << i)) {
// run something on index i
}
}
12 June, 2010 at 6:35 am
15 March, 2009 at 10:11 pm
The best method for counting bits in a 32-bit integer v is the following:
v = (v & 0×33333333) + ((v >> 2) & 0×33333333); // temp
c = ((v + (v >> 4) & 0xF0F0F0F) * 0×1010101) >> 24; // count
—
16 March, 2009 at 6:15 pm
16 March, 2009 at 11:07 pm
the magnitude of the difference
between the number of one bits
in two numbers:
| myself |
myself := self.
[ myself == 0 ifTrue: [^theOther bitCount].
theOther == 0 ifTrue: [^myself bitCount].
myself := myself bitAnd: myself – 1.
theOther := theOther bitAnd: theOther – 1.
] repeat
or whatever your target language is.
2) Use whichever definition of #bitCount you prefer.
3) For more speed, inline the calls to #bitCount.
27 March, 2009 at 7:28 am
2 April, 2009 at 11:05 am
15 April, 2009 at 5:40 am
30 April, 2009 at 2:44 am
12 June, 2010 at 6:46 am
18 August, 2009 at 4:39 am
4 September, 2009 at 3:22 am
12 June, 2010 at 6:47 am
4 September, 2009 at 11:01 am
20 November, 2009 at 11:13 am
It was really helpfull
21 January, 2010 at 8:28 pm
#define MASK_0011001100110011 (((unsigned int)(-1))/5)
#define MASK_0000111100001111 (((unsigned int)(-1))/17)
#define MASK_0000000011111111 (((unsigned int)(-1))/257)
#define MASK_1111111111111111 (((unsigned int)(-1))/65537)
int bitcount (unsigned int n)
{
n = (n & MASK_0101010101010101) + ((n >> 1) & MASK_0101010101010101) ;
n = (n & MASK_0011001100110011) + ((n >> 2) & MASK_0011001100110011) ;
n = (n & MASK_0000111100001111) + ((n >> 4) & MASK_0000111100001111) ;
n = (n & MASK_0000000011111111) + ((n >> 8) & MASK_0000000011111111) ;
n = (n & MASK_1111111111111111) + ((n >> 16) & MASK_1111111111111111) ;
}
19 February, 2010 at 11:07 am
28 February, 2010 at 2:02 pm
22 June, 2010 at 2:01 pm
(n & 0xFF) + ((n >> 8) & 0xFF) + ((n >> 16) & 0xFF) + ((n >> 24) & 0xFF, but I don’t understand why this is equivalent to n % 255. The article said that a little thought would explain why it works, but it looks like I’m too thick for that and I haven’t been able to find clues on the web.
22 June, 2010 at 4:18 pm
ABCD%255 =
(((A*256+B)*256+C)*256+D)%255=
(((A*256+B)*256+C)*(255+1))%255+D%255=
((A*256+B)*256+C)%255+D%255=
((A*256+B)*(255+1))%255+C%255+D%255=
(A*256+B)%255+C%255+D%255=
(A*(255+1))%255+B%255+C%255+D%255=
A%255+B%255+C%255+D%255=
A+B+C+D
23 June, 2010 at 4:37 am
I get it now.
Thanks
1 July, 2010 at 11:12 pm
[CODE]
template //8 16 32 or 64bit signed/unsigned.. use union for float/double etc
inline uint8_t bitCountA(T i){
uint8_t c = i & 01;
i /= 2; // FIX: cast to unsigned, or use divide to clear highest bit
while (i){
c += (i & 01);
i >>= 1;
}
return c;
}
[/CODE]
[CODE]
//use “recursive” count #1s in 2bit block, sum 2bit blocks, 4bit blocks etc
template
inline uint8_t bitCount_char_E(uint i){
i -= (0xAAAAAAAAAAAAAAAA & i)>>1; //each 2bit block has #1s in that block
//to sum 2bit blocks, need clear upper 2bit block for overflow (010+010=100)
i = (0×3333333333333333 & i) + ((0xCCCCCCCCCCCCCCCC & i)>>2);
i += (i>>4); //add up 1s in 4bit blocks
i &= 0x0F0F0F0F0F0F0F0F; //clear upper 4bit blocks (reqd for uint_16)
switch (sizeof(uint)){ //use switch case “fall-through”
case 8: i += i>>32; //if (sizeof(uint)>=8) i += i>>32;
case 4: i += i>>16; //if (sizeof(uint)>=4) i += i>>16;
case 2: i += i>>8; //if (sizeof(uint)>=2) i += i>>8;
}
return i; //return char – lower 8bits only
}
[/CODE]
1 July, 2010 at 11:50 pm
16b_lookup = 2891ms ~8.1 clock cycles
Seanders__ = 4421ms ~12.4 clock cycles
8b_lookup_ = 5325ms ~14.9 clock cycles
my_recursi = 5757ms ~16.1 clock cycles (woohoo!)
Kernighan_ = 6708ms
__naive___ = 61148ms
13 July, 2010 at 7:22 am
29 September, 2010 at 1:09 am
int f(unsigned n, int count);
{
return (*pf[n > 0])(n >> 1, count+(n & 1));
}
{
return count;
}
printf(“%d\n”, f(my_number, 0));