blog's thienbao: Puzzle: Fast Bit Counting

5 August, 2008

40 Votes

All Puzzles: Here

Puzzle: Devise a fast algorithm for computing the number of 1-bits in an unsigned integer. If the machine supports n-bit integers, can we compute the number of 1-bits in O(log n) machine instructions? Can we compute the number of 1-bits in O(b) machine instructions where b is the number of 1-bits in the integer?

Source: Commonly asked in job interviews for Software Engineers. I first heard it in 1996 when I was a graduate student.

Solution: This article presents six solutions to this problem. Source code in C is available.

1) Iterated Count
int bitcount (unsigned int n) {
   int count = 0;
   while (n) {
      count += n & 0x1u;
      n >>= 1;
   }
   return count;
}

Iterated Count runs in time proportional to the total number of bits. It simply loops through all the bits, terminating slightly earlier because of the while condition. Useful if 1′s are sparse and among the least significant bits.

2a) Sparse Ones
int bitcount (unsigned int n)  {
   int count = 0 ;
   while (n)  {
      count++ ;
      n &= (n - 1) ;
   }
   return count ;
}

Sparse Ones runs in time proportional to the number of 1 bits. The mystical line n &= (n – 1) simply sets the rightmost 1 bit in n to 0.

2b) Dense Ones
int bitcount (unsigned int n)   {
   int count = 8 * sizeof(int) ;
   n ^= (unsigned int) - 1 ;
   while (n)  {
      count-- ;
      n &= (n - 1) ;
   }
   return count ;
}

Dense Ones runs in time proportional to the number of 0 bits. It is the same as Sparse Ones, except that it first toggles all bits (n ~= -1), and continually subtracts the number of 1 bits from sizeof(int).

Sparse Ones and Dense Ones were first described by Peter Wegner in “A Technique for Counting Ones in a Binary Computer“, Communications of the ACM, Volume 3 (1960) Number 5, page 322.

3a) Precompute_8bit
static int bits_in_char [256] ;           

int bitcount (unsigned int n)  {
  // works only for 32-bit ints

  return bits_in_char [n  & 0xffu]
      +  bits_in_char [(n >>  8 ) & 0xffu]
      +  bits_in_char [(n >> 16) & 0xffu]
      +  bits_in_char [(n >> 24) & 0xffu] ;
}

Precompute_8bit assumes an array bits_in_char such that bits_in_char[i] contains the number of 1 bits in the binary representation for i. It repeatedly updates count by masking out the last eight bits in n, and indexing into bits_in_char.

3b) Precompute_16bit
static char bits_in_16bits [0x1u << 16] ;      

int bitcount (unsigned int n)  {
   // works only for 32-bit ints

   return bits_in_16bits [n         & 0xffffu]
       +  bits_in_16bits [(n >> 16) & 0xffffu] ;
}

Precompute_16bit is a variant of Precompute_8bit in that an array bits_in_16bits[] stores the number of 1 bits in successive 16 bit numbers (shorts).

4) Parallel Count
#define TWO(c)     (0x1u << (c))
#define MASK(c) \
  (((unsigned int)(-1)) / (TWO(TWO(c)) + 1u))
#define COUNT(x,c) \
  ((x) & MASK(c)) + (((x) >> (TWO(c))) & MASK(c))

int bitcount (unsigned int n)  {
   n = COUNT(n, 0) ;
   n = COUNT(n, 1) ;
   n = COUNT(n, 2) ;
   n = COUNT(n, 3) ;
   n = COUNT(n, 4) ;
   /* n = COUNT(n, 5) ;    for 64-bit integers */
   return n ;
}

Parallel Count carries out bit counting in a parallel fashion. Consider n after the first line has finished executing. Imagine splitting n into pairs of bits. Each pair contains the number of ones in those two bit positions in the original n. After the second line has finished executing, each nibble contains the number of ones in those four bits positions in the original n. Continuing this for five iterations, the 64 bits contain the number of ones among these sixty-four bit positions in the original n. That is what we wanted to compute.

5) Nifty Parallel Count
#define MASK_01010101 (((unsigned int)(-1))/3)
#define MASK_00110011 (((unsigned int)(-1))/5)
#define MASK_00001111 (((unsigned int)(-1))/17)
 int bitcount (unsigned int n) {
   n = (n & MASK_01010101) + ((n >> 1) & MASK_01010101) ;
   n = (n & MASK_00110011) + ((n >> 2) & MASK_00110011) ;
   n = (n & MASK_00001111) + ((n >> 4) & MASK_00001111) ;
   return n % 255 ;
}

Nifty Parallel Count works the same way as Parallel Count for the first three iterations. At the end of the third line (just before the return), each byte of n contains the number of ones in those eight bit positions in the original n. A little thought then explains why the remainder modulo 255 works.

According to Don Knuth (The Art of Computer Programming Vol IV, p 11), in the first textbook on programming, The Preparation of Programs for an Electronic Digital Computer by Wilkes, Wheeler and Gill (1957, reprinted 1984), pages 191–193 presented Nifty Parallel Count by D B Gillies and J C P Miller.

6) MIT HAKMEM Count
int bitcount(unsigned int n) {
   /* works for 32-bit numbers only    */
   /* fix last line for 64-bit numbers */

   register unsigned int tmp;

   tmp = n - ((n >> 1) & 033333333333)
           - ((n >> 2) & 011111111111);
   return ((tmp + (tmp >> 3)) & 030707070707) % 63;
}

MIT HAKMEM Count is funky. Consider a 3 bit number as being 4a+2b+c. If we shift it right 1 bit, we have 2a+b. Subtracting this from the original gives 2a+b+c. If we right-shift the original 3-bit number by two bits, we get a, and so with another subtraction we have a+b+c, which is the number of bits in the original number. How is this insight employed? The first assignment statement in the routine computes tmp. Consider the octal representation of tmp. Each digit in the octal representation is simply the number of 1′s in the corresponding three bit positions in n. The last return statement sums these octal digits to produce the final answer. The key idea is to add adjacent pairs of octal digits together and then compute the remainder modulus 63. This is accomplished by right-shifting tmp by three bits, adding it to tmp itself and ANDing with a suitable mask. This yields a number in which groups of six adjacent bits (starting from the LSB) contain the number of 1′s among those six positions in n. This number modulo 63 yields the final answer. For 64-bit numbers, we would have to add triples of octal digits and use modulus 1023. This is HACKMEM 169, as used in X11 sources. Source: MIT AI Lab memo, late 1970′s.

7) Builtin Instructions

GNU compiler allows for

int __builtin_popcount (unsigned int x);

which translates into a single CPU instruction if the underlying machine architecture supports it. For example, Intel machines have POPCNT (SSE4 Instruction set announced in 2006). Many GCC builtin functions exist.

                  No          Some        Heavy 
            Optimization Optimization Optimization

  Precomp_16 52.94 Mcps   76.22 Mcps   80.58 Mcps
   Precomp_8 29.74 Mcps   49.83 Mcps   51.65 Mcps
    Parallel 19.30 Mcps   36.00 Mcps   38.55 Mcps
         MIT 16.93 Mcps   17.10 Mcps   31.82 Mcps
       Nifty 12.78 Mcps   16.07 Mcps   29.71 Mcps
      Sparse  5.70 Mcps   15.01 Mcps   14.62 Mcps
       Dense  5.30 Mcps   14.11 Mcps   14.56 Mcps
    Iterated  3.60 Mcps    3.84 Mcps    9.24 Mcps

  Mcps = Million counts per second

Which of the several bit counting routines is the fastest? Results of speed trials on an i686 are summarized in the table on left. “No Optimization” was compiled with plain gcc. “Some Optimizations” was gcc -O3. “Heavy Optimizations” corresponds to gcc -O3 -mcpu=i686 -march=i686 -fforce-addr -funroll-loops -frerun-cse-after-loop -frerun-loop-opt -malign-functions=4.

Thanks to Seth Robertson who suggested performing speed trials by extendingbitcount.c. Seth also pointed me to MIT_Hackmem routine. Thanks to Denny Gursky who suggested the idea of Precompute_11bit. That would require three sums (11-bit, 11-bit and 10-bit precomputed counts). I then tried Precompute_16bit which turned out to be even faster.

If you have niftier solutions up your sleeves, please send me an e-mail or write comments below!

Further Reading:

HAKMEM (bit counting is memo number 169), MIT AI Lab, Artificial Intelligence Memo No. 239, February 29, 1972.
Bit Twiddling Hacks by Sean Anderson at Stanford University.
Bitwise Tricks and Techniques by Don Knuth (The Art of Computer Programming, Part IV).

On 15 March 2009, this article was discussed on Reddit — you might learn quite a bit by reading the comments therein.

Ads by Google

Optimization Made Easy
Learn How to Solve Optimization Models Easily. Free Trial Downloadwww.lindo.com

Posted by Gurmeet Singh Manku

46 Comments »

46 Responses to “Puzzle: Fast Bit Counting”

Hansueli Fuchs Says:

5 August, 2008 at 3:47 pm
4
0

Rate This
Written in assembly the maybe fasest version – apart from the precomputes naturally – is the “Sparse Ones”.
For unsigned long long variables:
unsigned char bitcount(unsigned long long a){
// value can’t be greater than 64 and so we can improve the performance
//by declaring the function as unsigned char
__asm {
xor eax, eax // set return to 0
mov edx, dword ptr a // set edx to the first 32 bits (0-31)
test edx, edx // compare edx to zero (get e flag)
je SHORT L2 // jump if equal
L1:
mov ebx, edx // set ebx to edx
sub edx, 1 // decrease edx
add eax, 1 // increase eax
and edx, ebx // set edx to edx & ebx (get e flag)
jne SHORT L1 // jump if not equal zero
L2:
mov edx, dword ptr a + 4 // set edx to the second 32 bits (32-63)
test edx, edx // compare edx to zero (get e flag)
je SHORT L4 // jump if equal
L3:
mov ebx, edx // set ebx to edx
sub edx, 1 // decrease edx
add eax, 1 // increase eax
and edx, ebx // set edx to edx & ebx (get e flag)
jne SHORT L3 // jump if not equal zero
L4: // end
}
}
For an unsigned int variable.
unsigned char bitcount(unsigned int a){
__asm {
xor eax, eax // set return to 0
mov edx, dword ptr a // set edx to the first 32 bits (0-31)
test edx, edx // compare edx to zero (get e flag)
je SHORT L2 // jump if equal
L1:
mov ebx, edx // set ebx to edx
sub edx, 1 // decrease edx
add eax, 1 // increase eax
and edx, ebx // set edx to edx & ebx (get e flag)
jne SHORT L1 // jump if not equal zero
L2: // end
}
}
Reply
Fidaali Says:

5 August, 2008 at 11:27 pm
0
1

Rate This
Isn’t it a bit strange, that the nifty parallel actually performs worst than the simple parallel one ?
Also, the modulo operation, as long as it uses a power of two, can be achieved with the & (bitwise and) instruction.
y=x % 64 can be written as y=x&63
Assembly wise, the second instruction executes about instantly, but the first one does take time indeed (maybe 5 to 20 times more). Of course it’s possible that the compiler does this optimization. But anyway it’s probably worth it to get rid of the % operator.
Also i didn’t tryed any of the code but some home made nifty parallel (wich doesn’t use any % operation), but it’s certainly worth it to have a regression test to give witch would both validate each given function, but also enable to tune them in a more easy way for different architectures.
As far as x86 assembly is concerned, it’s probably worth mentioning the SSE instruction POPCNT wich count the number of bit set to 1 ….
(now you only have to get your hands on a list of processor that do support this instruction, wich may be tricky)
http://en.wikipedia.org/wiki/SSE4
For reference purpose, here is what i thought to be equivalent to the nifty parallel bit count (it’s given as c define maccro : it considers that src is 32 bits operand, and that n will hold the result. It doesn’t have been tuned for a particular architectures beside counting a 32 bits operand. There is very high chances that the implementation is correct, because i use it in some code. But i didn’t designed a particular non regression test suit for it either.)
It’s probably worth mentioning also that there are no conditional jump in this code : with intel modern processor, a conditional jump may easily take the time of 20 shifts instructions …
#define MASK_01010101 (((unsigned int)(-1))/3)
#define MASK_00110011 (((unsigned int)(-1))/5)
#define MASK_00001111 (((unsigned int)(-1))/17)
#define MASK_00001111_2 (((unsigned int)(-1))/257)
#define bitcountMacro(src,n) \
{ \
/* printf(“Source : %X\n”,src); */ \
n=src; \
n = (n & MASK_01010101) + ((n >> 1) & MASK_01010101) ; \
/*printf(“Result 1 : %X\n”,n);*/ \
n = (n & MASK_00110011) + ((n >> 2) & MASK_00110011) ; \
/*printf(“Result 2 : %X\n”,n); */ \
n = (n & MASK_00001111) + ((n >> 4) & MASK_00001111) ; \
/*printf(“Result 3 : %X\n”,n); */ \
n = (n & MASK_00001111_2) + ((n >> 8) & MASK_00001111_2) ; \
/* printf(“Result 4 : %X\n”,n);*/ \
n = (n & 65535) + ((n >> 16) & 65535) ; \
/*printf(“Result : %i\n”,n); */ \
}
Reply
Georg-Johann Lay Says:

11 August, 2008 at 4:52 am
0
0

Rate This
If no divider hardware is at hand, the mod 255 can be eliminated easily. It is just the digit sum taken over the bytes (taken mod 255).
Note that the mod 63 ist just a digit sum mod 63 where a digit consists of 6 bits (63 is 2**6 – 1).
Therefore, the MIT algorithm will fail for 64-bit numbers, because the mask in the last line must contain at most 10 of the octal digits 7 (11*6 is greater than 62; each 3-bit packet can hold a value up to 6 at that point).
So you can fix MIT HACKMEM by
int bitcount (unsigned long long n)
{
unsigned long long tmp;
tmp = n
– ((n >> 1) & 0333333333333333333333ull)
– ((n >> 2) & 0111111111111111111111ull);
tmp += tmp >> 3;
return (tmp & 0707070707070707070700ull) % 63
+ (tmp & 7);
}
Cheers, Georg-Johann
Reply
Paul Says:

4 September, 2008 at 9:53 am
1
0

Rate This
And if I need to compute only the difference of 1-bits in two unsigned integers? Something faster than computing the number of 1-bits in the first uint and then subtracting the number of 1-bits in the second uint ??
Reply
CKret Says:

16 September, 2008 at 7:58 am
2
1

Rate This
To #4 Paul:
Simplest (and probably fastest) way is to XOR the two values then count the bits.
int nDifferentBits = BitCount(value1 ^ value2);
61 (…00111101)
14 (…00001110)
These two values have two 1-bits in common but four 1-bits are different.
61 ^ 14 = 49 (…00110011)
BitCount(49) = 4;
If you want to count the 1-bits that are in common use AND instead:
int nCommonBits = BitCount(value1 & value2);
61 & 14 = 12 (…00001100)
BitCount(12) = 2;
Reply
CKret Says:

17 September, 2008 at 12:55 am
0
0

Rate This
To #4 Paul:
Read that a bit (no pun intended) to quickly so post #5 doesn’t really apply to your question.
Reply
balrog Says:

3 October, 2008 at 1:42 pm
0
0

Rate This
Notice that Sparse Ones and Parallel Count are the only correct answers to the two questions in the Puzzle (O(b) and O(log n) respectively).
Precompute and Iterated Count are O(n). Dense Ones is on average O(n). Nifty Parallel Count and HAKMEM can’t be parametrized.
Reply
twu Says:

24 November, 2008 at 10:03 pm
0
0

Rate This
On Precomp_16, masking of (n >> 16) by 0xffff isn’t necessary, since a right shift on an unsigned int is a logical shift (not an arithmetic shift). That will guarantee that zeroes are present in the upper 16 bits.
Reply
Lee Dowling Says:

3 December, 2008 at 3:02 am
0
0

Rate This
I wonder if anyone can help me on a similar problem.
I have a need to generate the next digit from a given digit, whose binary representation has the same bitcount as the given digit.
So given a number, say, 5, with binary representation 101, I need the next digit with a bitcount of 2, which in this case is 6 (110). However, if run again, I would then need the next digit with bitcount 2 which would be 9 (1001).
There is a potential for going mad in that you could easily overflow, but notification of overflow would be enough rather than actually trying to determine the next digit. For example, if asked for “next digit with bitcount 2″ after 11000000000, it would take quite a long time with a naive implementation to find the next (100000000001) and may well overflow as it adds an extra binary digit.
My current implementation is this naive: a simple loop ++’ing a variable and testing bitcount each iteration. This is, obviously, a slow way to do such a thing and I have a need to do this lots, quickly, on a slow processor.
I wonder if anyone knows of a better way?
Reply
Grigore Gabriel Says:

5 December, 2008 at 5:36 am
0
0

Rate This
what do you think about this solution?
void test()
{
int n, nr = 3, i = 1;
n = nr;
while (nr>1))
{
nr = ((n<<1)&(~n));
n = nr;
nr = nr | 1;
i = 1;
}
else
{
nr = n | (1<<i);
i++;
}
}
}
Reply
Grigore Gabriel Says:

5 December, 2008 at 5:39 am
1
0

Rate This
there are parts missing from the algorythm. let me try again:
void test()
{
int n, nr = 3, i = 1;
n = nr;
while (nr>1))
{
nr = ((n<<1)&(~n));
n = nr;
nr = nr | 1;
i = 1;
}
else
{
nr = n | (1<<i);
i++;
}
}
}
Reply
Timo Says:

5 January, 2009 at 12:02 am
1
0

Rate This
I need to count the number of bits in an int for a game I’m working on. It turns out that for my case, at least with my computer (core 2 duo with sse4), the gcc built in function is slow. Since mine is sparse, and known to always be 6 bits or fewer, I gained a significant speed boost by manually loop unrolling the sparse version you provided.
int sparse_ones_bitcount_6bit(uint64_t n){
int count = (n > 0);
if((n &= (n – 1))){ ++count;
if((n &= (n – 1))){ ++count;
if((n &= (n – 1))){ ++count;
if((n &= (n – 1))){ ++count;
if((n &= (n – 1))){ ++count;
}}}}}
return count;
}
Reply
Calcolare la parit� di un byte in C | hilpers Says:

17 January, 2009 at 12:52 pm
0
0

Rate This
[...] i-1; > ++n; > } > return n; > } > > > Trovate questo ed altri metodi quahttp://gurmeetsingh.wordpress.com/20…ting-routines/ Ciao [...]
Reply
Tomas Says:

15 March, 2009 at 4:08 am
0
0

Rate This
Note that the lookup table versions (precomp_16 and company) require a load from memory unless the tables are already in the cache. Therefore, you may find that real usage of these methods can be a lot slower than your “warm cache” benchmarks.
Reply
- Jakob Heitz Says:
  
  20 August, 2010 at 3:31 pm
  0
  0
  
  Rate This
  If the lookup table is not in the cache, then you’re not doing it often enough to care. That is, the code is doing so much other stuff, that any performance increase in the bit-count code will be drowned out by the other stuff.
  Reply
dbasnett Says:

15 March, 2009 at 6:32 am
0
0

Rate This
pre-computing all possible mask’s i was able to count bits in a 64 bit unsigned integer in around 0.000003000 seconds on my 1.8Ghz laptop.
Reply
Alf Says:

15 March, 2009 at 10:27 am
0
0

Rate This
Lee Dowling wants the snoob() function from Hacker’s Delight:
unsigned snoob(unsigned x) {
unsigned smallest, ripple, ones;
// x = xxx0 1111 0000
smallest = x & -x; // 0000 0001 0000
ripple = x + smallest; // xxx1 0000 0000
ones = x ^ ripple; // 0001 1111 0000
ones = (ones >> 2)/smallest; // 0000 0000 0111
return ripple | ones; // xxx1 0000 0111
}
Reply
mark Says:

15 March, 2009 at 2:32 pm
0
0

Rate This
compute them all ahead of time and look up in linear time if speed is crucial.
Reply
bitnewbie Says:

15 March, 2009 at 6:16 pm
1
0

Rate This
speaking of bit counting… how would you guys approach this problem: for each set bit, I want to “do” something with the index of that bit as a parameter. e.g. (101)b would foo(0), foo(3).
naive approach being something like:
uint64_t x;
int n = 0;
while (x) {
if (x & 1) foo(n);
x >>= 1;
n++;
}
bit sets are quite sparse and “clustery” if that makes a difference.
Reply
- bitnewbie Says:
  
  15 March, 2009 at 6:17 pm
  0
  0
  
  Rate This
  sorry that should be foo(0) and foo(2) (skipping foo(1))
  Reply
- ravi Babu Says:
  
  20 April, 2009 at 4:08 am
  0
  0
  
  Rate This
  Hi,
  Is any one would like to help me on this.
  I need to solve the same urgently…
  Reply
- Robert Says:
  
  24 October, 2009 at 4:41 pm
  1
  0
  
  Rate This
  Hi,
  try this:
  for(i=0; i<sizeof(int); i++) {
  if(mask & (1 << i)) {
  // run something on index i
  }
  }
  I think you can get over that you have a O(sizeof(int)) in this case.
  Reply
- Mihai Stanimir Says:
  
  12 June, 2010 at 6:35 am
  1
  0
  
  Rate This
  I was looking for the exact thing; I call it a bit array iterator or bit pop iterator.
  It had to be as fast as possible and work well for sparse and dense bit arrays. I code in C# so my solution was influenced by the benchmarks under .NET Framework. I’m using 4 quad words (4×64 bits) but it would also work for single 64 bits or 32 bits.
  Best approach is to generate a list of positions of set bits. For example for 10011100 you would get [3,4,5,8].
  It turns out if you cannot estimate the number of set bits and can afford to precompute sublists then your best bet is to use a word map (I used pointers for this). What this means is to create an array of size 2^16, each position containing a small list of set bit positions. Then what you do is break every quad word in 4 words (4×16 bits), get the precomputed lists and merge them together. Benchmark results show that if you got 50% of bits set then the list generation takes about 9 CPU cycles per list entry. For sparse (2% of bits set) you get 28 cycles per list entry, and for dense (95% of bits set) you get 6 cycles per list entry. This is efficient only if you have to run it many times so you can even out the initial cost of precomputing the long list using a slower algorithm such as Robert’s solution. There’s also a bit memory overhead. Alternatively you can do a similar solution using precomputed bytes instead of precomputed words (at a speed penalty).
  If you can estimate the number of set bits then there are other options:
  According to my benchmarks if over 90% of bits are set then Robert’s algorithm is up to 35% faster (if all bits are set), and there’s no initial cost. However this algorithm can be as slow as 4000 CPU cycles if only a few bits are set.
  If less than 1.2% of bits are set then you’re a bit better with another algorithm, which selects only the next set bit using a little trick. Now using another precomputed array (this time of size 2^22!) you find out the next entry for your final list. CPU cost can go up to 100 cycles for a very sparse array (0.01% bits set). Cost is 34 cycles at 1% bits set, 30 cycles at 50%, and 26 cycles at 95%. This algorithm has a high overhead (both CPU and RAM) and is only useful if you run it many times to even out the overhead cost.
  All these algorithms are created with the CPU cost in mind. They have large initial overhead and require some memory.
  I hope my solutions help (although the implementation is rather complex, reason why I’m omitting it) and that my benchmark results will help you make an idea on the bit array iterator overhead.
  I’m still working on improving the algorithms as I use them for NP problem solving.
  Let me know if I can help any further.
  Reply
Rob Mueller Says:

15 March, 2009 at 10:11 pm
0
0

Rate This
Have you tried the method from here:
http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel
—
The best method for counting bits in a 32-bit integer v is the following:
v = v – ((v >> 1) & 0×55555555); // reuse input as temporary
v = (v & 0×33333333) + ((v >> 2) & 0×33333333); // temp
c = ((v + (v >> 4) & 0xF0F0F0F) * 0×1010101) >> 24; // count
The best bit counting method takes only 12 operations, which is the same as the lookup-table method, but avoids the memory and potential cache misses of a table. It is a hybrid between the purely parallel method above and the earlier methods using multiplies (in the section on counting bits with 64-bit instructions), though it doesn’t use 64-bit instructions. The counts of bits set in the bytes is done in parallel, and the sum total of the bits set in the bytes is computed by multiplying by 0×1010101 and shifting right 24 bits.
—
It manages to avoid the division/modulo operations.
Reply
Andrew Dalke Says:

16 March, 2009 at 6:15 pm
0
0

Rate This
See alsohttp://www.dalkescientific.com/writings/diary/archive/2008/07/03/hakmem_and_other_popcounts.html, where I benchmark implementations for popcounts of large data sizes. My test size has 16777216 bits instead of 32 bits, which of course has different performance characteristics. I implemented a couple of variations not mentioned here, and gave a few more links to other resources.
Reply
Jim Sawyer Says:

16 March, 2009 at 11:07 pm
0
0

Rate This
Answering Paul’s question, i.e.
the magnitude of the difference
between the number of one bits
in two numbers:
Integer>>bitCountDifference: theOther
| myself |
myself := self.
[ myself == 0 ifTrue: [^theOther bitCount].
theOther == 0 ifTrue: [^myself bitCount].
myself := myself bitAnd: myself – 1.
theOther := theOther bitAnd: theOther – 1.
] repeat
Usage:
diff := anInteger bitCountDifference: anotherInteger
Notes:
1) You’ll have to translate back into C,
or whatever your target language is.
2) Use whichever definition of #bitCount you prefer.
3) For more speed, inline the calls to #bitCount.
-Jim
Reply
Kaleb Says:

27 March, 2009 at 7:28 am
0
0

Rate This
To: Timo
Maybe you already knew this. If you don’t compile with -mpopcnt, gcc won’t emit the SSE4 popcnt insns and you get calls to __popcount[sd]i2, which are just table lookups.
Unfortunately though my benchmarks on an intel EMT64 (running 64-bit mode) show that the popcnt insns aren’t measurably faster than either Parallel or Nifty Parallel.
Reply
typo Says:

2 April, 2009 at 11:05 am
0
0

Rate This
The 16 bit table lookups can be fast but for some people the 64k byte table can be a big waste of memory, particularly for small embedded systems.
Reply
Optimising Bit Counting using Iterative, Data-Driven Development | Necessary and Sufficient Says:

15 April, 2009 at 5:40 am
1
0

Rate This
[...] Hacks HAKMEM Item 169: MIT AI Memo 239, Feb. 29, 1972 Discussion of the Problem on StackOverflow Gurmeet Singh’s Weblog with C code Jesse Farmer’s Blog Magic [...]
Reply
Amol Says:

30 April, 2009 at 2:44 am
0
3

Rate This
I’ve a doubt in Precompute_8bit method.
We are using a static integer array. By default it will be initialised by 0.
Now, if we AND the passed integer value with 0xffu to determine the number of 1 bits set; suppose the number comes out to be 2 so, the bits_in_char[2] will be 0 and not the number of 1′s set. Same will the case for the next three statements.
So, we will always get 0 as the answer. This is wrong.
Reply
- Mihai Stanimir Says:
  
  12 June, 2010 at 6:46 am
  1
  0
  
  Rate This
  Amol, the Precompute_8bit (same with 16bit) method uses a precomputed table. You will have to first populate the array prior to accessing the method. The code doesn’t explicitly show how to precompute it but basically you loop through all 256 values and run another method to compute the value, such as nifty parallel or any other not requiring precomputed data.
  Reply
John | Retro Programming Says:

18 August, 2009 at 4:39 am
1
0

Rate This
In 25 years of programming, I’ve never needed to count bits! Despite that, it makes and interesting problem and I’m implementing them all in Forth to find out which performs best :-)
Reply
Anant Says:

4 September, 2009 at 3:22 am
0
0

Rate This
Any fast method to count number of 1-bits upto the kth bit position in a 32-bit integer?
Reply
- Mihai Stanimir Says:
  
  12 June, 2010 at 6:47 am
  1
  0
  
  Rate This
  Just create and apply a mask and then use any of the algorithms.
  Reply
Andrew Dalke Says:

4 September, 2009 at 11:01 am
0
0

Rate This
How fast do you want, Anant? You can always apply a bitmask to the input before doing the normal popcount. Either through bitops (like 1<<k – 1) or a 32-value lookup table. If there's a distribution of k values (as an extreme, consideredwhen k=1 is used 99% of the time) then you might be able to special case for extra speed in the common case.
Reply
Nikhil Says:

20 November, 2009 at 11:13 am
0
0

Rate This
Thanks a lot.
It was really helpfull
Reply
Nuomnicron Says:

21 January, 2010 at 8:28 pm
0
0

Rate This
I was playing around with the Nifty Parallel count and found the same solution as Fidaali had, but I stayed a little more consistent with the previous code. This removes the need for mod by extending the idea 2 more steps:
#define MASK_0101010101010101 (((unsigned int)(-1))/3)
#define MASK_0011001100110011 (((unsigned int)(-1))/5)
#define MASK_0000111100001111 (((unsigned int)(-1))/17)
#define MASK_0000000011111111 (((unsigned int)(-1))/257)
#define MASK_1111111111111111 (((unsigned int)(-1))/65537)
int bitcount (unsigned int n)
{
n = (n & MASK_0101010101010101) + ((n >> 1) & MASK_0101010101010101) ;
n = (n & MASK_0011001100110011) + ((n >> 2) & MASK_0011001100110011) ;
n = (n & MASK_0000111100001111) + ((n >> 4) & MASK_0000111100001111) ;
n = (n & MASK_0000000011111111) + ((n >> 8) & MASK_0000000011111111) ;
n = (n & MASK_1111111111111111) + ((n >> 16) & MASK_1111111111111111) ;
return n;
}
Reply
Seth Robertson Says:

19 February, 2010 at 11:07 am
0
0

Rate This
Going through the comments I have added the two additional implementations I saw, from Nuomnicron and from Rob Mueller (really Seander), to the speed test of the bitcounts which I originally did for manku when I notified him of the MIT HAKMEM implementation.
What is surprising to me is that the POPCNT assembler instruction didn’t perform better than precomputed16 and that parallel and Nuomnicron’s version of Nifty are exactly the same speed. The test (best of 5 tries for each implementation) was done on a X5550 2.67GHz Xeon with gcc 4.1.2 using -O4 -mpopcnt -mtune=core2 -march=core2.
The source code is available fromhttp://www.baka.org/private/bitcount.c
```
POPCNT:   504540 cnts/ms
precomputed16:   504540 cnts/ms
  precomputed:   431965 cnts/ms
      seander:   345184 cnts/ms
     nuonifty:   227376 cnts/ms
     parallel:   227376 cnts/ms
          mit:   202634 cnts/ms
        nifty:   201572 cnts/ms
       sparse:    61477 cnts/ms
        dense:    53450 cnts/ms
     iterated:    41452 cnts/ms
```
Reply
Probability of a bitmask producing a non-zero result -- Chip's Tips for Developers Says:

28 February, 2010 at 2:02 pm
1
0

Rate This
[...] next task is to determine the number of on bits in a given bitmask. I found several good algorithms in C, but the most clever of these required foreknowledge of the number of bits in the value. [...]
Reply
Osor77 Says:

22 June, 2010 at 2:01 pm
0
0

Rate This
I find this discussion very interesting. I am a student and I’ve followed all you’ve been saying relatively easily, however, I’m still puzzled about the Nifty Parallel Count algorithm.
Can anyone explain why n % 255 results in the 4 bytes being summed. I achieved the same with
(n & 0xFF) + ((n >> 8) & 0xFF) + ((n >> 16) & 0xFF) + ((n >> 24) & 0xFF, but I don’t understand why this is equivalent to n % 255. The article said that a little thought would explain why it works, but it looks like I’m too thick for that and I haven’t been able to find clues on the web.
Reply
- Mihai Stanimir Says:
  
  22 June, 2010 at 4:18 pm
  1
  0
  
  Rate This
  Hi, I didn’t really study this but I think I have an idea how % 255 works.
  It should work with other numbers too, idea is to have 2^b-1 where b is number of bits in a group.
  First please note n%255 does not sum any 4 bytes. The byte values must be lower than 255 and their sum must be lower than 255. Having that said let’s start with a simpler example of only 2 bytes instead of 4. Call them together AB where A is the most significant one and B is the least significant one. If you take B alone, B%255=B (for B<255). Now take AB-B which is A*256. (A*256)%255 = (A*(255+1))%255 = A*255%255+A%255 = 0+A%255=A (for A<255). So AB% = (A*256 + B)%255 = (A*256)%255+B%255 = A + B.
  You can now generalize this for any number of bytes. For 4 bytes you'd have
  ABCD%255 =
  (((A*256+B)*256+C)*256+D)%255=
  (((A*256+B)*256+C)*(255+1))%255+D%255=
  ((A*256+B)*256+C)%255+D%255=
  ((A*256+B)*(255+1))%255+C%255+D%255=
  (A*256+B)%255+C%255+D%255=
  (A*(255+1))%255+B%255+C%255+D%255=
  A%255+B%255+C%255+D%255=
  A+B+C+D
  This works also for 8 bytes (for 64 bit processors). Just make sure all bytes are smaller than 255 and their sum is smaller than 255.
  Reply
Osor77 Says:

23 June, 2010 at 4:37 am
0
0

Rate This
Hi Mihai,
your explanation is very clear and makes sense.
I get it now.
Thanks
Reply
Ghostbusters Says:

1 July, 2010 at 11:12 pm
0
0

Rate This
I sent the code below to RIM after an interview (ran out of time) to explain futility of weird/complicated solutions when 8bit_lookup is easy to understand/maintain and works for all data types (no div mod or shifts):
[CODE]
template //8 16 32 or 64bit signed/unsigned.. use union for float/double etc
inline uint8_t bitCountA(T i){
uint8_t c = i & 01;
i /= 2; // FIX: cast to unsigned, or use divide to clear highest bit
while (i){
c += (i & 01);
i >>= 1;
}
return c;
}
[/CODE]
I haven’t tested out “Parallel” cause nested macros are confusing, but used the idea to come up with something similar (looks just like Seander’s solution):
[CODE]
//use “recursive” count #1s in 2bit block, sum 2bit blocks, 4bit blocks etc
template
inline uint8_t bitCount_char_E(uint i){
i -= (0xAAAAAAAAAAAAAAAA & i)>>1; //each 2bit block has #1s in that block
//to sum 2bit blocks, need clear upper 2bit block for overflow (010+010=100)
i = (0×3333333333333333 & i) + ((0xCCCCCCCCCCCCCCCC & i)>>2);
i += (i>>4); //add up 1s in 4bit blocks
i &= 0x0F0F0F0F0F0F0F0F; //clear upper 4bit blocks (reqd for uint_16)
switch (sizeof(uint)){ //use switch case “fall-through”
case 8: i += i>>32; //if (sizeof(uint)>=8) i += i>>32;
case 4: i += i>>16; //if (sizeof(uint)>=4) i += i>>16;
case 2: i += i>>8; //if (sizeof(uint)>=2) i += i>>8;
}
return i; //return char – lower 8bits only
}
[/CODE]
Reply
Ghostbusters Says:

1 July, 2010 at 11:50 pm
0
0

Rate This
GCC4.5 OPT3(HUGE difference on lookups) 2.8Ghz Core2Duo: from I64u max (FF…) down 1 000 000 000:
16b_lookup = 2891ms ~8.1 clock cycles
Seanders__ = 4421ms ~12.4 clock cycles
8b_lookup_ = 5325ms ~14.9 clock cycles
my_recursi = 5757ms ~16.1 clock cycles (woohoo!)
Kernighan_ = 6708ms
__naive___ = 61148ms
without GCC Opt, lookups did poorly – only naive was worse. But, OPT3 and wham, low latency L1 + large cache line = already preloaded next table value (testing consecutive numbers).
Finally, Seth Robertson has EVEN FASTER lookup resuls on his slower Xeon (less than 6 cycles!!) – his compiler arg string gives worse results for me.
Reply
Puzzler Says:

13 July, 2010 at 7:22 am
0
0

Rate This
I was playing around with the Nifty Parallel count and found the same solution as Fidaali had, but I stayed a little more consistent with the previous code.
Reply
yves Says:

29 September, 2010 at 1:09 am
0
0

Rate This
Didn’t see a recursive solution yet :-)
Not really efficient – just recursive…
int z(unsigned n, int count);
int f(unsigned n, int count);
int (*pf[2])(unsigned n, int count) = { z,f };
int f(unsigned n, int count)
{
return (*pf[n > 0])(n >> 1, count+(n & 1));
}
int z(unsigned n, int count)
{
return count;
}
…
printf(“%d\n”, f(my_number, 0));
Reply

blog's thienbao

30 thg 10, 2010

Puzzle: Fast Bit Counting

5 August, 2008

46 Responses to “Puzzle: Fast Bit Counting”

Không có nhận xét nào:

Đăng nhận xét