Cardinality Estimation in Linear Time using Sub-Linear Space


The cardinality of a collection A (which might be an ordered or unordered list, a set, or what not) is basically the number of unique values in A. For example, the collections [1,2,3,4] and [1,2,1,3,1,4,3] have the same cardinality of 4 (and also correspond to the same set).

Determing the Cardinality of a Collection: The Naive Approach

Consider a collection A=[1,2,1,3,1,4,3]. How can we systematically determine the cardinality of A? Well, here are two of many ways to do this:

  1. First sort A in ascending order. Then,¬†we can perform a linear scan on A to remove¬†duplicates. It’s pretty easy to see how this¬†can be done. Finally, return the size of the¬†possibly trickled-down collection (now a set) obtained.¬†If the initial size of A is n. Then, the cardinality¬†of A, using this method, can be determined in O(n\, log\, n) (if we use merge-sort) time and O(1) extra space.
  2. Use a hash table: Perform a linear scan of A, hashing¬†the values of A. It’s easy to see that cardinality of A is¬†the number of keys in the hash table obtained.¬†This uses O(n) time but O(n) extra space also.

Notice that we can’t do any better (lower upper-bound) than O(n)¬†because we have to look at the entire input (which is of size n).¬†But can we determine the cardinality of A in O(n) time using¬†sub-linear space (using strictly smaller space than n)?

That’s where probability comes in.

Linear Probabilistic Counting

This is a probabilistic algorithm for counting the number of unique values in a collection. It produces an estimation with an arbitrary accuracy that can be pre-specified by the user using only a small amount of space that can also be pre-specified. The accuracy of linear counting depends on the load factor (think hash tables) which is the number of unique values in the collection divided by the size of the collection. The larger the load factor, the less accurate the result of the linear probabilistic counter. Correspondingly, the smaller the load factor, the more accurate the result. Nevertheless, load factors much higher than 1 (e.g. 16) can be used while achieving high accuracy in cardinality estimation (e.g. <1% error).

Note: in simple hashing, the load factor cannot exceed 1 but in linear counting, the load factor can exceed 1.

Linear counting is a two-step process. In step 1, the¬†algorithm allocates a bit map of a specific size in main memory.¬†Let this size be some integer m.¬†All entries in the bit map are initialized to “0”‘s. The algorithm then¬†scans the collection and applies a hash function to each data value in¬†the collection. The hash function generates a bit map address and the¬†algorithm sets this addressed bit to “1”. In step 2, the algorithm first¬†counts the number of empty bit map entries (equivalently, the¬†number of entries set to “0”). It then estimates the cardinality of the¬†collection by dividing this count by the bit map size m¬†(thus obtaining the fraction of empty bit map entries. Call this V_n).

Plug in V_n and m into the equation r = m\, log_e\, V_n to obtain r which is the estimated cardinality of the collection. The derivation of this equation is detailed in this paper[1].

Here’s my simple implementation of the Linear Probabilistic Counting Algorithm.

Errors in Counting

Although the linear probabilistic counting method is faster than deterministic approaches, it might sometimes fail to be accurate as explained above. So this method should be used only when 100% accuracy is not needed. For example, in determining the number of unique visitors to a website.

Probabilistic Alternatives to Linear Probabilistic Counting

The HyperLogLog algorithm is such an alternative. It also runs in linear time (linear in the size of the initial collection). But the HyperLogLog counter usually gives a more accurate estimation of the cardinality count and also uses less space. See this paper for more details on the HyperLogLog counter.


[1] A Linear-time Probablistic Counting Algorithm for Database Applications 

An Application of Linear Programming in Game Theory

I took the Combinatorial Optimization class at AIT Budapest¬†(Aquincum Institute of Technology)¬†with David Szeszler, a Professor at the Budapest University of Technology and Economics. We touched on some Graph Theory, Linear Programming, Integer Programming, the Assignment Problem, and the Hungarian method. My favorite class in the course was focused on applying Linear Programming in Game Theory. I’ll summarize the most important aspects of that class in this blog post. I hope this piques your interest in Game Theory (and in attending¬†AIT).

Basics of Linear Programming

First, I want to touch on some topics in Linear Programming for those who don’t know much about setting up a linear program (which is basically a system of linear inequalities with a maximization function or a minimization function). You can skip this section if you are confident about the subject.

Linear Programming is basically a field of mathematics that has to do with determining the optimum value in a feasible region. In determining the optimum value, one of two questions can be asked: find the minimum point/region or find the maximum point/region. The feasible region in a linear program is determined by a set of linear inequalities. For a feasible region to even exist, the set of linear inequalities must be solvable.

A typical linear program is given in this form: max\{cx: Ax \leq b\}. c is a row vector of dimension n. A is an m \times n matrix called the incidence matrix. x is a column vector of dimension n. This is called the primal program. The primal program is used to solve maximization problems. The dual of this primal program is of the form min\{yb: yA = c, y \geq 0\}. b, A, c are the same as previously defined. y is a row vector of dimension m. This is called the dual program. The dual is just a derivation of the primal program that is used to solve minimization problems.

Having introduced primal and dual programs, the next important theory in line is the duality theorem. The duality theorem states that max\{cx: Ax \leq b\} ¬†= min\{yb: yA=c, y \geq 0\}. In other words, the maximum of the primal program is equal to the minimum of the dual program (provided that the primal program is solvable and bounded from above). Using this “tool”, every minimization problem can be converted to a maximization problem and vice versa (as long as the initial problem involves a system of linear inequalities that can be set up as a linear program with a finite amount of linear constraints and one objective function).

There are linear program solvers out there (both open-source and commercial). Most linear program solvers are based on the simplex method. I acknowledge that the summary of Linear Programming given here is devoid of some details. Linear programming is a large field that cannot be  wholly summarized in a few sentences. For more information on linear programming,  check out this wikipedia page.

Sample Game Theory Problem

Suppose that I and my roommate Nick are playing a game called Yo!. The game rules are as follows: if we both say Yo!, I get $2. If I say Yo! but Nick says YoYo!, I lose $3. On the other hand, if we both say YoYo!, I get $4. If I say YoYo! but Nick says Yo!, I lose $3. The rules are summarized in the table below:

Nick Yo! YoYo!
Yo! $2 $-3
YoYo! $-3 $4

The values make up the payoff matrix.¬†When Daniel gains, Nick loses. When Nick gains, Daniel loses. A negative value (e.g. $-3) indicates that Daniel loses but Nick gains. Now the question surfaces: is there a smart way of playing this game so that I always win? Of course, if I could predict Nick’s next move all the time, then I’ll certainly play to win. But I can’t. ¬†I must come up with a strategy that reduces the risk of me losing to a minimum and increases my chance of winning. In other words, I want to maximize my minimum expected value. So I wish to know how often I should say Yo! and how often I should say YoYo!. This problem is equivalent to trying to find a probability column vector of dimension 2 (for the two possible responses Yo!, YoYo!). Such a probability vector is called a mixed strategy. For example, a mixed strategy for Daniel could be the column vector: (1/4 \ 3/4)^T. This translates to saying YoYo! three-quarters of the time and saying Yo! a quarter of the time. My expected value is then 1/4*2 + 3/4*(-3) = -7/4. This mixed strategy doesn’t seem optimal! In fact, it’s not as we’ll see later!

This kind of Game Theory problem where we wish to obtain an optimal mixed strategy for the Column player (in this case, Daniel) and an optimal mixed strategy for the Row  player (in this case, Nick) is called a Two-player, zero sum game. A mixed strategy for the Column player is an n-dimensional probability vector x; that is, a column vector with nonnegative entries that add up to 1. The i^{th} entry of the mixed strategy measures the probability that the Column player will choose the i^{th} column. In any Two-player, zero sum game, the problem is to maximize the worst-case gain of the Column player which is equivalent to finding

max\{min(Ax) : x is a probability vector \} where A represents the payoff matrix

Analogously, the problem of minimizing the worst-case loss of the Row player is equivalent to finding

min\{max(yA) : y is a probability vector \} where A is the payoff matrix

There’s a theorem that states that¬†max\{min(Ax) : x is a probability vector \} =¬†min\{max(yA) : y is a probability vector \} = \mu. ¬†We call \mu the common value of the game. This theorem is called the Minimax Theorem.

Minimax Theorem

The Minimax Theorem¬† was proved by John von Neumann (one of the greatest polymaths of all time, I think). It states that “For every two-player, zero sum game the maximum of the minimum expected gain of the Column player is equal to the minimum of the maximum expected losses of the Row player”. In other words, there exists the optimum mixed strategies x¬†and y¬†for the Column player and the Row player respectively and a common value \mu such that

  1.  No matter how the Row player plays, x guarantees an expected gain of at least \mu to the Column player and
  2. No matter how the Column player plays, y guarantees an expected loss of at most \mu to the Row player

Solving the Two-Player, Zero Sum Game

Now let’s try to solve the Yo!¬†game. First, we aim to obtain the mixed strategy for the Column player. Let x be the mixed strategy where x = (x_1, x_2)^T for which x_1, x_2 \geq 0 and x_1 + x_2 = 1. We wish to find the maximum of min(Ax) where A is the payoff matrix. To make this into a linear program, we can say \mu = min(Ax). So \mu is worst-case gain of Daniel. We wish to maximize \mu. Since \mu is the minimum possible value of Ax, we obtain the following linear constraints

  • 2x_1-3x_2-\mu \geq 0
  • -3x_1+4x_2-\mu \geq 0
  • x_1 + x_2 = 1
  • x_1, x_2 \geq 0

Solving the linear program gives us x_1=7/12, x_2=5/12 and \mu = -1/12. So the optimal mixed strategy for the Column player is x = (7/12 \ 5/12)^T. This translates to saying that if Daniel says Yo! 7/12 of the time and YoYo! 5/12 of the time, his worst-case gain will be -1/12. In other words, Daniel will lose at most 1/12 the value of the game no matter how Nick plays. According to the minimax theorem, this is optimal.

Note that this doesn’t mean that Daniel will always lose the game but that he can lose by at most 1/12 the value of the game. If Nick doesn’t play optimally (Nick doesn’t use his optimal mixed strategy), Daniel will most likely win!

Nick could obtain his optimal strategy by solving the dual of the primal program to obtain the vector y which will be his optimal mixed strategy.

The minimax theorem is an interesting and very useful application of Linear Programming in Game Theory. Two-player, zero sum games can also be solved using Nash Equilibrium which is very closely related to the minimax theorem but applies to two or more players. Nash Equilibrium was first proposed by John Nash. There are many Two-player games including Poker, Card games, Betting games, and so on. As a result, Linear Programming is used in the Casino!