<?xml version="1.0" encoding="utf-8"?> 
<rss version="2.0">
 <channel>
  <title>SHRIPHANI PALAKODETY: Posts tagged 'probabilistic-counting'</title>
  <description>SHRIPHANI PALAKODETY: Posts tagged 'probabilistic-counting'</description>
  <link>http://blog.shriphani.com/tags/probabilistic-counting.html</link>
  <lastBuildDate>Sat, 14 Dec 2013 04:12:33 UT</lastBuildDate>
  <pubDate>Sat, 14 Dec 2013 04:12:33 UT</pubDate>
  <ttl>1800</ttl>
  <item>
   <title>Probabilistic Counting</title>
   <link>http://blog.shriphani.com/2013/12/13/probabilistic-counting/?utm_source=probabilistic-counting&amp;utm_medium=RSS</link>
   <guid>urn:http-blog-shriphani-com:-2013-12-13-probabilistic-counting</guid>
   <pubDate>Sat, 14 Dec 2013 04:12:33 UT</pubDate>
   <description>&lt;html&gt;
&lt;p&gt;I recently got my hands on the common-crawl URL dataset from &lt;a href="https://archive.org/details/2013_common_crawl_index_urls"&gt;here&lt;/a&gt; and wanted to compute some stats like the number of unique domains contained in the corpus. This was the perfect opportunity to whip up a script that implemented Durande et. al&amp;rsquo;s &lt;a href="http://algo.inria.fr/flajolet/Publications/DuFl03.pdf"&gt;log-log paper&lt;/a&gt;.&lt;/p&gt;
&lt;!-- more--&gt;

&lt;p&gt;The algorithm operates on a stream and produces an estimate for the cardinality of the input stream. Here&amp;rsquo;s the algorithm itself:&lt;/p&gt;

&lt;div class="brush: clojure"&gt;
 &lt;table class="sourcetable"&gt;
  &lt;tbody&gt;
   &lt;tr&gt;
    &lt;td class="linenos"&gt;
     &lt;div class="linenodiv"&gt;
      &lt;pre&gt; 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47&lt;/pre&gt;&lt;/div&gt;&lt;/td&gt;
    &lt;td class="code"&gt;
     &lt;div class="source"&gt;
      &lt;pre&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;ns &lt;/span&gt;&lt;span class="nv"&gt;probabilistic-counting.log-log&lt;/span&gt;
  &lt;span class="s"&gt;"The LogLog algorithnm"&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;:use&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;incanter.core&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;:import&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;org.apache.mahout.math&lt;/span&gt; &lt;span class="nv"&gt;MurmurHash&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
           &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;org.apache.commons.lang3&lt;/span&gt; &lt;span class="nv"&gt;SerializationUtils&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;

&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;defn &lt;/span&gt;&lt;span class="nv"&gt;rho&lt;/span&gt;
  &lt;span class="s"&gt;"Number of leading zeros in the bit-representation&lt;/span&gt;
&lt;span class="s"&gt;   Args:&lt;/span&gt;
&lt;span class="s"&gt;    y : the number itself&lt;/span&gt;
&lt;span class="s"&gt;    size : optional : the number of bytes used to represent&lt;/span&gt;
&lt;span class="s"&gt;           the number. Default: 4 bytes/32 bits"&lt;/span&gt;
  &lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="nv"&gt;y&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;rho&lt;/span&gt; &lt;span class="nv"&gt;y&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
  
  &lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="nv"&gt;y&lt;/span&gt; &lt;span class="nv"&gt;size&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
     &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;int&lt;/span&gt;
      &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;- &lt;/span&gt;&lt;span class="nv"&gt;size&lt;/span&gt;
         &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;Math/ceil&lt;/span&gt;
          &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;/ &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;Math/log&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;+ &lt;/span&gt;&lt;span class="nv"&gt;y&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
             &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;Math/log&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)))))))&lt;/span&gt;

&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;defn &lt;/span&gt;&lt;span class="nv"&gt;alpha&lt;/span&gt;
  &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;m&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;&amp;lt; &lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt; &lt;span class="nv"&gt;m&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="mf"&gt;0.79402&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;* &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;Math/pow&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;* &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;gamma&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;- &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;/ &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="nv"&gt;m&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
                      &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;/ &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;- &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;Math/pow&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;/ &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="nv"&gt;m&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
                         &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;Math/log&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
                   &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;- &lt;/span&gt;&lt;span class="nv"&gt;m&lt;/span&gt;&lt;span class="p"&gt;)))))&lt;/span&gt;

&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;defn &lt;/span&gt;&lt;span class="nv"&gt;log-log&lt;/span&gt;
  &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;xs&lt;/span&gt; &lt;span class="nv"&gt;k&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;let &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;m&lt;/span&gt;       &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;Math/pow&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="nv"&gt;k&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="nv"&gt;buckets&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;make-array &lt;/span&gt;&lt;span class="nv"&gt;Integer/TYPE&lt;/span&gt; &lt;span class="nv"&gt;m&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;

    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;do&lt;/span&gt;
      &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;map &lt;/span&gt;&lt;span class="o"&gt;#&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;aset &lt;/span&gt;&lt;span class="nv"&gt;buckets&lt;/span&gt; &lt;span class="nv"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;range &lt;/span&gt;&lt;span class="nv"&gt;m&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
      &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;doseq &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;x&lt;/span&gt; &lt;span class="nv"&gt;xs&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;let &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;h&lt;/span&gt;   &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;MurmurHash/hash&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;SerializationUtils/serialize&lt;/span&gt; &lt;span class="nv"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="mi"&gt;1991&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
              &lt;span class="nv"&gt;idx&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;bit-and &lt;/span&gt;&lt;span class="nv"&gt;h&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;- &lt;/span&gt;&lt;span class="nv"&gt;m&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
              &lt;span class="nb"&gt;val &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;
                   &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;aget &lt;/span&gt;&lt;span class="nv"&gt;buckets&lt;/span&gt; &lt;span class="nv"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                   &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;rho&lt;/span&gt; &lt;span class="nv"&gt;h&lt;/span&gt;&lt;span class="p"&gt;))]&lt;/span&gt;
          &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;aset &lt;/span&gt;&lt;span class="nv"&gt;buckets&lt;/span&gt; &lt;span class="nv"&gt;idx&lt;/span&gt; &lt;span class="nv"&gt;val&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
      &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;* &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;Math/pow&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;/ &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;apply + &lt;/span&gt;&lt;span class="nv"&gt;buckets&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nv"&gt;m&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
         &lt;span class="nv"&gt;m&lt;/span&gt;
         &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;alpha&lt;/span&gt; &lt;span class="nv"&gt;m&lt;/span&gt;&lt;span class="p"&gt;)))))&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;

&lt;p&gt;And here&amp;rsquo;s a test (actual cardinality = 1,000,000):&lt;/p&gt;

&lt;div class="brush: clojure"&gt;
 &lt;table class="sourcetable"&gt;
  &lt;tbody&gt;
   &lt;tr&gt;
    &lt;td class="linenos"&gt;
     &lt;div class="linenodiv"&gt;
      &lt;pre&gt;1
2
3&lt;/pre&gt;&lt;/div&gt;&lt;/td&gt;
    &lt;td class="code"&gt;
     &lt;div class="source"&gt;
      &lt;pre&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;defn &lt;/span&gt;&lt;span class="nv"&gt;demo-log-log&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;log-log&lt;/span&gt;
   &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;map &lt;/span&gt;&lt;span class="o"&gt;#&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;mod&lt;/span&gt; &lt;span class="nv"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;1000000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;range &lt;/span&gt;&lt;span class="mi"&gt;10000000&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;

&lt;p&gt;When run, we get:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;user&amp;gt; (demo-log-log)
1023513.5806580923&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;which is an error of 2.3%.&lt;/p&gt;

&lt;p&gt;Now, let us deploy it on the URL dataset. There are 2,412,755,840 URLs in the format: &lt;code&gt;host_reversed/path/scheme&lt;/code&gt;. The following piece of code constructs the host stream and estimates the cardinality.&lt;/p&gt;

&lt;div class="brush: clojure"&gt;
 &lt;table class="sourcetable"&gt;
  &lt;tbody&gt;
   &lt;tr&gt;
    &lt;td class="linenos"&gt;
     &lt;div class="linenodiv"&gt;
      &lt;pre&gt; 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28&lt;/pre&gt;&lt;/div&gt;&lt;/td&gt;
    &lt;td class="code"&gt;
     &lt;div class="source"&gt;
      &lt;pre&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;ns &lt;/span&gt;&lt;span class="nv"&gt;probabilistic-counting.demo-urls&lt;/span&gt;
  &lt;span class="s"&gt;"Estimate the number of unique domains in the Common crawl dataset"&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;:use&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;probabilistic-counting.log-log&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;

&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;defn &lt;/span&gt;&lt;span class="nv"&gt;urls-stream&lt;/span&gt;
  &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;url-file&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;-&amp;gt; &lt;/span&gt;&lt;span class="nv"&gt;url-file&lt;/span&gt;
      &lt;span class="nv"&gt;clojure.java.io/reader&lt;/span&gt;
      &lt;span class="nv"&gt;line-seq&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;defn &lt;/span&gt;&lt;span class="nv"&gt;hosts-stream&lt;/span&gt;
  &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;url-stream&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;
   &lt;span class="o"&gt;#&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;first &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;clojure.string/split&lt;/span&gt; &lt;span class="nv"&gt;%&lt;/span&gt; &lt;span class="o"&gt;#&lt;/span&gt;&lt;span class="s"&gt;"/"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
   &lt;span class="nv"&gt;url-stream&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;defn &lt;/span&gt;&lt;span class="nv"&gt;count-hosts&lt;/span&gt;
  &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;a-hosts-stream&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;log-log&lt;/span&gt; &lt;span class="nv"&gt;a-hosts-stream&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;defn &lt;/span&gt;&lt;span class="nv"&gt;-main&lt;/span&gt;
  &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="nv"&gt;args&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;let &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;path &lt;/span&gt;    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;first &lt;/span&gt;&lt;span class="nv"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nv"&gt;num-urls&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;when &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;second &lt;/span&gt;&lt;span class="nv"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                   &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;-&amp;gt; &lt;/span&gt;&lt;span class="nv"&gt;args&lt;/span&gt; &lt;span class="nb"&gt;second &lt;/span&gt;&lt;span class="nv"&gt;Integer/parseInt&lt;/span&gt;&lt;span class="p"&gt;))]&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="nv"&gt;num-urls&lt;/span&gt;
      &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;-&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;path &lt;/span&gt;&lt;span class="nv"&gt;urls-stream&lt;/span&gt; &lt;span class="nv"&gt;hosts-stream&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;take &lt;/span&gt;&lt;span class="nv"&gt;num-urls&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nv"&gt;count-hosts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;-&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;path &lt;/span&gt;&lt;span class="nv"&gt;urls-stream&lt;/span&gt; &lt;span class="nv"&gt;hosts-stream&lt;/span&gt; &lt;span class="nv"&gt;count-hosts&lt;/span&gt;&lt;span class="p"&gt;))))&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;

&lt;p&gt;On a portion of the dataset, these are the results:&lt;/p&gt;

&lt;p&gt;Using standard unix tools (&lt;code&gt;uniq&lt;/code&gt; works here without a sort because the dataset is already sorted by hostname):&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;➜  probabilistic_counting git:(master) ✗ cat ~/common_crawl_index_urls | head -n 50000 | cut -d "/" -f 1 | uniq | wc -l
9205

➜  probabilistic_counting git:(master) ✗ cat ~/common_crawl_index_urls | head -n 500000 | cut -d "/" -f 1 | uniq | wc -l
22164

➜  probabilistic_counting git:(master) ✗ cat ~/common_crawl_index_urls | head -n 5000000 | cut -d "/" -f 1 | uniq | wc -l
196318

➜  probabilistic_counting git:(master) ✗ cat ~/common_crawl_index_urls | head -n 50000000 | cut -d "/" -f 1 | uniq | wc -l
1525445&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And log-log produces:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;➜  probabilistic_counting git:(master) ✗ lein trampoline run ~/common_crawl_index_urls 50000
9677.974613035705

➜  probabilistic_counting git:(master) ✗ lein trampoline run ~/common_crawl_index_urls 500000
22708.710878857888

➜  probabilistic_counting git:(master) ✗ lein trampoline run ~/common_crawl_index_urls 5000000
194919.74158794453

➜  probabilistic_counting git:(master) ✗ lein trampoline run ~/common_crawl_index_urls 5000000
1602155.9911824786&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And the accuracy is really very good.&lt;/p&gt;

&lt;p&gt;This algorithm works since:&lt;/p&gt;

&lt;ul&gt;
 &lt;li&gt;$ \rho(x) $ computes the position of the LSB in $ x $.&lt;/li&gt;
 &lt;li&gt;The probability that $ \rho(x) $ is $ k $ is $ \frac{1}{2^k} $ (the  probability of obtaining a sequence of $ k &amp;ndash; 1 $ zeroes and a one).&lt;/li&gt;
 &lt;li&gt;Say, the real cardinality is $ n $. Thus, on $\frac{n}{2^k}$ members of  this set, $ \rho(x) $ will yield a value of $ k $.&lt;/li&gt;
 &lt;li&gt;As a result, if you can drive $ 2^k $ as close to $ n $ as possible,  you have a good estimation of cardinality. This is achieved by  maximizing $ k $.&lt;/li&gt;
 &lt;li&gt;And thus, $ \arg\max_{x \in M} \rho(x) $ is an estimator (albeit  biased) for $ \log(n) $.&lt;/li&gt;&lt;/ul&gt;

&lt;p&gt;This is one of those extremely sweet papers where the idea (minus the details) fits on a business card and the resulting algorithm has immense practical value.&lt;/p&gt;

&lt;p&gt;The full source code is available in &lt;a href="https://github.com/shriphani/probabilistic-counting"&gt;this github repository&lt;/a&gt;.&lt;/p&gt;&lt;/html&gt;</description></item></channel></rss>