Recently I came across scala Algebird. Well there are not too many of good descriptions of it but the following is awsome.
It starts by explaining why add is actually awesome. We have a stream of numbers we don't need to remember anything we just remember the total which is a single number. So a single number a property of infinite number of numbers.
Add is awesome also because we can split the stream of numbers to multiple boxes and we can just sum the sum of all boxes, great, cluster.
Associativity, Commutativity, no need to extend explanation on this. Associativity is what lets us have multiple boxes. Also as we send only the totals over the network its a small thingy to send over network.
now what if we track max and not add? do we need to write the whole logic for max again? actually we will find out that if we can tranform from add --> max then we can reuse the add! a whole box of operations now can be done using only add you just need to learn how to transform from add to max or any other function here (which withstands the rules described below) and we are good to go.
its not just a standard abstraction its a monoid abstraction aha!
the properties we had here are that we have 0 item, (they are ignored for add) grouping doesnt matter (associativity) commutativity, takes input of two nubmers and generates 1.
and here we just defined a commutative monoid .
now if you want average you again transform to pairs (vectors) of weight and value and then transform back to the avg number for presentation.
and here comes a stream of monoids
unique values monoid
avg monoid
histogram monoid
frequency monoid
hmmm and a very nice way to find unique values without remembering all the input items, we just hash them and use the property that the avg distance between two items would be 1/e - 1 and thus we can just see that the expected value to the first item is the number of unique items. to get better results we can hash multiple times each item. This is called either the min hash signature or HyperLogLog or Bloom filters where we do vector max operation on them.
these are all basically the same idea tweaked for the specific problem domain.
and you see unique values can also be aggregated like add! add is really awesome!
so alice: 1 => [0,0,1,0] we are limited here by 4 buckets.
http://www.infoq.com/presentations/abstract-algebra-analytics
It starts by explaining why add is actually awesome. We have a stream of numbers we don't need to remember anything we just remember the total which is a single number. So a single number a property of infinite number of numbers.
Add is awesome also because we can split the stream of numbers to multiple boxes and we can just sum the sum of all boxes, great, cluster.
Associativity, Commutativity, no need to extend explanation on this. Associativity is what lets us have multiple boxes. Also as we send only the totals over the network its a small thingy to send over network.
now what if we track max and not add? do we need to write the whole logic for max again? actually we will find out that if we can tranform from add --> max then we can reuse the add! a whole box of operations now can be done using only add you just need to learn how to transform from add to max or any other function here (which withstands the rules described below) and we are good to go.
its not just a standard abstraction its a monoid abstraction aha!
the properties we had here are that we have 0 item, (they are ignored for add) grouping doesnt matter (associativity) commutativity, takes input of two nubmers and generates 1.
and here we just defined a commutative monoid .
now if you want average you again transform to pairs (vectors) of weight and value and then transform back to the avg number for presentation.
and here comes a stream of monoids
unique values monoid
avg monoid
histogram monoid
frequency monoid
hmmm and a very nice way to find unique values without remembering all the input items, we just hash them and use the property that the avg distance between two items would be 1/e - 1 and thus we can just see that the expected value to the first item is the number of unique items. to get better results we can hash multiple times each item. This is called either the min hash signature or HyperLogLog or Bloom filters where we do vector max operation on them.
these are all basically the same idea tweaked for the specific problem domain.
and you see unique values can also be aggregated like add! add is really awesome!
alice: 1, bob: 1, alice: 1 => {alice: 2, bob: 1}again we can hash the items into a contained limited set of buckets and project it down into this smaller space.
so alice: 1 => [0,0,1,0] we are limited here by 4 buckets.
http://www.infoq.com/presentations/abstract-algebra-analytics
Comments