Notes on Map-reduce and Hadoop

This hasn't bubbled up to the top of my list, but I'm starting to set notes aside as I happen onto them, for example, in discussions with people who seem to know something about this topic, and I write them down, draw them up, etc.

Map/reduce is an algorithm that makes use of massively parallel operations over which a set of (or single) results is obtained.

Map takes input data and processes it to produce key-value pairs. Imagine temperature reports for dozens of cities over years. You ask map to produce a list of cities and temperatures for each.

Reduce aggregates the key-value pairs to produce final results. These results are, if you're looking for maximum temperature, a list of cities and the maximum temperature for each.

Ancient world example

Census takers were sent out by Rome to every city, one worker per city. Each counts the number of people in his city and reports it back to the Senate. (This is "map.") The Senate adds up the tally from each city into a single number of Roman citizens. (This is "reduce.")

The point is that this was more efficient than to send out one worker to visit every city and come back years later with a single tally.