Notes from MongoDB Training

Russ Bateman
5 October 2012
last update:

Preliminary notes

For training, if using separate mongod from the one already running on your host, it must be launched on a different port:

$ mongod --port 30000

Thereafter (as seen occasionally in these notes), run tools against that port explicitly:

$ mongorestore --port 30000 -d digg dump/digg

Here's what the training subdirectory looked like on my host by the time it was over:

~ $ tree -d mongo-training/
mongo-training/
|— bin
`— data
    |— db
    |   `— journal
    |— dump
    |   |— digg
    |   `— training
    |— rs1
    |   `— journal
    |— rs2
    |   |— journal
    |   `— _tmp
    |— rs3
    |   |— journal
    |   `— _tmp
    `— rs4
        |— journal
        `— _tmp

18 directories

For the questions to the exercise answers below, the original MongoDB Training Course Manual is required. This may no longer exist.



Wednesday morning

1. MongoDB Introduction

To see options to configure:

$ bin/mongod --help

Journaling (100ms default), disk flush (60s default), data file size, logfiles, etc. MongoDB writes to a memory-mapped journal file, then begins asynchronously to write to the database. In the event that the node goes down, the journal is replayed.

$ mongod -v
$ mongod -vv (up to 5 vs)

Can store these changes/settings in the configuration file, mongodb.conf. Typically, this file is on the path /etc/mongodb.conf, but in formal, data-center installations, it often ends up on a different path, e.g.: /data/mongodb/mongodb.conf.

MongoDB shell

$ mongo --port 30000

In shell, issue the following to see how MongoDB was launched:

> db.adminCommand( { getCmdLineOpts: 1 } )
{
    "argv" : [
        "/usr/bin/mongod",
        "--config",
        "/etc/mongodb.conf"
    ],
    "parsed" : {
        "config" : "/etc/mongodb.conf",
        "dbpath" : "/var/lib/mongodb",
        "logappend" : "true",
        "logpath" : "/var/log/mongodb/mongodb.log"
    },
    "ok" : 1
}

> db.adminCommand( { getCmdLineOpts: 1 } )
{
    "argv" : [
        "bin/mongod",
        "--port",
        "30000"
    ],
    "parsed" : {
        "port" : 30000
    },
    "ok" : 1
}

Everything in the shell is a command:

> db.adminCommand( { listDatabases : 1 } )
{
    "databases" : [
        {
            "name" : "training",
            "sizeOnDisk" : 218103808,
            "empty" : false
        },
        {
            "name" : "digg",
            "sizeOnDisk" : 218103808,
            "empty" : false
        },
        {
            "name" : "twitter",
            "sizeOnDisk" : 486539264,
            "empty" : false
        },
        {
            "name" : "local",
            "sizeOnDisk" : 1,
            "empty" : true
        },
        {
            "name" : "test",
            "sizeOnDisk" : 1,
            "empty" : true
        }
    ],
    "totalSize" : 922746880,
    "ok" : 1
}

...which is the long-hand for:

> show dbs
digg    0.203125GB
local    (empty)
test    (empty)
training    0.203125GB
twitter    0.453125GB

...which is a wrapper for the longer command (that organizes stuff differently).

Here's how to see what the size of a document would be:

> Object.bsonsize( { "hello" : "world" } )
22

2. CRUD and the MongoDB shell

Exercise: inserting some documents.

> db.people.insert( { "name" : "Smith", "age" : 30 } )
> for( i = 0; i < 1000; i++ ) { db.people.insert( { a : 20 } ); }
> db.people.find()
{ "_id" : ObjectId("50634c529cf47e02347c2ee2"), "a" : 20 }
{ "_id" : ObjectId("50634c529cf47e02347c2ee3"), "a" : 20 }
{ "_id" : ObjectId("50634c529cf47e02347c2ee4"), "a" : 20 }
{ "_id" : ObjectId("50634c529cf47e02347c2ee5"), "a" : 20 }
{ "_id" : ObjectId("50634c529cf47e02347c2ee6"), "a" : 20 }
{ "_id" : ObjectId("50634c529cf47e02347c2ee7"), "a" : 20 }
{ "_id" : ObjectId("50634c529cf47e02347c2ee8"), "a" : 20 }
{ "_id" : ObjectId("50634c529cf47e02347c2ee9"), "a" : 20 }
{ "_id" : ObjectId("50634c529cf47e02347c2eea"), "a" : 20 }
{ "_id" : ObjectId("50634c529cf47e02347c2eeb"), "a" : 20 }
{ "_id" : ObjectId("50634c529cf47e02347c2eec"), "a" : 20 }
{ "_id" : ObjectId("50634c529cf47e02347c2eed"), "a" : 20 }
{ "_id" : ObjectId("50634c529cf47e02347c2eee"), "a" : 20 }
{ "_id" : ObjectId("50634c529cf47e02347c2eef"), "a" : 20 }
{ "_id" : ObjectId("50634c529cf47e02347c2ef0"), "a" : 20 }
{ "_id" : ObjectId("50634c529cf47e02347c2ef1"), "a" : 20 }
{ "_id" : ObjectId("50634c529cf47e02347c2ef2"), "a" : 20 }
{ "_id" : ObjectId("50634c529cf47e02347c2ef3"), "a" : 20 }
{ "_id" : ObjectId("50634c529cf47e02347c2ef4"), "a" : 20 }
{ "_id" : ObjectId("50634c529cf47e02347c2ef5"), "a" : 20 }
Type "it" for more

Exercise, page 9

> db.exercise.insert( { "_id":"Jill", x : 1 } )
> db.exercise.findOne()
{ "_id" : "Jill", "x" : 1 }
> db.exercise.insert( { "_id":"Jill", x : 2 } )
E11000 duplicate key error index: test.exercise.$_id_  dup key: { : "Jill" }

> db.exercise.insert( { "_id" : 1.78 } )
> db.exercise.find()
{ "_id" : "Jill", "x" : 1 }
{ "_id" : 1.78 }

Auto-generated indices:

> db.system.indexes.find()
{ "v" : 1, "key" : { "_id" : 1 }, "ns" : "test.people", "name" : "_id_" }
{ "v" : 1, "key" : { "_id" : 1 }, "ns" : "test.exercise", "name" : "_id_" }

Embedded documents

> db.foo.insert( { a : { b : 2 } } )
> db.foo.find()
{ "_id" : ObjectId("506350809cf47e02347c32cb"), "a" : { "b" : 2 } }

(How to discover how 'find' works...)

> db.foo.find
function (query, fields, limit, skip, batchSize, options) {
    return new DBQuery(this._mongo, this._db, this, this._fullName, this._massageObject(query), fields, limit, skip, batchSize, options || this.getQueryOptions());
}



> db.foo.insert( { a : 100, b : 200 } )
> db.foo.find( { a : 100 } )
{ "_id" : ObjectId("506350f89cf47e02347c32cc"), "a" : 100, "b" : 200 }
> db.foo.find( { a : 100, b : 200 }, { a : true } )
{ "_id" : ObjectId("506350f89cf47e02347c32cc"), "a" : 100 }

Getting server status

> db.serverStatus()
{
    "host" : "russ-elite-book:30000",
    "version" : "2.2.0",
    "process" : "mongod",
    "pid" : 4305,
    "uptime" : 6099,
    "uptimeMillis" : NumberLong(6098936),
    "uptimeEstimate" : 6030,
    "localTime" : ISODate("2012-09-26T19:09:23.219Z"),
    "locks" : {
        "." : {
            "timeLockedMicros" : {
                "R" : NumberLong(152395),
                "W" : NumberLong(798928)
            },
            "timeAcquiringMicros" : {
                "R" : NumberLong(5791312),
                "W" : NumberLong(13902)
            }
        },
        "admin" : {
            "timeLockedMicros" : {

            },
            "timeAcquiringMicros" : {

            }
        },
        "local" : {
            "timeLockedMicros" : {
                "r" : NumberLong(5088),
                "w" : NumberLong(0)
            },
            "timeAcquiringMicros" : {
                "r" : NumberLong(829),
                "w" : NumberLong(0)
            }
        },
        "digg" : {
            "timeLockedMicros" : {
                "r" : NumberLong(12318),
                "w" : NumberLong(2280581)
            },
            "timeAcquiringMicros" : {
                "r" : NumberLong(339),
                "w" : NumberLong(29432)
            }
        },
        "test" : {
            "timeLockedMicros" : {
                "r" : NumberLong(2532),
                "w" : NumberLong(1361803)
            },
            "timeAcquiringMicros" : {
                "r" : NumberLong(144),
                "w" : NumberLong(1140)
            }
        },
        "training" : {
            "timeLockedMicros" : {
                "r" : NumberLong(3918),
                "w" : NumberLong(1443396)
            },
            "timeAcquiringMicros" : {
                "r" : NumberLong(235),
                "w" : NumberLong(2509)
            }
        },
        "twitter" : {
            "timeLockedMicros" : {
                "r" : NumberLong(3650),
                "w" : NumberLong(2556026)
            },
            "timeAcquiringMicros" : {
                "r" : NumberLong(206),
                "w" : NumberLong(837704)
            }
        }
    },
    "globalLock" : {
        "totalTime" : NumberLong("6098936000"),
        "lockTime" : NumberLong(798928),
        "currentQueue" : {
            "total" : 0,
            "readers" : 0,
            "writers" : 0
        },
        "activeClients" : {
            "total" : 0,
            "readers" : 0,
            "writers" : 0
        }
    },
    "mem" : {
        "bits" : 64,
        "resident" : 142,
        "virtual" : 1068,
        "supported" : true,
        "mapped" : 448,
        "mappedWithJournal" : 896
    },
    "connections" : {
        "current" : 2,
        "available" : 817
    },
    "extra_info" : {
        "note" : "fields vary by platform",
        "heap_usage_bytes" : 65816224,
        "page_faults" : 0
    },
    "indexCounters" : {
        "btree" : {
            "accesses" : 149948,
            "hits" : 149948,
            "misses" : 0,
            "resets" : 0,
            "missRatio" : 0
        }
    },
    "backgroundFlushing" : {
        "flushes" : 101,
        "total_ms" : 171,
        "average_ms" : 1.693069306930693,
        "last_ms" : 0,
        "last_finished" : ISODate("2012-09-26T19:08:44.294Z")
    },
    "cursors" : {
        "totalOpen" : 0,
        "clientCursors_size" : 0,
        "timedOut" : 2
    },
    "network" : {
        "bytesIn" : 108016512,
        "bytesOut" : 19095,
        "numRequests" : 75547
    },
    "opcounters" : {
        "insert" : 65438,
        "query" : 450,
        "update" : 0,
        ...
        "local" : {
            "accessesNotInMemory" : 0,
            "pageFaultExceptionsThrown" : 0
        },
        "test" : {
            "accessesNotInMemory" : 0,
            "pageFaultExceptionsThrown" : 0
        },
        "training" : {
            "accessesNotInMemory" : 0,
            "pageFaultExceptionsThrown" : 0
        },
        "twitter" : {
            "accessesNotInMemory" : 0,
            "pageFaultExceptionsThrown" : 0
        }
    },
    "ok" : 1
}

How findOne() works:

> db.foo.findOne
function (query, fields, options) {
    var cursor = this._mongo.find(this._fullName, this._massageObject(query) || {}, fields, -1, 0, 0, options || this.getQueryOptions());
    if (!cursor.hasNext()) {
        return null;
    }
    var ret = cursor.next();
    if (cursor.hasNext()) {
        throw "findOne has more than 1 result!";
    }
    if (ret.$err) {
        throw "error " + tojson(ret);
    }
    return ret;
}

Links

http://www.littlelostmanuals.com/2011/11/overview-of-basic-mongodb-java-write.html
http://www.mongodb.org/display/DOCS/getLastError+Command
https://gist.github.com/795748
http://whyjava.wordpress.com/2011/12/08/how-mongodb-different-write-concern-values-affect-performance-on-a-single-node/
http://www.slideshare.net/mongodb/mongodb-at-nosql-now-2012


Wednesday afternoon

Embedded documents

> db.foo.find( { "a.b" : 2 } )
{ "_id" : ObjectId("506350809cf47e02347c32cb"), "a" : { "b" : 2 } }
> db.foo.insert( { a : { b : 2 } } )
> db.foo.find( { "a.b" : 2 } )
{ "_id" : ObjectId("506363a59cf47e02347c32ce"), "a" : { "b" : 2 } }
> db.foo.find( { a : { b : 2 } } )
{ "_id" : ObjectId("506363a59cf47e02347c32ce"), "a" : { "b" : 2 } }
> db.foo.insert( { a : { b : 2, c : 1 } } )
> db.foo.find( { a : { b : 2 } } )
{ "_id" : ObjectId("506363a59cf47e02347c32ce"), "a" : { "b" : 2 } }
> db.foo.find( { "a.b" : 2 } )
{ "_id" : ObjectId("506363a59cf47e02347c32ce"), "a" : { "b" : 2 } }
{ "_id" : ObjectId("506363d79cf47e02347c32cf"), "a" : { "b" : 2, "c" : 1 } }

This is to illustrate that a document,

{
  "a" :
  {
    "b":2,
    "c":1
  }
}

matched by { "a.b":2 } because this query is only concerned with matching that a.b is 2 and not whether the rest of the document matches. { "a":{"b":2} } says that the subdocument (underneath a) is closed to include only { "b":2 }.

Matching operators:

> db.foo.drop()
true
> db.foo.insert( { a:100, b:200 })
> db.foo.insert( { a:50, b:200 })
> db.foo.find()
{ "_id" : ObjectId("506365b29cf47e02347c32d0"), "a" : 100, "b" : 200 }
{ "_id" : ObjectId("506365b79cf47e02347c32d1"), "a" : 50, "b" : 200 }
> db.foo.find( { a:{ $gte:60 } } )
{ "_id" : ObjectId("506365b29cf47e02347c32d0"), "a" : 100, "b" : 200 }

> db.foo.find( { a:{ $in: [50, 60] } } )
{ "_id" : ObjectId("506365b79cf47e02347c32d1"), "a" : 50, "b" : 200 }

Exercises, page 11

1.

> use training
switched to db training
> show collections
scores
system.indexes
> db.scores.findOne()
{
    "_id" : ObjectId("4c90f2543d937c033f424701"),
    "kind" : "quiz",
    "score" : 50,
    "student" : 0
}
> db.scores.find( { "score" : { $lt : 65 } } )
{ "_id" : ObjectId("4c90f2543d937c033f424701"), "kind" : "quiz", "score" : 50, "student" : 0 }
{ "_id" : ObjectId("4c90f2543d937c033f424703"), "kind" : "exam", "score" : 56, "student" : 0 }
{ "_id" : ObjectId("4c90f2543d937c033f424706"), "kind" : "exam", "score" : 58, "student" : 1 }
{ "_id" : ObjectId("4c90f2543d937c033f424709"), "kind" : "exam", "score" : 53, "student" : 2 }
{ "_id" : ObjectId("4c90f2543d937c033f42470a"), "kind" : "quiz", "score" : 58, "student" : 3 }
{ "_id" : ObjectId("4c90f2543d937c033f424710"), "kind" : "quiz", "score" : 54, "student" : 5 }
{ "_id" : ObjectId("4c90f2543d937c033f424711"), "kind" : "essay", "score" : 50, "student" : 5 }
{ "_id" : ObjectId("4c90f2543d937c033f424712"), "kind" : "exam", "score" : 50, "student" : 5 }
{ "_id" : ObjectId("4c90f2543d937c033f424714"), "kind" : "essay", "score" : 53, "student" : 6 }
{ "_id" : ObjectId("4c90f2543d937c033f424715"), "kind" : "exam", "score" : 51, "student" : 6 }
{ "_id" : ObjectId("4c90f2543d937c033f424718"), "kind" : "exam", "score" : 63, "student" : 7 }
{ "_id" : ObjectId("4c90f2543d937c033f424719"), "kind" : "quiz", "score" : 57, "student" : 8 }
{ "_id" : ObjectId("4c90f2543d937c033f42471e"), "kind" : "exam", "score" : 60, "student" : 9 }
{ "_id" : ObjectId("4c90f2543d937c033f42471f"), "kind" : "quiz", "score" : 50, "student" : 10 }
{ "_id" : ObjectId("4c90f2543d937c033f424722"), "kind" : "quiz", "score" : 64, "student" : 11 }
{ "_id" : ObjectId("4c90f2543d937c033f424725"), "kind" : "quiz", "score" : 59, "student" : 12 }
{ "_id" : ObjectId("4c90f2543d937c033f424729"), "kind" : "essay", "score" : 63, "student" : 13 }
{ "_id" : ObjectId("4c90f2543d937c033f42472e"), "kind" : "quiz", "score" : 54, "student" : 15 }
{ "_id" : ObjectId("4c90f2543d937c033f424731"), "kind" : "quiz", "score" : 54, "student" : 16 }
{ "_id" : ObjectId("4c90f2543d937c033f424734"), "kind" : "quiz", "score" : 61, "student" : 17 }

2.

> db.scores.find( { } ).sort( { "score" : 1 } ).limit( 1 )
{ "_id" : ObjectId("4c90f2543d937c033f424701"), "kind" : "quiz", "score" : 50, "student" : 0 }
> db.scores.find( { } ).sort( { "score" : -1 } ).limit( 1 )
{ "_id" : ObjectId("4c90f2543d937c033f42471c"), "kind" : "quiz", "score" : 99, "student" : 9 }

3.

> db.stories.find( { "shorturl" : { $gt : { "view_count" : 1000 } } } ).count()
10000

4.

> db.stories.find( { $or : [ { "media" : "news" }, { "media" : "images" } ] }
).count()
8986
> db.stories.find( { "topic.name" : "Comedy" } ).count()
422
> db.stories.find( { $or : [ { "media" : "news" }, { "media" : "images" } ], "topic.name" :
"Comedy" } ).count()
308

5.

> db.stories.find( { $or : [ { "topic.name" : "Television" }, { "media" : "videos" } ] }
).count()
1218

Updating documents...

> db.stuff.insert( { _id:123, "foo" : "bar" } )
> db.stuff.find()
{ "_id" : 123, "foo" : "bar" }
> db.stuff.update( { _id:123 }, { "hello" : "world" } )
> db.stuff.find()
{ "_id" : 123, "hello" : "world" }

This is "replace"--uncommon. This works thus:

> db.func.update
function (query, obj, upsert, multi) {
    assert(query, "need a query");
    assert(obj, "need an object");
    var firstKey = null;
    for (var k in obj) {
        firstKey = k;
        break;
    }
    if (firstKey != null && firstKey[0] == "$") {
        this._validateObject(obj);
    } else {
        this._validateForStorage(obj);
    }
    if (typeof upsert === "object") {
        assert(multi === undefined, "Fourth argument must be empty when specifying upsert and multi with an object.");
        opts = upsert;
        multi = opts.multi;
        upsert = opts.upsert;
    }
    this._db._initExtraInfo();
    this._mongo.update(this._fullName, query, obj, upsert ? true : false, multi ? true : false);
    this._db._getExtraInfo("Updated");
}

This is upsert...

> db.stuff.update( { _id:123 }, { $set : { "foo" : "bar" } } )
> db.stuff.find()
{ "_id" : 123, "foo" : "bar", "hello" : "world" }

Pushing...

> db.stuff.insert( { a:1, b:[] } )
> db.stuff.find()
{ "_id" : 123, "foo" : "bar", "hello" : "world" }
{ "_id" : ObjectId("5063737c9cf47e02347c32d3"), "a" : 1, "b" : [] }
> db.stuff.update( { a : 1 }, { $push : { b : 2 } } )
> db.stuff.find()
{ "_id" : 123, "foo" : "bar", "hello" : "world" }
{ "_id" : ObjectId("5063737c9cf47e02347c32d3"), "a" : 1, "b" : [ 2 ] }

Updating, incrementing...

> db.stuff.insert( { _id:1, a : 10 } )
> db.stuff.find()
{ "_id" : 123, "foo" : "bar", "hello" : "world" }
{ "_id" : ObjectId("5063737c9cf47e02347c32d3"), "a" : 1, "b" : [ 2 ] }
{ "_id" : ObjectId("506373ca9cf47e02347c32d4"), "a" : 1, "c" : [ ] }
{ "_id" : 1, "a" : 10 }
> db.stuff.update( { _id:1 }, { $inc : { "a" : 5 } } )
> db.stuff.find()
{ "_id" : 123, "foo" : "bar", "hello" : "world" }
{ "_id" : ObjectId("5063737c9cf47e02347c32d3"), "a" : 1, "b" : [ 2 ] }
{ "_id" : ObjectId("506373ca9cf47e02347c32d4"), "a" : 1, "c" : [ ] }
{ "_id" : 1, "a" : 15 }

Exercises, page 12

1.

> db.scores.update( { "score" : { $gt : 90 } }, { $set : { "grade" : "A" } }, false, true
)
> db.scores.find( { "score" : { $gt : 90 } } )
{ "_id" : ObjectId("4c90f2543d937c033f42470d"), "grade" : "A", "kind" : "quiz", "score" : 98, "student" : 4 }
{ "_id" : ObjectId("4c90f2543d937c033f424713"), "grade" : "A", "kind" : "quiz", "score" : 98, "student" : 6 }
{ "_id" : ObjectId("4c90f2543d937c033f42471c"), "grade" : "A", "kind" : "quiz", "score" : 99, "student" : 9 }
{ "_id" : ObjectId("4c90f2543d937c033f424724"), "grade" : "A", "kind" : "exam", "score" : 96, "student" : 11 }
{ "_id" : ObjectId("4c90f2543d937c033f42472c"), "grade" : "A", "kind" : "essay", "score" : 92, "student" : 14 }
{ "_id" : ObjectId("4c90f2543d937c033f42472f"), "grade" : "A", "kind" : "essay", "score" : 95, "student" : 15 }
{ "_id" : ObjectId("4c90f2543d937c033f424732"), "grade" : "A", "kind" : "essay", "score" : 94, "student" : 16 }
{ "_id" : ObjectId("4c90f2543d937c033f424748"), "grade" : "A", "kind" : "exam", "score" : 92, "student" : 23 }
{ "_id" : ObjectId("4c90f2543d937c033f42474d"), "grade" : "A", "kind" : "essay", "score" : 93, "student" : 25 }
{ "_id" : ObjectId("4c90f2543d937c033f424755"), "grade" : "A", "kind" : "quiz", "score" : 98, "student" : 28 }
{ "_id" : ObjectId("4c90f2543d937c033f424758"), "grade" : "A", "kind" : "quiz", "score" : 94, "student" : 29 }
{ "_id" : ObjectId("4c90f2543d937c033f424761"), "grade" : "A", "kind" : "quiz", "score" : 98, "student" : 32 }
{ "_id" : ObjectId("4c90f2543d937c033f424764"), "grade" : "A", "kind" : "quiz", "score" : 95, "student" : 33 }
{ "_id" : ObjectId("4c90f2543d937c033f424766"), "grade" : "A", "kind" : "exam", "score" : 91, "student" : 33 }
{ "_id" : ObjectId("4c90f2543d937c033f424768"), "grade" : "A", "kind" : "essay", "score" : 98, "student" : 34 }
{ "_id" : ObjectId("4c90f2543d937c033f424774"), "grade" : "A", "kind" : "essay", "score" : 93, "student" : 38 }
{ "_id" : ObjectId("4c90f2543d937c033f424776"), "grade" : "A", "kind" : "quiz", "score" : 91, "student" : 39 }
{ "_id" : ObjectId("4c90f2543d937c033f42477a"), "grade" : "A", "kind" : "essay", "score" : 96, "student" : 40 }
{ "_id" : ObjectId("4c90f2543d937c033f42477b"), "grade" : "A", "kind" : "exam", "score" : 98, "student" : 40 }
{ "_id" : ObjectId("4c90f2543d937c033f424788"), "grade" : "A", "kind" : "quiz", "score" : 98, "student" : 45 }
Type "it" for more

Uh... this didn't really work. In fact, I goofed and got "grade" inserted, then fixed it. Trying the second half of the exercise, I never could get it to work. The query works; the update doesn't.

> db.scores.find( { $and : [ { "score" : { $lte : 90 } }, { "score" : { $gt : 80 } } ] }, {
$set : { "grade" : "B" } }, false, true )
error: { "$err" : "Unsupported projection option: grade", "code" : 13097 }

Here, I kept doing "find" instead of update (as I was copying and pasting in order not to have to re-type).

> db.scores.find( { $and : [ { "score" : { $lte : 90 } }, { "score" : { $gt : 80 } } ] }
)
{ "_id" : ObjectId("4c90f2543d937c033f424707"), "grade" : "B", "kind" : "quiz", "score" : 90, "student" : 2 }
{ "_id" : ObjectId("4c90f2543d937c033f42470f"), "grade" : "B", "kind" : "exam", "score" : 86, "student" : 4 }
{ "_id" : ObjectId("4c90f2543d937c033f4247b0"), "grade" : "B", "kind" : "essay", "score" : 85, "student" : 58 }
{ "_id" : ObjectId("4c90f2543d937c033f4247bf"), "grade" : "B", "kind" : "essay", "score" : 83, "student" : 63 }
{ "_id" : ObjectId("4c90f2543d937c033f4247d1"), "grade" : "B", "kind" : "essay", "score" : 85, "student" : 69 }
{ "_id" : ObjectId("4c90f2543d937c033f4247e6"), "grade" : "B", "kind" : "essay", "score" : 87, "student" : 76 }
{ "_id" : ObjectId("4c90f2543d937c033f4247f2"), "grade" : "B", "kind" : "essay", "score" : 84, "student" : 80 }
{ "_id" : ObjectId("4c90f2543d937c033f4247fb"), "grade" : "B", "kind" : "essay", "score" : 90, "student" : 83 }
{ "_id" : ObjectId("4c90f2543d937c033f4247fc"), "grade" : "B", "kind" : "exam", "score" : 88, "student" : 83 }
{ "_id" : ObjectId("4c90f2543d937c033f424800"), "grade" : "B", "kind" : "quiz", "score" : 90, "student" : 85 }
{ "_id" : ObjectId("4c90f2543d937c033f424803"), "grade" : "B", "kind" : "quiz", "score" : 84, "student" : 86 }
{ "_id" : ObjectId("4c90f2543d937c033f42480a"), "grade" : "B", "kind" : "essay", "score" : 90, "student" : 88 }
{ "_id" : ObjectId("4c90f2543d937c033f424817"), "grade" : "B", "kind" : "exam", "score" : 87, "student" : 92 }
{ "_id" : ObjectId("4c90f2543d937c033f424819"), "grade" : "B", "kind" : "essay", "score" : 84, "student" : 93 }
{ "_id" : ObjectId("4c90f2543d937c033f42481b"), "grade" : "B", "kind" : "quiz", "score" : 89, "student" : 94 }
{ "_id" : ObjectId("4c90f2543d937c033f42481d"), "grade" : "B", "kind" : "exam", "score" : 87, "student" : 94 }
{ "_id" : ObjectId("4c90f2543d937c033f424826"), "grade" : "B", "kind" : "exam", "score" : 87, "student" : 97 }
{ "_id" : ObjectId("4c90f2543d937c033f424829"), "grade" : "B", "kind" : "exam", "score" : 82, "student" : 98 }
{ "_id" : ObjectId("4c90f2543d937c033f42482e"), "grade" : "B", "kind" : "essay", "score" : 83, "student" : 100 }
{ "_id" : ObjectId("4c90f2543d937c033f42483b"), "grade" : "B", "kind" : "exam", "score" : 86, "student" : 104 }
Type "it" for more

2.

> db.scores.update( { "score" : { $lt : 60 } }, { $inc : { "score" : 10 } }, false, true
)
> db.scores.find( { "score" : { $lt : 60 } } )
> db.scores.find( { "score" : { $lt : 70 } } )
{ "_id" : ObjectId("4c90f2543d937c033f424701"), "kind" : "quiz", "score" : 60, "student" : 0 }
{ "_id" : ObjectId("4c90f2543d937c033f424703"), "kind" : "exam", "score" : 66, "student" : 0 }
{ "_id" : ObjectId("4c90f2543d937c033f424706"), "kind" : "exam", "score" : 68, "student" : 1 }
{ "_id" : ObjectId("4c90f2543d937c033f424709"), "kind" : "exam", "score" : 63, "student" : 2 }
{ "_id" : ObjectId("4c90f2543d937c033f42470a"), "kind" : "quiz", "score" : 68, "student" : 3 }
{ "_id" : ObjectId("4c90f2543d937c033f424710"), "kind" : "quiz", "score" : 64, "student" : 5 }
{ "_id" : ObjectId("4c90f2543d937c033f424711"), "kind" : "essay", "score" : 60, "student" : 5 }
{ "_id" : ObjectId("4c90f2543d937c033f424712"), "kind" : "exam", "score" : 60, "student" : 5 }
{ "_id" : ObjectId("4c90f2543d937c033f424714"), "kind" : "essay", "score" : 63, "student" : 6 }
{ "_id" : ObjectId("4c90f2543d937c033f424715"), "kind" : "exam", "score" : 61, "student" : 6 }
{ "_id" : ObjectId("4c90f2543d937c033f424718"), "kind" : "exam", "score" : 63, "student" : 7 }
{ "_id" : ObjectId("4c90f2543d937c033f424719"), "kind" : "quiz", "score" : 67, "student" : 8 }
{ "_id" : ObjectId("4c90f2543d937c033f42471e"), "kind" : "exam", "score" : 60, "student" : 9 }
{ "_id" : ObjectId("4c90f2543d937c033f42471f"), "kind" : "quiz", "score" : 60, "student" : 10 }
{ "_id" : ObjectId("4c90f2543d937c033f424722"), "kind" : "quiz", "score" : 64, "student" : 11 }
{ "_id" : ObjectId("4c90f2543d937c033f424725"), "kind" : "quiz", "score" : 60, "student" : 12 }
{ "_id" : ObjectId("4c90f2543d937c033f424729"), "kind" : "essay", "score" : 63, "student" : 13 }
{ "_id" : ObjectId("4c90f2543d937c033f42472e"), "kind" : "quiz", "score" : 64, "student" : 15 }
{ "_id" : ObjectId("4c90f2543d937c033f424731"), "kind" : "quiz", "score" : 64, "student" : 16 }
{ "_id" : ObjectId("4c90f2543d937c033f424734"), "kind" : "quiz", "score" : 61, "student" : 17 }
Type "it" for more

Write locking, favored over read locks, exists since 2.2 on a per database level. This means that if there's contention with lots of writes going on, it may be useful to move to a one collection per database layout for Account, Address, Payment, Partner.

How does finding a document work...

> db.tweets.find( { "user.followers_count" : 1000 } ).explain()
{
    "cursor" : "BasicCursor",
    "isMultiKey" : false,
    "n" : 8,
    "nscannedObjects" : 51428,
    "nscanned" : 51428,
    "nscannedObjectsAllPlans" : 51428,
    "nscannedAllPlans" : 51428,
    "scanAndOrder" : false,
    "indexOnly" : false,
    "nYields" : 0,
    "nChunkSkips" : 0,
    "millis" : 134,
    "indexBounds" : {

    },
    "server" : "russ-elite-book:30000"
}

Shows that we looked through 51428 documents and it took 134 milliseconds. This isn't good. Create an index:

> db.tweets.ensureIndex( { "user.followers_count" : 1 } )

> show collections
system.indexes
tweets
> db.system.indexes.find()
{ "v" : 1, "key" : { "_id" : 1 }, "ns" : "twitter.tweets", "name" : "_id_" }
{ "v" : 1, "key" : { "user.followers_count" : 1 }, "ns" : "twitter.tweets", "name" : "user.followers_count_1" }

Then, we re-run the query:

> db.tweets.find( { "user.followers_count" : 1000 } ).explain()
{
    "cursor" : "BtreeCursor user.followers_count_1",
    "isMultiKey" : false,
    "n" : 8,
    "nscannedObjects" : 8,
    "nscanned" : 8,
    "nscannedObjectsAllPlans" : 8,
    "nscannedAllPlans" : 8,
    "scanAndOrder" : false,
    "indexOnly" : false,
    "nYields" : 0,
    "nChunkSkips" : 0,
    "millis" : 0,
    "indexBounds" : {
        "user.followers_count" : [
            [
                1000,
                1000
            ]
        ]
    },
    "server" : "russ-elite-book:30000"
}

...and notice the difference. We only needed to look through 8 documents and it took "no time" at all.

3. Indexing

Multikey indices; tags is an array...

> db.foo.insert( { name:"Raleigh", "tags": [ "north", "carolina", "unc" ] } )
> db.foo.ensureIndex( { tags : 1 } )
> db.foo.find( { tags : "north" } ).explain()
{
    "cursor" : "BtreeCursor tags_1",
    "isMultiKey" : true,
    "n" : 1,
    "nscannedObjects" : 1,
    "nscanned" : 1,
    "nscannedObjectsAllPlans" : 1,
    "nscannedAllPlans" : 1,
    "scanAndOrder" : false,
    "indexOnly" : false,
    "nYields" : 0,
    "nChunkSkips" : 0,
    "millis" : 0,
    "indexBounds" : {
        "tags" : [
            [
                "north",
                "north"
            ]
        ]
    },
    "server" : "russ-elite-book:30000"
}

Only 64 indices per collection.

Polish off indices...

> db.foo.getIndexes()
[
    {
        "v" : 1,
        "key" : {
            "_id" : 1
        },
        "ns" : "twitter.foo",
        "name" : "_id_"
    },
    {
        "v" : 1,
        "key" : {
            "tags" : 1
        },
        "ns" : "twitter.foo",
        "name" : "tags_1"
    }
]
> db.system.indexes.find()
{ "v" : 1, "key" : { "_id" : 1 }, "ns" : "twitter.tweets", "name" : "_id_" }
{ "v" : 1, "key" : { "user.followers_count" : 1 }, "ns" : "twitter.tweets", "name" : "user.followers_count_1" }
{ "v" : 1, "key" : { "_id" : 1 }, "ns" : "twitter.foo", "name" : "_id_" }
{ "v" : 1, "key" : { "tags" : 1 }, "ns" : "twitter.foo", "name" : "tags_1" }
> db.foo.reIndex()...

Sparse indexing

A sparse index is when documents missing particular fields, that are explicitly referenced in an index, are not included in that index which narrows the search space. For instance, imagine the following document:

{
    title : "Don't Stop Me Now",
    artist : "Queen",
    metadata :
    {
        genre : "rock",
        length : 120,
        bps : 120,
        key : "A"
    }
}

If the index is built with the fields in the subdocument, metadata, thus:

    > db.foo.ensureIndex( { "metatdata.genre" : 1,
    > ..."metadata.length" : 1,
    > ..."metadata.bps" : 1,
    > ..."metadata.key" : 1 },
    > ...{ sparse : true } } )

...those titles for which there's no record of what key they are in will not encumber the index.

The problem with doing this is, however, that as new fields are added conveniently to the schema, all the collection must be traversed for each new index added. The advantages to sparse indices are:

...while the disadvantages are:

Multikeying

One solution to this problem, to continue the example, is multikeying, which groups elements in an array (instead of a subdocument as here) making the metadata items here just part of the same, single field:

{
    title : "Don't Stop Me Now",
    artist : "Queen",
    metadata :
    [
        { genre : "rock" },
        { length : 120 },
        { bps : 120 },
        { key : "A" }
    ]
}

The index is built and finds conducted:

    > db.foo.ensureIndex( { "metatdata" : 1 } )
    > db.foo.find( { metatdata : { duration : 120 } } )

There is a different problem with this approach. It is that anything, but a direct match on an element will lead to reverting to the basic cursor or simple traversal of the entire collection again. The advantages of this approach were:

...but the disadvantages are:

Structured multikeying

To have but a single index, but one that's powerful enough to match on fields no matter how the fields are represented, use this approach. In essence it's a an array of key/value pairs.

{
    title : "Don't Stop Me Now",
    artist : "Queen",
    metadata :
    [
        { key : "genre", value : "rock" },
        { key : "length", value : 120 },
        { key : "bps", value : 120 },
        { key : "key", value : "A" }
    ]
}

The index is built thus:

    > db.foo.ensureIndex( { "metatdata.key" : 1, "metadata.value" : 1 } )
    > db.foo.find( { metatdata : { duration : 120 } } )

The advantages to this approach are:

...while the (inevitable) disadvantages are:

The new document is uglier and less straighforward; storage is amplified as "key" and "value" are everywhere repeated over and over.

4. Schema design

(I took no notes; http://www.mongodb.org/display/DOCS/SQL+to+Mongo+Mapping+Chart is a good resource for recovering SQL addicts.)



Thursday morning

5. Introduction to the drivers

(I took few notes because I'm really a Java guy.)


5. Durability, Availability and Replica Sets

(Yup, the duplicate section number was in fact in the otherwise very nice 10gen manual too. Sorry.)

There's a heartbeat that goes around between mongod processes.

Replication happens in a few seconds or a few minutes.

The primary replica is "elected" by all the active nodes.

You can only read and write off the primary.

If there's no majority and the primary isn't up, then the system goes into a WAIT state.

How to elect a new primary...

  1. Latest oplog entry.
  2. Priority.

After the former primary comes back up, it's only a secondary..

An arbiter is a node that isn't a replica (performs no database function), but can vote for a new primary.

By default, all replicas start out with equal priority (1).

A capped collection is a circular buffer.

You can copy the database by freezing the replica for a sufficient number of seconds during the copy (cp -R).



Thursday afternoon

> db.runCommand( { getLastError:1, w:3, wtimeout:1000 } )

--must not return until back from committing to at least 3 replicas.

The oplog...

What's in the database?

db
|— digg.0
|— digg.1
|— digg.ns
|— journal
|   |— j._0
|   |— prealloc.1
|   `— prealloc.2
|— mongod.lock
|— test.0
|— test.1
|— test.ns
|— training.0
|— training.1
|— training.ns
|— twitter.0
|— twitter.1
|— twitter.2

`— twitter.ns

Exercises, page 23

1.

Three.

2.

DC-A DC-B
0 5
1 4
2 3
3 2
4 1
5 0

3.

Think about the scenarios above and the answers below.

4.

Three and two.

5.

Nothing can happen as there aren't enough voting members.

6.

?

Exercise, page 26

(I offer a live example of setting up four replica nodes and an arbiter across two pieces of hardware at http://www.javahotchocolate.com/notes/mongodb-replica.html).

Note in what's going on below that I was already using port 30000 in order not to monkey with MongoDB running on 27017.

Start all of this by creating a few subdirectories for the replicas to live in:

$ mkdir rs1 rs2 rs3
erect-replicas.sh:
#!/bin/sh
# subdirectories data/rs1, rs2 and rs3 must already exist.
../bin/mongod --port 30000 --replSet foo --logpath "1.log" --dbpath /data/rs1 --fork
../bin/mongod --port 30001 --replSet foo --logpath "2.log" --dbpath /data/rs2 --fork
../bin/mongod --port 30002 --replSet foo --logpath "3.log" --dbpath /data/rs3 --fork
replica-config.js:

It doesn't do any good to run this against the mongo shell from the command line as then launching the shell, config will still not be defined. Instead, simply copy and paste it into the shell.

config =
    { _id:"foo", members:
        [
            { _id:0, host:"localhost:30000" },
            { _id:1, host:"localhost:30001" },
            { _id:2, host:"localhost:30002" }
        ]
    }

Replica instructions

  1. Create the new subdirectories.
  2. Run erect-replicas.sh to launch the mongod processes for each (this takes a long time).
  3. Launch the mongo shell, copy and paste in the replica config
    $ mongo --port 30000
    > (paste replica-config.js here so that config is defined)
    
  4. Initiate the replicas (this takes a long time):
    > rs.initiate( config )
    
  5. See the status:
    > rs.status()
    

Here's the illustration...

bash shell work...

~/mongo-training/data $ ./erect-replicas.sh
forked process: 5669
all output going to: /home/russ/mongo-training/data/1.log
log file [/home/russ/mongo-training/data/1.log] exists; copied to temporary file [/home/russ/mongo-training/data/1.log.2012-09-27T21-04-17]
child process started successfully, parent exiting
forked process: 5717
all output going to: /home/russ/mongo-training/data/2.log
log file [/home/russ/mongo-training/data/2.log] exists; copied to temporary file [/home/russ/mongo-training/data/2.log.2012-09-27T21-04-17]
forked process: 5724
all output going to: /home/russ/mongo-training/data/3.log
log file [/home/russ/mongo-training/data/3.log] exists; copied to temporary file [/home/russ/mongo-training/data/3.log.2012-09-27T21-04-17]

Mongo shell work...

~/mongo-training/data $ ../bin/mongo --port 30000
MongoDB shell version: 2.2.0
connecting to: 127.0.0.1:30000/test
> config =
... { _id:"foo", members:
... [
... { _id:0, host:"localhost:30000" },
... { _id:1, host:"localhost:30001" },
... { _id:2, host:"localhost:30002" }
... ]
... }
{
    "_id" : "foo",
    "members" : [
        {
            "_id" : 0,
            "host" : "localhost:30000"
        },
        {
            "_id" : 1,
            "host" : "localhost:30001"
        },
        {
            "_id" : 2,
            "host" : "localhost:30002"
        }
    ]
}
> rs.initiate( config )
{
    "info" : "Config now saved locally.  Should come online in about a minute.",
    "ok" : 1
}
> rs.status()
{
    "set" : "foo",
    "date" : ISODate("2012-09-27T21:12:22Z"),
    "myState" : 1,
    "members" : [
        {
            "_id" : 0,
            "name" : "localhost:30000",
            "health" : 1,
            "state" : 1,
            "stateStr" : "PRIMARY",
            "uptime" : 485,
            "optime" : Timestamp(1348780280000, 1),
            "optimeDate" : ISODate("2012-09-27T21:11:20Z"),
            "self" : true
        },
        {
            "_id" : 1,
            "name" : "localhost:30001",
            "health" : 1,
            "state" : 5,
            "stateStr" : "STARTUP2",
            "uptime" : 62,
            "optime" : Timestamp(0, 0),
            "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
            "lastHeartbeat" : ISODate("2012-09-27T21:12:21Z"),
            "pingMs" : 62
        },
        {
            "_id" : 2,
            "name" : "localhost:30002",
            "health" : 1,
            "state" : 3,
            "stateStr" : "RECOVERING",
            "uptime" : 62,
            "optime" : Timestamp(0, 0),
            "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
            "lastHeartbeat" : ISODate("2012-09-27T21:12:22Z"),
            "pingMs" : 0
        }
    ],
    "ok" : 1
}
foo:PRIMARY>

Starting a replica...

$ mongod --port 30000

Looking at replica status...

foo:PRIMARY> rs.conf()
{
    "_id" : "foo",
    "version" : 1,
    "members" : [
        {
            "_id" : 0,
            "host" : "localhost:30000"
        },
        {
            "_id" : 1,
            "host" : "localhost:30001"
        },
        {
            "_id" : 2,
            "host" : "localhost:30002"
        }
    ]
}

How to modify a configuration element. See here:

foo:PRIMARY> rs.conf()
{
    "_id" : "foo",
    "version" : 1,
    "members" : [
        {
            "_id" : 0,
            "host" : "localhost:30000"
        },
        {
            "_id" : 1,
            "host" : "localhost:30001"
        },
        {
            "_id" : 2,
            "host" : "localhost:30002"
        }
    ]
}
foo:PRIMARY> var config=rs.conf()
foo:PRIMARY> config.members[ 2 ].priority = 0
0
foo:PRIMARY> rs.reconfig( config )
Thu Sep 27 16:00:47 DBClientCursor::init call() failed
Thu Sep 27 16:00:47 query failed : admin.$cmd { replSetReconfig: { _id: "foo", version: 2, members: [ { _id: 0, host: "localhost:30000" }, { _id: 1, host: "localhost:30001" }, { _id: 2, host: "localhost:30002", priority: 0.0 } ] } } to: 127.0.0.1:30000
Thu Sep 27 16:00:47 trying reconnect to 127.0.0.1:30000
Thu Sep 27 16:00:47 reconnect 127.0.0.1:30000 ok
reconnected to server after rs command (which is normal)

foo:PRIMARY> rs.conf()
{
    "_id" : "foo",
    "version" : 2,
    "members" : [
        {
            "_id" : 0,
            "host" : "localhost:30000"
        },
        {
            "_id" : 1,
            "host" : "localhost:30001"
        },
        {
            "_id" : 2,
            "host" : "localhost:30002",
            "priority" : 0
        }
    ]
}
foo:PRIMARY>

As soon as you reconfigure, the shell loses connection with the primary until it reconnects.

foo:PRIMARY> rs.status()
{
    "set" : "foo",
    "date" : ISODate("2012-09-27T22:02:39Z"),
    "myState" : 1,
    "members" : [
        {
            "_id" : 0,
            "name" : "localhost:30000",
            "health" : 1,
            "state" : 1,
            "stateStr" : "PRIMARY",
            "uptime" : 3502,
            "optime" : Timestamp(1348783247000, 1),
            "optimeDate" : ISODate("2012-09-27T22:00:47Z"),
            "self" : true
        },
        {
            "_id" : 1,
            "name" : "localhost:30001",
            "health" : 1,
            "state" : 2,
            "stateStr" : "SECONDARY",
            "uptime" : 110,
            "optime" : Timestamp(1348783247000, 1),
            "optimeDate" : ISODate("2012-09-27T22:00:47Z"),
            "lastHeartbeat" : ISODate("2012-09-27T22:02:39Z"),
            "pingMs" : 0,
            "errmsg" : "syncing to: localhost:30000"
        },
        {
            "_id" : 2,
            "name" : "localhost:30002",
            "health" : 1,
            "state" : 2,
            "stateStr" : "SECONDARY",
            "uptime" : 112,
            "optime" : Timestamp(1348783247000, 1),
            "optimeDate" : ISODate("2012-09-27T22:00:47Z"),
            "lastHeartbeat" : ISODate("2012-09-27T22:02:39Z"),
            "pingMs" : 0,
            "errmsg" : "syncing to: localhost:30000"
        }
    ],
    "ok" : 1
}

The errmsg isn't really an error at all.

Hiding a node...

foo:PRIMARY> var config=rs.conf()
foo:PRIMARY> config.members[ 2 ].hidden = true
true
foo:PRIMARY> rs.reconfig( config )
Thu Sep 27 16:08:06 DBClientCursor::init call() failed
Thu Sep 27 16:08:06 query failed : admin.$cmd { replSetReconfig: { _id: "foo", version: 3, members: [ { _id: 0, host: "localhost:30000" }, { _id: 1, host: "localhost:30001" }, { _id: 2, host: "localhost:30002", priority: 0.0, hidden: true } ] } } to: 127.0.0.1:30000
Thu Sep 27 16:08:06 trying reconnect to 127.0.0.1:30000
Thu Sep 27 16:08:06 reconnect 127.0.0.1:30000 ok
reconnected to server after rs command (which is normal)

foo:PRIMARY> rs.conf()
{
    "_id" : "foo",
    "version" : 3,
    "members" : [
        {
            "_id" : 0,
            "host" : "localhost:30000"
        },
        {
            "_id" : 1,
            "host" : "localhost:30001"
        },
        {
            "_id" : 2,
            "host" : "localhost:30002",
            "priority" : 0,
            "hidden" : true
        }
    ]
}
foo:PRIMARY>

Looking at the oplog...

foo:PRIMARY> use local
switched to db local
foo:PRIMARY> show tables
oplog.rs
slaves
system.indexes
system.replset
foo:PRIMARY> db.me.find()
foo:PRIMARY> use twitter
switched to db twitter
foo:PRIMARY> db.tweets.insert( { "name", "Smith" } )
Thu Sep 27 16:11:28 SyntaxError: missing : after property id (shell):1
foo:PRIMARY> use local
switched to db local
foo:PRIMARY> show collections
oplog.rs
slaves
system.indexes
system.replset
foo:PRIMARY> db.oplog.rs.find()
{ "ts" : Timestamp(1348780280000, 1), "h" : NumberLong(0), "op" : "n", "ns" : "", "o" : { "msg" : "initiating set" } }
{ "ts" : Timestamp(1348783247000, 1), "h" : NumberLong("6246938530832973841"), "op" : "n", "ns" : "", "o" : { "msg" : "Reconfig set", "version" : 2 } }
{ "ts" : Timestamp(1348783686000, 1), "h" : NumberLong("-9112696901504822284"), "op" : "n", "ns" : "", "o" : { "msg" : "Reconfig set", "version" : 3 } }
bind_ip

This isn't explained very well. bind_ip is used particularly when the host has more than one IP address (interface) associated with it and you don't want MongoDB to be listening on every one of them, but only one or two. It accepts a comma-delimited list.

Adding an arbiter...

This example isn't very complete, but here's how. To add an arbiter, you need

  1. The port on which it will listen.
  2. The name of the replica set.

Launch it thus:

    $ mongod --port 37016 --replSet foo

Add it using the MongoDB shell:

    $ mongo --port 30000
    foo:PRIMART> rs.addArb( "localhost:37021" )

You'll see the arbiter using rs.status() along with the rest of the replica set details.

Exercises, page 27

1.

foo:PRIMARY> config.members[ 2 ].slaveDelay = 60
60
foo:PRIMARY> rs.reconfig( config )
Thu Sep 27 16:24:35 DBClientCursor::init call() failed
Thu Sep 27 16:24:35 query failed : admin.$cmd { replSetReconfig: { _id: "foo", version: 4, members: [ { _id: 0, host: "localhost:30000" }, { _id: 1, host: "localhost:30001" }, { _id: 2, host: "localhost:30002", priority: 0.0, hidden: true, slaveDelay: 60.0 } ] } } to: 127.0.0.1:30000
Thu Sep 27 16:24:35 trying reconnect to 127.0.0.1:30000
Thu Sep 27 16:24:35 reconnect 127.0.0.1:30000 ok
reconnected to server after rs command (which is normal)

foo:PRIMARY> rs.conf()
{
    "_id" : "foo",
    "version" : 4,
    "members" : [
        {
            "_id" : 0,
            "host" : "localhost:30000"
        },
        {
            "_id" : 1,
            "host" : "localhost:30001"
        },
        {
            "_id" : 2,
            "host" : "localhost:30002",
            "priority" : 0,
            "slaveDelay" : 60,
            "hidden" : true
        }
    ]
}

2.

foo:PRIMARY> rs.status()
{
    "set" : "foo",
    "date" : ISODate("2012-09-27T22:25:43Z"),
    "myState" : 1,
    "members" : [
        {
            "_id" : 0,
            "name" : "localhost:30000",
            "health" : 1,
            "state" : 1,
            "stateStr" : "PRIMARY",
            "uptime" : 4886,
            "optime" : Timestamp(1348784675000, 1),
            "optimeDate" : ISODate("2012-09-27T22:24:35Z"),
            "self" : true
        },
        {
            "_id" : 1,
            "name" : "localhost:30001",
            "health" : 1,
            "state" : 2,
            "stateStr" : "SECONDARY",
            "uptime" : 66,
            "optime" : Timestamp(1348784675000, 1),
            "optimeDate" : ISODate("2012-09-27T22:24:35Z"),
            "lastHeartbeat" : ISODate("2012-09-27T22:25:41Z"),
            "pingMs" : 0,
            "errmsg" : "syncing to: localhost:30000"
        },
        {
            "_id" : 2,
            "name" : "localhost:30002",
            "health" : 1,
            "state" : 2,
            "stateStr" : "SECONDARY",
            "uptime" : 66,
            "optime" : Timestamp(1348784675000, 1),
            "optimeDate" : ISODate("2012-09-27T22:24:35Z"),
            "lastHeartbeat" : ISODate("2012-09-27T22:25:41Z"),
            "pingMs" : 0,
            "errmsg" : "syncing to: localhost:30000"
        }
    ],
    "ok" : 1
}
foo:PRIMARY> db.adminCommand( { replSetStepDown:1 } )
Thu Sep 27 16:30:12 DBClientCursor::init call() failed
Thu Sep 27 16:30:12 query failed : admin.$cmd { replSetStepDown: 1.0 } to: 127.0.0.1:30000
Thu Sep 27 16:30:12 Error: error doing query: failed src/mongo/shell/collection.js:155
Thu Sep 27 16:30:12 trying reconnect to 127.0.0.1:30000
Thu Sep 27 16:30:12 reconnect 127.0.0.1:30000 ok

foo:SECONDARY> rs.status()
{
    "set" : "foo",
    "date" : ISODate("2012-09-27T22:32:45Z"),
    "myState" : 2,
    "syncingTo" : "localhost:30001",
    "members" : [
        {
            "_id" : 0,
            "name" : "localhost:30000",
            "health" : 1,
            "state" : 2,
            "stateStr" : "SECONDARY",
            "uptime" : 5308,
            "optime" : Timestamp(1348784675000, 1),
            "optimeDate" : ISODate("2012-09-27T22:24:35Z"),
            "self" : true
        },
        {
            "_id" : 1,
            "name" : "localhost:30001",
            "health" : 1,
            "state" : 1,
            "stateStr" : "PRIMARY",
            "uptime" : 150,
            "optime" : Timestamp(1348784675000, 1),
            "optimeDate" : ISODate("2012-09-27T22:24:35Z"),
            "lastHeartbeat" : ISODate("2012-09-27T22:32:43Z"),
            "pingMs" : 0
        },
        {
            "_id" : 2,
            "name" : "localhost:30002",
            "health" : 1,
            "state" : 2,
            "stateStr" : "SECONDARY",
            "uptime" : 150,
            "optime" : Timestamp(1348784675000, 1),
            "optimeDate" : ISODate("2012-09-27T22:24:35Z"),
            "lastHeartbeat" : ISODate("2012-09-27T22:32:43Z"),
            "pingMs" : 0
        }
    ],
    "ok" : 1
}

3.

~/mongo-training/data $ ps -ef | grep [m]ongo
mongodb   1343     1  0 09:11 ?        00:01:08 /usr/bin/mongod --config /etc/mongodb.conf
russ      5490     1  0 14:57 ?        00:00:38 ../bin/mongod --port 30001 --replSet foo --logpath 2.log --dbpath /data/rs2 --fork
russ      5578     1  0 14:59 ?        00:00:37 ../bin/mongod --port 30002 --replSet foo --logpath 3.log --dbpath /data/rs3 --fork
russ      5669     1  0 15:04 ?        00:00:30 ../bin/mongod --port 30000 --replSet foo --logpath 1.log --dbpath /data/rs1 --fork
russ      5767  2085  0 15:08 pts/2    00:00:00 ../bin/mongo --port 30000
russ      7936  5124  0 16:26 pts/4    00:00:00 mongo --port 30001
~/mongo-training/data $ kill -9 5669

And, from the Mongo shell that was looking at the mongod process we just killed...

foo:SECONDARY> rs.status()
Thu Sep 27 16:36:31 DBClientCursor::init call() failed
Thu Sep 27 16:36:31 query failed : admin.$cmd { replSetGetStatus: 1.0 } to: 127.0.0.1:30000
Thu Sep 27 16:36:31 Error: error doing query: failed src/mongo/shell/collection.js:155
Thu Sep 27 16:36:31 trying reconnect to 127.0.0.1:30000
Thu Sep 27 16:36:31 reconnect 127.0.0.1:30000 failed couldn't connect to server 127.0.0.1:30000

But, from the original secondary, now become primary (in exercise 2):

foo:PRIMARY> rs.status()
{
    "set" : "foo",
    "date" : ISODate("2012-09-27T22:37:36Z"),
    "myState" : 1,
    "members" : [
        {
            "_id" : 0,
            "name" : "localhost:30000",
            "health" : 0,
            "state" : 8,
            "stateStr" : "(not reachable/healthy)",
            "uptime" : 0,
            "optime" : Timestamp(1348784675000, 1),
            "optimeDate" : ISODate("2012-09-27T22:24:35Z"),
            "lastHeartbeat" : ISODate("2012-09-27T22:35:46Z"),
            "pingMs" : 0,
            "errmsg" : "socket exception [CONNECT_ERROR] for localhost:30000"
        },
        {
            "_id" : 1,
            "name" : "localhost:30001",
            "health" : 1,
            "state" : 1,
            "stateStr" : "PRIMARY",
            "uptime" : 5985,
            "optime" : Timestamp(1348784675000, 1),
            "optimeDate" : ISODate("2012-09-27T22:24:35Z"),
            "self" : true
        },
        {
            "_id" : 2,
            "name" : "localhost:30002",
            "health" : 1,
            "state" : 2,
            "stateStr" : "SECONDARY",
            "uptime" : 780,
            "optime" : Timestamp(1348784675000, 1),
            "optimeDate" : ISODate("2012-09-27T22:24:35Z"),
            "lastHeartbeat" : ISODate("2012-09-27T22:37:34Z"),
            "pingMs" : 0
        }
    ],
    "ok" : 1
}

4. Now, use the original command that created the original primary, which we've just killed, to set back up a replica:

~/mongo-training/data $ cat erect-replicas.sh
#!/bin/sh
# subdirectories data/rs1, rs2 and rs3 must already exist.
../bin/mongod --port 30000 --replSet foo --logpath "1.log" --dbpath /data/rs1 --fork
../bin/mongod --port 30001 --replSet foo --logpath "2.log" --dbpath /data/rs2 --fork
../bin/mongod --port 30002 --replSet foo --logpath "3.log" --dbpath /data/rs3 --fork

~/mongo-training/data $ ../bin/mongod --port 30000 --replSet foo --logpath "1.log" --dbpath
/data/rs1 --fork
forked process: 8767
all output going to: /home/russ/mongo-training/data/1.log
log file [/home/russ/mongo-training/data/1.log] exists; copied to temporary file [/home/russ/mongo-training/data/1.log.2012-09-27T22-39-11]
child process started successfully, parent exiting

No! Don't do rs.add() for a replica that already existed. This is not the exercise. Create a brand new replica! So, kill this one for now and don't use it.

~/mongo-training/data $ ps -ef | grep [m]ongo
mongodb   1343     1  0 09:11 ?        00:01:09 /usr/bin/mongod --config /etc/mongodb.conf
russ      5490     1  0 14:57 ?        00:00:39 ../bin/mongod --port 30001 --replSet foo --logpath 2.log --dbpath /data/rs2 --fork
russ      5578     1  0 14:59 ?        00:00:38 ../bin/mongod --port 30002 --replSet foo --logpath 3.log --dbpath /data/rs3 --fork
russ      5767  2085  0 15:08 pts/2    00:00:00 ../bin/mongo --port 30000
russ      7936  5124  0 16:26 pts/4    00:00:00 mongo --port 30001
russ      8767     1  1 16:39 ?        00:00:02 ../bin/mongod --port 30000 --replSet foo --logpath 1.log --dbpath /data/rs1 --fork
~/mongo-training/data $ kill -9 8767
~/mongo-training/data $ ../bin/mongod --port 30003 --replSet foo --logpath "4.log" --dbpath
/data/rs4 --fork
forked process: 9031
all output going to: /home/russ/mongo-training/data/4.log

Add the new replica...

foo:PRIMARY> rs.add( "localhost:30003" )
{
    "errmsg" : "exception: need most members up to reconfigure, not ok : localhost:30003",
    "code" : 13144,
    "ok" : 0
}

Oops, didn't create the new subdirectory, so it didn't run:

~/mongo-training/data $ mkdir rs4
~/mongo-training/data $ ../bin/mongod --port 30003 --replSet foo --logpath "4.log" --dbpath
/data/rs4 --fork
forked process: 9352
all output going to: /home/russ/mongo-training/data/4.log
log file [/home/russ/mongo-training/data/4.log] exists; copied to temporary file [/home/russ/mongo-training/data/4.log.2012-09-27T22-46-35]
child process started successfully, parent exiting

Try it again:

foo:PRIMARY> rs.add( "localhost:30003" )
{ "down" : [ "localhost:30000" ], "ok" : 1 }

foo:PRIMARY> rs.status()
{
    "set" : "foo",
    "date" : ISODate("2012-09-27T22:49:27Z"),
    "myState" : 1,
    "members" : [
        {
            "_id" : 0,
            "name" : "localhost:30000",
            "health" : 0,
            "state" : 8,
            "stateStr" : "(not reachable/healthy)",
            "uptime" : 0,
            "optime" : Timestamp(1348784675000, 1),
            "optimeDate" : ISODate("2012-09-27T22:24:35Z"),
            "lastHeartbeat" : ISODate("2012-09-27T22:44:05Z"),
            "pingMs" : 0,
            "errmsg" : "socket exception [CONNECT_ERROR] for localhost:30000"
        },
        {
            "_id" : 1,
            "name" : "localhost:30001",
            "health" : 1,
            "state" : 1,
            "stateStr" : "PRIMARY",
            "uptime" : 6696,
            "optime" : Timestamp(1348786131000, 1),
            "optimeDate" : ISODate("2012-09-27T22:48:51Z"),
            "self" : true
        },
        {
            "_id" : 2,
            "name" : "localhost:30002",
            "health" : 1,
            "state" : 2,
            "stateStr" : "SECONDARY",
            "uptime" : 1491,
            "optime" : Timestamp(1348784675000, 1),
            "optimeDate" : ISODate("2012-09-27T22:24:35Z"),
            "lastHeartbeat" : ISODate("2012-09-27T22:49:27Z"),
            "pingMs" : 0
        },
        {
            "_id" : 3,
            "name" : "localhost:30003",
            "health" : 1,
            "state" : 5,
            "stateStr" : "STARTUP2",
            "uptime" : 36,
            "optime" : Timestamp(0, 0),
            "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
            "lastHeartbeat" : ISODate("2012-09-27T22:49:25Z"),
            "pingMs" : 0
        }
    ],
    "ok" : 1
}

After a few minutes, this will change to a full blown SECONDARY:

foo:PRIMARY> rs.status()
{
    "set" : "foo",
    "date" : ISODate("2012-09-27T22:49:27Z"),
    "myState" : 1,
    "members" : [
        ...
        {
            "_id" : 3,
            "name" : "localhost:30003",
            "health" : 1,
            "state" : 2,
            "stateStr" : "SECONDARY",
            "uptime" : 122,
            "optime" : Timestamp(1348786131000, 1),
            "optimeDate" : ISODate("2012-09-27T22:48:51Z"),
            "lastHeartbeat" : ISODate("2012-09-27T22:50:51Z"),
            "pingMs" : 0
        }
 ],
 "ok" : 1
}

"Real" or cross-host replicas

This would be the real way to set up replicas (as compared to what we did above), i.e.: one node per hardware host. The steps to do this are:

  1. On each system, start mongod:
    $ mongod --config config-file      # (typically /etc/mongodb.conf)
    
  2. Open a mongo shell connected to the host; this could be on a remote host.
    $ mongo [ --port port-number ] [ --host hostname ]
    
  3. Use rs.initiate() to start the replica set for the current member.
  4. Display the current configuration using rs.conf().
  5. Add members to the replica set using rs.add( hostname ) for each replica node; these hostnames are likely resolved by DNS or added to /etc/hosts. After this, a fully functional replica set will elect a primary within seconds.
  6. Check the status of the replica set at any time using rs.status().

A simple configuration file might contain:

port=37018
replSet=myset
logpath=/mongodb/data/myset.log
dbpath=/mongodb/data/americas
fork=true
logappend=true

10. Host Configuration and Deployment Notes

See page 35.

RAID10 is RAID 1 and RAID 0.

A good idea is to use a couple of replicas to stand in for a back-up.

fsync

flushes and comes back...

foo:PRIMARY> use admin
switched to db admin
foo:PRIMARY> db.runCommand( { fsync:1 } )
{ "numFiles" : 6, "ok" : 1 }

with lock, locks against writes...

foo:PRIMARY> db.runCommand( { fsync:1, lock:true } )
{
    "info" : "now locked against writes, use db.fsyncUnlock() to unlock",
    "seeAlso" : "http://dochub.mongodb.org/core/fsynccommand",
    "ok" : 1
}
foo:PRIMARY> db.fsyncUnlock()
{ "ok" : 1, "info" : "unlock completed" }


Friday morning

11. Monitoring MongDB

(Some monitoring yesterday afternoon already...)

Using mongostat...

~/mongo-training/data $ ../bin/mongostat --port 30000 2
connected to: 127.0.0.1:30000
insert  query update delete getmore command flushes mapped  vsize    res faults    locked db idx miss %     qr|qw   ar|aw  netIn netOut  conn       time
     0      0      0      0       0       0       0   448m     1g    95m      0 twitter:0.0%          0       0|0     0|0    31b     1k     2   10:08:49
     0      0      0      0       0       0       0   448m     1g    95m      0 twitter:0.0%          0       0|0     0|0    31b     1k     2   10:08:51
     0      0      0      0       0       0       0   448m     1g    95m      0 twitter:0.0%          0       0|0     0|0    31b     1k     2   10:08:53
     0      0      0      0       0       0       0   448m     1g    95m      0 twitter:0.0%          0       0|0     0|0    31b     1k     2   10:08:55
^C

mongotop...

~/mongo-training/data $ ../bin/mongotop --port 30000
connected to: 127.0.0.1:30000

                            ns       total        read       write        2012-09-28T16:10:17
     twitter.system.namespaces         0ms         0ms         0ms
        twitter.system.indexes         0ms         0ms         0ms
    training.system.namespaces         0ms         0ms         0ms
       training.system.indexes         0ms         0ms         0ms
        test.system.namespaces         0ms         0ms         0ms
           test.system.indexes         0ms         0ms         0ms
^C

3. Indexing: database profiling...

> use test
switched to db test
> show collections
exercise
foo
people
system.indexes
> db.setProfilingLevel( 1, 250 )
{ "was" : 0, "slowms" : 100, "ok" : 1 }
> db.foo.insert( { "party":"pooper" } )
> db.foo.find().pretty()
{ "_id" : ObjectId("506365b29cf47e02347c32d0"), "a" : 100, "b" : 200 }
{ "_id" : ObjectId("506365b79cf47e02347c32d1"), "a" : 50, "b" : 200 }
{ "_id" : ObjectId("5063674f9cf47e02347c32d2"), "a" : 100 }
{ "_id" : ObjectId("5065ce8913c1801431eb47fe"), "party" : "pooper" }

8. Sharding

A shard consists of a replica set (or even just a node, but that's not reliable). Shards are based on collections since they are formulated using a shard key, taken as one or more fields, and therefore, of a single collection.

See http://www.mongodb.org/display/DOCS/Sharding+Introduction.

If you wish a replica set to be a shard, you add an option. See page 31, second line:

$ mongod --replSet ... --shardsvr

Sharding makes use of:

  1. mongos instance on each application node,*
  2. mongod instance as a configuration server on three, separate VMs,**
  3. mongod instance on each replica node.

* Note, however, that this makes problems a lot harder to debug. Do this for performance, but not in developer or QA environments where you would be better off just using a single mongos.
** In developer and QA environments, it's unnecessary to erect more than one configuration server.

mongos, the MongoDB sharding router

The mongos provides an interface for applications to interact with sharded clusters hiding the complexity of data partitioning. It receives queries from the applications and uses metadata from the configuration server to route queries to appropriate mongod instances (on the various replica nodes). From the application's perspective, mongos behaves identically to a mongod, but it is, of course, more of a router to one or more mongods.

A mongos instance is lightweight and doesn't require a data directory. It can be run on an application server or even a server running a mongod process. By default, it runs on port 27017.

The mongos binary consists of a balancer and a router. Everything else is mongod. The configuration servers are mongod binaries (as shown on this page).

shard 1 shard 2
documents 1-5 51-100
s1a s2a
s1b s2b
s1c s2c

Note that a document could switch shards. Sharding doesn't work at the document level, but at the chunk level.

Shard 1, documents 1-50, contains 10 chunks (example hypothesized); shard 2, documents 51-100 in twenty chunks (but was originally 10). The balancer may ask a shard how many chunks. If the difference is greater than 8, then some chunks will be moved from the larger shard to the smaller one. After chunk migration, the migrated chunks exist in both shards. The map is updated at the configuration servers, then the unneeded chunks are removed from the shard giving them up. Even if the system went down before the delete happened, the shard will no longer respond to requests for the migrated chunks.

Chunks are (up to) 64Mb; they get split as soon as they get bigger.

MongoDB 2.2 has "tagged" sharding that obviates maintaining our continent distinction. If you wish to shard such that all countries' documents end up in a shard whose primary is located in a data center on a particular continent, you'd need to tag them (since using the ISO country code isn't going to result naturally in all of Europe ending upon in the EMEA data center). A solution might be to create a continent code to go along with the country code, but with tagging, that's not necessary.

Run the mongos binary on each application server.

Configuration server

A mongod, called a configuration (or config) server, maintains shared-cluster metadata in a configuration database for the mongos instances. A sharded cluster operates with a group of three configuration servers that use a two-phase commit process ensuring immediate consistency and reliability. For testing, it's possible to deploy with a single configuration server, but this isn't recommended for production as the mongod would be a single point of failure.

Each configuration server instance can run alongside the usual mongod instance of a common replica node.

Sharding (continued)...

Mongo only talks to C1 (of the three configuration servers shown). When that's gone, it talks to C2, etc.

If the query contains the shard key, it will go directly to the shard containing it. See this presentation:
http://www.slideshare.net/fullscreen/TylerBrock/sharding-mongo-dc-2012/71

Queries

By Shard Key   Routed    db.users.find( { email:"[email protected]" } );

How to choose shard key?

  • (Rule of thumb) Choose a field that's common to queries.
  • Chunks should be able to split.

See bad example (in slide presentation).

{ node:1 }
{ node:1, time:1 }

mongos sorts everything in memory.

Try to make it so that writes should be disputed, but reads should go directly to the shard.

Sharding instructions...

...how to set up MongoDB sharding. This assumes the presence of one (or more) existing replica sets, ready to be inducted into a shard. In this example of setting up an existing replica in one shard, our replica set is named "my-replica". Color here helps sort out port numbers, green for the configuration server port number we create here, blue for the mongos port number created here, and orange for any node of the existing replica set whose set-up is not shown here, but please see higher up in this document for how to do that.

  1. Stand up a MongoDB configuration server or servers, three in production, each of which should be on different host hardware (or VMs). Of course, in testing, one is sufficient and it doesn't necessarily need to be on separate hardware (or VM). (Of course, no matter where it is set up, its path and port must be separate from all other mongod and mongos paths and ports.)
        $ mongod --configsvr --dbpath path --port port
    

    ...where

    • --dbpath is where the data files that MongoDB will create to hold the configuration data will be kept
    • --port is the (new) port by which the configuration server will be contacted; each configuration server has its own port number.
  2. Start the mongos dæmon.
        $ mongos --configdb hostname:port [, hostname:port, hostname:port ] \
                 --port port \
                 --logpath path
    	
    

    ...where

    • the hostnames and ports are those used to create the configuration servers in the previous step
    • a hostname must be one of IP address, found in DNS or found in /etc/hosts
    • --port is the (new) port assigned by which the instance of mongos will be contacted
    • --logpath is where the mongos will do its logging

    Repeat this step for as many instances of mongos you need. As noted elsewhere, this would likely be a single one for testing, but perhaps one each on application nodes in production.

  3. Add shards, to wit: a replica set. In this example, we add one replica set to a shard (mongos). Launch the MongoDB shell to do this.
        $ mongo --port port
        > sh.addShard( "my-replica/hostname:port"
    	)
    

    ...where

    • the port number on which the shell is run is that of the mongos created in the previous step
    • "my-replica" is the name (we warned you earlier) of the replica set we're using for this example
    • the hostname and port to add the shard is that of any one of the nodes in the replica set

    Repeat this step for as many instances of mongos you created. Do not attach any replica set to more than one mongos.

  4.  
    If you are only using one shard (which is a legitimate topology), you do not need to perform this next step.
     
  5. Shard the collections you want to spread across multiple shards. This is done from the MongoDB shell session begun in the previous step. We're going to pretend that our sample collection, "bar", is in a database named, "foodb". Moreover, our shard key will be a field named "drinkid" plus the ObjectId _id of documents in the collection.
        > sh.enableSharding( "foodb" )
        > sh.shardOn( "foodb.bar", { drinkid:1, _id } )
    

Instructions for a 2 sharded replica sets

The previous set of instructions was for a simple, single shard and its replica set. That isn't much of an example because, while the single sharded replica set is a useful concept, real, industrial-strength sharding would involve multiple shards (and their replica sets). In general, follow the instructions above for detail that's not as unctuously developed in the following. These steps carry the same numbering as the previous ones.

  1. Stand up three MongoDB configuration servers. Remember their port numbers. (See same step above.)
  2.  
  3. Configure a mongos dæmons using the configuration server dæmon hostnames and port numbers. Remember the port number you specify for this mongos or, if you didn't, it's going to be 27017 (default) for the next step. (See same step above.)
  4.  
  5. Add shards by launching a shell and using sh.addShard(). Here's how:
    1. Launch a shell...
          # mongo --host hostname --port port
      

      ...where hostname is the hostname (or IP address) of the host running the mongos and port is the port number, possibly assigned in the mongos' configuration file, or on its command line at launch, on which the mongos is listening. (Remember, the default is 27017.)

    2. Issue the following commands to the MongoDB shell, one per each replica set:
          > sh.addShard( "first-replica-set-name/hostname:port" )
          > sh.addShard( "second-replica-set-name/hostname:port" )
      

    For example, let's say that we launched our mongos dæmon thus:

        # mongo --host 16.86.192.103 --port 47017
    

    ...and our replica sets were named "humpity" and "dumpity", we would add our replica sets in this way:

        > sh.addShard( "humpity/16.86.192.103:47017" )
        > sh.addShard( "dumpity/16.86.192.103:47017" )
    
  6. For questions about whether you should shard an entire database or only a collection, and what MongoDB shells commands to proceed with, please see Deploy a Sharded Cluster

Configuration files

It's much easier to do this stuff in configuration files than by hand, especially each time one, two or all of the elements of a complex cluster go down. These configuration files are required:

  1. Files that go in /etc/init for upstart modeled upon the one installed there, /etc/init/mongodb.conf.
  2. Files that replace the /etc/mongodb.conf, which is the basic MongoDB configuration, and are placed in one of the following common places:
    • /etc/mongodb.conf —not good because it's easily clobbered.
    • /data/mongodb/mongodb.conf —a more common, customized location.
    • another customized location.


Friday afternoon

TTL - "time to live": "Any document that is 3 days old or older will be deleted."

In MongoDB parlance, a cluster is a sharded set-up.

Siddarth did the exercise, page 31 under OSX. There were two egregious typos, command-line argument order and use of an invalid port number (the highest possible port number is 65535).

$ mongos --configdb localhost:57017,localhost:57018,localhost:57019 --logpath "mongos-1.log"
--port 60000 --fork

Then use the shell to connect to mongos; again, the port number is bogus, use 60000.

Suggestion: specify smaller chunk size as 64Mb is too big for a simple test/exercise.

Fill the database collection with records before invoking db.adminCommand( { addshard:"s1/localhost:37017" } ):

for( i = 0; i < 10000; i++ )
{
    db.users.insert( { a:i } )
}
mongos> sh.shardCollection( "test.users", { a:1 } )

Have to create index...

mongos> db.ensureIndex( { a:1 } )

Stop the balancer...

> sh.stopBalancer()
Waiting for active hosts...
Waiting for the balancer lock...
Waiting again for active hosts after balancer is off...

Appendix B: GridFS

GridFS is a datastore for big files; can do store, retrieve, list and remove.

GridFS fs = new GridFS( ... );

In the database, the chunks collection is the file (in small segments) and the files collection is its metadata.

You see how many chunks:

> use test
switched to db test
> db.fs.chunks.count()
mongofiles is a binary that uploads files to MongoDB.

Why would you want to store files in a database in the first place? There can be valid reasons for doing it, the best one is reducing complexity. However this comes at some expense, it might be worth it. Let's say you have 10Gb of data in files like PDFs, images, etc....

The major advantages to doing this are:

  • files will be copied to each of your replicas
  • files will be backed up with your db
  • they're easy to associate with other data and add meta data to
  • don't have to replicate/manage where files are (as when stored statically on multiple web servers

The major disadvantages are:

  • replica resynchronization will take longer as it has +10Gb to synchronize
  • back-ups, unless excluded, will take longer as they have +10Gb more data
  • wasted storage and RAM on stuff that doesn't change
  • it's much slower than serving the content from Apache (etc.) directly as static content

Unless really trying to keep complexity down or have some other special use case, store files like this on Amazon S3 or equivalent.

7. Back-ups and restores

mongodump, mongorestore operate using BSON. mongoimport, mongoexport deal in JSON.

Creates dump subdirectory with one deeper subdirectory per database. In each, there are collection.bson, containing the actual data and collection.metadata.json, which is readable, and contains the metadata.

Document-level dump. Option --oplog will dump the oplog too. Other options --host host, --port port, etc.

Could instead us MongoDB fsync to lock out writes, then Unix rsync instead.

Really, though, rely on well formed and behaved replicas to ensure back-up.

9. Security

$ mongod --auth
$ mongo
> use admin
switched to db admin
> db.addUser( 'sid', 'sid' )
{
    "user" : "sid",
    "readOnly" : false,
    "pwd" : "2cf3ac4bac67006e2ba5795b15b954bb",
    "_id" : ObjectId("5066116413c1801431eb4800")
}
> show dbs
Fri Sep 28 15:07:45 uncaught exception: listDatabases failed:{ "errmsg" : "need to login", "ok" : 0 }
> exit
bye
~/mongo-training/data $ mongo --port 30000 -u sid -p sid admin
MongoDB shell version: 2.2.0
connecting to: 127.0.0.1:30000/admin
> show dbs
admin    0.203125GB
config    0.203125GB
digg    0.203125GB
local    (empty)
test    0.203125GB
training    0.203125GB
twitter    0.453125GB

Now we're here as admin user. Add another, but you must add it to a chosen database. Thereafter, you log in with that user (for example):

> use twitter
> db.addUser( 'dan', 'dan', true )
{
    "user" : "dan",
    "readOnly" : true,
    "pwd" : "d41d78804bfdccc043d535435c39db94",
    "_id" : ObjectId("506612051107781f30ddf871")
}
> exit
bye
~/mongo-training/data $ mongo --port 30000 -u dan -p dan twitter

See this stuff...

> db
twitter
> db.system.users.find()
{ "_id" : ObjectId("5066116413c1801431eb4800"), "user" : "sid", "readOnly" : false, "pwd" : "2cf3ac4bac67006e2ba5795b15b954bb" }
{ "_id" : ObjectId("506612051107781f30ddf871"), "user" : "dan", "readOnly" : true, "pwd" : "d41d78804bfdccc043d535435c39db94" }

For inter-shard communication, launch shards using

$ mongod ... --keyfile key.txt ...

key.txt contains password for the admin user; must be same across all shards.

Appendix C: Aggregation

http://www.10gen.com/presentations/mongosv-2011/mongodbs-new-aggregation-framework
http://docs.mongodb.org/manual/applications/aggregation*
http://api.mongodb.org/wiki/current/Aggregation.html

* The aggregation pipeline begins with the collection articles and selects the author and tags fields using the $project aggregation operator. The $unwind operator produces one output document per tag. Finally, the $group operator pivots these fields.


Wednesday morning-----------------------
        1. MongoDB Introduction
        MongoDB shell
        2. CRUD and the MongoDB shell
        Exercise, inserting
        Exercise, page 9
        Auto-generated indices
        Embedded documents
        Getting server status
        Links
Wednesday afternoon---------------------
        Embedded documents (2)
        Exercises, page 11
        Updating documents
        Exercises, page 12
        3. Indexing
        Sparse indexing
Thursday morning------------------------
        4. Schema design
        Thursday afternoon----------------------
        5. Introduction to the drivers
        5. Durability, availability and replica sets
        oplog
        Exercises, page 23
        Exercises, page 26
        erect-replicas.sh
        replica-config.js
        Replica instructions
        Starting a replica
        Hiding a node
        bind_ip
        Adding an arbiter
        Exercises, page 27
        "Real" or cross-host replicas
        10. Host configuration/deployment
        fsync
Friday morning--------------------------
        11. Monitoring MongoDB
        3. Indexing: database profiling
        8. Sharding
        mongos, the MongoDB sharding router
        Configuration server
        Sharding (continued)
        Sharding queries
        Sharding instructions
        Instructions for 2 sharded replica sets
        Configuration files
Friday afternoon------------------------
        Appendix B: GridFS
        7. Back-ups and restores
        9. Security
        Appendix C: Aggregation