December 20, 2018

Intro to MongoDB (part-1)

I don’t like feeling dumb. I know this is a weird way to start a blog post. I detest feeling out of my element and inadequate. As the tech world continues to inexorably advance – exponentially even, the likelihood that I will keep running into those feelings becomes greater and greater. However, to try to combat this, I will have a number of projects to learn new products in the works. Since there is a title on this blog post and I have shortsightedly titled it the tech that I will be attempting to learn, it would be rather anticlimactic to say what is it now. Jumping in….

What is MongoDB?

The first question is what is MongoDB and what makes it different from other database programs out there? Say for example MS SQL or MySQL or PostgreSQL? To answer that question, I will need to describe a bit of the architecture. (And yes, I am learning this all as I go along. I had a fuzzy idea of what databases were and used them myself for many things. But if you asked me the difference between a relational and non-relational DB and I would have had to go to Google to answer you.) The two main types of databases out there are relational and non-relational. Trying to figure out the simple difference between them was confusing. The simplest way of defining it is using the relational model definition from Wikipedia. ” ..all data is represented in terms of tuples, grouped into relations.” The issue with this was it didn’t describe it well enough for me to understand. So, I kept looking for a simpler definition. The one I found and liked is the following, (found on stackoverflow), “I know exactly how many values (attributes) each row (tuple) in my table (relation) has and now I want to exploit that fact accordingly, thoroughly, and to its extreme.” There are other differences – such as how relational databases are more difficult to scale and work with large datasets. You also need to define all the data you will need before you create the relational DB. Unstructured data or unknowns are difficult to plan for and you may not be able to.

So, what is MongoDB then? Going back to our good friend Wikipedia, it is a cross-platform document-oriented database program. It is also defined as a NoSQL Database. There are number of different types of NoSQL though (this is where I really start feeling out of my element and dumb). There are:

  1. Document databases – These pair each key with a complex data structure known as a document. Documents can contain many different key-value pairs, or key-array pairs, or even nested documents.
  2. Graph stores – These are used to store information about networks of data such as social connections
  3. Key-Value stores – are the simplest NoSQL databases. Every single item in the database is stored as an attribute name (or key) together with its value.
  4. Wide column stores – are optimized for queries over large datasets, and store columns of data together, instead of rows. One example of this type is Cassandra

Why NoSQL?

In our speed crazed society, we value performance. Sometimes too much. But still, performance. SQL Databases were not built to scale easily and to handle the amount of data that some orgs need. To this end, NoSQL databases were built to provide superior performance and the ability to scale easily. Things like Auto-Sharding (distribution of data between nodes), replication without third party software or add-ons, and easy scale out, all add up to high performing databases.

NoSQL databases can also be built without a predefined schema. If you need to add a different type of data to a record, you don’t have to recreate the whole DB schema, you can just add that data. Dynamic schemas make for faster development and less database downtime.

Why MongoDB?

Data Consistency Guarantees – Distributed systems occasionally have the bad rap of eventual data consistency. With MongoDB, this is tunable, down to individual queries within an app. Whether something needs to be near instantaneous or has a more casual need for consistency, MongoDB can do it. You can even configure Replica sets (more about those in a bit) so that you can read from secondary replicas instead of primary for reduced network latency.

Support for multi-document ACID transactions as of 4.0 – So I had no idea what this meant at all. I had to look it up. What it means, is that if you needed to make a change to two different tables at the same time, you were unable to before 4.0. NOW you are able to do both at the same time. Think of a shopping cart inventory. You want to remove the item out of your inventory as the customer is buying it. You would want to do those two transactions at the same time. BOOM! Multi-Document transaction support.

Flexibility – As mentioned above, MongoDB documents are polymorphic. Meaning….They can contain different data from other documents with no ill effects. There is also no need to declare anything as each file is self-describing. However……. There is such a thing as Schema Governance. If your documents MUST have certain fields in them, Data Governance will step and in and structure can be imposed to make sure that data is there.

Speed – Taking what I talked about above a bit further in, there are a number of ways and reasons why MongoDB is much faster. Since a single document is the place for reads and writes for an entity, to pull data usually only requires a single read operation. Query language is also much simpler further enhancing your speed. Going even further you can build “change streams” that allow you to trigger actions based off of configurable stimuli.

Availability – This will be a little longer since there is a bit more meat on this one. MongoDB maintains multiple copies of data using a tech called Replica Sets. They are self-healing and will auto-recover and failover as needed. The replicas can be placed in different regions as mentioned previously so that reads can be from a local source increasing speed.

In order to maintain data consistency, one of the members assumes the role of primary. All others acts as secondaries, and they will repeat the operations in the oplog of the primary. If for some reason the primary goes down, one of the secondaries is elected to primary. How does it decide you may ask? I’m glad you asked! It does it based on who has the latest data (based on a number of parameters), who has the most connectivity with the majority of other replicas, and it could use user-defined priorities. This is all happens quickly and is decided in seconds. When the election is done, all the other replicas will start replicating from the new primary. If for some reason the old primary comes back online, it will automatically discover its role has been occupied and will become secondary. Up to 50 members can be configured per replica set.

Well that’s enough for one blog post as I’m about 1200 words already. Next post will continue with sharding and more MongoDB goodness.