document DB

Intro to MongoDB (part 2)

When I left off last part, we were discussing MongoDB’s availability features. We will start next on:

Scalability – We’ve gone over replica sets in availability. For those who come from more of an infrastructure/hardware background. This is very close to something like RAID 1. You are creating multiple copies of a set of data and then placing it on separate physical nodes (instead of drives). This does allow a bit of scalability but is not as efficient as say, RAID 6 is. So, what we will next get into is, sharding. Sharding is a lot closer to what us hardware people would think RAID 6 is. You are breaking pieces of one overall replica and spreading those across physical nodes. For refresher purposes let’s throw in a graphic. I’ve modified it slightly. This diagram makes more sense to me according to the above comparison.

Now if we add in shards this is what we look like.

I have just 3 nodes there but you can scale much bigger. This sharding is automatic and built-in. No need for 3rd party add-ins. Rebalancing is done automatically as you add or subtract nodes. A bit of planning is needed as to how you plan to distribute the data. Sharding is applied based off a shard key. This is defined by a data modeler that describes the range of values that will be used to partition the data into chunks – this is then known as the shard key. Much like you would use a key on a map to tell you how to use it. There are three components to this.

  • Query Router (mongos) – This provides an interface between client apps and the sharded cluster.
  • Config Server – Stores the metadata for the sharded cluster. Metadata includes location of all the sharded chunks and the ranges that define the chunks (each shard is broken into chunks) The Query router cache this data and use it to properly direct where they need to send read and write operations. Config servers also store authentication information and manage distributed locks. These must be unique to a sharded cluster. They store information in a “config” database. Config Servers can be setup as a replica set since keeping these going is necessary for the sharded cluster.
  • Shard – This is a subset of the data. This can be deployed as a replica set.

There is a total of 3 types of sharding strategies. The three types are Range, Hash, and Zone.

  • Ranged Sharding – You would create the shard key by giving it a range and all documents within a range zone would be grouped on the same shard. This approach is great for co-locating data such as all customers within a specific region.
  • Hashed Sharding – Documents are distributed across shards more evenly using a MD5 hash of the shard key, optimizing write performance, and is optimal for ingesting streams of times-series or event data.
  • Zoned Sharding – Developers can define specific criteria for a zone to partition data as needed by the business. This allows much more precise control over where the data is placed physically. If a customer was concerned of data locality this would be a great way to enforce that. Reasons might include GDPR etc.

You can learn more (by watching the MongoDB webinar on sharding here).

Data Security

MongoDB has a number of security features that can be taken advantage of to keep data safe, which is becoming more and more important with the ever-increasing amount of personal information being kept and stored. The main features utilized are:

  • Authentication. MongoDB offers integration with all the main external methods of authentication. LDAP, AD, Kerberos, x.509 Certs. You can take this a step further and implement IP white-listing as well.
  • RBAC. Role-Based Authentication Controls allow for granular user permissions to be assigned. Either to a user or application. Developers can also create specific views just to show pertinent data as needed.
  • Auditing. This will need to be configured but Auditing is offered and can be output to the console, a JSON, or BSON file. The types of operations that can be audited are schema, Replica/Sharded events, authentication and authorization, and CRUD operations (create, read, update, delete)
  • Encryption.
    MongoDB supports both at-rest encryption and transport encryption. Transport encryption is taken care of by support of TLS/SSL certs. They must be a minimum of 128bit key length. As of 4.0 TLS 1.0 is disabled if 1.1 is available. Either self-signed or CA authority certs can be used. Identity verification is also supported for both client and server node members. FIPS is supported but only on Enterprise level. Encryption at-rest is taken care of by a new as of 3.2 storage engine. Default is AES256-CBC.

Next up we will go over a bit of what the hardware should look like.

Intro to MongoDB (part-1)

I don’t like feeling dumb. I know this is a weird way to start a blog post. I detest feeling out of my element and inadequate. As the tech world continues to inexorably advance – exponentially even, the likelihood that I will keep running into those feelings becomes greater and greater. However, to try to combat this, I will have a number of projects to learn new products in the works. Since there is a title on this blog post and I have shortsightedly titled it the tech that I will be attempting to learn, it would be rather anticlimactic to say what is it now. Jumping in….

What is MongoDB?

The first question is what is MongoDB and what makes it different from other database programs out there? Say for example MS SQL or MySQL or PostgreSQL? To answer that question, I will need to describe a bit of the architecture. (And yes, I am learning this all as I go along. I had a fuzzy idea of what databases were and used them myself for many things. But if you asked me the difference between a relational and non-relational DB and I would have had to go to Google to answer you.) The two main types of databases out there are relational and non-relational. Trying to figure out the simple difference between them was confusing. The simplest way of defining it is using the relational model definition from Wikipedia. ” ..all data is represented in terms of tuples, grouped into relations.” The issue with this was it didn’t describe it well enough for me to understand. So, I kept looking for a simpler definition. The one I found and liked is the following, (found on stackoverflow), “I know exactly how many values (attributes) each row (tuple) in my table (relation) has and now I want to exploit that fact accordingly, thoroughly, and to its extreme.” There are other differences – such as how relational databases are more difficult to scale and work with large datasets. You also need to define all the data you will need before you create the relational DB. Unstructured data or unknowns are difficult to plan for and you may not be able to.

So, what is MongoDB then? Going back to our good friend Wikipedia, it is a cross-platform document-oriented database program. It is also defined as a NoSQL Database. There are number of different types of NoSQL though (this is where I really start feeling out of my element and dumb). There are:

  1. Document databases – These pair each key with a complex data structure known as a document. Documents can contain many different key-value pairs, or key-array pairs, or even nested documents.
  2. Graph stores – These are used to store information about networks of data such as social connections
  3. Key-Value stores – are the simplest NoSQL databases. Every single item in the database is stored as an attribute name (or key) together with its value.
  4. Wide column stores – are optimized for queries over large datasets, and store columns of data together, instead of rows. One example of this type is Cassandra

Why NoSQL?

In our speed crazed society, we value performance. Sometimes too much. But still, performance. SQL Databases were not built to scale easily and to handle the amount of data that some orgs need. To this end, NoSQL databases were built to provide superior performance and the ability to scale easily. Things like Auto-Sharding (distribution of data between nodes), replication without third party software or add-ons, and easy scale out, all add up to high performing databases.

NoSQL databases can also be built without a predefined schema. If you need to add a different type of data to a record, you don’t have to recreate the whole DB schema, you can just add that data. Dynamic schemas make for faster development and less database downtime.

Why MongoDB?

Data Consistency Guarantees – Distributed systems occasionally have the bad rap of eventual data consistency. With MongoDB, this is tunable, down to individual queries within an app. Whether something needs to be near instantaneous or has a more casual need for consistency, MongoDB can do it. You can even configure Replica sets (more about those in a bit) so that you can read from secondary replicas instead of primary for reduced network latency.

Support for multi-document ACID transactions as of 4.0 – So I had no idea what this meant at all. I had to look it up. What it means, is that if you needed to make a change to two different tables at the same time, you were unable to before 4.0. NOW you are able to do both at the same time. Think of a shopping cart inventory. You want to remove the item out of your inventory as the customer is buying it. You would want to do those two transactions at the same time. BOOM! Multi-Document transaction support.

Flexibility – As mentioned above, MongoDB documents are polymorphic. Meaning….They can contain different data from other documents with no ill effects. There is also no need to declare anything as each file is self-describing. However……. There is such a thing as Schema Governance. If your documents MUST have certain fields in them, Data Governance will step and in and structure can be imposed to make sure that data is there.

Speed – Taking what I talked about above a bit further in, there are a number of ways and reasons why MongoDB is much faster. Since a single document is the place for reads and writes for an entity, to pull data usually only requires a single read operation. Query language is also much simpler further enhancing your speed. Going even further you can build “change streams” that allow you to trigger actions based off of configurable stimuli.

Availability – This will be a little longer since there is a bit more meat on this one. MongoDB maintains multiple copies of data using a tech called Replica Sets. They are self-healing and will auto-recover and failover as needed. The replicas can be placed in different regions as mentioned previously so that reads can be from a local source increasing speed.

In order to maintain data consistency, one of the members assumes the role of primary. All others acts as secondaries, and they will repeat the operations in the oplog of the primary. If for some reason the primary goes down, one of the secondaries is elected to primary. How does it decide you may ask? I’m glad you asked! It does it based on who has the latest data (based on a number of parameters), who has the most connectivity with the majority of other replicas, and it could use user-defined priorities. This is all happens quickly and is decided in seconds. When the election is done, all the other replicas will start replicating from the new primary. If for some reason the old primary comes back online, it will automatically discover its role has been occupied and will become secondary. Up to 50 members can be configured per replica set.

Well that’s enough for one blog post as I’m about 1200 words already. Next post will continue with sharding and more MongoDB goodness.