Database Indexing

What is an Index?¶

A database index is a data structure that improves the speed of read operations on a table or collection at the cost of additional storage and slower writes. Think of it like a book index — instead of scanning every page to find a topic, you look up the topic in the index and jump directly to the right page.

Without an index, the database must perform a full table scan (or collection scan in NoSQL), reading every row or document to find matches. With an index, the database can locate the relevant records in logarithmic time using structures like B-trees or hash maps.

How a B-Tree Index Works¶

Most databases (MySQL, MongoDB, CouchDB) use a B-tree (or B+tree variant) as the default index structure. It is a balanced tree where each node can hold multiple keys, and all leaf nodes are at the same depth — guaranteeing O(log n) lookups.

Suppose we have a users table with an index on the age column, containing values: 5, 10, 15, 20, 25, 30, 35, 40, 45. With a max of 3 keys per node, the B-tree looks like this after all insertions:

                         ┌──────┐
                         │ [20] │  ◄── Root node
                         └──┬───┘
                    ┌───────┘└────────┐
                    ▼                 ▼
              ┌──────┐          ┌──────────┐
              │ [10] │          │ [30, 40]  │  ◄── Internal nodes
              └──┬───┘          └──┬──┬──┬──┘
             ┌───┘└──┐        ┌────┘  │  └────┐
             ▼       ▼        ▼       ▼       ▼
           ┌───┐  ┌────┐  ┌────┐  ┌────┐  ┌────┐
           │[5]│  │[15]│  │[25]│  │[35]│  │[45]│  ◄── Leaf nodes
           └─┬─┘  └─┬──┘  └─┬──┘  └─┬──┘  └─┬──┘
             ▼      ▼       ▼       ▼       ▼
           Row #3  Row #7  Row #9  Row #6  Row #8     ◄── Actual table rows

Searching: `WHERE age = 10`¶

Step 1:  Root [20]
         → 10 < 20 → go to left child

Step 2:  Internal [10]
         → found 10 → follow pointer to Row #1

Result:  2 node reads instead of scanning all 9 rows

Searching: `WHERE age = 25`¶

Step 1:  Root [20]
         → 25 > 20 → go to right child

Step 2:  Internal [30, 40]
         → 25 < 30 → go to left child

Step 3:  Leaf [25]
         → found 25 → follow pointer to Row #9

Result:  3 node reads instead of scanning all 9 rows

Range query: `WHERE age BETWEEN 15 AND 30`¶

Step 1:  Root [20]
         → 15 < 20 → go to left child

Step 2:  Internal [10]
         → 15 > 10 → go to right child

Step 3:  Leaf [15]
         → found 15 → collect it, then scan right through linked leaves →

Step 4:  Internal [20] → collect 20
         Leaf [25] → collect 25
         Internal [30] → collect 30 (30 ≤ 30) → stop

Result:  {15, 20, 25, 30} — leaf nodes are linked, so range scans are efficient

In a B+tree (used by MySQL InnoDB), only leaf nodes hold pointers to actual rows, and leaf nodes are linked together in a doubly-linked list. This makes range scans and ordered traversals very efficient — the database walks the linked list instead of climbing back up the tree.

Without an index (full table scan)¶

Scan:  Row #1 → Row #2 → Row #3 → ... → Row #9
       Check every row's age value
       O(n) — must read all rows

With an index (B-tree lookup)¶

Lookup: Root → Leaf → Row pointer
        O(log n) — reads only 2-3 nodes for millions of rows

For a table with 1 million rows, a B-tree index typically requires only 3-4 node reads to find any value, compared to scanning all 1,000,000 rows without an index.

Creating Indexes¶

MySQL¶

-- Create an index on a single column
CREATE INDEX idx_users_email ON users (email);

-- Create a unique index (enforces uniqueness)
CREATE UNIQUE INDEX idx_users_email ON users (email);

-- View existing indexes on a table
SHOW INDEX FROM users;

-- Drop an index
DROP INDEX idx_users_email ON users;

MongoDB¶

// Create an index on a single field
db.users.createIndex({ email: 1 });  // 1 = ascending

// Create a unique index
db.users.createIndex({ email: 1 }, { unique: true });

// View existing indexes on a collection
db.users.getIndexes();

// Drop an index
db.users.dropIndex("email_1");

CouchDB¶

CouchDB uses design documents with views or Mango indexes for querying:

// Create a Mango index via HTTP API
// POST /dbname/_index
{
    "index": {
        "fields": ["email"]
    },
    "name": "idx_users_email",
    "type": "json"
}

# List all indexes
# GET /dbname/_index

# Delete an index
# DELETE /dbname/_index/_design/{design-doc}/json/{index-name}

Indexing Nested Fields (NoSQL)¶

NoSQL documents often have deeply nested structures. Indexing these nested fields allows the database to query them efficiently without scanning every document.

Consider this document structure:

{
    "_id": "user_1",
    "name": "Alice",
    "address": {
        "city": "Seattle",
        "state": "WA",
        "coords": { "lat": 47.6, "lng": -122.3 }
    },
    "orders": [
        { "product": "Widget", "price": 25 },
        { "product": "Gadget", "price": 30 }
    ]
}

MongoDB¶

MongoDB uses dot notation to index nested fields. The B-tree stores the nested value as the key, just like a flat field:

// Index a nested field
db.users.createIndex({ "address.city": 1 });

// Query uses the index
db.users.find({ "address.city": "Seattle" });

// Index a deeply nested field
db.users.createIndex({ "address.coords.lat": 1 });

Multikey indexes — when you index a field inside an array, MongoDB automatically creates a multikey index. It inserts a separate key into the B-tree for each element in the array:

// Index a field inside an array of objects
db.users.createIndex({ "orders.product": 1 });

// Query uses the index — finds any user with a "Widget" order
db.users.find({ "orders.product": "Widget" });

How a multikey index looks in the B-tree for the document above:

B-tree keys:              Points to:
─────────────             ──────────
"Gadget"        ───►      user_1
"Widget"        ───►      user_1

Each array element gets its own entry. If another user also ordered "Widget", both document IDs would be stored under the same key.

Limitation: MongoDB does not allow a compound index on two different array fields in the same document. This is because the cross-product of two arrays would create an explosion of index entries:

// This will fail if both "orders" and "tags" are arrays
db.users.createIndex({ "orders.product": 1, "tags": 1 }); // Error

CouchDB¶

CouchDB Mango indexes support nested fields using dot notation, similar to MongoDB:

// POST /dbname/_index
{
    "index": {
        "fields": ["address.city"]
    },
    "name": "idx_address_city",
    "type": "json"
}

// POST /dbname/_find
{
    "selector": {
        "address.city": "Seattle"
    }
}

For indexing fields inside arrays, Mango indexes do not automatically handle array elements like MongoDB's multikey indexes. Instead, use a MapReduce view to emit each array element individually:

// PUT /dbname/_design/orders
{
    "views": {
        "by_product": {
            "map": "function(doc) { if (doc.orders) { doc.orders.forEach(function(order) { emit(order.product, doc._id); }); } }"
        }
    }
}

# Find all users who ordered "Widget"
# GET /dbname/_design/orders/_view/by_product?key="Widget"

This is CouchDB's equivalent of a multikey index — the view emits one index entry per array element.

Composite Indexes¶

A composite (multi-column) index covers multiple columns in a single index. Column order matters — the database can only use the index efficiently when queries filter or sort by a leftmost prefix of the indexed columns.

The Leftmost Prefix Rule¶

Given a composite index on (country, city, zipcode):

Query filters on	Uses the index?
`country`	Yes
`country` AND `city`	Yes
`country` AND `city` AND `zipcode`	Yes (full index)
`city` only	No — skips the leftmost column
`zipcode` only	No — skips the leftmost columns
`country` AND `zipcode`	Partially — uses `country` only

MySQL¶

CREATE INDEX idx_location ON addresses (country, city, zipcode);

-- Uses the full index
SELECT * FROM addresses WHERE country = 'US' AND city = 'Seattle' AND zipcode = '98101';

-- Uses first two columns of the index
SELECT * FROM addresses WHERE country = 'US' AND city = 'Seattle';

-- Cannot use the index (skips leftmost column)
SELECT * FROM addresses WHERE city = 'Seattle';

MongoDB¶

db.addresses.createIndex({ country: 1, city: 1, zipcode: 1 });

// Uses the full index
db.addresses.find({ country: "US", city: "Seattle", zipcode: "98101" });

// Uses first two fields of the index
db.addresses.find({ country: "US", city: "Seattle" });

// Cannot use the index efficiently (skips leftmost field)
db.addresses.find({ city: "Seattle" });

CouchDB¶

// POST /dbname/_index
{
    "index": {
        "fields": ["country", "city", "zipcode"]
    },
    "name": "idx_location",
    "type": "json"
}

CouchDB Mango indexes follow the same leftmost prefix rule — queries must include fields from left to right in the order they were defined.

Covering Indexes¶

A covering index contains all the columns a query needs, allowing the database to return results entirely from the index without reading the actual table or document. This eliminates the expensive "lookup" step and can dramatically improve performance.

MySQL¶

-- Suppose this query runs frequently:
SELECT email, name FROM users WHERE email = 'alice@example.com';

-- A covering index for this query:
CREATE INDEX idx_users_email_name ON users (email, name);

When MySQL can satisfy a query using only the index, it shows "Using index" in the EXPLAIN output — meaning it never touches the table data.

EXPLAIN SELECT email, name FROM users WHERE email = 'alice@example.com';
-- Look for "Using index" in the Extra column

MongoDB¶

MongoDB calls this a covered query. The index must include all fields in both the query filter and the projection, and the projection must explicitly exclude the _id field (unless _id is part of the index):

db.users.createIndex({ email: 1, name: 1 });

// Covered query — all fields come from the index
db.users.find(
    { email: "alice@example.com" },
    { email: 1, name: 1, _id: 0 }
);

Use explain() to verify:

db.users.find(
    { email: "alice@example.com" },
    { email: 1, name: 1, _id: 0 }
).explain("executionStats");
// Look for "totalDocsExamined": 0 — means the index covered the query

CouchDB¶

CouchDB Mango queries always load the full document by default, so true covering indexes are not directly supported. However, CouchDB views (MapReduce) can emit only the fields you need, effectively serving as a covering index:

// Design document with a view that emits email and name
// PUT /dbname/_design/users
{
    "views": {
        "by_email": {
            "map": "function(doc) { if (doc.type === 'user') { emit(doc.email, { email: doc.email, name: doc.name }); } }"
        }
    }
}

Querying this view returns the email and name directly from the index without loading the full document:

# GET /dbname/_design/users/_view/by_email?key="alice@example.com"

Search Indexes vs Database Indexes¶

The indexes discussed above are database indexes — B-tree or hash structures that speed up exact matches, range queries, and sorting. A search index is a fundamentally different data structure designed for full-text search.

A search index builds an inverted index: it tokenizes text into individual words, normalizes them (lowercasing, stemming, removing stop words), and maps each token back to the documents that contain it. This allows queries like "find all products mentioning 'wireless bluetooth'" to return results ranked by relevance.

	Database Index	Search Index
Structure	B-tree, hash	Inverted index
Optimized for	Exact match, range, sort	Full-text search, relevance ranking
Query style	`WHERE email = 'x'`	`MATCH(description) AGAINST('wireless')`

MySQL¶

MySQL supports full-text search via FULLTEXT indexes:

-- Create a fulltext index
ALTER TABLE products ADD FULLTEXT INDEX idx_ft_description (description);

-- Search using natural language mode
SELECT * FROM products
WHERE MATCH(description) AGAINST('wireless bluetooth' IN NATURAL LANGUAGE MODE);

-- Search using boolean mode (supports +, -, * operators)
SELECT * FROM products
WHERE MATCH(description) AGAINST('+wireless -wired' IN BOOLEAN MODE);

FULLTEXT indexes only work on CHAR, VARCHAR, and TEXT columns, and are supported on InnoDB and MyISAM storage engines.

MongoDB¶

MongoDB offers full-text search through text indexes:

// Create a text index
db.products.createIndex({ description: "text" });

// Search
db.products.find({ $text: { $search: "wireless bluetooth" } });

// Search with relevance score
db.products.find(
    { $text: { $search: "wireless bluetooth" } },
    { score: { $meta: "textScore" } }
).sort({ score: { $meta: "textScore" } });

For more advanced search (fuzzy matching, faceted search, autocomplete), MongoDB Atlas offers Atlas Search powered by Apache Lucene.

CouchDB¶

CouchDB does not have built-in full-text search. The common approach is to integrate an external search engine:

Nouveau (CouchDB 4.x) — built-in full-text search powered by Apache Lucene, configured through design documents:

// PUT /dbname/_design/search
{
    "nouveau": {
        "products": {
            "index": "function(doc) { if (doc.type === 'product') { index('description', doc.description, { store: true }); } }"
        }
    }
}

# GET /dbname/_design/search/_nouveau/products?q=description:wireless

For older CouchDB versions, use an external search engine like Elasticsearch with a change feed listener to keep the search index in sync.

Common Pitfalls¶

1. Over-Indexing¶

Every index must be updated on every INSERT, UPDATE, and DELETE. Too many indexes slow down write operations and consume additional disk space. Only create indexes that serve actual query patterns.

-- Bad: indexing every column "just in case"
CREATE INDEX idx_a ON users (first_name);
CREATE INDEX idx_b ON users (last_name);
CREATE INDEX idx_c ON users (age);
CREATE INDEX idx_d ON users (created_at);
CREATE INDEX idx_e ON users (status);

Instead, analyze your query patterns and create targeted indexes — often a few composite indexes cover most queries.

2. Unused Indexes¶

Indexes that no query uses waste storage and slow down writes for no benefit.

MySQL — identify unused indexes:

SELECT * FROM sys.schema_unused_indexes
WHERE object_schema = 'your_database';

MongoDB — check index usage stats:

db.users.aggregate([{ $indexStats: {} }]);
// Indexes with 0 "accesses.ops" since last restart may be unused

3. Indexing Low-Cardinality Columns¶

Columns with very few distinct values (e.g., status with only 'active' / 'inactive', or gender with 'M' / 'F') make poor indexes on their own. When most rows share the same value, the database gains little by using the index over a full scan.

A low-cardinality column can still be useful as part of a composite index combined with a higher-cardinality column:

-- Poor index: low cardinality on its own
CREATE INDEX idx_status ON users (status);

-- Better: combine with a high-cardinality column
CREATE INDEX idx_status_email ON users (status, email);

4. Not Using EXPLAIN¶

Always verify that your queries actually use the indexes you created. The query planner may choose a full scan if it estimates the index would not help.

MySQL:

EXPLAIN SELECT * FROM users WHERE email = 'alice@example.com';
-- Check "type" column: "ref" or "const" = index used, "ALL" = full scan

MongoDB:

db.users.find({ email: "alice@example.com" }).explain("executionStats");
// Check "winningPlan.stage": "IXSCAN" = index used, "COLLSCAN" = full scan

CouchDB:

// POST /dbname/_find
{
    "selector": { "email": "alice@example.com" },
    "execution_stats": true
}
// Check "execution_stats.results_returned" vs "execution_stats.total_docs_examined"

5. Indexing Large or Rarely Queried Fields¶

Avoid indexing fields that hold large values (e.g., long text, blobs) or fields that are rarely used in WHERE clauses or query selectors. Large values bloat the index size, and rarely queried fields provide no benefit.

Scaling Beyond Indexes¶

Indexes trade write speed for read speed. When you need both fast reads and fast writes, the next step is to separate or distribute the workload.

1. Read Replicas¶

Route writes to a primary node and reads to replica nodes. The primary can have fewer indexes (fast writes), while replicas can be heavily indexed (fast reads):

Writes ──► Primary (minimal indexes, fast writes)
                │  replication
          ┌─────┼─────┐
          ▼     ▼     ▼
       Replica Replica Replica (heavily indexed, fast reads)
Reads ──► ─────────────────────

MySQL supports this natively with primary-replica replication. MongoDB uses replica sets where secondary members can serve reads with readPreference: "secondary". CouchDB supports continuous replication between nodes, where any node can serve reads.

2. Sharding / Partitioning¶

Split data across multiple nodes by a shard key. Each node holds a subset of the data with its own indexes, so both reads and writes are distributed:

                    ┌──────────┐
                    │  Router  │
                    └────┬─────┘
              ┌──────────┼──────────┐
              ▼          ▼          ▼
         ┌─────────┐ ┌─────────┐ ┌─────────┐
         │ Shard 1 │ │ Shard 2 │ │ Shard 3 │
         │ Users   │ │ Users   │ │ Users   │
         │ A - H   │ │ I - P   │ │ Q - Z   │
         └─────────┘ └─────────┘ └─────────┘

Each shard maintains its own indexes over a smaller dataset, so both reads and writes are faster per-node. The trade-off is operational complexity — cross-shard queries (e.g., WHERE age > 30 spanning all shards) require a scatter-gather across all nodes.

Choosing a shard key is critical:

High cardinality — the key should have many distinct values to distribute evenly
Query alignment — the key should match your most common query pattern so most queries hit a single shard
Write distribution — avoid monotonically increasing keys (e.g., auto-increment IDs, timestamps) as they funnel all writes to one shard

MongoDB has built-in sharding. MySQL supports partitioning (range, hash, list) at the table level, or application-level sharding with tools like Vitess. CouchDB uses clustering in CouchDB 2.x+ where data is distributed across nodes using consistent hashing.

3. CQRS (Command Query Responsibility Segregation)¶

Use separate data models for reads and writes. The write model is normalized and minimally indexed for fast writes. The read model is denormalized and heavily indexed for fast reads. Changes propagate from write to read model via events:

              Commands (writes)              Queries (reads)
                    │                              ▲
                    ▼                              │
             ┌────────────┐   sync/event    ┌────────────┐
             │ Write Model│ ──────────────► │ Read Model │
             │ normalized │                 │denormalized│
             │ few indexes│                 │many indexes│
             └────────────┘                 └────────────┘

This pattern is especially natural in NoSQL — the read model can be a pre-computed document that matches exactly what the UI needs, eliminating the need for joins or complex queries at read time.

4. Caching¶

Place a cache (Redis, Memcached) in front of the database for frequently accessed data. This reduces read load on the database, which means you can remove indexes that only served those hot queries:

Read request
     │
     ▼
  ┌───────┐  cache hit   ┌──────────┐
  │ Cache │ ────────────► │ Response │
  └───┬───┘               └──────────┘
      │ cache miss
      ▼
  ┌──────────┐
  │ Database │ ──► populate cache ──► Response
  └──────────┘

Common caching strategies:

Cache-aside — application checks cache first, queries database on miss, and populates cache
Write-through — writes go to cache and database simultaneously, reads always hit cache
Write-behind — writes go to cache immediately, then asynchronously flush to database (faster writes, risk of data loss)

Combining Strategies¶

These approaches are often used together. A typical high-scale setup might look like:

Client ──► Cache (Redis) ──miss──► Read Replica (heavily indexed)
                                        │
                                   replicates from
                                        │
Client ──► Write ──────────────► Primary (minimal indexes)
                                        │
                                   sharded across
                                   multiple nodes

Summary¶

Concept	Key Takeaway
Single-column index	Speeds up queries filtering on one column/field
Composite index	Covers multi-column filters; column order matters (leftmost prefix rule)
Covering index	Returns results entirely from the index — no table/document lookup needed
Over-indexing	Slows writes and wastes storage; only index what you query
Search index	Inverted index for full-text search and relevance ranking — different from a database index
Read replicas	Separate read/write workloads — replicas serve indexed reads, primary handles fast writes
Sharding	Distribute data across nodes — each shard has its own smaller indexes
CQRS	Separate write model (normalized, few indexes) from read model (denormalized, many indexes)
Caching	Reduce read load with Redis/Memcached — removes the need for some indexes entirely
EXPLAIN	Always verify your indexes are actually being used

Database Indexing

What is an Index?¶

How a B-Tree Index Works¶

Searching: WHERE age = 10¶

Searching: WHERE age = 25¶

Range query: WHERE age BETWEEN 15 AND 30¶

Without an index (full table scan)¶

With an index (B-tree lookup)¶

Creating Indexes¶

MySQL¶

MongoDB¶

CouchDB¶

Indexing Nested Fields (NoSQL)¶

MongoDB¶

CouchDB¶

Composite Indexes¶

The Leftmost Prefix Rule¶

MySQL¶

MongoDB¶

CouchDB¶

Covering Indexes¶

MySQL¶

MongoDB¶

CouchDB¶

Search Indexes vs Database Indexes¶

MySQL¶

MongoDB¶

CouchDB¶

Common Pitfalls¶

1. Over-Indexing¶

2. Unused Indexes¶

3. Indexing Low-Cardinality Columns¶

4. Not Using EXPLAIN¶

5. Indexing Large or Rarely Queried Fields¶

Scaling Beyond Indexes¶

1. Read Replicas¶

2. Sharding / Partitioning¶

3. CQRS (Command Query Responsibility Segregation)¶

4. Caching¶

Combining Strategies¶

Summary¶

Searching: `WHERE age = 10`¶

Searching: `WHERE age = 25`¶

Range query: `WHERE age BETWEEN 15 AND 30`¶